SCICAP: Scientific Figures Dataset

This is the Github repo of the EMNLP 2021 Findings' paper, SCICAP: Generating Captions for Scientific Figures (Hsu et. al, 2021)

SCICAP a large-scale figure caption dataset based on Computer Science arXiv papers published between 2010 and 2020. SCICAP contained 410k figures that focused on one of the dominent figure type - graphplot, extracted from over 290,000 papers.

How to Cite?

@inproceedings{hsu2021scicap,
  title={SciCap: Generating Captions for Scientific Figures},
  author={Hsu, Ting-Yao E. and Giles, C. Lee and Huang, Ting-Hao K.},
  booktitle={Findings of 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021 Findings)},
  year={2021}
}

Download the Dataset

You can dowload the SCICAP dataset here: Download Link (18.15 GB)

Folder Structure

scicap_data.zip
├── SciCap-Caption-All                  #caption text for all figures
│	├── Train
│	├── Val
│	└── Test
├── SciCap-No-Subfig-Img                #image files for the figures without subfigures
│	├── Train
│	├── Val
│	└── Test
├── SciCap-Yes-Subfig-Img               #image files for the figures with subfigures
│	├── Train
│	├── Val
│	└── Test
├── arxiv-metadata-oai-snapshot.json    #arXiv paper's metadata (from arXiv dataset)
└── List-of-Files-for-Each-Experiments  #list of figure names used in each experiment 
    ├── Single-Sentence-Caption
    │   ├── No-Subfig
    │   │   ├── Train
    │	│   ├── Val
    │	│   └── Test
    │	└── Yes-Subfig
    │       ├── Train
    │       ├── Val
    │       └── Test
    ├── First-Sentence                  #Same as in Single-Sentence-Caption
    └── Caption-No-More-Than-100-Tokens #Same as in Single-Sentence-Caption

Number of Figures in Each Subset

Data Collection	Does the figure have subfigures?	Train	Validate	Test
First Sentence	Yes	226,608	28,326	28,327
First Sentence	No	106,834	13,354	13,355
Single-Sent Caption	Yes	123,698	15,469	15,531
Single-Sent Caption	No	75,494	9,242	9,459
Caption w/ <=100 words	Yes	216,392	27,072	27,036
Caption w/ <=100 words	No	105,687	13,215	13,226

JSON Data Format

Example Data Instance (Caption and Figure)

An actual JSON object from SCICAP:

{
  "contains-subfigure": true, 
  "Img-text": ["(b)", "s]", "[m", "fs", "et", "e", "of", "T", "im", "Attack", "duration", "[s]", "350", "300", "250", "200", "150", "100", "50", "0", "50", "100", "150", "200", "250", "300", "0", "(a)", "]", "[", "m", "fs", "et", "e", "of", "ta", "nc", "D", "is", "Attack", "duration", "[s]", "10000", "9000", "8000", "7000", "6000", "5000", "4000", "3000", "2000", "1000", "0", "50", "100", "150", "200", "250", "300", "0"], 
  "paper-ID": "1001.0025v1", 
  "figure-ID": "1001.0025v1-Figure2-1.png", 
  "figure-type": "Graph Plot", 
  "0-originally-extracted": "Figure 2: Impact of the replay attack, as a function of the spoofing attack duration. (a) Location offset or error: Distance between the attack-induced and the actual victim receiver position. (b) Time offset or error: Time difference between the attack-induced clock value and the actual time.", 
  "1-lowercase-and-token-and-remove-figure-index": {
    "caption": "impact of the replay attack , as a function of the spoofing attack duration . ( a ) location offset or error : distance between the attack-induced and the actual victim receiver position . ( b ) time offset or error : time difference between the attack-induced clock value and the actual time .", 
    "sentence": ["impact of the replay attack , as a function of the spoofing attack duration .", "( a ) location offset or error : distance between the attack-induced and the actual victim receiver position .", "( b ) time offset or error : time difference between the attack-induced clock value and the actual time ."], 
    "token": ["impact", "of", "the", "replay", "attack", ",", "as", "a", "function", "of", "the", "spoofing", "attack", "duration", ".", "(", "a", ")", "location", "offset", "or", "error", ":", "distance", "between", "the", "attack-induced", "and", "the", "actual", "victim", "receiver", "position", ".", "(", "b", ")", "time", "offset", "or", "error", ":", "time", "difference", "between", "the", "attack-induced", "clock", "value", "and", "the", "actual", "time", "."]
  }, 
  "2-normalized": {
    "2-1-basic-num": {
      "caption": "impact of the replay attack , as a function of the spoofing attack duration . ( a ) location offset or error : distance between the attack-induced and the actual victim receiver position . ( b ) time offset or error : time difference between the attack-induced clock value and the actual time .", 
      "sentence": ["impact of the replay attack , as a function of the spoofing attack duration .", "( a ) location offset or error : distance between the attack-induced and the actual victim receiver position .", "( b ) time offset or error : time difference between the attack-induced clock value and the actual time ."], 
      "token": ["impact", "of", "the", "replay", "attack", ",", "as", "a", "function", "of", "the", "spoofing", "attack", "duration", ".", "(", "a", ")", "location", "offset", "or", "error", ":", "distance", "between", "the", "attack-induced", "and", "the", "actual", "victim", "receiver", "position", ".", "(", "b", ")", "time", "offset", "or", "error", ":", "time", "difference", "between", "the", "attack-induced", "clock", "value", "and", "the", "actual", "time", "."]
      }, 
    "2-2-advanced-euqation-bracket": {
      "caption": "impact of the replay attack , as a function of the spoofing attack duration . BRACKET-TK location offset or error : distance between the attack-induced and the actual victim receiver position . BRACKET-TK time offset or error : time difference between the attack-induced clock value and the actual time .", 
      "sentence": ["impact of the replay attack , as a function of the spoofing attack duration .", "BRACKET-TK location offset or error : distance between the attack-induced and the actual victim receiver position .", "BRACKET-TK time offset or error : time difference between the attack-induced clock value and the actual time ."], 
      "tokens": ["impact", "of", "the", "replay", "attack", ",", "as", "a", "function", "of", "the", "spoofing", "attack", "duration", ".", "BRACKET-TK", "location", "offset", "or", "error", ":", "distance", "between", "the", "attack-induced", "and", "the", "actual", "victim", "receiver", "position", ".", "BRACKET-TK", "time", "offset", "or", "error", ":", "time", "difference", "between", "the", "attack-induced", "clock", "value", "and", "the", "actual", "time", "."]
      }
    }
  }

Corresponding Figure: 1001.0025v1-Figure2-1.png

JSON Scheme

contains-subfigure: boolean (check if contain subfigure)
paper-ID: the unique paper ID in the arXiv dataset
figure-ID: the extracted figure ID of paper (the index is not the same as the label in the caption)
figure-type: the figure type
0-originally-extracted: original captions of figures
1-lowercase-and-token-and-remove-figure-index: Removed figure index and the captions in lowercase
2-normalized:
- 2-1-basic-num: caption after replacing the number
- 2-2-advanced-euqation-bracket: caption after replacing the equations and contents in the bracket
Img-text: texts extracted from the figure, such as the texts for labels, legends ... etc.

Within the caption content, we have three attributes:

caption: caption after each normalization
sentence: a list of segmented sentences
token: a list of tokenized words

Normalized Token

In the paper, we used [NUM], [BRACKET], [EQUATION], but we decided to use NUM-TK, BRACKET-TK, EQUAT-TK in the final data release to avoid the extra problems caused by "[]".

Token	Description
NUM-TK	Numbers (e.g., 0, -0.2, 3.44%, 1,000,000).
BRACKET-TK	Text spans enclosed by any types of bracket pairs, including {}, [], and ().
EQUAT-TK	Math equations identified using regular expressions.

Baseline Performance

To examine the feasibility and challenges of creating an image-captioning model for scientific figures, we established several baselines and tested them using SCICAP. The caption quality was measured by BLEU-4, using the test set of the corresponding data collection as a reference. We trained the models on each data collection with varying levels of data filtering and text normalization. Table 2 shows the results. We also designed three variations of the baseline models, Vision-only, Vision+Text, and Text-only. Table 3 shows the results.

Data License

The arXiv dataset uses the CC0 1.0 Universal (CC0 1.0) Public Domain Dedication license, which grants permission to remix, remake, annotate, and publish the data.

Acknowledgements

We thank Chieh-Yang Huang, Hua Shen, and Chacha Chen for helping with the data annotation. We thank Chieh-Yang Huang for the feedback and strong technical support. We also thank the anonymous reviewers for their constructive feedback. This research was partially supported by the Seed Grant (2020) from the College of Information Sciences and Technology (IST), Pennsylvania State University.

EMNLP 2021 Findings' paper, SCICAP: Generating Captions for Scientific Figures

Related tags

Overview

SCICAP: Scientific Figures Dataset

This is the Github repo of the EMNLP 2021 Findings' paper, SCICAP: Generating Captions for Scientific Figures (Hsu et. al, 2021)

How to Cite?

Download the Dataset

You can dowload the SCICAP dataset here: Download Link (18.15 GB)

Folder Structure

Number of Figures in Each Subset

JSON Data Format

Example Data Instance (Caption and Figure)

JSON Scheme

Normalized Token

Baseline Performance

Data License

Acknowledgements

Owner

Edward

Decorator for PyMC3

Graph InfoClust: Leveraging cluster-level node information for unsupervised graph representation learning

Neural network-based build time estimation for additive manufacturing

Automatically download the cwru data set, and then divide it into training data set and test data set

Pipeline for employing a Lightweight deep learning models for LOW-power systems

Code implementation from my Medium blog post: [Transformers from Scratch in PyTorch]

This repository includes the official project for the paper: TransMix: Attend to Mix for Vision Transformers.

Optimal Adaptive Allocation using Deep Reinforcement Learning in a Dose-Response Study

TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks

Facial Image Inpainting with Semantic Control

The 1st Place Solution of the Facebook AI Image Similarity Challenge (ISC21) : Descriptor Track.

Multi-task Learning of Order-Consistent Causal Graphs (NeuRIPs 2021)

The source code for CATSETMAT: Cross Attention for Set Matching in Bipartite Hypergraphs

Spectral normalization (SN) is a widely-used technique for improving the stability and sample quality of Generative Adversarial Networks (GANs)

Open source implementation of AceNAS: Learning to Rank Ace Neural Architectures with Weak Supervision of Weight Sharing

Differentiable Wavetable Synthesis

a grammar based feedback fuzzer

HiFi++: a Unified Framework for Neural Vocoding, Bandwidth Extension and Speech Enhancement

Deep Q-Learning Network in pytorch (not actively maintained)

Code for "PV-RAFT: Point-Voxel Correlation Fields for Scene Flow Estimation of Point Clouds", CVPR 2021