Arxiv harvester - Poor man's simple harvester for arXiv resources

Overview

Poor man's simple harvester for arXiv resources

This modest Python script takes advantage of arXiv resources hosted by Kaggle to harvest arXiv metadata and PDF, without using the AWS requester paid buckets.

The harvester performs the following tasks:

  • parse the full JSON arXiv metadata file available at Kaggle

  • parallel download PDF located at the public access bucket gs://arxiv-dataset and store them (also in parallel) on a cloud storage, AWS S3 and Swift OpenStack supported, or on the local file system

  • store the metadata of the uploaded article along with the PDF in JSON format

To save storage space, only the most recent available version of the PDF for an article is harvested, not every available versions.

Resuming interrupted and incremental update are automatically supported.

In case an article is only available in postcript, it will be converted into PDF too - but it is extremely rare (and usually when it happens the conversion fails because the PostSript is corrupted...).

Install

The tool is supposed to work on a POSIX environment. External call to the following command lines are used: gzip, gunzip and ps2pdf.

First, download the full arXiv metadata JSON file available at https://www.kaggle.com/Cornell-University/arxiv (1GB compressed). It's actually a JSONL file (one JSON document per line), currently named arxiv-metadata-oai-snapshot.json.zip. You can also generate yourself this file with arxiv-public-dataset OAI harvester using the arXiv OAI-PMH service.

Get this github repo:

git clone https://github.com/kermitt2/arxiv_harvester
cd arxiv_harvester

Setup a virtual environment:

virtualenv --system-site-packages -p python3.8 env
source env/bin/activate

Install the dependencies:

pip3 install -r requirements.txt

Finally install the project in editable state:

pip3 install -e .

Usage

First check the configuration file:

  • set the parameters according to your selected storage (AWS S3, SWIFT OepnStack or local storage), see below for more details,
  • the default batch_size for parallel download/upload is 10, change it as you wish and dare,
  • by default gzip compression of files on the target storage is selected.
arXiv harvester

optional arguments:
  -h, --help           show this help message and exit
  --config CONFIG      path to the config file, default is ./config.json
  --reset              ignore previous processing states and re-init the harvesting process from
                       the beginning
  --metadata METADATA  arXiv metadata json file
  --diagnostic         produce a summary of the harvesting

For example, to harvest articles from a metadata snapshot file:

python3 arxiv_harvester/harvester.py --metadata arxiv-metadata-oai-snapshot.json.zip --config config.json

To reset an existing harvesting and starts the harvesting again from scratch, add the --reset argument:

python3 arxiv_harvester/harvester.py --metadata arxiv-metadata-oai-snapshot.json.zip --config config.json --reset

Note that with --reset, no actual stored PDF file is removed - only the harvesting process is reinitialized.

Interrupted harvesting / Incremental update

Launching the harvesting command on an interrupted harvesting will resume the harvesting automatically where it stopped.

If the arXiv metadata file has been updated to a newer version (downloaded from https://www.kaggle.com/Cornell-University/arxiv or generated with arxiv-public-dataset OAI harvester), launching the harvesting command on the updated metadata file will harvest only the new and updated articles (new most recent PDF version).

Resource file organization

The organization of harvested files permits a direct access to the PDF based on the arxiv identifier. More particularly, the Open Access link given for an arXiv resource by Unpaywall is enough to create a direct access path. It also avoids storing too many files in the same directory for performance reasons.

The stored PDF is always the most recent version. There is no need to know what is the exact latest version (an information that we don't have with the Unpaywall arXiv full text links for example). The local metadata file for the article gives the version number of the stored PDF.

For example, to get access path from the identifiers or Unpaywall OA url:

  • post-2007 arXiv identifiers (pattern arXiv:YYMM.numbervV or commonly YYMM.numbervV):

    • 1501.00001v1 -> $root/arXiv/1501/1501.00001/1501.00001.pdf (most recent version of the PDF), $root/arXiv/1501/1501.00001/1501.00001.json (arXiv metadata for the article)
    • Unpaywall link http://arxiv.org/pdf/1501.00001 -> $root/arXiv/1501/1501.00001/1501.00001.pdf, $root/arXiv/1501/1501.00001/1501.00001.json
  • pre-2007 arXiv identifiers (pattern archive.subject_call/YYMMnumber):

    • quant-ph/0602109 -> $root/quant-ph/0602/0602109/0602109.pdf (most recent version of the PDF), $root/quant-ph/0602/0602109/0602109.json (arXiv metadata for the article)

    • Unpaywall link https://arxiv.org/pdf/quant-ph/0602109 -> $root/quant-ph/0602/0602109/0602109.pdf, $root/quant-ph/0602/0602109/0602109.json

If the compression option is set to True in the configuration file config.json, all the resources have an additional .gz extension.

$root in the above examples should be adapted to the storage of choice, as configured in the configuration file config.json. For instance with AWS S3: https://bucket_name.s3.amazonaws.com/arXiv/1501/1501.00001/1501.00001.pdf (if access rights are appropriate). The same applies to a SWIFT object storage based on the container name indicated in the config file.

AWS S3 and SWIFT configuration

For a local storage, just indicate the path where to store the PDF with the parameter data_path in the configuration file config.json.

The configuration for a S3 storage uses the following parameters:

{
    "aws_access_key_id": "",
    "aws_secret_access_key": "",
    "bucket_name": "",
    "region": ""
}

If you are not using a S3 storage, remove these keys or leave these values empty.

The configuration for a SWIFT object storage uses the following parameters:

{
    "swift": {},
    "swift_container": ""
}

If you are not using a SWIFT storage, remove these keys or leave these above values empty.

The "swift" key will contain the account and authentication information, typically via Keystone, something like this:

{
    "swift": {
        "auth_version": "3",
        "auth_url": "https://auth......./v3",
        "os_username": "user-007",
        "os_password": "1234",
        "os_user_domain_name": "Default",
        "os_project_domain_name": "Default",
        "os_project_name": "myProjectName",
        "os_project_id": "myProjectID",
        "os_region_name": "NorthPole",
        "os_auth_url": "https://auth......./v3"
    },
    "swift_container": "my_arxiv_harvesting"
}

Limitations

Source files (LaTeX sources) are not available via the Kaggle dataset and thus via this modest harvester. The LaTeX source files are available via AWS S3 Bulk Source File Access.

There are 44 articles only available in HTML format. These articles will not be harvested.

Acknowledgements

Kaggle arXiv dataset relies on arxiv-public-datasets:

Clement, C. B., Bierbaum, M., O'Keeffe, K. P., & Alemi, A. A. (2019). On the Use of ArXiv as a Dataset. arXiv preprint arXiv:1905.00075.

License and contact

This modest tool is distributed under Apache 2.0 license. The dependencies used in the project are either themselves also distributed under Apache 2.0 license or distributed under a compatible license.

If you contribute to this Open Source project, you agree to share your contribution following this license.

Kaggle dataset arXiv Metadata is distributed under CC0 1.0 license. Note that most articles on arXiv are submitted with the default arXiv license, which does usually not allow redistribution. See here about the possible usage of the harvested PDF.

Main author and contact: Patrice Lopez ([email protected])

Owner
Patrice Lopez
Patrice Lopez
Trainable Bilateral Filter Layer (PyTorch)

Trainable Bilateral Filter Layer (PyTorch) This repository contains our GPU-accelerated trainable bilateral filter layer (three spatial and one range

FabianWagner 26 Dec 25, 2022
SOFT: Softmax-free Transformer with Linear Complexity, NeurIPS 2021 Spotlight

SOFT: Softmax-free Transformer with Linear Complexity SOFT: Softmax-free Transformer with Linear Complexity, Jiachen Lu, Jinghan Yao, Junge Zhang, Xia

Fudan Zhang Vision Group 272 Dec 25, 2022
Unofficial implementation of "Coordinate Attention for Efficient Mobile Network Design"

Unofficial implementation of "Coordinate Attention for Efficient Mobile Network Design". CoordAttention tensorflow slim

Billy 9 Aug 22, 2022
A PyTorch Implementation of the Luna: Linear Unified Nested Attention

Unofficial PyTorch implementation of Luna: Linear Unified Nested Attention The quadratic computational and memory complexities of the Transformer’s at

Soohwan Kim 32 Nov 07, 2022
[NeurIPS-2020] Self-paced Contrastive Learning with Hybrid Memory for Domain Adaptive Object Re-ID.

Self-paced Contrastive Learning (SpCL) The official repository for Self-paced Contrastive Learning with Hybrid Memory for Domain Adaptive Object Re-ID

Yixiao Ge 286 Dec 21, 2022
VSR-Transformer - This paper proposes a new Transformer for video super-resolution (called VSR-Transformer).

VSR-Transformer By Jiezhang Cao, Yawei Li, Kai Zhang, Luc Van Gool This paper proposes a new Transformer for video super-resolution (called VSR-Transf

Jiezhang Cao 225 Nov 13, 2022
tensorrt int8 量化yolov5 4.0 onnx模型

onnx模型转换为 int8 tensorrt引擎

123 Dec 28, 2022
Code repository for the paper: Hierarchical Kinematic Probability Distributions for 3D Human Shape and Pose Estimation from Images in the Wild (ICCV 2021)

Hierarchical Kinematic Probability Distributions for 3D Human Shape and Pose Estimation from Images in the Wild Akash Sengupta, Ignas Budvytis, Robert

Akash Sengupta 149 Dec 14, 2022
Pytorch Implementation for (STANet+ and STANet)

Pytorch Implementation for (STANet+ and STANet) V2-Weakly Supervised Visual-Auditory Saliency Detection with Multigranularity Perception (arxiv), pdf:

GuotaoWang 14 Nov 29, 2022
Readings for "A Unified View of Relational Deep Learning for Polypharmacy Side Effect, Combination Therapy, and Drug-Drug Interaction Prediction."

Polypharmacy - DDI - Synergy Survey The Survey Paper This repository accompanies our survey paper A Unified View of Relational Deep Learning for Polyp

AstraZeneca 79 Jan 05, 2023
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

Introduction This is a Python package available on PyPI for NVIDIA-maintained utilities to streamline mixed precision and distributed training in Pyto

Artit 'Art' Wangperawong 5 Sep 29, 2021
Code for Max-Margin Contrastive Learning - AAAI 2022

Max-Margin Contrastive Learning This is a pytorch implementation for the paper Max-Margin Contrastive Learning accepted to AAAI 2022. This repository

Anshul Shah 12 Oct 22, 2022
BBB streaming without Xorg and Pulseaudio and Chromium and other nonsense (heavily WIP)

BBB Streamer NG? Makes a conference like this... ...streamable like this! I also recorded a small video showing the basic features: https://www.youtub

Lukas Schauer 60 Oct 21, 2022
WORD: Revisiting Organs Segmentation in the Whole Abdominal Region

WORD: Revisiting Organs Segmentation in the Whole Abdominal Region (Paper and DataSet). [New] Note that all the emails about the download permission o

Healthcare Intelligence Laboratory 71 Dec 22, 2022
A PyTorch port of the Neural 3D Mesh Renderer

Neural 3D Mesh Renderer (CVPR 2018) This repo contains a PyTorch implementation of the paper Neural 3D Mesh Renderer by Hiroharu Kato, Yoshitaka Ushik

Daniilidis Group University of Pennsylvania 1k Jan 09, 2023
PyTorch implemention of ICCV'21 paper SGPA: Structure-Guided Prior Adaptation for Category-Level 6D Object Pose Estimation

SGPA: Structure-Guided Prior Adaptation for Category-Level 6D Object Pose Estimation This is the PyTorch implemention of ICCV'21 paper SGPA: Structure

Chen Kai 24 Dec 05, 2022
Scales, Chords, and Cadences: Practical Music Theory for MIR Researchers

ISMIR-musicTheoryTutorial This repository has slides and Jupyter notebooks for the ISMIR 2021 tutorial Scales, Chords, and Cadences: Practical Music T

Johanna Devaney 58 Oct 11, 2022
This is 2nd term discrete maths project done by UCU students that uses backtracking to solve various problems.

Backtracking Project Sponsors This is a project made by UCU students: Olha Liuba - crossword solver implementation Hanna Yershova - sudoku solver impl

Dasha 4 Oct 17, 2021
Aerial Imagery dataset for fire detection: classification and segmentation (Unmanned Aerial Vehicle (UAV))

Aerial Imagery dataset for fire detection: classification and segmentation using Unmanned Aerial Vehicle (UAV) Title FLAME (Fire Luminosity Airborne-b

79 Jan 06, 2023
Official PyTorch implementation of the Fishr regularization for out-of-distribution generalization

Fishr: Invariant Gradient Variances for Out-of-distribution Generalization Official PyTorch implementation of the Fishr regularization for out-of-dist

62 Dec 22, 2022