Princeton NLP's pre-training library based on fairseq with DeepSpeed kernel integration 🚃

Overview


This repository provides a library for efficient training of masked language models (MLM), built with fairseq. We fork fairseq to give researchers more flexibility when using our training scripts, while also making it easier to adapt our code contributions into other projects.

Why DinkyTrain?

The Dinky runs between Princeton Junction and Princeton and is the shortest scheduled commuter rail line in the United States. We also aim to make pre-training short and accessible to everyone.

Our Contributions

  • DeepSpeed transformer kernel integration
  • A training recipe for efficient MLM pre-training
  • An easy-to-follow guideline of using fairseq for MLM pre-training.

Other fairseq features:

See the fairseq repo and its documentation for more details on how to use and extend fairseq.

DinkyTrain for Efficient MLM Pre-training

Quick Links

Overview

You can reproduce the pre-training experiments of our recent paper Should You Mask 15% in Masked Language Modeling?, where we find that higher masking rates can lead to more efficient pre-training.

Installation

  • PyTorch version >= 1.5.0
  • Python version >= 3.6
  • To install fairseq and develop locally:
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./
  • For faster training (FP16) install NVIDIA's apex library:
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" --global-option="--xentropy" \
  --global-option="--fast_multihead_attn" ./
  • For faster training (DeepSpeed cuda kernel) install DeepSpeed library and compile the DeepSpeed kernel
DS_BUILD_TRANSFORMER=1 DS_BUILD_STOCHASTIC_TRANSFORMER=1 pip install deepspeed
  • For large datasets install PyArrow: pip install pyarrow
  • If you use Docker make sure to increase the shared memory size either with --ipc=host or --shm-size as command line options to nvidia-docker run .

Trouble-shooting:

  • If using lower version of Python, you might encounter import problems with importlib.metadata. Try pip install importlib-metadata.
  • To install apex and deepspeed, you will need nvcc (CUDA compiler).
  • When installing apex, if you encounter the error Cuda extensions are bing compiled with a version of Cuda that does not match ..., go to setup.py and comment out the line that raised the error (at your own risk).
  • Both apex and deepspeed installation require a high gcc version to support c++14. If you encounter relevant errors, update your gcc.

Data Pre-processing

Tokenization: First, download the GPT2 BPE vocabulary:

wget -O gpt2_bpe/encoder.json https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json
wget -O gpt2_bpe/vocab.bpe https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe

Then, tokenize your raw data:

python -m examples.roberta.multiprocessing_bpe_encoder \
    --encoder-json gpt2_bpe/encoder.json \
    --vocab-bpe gpt2_bpe/vocab.bpe \
    --inputs ${SPLIT}.raw \
    --outputs ${SPLIT}.bpe \
    --keep-empty \
    --workers 8

Finally, index and binarize your data:

fairseq-preprocess \
    --only-source \
    --srcdict gpt2_bpe/dict.txt \
    --trainpref ${TRAIN_SPLIT}.bpe \
    --validpref ${VALID_SPLIT}.bpe \
    --testpref ${TEST_SPLIT}.bpe \
    --destdir output-bin \
    --workers 8

Alternatively: Use our pre-processed data: We preprocessed Wikipedia+BookCorpus and shared it on Huggingface dataset. It is ~22GB and contains two epochs of data, each epoch being sliced into 8 shards. You can download it using git:

git lfs install # Git lfs is needed for downloading
git clone https://huggingface.co/datasets/princeton-nlp/wikibook_fairseq_format

Pre-training

Use our script for efficient pre-training

GPU={number of GPUs} DATA_DIR={data path} [DEEPSPEED=1] bash run_efficient_mlm_recipe.sh

Flags explained

  • GPU: number of GPUs.
  • DATA_DIR: directory to the processed pre-training data. If you are using our preprocessed dataset, DATA_DIR should be:
DATA_DIR=$(seq 0 15 | sed -e 's/^/wikibook_fairseq_format\/bin-shard/' | sed -e 's/$/-8/' | paste -sd ':')
  • DEEPSPEED (optional): if set to 1, the DeepSpeed CUDA kernel will be used.

Please refer to the script for more hyperparameter choices.

Fine-tuning on GLUE and SQuAD

All our checkpoints can be converted to HuggingFace transformers models (see next nextion) and use the transformers package for fine-tuning. Fairseq also supports fine-tuning on GLUE.

First, download the preprocessed GLUE data (you can also process by yourself following the preprocess section above):

git lfs install # Git lfs is needed for downloading
git clone https://huggingface.co/datasets/princeton-nlp/glue_fairseq_format

Then use the following script for fine-tuning

DATA_DIR={path to the data directory} \
TASK={glue task name (mnli qnli qqp rte sst2 mrpc cola stsb)} \
LR={learning rate} \
BSZ={batch size} \
EPOCHS={number of epochs} \
SEED={random seed} \
CKPT_DIR={checkpoint's directory} \
CKPT_NAME={checkpoint's name} \
[DEEPSPEED=1] bash finetune_glue.sh

For fine-tuning on SQuAD, please convert the models to HuggingFace checkpoints following the next section and use HuggingFace's examples.

Convert to HuggingFace

We also provide conversion codes so that you can easily turn Fairseq checkpoints into HuggingFace checkpoints. Usage:

cd scripts
[PRELAYERNORM=1] [FROM_DS=1] python convert_fs_ckpt_to_hf_ckpt.py --fr {fairseq checkpoint} --to {huggingface checkpoint path} --hf_model_config {roberta-base/roberta-large}

Flags explained:

  • PRELAYERNORM=1: Using pre layer-norm (default is post layer-norm).
  • FROM_DS=1: The Fairseq checkpoint uses DeepSpeed's cuda kernel.
  • --fr: The path to the Fairseq checkpoint.
  • --to: The path you want to save the HuggingFace checkpoint to.
  • --hf_model_config: roberta-base or roberta-large.

IMPORTANT: all our models use pre layer norm, which is not supported by HuggingFace yet. To use it, import the model class from huggingface/modeling_roberta_prelayernorm.py. For example:

from huggingface.modeling_roberta_prelayernorm import RobertaForSequenceClassification

For more configuration, please refer to convert_fs_ckpt_to_hf_ckpt.py.

Model List

Here are the HuggingFace checkpoints of our models in the paper Should You Mask 15% in Masked Language Modeling. Results are development set performance.

Model MNLI QNLI QQP SST-2
princeton-nlp/efficient_mlm_m0.15 84.2 90.9 87.8 93.3
princeton-nlp/efficient_mlm_m0.20 84.1 91.3 87.9 92.7
princeton-nlp/efficient_mlm_m0.30 84.2 91.6 88.0 93.0
princeton-nlp/efficient_mlm_m0.40 84.5 91.6 88.1 92.8
princeton-nlp/efficient_mlm_m0.50 84.1 91.1 88.1 92.7
princeton-nlp/efficient_mlm_m0.60 83.2 90.7 87.8 92.6
princeton-nlp/efficient_mlm_m0.70 82.3 89.4 87.5 91.9
princeton-nlp/efficient_mlm_m0.80 80.8 87.9 87.1 90.5
princeton-nlp/efficient_mlm_m0.15-801010 83.7 90.4 87.8 93.2
princeton-nlp/efficient_mlm_m0.40-801010 84.3 91.2 87.9 93.0

We also offer the original (deepspeed) fairseq checkpoints here.

Bugs or Questions?

If you hav an questions, or encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!

Citation

@article{wettig2022should,
   title={Should You Mask 15% in Masked Language Modeling?},
   author={Wettig, Alexander and Gao, Tianyu and Zhong, Zexuan and Chen, Danqi},
   boo={arXiv preprint arXiv:2202.08005},
   year={2022}
}

Acknowledgment

Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53.

  • Our efficient training recipe is based on the following paper:

Peter Izsak, Moshe Berchansky, and Omer Levy. 2021. How to train BERT with an academic budget. In Empirical Methods in Natural Language Processing (EMNLP), pages 10644–10652.

Owner
Princeton Natural Language Processing
Princeton Natural Language Processing
TPlinker for NER 中文/英文命名实体识别

本项目是参考 TPLinker 中HandshakingTagging思想,将TPLinker由原来的关系抽取(RE)模型修改为命名实体识别(NER)模型。

GodK 113 Dec 28, 2022
Code for the paper "BERT Loses Patience: Fast and Robust Inference with Early Exit".

Patience-based Early Exit Code for the paper "BERT Loses Patience: Fast and Robust Inference with Early Exit". NEWS: We now have a better and tidier i

Kevin Canwen Xu 54 Jan 04, 2023
Pytorch implementation of winner from VQA Chllange Workshop in CVPR'17

2017 VQA Challenge Winner (CVPR'17 Workshop) pytorch implementation of Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challeng

Mark Dong 166 Dec 11, 2022
Yuqing Xie 2 Feb 17, 2022
Intent parsing and slot filling in PyTorch with seq2seq + attention

PyTorch Seq2Seq Intent Parsing Reframing intent parsing as a human - machine translation task. Work in progress successor to torch-seq2seq-intent-pars

Sean Robertson 159 Apr 04, 2022
A demo of chinese asr

chinese_asr_demo 一个端到端的中文语音识别模型训练、测试框架 具备数据预处理、模型训练、解码、计算wer等等功能 训练数据 训练数据采用thchs_30,

4 Dec 09, 2021
Natural language computational chemistry command line interface.

nlcc Install pip install nlcc Must have Open-AI Codex key: export OPENAI_API_KEY=your key here then nlcc key bindings ctrl-w copy to clipboard (Note

Andrew White 37 Dec 14, 2022
Search for documents in a domain through Google. The objective is to extract metadata

MetaFinder - Metadata search through Google _____ __ ___________ .__ .___ / \

Josué Encinar 85 Dec 16, 2022
Word2Wave: a framework for generating short audio samples from a text prompt using WaveGAN and COALA.

Word2Wave is a simple method for text-controlled GAN audio generation. You can either follow the setup instructions below and use the source code and CLI provided in this repo or you can have a play

Ilaria Manco 91 Dec 23, 2022
RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

RoNER RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2. It is meant to be an easy to use, hi

Stefan Dumitrescu 9 Nov 07, 2022
Persian Bert For Long-Range Sequences

ParsBigBird: Persian Bert For Long-Range Sequences The Bert and ParsBert algorithms can handle texts with token lengths of up to 512, however, many ta

Sajjad Ayoubi 63 Dec 14, 2022
A workshop with several modules to help learn Feast, an open-source feature store

Workshop: Learning Feast This workshop aims to teach users about Feast, an open-source feature store. We explain concepts & best practices by example,

Feast 52 Jan 05, 2023
Simple bots or Simbots is a library designed to create simple bots using the power of python. This library utilises Intent, Entity, Relation and Context model to create bots .

Simple bots or Simbots is a library designed to create simple chat bots using the power of python. This library utilises Intent, Entity, Relation and

14 Dec 15, 2021
simpleT5 is built on top of PyTorch-lightning⚡️ and Transformers🤗 that lets you quickly train your T5 models.

Quickly train T5 models in just 3 lines of code + ONNX support simpleT5 is built on top of PyTorch-lightning ⚡️ and Transformers 🤗 that lets you quic

Shivanand Roy 220 Dec 30, 2022
Korean stereoypte detector with TUNiB-Electra and K-StereoSet

Korean Stereotype Detector Korean stereotype sentence classifier using K-StereoSet with TUNiB-Electra Web demo you can test this model easily in demo

Sae_Chan_Oh 11 Feb 18, 2022
Wrapper to display a script output or a text file content on the desktop in sway or other wlroots-based compositors

nwg-wrapper This program is a part of the nwg-shell project. This program is a GTK3-based wrapper to display a script output, or a text file content o

Piotr Miller 94 Dec 27, 2022
Python code for ICLR 2022 spotlight paper EViT: Expediting Vision Transformers via Token Reorganizations

Expediting Vision Transformers via Token Reorganizations This repository contain

Youwei Liang 101 Dec 26, 2022
The model is designed to train a single and large neural network in order to predict correct translation by reading the given sentence.

Neural Machine Translation communication system The model is basically direct to convert one source language to another targeted language using encode

Nishant Banjade 7 Sep 22, 2022
Learn meanings behind words is a key element in NLP. This project concentrates on the disambiguation of preposition senses. Therefore, we train a bert-transformer model and surpass the state-of-the-art.

New State-of-the-Art in Preposition Sense Disambiguation Supervisor: Prof. Dr. Alexander Mehler Alexander Henlein Institutions: Goethe University TTLa

Dirk Neuhäuser 4 Apr 06, 2022
Unsupervised Abstract Reasoning for Raven’s Problem Matrices

Unsupervised Abstract Reasoning for Raven’s Problem Matrices This code is the implementation of our TIP paper. This is the first unsupervised abstract

Tao Zhuo 9 Dec 17, 2022