Repository for the paper "Optimal Subarchitecture Extraction for BERT"

Related tags

Text Data & NLPbort
Overview

Bort

Companion code for the paper "Optimal Subarchitecture Extraction for BERT."

Bort is an optimal subset of architectural parameters for the BERT architecture, extracted by applying a fully polynomial-time approximation scheme (FPTAS) for neural architecture search. Bort has an effective (that is, not counting the embedding layer) size of 5.5% the original BERT-large architecture, and 16% of the net size. It is also able to be pretrained in 288 GPU hours, which is 1.2% of the time required to pretrain the highest-performing BERT parametric architectural variant, RoBERTa-large. It is also 7.9x faster than BERT-base (20x faster than BERT/RoBERTa-large) on a CPU, and performs better than other compressed variants of the architecture, and some of the non-compressed variants; it obtains an average performance improvement of between 0.3% and 31%, relative, with respect to BERT-large on multiple public natural language understanding (NLU) benchmarks.

Here are the corresponding GLUE scores on the test set:

Model Score CoLA SST-2 MRPC STS-B QQP MNLI-m MNLI-mm QNLI(v2) RTE WNLI AX
Bort 83.6 63.9 96.2 94.1/92.3 89.2/88.3 66.0/85.9 88.1 87.8 92.3 82.7 71.2 51.9
BERT-Large 80.5 60.5 94.9 89.3/85.4 87.6/86.5 72.1/89.3 86.7 85.9 92.7 70.1 65.1 39.6

And SuperGLUE scores on the test set:

Model Score BoolQ CB COPA MultiRC ReCoRD RTE WiC WSC AX-b AX-g
Bort 74.1 83.7 81.9/86.5 89.6 83.7/54.1 49.8/49.0 81.2 70.1 65.8 48.0 96.1/61.5
BERT-Large 69.0 77.4 75.7/83.6 70.6 70.0/24.1 72.0/71.3 71.7 69.6 64.4 23.0 97.8/51.7

And here are the architectural parameters:

Model Parameters (M) Layers Attention heads Hidden size Intermediate size Embedding size (M) Encoder proportion (%)
Bort 56 4 8 1024 768 39 30.3
BERT-Large 340 24 16 1024 4096 31.8 90.6

Setup:

  1. You need to install the requirements from the requirements.txt file:
pip install -r requirements.txt

This code has been tested with Python 3.6.5+. To save yourself some headache we recommend you install Horovod from source, after you install MxNet. This is only needed if you are pre-training the architecture. For this, run the following commands (you'll need a C++ compiler which supports c++11 standards, like gcc > 4.8):

    pip uninstall horovod
    HOROVOD_CUDA_HOME=/usr/local/cuda-10.1 \
    HOROVOD_WITH_MXNET=1 \
    HOROVOD_GPU_ALLREDUCE=NCCL \
    pip install horovod==0.16.2 --no-cache-dir
  1. You also need to download the model from here. If you have the AWS CLI, all you need to do is run:
aws s3 cp s3://alexa-saif-bort/bort.params model/
  1. To run the tests, you also need to download the sample text from Gluon and put it in test_data/:
wget https://github.com/dmlc/gluon-nlp/blob/v0.9.x/scripts/bert/sample_text.txt
mv sample_text.txt test_data/

Pre-training:

Bort is already pre-trained, but if you want to try out other datasets, you can follow the steps here. Note that this does not run the FPTAS described in the paper, and works for a fixed architecture (Bort).

  1. First, you will need to tokenize the pre-training text:
python create_pretraining_data.py \
            --input_file <input text> \
            --output_dir <output directory> \
            --dataset_name <dataset name> \
            --dupe_factor <duplication factor> \
            --num_outputs <number of output files>

We recommend using --dataset_name openwebtext_ccnews_stories_books_cased for the vocabulary. If your data file is too large, the script will throw out-of-memory errors. We recommend splitting it into smaller chunks and then calling the script one-by-one.

  1. Then run the pre-training distillation script:
./run_pretraining_distillation.sh <num gpus> <training data> <testing data> [optional teacher checkpoint]

Please see the contents of run_pretraining_distillation.sh for example usages and additional optional configuration. If you have installed Horovod, we highly recommend you use run_pretraining_distillation_hvd.py instead.

Fine-tuning:

  1. To fine-tune Bort, run:
./run_finetune.sh <your task here>

We recommend you play with the hyperparameters from run_finetune.sh. This code supports all the tasks outlined in the paper, but for the case of the RACE dataset, you need to download the data and extract it. The default location for extraction is ~/.mxnet/datasets/race. Same goes for SuperGLUE's MultiRC, since the Gluon implementation is the old version. You can download the data and extract it to ~/.mxnet/datasets/superglue_multirc/.

It is normal to get very odd results for the fine-tuning step, since this repository only contains the training part of Agora. However, you can easily implement your own version of that algorithm. We recommend you use the following initial set of hyperparameters, and follow the requirements described in the papers at the end of this file:

seeds={0,1,2,3,4}
learning_rates={1e-4, 1e-5, 9e-6}
weight_decays={0, 10, 100, 350}
warmup_rates={0.35, 0.40, 0.45, 0.50}
batch_sizes={8, 16}

Troubleshooting:

Dependency errors

Bort requires a rather unusual environment to run. For this reason, most of the problems regarding runtime can be fixed by installing the requirements from the requirements.txt file. Also make sure to have reinstalled Horovod as outlined above.

Script failing when downloading the data

This is inherent to the way Bort is fine-tuned, since it expects the data to be preexisting for some arbitrary implementation of Agora. You can get around that error by downloading the data before running the script, e.g.:

from data.classification import BoolQTask
task = BoolQTask()
task.dataset_train()[1]; task.dataset_val()[1]; task.dataset_test()[1]
Out-of-memory errors

While Bort is designed to be efficient in terms of the space it occupies in memory, a very large batch size or sequence length will still cause you to run out of memory. More often than ever, reducing the sequence length from 512 to 256 will solve out-of-memory issues. 80% of the time, it works every time.

Slow fine-tuning/pre-training

We strongly recommend using distributed training for both fine-tuning and pre-training. If your Horovod acts weird, remember that it needs to be built after the installation of MXNet (or any framework for that matter).

Low task-specific performance

If you observe near-random task-specific performance, that is to be expected. Bort is a rather small architecture and the optimizer/scheduler/learning rate combination is quite aggressive. We highly recommend you fine-tune Bort using an implementation of Agora. More details on how to do that are in the references below, specifically the second paper. Note that we needed to implement "replay" (i.e., re-doing some iterations of Agora) to get it to converge better.

References

If you use Bort or the other algorithms in your work, we'd love to hear from it! Also, please cite the so-called "Bort trilogy" papers:

@article{deWynterApproximation,
    title={An Approximation Algorithm for Optimal Subarchitecture Extraction},
    author={Adrian de Wynter},
    year={2020},
    eprint={2010.08512},
    archivePrefix={arXiv},
    primaryClass={cs.LG},
    journal={CoRR},
    volume={abs/2010.08512},
    url={http://arxiv.org/abs/2010.08512}
}
@article{deWynterAlgorithm,
      title={An Algorithm for Learning Smaller Representations of Models With Scarce Data},
      author={Adrian de Wynter},
      year={2020},
      eprint={2010.07990},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      journal={CoRR},
      volume={abs/2010.07990},
      url={http://arxiv.org/abs/2010.07990}
}
@article{deWynterPerryOptimal,
      title={Optimal Subarchitecture Extraction for BERT},
      author={Adrian de Wynter and Daniel J. Perry},
      year={2020},
      eprint={2010.10499},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      journal={CoRR},
      volume={abs/2010.10499},
      url={http://arxiv.org/abs/2010.10499}
}

Lastly, if you use the GLUE/SuperGLUE/RACE tasks, don't forget to give proper attribution to the original authors.

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

Owner
Alexa
Alexa
LightSeq: A High-Performance Inference Library for Sequence Processing and Generation

LightSeq is a high performance inference library for sequence processing and generation implemented in CUDA. It enables highly efficient computation of modern NLP models such as BERT, GPT2, Transform

Bytedance Inc. 2.5k Jan 03, 2023
Conditional Transformer Language Model for Controllable Generation

CTRL - A Conditional Transformer Language Model for Controllable Generation Authors: Nitish Shirish Keskar, Bryan McCann, Lav Varshney, Caiming Xiong,

Salesforce 1.7k Dec 28, 2022
HuggingTweets - Train a model to generate tweets

HuggingTweets - Train a model to generate tweets Create in 5 minutes a tweet generator based on your favorite Tweeter Make my own model with the demo

Boris Dayma 318 Jan 04, 2023
LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

LV-BERT Introduction In this repo, we introduce LV-BERT by exploiting layer variety for BERT. For detailed description and experimental results, pleas

Weihao Yu 14 Aug 24, 2022
Twitter-Sentiment-Analysis - Twitter sentiment analysis for india's top online retailers(2019 to 2022)

Twitter-Sentiment-Analysis Twitter sentiment analysis for india's top online retailers(2019 to 2022) Project Overview : Sentiment Analysis helps us to

Balaji R 1 Jan 01, 2022
A python wrapper around the ZPar parser for English.

NOTE This project is no longer under active development since there are now really nice pure Python parsers such as Stanza and Spacy. The repository w

ETS 49 Sep 12, 2022
DensePhrases provides answers to your natural language questions from the entire Wikipedia in real-time

DensePhrases provides answers to your natural language questions from the entire Wikipedia in real-time. While it efficiently searches the answers out of 60 billion phrases in Wikipedia, it is also v

Jinhyuk Lee 543 Jan 08, 2023
Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings

Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings Trong bài viết này mình sẽ sử dụng pretrain model SimCS

Vo Van Phuc 18 Nov 25, 2022
Free and Open Source Machine Translation API. 100% self-hosted, offline capable and easy to setup.

LibreTranslate Try it online! | API Docs | Community Forum Free and Open Source Machine Translation API, entirely self-hosted. Unlike other APIs, it d

3.4k Dec 27, 2022
A toolkit for document-level event extraction, containing some SOTA model implementations

Document-level Event Extraction via Heterogeneous Graph-based Interaction Model with a Tracker Source code for ACL-IJCNLP 2021 Long paper: Document-le

84 Dec 15, 2022
基于GRU网络的句子判断程序/A program based on GRU network for judging sentences

SentencesJudger SentencesJudger 是一个基于GRU神经网络的句子判断程序,基本的功能是判断文章中的某一句话是否为一个优美的句子。 English 如何使用SentencesJudger 确认Python运行环境 安装pyTorch与LTP python3 -m pip

8 Mar 24, 2022
Torchrecipes provides a set of reproduci-able, re-usable, ready-to-run RECIPES for training different types of models, across multiple domains, on PyTorch Lightning.

Recipes are a standard, well supported set of blueprints for machine learning engineers to rapidly train models using the latest research techniques without significant engineering overhead.Specifica

Meta Research 193 Dec 28, 2022
List of GSoC organisations with number of times they have been selected.

Welcome to GSoC Organisation Frequency And Details 👋 List of GSoC organisations with number of times they have been selected, techonologies, topics,

Shivam Kumar Jha 41 Oct 01, 2022
⚡ boost inference speed of T5 models by 5x & reduce the model size by 3x using fastT5.

Reduce T5 model size by 3X and increase the inference speed up to 5X. Install Usage Details Functionalities Benchmarks Onnx model Quantized onnx model

Kiran R 399 Jan 05, 2023
Toward Model Interpretability in Medical NLP

Toward Model Interpretability in Medical NLP LING380: Topics in Computational Linguistics Final Project James Cross ( 1 Mar 04, 2022

Community and sentiment analysis based on tweets

The project has set itself the goal of analyzing the thoughts and interaction of Italian users through the social posts expressed through the Twitter platform on the day of the entry into force of th

3 Nov 17, 2022
Paradigm Shift in NLP - "Paradigm Shift in Natural Language Processing".

Paradigm Shift in NLP Welcome to the webpage for "Paradigm Shift in Natural Language Processing". Some resources of the paper are constantly maintaine

Tianxiang Sun 41 Dec 30, 2022
This github repo is for Neurips 2021 paper, NORESQA A Framework for Speech Quality Assessment using Non-Matching References.

NORESQA: Speech Quality Assessment using Non-Matching References This is a Pytorch implementation for using NORESQA. It contains minimal code to predi

Meta Research 36 Dec 08, 2022
🤗🖼️ HuggingPics: Fine-tune Vision Transformers for anything using images found on the web.

🤗 🖼️ HuggingPics Fine-tune Vision Transformers for anything using images found on the web. Check out the video below for a walkthrough of this proje

Nathan Raw 185 Dec 21, 2022