The code for the Subformer, from the EMNLP 2021 Findings paper: "Subformer: Exploring Weight Sharing for Parameter Efficiency in Generative Transformers", by Machel Reid, Edison Marrese-Taylor, and Yutaka Matsuo

Overview

Subformer

This repository contains the code for the Subformer. To help overcome this we propose the Subformer, allowing us to retain performance while reducing parameters in generative Transformers from 25% ~ 70%. The Subformer consists of the following two techniques:

  1. Sandwich-style parameter sharing, in which we share all the layers in a block except the first and last. This allows us the use the central shared layers --"sandwich module" -- as a large representation learner (similar to BERT vs ALBERT) while the input and output model layers are able to focus on more specific representations for token prediction/generation while maintaining performance.
  2. For our sequence to sequence tasks, we also introduce SAFE (self-attentive factorized embeddings), which help us reduce embedding parameters significantly, while still retaining performance.

If you used this code or found our work useful, please cite:

@inproceedings{reid2021subformer,
    title = {{S}ubformer: {E}xploring {W}eight {S}haring for {P}arameter {E}fficiency in {G}enerative {T}ransformers},
    author = {Machel Reid and Edison Marrese-Taylor and Yutaka Matsuo},
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
}

Requirements and Installation

(As this code is based on fairseq, some installation instructions are taken straight from their README)

  • PyTorch version >= 1.5.0
  • Python version >= 3.6
  • For training new models, you'll also need an NVIDIA GPU and NCCL
  • To install and develop locally:
git clone https://github.com/machelreid/subformer
cd subformer
pip install --e ./

# on MacOS:
# CFLAGS="-stdlib=libc++" pip install --editable ./
  • For faster training install NVIDIA's apex library:
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" --global-option="--xentropy" \
  --global-option="--fast_multihead_attn" ./
  • For large datasets install PyArrow: pip install pyarrow
  • If you use Docker make sure to increase the shared memory size either with --ipc=host or --shm-size as command line options to nvidia-docker run .

Training

Machine Translation

python train.py $DATA_BIN --arch transformer_wmt_en_de \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --lr 5e-4 \
    --warmup-init-lr 1e-7 --stop-min-lr 1e-9 --lr-scheduler inverse_sqrt --warmup-updates 10000 \
    --optimizer adam --adam-betas '(0.9, 0.999)' --adam-eps 1e-6 --task translation \
    --max-tokens 8192 --weight-decay 0.01 --dropout 0.2 --encoder-layers 6 --encoder-embed-dim 512 \
    --decoder-layers 6 --decoder-embed-dim 512 --fp16 --max-source-positions 10000 \
    --max-target-positions 10000 --max-update 200000 --seed 1 \
    --save-dir $CHECKPOINT_DIR --share-all-embeddings \
    --share-encoder-parameters-sandwich --share-decoder-parameters-sandwich \ #for sandwich-style parameter sharing
    --reduction-dim 320 #for SAFE embeddings

Generation

python generate.py --path $CHECKPOINT --gen-subset $SPLIT --beam 5 --lenpen $LENPEN --batch-size 400 --remove-bpe

CNN-DM Summarization

fairseq-train $DATA_BIN \
   --share-decoder-input-output-embed \
   --max-update 30000 \
   --optimizer adam --adam-betas '(0.9, 0.98)' --skip-invalid-size-inputs-valid-test \
   --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 10000 --lr 0.0005 \
   --stop-min-lr 1e-09 --clip-norm 0.1 --dropout 0.3 --weight-decay 0.0 \
   --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --update-freq 7 --attention-dropout 0.2 \
   --max-tokens 8192 --arch transformer_wmt_en_de --seed 1 --warmup-init-lr 1e-7 \
   --source-lang source_bpe --target-lang target_bpe --save-dir $CHECKPOINT_DIR --no-epoch-checkpoints --keep-best-checkpoints 10 --truncate-source --max-source-positions 512 --share-encoder-parameters-sandwich --share-decoder-parameters-sandwich --sandwich-embed-dim 1024 --sandwich-ffn-embed-dim 3072 --reduction-dim 256

Generation

fairseq-generate $DATA_BIN --task translation --gen-subset $SPLIT --batch-size 32 --path $CHECKPOINT --remove-bpe  --min-len 55 --beam 5 --max-len-b 140 --no-repeat-ngram-size 3 --lenpen $LENPEN -s source_bpe -t target_bpe --truncate-source --max-source-positions 512

Note that the min,max len parameters can be tuned for better performance

For post processing and ROUGE calculation feel free to take a look at this.

Citation

Please cite as:

@inproceedings{reid2021subformer,
    title = {{S}ubformer: {E}xploring {W}eight {S}haring for {P}arameter {E}fficiency in {G}enerative {T}ransformers},
    author = {Machel Reid and Edison Marrese-Taylor and Yutaka Matsuo},
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
}
Owner
Machel Reid
Researcher at University of Tokyo. Research Intern at CMU. Masason Foundation Scholar. Won the Rakuten Hackathon 2018.
Machel Reid
I label phrases on a scale of five values: negative, somewhat negative, neutral, somewhat positive, positive

I label phrases on a scale of five values: negative, somewhat negative, neutral, somewhat positive, positive. Obstacles like sentence negation, sarcasm, terseness, language ambiguity, and many others

1 Jan 13, 2022
GooAQ 🥑 : Google Answers to Google Questions!

This repository contains the code/data accompanying our recent work on long-form question answering.

AI2 112 Nov 06, 2022
edge-SR: Super-Resolution For The Masses

edge-SR: Super Resolution For The Masses Citation Pablo Navarrete Michelini, Yunhua Lu and Xingqun Jiang. "edge-SR: Super-Resolution For The Masses",

Pablo 40 Nov 10, 2022
LeBenchmark: a reproducible framework for assessing SSL from speech

LeBenchmark: a reproducible framework for assessing SSL from speech

11 Nov 30, 2022
Deep learning for NLP crash course at ABBYY.

Deep NLP Course at ABBYY Deep learning for NLP crash course at ABBYY. Suggested textbook: Neural Network Methods in Natural Language Processing by Yoa

Dan Anastasyev 597 Dec 18, 2022
A toolkit for document-level event extraction, containing some SOTA model implementations

Document-level Event Extraction via Heterogeneous Graph-based Interaction Model with a Tracker Source code for ACL-IJCNLP 2021 Long paper: Document-le

84 Dec 15, 2022
skweak: A software toolkit for weak supervision applied to NLP tasks

Labelled data remains a scarce resource in many practical NLP scenarios. This is especially the case when working with resource-poor languages (or text domains), or when using task-specific labels wi

Norsk Regnesentral (Norwegian Computing Center) 850 Dec 28, 2022
Maha is a text processing library specially developed to deal with Arabic text.

An Arabic text processing library intended for use in NLP applications Maha is a text processing library specially developed to deal with Arabic text.

Mohammad Al-Fetyani 184 Nov 27, 2022
🚀 RocketQA, dense retrieval for information retrieval and question answering, including both Chinese and English state-of-the-art models.

In recent years, the dense retrievers based on pre-trained language models have achieved remarkable progress. To facilitate more developers using cutt

475 Jan 04, 2023
A text file containing 479k English words for all your dictionary/word-based projects e.g: auto-completion / autosuggestion

List Of English Words A text file containing over 466k English words. While searching for a list of english words (for an auto-complete tutorial) I fo

dwyl 8.5k Jan 03, 2023
Training and evaluation codes for the BertGen paper (ACL-IJCNLP 2021)

BERTGEN This repository is the implementation of the paper "BERTGEN: Multi-task Generation through BERT" (https://arxiv.org/abs/2106.03484). The codeb

<a href=[email protected]"> 9 Oct 26, 2022
🌐 Translation microservice powered by AI

Dot Translate 🌐 A microservice for quick and local translation using A.I. This service starts a local webserver used for neural machine translation.

Dot HQ 48 Nov 22, 2022
Large-scale pretraining for dialogue

A State-of-the-Art Large-scale Pretrained Response Generation Model (DialoGPT) This repository contains the source code and trained model for a large-

Microsoft 1.8k Jan 07, 2023
Code Implementation of "Learning Span-Level Interactions for Aspect Sentiment Triplet Extraction".

Span-ASTE: Learning Span-Level Interactions for Aspect Sentiment Triplet Extraction ***** New March 31th, 2022: Scikit-Style API for Easy Usage *****

Chia Yew Ken 111 Dec 23, 2022
Lattice methods in TensorFlow

TensorFlow Lattice TensorFlow Lattice is a library that implements constrained and interpretable lattice based models. It is an implementation of Mono

504 Dec 20, 2022
🤗 Transformers: State-of-the-art Natural Language Processing for Pytorch, TensorFlow, and JAX.

English | 简体中文 | 繁體中文 State-of-the-art Natural Language Processing for Jax, PyTorch and TensorFlow 🤗 Transformers provides thousands of pretrained mo

Hugging Face 77.2k Jan 03, 2023
A Streamlit web app that generates Rick and Morty stories using GPT2.

Rick and Morty Story Generator This project uses a pre-trained GPT2 model, which was fine-tuned on Rick and Morty transcripts, to generate new stories

₸ornike 33 Oct 13, 2022
This is a MD5 password/passphrase brute force tool

CROWES-PASS-CRACK-TOOl This is a MD5 password/passphrase brute force tool How to install: Do 'git clone https://github.com/CROW31/CROWES-PASS-CRACK-TO

9 Mar 02, 2022
jiant is an NLP toolkit

jiant is an NLP toolkit The multitask and transfer learning toolkit for natural language processing research Why should I use jiant? jiant supports mu

ML² AT CILVR 1.5k Jan 04, 2023
The guide to tackle with the Text Summarization

The guide to tackle with the Text Summarization

Takahiro Kubo 1.2k Dec 30, 2022