EMNLP 2021 - Frustratingly Simple Pretraining Alternatives to Masked Language Modeling

Overview

Frustratingly Simple Pretraining Alternatives to Masked Language Modeling

This is the official implementation for "Frustratingly Simple Pretraining Alternatives to Masked Language Modeling" (EMNLP 2021).

Requirements

  • torch
  • transformers
  • datasets
  • scikit-learn
  • tensorflow
  • spacy

How to pre-train

1. Clone this repository

git clone https://github.com/gucci-j/light-transformer-emnlp2021.git

2. Install required packages

cd ./light-transformer-emnlp2021
pip install -r requirements.txt

requirements.txt is located just under light-transformer-emnlp2021.

We also need spaCy's en_core_web_sm for preprocessing. If you have not installed this model, please run python -m spacy download en_core_web_sm.

3. Preprocess datasets

cd ./src/utils
python preprocess_roberta.py --path=/path/to/save/data/

You need to specify the following argument:

  • path: (str) Where to save the processed data?

4. Pre-training

You need to secify configs as command line arguments. Sample configs for pre-training MLM are shown as below. python pretrainer.py --help will display helper messages.

cd ../
python pretrainer.py \
--data_dir=/path/to/dataset/ \
--do_train \
--learning_rate=1e-4 \
--weight_decay=0.01 \
--adam_epsilon=1e-8 \
--max_grad_norm=1.0 \
--num_train_epochs=1 \
--warmup_steps=12774 \
--save_steps=12774 \
--seed=42 \
--per_device_train_batch_size=16 \
--logging_steps=100 \
--output_dir=/path/to/save/weights/ \
--overwrite_output_dir \
--logging_dir=/path/to/save/log/files/ \
--disable_tqdm=True \
--prediction_loss_only \
--fp16 \
--mlm_prob=0.15 \
--pretrain_model=RobertaForMaskedLM 
  • pretrain_model should be selected from:
    • RobertaForMaskedLM (MLM)
    • RobertaForShuffledWordClassification (Shuffle)
    • RobertaForRandomWordClassification (Random)
    • RobertaForShuffleRandomThreeWayClassification (Shuffle+Random)
    • RobertaForFourWayTokenTypeClassification (Token Type)
    • RobertaForFirstCharPrediction (First Char)

Check the pre-training process

You can monitor the progress of pre-training via the Tensorboard. Simply run the following:

tensorboard --logdir=/path/to/log/dir/

Distributed training

pretrainer.py is compatible with distributed training. Sample configs for pre-training MLM are as follows.

python -m torch/distributed/launch.py \
--nproc_per_node=8 \
pretrainer.py \
--data_dir=/path/to/dataset/ \
--model_path=None \
--do_train \
--learning_rate=5e-5 \
--weight_decay=0.01 \
--adam_epsilon=1e-8 \
--max_grad_norm=1.0 \
--num_train_epochs=1 \
--warmup_steps=24000 \
--save_steps=1000 \
--seed=42 \
--per_device_train_batch_size=8 \
--logging_steps=100 \
--output_dir=/path/to/save/weights/ \
--overwrite_output_dir \
--logging_dir=/path/to/save/log/files/ \
--disable_tqdm \
--prediction_loss_only \
--fp16 \
--mlm_prob=0.15 \
--pretrain_model=RobertaForMaskedLM 

For more details about launch.py, please refer to https://github.com/pytorch/pytorch/blob/master/torch/distributed/launch.py.

Mixed precision training

Installation

  • For PyTorch version >= 1.6, there is a native functionality to enable mixed precision training.
  • For older versions, NVIDIA apex must be installed.
    • You might encounter some errors when installing apex due to permission problems. To fix these, specify export TMPDIR='/path/to/your/favourite/dir/' and change permissions of all files under apex/.git/ to 777.
    • You also need to specify an optimisation method from https://nvidia.github.io/apex/amp.html.

Usage
To use mixed precision during pre-training, just specify --fp16 as an input argument. For older PyTorch versions, also specify --fp16_opt_level from O0, O1, O2, and O3.

How to fine-tune

GLUE

  1. Download GLUE data

    git clone https://github.com/huggingface/transformers
    python transformers/utils/download_glue_data.py
    
  2. Create a json config file
    You need to create a .json file for configuration or use command line arguments.

    {
        "model_name_or_path": "/path/to/pretrained/weights/",
        "tokenizer_name": "roberta-base",
        "task_name": "MNLI",
        "do_train": true,
        "do_eval": true,
        "data_dir": "/path/to/MNLI/dataset/",
        "max_seq_length": 128,
        "learning_rate": 2e-5,
        "num_train_epochs": 3, 
        "per_device_train_batch_size": 32,
        "per_device_eval_batch_size": 128,
        "logging_steps": 500,
        "logging_first_step": true,
        "save_steps": 1000,
        "save_total_limit": 2,
        "evaluate_during_training": true,
        "output_dir": "/path/to/save/models/",
        "overwrite_output_dir": true,
        "logging_dir": "/path/to/save/log/files/",
        "disable_tqdm": true
    }

    For task_name and data_dir, please choose one from CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, and WNLI.

  3. Fine-tune

    python run_glue.py /path/to/json/
    

    Instead of specifying a JSON path, you can directly specify configs as input arguments.
    You can also monitor training via Tensorboard.
    --help option will display a helper message.

SQuAD

  1. Download SQuAD data

    cd ./utils
    python download_squad_data.py --save_dir=/path/to/squad/
    
  2. Fine-tune

    cd ..
    export SQUAD_DIR=/path/to/squad/
    python run_squad.py \
    --model_type roberta \
    --model_name_or_path=/path/to/pretrained/weights/ \
    --tokenizer_name roberta-base \
    --do_train \
    --do_eval \
    --do_lower_case \
    --data_dir=$SQUAD_DIR \
    --train_file $SQUAD_DIR/train-v1.1.json \
    --predict_file $SQUAD_DIR/dev-v1.1.json \
    --per_gpu_train_batch_size 16 \
    --per_gpu_eval_batch_size 32 \
    --learning_rate 3e-5 \
    --weight_decay=0.01 \
    --warmup_steps=3327 \
    --num_train_epochs 10.0 \
    --max_seq_length 384 \
    --doc_stride 128 \
    --logging_steps=278 \
    --save_steps=50000 \
    --patience=5 \
    --objective_type=maximize \
    --metric_name=f1 \
    --overwrite_output_dir \
    --evaluate_during_training \
    --output_dir=/path/to/save/weights/ \
    --logging_dir=/path/to/save/logs/ \
    --seed=42 
    

    Similar to pre-training, you can monitor the fine-tuning status via Tensorboard.
    --help option will display a helper message.

Citation

@inproceedings{yamaguchi-etal-2021-frustratingly,
    title = "Frustratingly Simple Pretraining Alternatives to Masked Language Modeling",
    author = "Yamaguchi, Atsuki  and
      Chrysostomou, George  and
      Margatina, Katerina  and
      Aletras, Nikolaos",
    booktitle = "Proceedings of the 2021 Conference on Empirical
Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2021",
    publisher = "Association for Computational Linguistics",
}

License

MIT License

Owner
Atsuki Yamaguchi
NLP researcher
Atsuki Yamaguchi
This is the code for ACL2021 paper A Unified Generative Framework for Aspect-Based Sentiment Analysis

This is the code for ACL2021 paper A Unified Generative Framework for Aspect-Based Sentiment Analysis Install the package in the requirements.txt, the

108 Dec 23, 2022
Aspect-Sentiment-Multiple-Opinion Triplet Extraction (NLPCC 2021)

The code and data for the paper "Aspect-Sentiment-Multiple-Opinion Triplet Extraction" Requirements Python 3.6.8 torch==1.2.0 pytorch-transformers==1.

慢半拍 5 Jul 02, 2022
3D ResNet Video Classification accelerated by TensorRT

Activity Recognition TensorRT Perform video classification using 3D ResNets trained on Kinetics-400 dataset and accelerated with TensorRT P.S Click on

Akash James 39 Nov 21, 2022
Music source separation is a task to separate audio recordings into individual sources

Music Source Separation Music source separation is a task to separate audio recordings into individual sources. This repository is an PyTorch implmeme

Bytedance Inc. 958 Jan 03, 2023
Code for testing convergence rates of Lipschitz learning on graphs

📈 LipschitzLearningRates The code in this repository reproduces the experimental results on convergence rates for k-nearest neighbor graph infinity L

2 Dec 20, 2021
AirPose: Multi-View Fusion Network for Aerial 3D Human Pose and Shape Estimation

AirPose AirPose: Multi-View Fusion Network for Aerial 3D Human Pose and Shape Estimation Check the teaser video This repository contains the code of A

Robot Perception Group 41 Dec 05, 2022
Course on computational design, non-linear optimization, and dynamics of soft systems at UIUC.

Computational Design and Dynamics of Soft Systems · This is a repository that contains the source code for generating the lecture notes, handouts, exe

Tejaswin Parthasarathy 4 Jul 21, 2022
Turning pixels into virtual points for multimodal 3D object detection.

Multimodal Virtual Point 3D Detection Turning pixels into virtual points for multimodal 3D object detection. Multimodal Virtual Point 3D Detection, Ti

Tianwei Yin 204 Jan 08, 2023
Principled Detection of Out-of-Distribution Examples in Neural Networks

ODIN: Out-of-Distribution Detector for Neural Networks This is a PyTorch implementation for detecting out-of-distribution examples in neural networks.

189 Nov 29, 2022
Implementation for "Exploiting Aliasing for Manga Restoration" (CVPR 2021)

[CVPR Paper](To appear) | [Project Website](To appear) | BibTex Introduction As a popular entertainment art form, manga enriches the line drawings det

133 Dec 15, 2022
🌾 PASTIS 🌾 Panoptic Agricultural Satellite TIme Series

🌾 PASTIS 🌾 Panoptic Agricultural Satellite TIme Series (optical and radar) The PASTIS Dataset Dataset presentation PASTIS is a benchmark dataset for

86 Jan 04, 2023
Pytorch implementation of the paper SPICE: Semantic Pseudo-labeling for Image Clustering

SPICE: Semantic Pseudo-labeling for Image Clustering By Chuang Niu and Ge Wang This is a Pytorch implementation of the paper. (In updating) SOTA on 5

Chuang Niu 154 Dec 15, 2022
Deployment of PyTorch chatbot with Flask

Chatbot Deployment with Flask and JavaScript In this tutorial we deploy the chatbot I created in this tutorial with Flask and JavaScript. This gives 2

Patrick Loeber (Python Engineer) 107 Dec 29, 2022
WormMovementSimulation - 3D Simulation of Worm Body Movement with Neurons attached to its body

Generate 3D Locomotion Data This module is intended to create 2D video trajector

1 Aug 09, 2022
An expansion for RDKit to read all types of files in one line

RDMolReader An expansion for RDKit to read all types of files in one line How to use? Add this single .py file to your project and import MolFromFile(

Ali Khodabandehlou 1 Dec 18, 2021
DeepHawkeye is a library to detect unusual patterns in images using features from pretrained neural networks

English | 简体中文 Introduction DeepHawkeye is a library to detect unusual patterns in images using features from pretrained neural networks Reference Pat

CV Newbie 28 Dec 13, 2022
multimodal transformer

This repo holds the code to perform experiments with the multimodal autoregressive probabilistic model Transflower. Overview of the repo It is structu

Guillermo Valle 68 Dec 13, 2022
Implementation of the federated dual coordinate descent (FedDCD) method.

FedDCD.jl Implementation of the federated dual coordinate descent (FedDCD) method. Installation To install, just call Pkg.add("https://github.com/Zhen

Zhenan Fan 6 Sep 21, 2022
SimplEx - Explaining Latent Representations with a Corpus of Examples

SimplEx - Explaining Latent Representations with a Corpus of Examples Code Author: Jonathan Crabbé ( Jonathan Crabbé 14 Dec 15, 2022

Self-Supervised Learning with Kernel Dependence Maximization

Self-Supervised Learning with Kernel Dependence Maximization This is the code for SSL-HSIC, a self-supervised learning loss proposed in the paper Self

DeepMind 29 Dec 29, 2022