LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

Overview

LV-BERT

Introduction

In this repo, we introduce LV-BERT by exploiting layer variety for BERT. For detailed description and experimental results, please refer to our paper LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021).

Requirements

  • Python 3.6
  • TensorFlow 1.15
  • numpy
  • scikit-learn

Experiments

Firstly, set your data dir (absolute) to place datasets and models by

DATA_DIR=/path/to/data/dir

Fine-tining

We give the instruction to fine-tune a pre-trained LV-BERT-small (13M parameters) on GLUE. You can refer to this Google Colab notebook for a quick example. All models of different are provided this Google Drive folder. The models are pre-trained 1M steps with sequence length 128 to save compute. *_seq512 named models are trained for more 100K steps with sequence length 512 whichs are used for long-sequence tasks like SQuAD. See our paper for more details on model performance.

  1. Create your data directory.
mkdir -p $DATA_DIR/models && cp vocab.txt $DATA_DIR/

Put the pre-trained model in the corresponding directory

mv lv-bert_small $DATA_DIR/models/
  1. Download the GLUE data by running
python3 download_glue_data.py
  1. Set up the data by running
cd glue_data && mv CoLA cola && mv MNLI mnli && mv MRPC mrpc && mv QNLI qnli && mv QQP qqp && mv RTE rte && mv SST-2 sst && mv STS-B sts && mv diagnostic/diagnostic.tsv mnli && mkdir -p $DATA_DIR/finetuning_data && mv * $DATA_DIR/finetuning_data && cd ..
  1. Fine-tune the model by running
bash finetune.sh $DATA_DIR

PS: (a) You can test different tasks by changing configs in finetune.sh. (b) Some of the datasets on GLUE are small, causing that the results may vary substantially for different random seeds. The same as ELECTRA, we report the median of 10 fine-tuning runs from the same pre-trained model for each result.

Pre-training

We give the instruction to pre-train LV-BERT-small (13M parameters) using the OpenWebText corpus.

  1. First download the OpenWebText pre-traing corpus (12G).

  2. After downloading the pre-training corpus, build the pre-training dataset tf-record by running

bash build_data.sh $DATA_DIR
  1. Then, pre-train the model by running
bash pretrain.sh $DATA_DIR

Bibtex

@inproceedings{yu2021lv-bert,
        author = {Yu, Weihao and Jiang, Zihang and Chen, Fei, Hou, Qibin and Feng, Jiashi},
        title = {LV-BERT: Exploiting Layer Variety for BERT},
        booktitle = {Findings of ACL},
        month = {August},
        year = {2021}
}

Reference

This repo is based on the repo ELECTRA.

Owner
Weihao Yu
PhD student at NUS
Weihao Yu
Gathers machine learning and Tensorflow deep learning models for NLP problems, 1.13 < Tensorflow < 2.0

NLP-Models-Tensorflow, Gathers machine learning and tensorflow deep learning models for NLP problems, code simplify inside Jupyter Notebooks 100%. Tab

HUSEIN ZOLKEPLI 1.7k Dec 30, 2022
A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

MMF is a modular framework for vision and language multimodal research from Facebook AI Research. MMF contains reference implementations of state-of-t

Facebook Research 5.1k Dec 26, 2022
Various Algorithms for Short Text Mining

Short Text Mining in Python Introduction This package shorttext is a Python package that facilitates supervised and unsupervised learning for short te

Kwan-Yuet 466 Dec 06, 2022
Sentello is python script that simulates the anti-evasion and anti-analysis techniques used by malware.

sentello Sentello is a python script that simulates the anti-evasion and anti-analysis techniques used by malware. For techniques that are difficult t

Malwation 62 Oct 02, 2022
An easy to use Natural Language Processing library and framework for predicting, training, fine-tuning, and serving up state-of-the-art NLP models.

Welcome to AdaptNLP A high level framework and library for running, training, and deploying state-of-the-art Natural Language Processing (NLP) models

Novetta 407 Jan 03, 2023
Bnagla hand written document digiiztion

Bnagla hand written document digiiztion This repo addresses the problem of digiizing hand written documents in Bangla. Documents have definite fields

Mushfiqur Rahman 1 Dec 10, 2021
KoBERT - Korean BERT pre-trained cased (KoBERT)

KoBERT KoBERT Korean BERT pre-trained cased (KoBERT) Why'?' Training Environment Requirements How to install How to use Using with PyTorch Using with

SK T-Brain 1k Jan 02, 2023
A Plover python dictionary allowing for consistent symbol input with specification of attachment and capitalisation in one stroke.

Emily's Symbol Dictionary Design This dictionary was created with the following goals in mind: Have a consistent method to type (pretty much) every sy

Emily 68 Jan 07, 2023
Dual languaged (rus+eng) tool for packing and unpacking archives of Silky Engine.

SilkyArcTool English Dual languaged (rus+eng) GUI tool for packing and unpacking archives of Silky Engine. It is not the same arc as used in Ai6WIN. I

Tester 5 Sep 15, 2022
This repository collects together basic linguistic processing data for using dataset dumps from the Common Voice project

Common Voice Utils This repository collects together basic linguistic processing data for using dataset dumps from the Common Voice project. It aims t

Francis Tyers 40 Dec 20, 2022
Data manipulation and transformation for audio signal processing, powered by PyTorch

torchaudio: an audio library for PyTorch The aim of torchaudio is to apply PyTorch to the audio domain. By supporting PyTorch, torchaudio follows the

1.9k Jan 08, 2023
DeepPavlov Tutorials

DeepPavlov tutorials DeepPavlov: Sentence Classification with Word Embeddings DeepPavlov: Transfer Learning with BERT. Classification, Tagging, QA, Ze

Neural Networks and Deep Learning lab, MIPT 28 Sep 13, 2022
Takes a string and puts it through different languages in Google Translate a requested amount of times, returning nonsense.

PythonTextObfuscator Takes a string and puts it through different languages in Google Translate a requested amount of times, returning nonsense. Requi

2 Aug 29, 2022
1 Jun 28, 2022
SentAugment is a data augmentation technique for semi-supervised learning in NLP.

SentAugment SentAugment is a data augmentation technique for semi-supervised learning in NLP. It uses state-of-the-art sentence embeddings to structur

Meta Research 363 Dec 30, 2022
this repository has datasets containing information of Uber pickups in NYC from April 2014 to September 2014 and January to June 2015. data Analysis , virtualization and some insights are gathered here

uber-pickups-analysis Data Source: https://www.kaggle.com/fivethirtyeight/uber-pickups-in-new-york-city Information about data set The dataset contain

1 Nov 02, 2021
Facilitating the design, comparison and sharing of deep text matching models.

MatchZoo Facilitating the design, comparison and sharing of deep text matching models. MatchZoo 是一个通用的文本匹配工具包,它旨在方便大家快速的实现、比较、以及分享最新的深度文本匹配模型。 🔥 News

Neural Text Matching Community 3.7k Jan 02, 2023
Pretrain CPM - 大规模预训练语言模型的预训练代码

CPM-Pretrain 版本更新记录 为了促进中文自然语言处理研究的发展,本项目提供了大规模预训练语言模型的预训练代码。项目主要基于DeepSpeed、Megatron实现,可以支持数据并行、模型加速、流水并行的代码。 安装 1、首先安装pytorch等基础依赖,再安装APEX以支持fp16。 p

Tsinghua AI 37 Dec 06, 2022
Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Expressions.

patterns-finder Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Ex

22 Dec 19, 2022
This is a GUI program that will generate a word search puzzle image

Word Search Puzzle Generator Table of Contents About The Project Built With Getting Started Prerequisites Installation Usage Roadmap Contributing Cont

11 Feb 22, 2022