Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Last update: Dec 31, 2022

Overview

Data Augmentation using Pre-trained Transformer Models

Code associated with the Data Augmentation using Pre-trained Transformer Models paper

Code contains implementation of the following data augmentation methods

EDA (Baseline)
Backtranslation (Baseline)
CBERT (Baseline)
BERT Prepend (Our paper)
GPT-2 Prepend (Our paper)
BART Prepend (Our paper)

DataSets

In paper, we use three datasets from following resources

Low-data regime experiment setup

Run src/utils/download_and_prepare_datasets.sh file to prepare all datsets.
download_and_prepare_datasets.sh performs following steps

Download data from github
Replace numeric labels with text for STSA-2 and TREC dataset
For a given dataset, creates 15 random splits of train and dev data.

Dependencies

To run this code, you need following dependencies

Pytorch 1.5
fairseq 0.9
transformers 2.9

How to run

To run data augmentation experiment for a given dataset, run bash script in scripts folder. For example, to run data augmentation on snips dataset,

run scripts/bart_snips_lower.sh for BART experiment
run scripts/bert_snips_lower.sh for rest of the data augmentation methods

How to cite

@inproceedings{kumar-etal-2020-data,
    title = "Data Augmentation using Pre-trained Transformer Models",
    author = "Kumar, Varun  and
      Choudhary, Ashutosh  and
      Cho, Eunah",
    booktitle = "Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems",
    month = dec,
    year = "2020",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.lifelongnlp-1.3",
    pages = "18--26",
}

Contact

Please reachout to [email protected] for any questions related to this code.

License

This project is licensed under the Creative Common Attribution Non-Commercial 4.0 license.

Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Related tags

Overview

Data Augmentation using Pre-trained Transformer Models

DataSets

Low-data regime experiment setup

Dependencies

How to run

How to cite

Contact

License

Owner

✨Fast Coreference Resolution in spaCy with Neural Networks

A relatively simple python program to generate one of those reddit text to speech videos dominating youtube.

A pytorch implementation of the ACL2019 paper "Simple and Effective Text Matching with Richer Alignment Features".

Text to speech for Vietnamese, ez to use, ez to update

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

Creating an Audiobook (mp3 file) using a Ebook (epub) using BeautifulSoup and Google Text to Speech

An implementation of WaveNet with fast generation

Guide to using pre-trained large language models of source code

숭실대학교 컴퓨터학부 전공종합설계프로젝트

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

State of the art faster Natural Language Processing in Tensorflow 2.0 .

MEDIALpy: MEDIcal Abbreviations Lookup in Python

Malaya-Speech is a Speech-Toolkit library for bahasa Malaysia, powered by Deep Learning Tensorflow.

Kurumi ChatBot

Beautiful visualizations of how language differs among document types.

Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

Shared, streaming Python dict

Crowd sourced training data for Rasa NLU models

Opal-lang - A WIP programming language based on Python

Code and data accompanying Natural Language Processing with PyTorch