The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding.

Overview

SuperGen

The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding.

Requirements

Before running, you need to first install the required packages by typing following commands (Using a virtual environment is recommended):

pip3 install -r requirements.txt

Overview

SuperGen is a Supervision Generation method for zero-shot learning on NLU tasks. Instead of training on task-specific data, SuperGen generates training data guided by label-descriptive prompts with a unidirectional language model and fine-tunes another language model on the generated data.

Training and Test Data: Our method does not use any task-specific data (e.g., original training set). We provide our generated training set and original dev set (used as the test set) of each GLUE task under the data directory: train.json files are the generated training set (after data selection); test.tsv files are the original GLUE dev set (used as the test set for evaluation purpose).
Pretraining Corpus: We provide the processed pretraining corpus (Wikipedia and OpenWebText) for generating training data for sequence-pair tasks under the pretrain_corpus directory; see the README file there for details.

Generating Training Data

The generated training set used in the paper are provided as train.json files under each task directory; you should be able to obtain very similar generated data by following the steps below:

Data Generation: The entry script for generating training data for GLUE tasks is gen_train_data.py. The basic usage is

python gen_train_data.py --task $TASK --label $LABEL --save_dir $SAVE_DIR --num_gen $NUM_GEN

You can generate training data of each label either by setting individual label name $LABEL one at a time or by setting $LABEL=all to generate data for all labels (this will still be done sequentially). You may want to set $NUM_GEN to be larger than the desired training set size, as only those texts with the highest generated probability will be used to form the final training set.

Data Selection: After generating the training data, the final training set can be constructed by running the following:

python src/gen_utils.py --task $TASK --num_select_samples $NUM_SELECT \
                        --read_dir $SAVE_DIR --save_dir $DATA_DIR

Example: We provide an example script run_gen.sh that includes the entire generation process for all GLUE tasks under the setting described in the paper.

Fine-Tuning

The entry script for fine-tuning on generated data is finetune.py. The basic usage is

python finetune.py \
    --task_name $TASK \
    --data_dir data/$TASK \
    --overwrite_output_dir \
    --do_train \
    --do_predict \
    --smooth $SM \
    --momentum $MOMENT \
    --eval_steps $INTERVAL \
    --threshold $TH \
    --reg_weight $REG \
    --temp_ensemble_rampup $RAMP \
    --model_name_or_path $MODEL \
    --max_seq_length 128 \
    --first_sent_limit 100 \
    --per_device_train_batch_size $BS \
    --learning_rate $LR \
    --num_train_epochs 3 \
    --output_dir $OUT_DIR \
    --template $TEMPLATE \
    --mapping $MAPPING \
    --warmup_ratio 0.1 \
    --save_at_last \

Example: We provide an example script run_finetune.sh with command line arguments set up for all GLUE tasks under the setting described in the paper.

Results: When using the same prompt-based fine-tuning pipeline (with the same manual prompts and label words), zero-shot SuperGen even achieves better performance than few-shot LM-BFF using 32 annotated samples per class across seven GLUE classification tasks:

Method MNLI-m/mm QQP QNLI SST-2 CoLA RTE MRPC AVG
LM-BFF 32-Sample Few-Shot 68.3/70.5 65.5 64.5 92.7 9.3 69.1 74.5 63.6
SuperGen Zero-Shot 72.3/73.8 66.1 73.3 92.8 32.7 65.3 82.2 69.4

Acknowledgement

Some scripts in this repository are adapted from COCO-LM (for COCO-LM model), LM-BFF (for prompt-based fine-tuning) and huggingface transformers (for text generation and GLUE processor/trainer).

Citations

Please cite the following paper if you find the code helpful for your research.

@article{meng2022generating,
  title={Generating Training Data with Language Models: Towards Zero-Shot Language Understanding},
  author={Meng, Yu and Huang, Jiaxin and Zhang, Yu and Han, Jiawei},
  journal={arXiv preprint arXiv:2202.04538},
  year={2022}
}
Owner
Yu Meng
Ph.D. student, Text Mining
Yu Meng
Examples of using f2py to get high-speed Fortran integrated with Python easily

f2py Examples Simple examples of using f2py to get high-speed Fortran integrated with Python easily. These examples are also useful to troubleshoot pr

Michael 35 Aug 21, 2022
CLUES: Few-Shot Learning Evaluation in Natural Language Understanding

CLUES: Few-Shot Learning Evaluation in Natural Language Understanding This repo contains the data and source code for baseline models in the NeurIPS 2

Microsoft 29 Dec 29, 2022
How to train a CNN to 99% accuracy on MNIST in less than a second on a laptop

Training a NN to 99% accuracy on MNIST in 0.76 seconds A quick study on how fast you can reach 99% accuracy on MNIST with a single laptop. Our answer

Tuomas Oikarinen 42 Dec 10, 2022
A free, multiplatform SDK for real-time facial motion capture using blendshapes, and rigid head pose in 3D space from any RGB camera, photo, or video.

mocap4face by Facemoji mocap4face by Facemoji is a free, multiplatform SDK for real-time facial motion capture based on Facial Action Coding System or

Facemoji 591 Dec 27, 2022
A playable implementation of Fully Convolutional Networks with Keras.

keras-fcn A re-implementation of Fully Convolutional Networks with Keras Installation Dependencies keras tensorflow Install with pip $ pip install git

JihongJu 202 Sep 07, 2022
Implementation for Homogeneous Unbalanced Regularized Optimal Transport

HUROT: An Homogeneous formulation of Unbalanced Regularized Optimal Transport. This repository provides code related to this preprint. This is an alph

Théo Lacombe 1 Feb 17, 2022
ECCV18 Workshops - Enhanced SRGAN. Champion PIRM Challenge on Perceptual Super-Resolution. The training codes are in BasicSR.

ESRGAN (Enhanced SRGAN) [ 🚀 BasicSR] [Real-ESRGAN] ✨ New Updates. We have extended ESRGAN to Real-ESRGAN, which is a more practical algorithm for rea

Xintao 4.7k Jan 02, 2023
RRL: Resnet as representation for Reinforcement Learning

Resnet as representation for Reinforcement Learning (RRL) is a simple yet effective approach for training behaviors directly from visual inputs. We demonstrate that features learned by standard image

Meta Research 21 Dec 07, 2022
Voice Gender Recognition

In this project it was used some different Machine Learning models to identify the gender of a voice (Female or Male) based on some specific speech and voice attributes.

Anne Livia 1 Jan 27, 2022
The Codebase for Causal Distillation for Language Models.

Causal Distillation for Language Models Zhengxuan Wu*,Atticus Geiger*, Josh Rozner, Elisa Kreiss, Hanson Lu, Thomas Icard, Christopher Potts, Noah D.

Zen 20 Dec 31, 2022
Code for DeepXML: A Deep Extreme Multi-Label Learning Framework Applied to Short Text Documents

DeepXML Code for DeepXML: A Deep Extreme Multi-Label Learning Framework Applied to Short Text Documents Architectures and algorithms DeepXML supports

Extreme Classification 49 Nov 06, 2022
Sequential model-based optimization with a `scipy.optimize` interface

Scikit-Optimize Scikit-Optimize, or skopt, is a simple and efficient library to minimize (very) expensive and noisy black-box functions. It implements

Scikit-Optimize 2.5k Jan 04, 2023
Generates all variables from your .tf files into a variables.tf file.

tfvg Generates all variables from your .tf files into a variables.tf file. It searches for every var.variable_name in your .tf files and generates a v

1 Dec 01, 2022
ObjectDetNet is an easy, flexible, open-source object detection framework

Getting started with the ObjectDetNet ObjectDetNet is an easy, flexible, open-source object detection framework which allows you to easily train, resu

5 Aug 25, 2020
Companion code for the paper "An Infinite-Feature Extension for Bayesian ReLU Nets That Fixes Their Asymptotic Overconfidence" (NeurIPS 2021)

ReLU-GP Residual (RGPR) This repository contains code for reproducing the following NeurIPS 2021 paper: @inproceedings{kristiadi2021infinite, title=

Agustinus Kristiadi 4 Dec 26, 2021
WiFi-based Multi-task Sensing

WiFi-based Multi-task Sensing Introduction WiFi-based sensing has aroused immense attention as numerous studies have made significant advances over re

zhangx289 6 Nov 24, 2022
Dynamic Environments with Deformable Objects (DEDO)

DEDO - Dynamic Environments with Deformable Objects DEDO is a lightweight and customizable suite of environments with deformable objects. It is aimed

Rika 32 Dec 22, 2022
Liquid Warping GAN with Attention: A Unified Framework for Human Image Synthesis

Liquid Warping GAN with Attention: A Unified Framework for Human Image Synthesis, including human motion imitation, appearance transfer, and novel view synthesis. Currently the paper is under review

2.3k Jan 05, 2023
Official PyTorch implementation for paper "Efficient Two-Stage Detection of Human–Object Interactions with a Novel Unary–Pairwise Transformer"

UPT: Unary–Pairwise Transformers This repository contains the official PyTorch implementation for the paper Frederic Z. Zhang, Dylan Campbell and Step

Frederic Zhang 109 Dec 20, 2022
The Fundamental Clustering Problems Suite (FCPS) summaries 54 state-of-the-art clustering algorithms, common cluster challenges and estimations of the number of clusters as well as the testing for cluster tendency.

FCPS Fundamental Clustering Problems Suite The package provides over sixty state-of-the-art clustering algorithms for unsupervised machine learning pu

9 Nov 27, 2022