When BERT Plays the Lottery, All Tickets Are Winning

Last update: Nov 10, 2022

Related tags

Overview

When BERT Plays the Lottery, All Tickets Are Winning

Large Transformer-based models were shown to be reducible to a smaller number of self-attention heads and layers. We consider this phenomenon from the perspective of the lottery ticket hypothesis, using both structured and magnitude pruning. For fine-tuned BERT, we show that (a) it is possible to find subnetworks achieving performance that is comparable with that of the full model, and (b) similarly-sized subnetworks sampled from the rest of the model perform worse. Strikingly, with structured pruning even the worst possible subnetworks remain highly trainable, indicating that most pre-trained BERT weights are potentially useful. We also study the "good" subnetworks to see if their success can be attributed to superior linguistic knowledge, but find them unstable, and not explained by meaningful self-attention patterns.

Environment

Install the requirements in your python 3.7.7 virtual environment.

pip install -r requirements.txt

These experiments were done on multi-gpu environment, were some experiments, benchmarks were run parallel. So some changes to the bash scripts to make it work for your environment.

Dataset

Download the GLUE dataset using data/download_glue.py and data/download_mnli_data.py. Follow the instructions in data/download_glue.py docstring for MRPC.
All data for the tasks should be organized in data/glue/task_name/ structure.
Extract the attention pattern classification labelled data.
```
cd data
tar -xvf head_classification_data.tar.gz
```

Training, Masking, and Evaluation

Switch cwd to src (cd src) as many paths are relative from that directory.

Fine-tune the BERT on GLUE tasks

./train.sh

Obtain the masks

./find_masks.sh

Train models with the masks applied in good, random and bad settings.

./train_with_masks.sh

Evaluate the trained models

./evaluate.sh

Note: These experiments were run through course of time and now stiched together into single scripts. So it might be better to run the training and evaluation commands in them one by one.

Train the CNN classifier on attention patterns normed and raw.

python classify_attention_patterns.py
python classify_normed_patterns.py

These only train the classifier.

Evaluation Analysis and Final Results

These are primarily done in jupyter notebooks in experiment_analysis directory. There are many experimental notebooks there. Here are the important ones used to generate results included in the paper.

Importance pruning Heatmaps. Ignore the final "train_subset" and "hans" settings.
Magnitude pruning Heatmap
Overlap of surviving components
Generate the random baseline
Attention Classification Patterns
Evaluation Result Comparisons and table
Statistics on mask correlation across seeds

When BERT Plays the Lottery, All Tickets Are Winning

Related tags

Overview

When BERT Plays the Lottery, All Tickets Are Winning

Environment

Dataset

Training, Masking, and Evaluation

Evaluation Analysis and Final Results

Owner

Sai

Calculates carbon footprint based on fuel mix and discharge profile at the utility selected. Can create graphs and tabular output for fuel mix based on input file of series of power drawn over a period of time.

Keepsake is a Python library that uploads files and metadata (like hyperparameters) to Amazon S3 or Google Cloud Storage

A clean and scalable template to kickstart your deep learning project 🚀 ⚡ 🔥

Official PyTorch code for Hierarchical Conditional Flow: A Unified Framework for Image Super-Resolution and Image Rescaling (HCFlow, ICCV2021)

Decoding the Protein-ligand Interactions Using Parallel Graph Neural Networks

Official repository for the paper "GN-Transformer: Fusing AST and Source Code information in Graph Networks".

Compare GAN code.

The official start-up code for paper "FFA-IR: Towards an Explainable and Reliable Medical Report Generation Benchmark."

Regularizing Nighttime Weirdness: Efficient Self-supervised Monocular Depth Estimation in the Dark (ICCV 2021)

Training RNNs as Fast as CNNs

Towards Calibrated Model for Long-Tailed Visual Recognition from Prior Perspective

aka "Bayesian Methods for Hackers": An introduction to Bayesian methods + probabilistic programming with a computation/understanding-first, mathematics-second point of view. All in pure Python ;)

minimizer-space de Bruijn graphs (mdBG) for whole genome assembly

Pytorch implementation of XRD spectral identification from COD database

Two-stage CenterNet

[AAAI 2021] EMLight: Lighting Estimation via Spherical Distribution Approximation and [ICCV 2021] Sparse Needlets for Lighting Estimation with Spherical Transport Loss

HMLLDB is a collection of LLDB commands to assist in the debugging of iOS apps.

Differentiable Quantum Chemistry (only Differentiable Density Functional Theory and Hartree Fock at the moment)

This is an example of object detection on Micro bacterium tuberculosis using Mask-RCNN

Explore extreme compression for pre-trained language models