Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics

Last update: Dec 22, 2022

Related tags

Overview

Dataset Cartography

Code for the paper Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics at EMNLP 2020.

This repository contains implementation of data maps, as well as other data selection baselines, along with notebooks for data map visualizations.

If using, please cite:

@inproceedings{swayamdipta2020dataset,
    title={Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics},
    author={Swabha Swayamdipta and Roy Schwartz and Nicholas Lourie and Yizhong Wang and Hannaneh Hajishirzi and Noah A. Smith and Yejin Choi},
    booktitle={Proceedings of EMNLP},
    url={https://arxiv.org/abs/2009.10795},
    year={2020}
}

This repository can be used to build Data Maps, like this one for SNLI using a RoBERTa-Large classifier.

Pre-requisites

This repository is based on the HuggingFace Transformers library.

Train GLUE-style model and compute training dynamics

To train a GLUE-style model using this repository:

python -m cartography.classification.run_glue \
    -c configs/$TASK.jsonnet \
    --do_train \
    --do_eval \
    -o $MODEL_OUTPUT_DIR

The best configurations for our experiments for each of the $TASKs (SNLI, MNLI, QNLI or WINOGRANDE) are provided under configs.

This produces a training dynamics directory $MODEL_OUTPUT_DIR/training_dynamics, see a sample here.

Note: you can use any other set up to train your model (independent of this repository) as long as you produce the dynamics_epoch_$X.jsonl for plotting data maps, and filtering different regions of the data. The .jsonl file must contain the following fields for every training instance:

guid : instance ID matching that in the original data file, for filtering,
logits_epoch_$X : logits for the training instance under epoch $X,
gold : index of the gold label, must match the logits array.

Plot Data Maps

To plot data maps for a trained $MODEL (e.g. RoBERTa-Large) on a given $TASK (e.g. SNLI, MNLI, QNLI or WINOGRANDE):

python -m cartography.selection.train_dy_filtering \
    --plot \
    --task_name $TASK \
    --model_dir $PATH_TO_MODEL_OUTPUT_DIR_WITH_TRAINING_DYNAMICS \
    --model $MODEL_NAME

Data Selection

To select (different amounts of) data based on various metrics from training dynamics:

python -m cartography.selection.train_dy_filtering \
    --filter \
    --task_name $TASK \
    --model_dir $PATH_TO_MODEL_OUTPUT_DIR_WITH_TRAINING_DYNAMICS \
    --metric $METRIC \
    --data_dir $PATH_TO_GLUE_DIR_WITH_ORIGINAL_DATA_IN_TSV_FORMAT

Supported $TASKs include SNLI, QNLI, MNLI and WINOGRANDE, and $METRICs include confidence, variability, correctness, forgetfulness and threshold_closeness; see paper for more details.

To select hard-to-learn instances, set $METRIC as "confidence" and for ambiguous, set $METRIC as "variability". For easy-to-learn instances: set $METRIC as "confidence" and use the flag --worst.

Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics

Related tags

Overview

Dataset Cartography

Pre-requisites

Train GLUE-style model and compute training dynamics

Plot Data Maps

Data Selection

Owner

AI2

The Python3 import playground

Training Very Deep Neural Networks Without Skip-Connections

Unofficial Tensorflow Implementation of ConvNeXt from A ConvNet for the 2020s

FFCV: Fast Forward Computer Vision (and other ML workloads!)

A scanpy extension to analyse single-cell TCR and BCR data.

A 3D sparse LBM solver implemented using Taichi

KaziText is a tool for modelling common human errors.

E2EC: An End-to-End Contour-based Method for High-Quality High-Speed Instance Segmentation

Official page of Struct-MDC (RA-L'22 with IROS'22 option); Depth completion from Visual-SLAM using point & line features

Where-Got-Time - An NUS timetable generator which uses a genetic algorithm to optimise timetables to suit the needs of NUS students

A PyTorch implementation of the Transformer model in "Attention is All You Need".

Code for the paper 'A High Performance CRF Model for Clothes Parsing'.

Official implementation of the NeurIPS 2021 paper Online Learning Of Neural Computations From Sparse Temporal Feedback

Code for generating the figures in the paper "Capacity of Group-invariant Linear Readouts from Equivariant Representations: How Many Objects can be Linearly Classified Under All Possible Views?"

Code for Towards Streaming Perception (ECCV 2020) :car:

TumorInsight is a Brain Tumor Detection and Classification model built using RESNET50 architecture.

Code for "Multi-View Multi-Person 3D Pose Estimation with Plane Sweep Stereo"

Spatial Action Maps for Mobile Manipulation (RSS 2020)

UltraGCN: An Ultra Simplification of Graph Convolutional Networks for Recommendation

DziriBERT: a Pre-trained Language Model for the Algerian Dialect