XtremeDistil framework for distilling/compressing massive multilingual neural network models to tiny and efficient models for AI at scale

Overview

XtremeDistilTransformers for Distilling Massive Multilingual Neural Networks

ACL 2020 Microsoft Research [Paper] [Video]

Releasing [XtremeDistilTransformers] with Tensorflow 2.3 and HuggingFace Transformers with an unified API with the following features:

  • Distil any supported pre-trained language models as teachers (e.g, Bert, Electra, Roberta)
  • Initialize student model with any pre-trained model (e.g, MiniLM, DistilBert, TinyBert), or initialize from scratch
  • Multilingual text classification and sequence tagging
  • Distil multiple hidden states from teacher
  • Distil deep attention networks from teacher
  • Pairwise and instance-level classification tasks (e.g, MNLI, MRPC, SST)
  • Progressive knowledge transfer with gradual unfreezing
  • Fast mixed precision training for distillation (e.g, mixed_float16, mixed_bfloat16)
  • ONNX runtime inference

Install requirements pip install -r requirements.txt

Initialize XtremeDistilTransformer with (6/384 pre-trained checkpoint)[https://huggingface.co/microsoft/xtremedistil-l6-h384-uncased] or [TinyBERT] (4/312 pre-trained checkpoint)

Sample usages for distilling different pre-trained language models (tested with Python 3.6.9 and CUDA 10.2)

Training

Sequence Labeling for Wiki NER

PYTHONHASHSEED=42 python run_xtreme_distil.py 
--task $$PT_DATA_DIR/datasets/NER 
--model_dir $$PT_OUTPUT_DIR 
--seq_len 32  
--transfer_file $$PT_DATA_DIR/datasets/NER/unlabeled.txt 
--do_NER 
--pt_teacher TFBertModel 
--pt_teacher_checkpoint bert-base-multilingual-cased 
--student_distil_batch_size 256 
--student_ft_batch_size 32
--teacher_batch_size 128  
--pt_student_checkpoint microsoft/xtremedistil-l6-h384-uncased 
--distil_chunk_size 10000 
--teacher_model_dir $$PT_OUTPUT_DIR 
--distil_multi_hidden_states 
--distil_attention 
--compress_word_embedding 
--freeze_word_embedding
--opt_policy mixed_float16

Text Classification for MNLI

PYTHONHASHSEED=42 python run_xtreme_distil.py 
--task $$PT_DATA_DIR/glue_data/MNLI 
--model_dir $$PT_OUTPUT_DIR 
--seq_len 128  
--transfer_file $$PT_DATA_DIR/glue_data/MNLI/train.tsv 
--do_pairwise 
--pt_teacher TFElectraModel 
--pt_teacher_checkpoint google/electra-base-discriminator 
--student_distil_batch_size 128  
--student_ft_batch_size 32
--pt_student_checkpoint microsoft/xtremedistil-l6-h384-uncased 
--teacher_model_dir $$PT_OUTPUT_DIR 
--teacher_batch_size 32
--distil_chunk_size 300000
--opt_policy mixed_float16

Alternatively, use TinyBert pre-trained student model checkpoint as --pt_student_checkpoint nreimers/TinyBERT_L-4_H-312_v2

Arguments


- task folder contains
	-- train/dev/test '.tsv' files with text and classification labels / token-wise tags (space-separated)
	--- Example 1: feel good about themselves <tab> 1
	--- Example 2: '' Atelocentra '' Meyrick , 1884 <tab> O B-LOC O O O O
	-- label files containing class labels for sequence labeling
	-- transfer file containing unlabeled data
	
- model_dir to store/restore model checkpoints

- task arguments
-- do_pairwise for pairwise classification tasks like MNLI and MRPC
-- do_NER for sequence labeling

- teacher arguments
-- pt_teacher for teacher model to distil (e.g., TFBertModel, TFRobertaModel, TFElectraModel)
-- pt_teacher_checkpoint for pre-trained teacher model checkpoints (e.g., bert-base-multilingual-cased, roberta-large, google/electra-base-discriminator)

- student arguments
-- pt_student_checkpoint to initialize from pre-trained small student models (e.g., MiniLM, DistilBert, TinyBert)
-- instead of pre-trained checkpoint, initialize a raw student from scratch with
--- hidden_size
--- num_hidden_layers
--- num_attention_heads

- distillation features
-- distil_multi_hidden_states to distil multiple hidden states from the teacher
-- distil_attention to distil deep attention network of the teacher
-- compress_word_embedding to initialize student word embedding with SVD-compressed teacher word embedding (useful for multilingual distillation)
-- freeze_word_embedding to keep student word embeddings frozen during distillation (useful for multilingual distillation)
-- opt_policy (e.g., mixed_float16 for GPU and mixed_bfloat16 for TPU)
-- distil_chunk_size for using transfer data in chunks during distillation (reduce for OOM issues, checkpoints are saved after every distil_chunk_size steps)

Model Outputs

The above training code generates intermediate model checkpoints to continue the training in case of abrupt termination instead of starting from scratch -- all saved in $$PT_OUTPUT_DIR. The final output of the model consists of (i) xtremedistil.h5 with distilled model weights, (ii) xtremedistil-config.json with the training configuration, and (iii) word_embedding.npy for the input word embeddings from the student model.

Prediction

PYTHONHASHSEED=42 python run_xtreme_distil_predict.py 
--do_eval 
--model_dir $$PT_OUTPUT_DIR 
--do_predict 
--pred_file ../../datasets/NER/unlabeled.txt
--opt_policy mixed_float16

*ONNX Runtime Inference

You can also use ONXX Runtime for inference speedup with the following script:

PYTHONHASHSEED=42 python run_xtreme_distil_predict_onnx.py 
--do_eval 
--model_dir $$PT_OUTPUT_DIR 
--do_predict 
--pred_file ../../datasets/NER/unlabeled.txt

For details on ONNX Runtime Inference, environment and arguments refer to this Notebook The script is for online inference with batch_size=1.

*Continued Fine-tuning

You can continue fine-tuning the distilled/compressed student model on more labeled data with the following script:

PYTHONHASHSEED=42 python run_xtreme_distil_ft.py --model_dir $$PT_OUTPUT_DIR 

If you use this code, please cite:

@inproceedings{mukherjee-hassan-awadallah-2020-xtremedistil,
    title = "{X}treme{D}istil: Multi-stage Distillation for Massive Multilingual Models",
    author = "Mukherjee, Subhabrata  and
      Hassan Awadallah, Ahmed",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.202",
    pages = "2221--2234",
    abstract = "Deep and large pre-trained language models are the state-of-the-art for various natural language processing tasks. However, the huge size of these models could be a deterrent to using them in practice. Some recent works use knowledge distillation to compress these huge models into shallow ones. In this work we study knowledge distillation with a focus on multilingual Named Entity Recognition (NER). In particular, we study several distillation strategies and propose a stage-wise optimization scheme leveraging teacher internal representations, that is agnostic of teacher architecture, and show that it outperforms strategies employed in prior works. Additionally, we investigate the role of several factors like the amount of unlabeled data, annotation resources, model architecture and inference latency to name a few. We show that our approach leads to massive compression of teacher models like mBERT by upto 35x in terms of parameters and 51x in terms of latency for batch inference while retaining 95{\%} of its F1-score for NER over 41 languages.",
}

Code is released under MIT license.

Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
Code for the paper: Fighting Fake News: Image Splice Detection via Learned Self-Consistency

Fighting Fake News: Image Splice Detection via Learned Self-Consistency [paper] [website] Minyoung Huh *12, Andrew Liu *1, Andrew Owens1, Alexei A. Ef

minyoung huh (jacob) 174 Dec 09, 2022
A python library for self-supervised learning on images.

Lightly is a computer vision framework for self-supervised learning. We, at Lightly, are passionate engineers who want to make deep learning more effi

Lightly 2k Jan 08, 2023
OREO: Object-Aware Regularization for Addressing Causal Confusion in Imitation Learning (NeurIPS 2021)

OREO: Object-Aware Regularization for Addressing Causal Confusion in Imitation Learning (NeurIPS 2021) Video demo We here provide a video demo from co

20 Nov 25, 2022
Breaking the Curse of Space Explosion: Towards Efficient NAS with Curriculum Search

Breaking the Curse of Space Explosion: Towards Effcient NAS with Curriculum Search Pytorch implementation for "Breaking the Curse of Space Explosion:

guoyong 17 Jan 03, 2023
DrNAS: Dirichlet Neural Architecture Search

This paper proposes a novel differentiable architecture search method by formulating it into a distribution learning problem. We treat the continuously relaxed architecture mixing weight as random va

Xiangning Chen 37 Jan 03, 2023
A general-purpose, flexible, and easy-to-use simulator alongside an OpenAI Gym trading environment for MetaTrader 5 trading platform (Approved by OpenAI Gym)

gym-mtsim: OpenAI Gym - MetaTrader 5 Simulator MtSim is a simulator for the MetaTrader 5 trading platform alongside an OpenAI Gym environment for rein

Mohammad Amin Haghpanah 184 Dec 31, 2022
The Pytorch implementation for "Video-Text Pre-training with Learned Regions"

Region_Learner The Pytorch implementation for "Video-Text Pre-training with Learned Regions" (arxiv) We are still cleaning up the code further and pre

Rui Yan 0 Mar 20, 2022
The dataset of tweets pulling from Twitters with keyword: Hydroxychloroquine, location: US, Time: 2020

HCQ_Tweet_Dataset: FREE to Download. Keywords: HCQ, hydroxychloroquine, tweet, twitter, COVID-19 This dataset is associated with the paper "Understand

2 Mar 16, 2022
Complete the code of prefix-tuning in low data setting

Prefix Tuning Note: 作者在论文中提到使用真实的word去初始化prefix的操作(Initializing the prefix with activations of real words,significantly improves generation)。我在使用作者提供的

Andrew Zeng 4 Jul 11, 2022
App for identification of various objects. Based on YOLO v4 tiny architecture

Object_detection Repository containing trained model yolo v4 tiny, which is capable of identification 80 different classes Default feed is set to be a

Mateusz Kurdziel 0 Jun 22, 2022
Plug-n-Play Reinforcement Learning in Python with OpenAI Gym and JAX

coax is built on top of JAX, but it doesn't have an explicit dependence on the jax python package. The reason is that your version of jaxlib will depend on your CUDA version.

128 Dec 27, 2022
A framework for attentive explainable deep learning on tabular data

🧠 kendrite A framework for attentive explainable deep learning on tabular data 💨 Quick start kedro run 🧱 Built upon Technology Description Links ke

Marnix Koops 3 Nov 06, 2021
Soomvaar is the repo which 🏩 contains different collection of 👨‍💻🚀code in Python and 💫✨Machine 👬🏼 learning algorithms📗📕 that is made during 📃 my practice and learning of ML and Python✨💥

Soomvaar 📌 Introduction Soomvaar is the collection of various codes implement in machine learning and machine learning algorithms with python on coll

Felix-Ayush 42 Dec 30, 2022
NNR conformation conditional and global probabilities estimation and analysis in peptides or proteins fragments

NNR and global probabilities estimation and analysis in peptides or protein fragments This module calculates global and NNR conformation dependent pro

0 Jul 15, 2021
Curved Projection Reformation

Description Assuming that we already know the image of the centerline, we want the lumen to be displayed on a plane, which requires curved projection

夜听残荷 5 Sep 11, 2022
[CoRL 21'] TANDEM: Tracking and Dense Mapping in Real-time using Deep Multi-view Stereo

TANDEM: Tracking and Dense Mapping in Real-time using Deep Multi-view Stereo Lukas Koestler1*    Nan Yang1,2*,†    Niclas Zeller2,3    Daniel Cremers1

TUM Computer Vision Group 744 Jan 04, 2023
Iranian Cars Detection using Yolov5s, PyTorch

Iranian Cars Detection using Yolov5 Train 1- git clone https://github.com/ultralytics/yolov5 cd yolov5 pip install -r requirements.txt 2- Dataset ../

Nahid Ebrahimian 22 Dec 05, 2022
Data and analysis code for an MS on SK VOC genomes phenotyping/neutralisation assays

Description Summary of phylogenomic methods and analyses used in "Immunogenicity of convalescent and vaccinated sera against clinical isolates of ance

Finlay Maguire 1 Jan 06, 2022
Open source implementation of "A Self-Supervised Descriptor for Image Copy Detection" (SSCD).

A Self-Supervised Descriptor for Image Copy Detection (SSCD) This is the open-source codebase for "A Self-Supervised Descriptor for Image Copy Detecti

Meta Research 68 Jan 04, 2023
HNN: Human (Hollywood) Neural Network

HNN: Human (Hollywood) Neural Network Learn the top 1000 actors on IMDB with your very own low cost, highly parallel, CUDAless biological neural netwo

Madhava Jay 0 Dec 21, 2021