Load What You Need: Smaller Multilingual Transformers for Pytorch and TensorFlow 2.0.

Overview

Smaller Multilingual Transformers

This repository shares smaller versions of multilingual transformers that keep the same representations offered by the original ones. The idea came from a simple observation: after massively multilingual pretraining, not all embeddings are needed to perform finetuning and inference. In practice one would rarely require a model that supports more than 100 languages as the original mBERT. Therefore, we extracted several smaller versions that handle fewer languages. Since most of the parameters of multilingual transformers are located in the embeddings layer, our models are between 21% and 45% smaller in size.

The table bellow compares two of our exracted versions with the original mBERT. It shows the models size, memory footprint and the obtained accuracy on the XNLI dataset (Cross-lingual Transfer from english for french). These measurements have been computed on a Google Cloud n1-standard-1 machine (1 vCPU, 3.75 GB).

Model Num parameters Size Memory Accuracy
bert-base-multilingual-cased 178 million 714 MB 1400 MB 73.8
Geotrend/bert-base-15lang-cased 141 million 564 MB 1098 MB 74.1
Geotrend/bert-base-en-fr-cased 112 million 447 MB 878 MB 73.8

Reducing the size of multilingual transformers facilitates their deployment on public cloud platforms. For instance, Google Cloud Platform requires that the model size on disk should be lower than 500 MB for serveless deployments (Cloud Functions / Cloud ML).

For more information, please refer to our paper: Load What You Need.

Available Models

Until now, we generated 70 smaller models from the original mBERT cased version. These models have been uploaded to the Hugging Face Model Hub in order to facilitate their use: https://huggingface.co/Geotrend.

They can be downloaded easily using the transformers library:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Geotrend/bert-base-en-fr-cased")
model = AutoModel.from_pretrained("Geotrend/bert-base-en-fr-cased")

More models will be released soon.

Generating new Models

We also share a python script that allows users to generate smaller transformers by their own based on a subset of the original vocabulary (the method does not only concern multilingual transformers):

pip install -r requirements.txt

python3 reduce_model.py \
	--source_model bert-base-multilingual-cased \
	--vocab_file vocab_5langs.txt \
	--output_model bert-base-5lang-cased \
	--convert_to_tf False

Where:

  • --source_model is the multilingual transformer to reduce
  • --vocab_file is the intended vocabulary file path
  • --output_model is the name of the final reduced model
  • --convert_to_tf tells the scipt whether to generate a tenserflow version or not

How to Cite

@inproceedings{smallermbert,
  title={Load What You Need: Smaller Versions of Multilingual BERT},
  author={Abdaoui, Amine and Pradel, Camille and Sigel, Grégoire},
  booktitle={SustaiNLP / EMNLP},
  year={2020}
}

Contact

Please contact [email protected] for any question, feedback or request.

Owner
Geotrend
Geotrend
State-Relabeling Adversarial Active Learning

State-Relabeling Adversarial Active Learning Code for SRAAL [2020 CVPR Oral] Requirements torch = 1.6.0 numpy = 1.19.1 tqdm = 4.31.1 AL Results The

10 Jul 14, 2022
Keyword spotting on Arm Cortex-M Microcontrollers

Keyword spotting for Microcontrollers This repository consists of the tensorflow models and training scripts used in the paper: Hello Edge: Keyword sp

Arm Software 1k Dec 30, 2022
Maximum Spatial Perturbation for Image-to-Image Translation (Official Implementation)

MSPC for I2I This repository is by Yanwu Xu and contains the PyTorch source code to reproduce the experiments in our CVPR2022 paper Maximum Spatial Pe

51 Dec 14, 2022
CVPR 2022 "Online Convolutional Re-parameterization"

OREPA: Online Convolutional Re-parameterization This repo is the PyTorch implementation of our paper to appear in CVPR2022 on "Online Convolutional Re

Mu Hu 121 Dec 21, 2022
A collection of papers about Transformer in the field of medical image analysis.

A collection of papers about Transformer in the field of medical image analysis.

Junyu Chen 377 Jan 05, 2023
A short code in python, Enchpyter, is able to encrypt and decrypt words as you determine, of course

Enchpyter Enchpyter is a program do encrypt and decrypt any word you want (just letters). You enter how many letters jumps and write the word, so, the

João Assalim 2 Oct 10, 2022
RoFormer_pytorch

PyTorch RoFormer 原版Tensorflow权重(https://github.com/ZhuiyiTechnology/roformer) chinese_roformer_L-12_H-768_A-12.zip (提取码:xy9x) 已经转化为PyTorch权重 chinese_r

yujun 283 Dec 12, 2022
RADIal is available now! Check the download section

Latest news: RADIal is available now! Check the download section. However, because we are currently working on the data anonymization, we provide for

valeo.ai 55 Jan 03, 2023
基于AlphaPose的TensorRT加速

1. Requirements CUDA 11.1 TensorRT 7.2.2 Python 3.8.5 Cython PyTorch 1.8.1 torchvision 0.9.1 numpy 1.17.4 (numpy版本过高会出报错 this issue ) python-package s

52 Dec 06, 2022
Little Ball of Fur - A graph sampling extension library for NetworKit and NetworkX (CIKM 2020)

Little Ball of Fur is a graph sampling extension library for Python. Please look at the Documentation, relevant Paper, Promo video and External Resour

Benedek Rozemberczki 619 Dec 14, 2022
HomeAssitant custom integration for dyson

HomeAssistant Custom Integration for Dyson This custom integration is still under development. This is a HA custom integration for dyson. There are se

Xiaonan Shen 232 Dec 31, 2022
A deep learning object detector framework written in Python for supporting Land Search and Rescue Missions.

AIR: Aerial Inspection RetinaNet for supporting Land Search and Rescue Missions AIR is a deep learning based object detection solution to automate the

Accenture 13 Dec 22, 2022
[ICCV21] Self-Calibrating Neural Radiance Fields

Self-Calibrating Neural Radiance Fields, ICCV, 2021 Project Page | Paper | Video Author Information Yoonwoo Jeong [Google Scholar] Seokjun Ahn [Google

381 Dec 30, 2022
Underwater industrial application yolov5m6

This project wins the intelligent algorithm contest finalist award and stands out from over 2000teams in China Underwater Robot Professional Contest, entering the final of China Underwater Robot Prof

8 Nov 09, 2022
[CVPR 2021] Forecasting the panoptic segmentation of future video frames

Panoptic Segmentation Forecasting Colin Graber, Grace Tsai, Michael Firman, Gabriel Brostow, Alexander Schwing - CVPR 2021 [Link to paper] We propose

Niantic Labs 44 Nov 29, 2022
A Web API for automatic background removal using Deep Learning. App is made using Flask and deployed on Heroku.

Automatic_Background_Remover A Web API for automatic background removal using Deep Learning. App is made using Flask and deployed on Heroku. 👉 https:

Gaurav 16 Oct 29, 2022
Official repository of OFA. Paper: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Paper | Blog OFA is a unified multimodal pretrained model that unifies modalities (i.e., cross-modality, vision, language) and tasks (e.g., image gene

OFA Sys 1.4k Jan 08, 2023
Deep Image Matting implementation in PyTorch

Deep Image Matting Deep Image Matting paper implementation in PyTorch. Differences "fc6" is dropped. Indices pooling. "fc6" is clumpy, over 100 millio

Yang Liu 724 Dec 27, 2022
SpeechBrain is an open-source and all-in-one speech toolkit based on PyTorch.

The SpeechBrain Toolkit SpeechBrain is an open-source and all-in-one speech toolkit based on PyTorch. The goal is to create a single, flexible, and us

SpeechBrain 5.1k Jan 02, 2023
Animation of solving the traveling salesman problem to optimality using mixed-integer programming and iteratively eliminating sub tours

tsp-streamlit Animation of solving the traveling salesman problem to optimality using mixed-integer programming and iteratively eliminating sub tours.

4 Nov 05, 2022