Code for EMNLP2021 paper "Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training"

Related tags

Deep LearningVoCapXLM
Overview

VoCapXLM

Code for EMNLP2021 paper Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training

Environment

DockerFile: dancingsoul/pytorch:VoCapXLM

Manully build the sentencepiece with following command:

cd sentencepiece
mkdir build
cd build
cmake ..
make -j $(nproc)
sudo make install
sudo ldconfig -v

Data Preparation

  1. Create a folder with mkdir -p monolingual_text in the root of this project.
  2. Sample monolingual corpus for each language individually, move them to the monolingual_text directory, named after their language codes (e.g., en.txt).
  3. Sample the multilingual corpus from monolingual corpora with the following command:
python sample_multilingual_corpus.py \
    --lang_prob_path ./lang_prob_wiki.json \ 
    --input_dir ./monolingual_text/ \ 
    --output_path ./multilingual_corpus.text \
    --n_sample <n_sample> --beta <beta> --rescale

where the options are described as follows:

  • --lang_prob_path: the probability of sampling training instances from each language during pre-training, lang_prob_wiki.json is counted on Wikipedia corpus and the probabilities are rescaled with alpha=0.7 from Equation (3) in our paper.
  • --n_sample: number of sentences in the multilingual corpus where the final multilingual sentencepiece model is trained, the default value is 20000000.
  • --rescale: further rescale the probability with another value beta from Equation (2) in our paper.
  • --beta: the rescaling factor in Equation (2), the default value is 0.7.

Training Monolingual SentencePiece Models

Train monolingual sentencepiece models in different sizes to obtain vocabularies with different ALP, i.e., language-specific vocabulary capacity.

python train_mono_spm.py \
    --input_dir ./monolingual_text/ \
    --output_dir ~/monolingual_spm/ \
    --languages <all_languages> \
    --min_vocab_size <min_vocab_size> \
    --max_vocab_size <max_vocab_size> \
    --delta_vocab_size <delta_vocab_size> \
    --n_sample <n_sample>

where the options are described as follows:

  • --languages: all languages under the monolingual_text directory, separated with ,, e.g. en,fr,zh.
  • --min_vocab_size: minimum vocabulary size allocated for each language, the default value is 1000.
  • --max_vocab_size: maximum vocabulary size allocated for each language, the default value is 50000.
  • --delta_vocab_size: the value of interval to learn vocabularies, the default value is 1000.
  • --n_sample: the number of sentences to calculate ALP for each language, the default value is 1000000.

or you can download our pre-trained monolingual sentencepiece models and vocabularies from [here][2].

Allocating Multilingual Vocabulary

Allocate the multilingual vocabulary from monolingual vocabularies:

python train_vocap.py \
    --lang_prob_path ./lang_prob_wiki.json \
    --input_dir ./monolingual_spm/ \
    --output_path ./multilingual.vocab \
    --beta <beta> --rescale --target_vocab_size <target_vocab_size>

where the options are described as follows:

  • --lang_prob_path: same as the above.
  • --rescale: same as the above.
  • --beta: same as the above.
  • --target_vocab_size: the desired vocabulary size of the multilingual vocabulary, the default value is 500000.

Then Use sentencepiece to train the tokenizer given the multilingual vocabulary:

spm_train --input=./multilingual_corpus.text --model_prefix=<model_name> --vocab_size=<target_vocab_size> \
--character_coverage=0.9995 --model_type=unigram --shuffle_input_sentence=true \
--input_sentence_size=<input_sentence_size> --vocab_path=./multilingual.vocab

where the options are described as follows:

  • --model_prefix: output model name prefix. <model_name>.model and <model_name>.vocab are generated.
  • --character_coverage: amount of characters covered by the model.
  • --vocab_size: same as --target_vocab_size.
  • --vocab_path: the required subwords in the final learned tokenizer.

Paper

Please cite our paper \cite{bo2021vocapxlm} if you found the resources in the repository useful.

@inproceedings{bo2021vocapxlm,
author = {Bo Zheng, Li Dong, Shaohan Huang, Saksham Singhal, Wanxiang Che, Ting Liu, Xia Song, Furu Wei},
booktitle = {Proceedings of EMNLP 2021},
title = {{Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training}},
year = {2021}
}

Reference

  1. https://github.com/google/sentencepiece
  2. https://drive.google.com/file/d/1VttgE30xo-i1ig5xsMF_7R4AB2sA5J9F/view?usp=sharing
Owner
Bo Zheng
Bo Zheng
Codes for AAAI22 paper "Learning to Solve Travelling Salesman Problem with Hardness-Adaptive Curriculum"

Paper For more details, please see our paper Learning to Solve Travelling Salesman Problem with Hardness-Adaptive Curriculum which has been accepted a

14 Sep 30, 2022
GraphGT: Machine Learning Datasets for Graph Generation and Transformation

GraphGT: Machine Learning Datasets for Graph Generation and Transformation Dataset Website | Paper Installation Using pip To install the core environm

y6q9 50 Aug 18, 2022
DeepLab is a state-of-art deep learning system for semantic image segmentation built on top of Caffe.

DeepLab Introduction DeepLab is a state-of-art deep learning system for semantic image segmentation built on top of Caffe. It combines densely-compute

Ali 234 Nov 14, 2022
Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Phil Wang 12.6k Jan 09, 2023
Combinatorially Hard Games where the levels are procedurally generated

puzzlegen Implementation of two procedurally simulated environments with gym interfaces. IceSlider: the agent needs to reach and stop on the pink squa

Autonomous Learning Group 3 Jun 26, 2022
Biomarker identification for COVID-19 Severity in BALF cells Single-cell RNA-seq data

scBALF Covid-19 dataset Analysis Here is the Github page that has the codes for the bioinformatics pipeline described in the paper COVID-Datathon: Bio

Nami Niyakan 2 May 21, 2022
Improving Query Representations for DenseRetrieval with Pseudo Relevance Feedback:A Reproducibility Study.

APR The repo for the paper Improving Query Representations for DenseRetrieval with Pseudo Relevance Feedback:A Reproducibility Study. Environment setu

ielab 8 Nov 26, 2022
FAST Aiming at the problems of cumbersome steps and slow download speed of GNSS data

FAST Aiming at the problems of cumbersome steps and slow download speed of GNSS data, a relatively complete set of integrated multi-source data download terminal software fast is developed. The softw

ChangChuntao 23 Dec 31, 2022
Extreme Dynamic Classifier Chains - XGBoost for Multi-label Classification

Extreme Dynamic Classifier Chains Classifier chains is a key technique in multi-label classification, sinceit allows to consider label dependencies ef

6 Oct 08, 2022
Starter kit for getting started in the Music Demixing Challenge.

Music Demixing Challenge - Starter Kit 👉 Challenge page This repository is the Music Demixing Challenge Submission template and Starter kit! Clone th

AIcrowd 106 Dec 20, 2022
Official Pytorch implementation of "DivCo: Diverse Conditional Image Synthesis via Contrastive Generative Adversarial Network" (CVPR'21)

DivCo: Diverse Conditional Image Synthesis via Contrastive Generative Adversarial Network Pytorch implementation for our DivCo. We propose a simple ye

64 Nov 22, 2022
Emotional conditioned music generation using transformer-based model.

This is the official repository of EMOPIA: A Multi-Modal Pop Piano Dataset For Emotion Recognition and Emotion-based Music Generation. The paper has b

hung anna 96 Nov 09, 2022
Pytorch implementation of SenFormer: Efficient Self-Ensemble Framework for Semantic Segmentation

SenFormer: Efficient Self-Ensemble Framework for Semantic Segmentation Efficient Self-Ensemble Framework for Semantic Segmentation by Walid Bousselham

61 Dec 26, 2022
Omniscient Video Super-Resolution

Omniscient Video Super-Resolution This is the official code of OVSR (Omniscient Video Super-Resolution, ICCV 2021). This work is based on PFNL. Datase

36 Oct 27, 2022
g2o: A General Framework for Graph Optimization

g2o - General Graph Optimization Linux: Windows: g2o is an open-source C++ framework for optimizing graph-based nonlinear error functions. g2o has bee

Rainer Kümmerle 2.5k Dec 30, 2022
Patch SVDD for Image anomaly detection

Patch SVDD Patch SVDD for Image anomaly detection. Paper: https://arxiv.org/abs/2006.16067 (published in ACCV 2020). Original Code : https://github.co

Hong-Jeongmin 0 Dec 03, 2021
PyTorch version of the paper 'Enhanced Deep Residual Networks for Single Image Super-Resolution' (CVPRW 2017)

About PyTorch 1.2.0 Now the master branch supports PyTorch 1.2.0 by default. Due to the serious version problem (especially torch.utils.data.dataloade

Sanghyun Son 2.1k Jan 01, 2023
Forecasting directional movements of stock prices for intraday trading using LSTM and random forest

Forecasting directional movements of stock-prices for intraday trading using LSTM and random-forest https://arxiv.org/abs/2004.10178 Pushpendu Ghosh,

Pushpendu Ghosh 270 Dec 24, 2022
SASM - simple crossplatform IDE for NASM, MASM, GAS and FASM assembly languages

SASM (SimpleASM) - простая кроссплатформенная среда разработки для языков ассемблера NASM, MASM, GAS, FASM с подсветкой синтаксиса и отладчиком. В SA

Dmitriy Manushin 5.6k Jan 06, 2023
Curvlearn, a Tensorflow based non-Euclidean deep learning framework.

English | 简体中文 Why Non-Euclidean Geometry Considering these simple graph structures shown below. Nodes with same color has 2-hop distance whereas 1-ho

Alibaba 123 Dec 12, 2022