Contrastive Learning for Many-to-many Multilingual Neural Machine Translation(mCOLT/mRASP2), ACL2021

Related tags

Deep LearningmRASP2
Overview

Contrastive Learning for Many-to-many Multilingual Neural Machine Translation(mCOLT/mRASP2), ACL2021

The code for training mCOLT/mRASP2, a multilingual NMT training framework, implemented based on fairseq.

mRASP2: paper

mRASP: paper, code


News

We have released two versions, this version is the original one. In this implementation:

  • You should first merge all data, by pre-pending language token before each sentence to indicate the language.
  • AA/RAS muse be done off-line (before binarize), check this toolkit.

New implementation: https://github.com/PANXiao1994/mRASP2/tree/new_impl

  • Acknowledgement: This work is supported by Bytedance. We thank Chengqi for uploading all files and checkpoints.

Introduction

mRASP2/mCOLT, representing multilingual Contrastive Learning for Transformer, is a multilingual neural machine translation model that supports complete many-to-many multilingual machine translation. It employs both parallel corpora and multilingual corpora in a unified training framework. For detailed information please refer to the paper.

img.png

Pre-requisite

pip install -r requirements.txt

Training Data and Checkpoints

We release our preprocessed training data and checkpoints in the following.

Dataset

We merge 32 English-centric language pairs, resulting in 64 directed translation pairs in total. The original 32 language pairs corpus contains about 197M pairs of sentences. We get about 262M pairs of sentences after applying RAS, since we keep both the original sentences and the substituted sentences. We release both the original dataset and dataset after applying RAS.

Dataset #Pair
32-lang-pairs-TRAIN 197603294
32-lang-pairs-RAS-TRAIN 262662792
mono-split-a -
mono-split-b -
mono-split-c -
mono-split-d -
mono-split-e -
mono-split-de-fr-en -
mono-split-nl-pl-pt -
32-lang-pairs-DEV-en-centric -
32-lang-pairs-DEV-many-to-many -
Vocab -
BPE Code -

Checkpoints & Results

  • Please note that the provided checkpoint is sightly different from that in the paper. In the following sections, we report the results of the provided checkpoints.

English-centric Directions

We report tokenized BLEU in the following table. (check eval.sh for details)

6e6d-no-mono 12e12d-no-mono 12e12d
en2cs/wmt16 21.0 22.3 23.8
cs2en/wmt16 29.6 32.4 33.2
en2fr/wmt14 42.0 43.3 43.4
fr2en/wmt14 37.8 39.3 39.5
en2de/wmt14 27.4 29.2 29.5
de2en/wmt14 32.2 34.9 35.2
en2zh/wmt17 33.0 34.9 34.1
zh2en/wmt17 22.4 24.0 24.4
en2ro/wmt16 26.6 28.1 28.7
ro2en/wmt16 36.8 39.0 39.1
en2tr/wmt16 18.6 20.3 21.2
tr2en/wmt16 22.2 25.5 26.1
en2ru/wmt19 17.4 18.5 19.2
ru2en/wmt19 22.0 23.2 23.6
en2fi/wmt17 20.2 22.1 22.9
fi2en/wmt17 26.1 29.5 29.7
en2es/wmt13 32.8 34.1 34.6
es2en/wmt13 32.8 34.6 34.7
en2it/wmt09 28.9 30.0 30.8
it2en/wmt09 31.4 32.7 32.8

Unsupervised Directions

We report tokenized BLEU in the following table. (check eval.sh for details)

12e12d
en2pl/wmt20 6.2
pl2en/wmt20 13.5
en2nl/iwslt14 8.8
nl2en/iwslt14 27.1
en2pt/opus100 18.9
pt2en/opus100 29.2

Zero-shot Directions

  • row: source language
  • column: target language We report sacreBLEU in the following table.
12e12d ar zh nl fr de ru
ar - 32.5 3.2 22.8 11.2 16.7
zh 6.5 - 1.9 32.9 7.6 23.7
nl 1.7 8.2 - 7.5 10.2 2.9
fr 6.2 42.3 7.5 - 18.9 24.4
de 4.9 21.6 9.2 24.7 - 14.4
ru 7.1 40.6 4.5 29.9 13.5 -

Training

export NUM_GPU=4 && bash train_w_mono.sh ${model_config}
  • We give example of ${model_config} in ${PROJECT_REPO}/examples/configs/parallel_mono_12e12d_contrastive.yml

Inference

  • You must pre-pend the corresponding language token to the source side before binarize the test data.
${final_res_file} python3 ${repo_dir}/scripts/utils.py ${res_file} ${ref_file} || exit 1; ">
fairseq-generate ${test_path} \
    --user-dir ${repo_dir}/mcolt \
    -s ${src} \
    -t ${tgt} \
    --skip-invalid-size-inputs-valid-test \
    --path ${ckpts} \
    --max-tokens ${batch_size} \
    --task translation_w_langtok \
    ${options} \
    --lang-prefix-tok "LANG_TOK_"`echo "${tgt} " | tr '[a-z]' '[A-Z]'` \
    --max-source-positions ${max_source_positions} \
    --max-target-positions ${max_target_positions} \
    --nbest 1 | grep -E '[S|H|P|T]-[0-9]+' > ${final_res_file}
python3 ${repo_dir}/scripts/utils.py ${res_file} ${ref_file} || exit 1;

Synonym dictionaries

We use the bilingual synonym dictionaries provised by MUSE.

We generate multilingual synonym dictionaries using this script, and apply RAS using this script.

Description File Size
dep=1 synonym_dict_raw_dep1 138.0 M
dep=2 synonym_dict_raw_dep2 1.6 G
dep=3 synonym_dict_raw_dep3 2.2 G

Contact

Please contact me via e-mail [email protected] or via wechat/zhihu

Citation

Please cite as:

@inproceedings{mrasp2,
  title = {Contrastive Learning for Many-to-many Multilingual Neural Machine Translation},
  author= {Xiao Pan and
           Mingxuan Wang and
           Liwei Wu and
           Lei Li},
  booktitle = {Proceedings of ACL 2021},
  year = {2021},
}
Implementation of Diverse Semantic Image Synthesis via Probability Distribution Modeling

Diverse Semantic Image Synthesis via Probability Distribution Modeling (CVPR 2021) Paper Zhentao Tan, Menglei Chai, Dongdong Chen, Jing Liao, Qi Chu,

tzt 45 Nov 17, 2022
STEM: An approach to Multi-source Domain Adaptation with Guarantees

STEM: An approach to Multi-source Domain Adaptation with Guarantees Introduction This is the official implementation of ``STEM: An approach to Multi-s

5 Dec 19, 2022
An implementation of the proximal policy optimization algorithm

PPO Pytorch C++ This is an implementation of the proximal policy optimization algorithm for the C++ API of Pytorch. It uses a simple TestEnvironment t

Martin Huber 59 Dec 09, 2022
Sketch-Based 3D Exploration with Stacked Generative Adversarial Networks

pix2vox [Demonstration video] Sketch-Based 3D Exploration with Stacked Generative Adversarial Networks. Generated samples Single-category generation M

Takumi Moriya 232 Nov 14, 2022
SparseInst: Sparse Instance Activation for Real-Time Instance Segmentation, CVPR 2022

SparseInst 🚀 A simple framework for real-time instance segmentation, CVPR 2022 by Tianheng Cheng, Xinggang Wang†, Shaoyu Chen, Wenqiang Zhang, Qian Z

Hust Visual Learning Team 458 Jan 05, 2023
A Tensorfflow implementation of Attend, Infer, Repeat

Attend, Infer, Repeat: Fast Scene Understanding with Generative Models This is an unofficial Tensorflow implementation of Attend, Infear, Repeat (AIR)

Adam Kosiorek 82 May 27, 2022
Cross-media Structured Common Space for Multimedia Event Extraction (ACL2020)

Cross-media Structured Common Space for Multimedia Event Extraction Table of Contents Overview Requirements Data Quickstart Citation Overview The code

Manling Li 49 Nov 21, 2022
Implementation of ML models like Decision tree, Naive Bayes, Logistic Regression and many other

ML_Model_implementaion Implementation of ML models like Decision tree, Naive Bayes, Logistic Regression and many other dectree_model: Implementation o

Anshuman Dalai 3 Jan 24, 2022
EMNLP'2021: SimCSE: Simple Contrastive Learning of Sentence Embeddings

SimCSE: Simple Contrastive Learning of Sentence Embeddings This repository contains the code and pre-trained models for our paper SimCSE: Simple Contr

Princeton Natural Language Processing 2.5k Dec 29, 2022
Code for CVPR 2021 oral paper "Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts"

Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts The rapid progress in 3D scene understanding has come with growing dem

Facebook Research 182 Dec 30, 2022
Official implementation of NeurIPS'21: Implicit SVD for Graph Representation Learning

isvd Official implementation of NeurIPS'21: Implicit SVD for Graph Representation Learning If you find this code useful, you may cite us as: @inprocee

Sami Abu-El-Haija 16 Jan 08, 2023
"Learning and Analyzing Generation Order for Undirected Sequence Models" in Findings of EMNLP, 2021

undirected-generation-dev This repo contains the source code of the models described in the following paper "Learning and Analyzing Generation Order f

Yichen Jiang 0 Mar 25, 2022
This project deploys a yolo fastest model in the form of tflite on raspberry 3b+. The model is from another repository of mine called -Trash-Classification-Car

Deploy-yolo-fastest-tflite-on-raspberry 觉得有用的话可以顺手点个star嗷 这个项目将垃圾分类小车中的tflite模型移植到了树莓派3b+上面。 该项目主要是为了记录在树莓派部署yolo fastest tflite的流程 (之后有时间会尝试用C++部署来提升

7 Aug 16, 2022
Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition in CVPR19

2s-AGCN Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition in CVPR19 Note PyTorch version should be 0.3! For PyTor

LShi 547 Dec 26, 2022
Pytorch Implementation of Interaction Networks for Learning about Objects, Relations and Physics

Interaction-Network-Pytorch Pytorch Implementraion of Interaction Networks for Learning about Objects, Relations and Physics. Interaction Network is a

117 Nov 05, 2022
Code in conjunction with the publication 'Contrastive Representation Learning for Hand Shape Estimation'

HanCo Dataset & Contrastive Representation Learning for Hand Shape Estimation Code in conjunction with the publication: Contrastive Representation Lea

Computer Vision Group, Albert-Ludwigs-Universität Freiburg 38 Dec 13, 2022
A code implementation of AC-GC: Activation Compression with Guaranteed Convergence, in NeurIPS 2021.

Code For AC-GC: Lossy Activation Compression with Guaranteed Convergence This code is intended to be used as a supplemental material for submission to

Dave Evans 2 Nov 01, 2022
COLMAP - Structure-from-Motion and Multi-View Stereo

COLMAP About COLMAP is a general-purpose Structure-from-Motion (SfM) and Multi-View Stereo (MVS) pipeline with a graphical and command-line interface.

4.7k Jan 07, 2023
To propose and implement a multi-class classification approach to disaster assessment from the given data set of post-earthquake satellite imagery.

To propose and implement a multi-class classification approach to disaster assessment from the given data set of post-earthquake satellite imagery.

Kunal Wadhwa 2 Jan 05, 2022
An auto discord account and token generator. Automatically verifies the phone number. Works without proxy. Bypasses captcha.

JOIN DISCORD SERVER https://discord.gg/uAc3agBY FREE HCAPTCHA SOLVING API Discord-Token-Gen An auto discord token generator. Auto verifies phone numbe

3kp 271 Jan 01, 2023