Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

Overview

Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

This is the implementaion of our paper:

Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation
Zhiwei He*, Xing Wang, Rui Wang, Shuming Shi, Zhaopeng Tu
ACL 2022 (long paper, main conference)

We based this code heavily on the original code of XLM, MASS and Deepaicode.

Dependencies

  • Python3

  • Pytorch1.7.1

    pip3 install torch==1.7.1+cu110
  • fastBPE

  • Apex

    git clone https://github.com/NVIDIA/apex
    cd apex
    git reset --hard 0c2c6ee
    pip3 install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .

Data ready

We prepared the data following the instruction from XLM (Section III). We used their released scripts, BPE codes and vocabularies. However, there are some differences with them:

  • All available data is used, not just 5,000,000 sentences per language

  • For Romanian, we augment it with the monolingual data from WMT16.

  • Noisy sentences are removed:

    python3 filter_noisy_data.py --input all.en --lang en --output clean.en
  • For English-German, we used the processed data provided by KaiTao Song.

Considering that it can take a very long time to prepare the data, we provide the processed data for download:

Pre-trained models

We adopted the released XLM and MASS models for all language pairs. In order to better reproduce the results for MASS on En-De, we used monolingual data to continue pre-training the MASS pre-trained model for 300 epochs and selected the best model ([email protected]) by perplexity (PPL) on the validation set.

Here are pre-trained models we used:

Languages XLM MASS
English-French Model Model
English-German Model Model
English-Romanian Model Model

Model training

We provide training scripts and trained models for UNMT baseline and our approach with online self-training.

Training scripts

Train UNMT model with online self-training and XLM initialization:

cd scripts
sh run-xlm-unmt-st-ende.sh

Note: remember to modify the path variables in the header of the shell script.

Trained model

We selected the best model by BLEU score on the validation set for both directions. Therefore, we release En-X and X-En models for each experiment.

Approch XLM MASS
UNMT En-Fr Fr-En En-Fr Fr-En
En-De De-En En-De De-En
En-Ro Ro-En En-Ro Ro-En
UNMT-ST En-Fr Fr-En En-Fr Fr-En
En-De De-En En-De De-En
En-Ro Ro-En En-Ro Ro-En

Evaluation

Generate translations

Input sentences must have the same tokenization and BPE codes than the ones used in the model.

cat input.en.bpe | \
python3 translate.py \
  --exp_name translate  \
  --src_lang en --tgt_lang de \
  --model_path trained_model.pth  \
  --output_path output.de.bpe \
  --batch_size 8

Remove bpe

sed  -r 's/(@@ )|(@@ ?$)//g' output.de.bpe > output.de.tok

Evaluate

BLEU_SCRIPT_PATH=src/evaluation/multi-bleu.perl
BLEU_SCRIPT_PATH ref.de.tok < output.de.tok
Owner
hezw.tkcw
PhD student @ SJTU
hezw.tkcw
HuggingSound: A toolkit for speech-related tasks based on HuggingFace's tools

HuggingSound HuggingSound: A toolkit for speech-related tasks based on HuggingFace's tools. I have no intention of building a very complex tool here.

Jonatas Grosman 247 Dec 26, 2022
Healthsea is a spaCy pipeline for analyzing user reviews of supplementary products for their effects on health.

Welcome to Healthsea ✨ Create better access to health with spaCy. Healthsea is a pipeline for analyzing user reviews to supplement products by extract

Explosion 75 Dec 19, 2022
Fake news detector filters - Smart filter project allow to classify the quality of information and web pages

fake-news-detector-1.0 Lists, lists and more lists... Spam filter list, quality keyword list, stoplist list, top-domains urls list, news agencies webs

Memo Sim 1 Jan 04, 2022
This library is testing the ethics of language models by using natural adversarial texts.

prompt2slip This library is testing the ethics of language models by using natural adversarial texts. This tool allows for short and simple code and v

9 Dec 28, 2021
Summarization module based on KoBART

KoBART-summarization Install KoBART pip install git+https://github.com/SKT-AI/KoBART#egg=kobart Requirements pytorch==1.7.0 transformers==4.0.0 pytor

seujung hwan, Jung 148 Dec 28, 2022
Official code repository of the paper Linear Transformers Are Secretly Fast Weight Programmers.

Linear Transformers Are Secretly Fast Weight Programmers This repository contains the code accompanying the paper Linear Transformers Are Secretly Fas

Imanol Schlag 77 Dec 19, 2022
Ongoing research training transformer language models at scale, including: BERT & GPT-2

Megatron (1 and 2) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA.

NVIDIA Corporation 3.5k Dec 30, 2022
Source code for AAAI20 "Generating Persona Consistent Dialogues by Exploiting Natural Language Inference".

Generating Persona Consistent Dialogues by Exploiting Natural Language Inference Source code for RCDG model in AAAI20 Generating Persona Consistent Di

16 Oct 08, 2022
Code to use Augmented Shapiro Wilks Stopping, as well as code for the paper "Statistically Signifigant Stopping of Neural Network Training"

This codebase is being actively maintained, please create and issue if you have issues using it Basics All data files are included under losses and ea

Justin Terry 32 Nov 09, 2021
Paddle2.x version AI-Writer

Paddle2.x 版本AI-Writer 用魔改 GPT 生成网文。Tuned GPT for novel generation.

yujun 74 Jan 04, 2023
Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

gpt-2-simple A simple Python package that wraps existing model fine-tuning and generation scripts for OpenAI's GPT-2 text generation model (specifical

Max Woolf 3.1k Jan 07, 2023
BERT Attention Analysis

BERT Attention Analysis This repository contains code for What Does BERT Look At? An Analysis of BERT's Attention. It includes code for getting attent

Kevin Clark 401 Dec 11, 2022
Poetry PEP 517 Build Backend & Core Utilities

Poetry Core A PEP 517 build backend implementation developed for Poetry. This project is intended to be a light weight, fully compliant, self-containe

Poetry 293 Jan 02, 2023
Official Stanford NLP Python Library for Many Human Languages

Official Stanford NLP Python Library for Many Human Languages

Stanford NLP 6.4k Jan 02, 2023
SentimentArcs: a large ensemble of dozens of sentiment analysis models to analyze emotion in text over time

SentimentArcs - Emotion in Text An end-to-end pipeline based on Jupyter notebooks to detect, extract, process and anlayze emotion over time in text. E

jon_chun 14 Dec 19, 2022
Client library to download and publish models and other files on the huggingface.co hub

huggingface_hub Client library to download and publish models and other files on the huggingface.co hub Do you have an open source ML library? We're l

Hugging Face 644 Jan 01, 2023
This repository contains Python scripts for extracting linguistic features from Filipino texts.

Filipino Text Linguistic Feature Extractors This repository contains scripts for extracting linguistic features from Filipino texts. The scripts were

Joseph Imperial 1 Oct 05, 2021
MASS: Masked Sequence to Sequence Pre-training for Language Generation

MASS: Masked Sequence to Sequence Pre-training for Language Generation

Microsoft 1.1k Dec 17, 2022
Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation (SIGGRAPH Asia 2021)

Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation This repository contains the implementation of the following paper: Live Speech

OldSix 575 Dec 31, 2022
Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources (NAACL-2021).

Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources Description This is the repository for the paper Unifying Cross-

Sapienza NLP group 16 Sep 09, 2022