Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

Overview

Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

This is the implementaion of our paper:

Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation
Zhiwei He*, Xing Wang, Rui Wang, Shuming Shi, Zhaopeng Tu
ACL 2022 (long paper, main conference)

We based this code heavily on the original code of XLM, MASS and Deepaicode.

Dependencies

  • Python3

  • Pytorch1.7.1

    pip3 install torch==1.7.1+cu110
  • fastBPE

  • Apex

    git clone https://github.com/NVIDIA/apex
    cd apex
    git reset --hard 0c2c6ee
    pip3 install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .

Data ready

We prepared the data following the instruction from XLM (Section III). We used their released scripts, BPE codes and vocabularies. However, there are some differences with them:

  • All available data is used, not just 5,000,000 sentences per language

  • For Romanian, we augment it with the monolingual data from WMT16.

  • Noisy sentences are removed:

    python3 filter_noisy_data.py --input all.en --lang en --output clean.en
  • For English-German, we used the processed data provided by KaiTao Song.

Considering that it can take a very long time to prepare the data, we provide the processed data for download:

Pre-trained models

We adopted the released XLM and MASS models for all language pairs. In order to better reproduce the results for MASS on En-De, we used monolingual data to continue pre-training the MASS pre-trained model for 300 epochs and selected the best model ([email protected]) by perplexity (PPL) on the validation set.

Here are pre-trained models we used:

Languages XLM MASS
English-French Model Model
English-German Model Model
English-Romanian Model Model

Model training

We provide training scripts and trained models for UNMT baseline and our approach with online self-training.

Training scripts

Train UNMT model with online self-training and XLM initialization:

cd scripts
sh run-xlm-unmt-st-ende.sh

Note: remember to modify the path variables in the header of the shell script.

Trained model

We selected the best model by BLEU score on the validation set for both directions. Therefore, we release En-X and X-En models for each experiment.

Approch XLM MASS
UNMT En-Fr Fr-En En-Fr Fr-En
En-De De-En En-De De-En
En-Ro Ro-En En-Ro Ro-En
UNMT-ST En-Fr Fr-En En-Fr Fr-En
En-De De-En En-De De-En
En-Ro Ro-En En-Ro Ro-En

Evaluation

Generate translations

Input sentences must have the same tokenization and BPE codes than the ones used in the model.

cat input.en.bpe | \
python3 translate.py \
  --exp_name translate  \
  --src_lang en --tgt_lang de \
  --model_path trained_model.pth  \
  --output_path output.de.bpe \
  --batch_size 8

Remove bpe

sed  -r 's/(@@ )|(@@ ?$)//g' output.de.bpe > output.de.tok

Evaluate

BLEU_SCRIPT_PATH=src/evaluation/multi-bleu.perl
BLEU_SCRIPT_PATH ref.de.tok < output.de.tok
Owner
hezw.tkcw
PhD student @ SJTU
hezw.tkcw
Large-scale Knowledge Graph Construction with Prompting

Large-scale Knowledge Graph Construction with Prompting across tasks (predictive and generative), and modalities (language, image, vision + language, etc.)

ZJUNLP 161 Dec 28, 2022
Treemap visualisation of Maya scene files

Ever wondered which nodes are responsible for that 600 mb+ Maya scene file? Features Fast, resizable UI Parsing at 50 mb/sec Dependency-free, single-f

Marcus Ottosson 76 Nov 12, 2022
Toward Model Interpretability in Medical NLP

Toward Model Interpretability in Medical NLP LING380: Topics in Computational Linguistics Final Project James Cross ( 1 Mar 04, 2022

1 Jun 28, 2022
🤖 Basic Financial Chatbot with handoff ability built with Rasa

Financial Services Example Bot This is an example chatbot demonstrating how to build AI assistants for financial services and banking with Rasa. It in

Mohammad Javad Hossieni 4 Aug 10, 2022
Code for Discovering Topics in Long-tailed Corpora with Causal Intervention.

Code for Discovering Topics in Long-tailed Corpora with Causal Intervention ACL2021 Findings Usage 0. Prepare environment Requirements: python==3.6 te

Xiaobao Wu 8 Dec 16, 2022
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 15.3k Jan 03, 2023
End-to-end image captioning with EfficientNet-b3 + LSTM with Attention

Image captioning End-to-end image captioning with EfficientNet-b3 + LSTM with Attention Model is seq2seq model. In the encoder pretrained EfficientNet

2 Feb 10, 2022
This is a modification of the OpenAI-CLIP repository of moein-shariatnia

This is a modification of the OpenAI-CLIP repository of moein-shariatnia

Sangwon Beak 2 Mar 04, 2022
AMUSE - financial summarization

AMUSE AMUSE - financial summarization Unzip data.zip Train new model: python FinAnalyze.py --task train --start 0 --count how many files,-1 for all

1 Jan 11, 2022
A linter to manage all your python exceptions and try/except blocks (limited only for those who like dinosaurs).

Manage your exceptions in Python like a PRO Currently in BETA. Inspired by this blog post. I shared the building process of this tool here. “For those

Guilherme Latrova 353 Dec 31, 2022
Course project of [email protected]

NaiveMT Prepare Clone this repository git clone [email protected]:Poeroz/NaiveMT.git

Poeroz 2 Apr 24, 2022
A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型,适用于英语、普通话/中文、日语、韩语、俄语和藏语(当前已测试)。

简体中文 | English 并行语音合成 [TOC] 新进展 2021/04/20 合并 wavegan 分支到 main 主分支,删除 wavegan 分支! 2021/04/13 创建 encoder 分支用于开发语音风格迁移模块! 2021/04/13 softdtw 分支 支持使用 Sof

Atomicoo 161 Dec 19, 2022
Towards Nonlinear Disentanglement in Natural Data with Temporal Sparse Coding

Towards Nonlinear Disentanglement in Natural Data with Temporal Sparse Coding

Bethge Lab 61 Dec 21, 2022
STonKGs is a Sophisticated Transformer that can be jointly trained on biomedical text and knowledge graphs

STonKGs STonKGs is a Sophisticated Transformer that can be jointly trained on biomedical text and knowledge graphs. This multimodal Transformer combin

STonKGs 27 Aug 11, 2022
Python code for ICLR 2022 spotlight paper EViT: Expediting Vision Transformers via Token Reorganizations

Expediting Vision Transformers via Token Reorganizations This repository contain

Youwei Liang 101 Dec 26, 2022
Paddlespeech Streaming ASR GUI

Paddlespeech-Streaming-ASR-GUI Introduction A paddlespeech Streaming ASR GUI. Us

Niek Zhen 3 Jan 05, 2022
Cherche (search in French) allows you to create a neural search pipeline using retrievers and pre-trained language models as rankers.

Cherche (search in French) allows you to create a neural search pipeline using retrievers and pre-trained language models as rankers. Cherche is meant to be used with small to medium sized corpora. C

Raphael Sourty 224 Nov 29, 2022
Repo for Enhanced Seq2Seq Autoencoder via Contrastive Learning for Abstractive Text Summarization

ESACL: Enhanced Seq2Seq Autoencoder via Contrastive Learning for AbstractiveText Summarization This repo is for our paper "Enhanced Seq2Seq Autoencode

Rachel Zheng 14 Nov 01, 2022
ConvBERT-Prod

ConvBERT 目录 0. 仓库结构 1. 简介 2. 数据集和复现精度 3. 准备数据与环境 3.1 准备环境 3.2 准备数据 3.3 准备模型 4. 开始使用 4.1 模型训练 4.2 模型评估 4.3 模型预测 5. 模型推理部署 5.1 基于Inference的推理 5.2 基于Serv

yujun 7 Apr 08, 2022