Code repository for the paper "Doubly-Trained Adversarial Data Augmentation for Neural Machine Translation" with instructions to reproduce the results.

Overview

Doubly Trained Neural Machine Translation System for Adversarial Attack and Data Augmentation

Languages Experimented:

  • Data Overview:

    Source Target Training Data Valid1 Valid2 Test data
    ZH EN WMT17 without UN corpus WMT2017 newstest WMT2018 newstest WMT2020 newstest
    DE EN WMT17 WMT2017 newstest WMT2018 newstest WMT2014 newstest
    FR EN WMT14 without UN corpus WMT2015 newsdiscussdev WMT2015 newsdiscusstest WMT2014 newstest
  • Corpus Statistics:

    Lang-pair Data Type #Sentences #tokens (English side)
    zh-en Train 9355978 161393634
    Valid1 2001 47636
    Valid2 3981 98308
    test 2000 65561
    de-en Train 4001246 113777884
    Valid1 2941 74288
    Valid2 2970 78358
    test 3003 78182
    fr-en Train 23899064 73523616
    Valid1 1442 30888
    Valid2 1435 30215
    test 3003 81967

Scripts (as shown in paper's appendix)

  • Set-up:

    • To execute the scripts shown below, it's required that fairseq version 0.9 is installed along with COMET. The way to easily install them after cloning this repo is executing following commands (under root of this repo):
      cd fairseq-0.9.0
      pip install --editable ./
      cd ../COMET
      pip install .
    • It's also possible to directly install COMET through pip: pip install unbabel-comet, but the recent version might have different dependency on other packages like fairseq. Please check COMET's official website for the updated information.
    • To make use of script that relies on COMET model (in case of dual-comet), a model from COMET should be downloaded. It can be easily done by running following script:
      from comet.models import download_model
      download_model("wmt-large-da-estimator-1719")
  • Pretrain the model:

    fairseq-train $DATADIR \
        --source-lang $src \
        --target-lang $tgt \
        --save-dir $SAVEDIR \
        --share-decoder-input-output-embed \
        --arch transformer_wmt_en_de \
        --optimizer adam --adam-betas ’(0.9, 0.98)’ --clip-norm 0.0 \
        --lr-scheduler inverse_sqrt \
        --warmup-init-lr 1e-07 --warmup-updates 4000 \
        --lr 0.0005 --min-lr 1e-09 \
        --dropout 0.3 --weight-decay 0.0001 \
        --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
        --max-tokens 2048 --update-freq 16 \
        --seed 2 
  • Adversarial Attack:

    fairseq-train $DATADIR \
        --source-lang $src \
        --target-lang $tgt \
        --save-dir $SAVEDIR \
        --share-decoder-input-output-embed \
        --train-subset valid \
        --arch transformer_wmt_en_de \
        --optimizer adam --adam-betas ’(0.9, 0.98)’ --clip-norm 0.0 \
        --lr-scheduler inverse_sqrt \
        --warmup-init-lr 1e-07 --warmup-updates 4000 \
        --lr 0.0005 --min-lr 1e-09 \
        --dropout 0.3 --weight-decay 0.0001 \
        --criterion dual_bleu --mrt-k 16 \
        --batch-size 2 --update-freq 64 \
        --seed 2 \
        --restore-file $PREETRAIN_MODEL \
        --reset-optimizer \
        --reset-dataloader 
  • Data Augmentation:

    fairseq-train $DATADIR \
        -s $src -t $tgt \
        --train-subset valid \
        --valid-subset valid1 \
        --left-pad-source False \
        --share-decoder-input-output-embed \
        --encoder-embed-dim 512 \
        --arch transformer_wmt_en_de \
        --dual-training \
        --auxillary-model-path $AUX_MODEL \
        --auxillary-model-save-dir $AUX_MODEL_SAVE \
        --optimizer adam --adam-betas ’(0.9, 0.98)’ --clip-norm 0.0 \
        --lr-scheduler inverse_sqrt \
        --warmup-init-lr 0.000001 --warmup-updates 1000 \
        --lr 0.00001 --min-lr 1e-09 \
        --dropout 0.3 --weight-decay 0.0001 \
        --criterion dual_comet/dual_mrt --mrt-k 8 \
        --comet-route $COMET_PATH \
        --batch-size 4 \
        --skip-invalid-size-inputs-valid-test \
        --update-freq 1 \
        --on-the-fly-train --adv-percent 30 \
        --seed 2 \
        --restore-file $PRETRAIN_MODEL \
        --reset-optimizer \
        --reset-dataloader \
        --save-dir $CHECKPOINT_FOLDER 

Generation and Test:

  • For Chinese-English, we use sentencepiece to perform the BPE so it's required to be removed in generation step. For all test we use beam size = 5. Noitce that we modified the code in fairseq-gen to use sacrebleu.tokenizers.TokenizerZh() to tokenize Chinese when the direction is en-zh.

    fairseq-generate $DATA-FOLDER \
        -s zh -t en \
        --task translation \
        --gen-subset $file \
        --path $CHECKPOINT \
        --batch-size 64 --quiet \
        --lenpen 1.0 \
        --remove-bpe sentencepiece \
        --sacrebleu \
        --beam 5
  • For French-Enlish, German-English, we modified the script to detokenize the moses tokenizer (which we used to preprocess the data). To reproduce the result, use following script:

    fairseq-generate $DATA-FOLDER \
        -s de/fr -t en \
        --task translation \
        --gen-subset $file \
        --path $CHECKPOINT \
        --batch-size 64 --quiet \
        --lenpen 1.0 \
        --remove-bpe \
        ---detokenize-moses \
        --sacrebleu \
        --beam 5

    Here --detokenize-moses would call detokenizer during the generation step and detokenize predictions before evaluating it. It would slow the generation step. Another way to manually do this is to retrieve prediction and target sentences from output file of fairseq and manually apply detokenizer from detokenizer.perl.

BibTex

@misc{tan2021doublytrained,
      title={Doubly-Trained Adversarial Data Augmentation for Neural Machine Translation}, 
      author={Weiting Tan and Shuoyang Ding and Huda Khayrallah and Philipp Koehn},
      year={2021},
      eprint={2110.05691},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
Owner
Steven Tan
Johns Hopkins 21' Computer Science & Applied Mathematics and Statistics Major
Steven Tan
Created as part of CS50 AI's coursework. This AI makes use of knowledge entailment to calculate the best probabilities to win Minesweeper.

Minesweeper-AI Created as part of CS50 AI's coursework. This AI makes use of knowledge entailment to calculate the best probabilities to win Minesweep

Beckham 0 Jul 20, 2022
This is a vision-based 3d model manipulation and control UI

Manipulation of 3D Models Using Hand Gesture This program allows user to manipulation 3D models (.obj format) with their hands. The project support bo

Cortic Technology Corp. 43 Oct 23, 2022
PROJECT - Az Residential Real Estate Analysis

AZ RESIDENTIAL REAL ESTATE ANALYSIS -Decided on libraries to import. Includes pa

2 Jul 05, 2022
Hyperparameter Optimization for TensorFlow, Keras and PyTorch

Hyperparameter Optimization for Keras Talos • Key Features • Examples • Install • Support • Docs • Issues • License • Download Talos radically changes

Autonomio 1.6k Dec 15, 2022
African language Speech Recognition - Speech-to-Text

Swahili-Speech-To-Text Table of Contents Swahili-Speech-To-Text Overview Scenario Approach Project Structure data: models: notebooks: scripts tests: l

2 Jan 05, 2023
Pywonderland - A tour in the wonderland of math with python.

A Tour in the Wonderland of Math with Python A collection of python scripts for drawing beautiful figures and animating interesting algorithms in math

Zhao Liang 4.1k Jan 03, 2023
In-place Parallel Super Scalar Samplesort (IPS⁴o)

In-place Parallel Super Scalar Samplesort (IPS⁴o) This is the implementation of the algorithm IPS⁴o presented in the paper Engineering In-place (Share

82 Dec 22, 2022
Optimizers-visualized - Visualization of different optimizers on local minimas and saddle points.

Optimizers Visualized Visualization of how different optimizers handle mathematical functions for optimization. Contents Installation Usage Functions

Gautam J 1 Jan 01, 2022
A deep learning library that makes face recognition efficient and effective

Distributed Arcface Training in Pytorch This is a deep learning library that makes face recognition efficient, and effective, which can train tens of

Sajjad Aemmi 10 Nov 23, 2021
Adversarial examples to the new ConvNeXt architecture

Adversarial examples to the new ConvNeXt architecture To get adversarial examples to the ConvNeXt architecture, run the Colab: https://github.com/stan

Stanislav Fort 19 Sep 18, 2022
A Python toolbox to create adversarial examples that fool neural networks in PyTorch, TensorFlow, and JAX

Foolbox Native: Fast adversarial attacks to benchmark the robustness of machine learning models in PyTorch, TensorFlow, and JAX Foolbox is a Python li

Bethge Lab 2.4k Dec 25, 2022
C3d-pytorch - Pytorch porting of C3D network, with Sports1M weights

C3D for pytorch This is a pytorch porting of the network presented in the paper Learning Spatiotemporal Features with 3D Convolutional Networks How to

Davide Abati 311 Jan 06, 2023
CLOCs: Camera-LiDAR Object Candidates Fusion for 3D Object Detection

CLOCs is a novel Camera-LiDAR Object Candidates fusion network. It provides a low-complexity multi-modal fusion framework that improves the performance of single-modality detectors. CLOCs operates on

Su Pang 254 Dec 16, 2022
A flexible submap-based framework towards spatio-temporally consistent volumetric mapping and scene understanding.

Panoptic Mapping This package contains panoptic_mapping, a general framework for semantic volumetric mapping. We provide, among other, a submap-based

ETHZ ASL 194 Dec 20, 2022
Code to reproduce the results in the paper "Tensor Component Analysis for Interpreting the Latent Space of GANs".

Tensor Component Analysis for Interpreting the Latent Space of GANs [ paper | project page ] Code to reproduce the results in the paper "Tensor Compon

James Oldfield 4 Jun 17, 2022
Compact Bidirectional Transformer for Image Captioning

Compact Bidirectional Transformer for Image Captioning Requirements Python 3.8 Pytorch 1.6 lmdb h5py tensorboardX Prepare Data Please use git clone --

YE Zhou 19 Dec 12, 2022
PyTorch Implementations for DeeplabV3 and PSPNet

Pytorch-segmentation-toolbox DOC Pytorch code for semantic segmentation. This is a minimal code to run PSPnet and Deeplabv3 on Cityscape dataset. Shor

Zilong Huang 746 Dec 15, 2022
Predicting Student Attentiveness using OpenCV

Predicting-Student-Attentiveness-using-OpenCV The model will predict if a student is attentive or not through facial parameter received through the st

Johann Pinto 2 Aug 20, 2022
Malware Env for OpenAI Gym

Malware Env for OpenAI Gym Citing If you use this code in a publication please cite the following paper: Hyrum S. Anderson, Anant Kharkar, Bobby Fila

ENDGAME 563 Dec 29, 2022
Code of Puregaze: Purifying gaze feature for generalizable gaze estimation, AAAI 2022.

PureGaze: Purifying Gaze Feature for Generalizable Gaze Estimation Description Our work is accpeted by AAAI 2022. Picture: We propose a domain-general

39 Dec 05, 2022