Code for paper "Vocabulary Learning via Optimal Transport for Neural Machine Translation"

Last update: Jan 09, 2023

Related tags

Overview

**Codebase and data are uploaded in progress. **

VOLT(-py) is a vocabulary learning codebase that allows researchers and developers to automaticaly generate a vocabulary with suitable granularity for machine translation.

What's New:

July 2021: Support En-De translation, TED bilingual translation, and multilingual translation.
July 2021: Support subword-nmt tokenization.
July 2021: Support sentencepiece tokenization.

What's On-going:

Add translation training/evaluation codes.
Support classification tasks.
Support pip usage.

Features:

Efficient: CPU learning on one machine.
Simple: The core code is no more than 200 lines.
Easy-to-use: Support widely-used tokenization toolkits,subword-nmt and sentencepiece.
Flexible: User can customize their own tokenization rules.

Requirements and Installation

The required environments:

python 3.0
tqdm
mosedecoder
subword-nmt

To use VOLT and develop locally:

git clone https://github.com/Jingjing-NLP/VOLT/
cd VOLT
git clone https://github.com/moses-smt/mosesdecoder
git clone https://github.com/rsennrich/subword-nmt
pip3 install sentencepiece
pip3 install tqdm

Usage

The first step is to get vocabulary candidates and tokenized texts. The sub-word vocabulary can be generated by subword-nmt and sentencepiece. Here are two examples:


#Assume source_data is the file stroing data in the source language
#Assume target_data is the file stroing data in the target language
BPEROOT=subword-nmt
size=30000 # the size of BPE
cat source_data > training_data
cat target_data >> training_data

#subword-nmt style:
mkdir bpeoutput
BPE_CODE=code # the path to save vocabulary
python3 $BPEROOT/learn_bpe.py -s $size  < training_data > $BPE_CODE
python3 $BPEROOT/apply_bpe.py -c $BPE_CODE < source_file > bpeoutput/source.file
python3 $BPEROOT/apply_bpe.py -c $BPE_CODE < target_file > bpeoutput/source.file

#sentencepiece style:
mkdir spmout
python3 spm/spm_train.py --input=training_data --model_prefix=spm --vocab_size=$size --character_coverage=1.0 --model_type=bpe
#After this step, you will see spm.vocab and spm.model
python3 spm/spm_encoder.py --model spm.model --inputs source_data --outputs spmout/source_data --output_format piece
python3 spm/spm_encoder.py --model spm.model --inputs target_data --outputs spmout/target_data --output_format piece

The second step is to run VOLT scripts. It accepts the following parameters:
- --source_file: the file storing data in the source language.
- --target_file: the file storing data in the target language.
- --token_candidate_file: the file storing token candidates.
- --max_number: the maximum size of the vocabulary generated by VOLT.
- --interval: the search granularity in VOLT.
- --loop_in_ot: the maximum interation loop in sinkhorn solution.
- --tokenizer: which toolkit you use to get vocabulary. Only subword-nmt and sentencepiece are supported.
- --size_file: the file to store the vocabulary size generated by VOLT.
- --threshold: the threshold to decide which tokens are added into the final vocabulary from the optimal matrix. Less threshold means that less token candidates are dropped.
```
#subword-nmt style
python3 ../ot_run.py --source_file bpeoutput/source.file --target_file bpeoutput/target.file \
          --token_candidate_file $BPE_CODE \
          --vocab_file bpeoutput/vocab --max_number 10000 --interval 1000  --loop_in_ot 500 --tokenizer subword-nmt --size_file bpeoutput/size 
#sentencepiece style
python3 ../ot_run.py --source_file spmoutput/source.file --target_file spmoutput/target.file \
          --token_candidate_file $BPE_CODE \
          --vocab_file spmoutput/vocab --max_number 10000 --interval 1000  --loop_in_ot 500 --tokenizer sentencepiece --size_file spmoutput/size 
```

The third step is to use the generated vocabulary to tokenize your texts:

  #for subword-nmt toolkit
  python3 $BPEROOT/apply_bpe.py -c bpeoutput/vocab < source_file > bpeoutput/source.file
  python3 $BPEROOT/apply_bpe.py -c bpeoutput/vocab < target_file > bpeoutput/source.file

  #for sentencepiece toolkit, here we only keep the optimal size
  best_size=$(cat spmoutput/size)
  python3 spm/spm_train.py --input=training_data --model_prefix=spm --vocab_size=$best_size --character_coverage=1.0 --model_type=bpe

  #After this step, you will see spm.vocab and spm.model
  python3 spm/spm_encoder.py --model spm.model --inputs source_data --outputs spmout/source_data --output_format piece
  python3 spm/spm_encoder.py --model spm.model --inputs target_data --outputs spmout/target_data --output_format piece

Examples

We have given several examples in path "examples/".

Datasets

The WMT-14 En-de translation data can be downloaed via the running scripts.

For TED, you can download at TED.

Citation

Please cite as:

@inproceedings{volt,
  title = {Vocabulary Learning via Optimal Transport for Neural Machine Translation},
  author= {Jingjing Xu and
               Hao Zhou and
               Chun Gan and
               Zaixiang Zheng and
               Lei Li},
  booktitle = {Proceedings of ACL 2021},
  year = {2021},
}

Code for paper "Vocabulary Learning via Optimal Transport for Neural Machine Translation"

Related tags

Overview

What's New:

What's On-going:

Features:

Requirements and Installation

Usage

Examples

Datasets

Citation

Owner

Revisiting, benchmarking, and refining Heterogeneous Graph Neural Networks.

CountDown to New Year and shoot fireworks

A 2D Visual Localization Framework based on Essential Matrices [ICRA2020]

Code to reproduce the experiments from our NeurIPS 2021 paper " The Limitations of Large Width in Neural Networks: A Deep Gaussian Process Perspective"

LyaNet: A Lyapunov Framework for Training Neural ODEs

Tutorial to set up TensorFlow Object Detection API on the Raspberry Pi

ilpyt: imitation learning library with modular, baseline implementations in Pytorch

Deep Anomaly Detection with Outlier Exposure (ICLR 2019)

Balancing Principle for Unsupervised Domain Adaptation

[CVPR2021] Invertible Image Signal Processing

TF Image Segmentation: Image Segmentation framework

Keras-1D-NN-Classifier

This is the code for "HyperNeRF: A Higher-Dimensional Representation for Topologically Varying Neural Radiance Fields".

8-week curriculum for AI Builders

A 1.3B text-to-image generation model trained on 14 million image-text pairs

Official PyTorch implementation of Data-free Knowledge Distillation for Object Detection, WACV 2021.

Pansharpening by convolutional neural networks in the full resolution framework

A PyTorch toolkit for 2D Human Pose Estimation.

An algorithmic trading bot that learns and adapts to new data and evolving markets using Financial Python Programming and Machine Learning.

Efficient-GlobalPointer - Pytorch Efficient GlobalPointer