Selecting Parallel In-domain Sentences for Neural Machine Translation Using Monolingual Texts

Last update: Jan 07, 2023

Overview

DataSelection-NMT

Selecting Parallel In-domain Sentences for Neural Machine Translation Using Monolingual Texts

Quick update: The paper got accepted on Dec 6, 2021! I will link the repository to the paper as soon as it got published.

Our Pre-trained models on Hugging Face

Systems	Link	Systems	Link
Top1	Download	Top1	Download
Top2+Top1	Download	Top2	Download
Top3+Top2+...	Download	Top3	Donwload
Top4+Top3+...	Download	Top4	Donwload
Top5+Top4+...	Download	Top5	Donwload
Top6+Top5+...	Download	Top6	Donwload

How to use

Note: we ported the best checkpoints of trained models to the Hugging Face (HF). Since our models were trained by OpenNMT-py, it was not possible to employ them directly for inference on HF. To bypass this issue, we use CTranslate2– an inference engine for transformer models.

Follow steps below to translate your sentences:

1. Install the Python package:

pip install --upgrade pip
pip install ctranslate2

2. Download models from our HF repository: You can do this manually or use the following python script:

import requests

url = "Download Link"
model_path = "Model Path"
r = requests.get(url, allow_redirects=True)
open(model_path, 'wb').write(r.content)

3. Convert the downloaded model:

ct2-opennmt-py-converter --model_path model_path --output_dir output_directory

3. Translate tokenized inputs:

Note: the inputs should be tokenized by SentencePiece. You can also use tokenized version of IWSLT test sets.

import ctranslate2
translator = ctranslate2.Translator("output_directory/")
translator.translate_batch([["▁H", "ello", "▁world", "!"]])

import ctranslate2
translator = ctranslate2.Translator("output_directory/")
translator.translate_file(input_file, output_file, batch_type= "tokens/examples")

To customize the CTranslate2 functions, read this API document.

4. Detokenize the outputs:

Note: you need to detokenize the output with the same sentencepiece model as used in step 3.

tools/detokenize.perl -no-escape -l fr \
< output_file \
> output_file.detok

5. Remove the @@ tokens:

cat output_file.detok | sed -E 's/(@@)|(@@ )|(@@ ?$)//g' \
> output._file.detok.postprocessd

Use grep to check if @@ tokens removed successfully:

grep @@ output._file.detok.postprocessd

Authors

Javad Pourmostafa - Email, Website
Dimitar Shterionov - Email, Website
Pieter Spronck - Email, Website

You might also like...

Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Parallel Tacotron2 Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

170 Dec 27, 2022

Code for the paper "JANUS: Parallel Tempered Genetic Algorithm Guided by Deep Neural Networks for Inverse Molecular Design"

JANUS: Parallel Tempered Genetic Algorithm Guided by Deep Neural Networks for Inverse Molecular Design This repository contains code for the paper: JA

55 Nov 29, 2022

PyKale is a PyTorch library for multimodal learning and transfer learning as well as deep learning and dimensionality reduction on graphs, images, texts, and videos

PyKale is a PyTorch library for multimodal learning and transfer learning as well as deep learning and dimensionality reduction on graphs, images, texts, and videos. By adopting a unified pipeline-based API design, PyKale enforces standardization and minimalism, via reusing existing resources, reducing repetitions and redundancy, and recycling learning models across areas.

370 Dec 27, 2022

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

ood-text-emnlp Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them" Files fine_tune.py is used to finetune the GPT-2 mo

19 Oct 28, 2022

Generate images from texts. In Russian. In PaddlePaddle

ruDALL-E PaddlePaddle ruDALL-E in PaddlePaddle. Install: pip install rudalle_paddle==0.0.1rc1 Run with free v100 on AI Studio. Original Pytorch versi

20 Oct 18, 2022

Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts

t5-japanese Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts. The following is a list of models that

1 Dec 13, 2021

Build an Amazon SageMaker Pipeline to Transform Raw Texts to A Knowledge Graph

Build an Amazon SageMaker Pipeline to Transform Raw Texts to A Knowledge Graph This repository provides a pipeline to create a knowledge graph from ra

3 Jan 1, 2022

Code for CVPR2021 "Visualizing Adapted Knowledge in Domain Transfer". Visualization for domain adaptation. #explainable-ai

Visualizing Adapted Knowledge in Domain Transfer @inproceedings{hou2021visualizing, title={Visualizing Adapted Knowledge in Domain Transfer}, auth

80 Dec 25, 2022

[CVPR2021] Domain Consensus Clustering for Universal Domain Adaptation

[CVPR2021] Domain Consensus Clustering for Universal Domain Adaptation [Paper] Prerequisites To install requirements: pip install -r requirements.txt

84 Dec 26, 2022

Selecting Parallel In-domain Sentences for Neural Machine Translation Using Monolingual Texts

Related tags

Overview

DataSelection-NMT

Quick update: The paper got accepted on Dec 6, 2021! I will link the repository to the paper as soon as it got published.

Our Pre-trained models on Hugging Face

How to use

Authors

You might also like...

Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Code for the paper "JANUS: Parallel Tempered Genetic Algorithm Guided by Deep Neural Networks for Inverse Molecular Design"

PyKale is a PyTorch library for multimodal learning and transfer learning as well as deep learning and dimensionality reduction on graphs, images, texts, and videos

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

Generate images from texts. In Russian. In PaddlePaddle

Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts

Build an Amazon SageMaker Pipeline to Transform Raw Texts to A Knowledge Graph

Code for CVPR2021 "Visualizing Adapted Knowledge in Domain Transfer". Visualization for domain adaptation. #explainable-ai

[CVPR2021] Domain Consensus Clustering for Universal Domain Adaptation

Releases(1.1)

1.1(Oct 25, 2021)

Owner

Javad Pourmostafa

PyTorch implementation for the ICLR 2020 paper "Understanding the Limitations of Variational Mutual Information Estimators"

RaftMLP: How Much Can Be Done Without Attention and with Less Spatial Locality?

Learning with Noisy Labels via Sparse Regularization, ICCV2021

The fundamental package for scientific computing with Python.

This program generates a random 12 digit/character password (upper and lowercase) and stores it in a file along with your username and app/website.

Deep Learning Slide Captcha

SARS-Cov-2 Recombinant Finder for fasta sequences

Artifacts for paper "MMO: Meta Multi-Objectivization for Software Configuration Tuning"

Codes for NAACL 2021 Paper "Unsupervised Multi-hop Question Answering by Question Generation"

Aiming at the common training datsets split, spectrum preprocessing, wavelength select and calibration models algorithm involved in the spectral analysis process

Unoffical implementation about Image Super-Resolution via Iterative Refinement by Pytorch

(CVPR 2021) Lifting 2D StyleGAN for 3D-Aware Face Generation

Source code of "Hold me tight! Influence of discriminative features on deep network boundaries"

Face Mesh is a face geometry solution that estimates 468 3D face landmarks in real-time even on mobile devices

[ICCV'21] NEAT: Neural Attention Fields for End-to-End Autonomous Driving

Unsupervised Feature Loss (UFLoss) for High Fidelity Deep learning (DL)-based reconstruction

PyTorch implementation for our NeurIPS 2021 Spotlight paper "Long Short-Term Transformer for Online Action Detection".

Python package to generate image embeddings with CLIP without PyTorch/TensorFlow

Material del curso IIC2233 Programación Avanzada 📚

Adaptation through prediction: multisensory active inference torque control