Use the state-of-the-art m2m100 to translate large data on CPU/GPU/TPU. Super Easy!

Last update: Dec 15, 2022

Related tags

Overview

Easy-Translate is a script for translating large text files in your machine using the M2M100 models from Facebook/Meta AI. We also privide a script for Easy-Evaluation of your translations 🥳

M2M100 is a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many multilingual translation introduced in this paper and first released in this repository.

M2M100 can directly translate between 9,900 directions of 100 languages.

Easy-Translate is built on top of 🤗 HuggingFace's Transformers and 🤗 HuggingFace's Accelerate library.

We currently support:

CPU / multi-CPU / GPU / multi-GPU / TPU acceleration
BF16 / FP16 / FP32 precision.
Automatic batch size finder: Forget CUDA OOM errors. Set an initial batch size, if it doesn't fit, we will automatically adjust it.
Sharded Data Parallel to load huge models sharded on multiple GPUs (See: https://huggingface.co/docs/accelerate/fsdp).

Test the 🔌 Online Demo here: https://huggingface.co/spaces/Iker/Translate-100-languages

Supported languages

See the Supported languages table for a table of the supported languages and their ids.

List of supported languages: Afrikaans, Amharic, Arabic, Asturian, Azerbaijani, Bashkir, Belarusian, Bulgarian, Bengali, Breton, Bosnian, Catalan, Cebuano, Czech, Welsh, Danish, German, Greeek, English, Spanish, Estonian, Persian, Fulah, Finnish, French, WesternFrisian, Irish, Gaelic, Galician, Gujarati, Hausa, Hebrew, Hindi, Croatian, Haitian, Hungarian, Armenian, Indonesian, Igbo, Iloko, Icelandic, Italian, Japanese, Javanese, Georgian, Kazakh, CentralKhmer, Kannada, Korean, Luxembourgish, Ganda, Lingala, Lao, Lithuanian, Latvian, Malagasy, Macedonian, Malayalam, Mongolian, Marathi, Malay, Burmese, Nepali, Dutch, Norwegian, NorthernSotho, Occitan, Oriya, Panjabi, Polish, Pushto, Portuguese, Romanian, Russian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Albanian, Serbian, Swati, Sundanese, Swedish, Swahili, Tamil, Thai, Tagalog, Tswana, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Wolof, Xhosa, Yiddish, Yoruba, Chinese, Zulu

Supported Models

Facebook/m2m100_418M: https://huggingface.co/facebook/m2m100_418M
Facebook/m2m100_1.2B: https://huggingface.co/facebook/m2m100_1.2B
Facebook/m2m100_12B: https://huggingface.co/facebook/m2m100-12B-avg-5-ckpt
Any other m2m100 model from HuggingFace's Hub: https://huggingface.co/models?search=m2m100

Requirements

Pytorch >= 1.10.0
See: https://pytorch.org/get-started/locally/

Accelerate >= 0.7.1
pip install --upgrade accelerate

HuggingFace Transformers 
pip install --upgrade transformers

Translate a file

Run python translate.py -h for more info.

Using a single CPU / GPU

accelerate launch translate.py \
--sentences_path sample_text/en.txt \
--output_path sample_text/en2es.translation.m2m100_1.2B.txt \
--source_lang en \
--target_lang es \
--model_name facebook/m2m100_1.2B

Multi-GPU

See Accelerate documentation for more information (multi-node, TPU, Sharded model...): https://huggingface.co/docs/accelerate/index
You can use the Accelerate CLI to configure the Accelerate environment (Run accelerate config in your terminal) instead of using the --multi_gpu and --num_processes flags.

# Use 2 GPUs
accelerate launch --multi_gpu --num_processes 2 --num_machines 1 translate.py \
--sentences_path sample_text/en.txt \
--output_path sample_text/en2es.translation.m2m100_1.2B.txt \
--source_lang en \
--target_lang es \
--model_name facebook/m2m100_1.2B

Automatic batch size finder

We will automatically find a batch size that fits in your GPU memory. The default initial batch size is 128 (You can set it with the --starting_batch_size 128 flag). If we find an Out Of Memory error, we will automatically decrease the batch size until we find a working one.

Choose precision

Use the --precision flag to choose the precision of the model. You can choose between: bf16, fp16 and 32.

accelerate launch translate.py \
--sentences_path sample_text/en.txt \
--output_path sample_text/en2es.translation.m2m100_1.2B.txt \
--source_lang en \
--target_lang es \
--model_name facebook/m2m100_1.2B \
--precision fp16

Evaluate translations

To run the evaluation script you need to install bert_score: pip install bert_score and 🤗 HuggingFace's Datasets model: pip install datasets.

The evaluation script will calculate the following metrics:

Run the following command to evaluate the translations:

accelerate launch eval.py \
--pred_path sample_text/es.txt \
--gold_path sample_text/en2es.translation.m2m100_1.2B.txt

If you want to save the results to a file use the --output_path flag.

See sample_text/en2es.m2m100_1.2B.json for a sample output.

Use the state-of-the-art m2m100 to translate large data on CPU/GPU/TPU. Super Easy!

Related tags

Overview

Supported languages

Supported Models

Requirements

Translate a file

Using a single CPU / GPU

Multi-GPU

Automatic batch size finder

Choose precision

Evaluate translations

Owner

Iker García-Ferrero

PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

Machine learning models from Singapore's NLP research community

PyTorch implementation of the paper: Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding

Findings of ACL 2021

构建一个多源（公众号、RSS）、干净、个性化的阅读环境

Implementation of Memorizing Transformers (ICLR 2022), attention net augmented with indexing and retrieval of memories using approximate nearest neighbors, in Pytorch

This project is part of Eleuther AI's quest to create a massive repository of high quality text data for training language models.

Composed Image Retrieval using Pretrained LANguage Transformers (CIRPLANT)

使用Mask LM预训练任务来预训练Bert模型。训练垂直领域语料的模型表征，提升下游任务的表现。

Longformer: The Long-Document Transformer

This is the writeup of all the challenges from Advent-of-cyber-2019 of TryHackMe

Suite of 500 procedurally-generated NLP tasks to study language model adaptability

Code for "Generative adversarial networks for reconstructing natural images from brain activity".

This is a project of data parallel that running on NLP tasks.

What are the best Systems? New Perspectives on NLP Benchmarking

SGMC: Spectral Graph Matrix Completion

OCR을 이용하여 인원수를 인식 후 줌을 Kill 해줍니다

This repository contains the code for "Generating Datasets with Pretrained Language Models".

End-to-end MLOps pipeline of a BERT model for emotion classification.

Research code for ECCV 2020 paper "UNITER: UNiversal Image-TExt Representation Learning"