Training open neural machine translation models

Last update: Jan 03, 2023

Overview

Train Opus-MT models

This package includes scripts for training NMT models using MarianNMT and OPUS data for OPUS-MT. More details are given in the Makefile but documentation needs to be improved. Also, the targets require a specific environment and right now only work well on the CSC HPC cluster in Finland.

Pre-trained models

The subdirectory models contains information about pre-trained models that can be downloaded from this project. They are distribted with a CC-BY 4.0 license license. More pre-trained models trained with the OPUS-MT training pipeline are available from the Tatoeba translation challenge also under a CC-BY 4.0 license license.

Quickstart

Setting up:

git clone https://github.com/Helsinki-NLP/OPUS-MT-train.git
git submodule update --init --recursive --remote
make install

Training a multilingual NMT model (Finnish and Estonian to Danish, Swedish and English):

make SRCLANGS="fi et" TRGLANGS="da sv en" train
make SRCLANGS="fi et" TRGLANGS="da sv en" eval
make SRCLANGS="fi et" TRGLANGS="da sv en" release

More information is available in the documentation linked below.

Documentation

Tutorials

References

Please, cite the following paper if you use OPUS-MT software and models:

@InProceedings{TiedemannThottingal:EAMT2020,
  author = {J{\"o}rg Tiedemann and Santhosh Thottingal},
  title = {{OPUS-MT} — {B}uilding open translation services for the {W}orld},
  booktitle = {Proceedings of the 22nd Annual Conferenec of the European Association for Machine Translation (EAMT)},
  year = {2020},
  address = {Lisbon, Portugal}
 }

Acknowledgements

None of this would be possible without all the great open source software including

GNU/Linux tools
Marian-NMT
eflomal

... and many other tools like terashuf, pigz, jq, Moses SMT, fast_align, sacrebleu ...

We would also like to acknowledge the support by the University of Helsinki, the IT Center of Science CSC, the funding through projects in the EU Horizon 2020 framework (FoTran, MeMAD, ELG) and the contributors to the open collection of parallel corpora OPUS.

Training open neural machine translation models

Related tags

Overview

Train Opus-MT models

Pre-trained models

Quickstart

Documentation

Tutorials

References

Acknowledgements

Owner

Language Technology at the University of Helsinki

Long text token classification using LongFormer

Repository to hold code for the cap-bot varient that is being presented at the SIIC Defence Hackathon 2021.

Include MelGAN, HifiGAN and Multiband-HifiGAN, maybe NHV in the future.

DeepPavlov Tutorials

This repository contains the code for "Generating Datasets with Pretrained Language Models".

VMD Audio/Text control with natural language

中文空间语义理解评测

BERN2: an advanced neural biomedical namedentity recognition and normalization tool

TruthfulQA: Measuring How Models Imitate Human Falsehoods

A Chinese to English Neural Model Translation Project

A multi-lingual approach to AllenNLP CoReference Resolution along with a wrapper for spaCy.

Simple Text-Generator with OpenAI gpt-2 Pytorch Implementation

A library that integrates huggingface transformers with the world of fastai, giving fastai devs everything they need to train, evaluate, and deploy transformer specific models.

a chinese segment base on crf

This is the code for the EMNLP 2021 paper AEDA: An Easier Data Augmentation Technique for Text Classification

FewCLUE: 为中文NLP定制的小样本学习测评基准

Smart discord chatbot integrated with Dialogflow to manage different classrooms and assist in teaching!

Unofficial Implementation of Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

The projects lets you extract glossary words and their definitions from a given piece of text automatically using NLP techniques

Mysticbbs-rjam - rJAM splitscreen message reader for MysticBBS A46+