The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models

Last update: Dec 14, 2022

Related tags

Overview

Graformer

The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models

Graformer (also named BridgeTransformer in the code) is a sequence-to-sequence model mainly for Neural Machine Translation. We improve the multilingual translation by taking advantage of pre-trained (masked) language models, including pre-trained encoder (BERT) and pre-trained decoder (GPT). The code is based on Fairseq.

Examples

You can start with run/run.sh, with some minor modification. The corresponding scripts represent:

train a pre-trained BERT:
    run_arnold_multilingual_masked_lm_6e6d.sh

train a pre-trained GPT:
    run_arnold_multilingual_lm_6e6d.sh

train a Graformer:
    run_arnold_multilingual_graft_transformer_12e12d_ted.sh

inference from Graformer:
    run_arnold_multilingual_graft_inference_ted.sh

Released Models

We release our pre-trained mBERT and mGPT, along with the trained Graformer model in here.

Tensorflow Version

We will provide the tensorflow version in Neurst, a popular toolkit for sequence processing.

Citation

Please cite as:

@inproceedings{sun2021mulilingual,
    title = "Multilingual Translation via Grafting Pre-trained Language Models",
    author = "Sun, Zewei and Wang, Mingxuan and Li, Lei",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
    year = "2021"
}

Contact

If you have any questions, please feel free to contact me: [email protected]

The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models

Related tags

Overview

Graformer

Examples

Released Models

Tensorflow Version

Citation

Contact

Owner

Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Expressions.

Text editor on python tkinter to convert english text to other languages with the help of ployglot.

SimBERT升级版（SimBERTv2）！

A crowdsourced dataset of dialogues grounded in social contexts involving utilization of commonsense.

Snowball compiler and stemming algorithms

मराठी भाषा वाचविण्याचा एक प्रयास. इंग्रजी ते मराठीचा शब्दकोश. An attempt to preserve the Marathi language. A lightweight and ad free English to Marathi thesaurus.

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (SVS & TTS); AAAI 2022

OceanScript is an Esoteric language used to encode and decode text into a formulation of characters

Athena is an open-source implementation of end-to-end speech processing engine.

Transformation spoken text to written text

Finetune gpt-2 in google colab

Explore different way to mix speech model(wav2vec2, hubert) and nlp model(BART,T5,GPT) together

A PyTorch Implementation of End-to-End Models for Speech-to-Text

PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

Implementation of ProteinBERT in Pytorch

Tokenizer - Module python d'analyse syntaxique et de grammaire, tokenization

Syntax-aware Multi-spans Generation for Reading Comprehension (TASLP 2022)

使用Mask LM预训练任务来预训练Bert模型。训练垂直领域语料的模型表征，提升下游任务的表现。

A fast, efficient universal vector embedding utility package.