BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese

Overview

Table of contents

  1. Introduction
  2. Using BARTpho with fairseq
  3. Using BARTpho with transformers
  4. Notes

BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese

Two BARTpho versions BARTpho-syllable and BARTpho-word are the first public large-scale monolingual sequence-to-sequence models pre-trained for Vietnamese. BARTpho uses the "large" architecture and pre-training scheme of the sequence-to-sequence denoising model BART, thus especially suitable for generative NLP tasks. Experiments on a downstream task of Vietnamese text summarization show that in both automatic and human evaluations, BARTpho outperforms the strong baseline mBART and improves the state-of-the-art.

The general architecture and experimental results of BARTpho can be found in our paper:

@article{bartpho,
title     = {{BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese}},
author    = {Nguyen Luong Tran and Duong Minh Le and Dat Quoc Nguyen},
journal   = {arXiv preprint},
volume    = {arXiv:2109.09701},
year      = {2021}
}

Please CITE our paper when BARTpho is used to help produce published results or incorporated into other software.

Using BARTpho in fairseq

Installation

There is an issue w.r.t. the encode function in the BART hub_interface, as discussed in this pull request https://github.com/pytorch/fairseq/pull/3905. While waiting for this pull request's approval, please install fairseq as follows:

git clone https://github.com/datquocnguyen/fairseq.git
cd fairseq
pip install --editable ./

Pre-trained models

Model #params Download Input text
BARTpho-syllable 396M fairseq-bartpho-syllable.zip Syllable level
BARTpho-word 420M fairseq-bartpho-word.zip Word level
  • unzip fairseq-bartpho-syllable.zip
  • unzip fairseq-bartpho-word.zip

Example usage

from fairseq.models.bart import BARTModel  

#Load BARTpho-syllable model:  
model_folder_path = '/PATH-TO-FOLDER/fairseq-bartpho-syllable/'  
spm_model_path = '/PATH-TO-FOLDER/fairseq-bartpho-syllable/sentence.bpe.model'  
bartpho_syllable = BARTModel.from_pretrained(model_folder_path, checkpoint_file='model.pt', bpe='sentencepiece', sentencepiece_model=spm_model_path).eval()
#Input syllable-level/raw text:  
sentence = 'Chúng tôi là những nghiên cứu viên.'  
#Apply SentencePiece to the input text
tokenIDs = bartpho_syllable.encode(sentence, add_if_not_exist=False)
#Extract features from BARTpho-syllable
last_layer_features = bartpho_syllable.extract_features(tokenIDs)

##Load BARTpho-word model:  
model_folder_path = '/PATH-TO-FOLDER/fairseq-bartpho-word/'  
bpe_codes_path = '/PATH-TO-FOLDER/fairseq-bartpho-word/bpe.codes'  
bartpho_word = BARTModel.from_pretrained(model_folder_path, checkpoint_file='model.pt', bpe='fastbpe', bpe_codes=bpe_codes_path).eval()
#Input word-level text:  
sentence = 'Chúng_tôi là những nghiên_cứu_viên .'  
#Apply BPE to the input text
tokenIDs = bartpho_word.encode(sentence, add_if_not_exist=False)
#Extract features from BARTpho-word
last_layer_features = bartpho_word.extract_features(tokenIDs)

Using BARTpho in transformers

Installation

  • Installation with pip (v4.12+): pip install transformers
  • Installing from source:
git clone https://github.com/huggingface/transformers.git
cd transformers
pip install -e .

Pre-trained models

Model #params Input text
vinai/bartpho-syllable 396M Syllable level
vinai/bartpho-word 420M Word level

Example usage

import torch
from transformers import AutoModel, AutoTokenizer

#BARTpho-syllable
syllable_tokenizer = AutoTokenizer.from_pretrained("vinai/bartpho-syllable", use_fast=False)
bartpho_syllable = AutoModel.from_pretrained("vinai/bartpho-syllable")
TXT = 'Chúng tôi là những nghiên cứu viên.'  
input_ids = syllable_tokenizer(TXT, return_tensors='pt')['input_ids']
features = bartpho_syllable(input_ids)

from transformers import MBartForConditionalGeneration
bartpho_syllable = MBartForConditionalGeneration.from_pretrained("vinai/bartpho-syllable")
TXT = 'Chúng tôi là <mask> nghiên cứu viên.'
input_ids = syllable_tokenizer(TXT, return_tensors='pt')['input_ids']
logits = bartpho_syllable(input_ids).logits
masked_index = (input_ids[0] == syllable_tokenizer.mask_token_id).nonzero().item()
probs = logits[0, masked_index].softmax(dim=0)
values, predictions = probs.topk(5)
print(syllable_tokenizer.decode(predictions).split())

#BARTpho-word
word_tokenizer = AutoTokenizer.from_pretrained("vinai/bartpho-word", use_fast=False)
bartpho_word = AutoModel.from_pretrained("vinai/bartpho-word")
TXT = 'Chúng_tôi là những nghiên_cứu_viên .'  
input_ids = word_tokenizer(TXT, return_tensors='pt')['input_ids']
features = bartpho_word(input_ids)

bartpho_word = MBartForConditionalGeneration.from_pretrained("vinai/bartpho-word")
TXT = 'Chúng_tôi là những <mask> .'
input_ids = word_tokenizer(TXT, return_tensors='pt')['input_ids']
logits = bartpho_word(input_ids).logits
masked_index = (input_ids[0] == word_tokenizer.mask_token_id).nonzero().item()
probs = logits[0, masked_index].softmax(dim=0)
values, predictions = probs.topk(5)
print(word_tokenizer.decode(predictions).split())
  • Following mBART, BARTpho uses the "large" architecture of BART with an additional layer-normalization layer on top of both the encoder and decoder. Thus, when converted to be used with transformers, BARTpho can be called via mBART-based classes.

Notes

  • Before fine-tuning BARTpho on a downstream task, users should perform Vietnamese tone normalization on the downstream task's data as this pre-process was also applied to the pre-training corpus. A Python script for Vietnamese tone normalization is available at HERE.
  • For BARTpho-word, users should use VnCoreNLP to segment input raw texts as it was used to perform both Vietnamese tone normalization and word segmentation on the pre-training corpus.

License

MIT License

Copyright (c) 2021 VinAI Research

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Owner
VinAI Research
VinAI Research
Club chatbot

Chatbot Club chatbot Instructions to get the Chatterbot working Step 1. First make sure you are using a version of Python 3 or newer. To check your ve

5 Mar 07, 2022
A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. IMPORTANT: (30.08.2020) We moved our models

flair 12.3k Dec 31, 2022
Contains descriptions and code of the mini-projects developed in various programming languages

TexttoSpeechAndLanguageTranslator-project introduction A pleasant application where the client will be given buttons like play,reset and exit. The cli

Adarsh Reddy 1 Dec 22, 2021
Use AutoModelForSeq2SeqLM in Huggingface Transformers to train COMET

Training COMET using seq2seq setting Use AutoModelForSeq2SeqLM in Huggingface Transformers to train COMET. The codes are modified from run_summarizati

tqfang 9 Dec 17, 2022
Neural text generators like the GPT models promise a general-purpose means of manipulating texts.

Boolean Prompting for Neural Text Generators Neural text generators like the GPT models promise a general-purpose means of manipulating texts. These m

Jeffrey M. Binder 20 Jan 09, 2023
뉴스 도메인 질의응답 시스템 (21-1학기 졸업 프로젝트)

뉴스 도메인 질의응답 시스템 본 프로젝트는 뉴스기사에 대한 질의응답 서비스 를 제공하기 위해서 진행한 프로젝트입니다. 약 3개월간 ( 21. 03 ~ 21. 05 ) 진행하였으며 Transformer 아키텍쳐 기반의 Encoder를 사용하여 한국어 질의응답 데이터셋으로

TaegyeongEo 4 Jul 08, 2022
Training code of Spatial Time Memory Network. Semi-supervised video object segmentation.

Training-code-of-STM This repository fully reproduces Space-Time Memory Networks Performance on Davis17 val set&Weights backbone training stage traini

haochen wang 128 Dec 11, 2022
Implementation of Fast Transformer in Pytorch

Fast Transformer - Pytorch Implementation of Fast Transformer in Pytorch. This only work as an encoder. Yannic video AI Epiphany Install $ pip install

Phil Wang 167 Dec 27, 2022
CCF BDCI BERT系统调优赛题baseline(Pytorch版本)

CCF BDCI BERT系统调优赛题baseline(Pytorch版本) 此版本基于Pytorch后端的huggingface进行实现。由于此实现使用了Oneflow的dataloader作为数据读入的方式,因此也需要安装Oneflow。其它框架的数据读取可以参考OneflowDataloade

Ziqi Zhou 9 Oct 13, 2022
An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"

The implementation of paper CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval. CLIP4Clip is a video-text retrieval model based

ArrowLuo 456 Jan 06, 2023
Telegram bot to auto post messages of one channel in another channel as soon as it is posted, without the forwarded tag.

Channel Auto-Post Bot This bot can send all new messages from one channel, directly to another channel (or group, just in case), without the forwarded

Aditya 128 Dec 29, 2022
Datasets of Automatic Keyphrase Extraction

This repository contains 20 annotated datasets of Automatic Keyphrase Extraction made available by the research community. Following are the datasets and the original papers that proposed them. If yo

LIAAD - Laboratory of Artificial Intelligence and Decision Support 163 Dec 23, 2022
Calibre recipe to convert latest issue of Analyse & Kritik into an ebook

Calibre Recipe für "Analyse & Kritik" Dies ist ein "Recipe" für die Konvertierung der aktuellen Ausgabe der Zeitung Analyse & Kritik in ein Ebook. Es

Henning 3 Jan 04, 2022
Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

0 Feb 13, 2022
An open collection of annotated voices in Japanese language

声庭 (Koniwa): オープンな日本語音声とアノテーションのコレクション Koniwa (声庭): An open collection of annotated voices in Japanese language 概要 Koniwa(声庭)は利用・修正・再配布が自由でオープンな音声とアノテ

Koniwa project 32 Dec 14, 2022
InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective

InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective This is the official code base for our ICLR 2021 paper

AI Secure 71 Nov 25, 2022
VMD Audio/Text control with natural language

This repository is a proof of principle for performing Molecular Dynamics analysis, in this case with the program VMD, via natural language commands.

Andrew White 13 Jun 09, 2022
NLPShala , the best IDE for all Natural language processing tasks.

The revolutionary IDE for all NLP (Natural language processing) stuffs on the internet.

Abhi 3 Aug 08, 2021
Official PyTorch implementation of "Dual Path Learning for Domain Adaptation of Semantic Segmentation".

Dual Path Learning for Domain Adaptation of Semantic Segmentation Official PyTorch implementation of "Dual Path Learning for Domain Adaptation of Sema

27 Dec 22, 2022
Quick insights from Zoom meeting transcripts using Graph + NLP

Transcript Analysis - Graph + NLP This program extracts insights from Zoom Meeting Transcripts (.vtt) using TigerGraph and NLTK. In order to run this

Advit Deepak 7 Sep 17, 2022