PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

Overview

logo

PhoNLP: A joint multi-task learning model for Vietnamese part-of-speech tagging, named entity recognition and dependency parsing

PhoNLP is a multi-task learning model for joint part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT for each task independently.

logo

Details of the PhoNLP model architecture and experimental results can be found in our following paper:

@article{PhoNLP,
title     = {{PhoNLP: A joint multi-task learning model for Vietnamese part-of-speech tagging, named entity recognition and dependency parsing}},
author    = {Linh The Nguyen and Dat Quoc Nguyen},
journal   = {arXiv preprint},
volume    = {arXiv:2101.01476},
year      = {2021}
}

Please CITE our paper when PhoNLP is used to help produce published results or incorporated into other software.

Although we specify PhoNLP for Vietnamese, usage examples below in fact can directly work for other languages that have gold annotated corpora available for the three tasks of POS tagging, NER and dependency parsing, and a pre-trained BERT-based language model available from transformers.

Installation

  • Python version >= 3.6; PyTorch version >= 1.4.0
  • PhoNLP can be installed using pip as follows: pip3 install phonlp
  • Or PhoNLP can also be installed from source with the following commands:
     git clone https://github.com/VinAIResearch/PhoNLP
     cd PhoNLP
     pip3 install -e .
    

Usage example: Command lines

To play with the examples using command lines, please install phonlp from the source:

git clone https://github.com/VinAIResearch/PhoNLP
cd PhoNLP
pip3 install -e . 

Training

cd phonlp/models
python3 run_phonlp.py --mode train --save_dir  \
	--pretrained_lm  \
	--lr  --batch_size  --num_epoch  \
	--lambda_pos  --lambda_ner  --lambda_dep  \
	--train_file_pos  --eval_file_pos  \
	--train_file_ner  --eval_file_ner  \
	--train_file_dep  --eval_file_dep 

--lambda_pos, --lambda_ner and --lambda_dep represent mixture weights associated with POS tagging, NER and dependency parsing losses, respectively, and lambda_pos + lambda_ner + lambda_dep = 1.

Example:

cd phonlp/models
python3 run_phonlp.py --mode train --save_dir ./phonlp_tmp \
	--pretrained_lm "vinai/phobert-base" \
	--lr 1e-5 --batch_size 32 --num_epoch 40 \
	--lambda_pos 0.4 --lambda_ner 0.2 --lambda_dep 0.4 \
	--train_file_pos ../sample_data/pos_train.txt --eval_file_pos ../sample_data/pos_valid.txt \
	--train_file_ner ../sample_data/ner_train.txt --eval_file_ner ../sample_data/ner_valid.txt \
	--train_file_dep ../sample_data/dep_train.conll --eval_file_dep ../sample_data/dep_valid.conll

Evaluation

cd phonlp/models
python3 run_phonlp.py --mode eval --save_dir  \
	--batch_size  \
	--eval_file_pos  \
	--eval_file_ner  \
	--eval_file_dep  

Example:

cd phonlp/models
python3 run_phonlp.py --mode eval --save_dir ./phonlp_tmp \
	--batch_size 8 \
	--eval_file_pos ../sample_data/pos_test.txt \
	--eval_file_ner ../sample_data/ner_test.txt \
	--eval_file_dep ../sample_data/dep_test.conll 

Annotate a corpus

cd phonlp/models
python3 run_phonlp.py --mode annotate --save_dir  \
	--batch_size  \
	--input_file  \
	--output_file  

Example:

cd phonlp/models
python3 run_phonlp.py --mode annotate --save_dir ./phonlp_tmp \
	--batch_size 8 \
	--input_file ../sample_data/input.txt \
	--output_file ../sample_data/output.txt 

The pre-trained PhoNLP model for Vietnamese is available at HERE!

Usage example: Python API

import phonlp
# Automatically download the pre-trained PhoNLP model 
# and save it in a local machine folder
phonlp.download(save_dir='./pretrained_phonlp')
# Load the pre-trained PhoNLP model
model = phonlp.load(save_dir='./pretrained_phonlp')
# Annotate a corpus where each line represents a word-segmented sentence
model.annotate(input_file='input.txt', output_file='output.txt')
# Annotate a word-segmented sentence
model.print_out(model.annotate(text="Tôi đang làm_việc tại VinAI ."))

By default, the output for each input sentence is formatted with 6 columns representing word index, word form, POS tag, NER label, head index of the current word and its dependency relation type:

1	Tôi	P	O	3	sub	
2	đang	R	O	3	adv
3	làm_việc	V	O	0	root
4	tại	E	O	3	loc
5	VinAI	Np 	B-ORG	4	prob
6	.	CH	O	3	punct

In addition, the output can be formatted following the 10-column CoNLL format where the last column is used to represent NER predictions. This can be done by adding output_type='conll' into the model.annotate() function. Also, in the model.annotate() function, the value of the parameter batch_size can be adjusted to fit your computer's memory instead of using the default one at 1 (batch_size=1). Here, a larger batch_size would lead to a faster performance speed.

Owner
VinAI Research
VinAI Research
Spert NLP Relation Extraction API deployed with torchserve for inference

URLMask Python program for Linux users to change a URL to ANY domain. A program than can take any url and mask it to any domain name you like. E.g. ne

Zichu Chen 1 Nov 24, 2021
Deal or No Deal? End-to-End Learning for Negotiation Dialogues

Introduction This is a PyTorch implementation of the following research papers: (1) Hierarchical Text Generation and Planning for Strategic Dialogue (

Facebook Research 1.4k Dec 29, 2022
Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS) Yoonhyung Lee, Joongbo Shin, Kyomin Jung Abstract: Although early

LEE YOON HYUNG 147 Dec 05, 2022
German Text-To-Speech Engine using Tacotron and Griffin-Lim

jotts JoTTS is a German text-to-speech engine using tacotron and griffin-lim. The synthesizer model has been trained on my voice using Tacotron1. Due

padmalcom 6 Aug 28, 2022
Pervasive Attention: 2D Convolutional Networks for Sequence-to-Sequence Prediction

This is a fork of Fairseq(-py) with implementations of the following models: Pervasive Attention - 2D Convolutional Neural Networks for Sequence-to-Se

Maha 490 Dec 15, 2022
fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.

fastNLP fastNLP是一款轻量级的自然语言处理(NLP)工具包,目标是快速实现NLP任务以及构建复杂模型。 fastNLP具有如下的特性: 统一的Tabular式数据容器,简化数据预处理过程; 内置多种数据集的Loader和Pipe,省去预处理代码; 各种方便的NLP工具,例如Embedd

fastNLP 2.8k Jan 01, 2023
Almost State-of-the-art Text Generation library

Ps: we are adding transformer model soon Text Gen 🐐 Almost State-of-the-art Text Generation library Text gen is a python library that allow you build

Emeka boris ama 63 Jun 24, 2022
Grover is a model for Neural Fake News -- both generation and detectio

Grover is a model for Neural Fake News -- both generation and detection. However, it probably can also be used for other generation tasks.

Rowan Zellers 856 Dec 24, 2022
Implementation for paper BLEU: a Method for Automatic Evaluation of Machine Translation

BLEU Score Implementation for paper: BLEU: a Method for Automatic Evaluation of Machine Translation Author: Ba Ngoc from ProtonX BLEU score is a popul

Ngoc Nguyen Ba 6 Oct 07, 2021
Fidibo.com comments Sentiment Analyser

Fidibo.com comments Sentiment Analyser Introduction This project first asynchronously grab Fidibo.com books comment data using grabber.py and then sav

Iman Kermani 3 Apr 15, 2022
test

Lidar-data-decode In this project, you can decode your lidar data frame(pcap file) and make your own datasets(test dataset) in Windows without any hug

46 Dec 05, 2022
This is a MD5 password/passphrase brute force tool

CROWES-PASS-CRACK-TOOl This is a MD5 password/passphrase brute force tool How to install: Do 'git clone https://github.com/CROW31/CROWES-PASS-CRACK-TO

9 Mar 02, 2022
Baseline code for Korean open domain question answering(ODQA)

Open-Domain Question Answering(ODQA)는 다양한 주제에 대한 문서 집합으로부터 자연어 질의에 대한 답변을 찾아오는 task입니다. 이때 사용자 질의에 답변하기 위해 주어지는 지문이 따로 존재하지 않습니다. 따라서 사전에 구축되어있는 Knowl

VUMBLEB 69 Nov 04, 2022
Automated question generation and question answering from Turkish texts using text-to-text transformers

Turkish Question Generation Offical source code for "Automated question generation & question answering from Turkish texts using text-to-text transfor

Open Business Software Solutions 29 Dec 14, 2022
GooAQ 🥑 : Google Answers to Google Questions!

This repository contains the code/data accompanying our recent work on long-form question answering.

AI2 112 Nov 06, 2022
Research code for the paper "Fine-tuning wav2vec2 for speaker recognition"

Fine-tuning wav2vec2 for speaker recognition This is the code used to run the experiments in https://arxiv.org/abs/2109.15053. Detailed logs of each t

Nik 103 Dec 26, 2022
Code for Emergent Translation in Multi-Agent Communication

Emergent Translation in Multi-Agent Communication PyTorch implementation of the models described in the paper Emergent Translation in Multi-Agent Comm

Facebook Research 75 Jul 15, 2022
Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models

PEGASUS library Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models, or PEGASUS, uses self-supervised

Google Research 1.4k Dec 22, 2022
This repo stores the codes for topic modeling on palliative care journals.

This repo stores the codes for topic modeling on palliative care journals. Data Preparation You first need to download the journal papers. bash 1_down

3 Dec 20, 2022
A highly sophisticated sequence-to-sequence model for code generation

CoderX A proof-of-concept AI system by Graham Neubig (June 30, 2021). About CoderX CoderX is a retrieval-based code generation AI system reminiscent o

Graham Neubig 39 Aug 03, 2021