Th2En & Th2Zh: The large-scale datasets for Thai text cross-lingual summarization

Overview

Th2En & Th2Zh: The large-scale datasets for Thai text cross-lingual summarization

📥 Download Datasets
📥 Download Trained Models

INTRODUCTION

TH2ZH (Thai-to-Simplified Chinese) and TH2EN (Thai-to-English) are cross-lingual summarization (CLS) datasets. The source articles of these datasets are from TR-TPBS dataset, a monolingual Thai text summarization dataset. To create CLS dataset out of TR-TPBS, we used a neural machine translation service to translate articles into target languages. For some reasons, we were strongly recommended not to mention the name of the service that we used 🥺 . We will refer to the service we used as ‘main translation service’.

Cross-lingual summarization (cross-sum) is a task to summarize a given document written in one language to another language short summary.

crosslingual summarization

Traditional cross-sum approaches are based on two techniques namely early translation technique and late translation technique. Early translation can be explained easily as translate-then-summarize method. Late translation, in reverse, is summarize-then-translate method.

However, classical cross-sum methods tend to carry errors from monolingual summarization process or translation process to final cross-language output summary. Several end-to-end approaches have been proposed to tackle problems of traditional ones. Couple of end-to-end models are available to download as well.

DATASET CONSTRUCTION

💡 Important Note In contrast to Zhu, et al, in our experiment, we found that filtering out articles using RTT technique worsen the overall performance of the end-to-end models significantly. Therefore, full datasets are highly recommended.

We used TR-TPBS as source documents for creating cross-lingual summarization dataset. In the same way as Zhu, et al., we constructed Th2En and Th2Zh by translating the summary references into target languages using translation service and filtered out those poorly-translated summaries using round-trip translation technique (RTT). The overview of cross-lingual summarization dataset construction is presented in belowe figure. Please refer to the corresponding paper for more details on RTT.

crosslingual summarization In our experiment, we set 𝑇1 and 𝑇2 equal to 0.45 and 0.2 respectively, backtranslation technique filtered out 27.98% from Th2En and 56.79% documents from Th2Zh.

python3 src/tools/cls_dataset_construction.py \
--dataset th2en \
--input_csv path/to/full_dataset.csv \
--output_csv path/to/save/filtered_csv \
--r1 0.45 \
--r2 0.2
  • --dataset can be {th2en, th2zh}.
  • --r1 and --r2 are where you can set ROUGE score thresholds (r1 and r2 represent ROUGE-1 and ROUGE-2 respectively) for filtering (assumingly) poor translated articles.

Dataset Statistic

Click the file name to download.

File Number of Articles Size
th2en_full.csv 310,926 2.96 GB
th2zh_full.csv 310,926 2.81 GB
testset.csv 3,000 44 MB
validation.csv 3,000 43 MB

Data Fields

Please refer to th2enzh_data_exploration.ipynb for more details.

Column Description
th_body Original Thai body text
th_sum Original Thai summary
th_title Original Thai Article headline
{en/zh}_body Translated body text
{en/zh}_sum Translated summary
{en/zh}_title Translated article's headline
{en/zh}2th Back translation of{en/zh}_body
{en/zh}_gg_sum Translated summary (by Google Translation)
url URL to original article’s webpage
  • {th/en/zh}_title are only available in test set.
  • {en/zh}_gg_sum are also only available in test set. We (at the time this experiment took place) assumed that Google translation was better than the main translation service we were using. We intended to use these Google translated summaries as some kind of alternative summary references, but in the end, they never been used. We decided to make them available in the test set anyway, just in case the others find them useful.
  • {en/zh}_body were not presented during training end-to-end models. They were used only in early translation methods.

AVAILABLE TRAINED MODELS

Model Corresponding Paper Thai -> English Thai -> Simplified Chinese
Full Filtered Full Filtered
TNCLS Zhu et al., 2019 - Available - -
CLS+MS Zhu et al., 2019 Available - - -
CLS+MT Zhu et al., 2019 Available - Available -
XLS – RL-ROUGE Dou et al., 2020 Available - Available -

To evaluate these trained models, please refer to xls_model_evaluation.ipynb and ncls_model_evaluation.ipynb.

If you wish to evaluate the models with our test sets, you can use below script to create test files for XLS and NCLS models.

python3 src/tools/create_cls_test_manifest.py \
--test_csv_path path/to/testset.csv \
--output_dir path/to/save/testset_files \
--use_google_sum {true/false} \
--max_tokens 500 \
--create_ms_ref {true/false}
  • output_dir is path to directory that you want to save test set files
  • use_google_sum can be {true/false}. If true, it will select summary reference from columns {en/zh}_gg_sum. Default is false.
  • max_tokens number of maximum words in input articles. Default is 500 words. Too short or too long articles can significantly worsen performance of the models.
  • create_ms_ref whether to create Thai summary reference file to evaluate MS task in NCLS:CLS+MS model.

This script will produce three files namely test.CLS.source.thai.txt and test.CLS.target.{en/zh}.txt. test.CLS.source.thai.txt is used as a test file for cls task. test.CLS.target.{en/zh}.txt are the crosslingual summary reference for English and Chinese, they are used to evaluate ROUGE and BertScore. Each line is corresponding to the body articles in test.CLS.source.thai.txt.

🥳 We also evaluated MT tasks in XLS and NCLS:CLS+MT models. Please refers to xls_model_evaluation.ipynb and ncls_model_evaluation.ipynb for BLUE score results . For test sets that we used to evaluate MT task, please refer to data/README.md.

EXPERIMENT RESULTS

🔆 It has to be noted that all of end-to-end models reported in this section were trained on filtered datasets NOT full datasets. And for all end-to-end models, only `th_body` and `{en/zh}_sum` were present during training. We trained end-to-end models for 1,000,000 steps and selected model checkpoints that yielded the highest overall ROUGE scores to report the experiment.

In this experiment, we used two automatic evaluation matrices namely ROUGE and BertScore to assess the performance of CLS models. We evaluated ROUGE on Chinese text at word-level, NOT character level.

We only reported BertScore on abstractive summarization models. To evaluate the results with BertScore we used weights from ‘roberta-large’ and ‘bert-base-chinese’ pretrained models for Th2En and Th2Zh respectively.

Model Thai to English Thai to Chinese
ROUGE BertScore ROUGE BertScore
R1 R2 RL F1 R1 R2 RL F1
Traditional Approaches
Translated Headline 23.44 6.99 21.49 - 21.55 4.66 18.58 -
ETrans → LEAD2 51.96 42.15 50.01 - 44.18 18.83 43.84 -
ETrans → BertSumExt 51.85 38.09 49.50 - 34.58 14.98 34.84 -
ETrans → BertSumExtAbs 52.63 32.19 48.14 88.18 35.63 16.02 35.36 70.42
BertSumExt → LTrans 42.33 27.33 34.85 - 28.11 18.85 27.46 -
End-to-End Training Approaches
TNCLS 26.48 6.65 21.66 85.03 27.09 6.69 21.99 63.72
CLS+MS 32.28 15.21 34.68 87.22 34.34 12.23 28.80 67.39
CLS+MT 42.85 19.47 39.48 88.06 42.48 19.10 37.73 71.01
XLS – RL-ROUGE 42.82 19.62 39.53 88.03 43.20 19.19 38.52 72.19

LICENSE

Thai crosslingual summarization datasets including TH2EN, TH2ZH, test and validation set are licensed under MIT License.

ACKNOWLEDGEMENT

  • These cross-lingual datasets and the experiments are parts of Nakhun Chumpolsathien ’s master’s thesis at school of computer science, Beijing Institute of Technology. Therefore, as well, a great appreciation goes to his supervisor, Assoc. Prof. Gao Yang.
  • Shout out to Tanachat Arayachutinan for the initial data processing and for introducing me 麻辣烫, 黄焖鸡.
  • We would like to thank Beijing Engineering Research Center of High Volume Language Information Processing and Cloud Computing Applications for providing computing resources to conduct the experiment.
  • In this experiment, we used PyThaiNLP v. 2.2.4 to tokenize (on both word & sentence levels) Thai texts. For Chinese and English segmentation, we used Stanza.
Owner
Nakhun Chumpolsathien
I thought it was fun.
Nakhun Chumpolsathien
Maix Speech AI lib, including ASR, chat, TTS etc.

Maix-Speech 中文 | English Brief Now only support Chinese, See 中文 Build Clone code by: git clone https://github.com/sipeed/Maix-Speech Compile x86x64 c

Sipeed 267 Dec 25, 2022
Nateve compiler developed with python.

Adam Adam is a Nateve Programming Language compiler developed using Python. Nateve Nateve is a new general domain programming language open source ins

Nateve 7 Jan 15, 2022
Deal or No Deal? End-to-End Learning for Negotiation Dialogues

Introduction This is a PyTorch implementation of the following research papers: (1) Hierarchical Text Generation and Planning for Strategic Dialogue (

Facebook Research 1.4k Dec 29, 2022
COVID-19 Chatbot with Rasa 2.0: open source conversational AI

COVID-19 chatbot implementation with Rasa open source 2.0, conversational AI framework.

Aazim Parwaz 1 Dec 23, 2022
Switch spaces for knowledge graph embeddings

SwisE Switch spaces for knowledge graph embeddings. Requirements: python3 pytorch numpy tqdm Reproduce the results To reproduce the reported results,

Shuai Zhang 4 Dec 01, 2021
Ongoing research training transformer language models at scale, including: BERT & GPT-2

What is this fork of Megatron-LM and Megatron-DeepSpeed This is a detached fork of https://github.com/microsoft/Megatron-DeepSpeed, which in itself is

BigScience Workshop 316 Jan 03, 2023
This repo is to provide a list of literature regarding Deep Learning on Graphs for NLP

This repo is to provide a list of literature regarding Deep Learning on Graphs for NLP

Graph4AI 230 Nov 22, 2022
TextFlint is a multilingual robustness evaluation platform for natural language processing tasks,

TextFlint is a multilingual robustness evaluation platform for natural language processing tasks, which unifies general text transformation, task-specific transformation, adversarial attack, sub-popu

TextFlint 587 Dec 20, 2022
OpenChat: Opensource chatting framework for generative models

OpenChat is opensource chatting framework for generative models.

Hyunwoong Ko 427 Jan 06, 2023
KakaoBrain KoGPT (Korean Generative Pre-trained Transformer)

KoGPT KoGPT (Korean Generative Pre-trained Transformer) https://github.com/kakaobrain/kogpt https://huggingface.co/kakaobrain/kogpt Model Descriptions

Kakao Brain 797 Dec 26, 2022
In this Notebook I've build some machine-learning and deep-learning to classify corona virus tweets, in both multi class classification and binary classification.

Hello, This Notebook Contains Example of Corona Virus Tweets Multi Class Classification. - Classes is: Extremely Positive, Positive, Extremely Negativ

Khaled Tofailieh 3 Dec 06, 2022
A PyTorch-based model pruning toolkit for pre-trained language models

English | 中文说明 TextPruner是一个为预训练语言模型设计的模型裁剪工具包,通过轻量、快速的裁剪方法对模型进行结构化剪枝,从而实现压缩模型体积、提升模型速度。 其他相关资源: 知识蒸馏工具TextBrewer:https://github.com/airaria/TextBrewe

Ziqing Yang 231 Jan 08, 2023
DeLighT: Very Deep and Light-Weight Transformers

DeLighT: Very Deep and Light-weight Transformers This repository contains the source code of our work on building efficient sequence models: DeFINE (I

Sachin Mehta 440 Dec 18, 2022
Minimal GUI for accessing the Watson Text to Speech service.

Description Minimal graphical application for accessing the Watson Text to Speech service. Requirements Python 3 plus all dependencies listed in requi

Moritz Maxeiner 1 Oct 22, 2021
A PyTorch implementation of VIOLET

VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling A PyTorch implementation of VIOLET Overview VIOLET is an implementati

Tsu-Jui Fu 119 Dec 30, 2022
text to speech toolkit. 好用的中文语音合成工具箱,包含语音编码器、语音合成器、声码器和可视化模块。

ttskit Text To Speech Toolkit: 语音合成工具箱。 安装 pip install -U ttskit 注意 可能需另外安装的依赖包:torch,版本要求torch=1.6.0,=1.7.1,根据自己的实际环境安装合适cuda或cpu版本的torch。 ttskit的

KDD 483 Jan 04, 2023
A PyTorch implementation of the WaveGlow: A Flow-based Generative Network for Speech Synthesis

WaveGlow A PyTorch implementation of the WaveGlow: A Flow-based Generative Network for Speech Synthesis Quick Start: Install requirements: pip install

Yuchao Zhang 204 Jul 14, 2022
PG-19 Language Modelling Benchmark

PG-19 Language Modelling Benchmark This repository contains the PG-19 language modeling benchmark. It includes a set of books extracted from the Proje

DeepMind 161 Oct 30, 2022
Japanese Long-Unit-Word Tokenizer with RemBertTokenizerFast of Transformers

Japanese-LUW-Tokenizer Japanese Long-Unit-Word (国語研長単位) Tokenizer for Transformers based on 青空文庫 Basic Usage from transformers import RemBertToken

Koichi Yasuoka 3 Dec 22, 2021
🗣️ NALP is a library that covers Natural Adversarial Language Processing.

NALP: Natural Adversarial Language Processing Welcome to NALP. Have you ever wanted to create natural text from raw sources? If yes, NALP is for you!

Gustavo Rosa 21 Aug 12, 2022