Th2En & Th2Zh: The large-scale datasets for Thai text cross-lingual summarization

Overview

Th2En & Th2Zh: The large-scale datasets for Thai text cross-lingual summarization

📥 Download Datasets
📥 Download Trained Models

INTRODUCTION

TH2ZH (Thai-to-Simplified Chinese) and TH2EN (Thai-to-English) are cross-lingual summarization (CLS) datasets. The source articles of these datasets are from TR-TPBS dataset, a monolingual Thai text summarization dataset. To create CLS dataset out of TR-TPBS, we used a neural machine translation service to translate articles into target languages. For some reasons, we were strongly recommended not to mention the name of the service that we used 🥺 . We will refer to the service we used as ‘main translation service’.

Cross-lingual summarization (cross-sum) is a task to summarize a given document written in one language to another language short summary.

crosslingual summarization

Traditional cross-sum approaches are based on two techniques namely early translation technique and late translation technique. Early translation can be explained easily as translate-then-summarize method. Late translation, in reverse, is summarize-then-translate method.

However, classical cross-sum methods tend to carry errors from monolingual summarization process or translation process to final cross-language output summary. Several end-to-end approaches have been proposed to tackle problems of traditional ones. Couple of end-to-end models are available to download as well.

DATASET CONSTRUCTION

💡 Important Note In contrast to Zhu, et al, in our experiment, we found that filtering out articles using RTT technique worsen the overall performance of the end-to-end models significantly. Therefore, full datasets are highly recommended.

We used TR-TPBS as source documents for creating cross-lingual summarization dataset. In the same way as Zhu, et al., we constructed Th2En and Th2Zh by translating the summary references into target languages using translation service and filtered out those poorly-translated summaries using round-trip translation technique (RTT). The overview of cross-lingual summarization dataset construction is presented in belowe figure. Please refer to the corresponding paper for more details on RTT.

crosslingual summarization In our experiment, we set 𝑇1 and 𝑇2 equal to 0.45 and 0.2 respectively, backtranslation technique filtered out 27.98% from Th2En and 56.79% documents from Th2Zh.

python3 src/tools/cls_dataset_construction.py \
--dataset th2en \
--input_csv path/to/full_dataset.csv \
--output_csv path/to/save/filtered_csv \
--r1 0.45 \
--r2 0.2
  • --dataset can be {th2en, th2zh}.
  • --r1 and --r2 are where you can set ROUGE score thresholds (r1 and r2 represent ROUGE-1 and ROUGE-2 respectively) for filtering (assumingly) poor translated articles.

Dataset Statistic

Click the file name to download.

File Number of Articles Size
th2en_full.csv 310,926 2.96 GB
th2zh_full.csv 310,926 2.81 GB
testset.csv 3,000 44 MB
validation.csv 3,000 43 MB

Data Fields

Please refer to th2enzh_data_exploration.ipynb for more details.

Column Description
th_body Original Thai body text
th_sum Original Thai summary
th_title Original Thai Article headline
{en/zh}_body Translated body text
{en/zh}_sum Translated summary
{en/zh}_title Translated article's headline
{en/zh}2th Back translation of{en/zh}_body
{en/zh}_gg_sum Translated summary (by Google Translation)
url URL to original article’s webpage
  • {th/en/zh}_title are only available in test set.
  • {en/zh}_gg_sum are also only available in test set. We (at the time this experiment took place) assumed that Google translation was better than the main translation service we were using. We intended to use these Google translated summaries as some kind of alternative summary references, but in the end, they never been used. We decided to make them available in the test set anyway, just in case the others find them useful.
  • {en/zh}_body were not presented during training end-to-end models. They were used only in early translation methods.

AVAILABLE TRAINED MODELS

Model Corresponding Paper Thai -> English Thai -> Simplified Chinese
Full Filtered Full Filtered
TNCLS Zhu et al., 2019 - Available - -
CLS+MS Zhu et al., 2019 Available - - -
CLS+MT Zhu et al., 2019 Available - Available -
XLS – RL-ROUGE Dou et al., 2020 Available - Available -

To evaluate these trained models, please refer to xls_model_evaluation.ipynb and ncls_model_evaluation.ipynb.

If you wish to evaluate the models with our test sets, you can use below script to create test files for XLS and NCLS models.

python3 src/tools/create_cls_test_manifest.py \
--test_csv_path path/to/testset.csv \
--output_dir path/to/save/testset_files \
--use_google_sum {true/false} \
--max_tokens 500 \
--create_ms_ref {true/false}
  • output_dir is path to directory that you want to save test set files
  • use_google_sum can be {true/false}. If true, it will select summary reference from columns {en/zh}_gg_sum. Default is false.
  • max_tokens number of maximum words in input articles. Default is 500 words. Too short or too long articles can significantly worsen performance of the models.
  • create_ms_ref whether to create Thai summary reference file to evaluate MS task in NCLS:CLS+MS model.

This script will produce three files namely test.CLS.source.thai.txt and test.CLS.target.{en/zh}.txt. test.CLS.source.thai.txt is used as a test file for cls task. test.CLS.target.{en/zh}.txt are the crosslingual summary reference for English and Chinese, they are used to evaluate ROUGE and BertScore. Each line is corresponding to the body articles in test.CLS.source.thai.txt.

🥳 We also evaluated MT tasks in XLS and NCLS:CLS+MT models. Please refers to xls_model_evaluation.ipynb and ncls_model_evaluation.ipynb for BLUE score results . For test sets that we used to evaluate MT task, please refer to data/README.md.

EXPERIMENT RESULTS

🔆 It has to be noted that all of end-to-end models reported in this section were trained on filtered datasets NOT full datasets. And for all end-to-end models, only `th_body` and `{en/zh}_sum` were present during training. We trained end-to-end models for 1,000,000 steps and selected model checkpoints that yielded the highest overall ROUGE scores to report the experiment.

In this experiment, we used two automatic evaluation matrices namely ROUGE and BertScore to assess the performance of CLS models. We evaluated ROUGE on Chinese text at word-level, NOT character level.

We only reported BertScore on abstractive summarization models. To evaluate the results with BertScore we used weights from ‘roberta-large’ and ‘bert-base-chinese’ pretrained models for Th2En and Th2Zh respectively.

Model Thai to English Thai to Chinese
ROUGE BertScore ROUGE BertScore
R1 R2 RL F1 R1 R2 RL F1
Traditional Approaches
Translated Headline 23.44 6.99 21.49 - 21.55 4.66 18.58 -
ETrans → LEAD2 51.96 42.15 50.01 - 44.18 18.83 43.84 -
ETrans → BertSumExt 51.85 38.09 49.50 - 34.58 14.98 34.84 -
ETrans → BertSumExtAbs 52.63 32.19 48.14 88.18 35.63 16.02 35.36 70.42
BertSumExt → LTrans 42.33 27.33 34.85 - 28.11 18.85 27.46 -
End-to-End Training Approaches
TNCLS 26.48 6.65 21.66 85.03 27.09 6.69 21.99 63.72
CLS+MS 32.28 15.21 34.68 87.22 34.34 12.23 28.80 67.39
CLS+MT 42.85 19.47 39.48 88.06 42.48 19.10 37.73 71.01
XLS – RL-ROUGE 42.82 19.62 39.53 88.03 43.20 19.19 38.52 72.19

LICENSE

Thai crosslingual summarization datasets including TH2EN, TH2ZH, test and validation set are licensed under MIT License.

ACKNOWLEDGEMENT

  • These cross-lingual datasets and the experiments are parts of Nakhun Chumpolsathien ’s master’s thesis at school of computer science, Beijing Institute of Technology. Therefore, as well, a great appreciation goes to his supervisor, Assoc. Prof. Gao Yang.
  • Shout out to Tanachat Arayachutinan for the initial data processing and for introducing me 麻辣烫, 黄焖鸡.
  • We would like to thank Beijing Engineering Research Center of High Volume Language Information Processing and Cloud Computing Applications for providing computing resources to conduct the experiment.
  • In this experiment, we used PyThaiNLP v. 2.2.4 to tokenize (on both word & sentence levels) Thai texts. For Chinese and English segmentation, we used Stanza.
Owner
Nakhun Chumpolsathien
I thought it was fun.
Nakhun Chumpolsathien
DataCLUE: 国内首个以数据为中心的AI测评(含模型分析报告)

DataCLUE 以数据为中心的AI测评(DataCLUE) DataCLUE: A Chinese Data-centric Language Evaluation Benchmark 内容导引 章节 描述 简介 介绍以数据为中心的AI测评(DataCLUE)的背景 任务描述 任务描述 实验结果

CLUE benchmark 135 Dec 22, 2022
Pretty-doc - Composable text objects with python

pretty-doc from __future__ import annotations from dataclasses import dataclass

Taine Zhao 2 Jan 17, 2022
Study German declensions (dER nettE Mann, ein nettER Mann, mit dEM nettEN Mann, ohne dEN nettEN Mann ...) Generate as many exercises as you want using the incredible power of SPACY!

Study German declensions (dER nettE Mann, ein nettER Mann, mit dEM nettEN Mann, ohne dEN nettEN Mann ...) Generate as many exercises as you want using the incredible power of SPACY!

Hans Alemão 4 Jul 20, 2022
Just a Basic like Language for Zeno INC

zeno-basic-language Just a Basic like Language for Zeno INC This is written in 100% python. this is basic language like language. so its not for big p

Voidy Devleoper 1 Dec 18, 2021
A framework for evaluating Knowledge Graph Embedding Models in a fine-grained manner.

A framework for evaluating Knowledge Graph Embedding Models in a fine-grained manner.

NEC Laboratories Europe 13 Sep 08, 2022
Sequence-to-Sequence Framework in PyTorch

nmtpytorch allows training of various end-to-end neural architectures including but not limited to neural machine translation, image captioning and au

LIUM 395 Nov 21, 2022
What are the best Systems? New Perspectives on NLP Benchmarking

What are the best Systems? New Perspectives on NLP Benchmarking In Machine Learning, a benchmark refers to an ensemble of datasets associated with one

Pierre Colombo 12 Nov 03, 2022
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

English | 简体中文 | 繁體中文 | 한국어 State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow 🤗 Transformers provides thousands of pretrained models

Hugging Face 77.1k Dec 31, 2022
Speech Recognition for Uyghur using Speech transformer

Speech Recognition for Uyghur using Speech transformer Training: this model using CTC loss and Cross Entropy loss for training. Download pretrained mo

Uyghur 11 Nov 17, 2022
txtai: Build AI-powered semantic search applications in Go

txtai: Build AI-powered semantic search applications in Go txtai executes machine-learning workflows to transform data and build AI-powered semantic s

NeuML 49 Dec 06, 2022
Understand Text Summarization and create your own summarizer in python

Automatic summarization is the process of shortening a text document with software, in order to create a summary with the major points of the original document. Technologies that can make a coherent

Sreekanth M 1 Oct 18, 2022
Some embedding layer implementation using ivy library

ivy-manual-embeddings Some embedding layer implementation using ivy library. Just for fun. It is based on NYCTaxiFare dataset from kaggle (cut down to

Ishtiaq Hussain 2 Feb 10, 2022
Official PyTorch implementation of "Dual Path Learning for Domain Adaptation of Semantic Segmentation".

Dual Path Learning for Domain Adaptation of Semantic Segmentation Official PyTorch implementation of "Dual Path Learning for Domain Adaptation of Sema

27 Dec 22, 2022
Unsupervised Abstract Reasoning for Raven’s Problem Matrices

Unsupervised Abstract Reasoning for Raven’s Problem Matrices This code is the implementation of our TIP paper. This is the first unsupervised abstract

Tao Zhuo 9 Dec 17, 2022
2021海华AI挑战赛·中文阅读理解·技术组·第三名

文字是人类用以记录和表达的最基本工具,也是信息传播的重要媒介。透过文字与符号,我们可以追寻人类文明的起源,可以传播知识与经验,读懂文字是认识与了解的第一步。对于人工智能而言,它的核心问题之一就是认知,而认知的核心则是语义理解。

21 Dec 26, 2022
PyTorch implementation of Tacotron speech synthesis model.

tacotron_pytorch PyTorch implementation of Tacotron speech synthesis model. Inspired from keithito/tacotron. Currently not as much good speech quality

Ryuichi Yamamoto 279 Dec 09, 2022
Code for Editing Factual Knowledge in Language Models

KnowledgeEditor Code for Editing Factual Knowledge in Language Models (https://arxiv.org/abs/2104.08164). @inproceedings{decao2021editing, title={Ed

Nicola De Cao 86 Nov 28, 2022
This github repo is for Neurips 2021 paper, NORESQA A Framework for Speech Quality Assessment using Non-Matching References.

NORESQA: Speech Quality Assessment using Non-Matching References This is a Pytorch implementation for using NORESQA. It contains minimal code to predi

Meta Research 36 Dec 08, 2022
A paper list for aspect based sentiment analysis.

Aspect-Based-Sentiment-Analysis A paper list for aspect based sentiment analysis. Survey [IEEE-TAC-20]: Issues and Challenges of Aspect-based Sentimen

jiangqn 419 Dec 20, 2022
Topic Modelling for Humans

gensim – Topic Modelling in Python Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Targ

RARE Technologies 13.8k Jan 02, 2023