LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation

Overview

LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation

Tasks | Datasets | LongLM | Baselines | Paper

Introduction

LOT is a benchmark for evaluating Chinese long text modeling. LOT consists of two understanding tasks and two generation tasks. We construct new datasets for these tasks based on human-written Chinese stories.

Furthermore, we release an encoder-decoder-based Chinese long text pretraining model named LongLM with up to 1 billion parameters. We pretrain LongLM on 120G Chinese novels with two generative tasks including text infilling and conditional continuation. Extensive experiments show that LongLM outperforms similar-sized pretraining models substantially on both the understanding and generation tasks in LOT.

Tasks

We design LOT as an aggregation of two understanding tasks including Cloze Test (ClozeT) and Sentence Position Prediction (SenPos), and two generation tasks including Plot Completion (PlotCom) and Outline-conditioned Generation (OutGen). We show the task descriptions in the table below.

Datasets

We show the data statistics in the table below. The abbreviation sent/len is short for sentence/length, respectively. The datasets and evaluation scripts can be downloaded from THUCloud.

LongLM

1. Parameters

  • $d_m$: the dimension of hidden states
  • $d_{ff}$: the dimension of feed forward layers
  • $d_{kv}$: the dimension of the keys/values in the self-attention layers
  • $n_h$: the number of attention heads
  • $n_e$: the number of hidden layers of the encoder
  • $n_d$: the number of hidden layers of the decoder
  • #P: the number of parameters

2. Pretraining Tasks

3. Pretraining Data

We collect 120G novels as the pretraining data for LongLM. The pretraining data will be publicly available soon.

4. Checkpoints

  1. Download: The checkpoints and example data can be downloaded from THUCloud. The training and generation scripts are under the directory longlm. You can also use the official script provided by Transformers to fine-tune the model.

  2. Model Loading:

    from transformers import T5Tokenizer, T5ForConditionalGeneration
    tokenizer = T5Tokenizer.from_pretrained('LongLM-large')
    model = T5ForConditionalGeneration.from_pretrained('LongLM-large')
    
    • Dependencies: torch=1.8.1, transformers=4.6.1
  3. Training:

    Execute bash ./finetune.sh to fine-tune LongLM. If deepspeed is available, you can execute bash ./finetune_deepspped.sh to accelerate.

    env CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 CUDA_LAUNCH_BLOCKING=1 python3 -m torch.distributed.launch --nproc_per_node=8 \
    finetune_trainer.py \
    --data_dir=./data \ # directory of data
    --train_name=train \ # file prefix of the training data
    --output_dir=./save_model \ # output directory to save the checkpoint
    --save_total_limit=10 \ # maximum number of the saved checkpoints
    --per_gpu_train_batch_size=3 \ # batch size for training
    --per_gpu_eval_batch_size=3 \ # batch size for evaluation
    --num_train_epochs=1 \ # number of training epochs
    --logging_steps=5 \ # number of stps to log the loss value
    --model_name_or_path=./LongLM-small \ # path to the pretrained model
    --warmup_steps=100 \ # number of steps for warmup
    --learning_rate=1e-4 \ # learning rate
    --n_val=100 \ # number of examples for validation
    --do_train --do_eval \ # whether to training/validation
    --evaluation_strategy steps \ # strategy of evaluation
    --gradient_accumulation_steps=40 # number of steps for gradient accumulation
    --overwrite_output_dir \
    --load_best_model_at_end
  4. Generation:

    ",return_tensors="pt", padding=True, truncation=True, max_length=512).input_ids.to(device) gen = model.generate(input_ids, do_sample=True, decoder_start_token_id=1, top_p=0.9, max_length=512) ">
    input_ids = tokenizer("小咕噜对,
         
          "
         ,return_tensors="pt", padding=True, truncation=True, max_length=512).input_ids.to(device)
    
    gen = model.generate(input_ids, do_sample=True, decoder_start_token_id=1, top_p=0.9, max_length=512)

Baselines

1. Understanding Tasks

The example data, training and evaluation scripts of LongLM are under the directory ./baselines/understanding. You can execute bash ./finetune.sh to fine-tune LongLM and execute bash ./eval.sh to evaluate the fine-tuned model.

2. Generation Tasks

The training script of LongLM for the generation tasks is the same as pretraining script. The generation script and example data can be found under ./baseline/generation. You can execute bash ./gen.sh for generation.

Citation

@misc{guan2021lot,
      title={LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation}, 
      author={Jian Guan and Zhuoer Feng and Yamei Chen and Ruilin He and Xiaoxi Mao and Changjie Fan and Minlie Huang},
      year={2021},
      eprint={2108.12960},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
Owner
Conversational AI groups from Tsinghua University
Lightweight utility tools for the detection of multiple spellings, meanings, and language-specific terminology in British and American English

Breame ( British English and American English) Breame is a lightweight Python package with a number of utility tools to aid in the detection of words

Charles 8 Oct 10, 2022
A retro text-to-speech bot for Discord

hawking A retro text-to-speech bot for Discord, designed to work with all of the stuff you might've seen in Moonbase Alpha, using the existing command

Nick Schorr 23 Dec 25, 2022
Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing Trankit is a light-weight Transformer-based Pyth

652 Jan 06, 2023
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group 8.4k Dec 30, 2022
Application for shadowing Chinese.

chinese-shadowing Simple APP for shadowing chinese. With this application, it is very easy to record yourself, play the sound recorded and listen to s

Thomas Hirtz 5 Sep 06, 2022
Sequence model architectures from scratch in PyTorch

This repository implements a variety of sequence model architectures from scratch in PyTorch. Effort has been put to make the code well structured so that it can serve as learning material. The train

Brando Koch 11 Mar 28, 2022
The FinQA dataset from paper: FinQA: A Dataset of Numerical Reasoning over Financial Data

Data and code for EMNLP 2021 paper "FinQA: A Dataset of Numerical Reasoning over Financial Data"

Zhiyu Chen 114 Dec 29, 2022
Big Bird: Transformers for Longer Sequences

BigBird, is a sparse-attention based transformer which extends Transformer based models, such as BERT to much longer sequences. Moreover, BigBird comes along with a theoretical understanding of the c

Google Research 457 Dec 23, 2022
Research code for the paper "Fine-tuning wav2vec2 for speaker recognition"

Fine-tuning wav2vec2 for speaker recognition This is the code used to run the experiments in https://arxiv.org/abs/2109.15053. Detailed logs of each t

Nik 103 Dec 26, 2022
Local cross-platform machine translation GUI, based on CTranslate2

DesktopTranslator Local cross-platform machine translation GUI, based on CTranslate2 Download Windows Installer You can either download a ready-made W

Yasmin Moslem 29 Jan 05, 2023
Automatic privilege escalation for misconfigured capabilities, sudo and suid binaries

GTFONow Automatic privilege escalation for misconfigured capabilities, sudo and suid binaries. Features Automatically escalate privileges using miscon

101 Jan 03, 2023
Code Implementation of "Learning Span-Level Interactions for Aspect Sentiment Triplet Extraction".

Span-ASTE: Learning Span-Level Interactions for Aspect Sentiment Triplet Extraction ***** New March 31th, 2022: Scikit-Style API for Easy Usage *****

Chia Yew Ken 111 Dec 23, 2022
Natural Language Processing library built with AllenNLP 🌲🌱

Custom Natural Language Processing with big and small models 🌲🌱

Recognai 65 Sep 13, 2022
Pipelines de datos, 2021.

Este repo ilustra un proceso sencillo de automatización de transformación y modelado de datos, a través de un pipeline utilizando Luigi. Stack princip

Rodolfo Ferro 8 May 19, 2022
ConvBERT-Prod

ConvBERT 目录 0. 仓库结构 1. 简介 2. 数据集和复现精度 3. 准备数据与环境 3.1 准备环境 3.2 准备数据 3.3 准备模型 4. 开始使用 4.1 模型训练 4.2 模型评估 4.3 模型预测 5. 模型推理部署 5.1 基于Inference的推理 5.2 基于Serv

yujun 7 Apr 08, 2022
Implementation of N-Grammer, augmenting Transformers with latent n-grams, in Pytorch

N-Grammer - Pytorch Implementation of N-Grammer, augmenting Transformers with latent n-grams, in Pytorch Install $ pip install n-grammer-pytorch Usage

Phil Wang 66 Dec 29, 2022
Opal-lang - A WIP programming language based on Python

thanks to aphitorite for the beautiful logo! opal opal is a WIP transcompiled pr

3 Nov 04, 2022
Few-shot Natural Language Generation for Task-Oriented Dialog

Few-shot Natural Language Generation for Task-Oriented Dialog This repository contains the dataset, source code and trained model for the following pa

172 Dec 13, 2022
Twitter-Sentiment-Analysis - Analysis of twitter posts' positive and negative score.

Twitter-Sentiment-Analysis The hands-on project is in Python 3 Programming class offered by University of Michigan via Coursera. The task is to build

Eszter Pai 1 Jan 03, 2022