Code and data for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources.

Related tags

Deep LearningMCLAS
Overview

Multi-Task Framework for Cross-Lingual Abstractive Summarization (MCLAS)

The code for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources (Paper).

Some codes are borrowed from PreSumm (https://github.com/nlpyang/PreSumm).

Environments

Python version: This code is in Python3.7

Package Requirements: torch==1.1.0, transformers, tensorboardX, multiprocess, pyrouge

Needs few changes to be compatible with torch 1.4.0~1.8.0, mainly tensor type (bool) bugs.

Data Preparation

To improve training efficiency, we preprocessed concatenated dataset (with target "monolingual summary + [LSEP] + cross-lingual summary") and normal dataset (with target "cross-lingual summary") in advance.

You can build your own dataset or download our preprocessed dataset.

Download Preprocessed dataset.

  1. En2De dataset: Google Drive Link.
  2. En2EnDe (concatenated) dataset: Google Drive Link.
  3. En2Zh dataset: Google Drive Link.
  4. En2EnZh (concatenated) dataset: Google Drive Link.

PS: Our implementation filter some invalid samples (if the target of a sample is too short). Hence the number of the training samples may be smaller than what is reported in the paper.

Build Your Own Dataset.

Remain to be origanized. Some of the code needs to be debug, plz use it carefully.

Build tokenized files.

Plz refer to function tokenize_xgiga() or tokenize_new() in ./src/data_builder.py to write your code to preprocess your own training, validation, and test dataset. And then run the following commands:

python preprocess.py -mode tokenize_xgiga -raw_path PATH_TO_YOUR_RAW_DATA -save_path PATH_TO_YOUR_SAVE_PATH
  • Stanford CoreNLP needs to be installed.

Plz substitute tokenize_xgiga to your own process function.

In our case, we made the raw data directory as follows:

.
└── raw_directory
    ├── train
    |   ├── 1.story
    |   ├── 2.story
    |   ├── 3.story
    |   └── ...
    ├── test
    |   ├── 1.story
    |   ├── 2.story
    |   ├── 3.story
    |   └── ...
    └─ dev
        ├── 1.story
        ├── 2.story
        ├── 3.story
        └── ...

Correspondingly, the tokenized data directory is as follows

.
└── raw_directory
    ├── train
    |   ├── 1.story.json
    |   ├── 2.story.json
    |   ├── 3.story.json
    |   └── ...
    ├── test
    |   ├── 1.story.json
    |   ├── 2.story.json
    |   ├── 3.story.json
    |   └── ...
    └─ dev
        ├── 1.story.json
        ├── 2.story.json
        ├── 3.story.json
        └── ...

Build tokenized files to json files.

python preprocess.py -mode format_to_lines_new -raw_path RAW_PATH -save_path JSON_PATH -n_cpus 1 -use_bert_basic_tokenizer false -map_path MAP_PATH -shard_size 3000

Shard size is pretty important and needs to be selected carefully. This implementation use a shard as a base data unit for low-resource training. In our setting, the shard size of En2Zh, Zh2En, and En2De is 1.5k, 5k, and 3k, respectively.

Build json files to pytorch(pt) files.

python preprocess.py -mode format_to_bert_new -raw_path JSON_PATH -save_path BERT_DATA_PATH  -lower -n_cpus 1 -log_file ../logs/preprocess.log

Model Training

Full dataset scenario training

To train our model in full dataset scenario, plz use following command. Change the data path to switch the trained model between NCLS and MCLAS.

When using NCLS type datasets, arguement --multi_task enables training with NCLS+MS model.

 python train.py  \
 -task abs -mode train \
 -temp_dir ../tmp \
 -bert_data_path PATH_TO_DATA/ncls \  
 -dec_dropout 0.2  \
 -model_path ../model_abs_en2zh_noseg \
 -sep_optim true \
 -lr_bert 0.005 -lr_dec 0.2 \
 -save_checkpoint_steps 5000 \
 -batch_size 1300 \
 -train_steps 400000 \
 -report_every 50 -accum_count 5 \
 -use_bert_emb true -use_interval true \
 -warmup_steps_bert 20000 -warmup_steps_dec 10000 \
 -max_pos 512 -visible_gpus 0  -max_length 1000 -max_tgt_len 1000 \
 -log_file ../logs/abs_bert_en2zh  
 # --multi_task

Low-resource scenario training

Monolingual summarization pretraining

First we should train a monolingual summarization model using following commands:

You can change the trained model type using the same methods mentioned above (change dataset or --multi_task)

python train.py  \
-task abs -mode train \
-dec_dropout 0.2  \
-model_path ../model_abs_en2en_de/ \
-bert_data_path PATH_TO_DATA/xgiga.en \
-temp_dir ../tmp \
-sep_optim true \
-lr_bert 0.002 -lr_dec 0.2 \
-save_checkpoint_steps 2000 \
-batch_size 210 \
-train_steps 200000 \
-report_every 50 -accum_count 5 \
-use_bert_emb true -use_interval true \
-warmup_steps_bert 25000 -warmup_steps_dec 15000 \
-max_pos 512 -visible_gpus 0,1,2 -max_length 1000 -max_tgt_len 1000 \
-log_file ../logs/abs_bert_mono_enen_de \
--train_first  

# -train_from is used as continue training from certain training checkpoints.
# example:
# -train_from ../model_abs_en2en_de/model_step_70000.pt \

Low-resource scenario fine-tuning

After obtaining the monolingual model, we use it to initialize the low-resource models and continue training process.

Note:

-train_from should be omitted if you want to train a model without monolingual initialization.

--new_optim is necessary since we need to restart warm-up and learning rate decay during this process.

--few_shot controls whether to use limited resource to train the model. Meanwhile, '-few_shot_rate' controls the number of samples that you want to use. More specifically, the number of dataset's chunks.

For each scenario in our paper (using our preprocessed dataset), the few_shot_rate is set as 1, 5, and 10.

python train.py  \
-task abs -mode train \
-dec_dropout 0.2  \
-model_path ../model_abs_enende_fewshot1/ \
-train_from ../model_abs_en2en_de/model_step_50000.pt \
-bert_data_path PATH_TO_YOUR_DATA/xgiga.en \
-temp_dir ../tmp \
-sep_optim true \
-lr_bert 0.002 -lr_dec 0.2 \
-save_checkpoint_steps 1000 \
-batch_size 270 \
-train_steps 10000 \
-report_every 50 -accum_count 5 \
-use_bert_emb true -use_interval true \
-warmup_steps_bert 25000 -warmup_steps_dec 15000 \
-max_pos 512 -visible_gpus 0,2,3 -max_length 1000 -max_tgt_len 1000 \
-log_file ../logs/abs_bert_enende_fewshot1 \
--few_shot -few_shot_rate 1 --new_optim

Model Evaluation

To evaluate a model, use a command as follows:

python train.py -task abs \
-mode validate \
-batch_size 5 \
-test_batch_size 5 \
-temp_dir ../tmp \
-bert_data_path PATH_TO_YOUR_DATA/xgiga.en \
-log_file ../results/val_abs_bert_enende_fewshot1_noinit \
-model_path ../model_abs_enende_fewshot1_noinit -sep_optim true \
-use_interval true -visible_gpus 1 \
-max_pos 512 -max_length 150 \
-alpha 0.95 -min_length 20 \
-max_tgt_len 1000 \
-result_path ../logs/abs_bert_enende_fewshot1 -test_all \
--predict_2language

If you are not evaluating a MCLAS model, plz remove --predict_2language.

If you are predicting Chinese summaries, plz add --predict_chinese to the command.

If you are evaluating a NCLS+MS model, plz add --multi_task to the command.

Using following two commands will slightly improve all models' performance.

--language_limit means that the predictor will only predict words appearing in summaries of training data.

--tgt_mask is a list, recording all the words appearing in summaries of the training set. We provided chiniese and english dict in ./src directory .

Other Notable Commands

Plz ignore these arguments, these command were added and abandoned when trying new ideas¸ I will delete these related code in the future.

  • --sep_decoder
  • --few_sep_decoder
  • --tgt_seg
  • --few_sep_decoder
  • -bart

Besides, --batch_verification is used to debug, printing all the attributes in a training batch.

Owner
Yu Bai
https://ybai-nlp.github.io/
Yu Bai
Wordle-solver - Wordle answer generation program in python

🟨 Wordle Solver 🟩 Wordle answer generation program in python ✔️ Requirements U

Dahyun Kang 4 May 28, 2022
magiCARP: Contrastive Authoring+Reviewing Pretraining

magiCARP: Contrastive Authoring+Reviewing Pretraining Welcome to the magiCARP API, the test bed used by EleutherAI for performing text/text bi-encoder

EleutherAI 43 Dec 29, 2022
Efficient semidefinite bounds for multi-label discrete graphical models.

Low rank solvers #################################### benchmark/ : folder with the random instances used in the paper. ############################

1 Dec 08, 2022
Pocsploit is a lightweight, flexible and novel open source poc verification framework

Pocsploit is a lightweight, flexible and novel open source poc verification framework

cckuailong 208 Dec 24, 2022
Unsupervised Image to Image Translation with Generative Adversarial Networks

Unsupervised Image to Image Translation with Generative Adversarial Networks Paper: Unsupervised Image to Image Translation with Generative Adversaria

Hao 71 Oct 30, 2022
PyTorch implementation of "ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context" (INTERSPEECH 2020)

ContextNet ContextNet has CNN-RNN-transducer architecture and features a fully convolutional encoder that incorporates global context information into

Sangchun Ha 24 Nov 24, 2022
GUPNet - Geometry Uncertainty Projection Network for Monocular 3D Object Detection

GUPNet This is the official implementation of "Geometry Uncertainty Projection Network for Monocular 3D Object Detection". citation If you find our wo

Yan Lu 103 Dec 28, 2022
Code for the paper "Graph Attention Tracking". (CVPR2021)

SiamGAT 1. Environment setup This code has been tested on Ubuntu 16.04, Python 3.5, Pytorch 1.2.0, CUDA 9.0. Please install related libraries before r

122 Dec 24, 2022
Everything you want about DP-Based Federated Learning, including Papers and Code. (Mechanism: Laplace or Gaussian, Dataset: femnist, shakespeare, mnist, cifar-10 and fashion-mnist. )

Differential Privacy (DP) Based Federated Learning (FL) Everything about DP-based FL you need is here. (所有你需要的DP-based FL的信息都在这里) Code Tip: the code o

wenzhu 83 Dec 24, 2022
DECA: Detailed Expression Capture and Animation (SIGGRAPH 2021)

DECA: Detailed Expression Capture and Animation (SIGGRAPH2021) input image, aligned reconstruction, animation with various poses & expressions This is

Yao Feng 1.5k Jan 02, 2023
[ICCV2021] Official code for "Channel-wise Topology Refinement Graph Convolution for Skeleton-Based Action Recognition"

CTR-GCN This repo is the official implementation for Channel-wise Topology Refinement Graph Convolution for Skeleton-Based Action Recognition. The pap

Yuxin Chen 148 Dec 16, 2022
Price-Prediction-For-a-Dream-Home - A machine learning based linear regression trained model for house price prediction.

Price-Prediction-For-a-Dream-Home ROADMAP TO THIS LINEAR REGRESSION BASED HOUSE PRICE PREDICTION PREDICTION MODEL Import all the dependencies of the p

DIKSHA DESWAL 1 Dec 29, 2021
Unconstrained Text Detection with Box Supervisionand Dynamic Self-Training

SelfText Beyond Polygon: Unconstrained Text Detection with Box Supervisionand Dynamic Self-Training Introduction This is a PyTorch implementation of "

weijiawu 34 Nov 09, 2022
Gym environments used in the paper: "Developmental Reinforcement Learning of Control Policy of a Quadcopter UAV with Thrust Vectoring Rotors"

gym_multirotor Gym to train reinforcement learning agents on UAV platforms Quadrotor Tiltrotor Requirements This package has been tested on Ubuntu 18.

Aditya M. Deshpande 19 Dec 29, 2022
Python based framework for Automatic AI for Regression and Classification over numerical data.

Python based framework for Automatic AI for Regression and Classification over numerical data. Performs model search, hyper-parameter tuning, and high-quality Jupyter Notebook code generation.

BlobCity, Inc 141 Dec 21, 2022
[NeurIPS-2021] Mosaicking to Distill: Knowledge Distillation from Out-of-Domain Data

MosaicKD Code for NeurIPS-21 paper "Mosaicking to Distill: Knowledge Distillation from Out-of-Domain Data" 1. Motivation Natural images share common l

ZJU-VIPA 37 Nov 10, 2022
TensorFlow-based neural network library

Sonnet Documentation | Examples Sonnet is a library built on top of TensorFlow 2 designed to provide simple, composable abstractions for machine learn

DeepMind 9.5k Jan 07, 2023
Source code for paper "ATP: AMRize Than Parse! Enhancing AMR Parsing with PseudoAMRs" @NAACL-2022

ATP: AMRize Then Parse! Enhancing AMR Parsing with PseudoAMRs Hi this is the source code of our paper "ATP: AMRize Then Parse! Enhancing AMR Parsing w

Chen Liang 13 Nov 23, 2022
AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty

AugMix Introduction We propose AugMix, a data processing technique that mixes augmented images and enforces consistent embeddings of the augmented ima

Google Research 876 Dec 17, 2022
Ensembling Off-the-shelf Models for GAN Training

Data-Efficient GANs with DiffAugment project | paper | datasets | video | slides Generated using only 100 images of Obama, grumpy cats, pandas, the Br

MIT HAN Lab 1.2k Dec 26, 2022