Code for our EMNLP 2021 paper “Heterogeneous Graph Neural Networks for Keyphrase Generation”

Last update: Nov 24, 2022

Overview

GATER

This repository contains the code for our EMNLP 2021 paper “Heterogeneous Graph Neural Networks for Keyphrase Generation”.

Our implementation is built on the source code from keyphrase-generation-rl and fastNLP. Thanks for their work.

If you use this code, please cite our paper:

@inproceedings{ye2021heterogeneous,
  title={Heterogeneous Graph Neural Networks for Keyphrase Generation},
  author={Ye, Jiacheng and Cai, Ruijian and Gui, Tao and Zhang, Qi},
  booktitle={Proceedings of EMNLP},
  year={2021}
}

Dependency

python 3.5+
pytorch 1.0+
dgl 0.4.3
sentence_transformers 1.1.0
faiss 1.6.3

Dataset

The datasets can be downloaded from here, which are the tokenized version of the datasets provided by Ken Chen:

The testsets directory contains the five datasets for testing (i.e., inspec, krapivin, nus, and semeval and kp20k), where each of the datasets contains test_src.txt and test_trg.txt.
The kp20k_sorted directory contains the training and validation files (i.e., train_src.txt, train_trg.txt, valid_src.txt and valid_trg.txt).
Each line of the *_src.txt file is the source document, which contains the tokenized words of title <eos> abstract .
Each line of the *_trg.txt file contains the target keyphrases separated by an ; character. For example, each line can be like present keyphrase one;present keyphrase two;absent keyprhase one;absent keyphrase two.

Quick Start

The whole process includes the following steps:

Build tfidf: The retrievers/build_tfidf.py script is used to build the index for document retrieval.
Preprocessing: The preprocess.py script numericalizes the train_src.txt, train_trg.txt,valid_src.txt and valid_trg.txt files, and produces train.one2many.pt, valid.one2many.pt and vocab.pt.
Training: The train.py script loads the train.one2many.pt, valid.one2many.pt and vocab.pt file and performs training. We evaluate the model every 8000 batches on the valid set, and the model will be saved if the valid loss is lower than the previous one.
Decoding: The predict.py script loads the trained model and performs decoding on the five test datasets. The prediction file will be saved, which is like predicted keyphrase one;predicted keyphrase two;….
Evaluation: The evaluate_prediction.py script loads the ground-truth and predicted keyphrases, and calculates the [email protected]$ and [email protected]$ metrics.

For the sake of simplicity, we provide an one-click script in the script directory. You can run the following command to run the whole process with Gater model:

# under `One2One` paradigm
bash scripts/run_gater_one2one.sh

# under `One2Seq` paradigm
bash scripts/run_gater_one2seq.sh

You can also run the baseline model with the following command:

# under `One2One` paradigm
bash scripts/run_one2one.sh

# under `One2Seq` paradigm
bash scripts/run_one2seq.sh

Note:

Please download and unzip the datasets in the ./data directory first.
The Preprocessing procedure takes time because we have to pre-retrieve similiar references for each samples, and we also store them for the preparation of the training stage.
To run all the bash files smoothly, you may need to specify the correct home_dir (i.e., the absolute path to kg_gater dictionary) and the gpu id for CUDA_VISIBLE_DEVICES. We provide a small amount of data to quickly test whether your running environment is correct. You can test by running the following command:

bash scripts/run_small_gater_one2seq.sh

Code for our EMNLP 2021 paper “Heterogeneous Graph Neural Networks for Keyphrase Generation”

Related tags

Overview

GATER

Dependency

Dataset

Quick Start

Owner

Jiacheng Ye

A tool to prepare websites grabbed with wget for local viewing.

RATE: Overcoming Noise and Sparsity of Textual Features in Real-Time Location Estimation (CIKM'17)

Code for Neural-GIF: Neural Generalized Implicit Functions for Animating People in Clothing(ICCV21)

Two-Stage Peer-Regularized Feature Recombination for Arbitrary Image Style Transfer

An example showing how to use jax to train resnet50 on multi-node multi-GPU

A collection of models for image<->text generation in ACM MM 2021.

Simple and Robust Loss Design for Multi-Label Learning with Missing Labels

FLAVR is a fast, flow-free frame interpolation method capable of single shot multi-frame prediction

Weakly-Supervised Semantic Segmentation Network with Deep Seeded Region Growing (CVPR 2018).

This project aims to be a handler for input creation and running of multiple RICEWQ simulations.

ML-PersonalWork - Big assignment PersonalWork in Machine Learning, 2021 autumn BUAA.

Brain Tumor Detection with Tensorflow Neural Networks.

A collection of 100 Deep Learning images and visualizations

Orthogonal Jacobian Regularization for Unsupervised Disentanglement in Image Generation (ICCV 2021)

Creating a custom CNN hypertunned architeture for the Fashion MNIST dataset with Python, Keras and Tensorflow.

Code for "Long-tailed Distribution Adaptation"

Python implementation of Wu et al (2018)'s registration fusion

Dyalog-apl-docset - Dyalog APL Dash Docset Generator

Repository for Multimodal AutoML Benchmark

My coursework for Machine Learning (2021 Spring) at National Taiwan University (NTU)