Codebase for the Summary Loop paper at ACL2020

Last update: Nov 04, 2022

Overview

Summary Loop

This repository contains the code for ACL2020 paper: The Summary Loop: Learning to Write Abstractive Summaries Without Examples.

Training Procedure

We provide pre-trained models for each component needed in the Summary Loop Release:

keyword_extractor.joblib: An sklearn pipeline that will extract can be used to compute tf-idf scores of words according to the BERT vocabulary, which is used by the Masking Procedure,
bert_coverage.bin: A bert-base-uncased finetuned model on the task of Coverage for the news domain,
fluency_news_bs32.bin: A GPT2 (base) model finetuned on a large corpus of news articles, used as the Fluency model,
gpt2_copier23.bin: A GPT2 (base) model that can be used as an initial point for the Summarizer model.

In the release, we also provide:

pretrain_coverage.py script to train a coverage model from scratch,
train_generator.py to train a fluency model from scratch (we recommend Fluency model on domain of summaries, such as news, legal, etc.)

Once all the pretraining models are ready, training a summarizer can be done using the train_summary_loop.py:

python train_summary_loop.py --experiment wikinews_test --dataset_file data/wikinews.db

Scorer Models

The Coverage and Fluency model and Guardrails scores can be used separately for analysis, evaluation, etc. They are respectively in model_coverage.py and model_guardrails.py, each model is implemented as a class with a score(document, summary) function. The Fluency model is a Language model, which is also the generator (in model_generator.py). Examples of how to run each model are included in the class files, at the bottom of the files.

Bringing in your own data

Want to test out the Summary Loop on a different language/type of text? A Jupyter Notebook can help you bring your own data into the SQLite format we use in the pre-training scripts. Otherwise you can modify the scripts' data loading (DataLoader) and collate function (collate_fn).

Cite the work

If you make use of the code, models, or algorithm, please cite our paper:

@inproceedings{laban2020summary,
  title={The Summary Loop: Learning to Write Abstractive Summaries Without Examples},
  author={Laban, Philippe and Hsi, Andrew and Canny, John and Hearst, Marti A},
  booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
  volume={1},
  year={2020}
}

Contributing

If you'd like to contribute, or have questions or suggestions, you can contact us at [email protected]. All contributions welcome! For example, if you have a type of text data on which you want to apply the Summary Loop.

Comments

Error Loading Model RuntimeError: Error(s) in loading state_dict for GPT2LMHeadModel:

Traceback (most recent call last):
  File "train_summary_loop.py", line 59, in <module>
    summarizer = GeneTransformer(max_output_length=args.max_output_length, device=args.device, tokenizer_type='gpt2', starter_model=summarizer_model_start)
  File "/home/tait-dev-0/summary_loop/summary_loop/model_generator.py", line 30, in __init__
    self.reload(starter_model)
  File "/home/tait-dev-0/summary_loop/summary_loop/model_generator.py", line 39, in reload
    print(self.model.load_state_dict(torch.load(from_file)))
  File "/home/tait-dev-0/anaconda2/envs/summary_loop/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1045, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for GPT2LMHeadModel:
	Missing key(s) in state_dict: "transformer.h.0.attn.masked_bias", "transformer.h.1.attn.masked_bias", "transformer.h.2.attn.masked_bias", "transformer.h.3.attn.masked_bias", "transformer.h.4.attn.masked_bias", "transformer.h.5.attn.masked_bias", "transformer.h.6.attn.masked_bias", "transformer.h.7.attn.masked_bias", "transformer.h.8.attn.masked_bias", "transformer.h.9.attn.masked_bias", "transformer.h.10.attn.masked_bias", "transformer.h.11.attn.masked_bias".

opened by raviolli 6

Missing models for training

Dear author, I tried to load the fluency_news_model_file models but failed. It seems that the "news_gpt2_bs32.bin" is not provided in the release.

I tried to replace it with "fluency_news_bs32.bin", but it does not seem to match the GeneTransformer. I.e. when I tried to load the fluency model using modelf=GeneTransformer(max_output_length=args.max_output_length, device=args.device, starter_model=fluency_news_model_file) it shows "IncompatibleKeys(missing_keys=['transformer.h.0.attn.masked_bias', 'transformer.h.1.attn.masked_bias', 'transformer.h.2.attn.masked_bias', 'transformer.h.3.attn.masked_bias', 'transformer.h.4.attn.masked_bias', 'transformer.h.5.attn.masked_bias', 'transformer.h.6.attn.masked_bias', 'transformer.h.7.attn.masked_bias', 'transformer.h.8.attn.masked_bias', 'transformer.h.9.attn.masked_bias', 'transformer.h.10.attn.masked_bias', 'transformer.h.11.attn.masked_bias'], unexpected_keys=[]) "

Is this fine?

In addition, when I tried to load the key word coverage model, the keys do not match either I.e. When running modelc = KeywordCoverage(args.device, keyword_model_file=coverage_keyword_model_file, model_file=coverage_model_file)} It shows IncompatibleKeys(missing_keys=['bert.embeddings.position_ids', 'cls.predictions.decoder.bias'], unexpected_keys=[])

Wondering how I could deal with this situation

opened by pengshancai 2

IndexError when decode with beam_size > 1

Followed the instruction from here and changed the beam_size to more than 1. IndexError occur:

~/summary_loop/model_generator.py in decode(self, bodies, max_output_length, max_batch_size, beam_size, return_scores, sample, progress)
    232             with torch.no_grad():
    233                 if beam_size > 1:
--> 234                     batch_outputs = self.decode_beam_batch(batch_bodies, beam_size=beam_size, max_output_length=max_output_length, sample=sample)
    235                 else:
    236                     batch_outputs = self.decode_batch(batch_bodies, max_output_length=max_output_length, sample=sample, return_scores=return_scores)

~/summary_loop/model_generator.py in decode_beam_batch(self, bodies, beam_size, max_output_length, sample)
    200             if build_up is not None:
    201                 build_up = build_up[tracks, :]
--> 202             past = [p[:, tracks, :] for p in past]
    203 
    204             # Update the latest scores, and the current_build

~/summary_loop/model_generator.py in <listcomp>(.0)
    200             if build_up is not None:
    201                 build_up = build_up[tracks, :]
--> 202             past = [p[:, tracks, :] for p in past]
    203 
    204             # Update the latest scores, and the current_build

IndexError: tensors used as indices must be long, byte or bool tensors

opened by s103321048 2

cannot reshape tensor of 0 elements into shape [-1, 0]

I followed the instruction training model with the provided example wikinews.db: python train_summary_loop.py --experiment wikinews_test --dataset_file data/wikinews.db

It did start training but later stop due to Runtimeerror:

Traceback (most recent call last):
  File "train_summary_loop.py", line 138, in <module>
    sampled_summaries, sampled_logprobs, sampled_tokens, input_past, sampled_end_idxs = summarizer.decode_batch(bodies, max_output_length=args.max_output_length, return_logprobs=True, sample=True)
  File "/home/robin/TrySomethingNew/summary_loop/model_generator.py", line 100, in decode_batch
    _, input_past = self.model(input_ids=inputs, past_key_values=None)
  File "/home/robin/virtual-env/summary-loop/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/robin/virtual-env/summary-loop/lib/python3.6/site-packages/transformers/modeling_gpt2.py", line 731, in forward
    return_dict=return_dict,
  File "/home/robin/virtual-env/summary-loop/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/robin/virtual-env/summary-loop/lib/python3.6/site-packages/transformers/modeling_gpt2.py", line 533, in forward
    input_ids = input_ids.view(-1, input_shape[-1])
RuntimeError: cannot reshape tensor of 0 elements into shape [-1, 0] because the unspecified
dimension size -1 can be any value and is ambiguous

opened by s103321048 2

Code for summary generation from the given model is not provided

You mentioned "Releasing the 11,490 summaries generated by the Summary Loop model (summary_loop_length46.bin) on the CNN/DM test set." and provided the json file "cnndm_test_summary_loop.json". Is there any code to get the json file(summaries) from the given model(.bin). If you have such code, then please share.

opened by tarunyadav 1

Resuming training

Is resuming training simply starting from the checkpoint instead of the gpt3 copier bin?

For example:

#summarizer_model_start = os.path.join(models_folder, "gpt2_copier23.bin")
summarizer_model_start = os.path.join(models_folder, "summarizer_wikinews_test_0_ckpt.bin")

opened by RevanthRameshkumar 1

Encoding error in bin file

(dlenv) D:\summary loop\summary_loop-0.1>python summary_loop_length10.bin --experiment wikinews_test --dataset_file data/wikinews.db File "summary_loop_length10.bin", line 1 SyntaxError: Non-UTF-8 code starting with '\x80' in file summary_loop_length10.bin on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

Anyone else get this issue? Currently debugging

opened by RevanthRameshkumar 1
a sample of data in hdf5 format

Hi,

I'm trying to train the models from scratch since I'd like to use them on a different language. It seems that one needs a dataset in hdf5 format instead of SQL to do that. Can you please release a sample of data in hdf5 format?

Thanks

opened by azagsam 1
Missing Model to run example

I'm trying to run the example:

python train_summary_loop.py --experiment wikinews_test --dataset_file ../data/wikinews.db --root_folder ../ --device cuda

but it seems I'm missing the ../models/fluency_news_bs32.bin

it doesn't seem to be in the list of downloadable models. Mistake??

opened by raviolli 1

Error running training_summary example

python train_summary_loop.py --experiment wikinews_test --dataset_file ../data/wikinews.db

Traceback (most recent call last):
  File "train_summary_loop.py", line 56, in <module>
    bert_tokenizer = utils_tokenizer.BERTCacheTokenizer()
  File "/home/tait-dev-0/summary_loop/summary_loop/utils_tokenizer.py", line 88, in __init__
    self.tokenizer.max_len = 10000
AttributeError: can't set attribute

transformers 3.0.2 py_0 conda-forge

I created a separate conda environment. Is this a transformer version issue?

opened by raviolli 1

updated torch.load params

Updated occurrences of torch.load to include map_location parameter. When attempting to train with --device set to cpu, torch.load may attempt to load a file with GPU tensors, which will lead to loading to GPU by default (see: https://pytorch.org/docs/stable/generated/torch.load.html). If --device is set to cpu, this will error on a cpu-only machine. Otherwise, it will go against desired functionality. This pull request resolves this issue.

opened by bsh98 0

Releases(0.3)

0.3(Jun 11, 2021)
Releasing the 11,490 summaries generated by the Summary Loop model (summary_loop_length46.bin) on the CNN/DM test set. Each summary is released attached with the CNN/DM id. The following code snippet can be used to evaluate ROUGE scores:

from datasets import load_dataset, load_metric import json with open("/home/phillab/data/cnndm_test_summary_loop.json", "r") as f: summary_loop_gens = json.load(f) rouge = load_metric("rouge") dataset_test = load_dataset("cnn_dailymail", "3.0.0")["test"] id2summary_loop = {d["id"]: d["summary_loop_gen"] for d in summary_loop_gens} candidates, references = [], [] for d in dataset_test: references.append(d["highlights"]) candidates.append(id2summary_loop[d["id"]]) print(len(references), len(candidates)) print(rouge.compute(predictions=candidates, references=references))

Notes: (1) this relies on HuggingFace's datasets repository (https://github.com/huggingface/datasets) to load the CNN/DM dataset, and the ROUGE metric. (2) The ROUGE metric implementation used in the above example is not the original, PERL-based implementation of ROUGE used for official numbers in the paper. This serves for demonstration purposes to show how to use the file.
Source code(tar.gz)
Source code(zip)
cnndm_test_summary_loop.json(3.40 MB)
0.2(Sep 8, 2020)
We release an upgraded set of initial models for the training script that are compatible with transformers==3.1.0 to make it easier to get started. The original release (0.1) used version 2.8.0 of transformers, and there were some breaking changed introduced since, which leads to some model loading failing. The requirements.txt in the latest release has been updated with compatible library versions to simplify installation.

Initial Models

These sets of models work using Python 3.6.10, Transformers 3.1.0 and Sklearn 0.22.1:

keyword_extractor.joblib: An sklearn pipeline that will extract can be used to compute tf-idf scores of words according to the BERT vocabulary, which is used by the Masking Procedure,

bert_coverage.bin: A bert-base-uncased finetuned model on the task of Coverage for the news domain,

fluency_news_bs32.bin: A GPT2 (base) model finetuned on a large corpus of news articles, used as the Fluency model,

gpt2_copier23.bin: A GPT2 (base) model that can be used as an initial point for the Summarizer model.

Final Models

Unfortunately, the three final models (trained summarizers) released in v0.1 do not work anymore in the latest transformers library, and only work in versions 2.8.0 and before. Once we retrain these models, we will reupload them. If this is of interest, feel free to add an issue or contact us.
Source code(tar.gz)
Source code(zip)
bert_coverage.bin(420.06 MB)
fluency_news_bs32.bin(486.73 MB)
gpt2_copier23.bin(633.97 MB)
keyword_extractor.joblib(667.33 KB)
v0.1(Jun 25, 2020)
We release models and data needed to run the Summary Loop and use the models we trained.

Initial models

Here are the models needed to run the train_summary_loop.py:

keyword_extractor.joblib: An sklearn pipeline that will extract can be used to compute tf-idf scores of words according to the BERT vocabulary, which is used by the Masking Procedure,

bert_coverage.bin: A bert-base-uncased finetuned model on the task of Coverage for the news domain,

fluency_news_bs32.bin: A GPT2 (base) model finetuned on a large corpus of news articles, used as the Fluency model,

gpt2_copier23.bin: A GPT2 (base) model that can be used as an initial point for the Summarizer model.

Sample dataset

We release a sample dataset of Wikinews news articles to get researchers started using the Summary Loop: wikinews.db. We cannot release the full dataset we used for copyright reasons. We note that we do not expect this to be enough to train to best performance, and recommend finding larger datasets (such as Newsroom or CNN/DM) for full-fledged training.

Final models

We release 3 Summarizer models obtained through the Summary Loop procedure for 3 target lengths: summary_loop_length_12.bin, summary_loop_length_27.bin, summary_loop_length_61.bin
Source code(tar.gz)
Source code(zip)
bert_coverage.bin(420.06 MB)
fluency_news_bs32.bin(522.73 MB)
gpt2_copier23.bin(633.97 MB)
keyword_extractor.joblib(667.33 KB)
summary_loop_length10.bin(522.73 MB)
summary_loop_length24.bin(522.73 MB)
summary_loop_length46.bin(522.73 MB)
wikinews.db(91.20 MB)

Owner

Canny Lab @ The University of California, Berkeley

GitHub Repository

pytorch implementation of the ICCV'21 paper "MVTN: Multi-View Transformation Network for 3D Shape Recognition"

MVTN: Multi-View Transformation Network for 3D Shape Recognition (ICCV 2021) By Abdullah Hamdi, Silvio Giancola, Bernard Ghanem Paper | Video | Tutori

64 Jan 03, 2023

OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark

Introduction English | 简体中文 MMAction2 is an open-source toolbox for video understanding based on PyTorch. It is a part of the OpenMMLab project. The m

2.7k Jan 07, 2023

An SMPC companion library for Syft

SyMPC A library that extends PySyft with SMPC support SyMPC /ˈsɪmpəθi/ is a library which extends PySyft ≥0.3 with SMPC support. It allows computing o

0 Oct 13, 2021

Canonical Appearance Transformations

CAT-Net: Learning Canonical Appearance Transformations Code to accompany our paper "How to Train a CAT: Learning Canonical Appearance Transformations

54 Dec 24, 2022

Graph Posterior Network: Bayesian Predictive Uncertainty for Node Classification (NeurIPS 2021)

Graph Posterior Network This is the official code repository to the paper Graph Posterior Network: Bayesian Predictive Uncertainty for Node Classifica

30 Dec 05, 2022

This repository contains a PyTorch implementation of "AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis".

AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis | Project Page | Paper | PyTorch implementation for the paper "AD-NeRF: Audio

551 Dec 29, 2022

Pre-trained NFNets with 99% of the accuracy of the official paper

NFNet Pytorch Implementation This repo contains pretrained NFNet models F0-F6 with high ImageNet accuracy from the paper High-Performance Large-Scale

133 Dec 09, 2022

HugsVision is a easy to use huggingface wrapper for state-of-the-art computer vision

HugsVision is an open-source and easy to use all-in-one huggingface wrapper for computer vision. The goal is to create a fast, flexible and user-frien

166 Nov 27, 2022

Official repository of "BasicVSR++: Improving Video Super-Resolution with Enhanced Propagation and Alignment"

BasicVSR_PlusPlus (CVPR 2022) [Paper] [Project Page] [Code] This is the official repository for BasicVSR++. Please feel free to raise issue related to

227 Jan 01, 2023

Open source hardware and software platform to build a small scale self driving car.

Donkeycar is minimalist and modular self driving library for Python. It is developed for hobbyists and students with a focus on allowing fast experimentation and easy community contributions.

2.4k Jan 04, 2023

Cartoon-StyleGan2 🙃 : Fine-tuning StyleGAN2 for Cartoon Face Generation

Fine-tuning StyleGAN2 for Cartoon Face Generation

520 Jan 04, 2023

The source code of "SIDE: Center-based Stereo 3D Detector with Structure-aware Instance Depth Estimation", accepted to WACV 2022.

SIDE: Center-based Stereo 3D Detector with Structure-aware Instance Depth Estimation The source code of our work "SIDE: Center-based Stereo 3D Detecto

10 Dec 18, 2022

Python periodic table module

elemenpy Hello! elements.py is a small Python periodic table module that is used for calling certain information about an element. Installation Instal

2 Dec 27, 2021

Start-to-finish tutorial for interactive music co-creation in PyTorch and Tensorflow.js

98 Dec 14, 2022

Face Recognize System on camera AI OAK1

FRS on OAK1 Face Recognize System on camera OAK1 This project contains our work that deploy on camera OAK1 Features Anti-Spoofing Face detection Face

6 Aug 08, 2022

ContourletNet: A Generalized Rain Removal Architecture Using Multi-Direction Hierarchical Representation

ContourletNet: A Generalized Rain Removal Architecture Using Multi-Direction Hierarchical Representation (Accepted by BMVC'21) Abstract: Images acquir

10 Dec 08, 2022

implicit displacement field

Geometry-Consistent Neural Shape Representation with Implicit Displacement Fields [project page][paper][cite] Geometry-Consistent Neural Shape Represe

100 Dec 19, 2022

[CVPR 21] Vectorization and Rasterization: Self-Supervised Learning for Sketch and Handwriting, IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2021.

Vectorization and Rasterization: Self-Supervised Learning for Sketch and Handwriting, CVPR 2021. Ayan Kumar Bhunia, Pinaki nath Chowdhury, Yongxin Yan

44 Dec 12, 2022

[ECCVW2020] Robust Long-Term Object Tracking via Improved Discriminative Model Prediction (RLT-DiMP)

Feel free to visit my homepage Robust Long-Term Object Tracking via Improved Discriminative Model Prediction (RLT-DIMP) [ECCVW2020 paper] Presentation

35 Oct 26, 2022

Data augmentation for NLP, accepted at EMNLP 2021 Findings

AEDA: An Easier Data Augmentation Technique for Text Classification This is the code for the EMNLP 2021 paper AEDA: An Easier Data Augmentation Techni

81 Dec 09, 2022