🏖 Easy training and deployment of seq2seq models.

Overview

Headliner

Build Status Build Status Docs codecov PyPI Version License

Headliner is a sequence modeling library that eases the training and in particular, the deployment of custom sequence models for both researchers and developers. You can very easily deploy your models in a few lines of code. It was originally built for our own research to generate headlines from Welt news articles (see figure 1). That's why we chose the name, Headliner.

Figure 1: One example from our Welt.de headline generator.

Update 21.01.2020

The library now supports fine-tuning pre-trained BERT models with custom preprocessing as in Text Summarization with Pretrained Encoders!

check out this tutorial on colab!

🧠 Internals

We use sequence-to-sequence (seq2seq) under the hood, an encoder-decoder framework (see figure 2). We provide a very simple interface to train and deploy seq2seq models. Although this library was created internally to generate headlines, you can also use it for other tasks like machine translations, text summarization and many more.

Figure 2: Encoder-decoder sequence-to-sequence model.

Why Headliner?

You may ask why another seq2seq library? There are a couple of them out there already. For example, Facebook has fairseq, Google has seq2seq and there is also OpenNMT. Although those libraries are great, they have a few drawbacks for our use case e.g. the former doesn't focus much on production whereas the Google one is not actively maintained. OpenNMT was the closest one to match our requirements i.e. it has a strong focus on production. However, we didn't like that their workflow (preparing data, training and evaluation) is mainly done via the command line. They also expose a well-defined API though but the complexity there is still too high with too much custom code (see their minimal transformer training example).

Therefore, we built this library for us with the following goals in mind:

  • Easy-to-use API for training and deployment (only a few lines of code)
  • Uses TensorFlow 2.0 with all its new features (tf.function, tf.keras.layers etc.)
  • Modular classes: text preprocessing, modeling, evaluation
  • Extensible for different encoder-decoder models
  • Works on large text data

For more details on the library, read the documentation at: https://as-ideas.github.io/headliner/

Headliner is compatible with Python 3.6 and is distributed under the MIT license.

⚙️ Installation

⚠️ Before installing Headliner, you need to install TensorFlow as we use this as our deep learning framework. For more details on how to install it, have a look at the TensorFlow installation instructions.

Then you can install Headliner itself. There are two ways to install Headliner:

  • Install Headliner from PyPI (recommended):
pip install headliner
  • Install Headliner from the GitHub source:
git clone https://github.com/as-ideas/headliner.git
cd headliner
python setup.py install

📖 Usage

Training

For the training, you need to import one of our provided models or create your own custom one. Then you need to create the dataset, a tuple of input-output sequences, and then train it:

from headliner.trainer import Trainer
from headliner.model.transformer_summarizer import TransformerSummarizer

data = [('You are the stars, earth and sky for me!', 'I love you.'),
        ('You are great, but I have other plans.', 'I like you.')]

summarizer = TransformerSummarizer(embedding_size=64, max_prediction_len=20)
trainer = Trainer(batch_size=2, steps_per_epoch=100)
trainer.train(summarizer, data, num_epochs=2)
summarizer.save('/tmp/summarizer')

Prediction

The prediction can be done in a few lines of code:

from headliner.model.transformer_summarizer import TransformerSummarizer

summarizer = TransformerSummarizer.load('/tmp/summarizer')
summarizer.predict('You are the stars, earth and sky for me!')

Models

Currently available models include a basic encoder-decoder, an encoder-decoder with Luong attention, the transformer and a transformer on top of a pre-trained BERT-model:

from headliner.model.basic_summarizer import BasicSummarizer
from headliner.model.attention_summarizer import AttentionSummarizer
from headliner.model.transformer_summarizer import TransformerSummarizer
from headliner.model.bert_summarizer import BertSummarizer

basic_summarizer = BasicSummarizer()
attention_summarizer = AttentionSummarizer()
transformer_summarizer = TransformerSummarizer()
bert_summarizer = BertSummarizer()

Advanced training

Training using a validation split and model checkpointing:

from headliner.model.transformer_summarizer import TransformerSummarizer
from headliner.trainer import Trainer

train_data = [('You are the stars, earth and sky for me!', 'I love you.'),
              ('You are great, but I have other plans.', 'I like you.')]
val_data = [('You are great, but I have other plans.', 'I like you.')]

summarizer = TransformerSummarizer(num_heads=1,
                                   feed_forward_dim=512,
                                   num_layers=1,
                                   embedding_size=64,
                                   max_prediction_len=50)
trainer = Trainer(batch_size=8,
                  steps_per_epoch=50,
                  max_vocab_size_encoder=10000,
                  max_vocab_size_decoder=10000,
                  tensorboard_dir='/tmp/tensorboard',
                  model_save_path='/tmp/summarizer')

trainer.train(summarizer, train_data, val_data=val_data, num_epochs=3)

Advanced prediction

Prediction information such as attention weights and logits can be accessed via predict_vectors returning a dictionary:

from headliner.model.transformer_summarizer import TransformerSummarizer

summarizer = TransformerSummarizer.load('/tmp/summarizer')
summarizer.predict_vectors('You are the stars, earth and sky for me!')

Resume training

A previously trained summarizer can be loaded and then retrained. In this case the data preprocessing and vectorization is loaded from the model.

train_data = [('Some new training data.', 'New data.')] * 10

summarizer_loaded = TransformerSummarizer.load('/tmp/summarizer')
trainer = Trainer(batch_size=2)
trainer.train(summarizer_loaded, train_data)
summarizer_loaded.save('/tmp/summarizer_retrained')

Use pretrained GloVe embeddings

Embeddings in GloVe format can be injected in to the trainer as follows. Optionally, set the embedding to non-trainable.

trainer = Trainer(embedding_path_encoder='/tmp/embedding_encoder.txt',
                  embedding_path_decoder='/tmp/embedding_decoder.txt')

# make sure the embedding size matches to the embedding size of the files
summarizer = TransformerSummarizer(embedding_size=64,
                                   embedding_encoder_trainable=False,
                                   embedding_decoder_trainable=False)

Custom preprocessing

A model can be initialized with custom preprocessing and tokenization:

from headliner.preprocessing.preprocessor import Preprocessor

train_data = [('Some inputs.', 'Some outputs.')] * 10

preprocessor = Preprocessor(filter_pattern='',
                            lower_case=True,
                            hash_numbers=False)
train_prep = [preprocessor(t) for t in train_data]
inputs_prep = [t[0] for t in train_prep]
targets_prep = [t[1] for t in train_prep]

# Build tf subword tokenizers. Other custom tokenizers can be implemented
# by subclassing headliner.preprocessing.Tokenizer
from tensorflow_datasets.core.features.text import SubwordTextEncoder
tokenizer_input = SubwordTextEncoder.build_from_corpus(
inputs_prep, target_vocab_size=2**13, reserved_tokens=[preprocessor.start_token, preprocessor.end_token])
tokenizer_target = SubwordTextEncoder.build_from_corpus(
    targets_prep, target_vocab_size=2**13,  reserved_tokens=[preprocessor.start_token, preprocessor.end_token])

vectorizer = Vectorizer(tokenizer_input, tokenizer_target)
summarizer = TransformerSummarizer(embedding_size=64, max_prediction_len=50)
summarizer.init_model(preprocessor, vectorizer)

trainer = Trainer(batch_size=2)
trainer.train(summarizer, train_data, num_epochs=3)

Use pre-trained BERT embeddings

Pre-trained BERT models can be included as follows. Be aware that pre-trained BERT models are expensive to train and require custom preprocessing!

from headliner.preprocessing.bert_preprocessor import BertPreprocessor
from spacy.lang.en import English

train_data = [('Some inputs.', 'Some outputs.')] * 10

# use BERT-specific start and end token
preprocessor = BertPreprocessor(nlp=English()
train_prep = [preprocessor(t) for t in train_data]
targets_prep = [t[1] for t in train_prep]


from tensorflow_datasets.core.features.text import SubwordTextEncoder
from transformers import BertTokenizer
from headliner.model.bert_summarizer import BertSummarizer

# Use a pre-trained BERT embedding and BERT tokenizer for the encoder 
tokenizer_input = BertTokenizer.from_pretrained('bert-base-uncased')
tokenizer_target = SubwordTextEncoder.build_from_corpus(
    targets_prep, target_vocab_size=2**13,  reserved_tokens=[preprocessor.start_token, preprocessor.end_token])

vectorizer = BertVectorizer(tokenizer_input, tokenizer_target)
summarizer = BertSummarizer(num_heads=2,
                            feed_forward_dim=512,
                            num_layers_encoder=0,
                            num_layers_decoder=4,
                            bert_embedding_encoder='bert-base-uncased',
                            embedding_size_encoder=768,
                            embedding_size_decoder=768,
                            dropout_rate=0.1,
                            max_prediction_len=50))
summarizer.init_model(preprocessor, vectorizer)

trainer = Trainer(batch_size=2)
trainer.train(summarizer, train_data, num_epochs=3)

Training on large datasets

Large datasets can be handled by using an iterator:

def read_data_iteratively():
    return (('Some inputs.', 'Some outputs.') for _ in range(1000))

class DataIterator:
    def __iter__(self):
        return read_data_iteratively()

data_iter = DataIterator()

summarizer = TransformerSummarizer(embedding_size=10, max_prediction_len=20)
trainer = Trainer(batch_size=16, steps_per_epoch=1000)
trainer.train(summarizer, data_iter, num_epochs=3)

🤝 Contribute

We welcome all kinds of contributions such as new models, new examples and many more. See the Contribution guide for more details.

📝 Cite this work

Please cite Headliner in your publications if this is useful for your research. Here is an example BibTeX entry:

@misc{axelspringerai2019headliners,
  title={Headliner},
  author={Christian Schäfer & Dat Tran},
  year={2019},
  howpublished={\url{https://github.com/as-ideas/headliner}},
}

🏗 Maintainers

© Copyright

See LICENSE for details.

References

Text Summarization with Pretrained Encoders

Effective Approaches to Attention-based Neural Machine Translation

Acknowlegements

https://www.tensorflow.org/tutorials/text/transformer

https://github.com/huggingface/transformers

https://machinetalk.org/2019/03/29/neural-machine-translation-with-attention-mechanism/

Owner
Axel Springer Ideas Engineering GmbH
We are driving, shaping and coding the future of tech at Axel Springer.
Axel Springer Ideas Engineering GmbH
Watson Natural Language Understanding and Knowledge Studio

Material de demonstração dos serviços: Watson Natural Language Understanding e Knowledge Studio Visão Geral: https://www.ibm.com/br-pt/cloud/watson-na

Vanderlei Munhoz 4 Oct 24, 2021
RecipeReduce: Simplified Recipe Processing for Lazy Programmers

RecipeReduce This repo will help you figure out the amount of ingredients to buy for a certain number of meals with selected recipes. RecipeReduce Get

Qibin Chen 9 Apr 22, 2022
Input english text, then translate it between languages n times using the Deep Translator Python Library.

mass-translator About Input english text, then translate it between languages n times using the Deep Translator Python Library. How to Use Install dep

2 Mar 04, 2022
AI-Broad-casting - AI Broad casting with python

Basic Code 1. Use The Code Configuration Environment conda create -n code_base p

In this repository we have tested 3 VQA models on the ImageCLEF-2019 dataset.

Med-VQA In this repository we have tested 3 VQA models on the ImageCLEF-2019 dataset. Two of these are made on top of Facebook AI Reasearch's Multi-Mo

Kshitij Ambilduke 8 Apr 14, 2022
Repositório da disciplina no semestre 2021-2

Avisos! Nenhum aviso! Compiladores 1 Este é o Git da disciplina Compiladores 1. Aqui ficará o material produzido em sala de aula assim como tarefas, w

6 May 13, 2022
문장단위로 분절된 나무위키 데이터셋. Releases에서 다운로드 받거나, tfds-korean을 통해 다운로드 받으세요.

Namuwiki corpus 문장단위로 미리 분절된 나무위키 코퍼스. 목적이 LM등에서 사용하기 위한 데이터셋이라, 링크/이미지/테이블 등등이 잘려있습니다. 문장 단위 분절은 kss를 활용하였습니다. 라이선스는 나무위키에 명시된 바와 같이 CC BY-NC-SA 2.0

Jeong Ukjae 16 Apr 02, 2022
Machine learning classifiers to predict American Sign Language .

ASL-Classifiers American Sign Language (ASL) is a natural language that serves as the predominant sign language of Deaf communities in the United Stat

Tarek idrees 0 Feb 08, 2022
Sentiment-Analysis and EDA on the IMDB Movie Review Dataset

Sentiment-Analysis and EDA on the IMDB Movie Review Dataset The main part of the work focuses on the exploration and study of different approaches whi

Nikolas Petrou 1 Jan 12, 2022
A calibre plugin that generates Word Wise and X-Ray files then sends them to Kindle. Supports KFX, AZW3 and MOBI eBooks. X-Ray supports 18 languages.

WordDumb A calibre plugin that generates Word Wise and X-Ray files then sends them to Kindle. Supports KFX, AZW3 and MOBI eBooks. Languages X-Ray supp

172 Dec 29, 2022
HAIS_2GNN: 3D Visual Grounding with Graph and Attention

HAIS_2GNN: 3D Visual Grounding with Graph and Attention This repository is for the HAIS_2GNN research project. Tao Gu, Yue Chen Introduction The motiv

Yue Chen 1 Nov 26, 2022
Practical Natural Language Processing Tools for Humans is build on the top of Senna Natural Language Processing (NLP)

Practical Natural Language Processing Tools for Humans is build on the top of Senna Natural Language Processing (NLP) predictions: part-of-speech (POS) tags, chunking (CHK), name entity recognition (

jawahar 20 Apr 30, 2022
Lumped-element impedance calculator and frequency-domain plotter.

fastZ: Lumped-Element Impedance Calculator fastZ is a small tool for calculating and visualizing electrical impedance in Python. Features include: Sup

Wesley Hileman 47 Nov 18, 2022
Chinese Pre-Trained Language Models (CPM-LM) Version-I

CPM-Generate 为了促进中文自然语言处理研究的发展,本项目提供了 CPM-LM (2.6B) 模型的文本生成代码,可用于文本生成的本地测试,并以此为基础进一步研究零次学习/少次学习等场景。[项目首页] [模型下载] [技术报告] 若您想使用CPM-1进行推理,我们建议使用高效推理工具BMI

Tsinghua AI 1.4k Jan 03, 2023
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility. Main features: Train new vocabularies and tok

Hugging Face 6.2k Dec 31, 2022
Google AI 2018 BERT pytorch implementation

BERT-pytorch Pytorch implementation of Google AI's 2018 BERT, with simple annotation BERT 2018 BERT: Pre-training of Deep Bidirectional Transformers f

Junseong Kim 5.3k Jan 07, 2023
Label data using HuggingFace's transformers and automatically get a prediction service

Label Studio for Hugging Face's Transformers Website • Docs • Twitter • Join Slack Community Transfer learning for NLP models by annotating your textu

Heartex 135 Dec 29, 2022
Natural Language Processing Tasks and Examples.

Natural Language Processing Tasks and Examples With the advancement of A.I. technology in recent years, natural language processing technology has bee

Soohwan Kim 53 Dec 20, 2022
Code to reproduce the results of the paper 'Towards Realistic Few-Shot Relation Extraction' (EMNLP 2021)

Realistic Few-Shot Relation Extraction This repository contains code to reproduce the results in the paper "Towards Realistic Few-Shot Relation Extrac

Bloomberg 8 Nov 09, 2022
Estimation of the CEFR complexity score of a given word, sentence or text.

NLP-Swedish … allows to estimate CEFR (Common European Framework of References) complexity score of a given word, sentence or text. CEFR scores come f

3 Apr 30, 2022