BLEURT is a metric for Natural Language Generation based on transfer learning.

Related tags

Deep Learningbleurt
Overview

BLEURT: a Transfer Learning-Based Metric for Natural Language Generation

BLEURT is an evaluation metric for Natural Language Generation. It takes a pair of sentences as input, a reference and a candidate, and it returns a score that indicates to what extent the candidate is fluent and conveys the mearning of the reference. It is comparable to sentence-BLEU, BERTscore, and COMET.

BLEURT is a trained metric, that is, it is a regression model trained on ratings data. The model is based on BERT and RemBERT. This repository contains all the code necessary to use it and/or fine-tune it for your own applications. BLEURT uses Tensorflow, and it benefits greatly from modern GPUs (it runs on CPU too).

An overview of BLEURT can be found in our our blog post. Further details are provided in the ACL paper BLEURT: Learning Robust Metrics for Text Generation and our EMNLP paper.

Installation

BLEURT runs in Python 3. It relies heavily on Tensorflow (>=1.15) and the library tf-slim (>=1.1). You may install it as follows:

pip install --upgrade pip  # ensures that pip is current
git clone https://github.com/google-research/bleurt.git
cd bleurt
pip install .

You may check your install with unit tests:

python -m unittest bleurt.score_test
python -m unittest bleurt.score_not_eager_test
python -m unittest bleurt.finetune_test
python -m unittest bleurt.score_files_test

Using BLEURT - TL;DR Version

The following commands download the recommended checkpoint and run BLEURT:

# Downloads the BLEURT-base checkpoint.
wget https://storage.googleapis.com/bleurt-oss-21/BLEURT-20.zip .
unzip BLEURT-20.zip

# Runs the scoring.
python -m bleurt.score_files \
  -candidate_file=bleurt/test_data/candidates \
  -reference_file=bleurt/test_data/references \
  -bleurt_checkpoint=BLEURT-20

The files bleurt/test_data/candidates and references contain test sentences, included by default in the BLEURT distribution. The input format is one sentence per line. You may replace them with your own files. The command outputs one score per sentence pair.

Oct 8th 2021 Update: we upgraded the recommended checkpoint to BLEURT-20, a more accurate, multilingual model 🎉 .

Using BLEURT - the Long Version

Command-line tools and APIs

Currently, there are three methods to invoke BLEURT: the command-line interface, the Python API, and the Tensorflow API.

Command-line interface

The simplest way to use BLEURT is through command line, as shown below.

python -m bleurt.score_files \
  -candidate_file=bleurt/test_data/candidates \
  -reference_file=bleurt/test_data/references \
  -bleurt_checkpoint=bleurt/test_checkpoint \
  -scores_file=scores

The files candidates and references contain one sentence per line (see the folder test_data for the exact format). Invoking the command should produce a file scores which contains one BLEURT score per sentence pair. Alternatively you may use a JSONL file, as follows:

python -m bleurt.score_files \
  -sentence_pairs_file=bleurt/test_data/sentence_pairs.jsonl \
  -bleurt_checkpoint=bleurt/test_checkpoint

The flags bleurt_checkpoint and scores_file are optional. If bleurt_checkpoint is not specified, BLEURT will default to a test checkpoint, based on BERT-Tiny, which is very light but also very inaccurate (we recommend against using it). If scores_files is not specified, BLEURT will use the standard output.

The following command lists all the other command-line options:

python -m bleurt.score_files -helpshort

Python API

BLEURT may be used as a Python library as follows:

from bleurt import score

checkpoint = "bleurt/test_checkpoint"
references = ["This is a test."]
candidates = ["This is the test."]

scorer = score.BleurtScorer(checkpoint)
scores = scorer.score(references=references, candidates=candidates)
assert type(scores) == list and len(scores) == 1
print(scores)

Here again, BLEURT will default to BERT-Tiny if no checkpoint is specified.

BLEURT works both in eager_mode (default in TF 2.0) and in a tf.Session (TF 1.0), but the latter mode is slower and may be deprecated in the near future.

Tensorflow API

BLEURT may be embedded in a TF computation graph, e.g., to visualize it on the Tensorboard while training a model.

The following piece of code shows an example:

import tensorflow as tf
# Set tf.enable_eager_execution() if using TF 1.x.

from bleurt import score

references = tf.constant(["This is a test."])
candidates = tf.constant(["This is the test."])

bleurt_ops = score.create_bleurt_ops()
bleurt_out = bleurt_ops(references=references, candidates=candidates)

assert bleurt_out["predictions"].shape == (1,)
print(bleurt_out["predictions"])

The crucial part is the call to score.create_bleurt_ops, which creates the TF ops.

Checkpoints

A BLEURT checkpoint is a self-contained folder that contains a regression model and some information that BLEURT needs to run. BLEURT checkpoints can be downloaded, copy-pasted, and stored anywhere. Furthermore, checkpoints are tunable, which means that they can be fine-tuned on custom ratings data.

BLEURT defaults to the test checkpoint, which is very inaccaurate. We recommend using BLEURT-20 for results reporting. You may use it as follows:

wget https://storage.googleapis.com/bleurt-oss-21/BLEURT-20.zip .
unzip BLEURT-20.zip
python -m bleurt.score_files \
  -candidate_file=bleurt/test_data/candidates \
  -reference_file=bleurt/test_data/references \
  -bleurt_checkpoint=BLEURT-20

The checkpoints page provides more information about how these checkpoints were trained, as well as pointers to smaller models. Additionally, you can fine-tune BERT or existing BLEURT checkpoints on your own ratings data. The checkpoints page describes how to do so.

Interpreting BLEURT Scores

Different BLEURT checkpoints yield different scores. The currently recommended checkpoint BLEURT-20 generates scores which are roughly between 0 and 1 (sometimes less than 0, sometimes more than 1), where 0 indicates a random output and 1 a perfect one. As with all automatic metrics, BLEURT scores are noisy. For a robust evaluation of a system's quality, we recommend averaging BLEURT scores across the sentences in a corpus. See the WMT Metrics Shared Task for a comparison of metrics on this aspect.

In principle, BLEURT should measure adequacy: most of its training data was collected by the WMT organizers who asked to annotators "How much do you agree that the system output adequately expresses the meaning of the reference?" (WMT Metrics'18, Graham et al., 2015). In practice however, the answers tend to be very correlated with fluency ("Is the text fluent English?"), and we added synthetic noise in the training set which makes the distinction between adequacy and fluency somewhat fuzzy.

Language Coverage

Currently, BLEURT-20 was tested on 13 languages: Chinese, Czech, English, French, German, Japanese, Korean, Polish, Portugese, Russian, Spanish, Tamil, Vietnamese (these are languages for which we have held-out ratings data). In theory, it should work for the 100+ languages of multilingual C4, on which RemBERT was trained.

If you tried any other language and would like to share your experience, either positive or negative, please send us feedback!

Speeding Up BLEURT

We describe three methods to speed up BLEURT, and how to combine them.

Batch size tuning

You may specify the flag -bleurt_batch_size which determines the number of sentence pairs processed at once by BLEURT. The default value is 16, you may want to increase or decrease it based on the memory available and the presence of a GPU (we typically use 16 when using a laptop without a GPU, 100 on a workstation with a GPU).

Length-based batching

Length-based batching is an optimization which consists in batching examples that have a similar a length and cropping the resulting tensor, to avoid wasting computations on padding tokens. This technique oftentimes results in spectacular speed-ups (typically, ~2-10X). It is described here, and it was successfully used by BERTScore in the field of learned metrics.

You can enable length-based by specifying -batch_same_length=True when calling score_files with the command line, or by instantiating a LengthBatchingBleurtScorer instead of BleurtScorer when using the Python API.

Distilled models

We provide pointers to several compressed checkpoints on the checkpoints page. These models were obtained by distillation, a lossy process, and therefore the outputs cannot be directly compared to those of the original BLEURT model (though they should be strongly correlated).

Putting everything together

The following command illustrates how to combine these three techniques, speeding up BLEURT by an order of magnitude (up to 20X with our configuration) on larger files:

# Downloads the 12-layer distilled model, which is ~3.5X smaller.
wget https://storage.googleapis.com/bleurt-oss-21/BLEURT-20-D12.zip .
unzip BLEURT-20-D12.zip

python -m bleurt.score_files \
  -candidate_file=bleurt/test_data/candidates \
  -reference_file=bleurt/test_data/references \
  -bleurt_batch_size=100 \            # Optimization 1.
  -batch_same_length=True \           # Optimization 2.
  -bleurt_checkpoint=BLEURT-20-D12    # Optimization 3.

Reproducibility

You may find information about how to work with ratings from the WMT Metrics Shared Task, reproduce results from our ACL paper, and a selection of models from our EMNLP paper here.

How to Cite

Please cite our ACL paper:

@inproceedings{sellam2020bleurt,
  title = {BLEURT: Learning Robust Metrics for Text Generation},
  author = {Thibault Sellam and Dipanjan Das and Ankur P Parikh},
  year = {2020},
  booktitle = {Proceedings of ACL}
}
Robust Self-augmentation for NER with Meta-reweighting

Robust Self-augmentation for NER with Meta-reweighting

Lam chi 17 Nov 22, 2022
Implemenets the Contourlet-CNN as described in C-CNN: Contourlet Convolutional Neural Networks, using PyTorch

C-CNN: Contourlet Convolutional Neural Networks This repo implemenets the Contourlet-CNN as described in C-CNN: Contourlet Convolutional Neural Networ

Goh Kun Shun (KHUN) 10 Nov 03, 2022
The King is Naked: on the Notion of Robustness for Natural Language Processing

the-king-is-naked: on the notion of robustness for natural language processing AAAI2022 DISCLAIMER:This repo will be updated soon with instructions on

Iperboreo_ 1 Nov 24, 2022
The source codes for TME-BNA: Temporal Motif-Preserving Network Embedding with Bicomponent Neighbor Aggregation.

TME The source codes for TME-BNA: Temporal Motif-Preserving Network Embedding with Bicomponent Neighbor Aggregation. Our implementation is based on TG

2 Feb 10, 2022
From the basics to slightly more interesting applications of Tensorflow

TensorFlow Tutorials You can find python source code under the python directory, and associated notebooks under notebooks. Source code Description 1 b

Parag K Mital 5.6k Jan 09, 2023
Official Implementation (PyTorch) of "Point Cloud Augmentation with Weighted Local Transformations", ICCV 2021

PointWOLF: Point Cloud Augmentation with Weighted Local Transformations This repository is the implementation of PointWOLF(To appear). Sihyeon Kim1*,

MLV Lab (Machine Learning and Vision Lab at Korea University) 16 Nov 03, 2022
High performance distributed framework for training deep learning recommendation models based on PyTorch.

PERSIA (Parallel rEcommendation tRaining System with hybrId Acceleration) is developed by AI 340 Dec 30, 2022

Sky Computing: Accelerating Geo-distributed Computing in Federated Learning

Sky Computing Introduction Sky Computing is a load-balanced framework for federated learning model parallelism. It adaptively allocate model layers to

HPC-AI Tech 72 Dec 27, 2022
Official Repository for our ICCV2021 paper: Continual Learning on Noisy Data Streams via Self-Purified Replay

Continual Learning on Noisy Data Streams via Self-Purified Replay This repository contains the official PyTorch implementation for our ICCV2021 paper.

Jinseo Jeong 22 Nov 23, 2022
PyTorch implementation of the REMIND method from our ECCV-2020 paper "REMIND Your Neural Network to Prevent Catastrophic Forgetting"

REMIND Your Neural Network to Prevent Catastrophic Forgetting This is a PyTorch implementation of the REMIND algorithm from our ECCV-2020 paper. An ar

Tyler Hayes 72 Nov 27, 2022
[ICCV 2021 Oral] PoinTr: Diverse Point Cloud Completion with Geometry-Aware Transformers

PoinTr: Diverse Point Cloud Completion with Geometry-Aware Transformers Created by Xumin Yu*, Yongming Rao*, Ziyi Wang, Zuyan Liu, Jiwen Lu, Jie Zhou

Xumin Yu 317 Dec 26, 2022
Towards Long-Form Video Understanding

Towards Long-Form Video Understanding Chao-Yuan Wu, Philipp Krähenbühl, CVPR 2021 [Paper] [Project Page] [Dataset] Citation @inproceedings{lvu2021,

Chao-Yuan Wu 69 Dec 26, 2022
This repository implements and evaluates convolutional networks on the Möbius strip as toy model instantiations of Coordinate Independent Convolutional Networks.

Orientation independent Möbius CNNs This repository implements and evaluates convolutional networks on the Möbius strip as toy model instantiations of

Maurice Weiler 59 Dec 09, 2022
Data & Code for ACCENTOR Adding Chit-Chat to Enhance Task-Oriented Dialogues

ACCENTOR: Adding Chit-Chat to Enhance Task-Oriented Dialogues Overview ACCENTOR consists of the human-annotated chit-chat additions to the 23.8K dialo

Facebook Research 69 Dec 29, 2022
HMLLDB is a collection of LLDB commands to assist in the debugging of iOS apps.

HMLLDB is a collection of LLDB commands to assist in the debugging of iOS apps. 中文介绍 Features Non-intrusive. Your iOS project does not need to be modi

mao2020 47 Oct 22, 2022
This a classic fintech problem that introduces real life difficulties such as data imbalance. Check out the notebook to find out more!

Credit Card Fraud Detection Introduction Online transactions have become a crucial part of any business over the years. Many of those transactions use

Jonathan Hasbani 0 Jan 20, 2022
HDMapNet: A Local Semantic Map Learning and Evaluation Framework

HDMapNet_devkit Devkit for HDMapNet. HDMapNet: A Local Semantic Map Learning and Evaluation Framework Qi Li, Yue Wang, Yilun Wang, Hang Zhao [Paper] [

Tsinghua MARS Lab 421 Jan 04, 2023
Annotated notes and summaries of the TensorFlow white paper, along with SVG figures and links to documentation

TensorFlow White Paper Notes Features Notes broken down section by section, as well as subsection by subsection Relevant links to documentation, resou

Sam Abrahams 437 Oct 09, 2022
git git《Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking》(CVPR 2021) GitHub:git2] 《Masksembles for Uncertainty Estimation》(CVPR 2021) GitHub:git3]

Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking Ning Wang, Wengang Zhou, Jie Wang, and Houqiang Li Accepted by CVPR

NingWang 236 Dec 22, 2022
Source Code for ICSE 2022 Paper - ``Can We Achieve Fairness Using Semi-Supervised Learning?''

Fair-SSL Source Code for ICSE 2022 Paper - Can We Achieve Fairness Using Semi-Supervised Learning? Ethical bias in machine learning models has become

1 Dec 18, 2021