Build Text Rerankers with Deep Language Models

Overview

Reranker

Reranker is a lightweight, effective and efficient package for training and deploying deep languge model reranker in information retrieval (IR), question answering (QA) and many other natural language processing (NLP) pipelines. The training procedure follows our ECIR paper Rethink Training of BERT Rerankers in Multi-Stage Retrieval Pipeline using a localized constrastive esimation (LCE) loss.

Reranker speaks Huggingface 🤗 language! This means that you instantly get all state-of-the-art pre-trained models as soon as they are ported to HF transformers. You also get the familiar model and trainer interfaces.

Stae of the Art Performance.

Reranker has two submissions to MS MARCO document leaderboard. Each got 1st place, advancing the SOTA!

Date Submission Name Dev [email protected] Eval [email protected]
2021/01/20 LCE loss + HDCT (ensemble) 0.464 0.405
2020/09/09 HDCT top100 + BERT-base FirstP (single) 0.434 0.382

Features

  • Training rerankers from the state-of-the-art pre-trained language models like BERT, RoBERTa and ELECTRA.
  • The state-of-the-art reranking performance with our LCE loss based training pipeline.
  • GPU memory optimizations: Loss Parallelism and Gradient Cache which allow training of larger model.
  • Faster training
    • Distributed Data Parallel (DDP) for multi GPUs.
    • Automatic Mixed Precision (AMP) training and inference with up to 2x speedup!
  • Break CPU RAM limitation by memory mapping datasets with pyarrow through datasets package interface.
  • Checkpoint interoperability with Hugging Face transformers.

Design Philosophy

The library is designed to be dedicated for text reranking modeling, training and testing. This helps us keep the code concise and focus on a more specific task.

Under the hood, Reranker provides a thin layer of wrapper over Huggingface libraries. Our model wraps PreTrainedModel and our trainer sub-class Huggingface Trainer. You can then work with the familiar interfaces.

Installation and Dependencies

Reranker uses Pytorch, Huggingface Transformers and Datasets. Install with the following commands,

git clone https://github.com/luyug/Reranker.git
cd Reranker
pip install .

Reranker has been tested with torch==1.6.0, transformers==4.2.0, datasets==1.1.3.

For development, install as editable,

pip install -e .

Workflow

Inference (Reranking)

The easiest way to do inference is to use one of our uploaded trained checkpoints with RerankerForInference.

from reranker import RerankerForInference
rk = RerankerForInference.from_pretrained("Luyu/bert-base-mdoc-bm25")  # load checkpoint

inputs = rk.tokenize('weather in new york', 'it is cold today in new york', return_tensors='pt')
score = rk(inputs).logits

Training

For training, you will need a model, a dataset and a trainer. Say we have parsed arguments into model_args, data_args and training_args with reranker.arguments. First, initialize the reranker and tokenizer from one of pre-tained language models from Hugging Face. For example, let's use RoBERTa by loading roberta-base.

from reranker import Reranker 
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('roberta-base')
model = Reranker.from_pretrained(model_args, data_args, training_args, 'roberta-base')

Then create the dataset,

from reranker.data import GroupedTrainDataset
train_dataset = GroupedTrainDataset(
    data_args, data_args.train_path, 
    tokenizer=tokenizer, train_args=training_args
)

Create a trainer and train,

from reranker import RerankerTrainer
trainer = RerankerTrainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        data_collator=GroupCollator(tokenizer),
    )
trainer.train()

See full examples in our examples.

Examples

MS MARCO Document Ranking with Reranker

More to come

Large Models

Loss Paralellism

We support computing a query's LCE loss with multiple GPUs with flag --collaborative. Note that a group size (pos + neg) not divisible by number of GPUs may incur undefined behaviours. You will typically want to use it with gradient accumulation steps greater than one.

Detailed instruction ot be added.

Gradient Cache

Experimental We provide subclasses RerankerDC and RerankerDCTrainer. In the MS MARCO example, You can use them with --distance_cahce argument to activate gradient caching with respect to computed unnormalized distance. This allows potentially training with unlimited number of negatives beyond GPU memory limitation up to numerical precision. The method is described in our preprint Scaling Deep Contrastive Learning Batch Size with Almost Constant Peak Memory Usage.

Detailed instruction to be added.

Helpers

We provide a few helpers in the helper directory for data formatting,

Score Formatting

  • score_to_marco.py turns a raw score txt file into MS MARCO format.
  • score_to_tein.py turns a raw score txt file into trec eval format.

For example,

python score_to_tein.py --score_file {path to raw score txt}

This generates a trec eval format file in the same directory as the raw score file.

Data Format

Reranker core utilities (batch training, batch inference) expect processed and tokenized text in token id format. This means pre-processing should be done beforehand, e.g. with BERT tokenizer.

Training Data

Training data is grouped by query into a json file where each line has a query, its corresponding positives and sampled negatives.

{
    "qry": {
        "qid": str,
        "query": List[int],
    },
    "pos": List[
        {
            "pid": str,
            "passage": List[int],
        }
    ],
    "neg": List[
        {
            "pid": str,
            "passage": List[int]
        }
    ]
}

Training data is handled by class reranker.data.GroupedTrainDataset.

Inference (Reranking) Data

Inference data is grouped by query document(passage) pairs. Each line is a json entry to be rereanked (scored).

{
    "qid": str,
    "pid": str,
    "qry": List[int],
    "psg": List[int]
}

To speed up postprocessing, we currently take an additional tsv specifying text ids,

qid0     pid0
qid0     pid1
...

The ordering in the two files are expected to be the same.

Inference data is handled by class reranker.data.PredictionDataset.

Result Scores

Scores are stored in a tsv file with columns corresponding to qid, pid and score.

qid0     pid0     s0
qid0     pid1     s1
...

You can post-process it with our helper scirpt into MS MARCO format or TREC eval format.

Contribution

We welcome contribution to the package, either adding new dataset interface or new models.

Contact

You can reach me by email [email protected]. As a 2nd year master, I get busy days from time to time and may not reply very promptly. Feel free to ping me if you don't get replies.

Citation

If you use Reranker in your research, please consider citing our ECIR paper,

@inproceedings{gao2021lce,
               title={Rethink Training of BERT Rerankers in Multi-Stage Retrieval Pipeline}, 
               author={Luyu Gao and Zhuyun Dai and Jamie Callan},
               year={2021},
               booktitle={The 43rd European Conference On Information Retrieval (ECIR)},
      
}

For the gradient cache utility, consider citing our preprint,

@misc{gao2021scaling,
      title={Scaling Deep Contrastive Learning Batch Size with Almost Constant Peak Memory Usage}, 
      author={Luyu Gao and Yunyi Zhang},
      year={2021},
      eprint={2101.06983},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

License

Reranker is currently licensed under CC-BY-NC 4.0.

Owner
Luyu Gao
NLP Research [email protected], CMU
Luyu Gao
An ultra fast tiny model for lane detection, using onnx_parser, TensorRTAPI, torch2trt to accelerate. our model support for int8, dynamic input and profiling. (Nvidia-Alibaba-TensoRT-hackathon2021)

Ultra_Fast_Lane_Detection_TensorRT An ultra fast tiny model for lane detection, using onnx_parser, TensorRTAPI to accelerate. our model support for in

steven.yan 121 Dec 27, 2022
Translation to python of Chris Sims' optimization function

pycsminwel This is a locol minimization algorithm. Uses a quasi-Newton method with BFGS update of the estimated inverse hessian. It is robust against

Gustavo Amarante 1 Mar 21, 2022
Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

T5: Text-To-Text Transfer Transformer The t5 library serves primarily as code for reproducing the experiments in Exploring the Limits of Transfer Lear

Google Research 4.6k Jan 01, 2023
The entmax mapping and its loss, a family of sparse softmax alternatives.

entmax This package provides a pytorch implementation of entmax and entmax losses: a sparse family of probability mappings and corresponding loss func

DeepSPIN 330 Dec 22, 2022
CLIPfa: Connecting Farsi Text and Images

CLIPfa: Connecting Farsi Text and Images OpenAI released the paper Learning Transferable Visual Models From Natural Language Supervision in which they

Sajjad Ayoubi 66 Dec 14, 2022
An open-source NLP library: fast text cleaning and preprocessing.

An open-source NLP library: fast text cleaning and preprocessing

Iaroslav 21 Mar 18, 2022
Idea is to build a model which will take keywords as inputs and generate sentences as outputs.

keytotext Idea is to build a model which will take keywords as inputs and generate sentences as outputs. Potential use case can include: Marketing Sea

Gagan Bhatia 364 Jan 03, 2023
Code for Editing Factual Knowledge in Language Models

KnowledgeEditor Code for Editing Factual Knowledge in Language Models (https://arxiv.org/abs/2104.08164). @inproceedings{decao2021editing, title={Ed

Nicola De Cao 86 Nov 28, 2022
Text-to-Speech for Belarusian language

title emoji colorFrom colorTo sdk app_file pinned Belarusian TTS 🐸 green green gradio app.py false Belarusian TTS 📢 🤖 Belarusian TTS (text-to-speec

Yurii Paniv 1 Nov 27, 2021
Python3 to Crystal Translation using Python AST Walker

py2cr.py A code translator using AST from Python to Crystal. This is basically a NodeVisitor with Crystal output. See AST documentation (https://docs.

66 Jul 25, 2022
CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training This is the official repository for the code and models of the paper CCQA: A N

Meta Research 29 Nov 30, 2022
Almost State-of-the-art Text Generation library

Ps: we are adding transformer model soon Text Gen 🐐 Almost State-of-the-art Text Generation library Text gen is a python library that allow you build

Emeka boris ama 63 Jun 24, 2022
ChainKnowledgeGraph, 产业链知识图谱包括A股上市公司、行业和产品共3类实体

ChainKnowledgeGraph, 产业链知识图谱包括A股上市公司、行业和产品共3类实体,包括上市公司所属行业关系、行业上级关系、产品上游原材料关系、产品下游产品关系、公司主营产品、产品小类共6大类。 上市公司4,654家,行业511个,产品95,559条、上游材料56,824条,上级行业480条,下游产品390条,产品小类52,937条,所属行业3,946条。

liuhuanyong 415 Jan 06, 2023
An Open-Source Package for Neural Relation Extraction (NRE)

OpenNRE We have a DEMO website (http://opennre.thunlp.ai/). Try it out! OpenNRE is an open-source and extensible toolkit that provides a unified frame

THUNLP 3.9k Jan 03, 2023
The first online catalogue for Arabic NLP datasets.

Masader The first online catalogue for Arabic NLP datasets. This catalogue contains 200 datasets with more than 25 metadata annotations for each datas

ARBML 94 Dec 26, 2022
TextAttack 🐙 is a Python framework for adversarial attacks, data augmentation, and model training in NLP

TextAttack 🐙 Generating adversarial examples for NLP models [TextAttack Documentation on ReadTheDocs] About • Setup • Usage • Design About TextAttack

QData 2.2k Jan 03, 2023
Reproduction process of BERT on SST2 dataset

BERT-SST2-Prod Reproduction process of BERT on SST2 dataset 安装说明 下载代码库 git clone https://github.com/JunnYu/BERT-SST2-Prod 进入文件夹,安装requirements pip ins

yujun 1 Nov 18, 2021
Kerberoast with ACL abuse capabilities

targetedKerberoast targetedKerberoast is a Python script that can, like many others (e.g. GetUserSPNs.py), print "kerberoast" hashes for user accounts

Shutdown 213 Dec 22, 2022
NLP applications using deep learning.

NLP-Natural-Language-Processing NLP applications using deep learning like text generation etc. 1- Poetry Generation: Using a collection of Irish Poem

KASHISH 1 Jan 27, 2022
[ICLR 2021 Spotlight] Pytorch implementation for "Long-tailed Recognition by Routing Diverse Distribution-Aware Experts."

RIDE: Long-tailed Recognition by Routing Diverse Distribution-Aware Experts. by Xudong Wang, Long Lian, Zhongqi Miao, Ziwei Liu and Stella X. Yu at UC

Xudong (Frank) Wang 205 Dec 16, 2022