Improving Query Representations for DenseRetrieval with Pseudo Relevance Feedback:A Reproducibility Study.

Related tags

Deep LearningAPR
Overview

APR

The repo for the paper Improving Query Representations for DenseRetrieval with Pseudo Relevance Feedback:A Reproducibility Study.

Environment setup

To reproduce the results in the paper, we rely on two open-source IR toolkits: Pyserini and tevatron.

We cloned, merged, and modified the two toolkits in this repo and will use them to train and inference the PRF models. We refer to the original github repos to setup the environment:

Install Pyserini: https://github.com/castorini/pyserini/blob/master/docs/installation.md.

Install tevatron: https://github.com/texttron/tevatron#installation.

You also need MS MARCO passage ranking dataset, including the collection and queries. We refer to the official github repo for downloading the data.

To reproduce ANCE-PRF inference results with the original model checkpoint

The code, dataset, and model for reproducing the ANCE-PRF results presented in the original paper:

HongChien Yu, Chenyan Xiong, Jamie Callan. Improving Query Representations for Dense Retrieval with Pseudo Relevance Feedback

have been merged into Pyserini source. Simply just need to follow this instruction, which includes the instructions of downloading the dataset, model checkpoint (provided by the original authors), dense index, and PRF inference.

To train dense retriever PRF models

We use tevatron to train the dense retriever PRF query encodes that we investigated in the paper.

First, you need to have train queries run files to build hard negative training set for each DR.

You can use Pyserini to generate run files for ANCE, TCT-ColBERTv2 and DistilBERT KD TASB by changing the query set flag --topics to queries.train.tsv.

Once you have the run file, cd to /tevatron and run:

python make_train_from_ranking.py \
	--ranking_file /path/to/train/run \
	--model_type (ANCE or TCT or DistilBERT) \
	--output /path/to/save/hard/negative

Apart from the hard negative training set, you also need the original DR query encoder model checkpoints to initial the model weights. You can download them from Huggingface modelhub: ance, tct_colbert-v2-hnp-msmarco, distilbert-dot-tas_b-b256-msmarco. Please use the same name as the link in Huggingface modelhub for each of the folders that contain the model.

After you generated the hard negative training set and downloaded all the models, you can kick off the training for DR-PRF query encoders by:

python -m torch.distributed.launch \
    --nproc_per_node=2 \
    -m tevatron.driver.train \
    --output_dir /path/to/save/mdoel/checkpoints \
    --model_name_or_path /path/to/model/folder \
    --do_train \
    --save_steps 5000 \
    --train_dir /path/to/hard/negative \
    --fp16 \
    --per_device_train_batch_size 32 \
    --learning_rate 1e-6 \
    --num_train_epochs 10 \
    --train_n_passages 21 \
    --q_max_len 512 \
    --dataloader_num_workers 10 \
    --warmup_steps 5000 \
    --add_pooler

To inference dense retriever PRF models

Install Pyserini by following the instructions within pyserini/README.md

Then run:

python -m pyserini.dsearch --topics /path/to/query/tsv/file \
    --index /path/to/index \
    --encoder /path/to/encoder \ # This encoder is for first round retrieval
    --batch-size 64 \
    --output /path/to/output/run/file \
    --prf-method tctv2-prf \
    --threads 12 \
    --sparse-index msmarco-passage \
    --prf-encoder /path/to/encoder \ # This encoder is for PRF query generation
    --prf-depth 3

An example would be:

python -m pyserini.dsearch --topics ./data/msmarco-test2020-queries.tsv \
    --index ./dindex-msmarco-passage-tct_colbert-v2-hnp-bf \
    --encoder ./tct_colbert_v2_hnp \
    --batch-size 64 \
    --output ./runs/tctv2-prf3.res \
    --prf-method tctv2-prf \
    --threads 12 \
    --sparse-index msmarco-passage \
    --prf-encoder ./tct-colbert-v2-prf3/checkpoint-10000 \
    --prf-depth 3

Or one can use pre-built index and models available in Pyserini:

python -m pyserini.dsearch --topics dl19-passage \
    --index msmarco-passage-tct_colbert-v2-hnp-bf \
    --encoder castorini/tct_colbert-v2-hnp-msmarco \
    --batch-size 64 \
    --output ./runs/tctv2-prf3.res \
    --prf-method tctv2-prf \
    --threads 12 \
    --sparse-index msmarco-passage \
    --prf-encoder ./tct-colbert-v2-prf3/checkpoint-10000 \
    --prf-depth 3

The PRF depth --prf-depth 3 depends on the PRF encoder trained, if trained with PRF 3, here only can use PRF 3.

Where --topics can be: TREC DL 2019 Passage: dl19-passage TREC DL 2020 Passage: dl20 MS MARCO Passage V1: msmarco-passage-dev-subset

--encoder can be: ANCE: castorini/ance-msmarco-passage TCT-ColBERT V2 HN+: castorini/tct_colbert-v2-hnp-msmarco DistilBERT Balanced: sebastian-hofstaetter/distilbert-dot-tas_b-b256-msmarco

--index can be: ANCE index with MS MARCO V1 passage collection: msmarco-passage-ance-bf TCT-ColBERT V2 HN+ index with MS MARCO V1 passage collection: msmarco-passage-tct_colbert-v2-hnp-bf DistillBERT Balanced index with MS MARCO V1 passage collection: msmarco-passage-distilbert-dot-tas_b-b256-bf

To evaluate the run:

TREC DL 2019

python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 -m recall.1000 -l 2 dl19-passage ./runs/tctv2-prf3.res

TREC DL 2020

python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 -m recall.1000 -l 2 dl20-passage ./runs/tctv2-prf3.res

MS MARCO Passage Ranking V1

python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset ./runs/tctv2-prf3.res
Owner
ielab
The Information Engineering Lab
ielab
Applications using the GTN library and code to reproduce experiments in "Differentiable Weighted Finite-State Transducers"

gtn_applications An applications library using GTN. Current examples include: Offline handwriting recognition Automatic speech recognition Installing

Facebook Research 68 Dec 29, 2022
A Python package for faster, safer, and simpler ML processes

Bender 🤖 A Python package for faster, safer, and simpler ML processes. Why use bender? Bender will make your machine learning processes, faster, safe

Otovo 6 Dec 13, 2022
High-Resolution 3D Human Digitization from A Single Image.

PIFuHD: Multi-Level Pixel-Aligned Implicit Function for High-Resolution 3D Human Digitization (CVPR 2020) News: [2020/06/15] Demo with Google Colab (i

Meta Research 8.4k Dec 29, 2022
Adversarial Attacks on Probabilistic Autoregressive Forecasting Models.

Attack-Probabilistic-Models This is the source code for Adversarial Attacks on Probabilistic Autoregressive Forecasting Models. This repository contai

SRI Lab, ETH Zurich 25 Sep 14, 2022
Code for EMNLP2020 long paper: BERT-Attack: Adversarial Attack Against BERT Using BERT

BERT-ATTACK Code for our EMNLP2020 long paper: BERT-ATTACK: Adversarial Attack Against BERT Using BERT Dependencies Python 3.7 PyTorch 1.4.0 transform

Linyang Li 142 Jan 04, 2023
Simple tutorials on Pytorch DDP training

pytorch-distributed-training Distribute Dataparallel (DDP) Training on Pytorch Features Easy to study DDP training You can directly copy this code for

Ren Tianhe 188 Jan 06, 2023
Simple and Robust Loss Design for Multi-Label Learning with Missing Labels

Simple and Robust Loss Design for Multi-Label Learning with Missing Labels Official PyTorch Implementation of the paper Simple and Robust Loss Design

Xinyu Huang 28 Oct 27, 2022
Official repository for the paper, MidiBERT-Piano: Large-scale Pre-training for Symbolic Music Understanding.

MidiBERT-Piano Authors: Yi-Hui (Sophia) Chou, I-Chun (Bronwin) Chen Introduction This is the official repository for the paper, MidiBERT-Piano: Large-

137 Dec 15, 2022
Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand

Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand Introduction We propose a generalization of leaderboards, bidimensional leader

4 Dec 03, 2022
Explainability for Vision Transformers (in PyTorch)

Explainability for Vision Transformers (in PyTorch) This repository implements methods for explainability in Vision Transformers

Jacob Gildenblat 442 Jan 04, 2023
A framework for annotating 3D meshes using the predictions of a 2D semantic segmentation model.

Semantic Meshes A framework for annotating 3D meshes using the predictions of a 2D semantic segmentation model. Paper If you find this framework usefu

Florian 40 Dec 09, 2022
A code generator from ONNX to PyTorch code

onnx-pytorch Generating pytorch code from ONNX. Currently support onnx==1.9.0 and torch==1.8.1. Installation From PyPI pip install onnx-pytorch From

Wenhao Hu 94 Jan 06, 2023
Aiming at the common training datsets split, spectrum preprocessing, wavelength select and calibration models algorithm involved in the spectral analysis process

Aiming at the common training datsets split, spectrum preprocessing, wavelength select and calibration models algorithm involved in the spectral analysis process, a complete algorithm library is esta

Fu Pengyou 50 Jan 07, 2023
Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Phil Wang 12.6k Jan 09, 2023
ZeroVL - The official implementation of ZeroVL

This repository contains source code necessary to reproduce the results presente

31 Nov 04, 2022
(CVPR2021) DANNet: A One-Stage Domain Adaptation Network for Unsupervised Nighttime Semantic Segmentation

DANNet: A One-Stage Domain Adaptation Network for Unsupervised Nighttime Semantic Segmentation CVPR2021(oral) [arxiv] Requirements python3.7 pytorch==

W-zx-Y 85 Dec 07, 2022
Wordle-solver - Wordle answer generation program in python

🟨 Wordle Solver 🟩 Wordle answer generation program in python ✔️ Requirements U

Dahyun Kang 4 May 28, 2022
Convert Python 3 code to CUDA code.

Py2CUDA Convert python code to CUDA. Usage To convert a python file say named py_file.py to CUDA, run python generate_cuda.py --file py_file.py --arch

Yuval Rosen 3 Jul 14, 2021
Code for paper PairRE: Knowledge Graph Embeddings via Paired Relation Vectors.

PairRE Code for paper PairRE: Knowledge Graph Embeddings via Paired Relation Vectors. This implementation of PairRE for Open Graph Benchmak datasets (

Alipay 65 Dec 19, 2022
Training a deep learning model on the noisy CIFAR dataset

Training-a-deep-learning-model-on-the-noisy-CIFAR-dataset This repository contai

1 Jun 14, 2022