SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

Overview

img

THUIR License made-with-python code-size

Introduction

This codebase contains source-code of the Python-based implementation (ARES) of our SIGIR 2022 paper.

Requirements

  • python 3.7
  • torch==1.9.0
  • transformers==4.9.2
  • tqdm, nltk, numpy, boto3
  • trec_eval for evaluation on TREC DL 2019
  • anserini for generating "RANK" axiom scores

Why this repo?

In this repo, you can pre-train ARESsimple and TransformerICT models, and fine-tune all pre-trained models with the same architecture as BERT. The papers are listed as follows:

You can download the pre-trained ARES checkpoint ARESsimple from Google drive and extract it.

Pre-training Data

Download data

Download the MS MARCO corpus from the official website.
Download the ADORE+STAR Top100 Candidates files from this repo.

Pre-process data

To save memory, we store most files using the numpy memmap or jsonl format in the ./preprocess directory.

Document files:

  • doc_token_ids.memmap: each line is the token ids for a document
  • docid2idx.json: {docid: memmap_line_id}

Query files:

  • queries.doctrain.jsonl: MS MARCO training queries {"id" qid, "ids": token_ids} for each line
  • queries.docdev.jsonl: MS MARCO validating queries {"id" qid, "ids": token_ids} for each line
  • queries.dl2019.jsonl: TREC DL 2019 queries {"id" qid, "ids": token_ids} for each line

Human label files:

  • msmarco-doctrain-qrels.tsv: qid 0 docid 1 for training set
  • dev-qrels.txt: qid relevant_docid for validating set
  • 2019qrels-docs.txt: qid relevant_docid for TREC DL 2019 set

Top 100 candidate files:

  • train.rank.tsv, dev.rank.tsv, test.rank.tsv: qid docid rank for each line

Pseudo queries and axiomatic features:

  • doc2qs.jsonl: {"docid": docid, "queries": [qids]} for each line
  • sample_qs_token_ids.memmap: each line is the token ids for a pseudo query
  • sample_qid2id.json: {qid: memmap_line_id}
  • axiom.memmap: axiom can be one of the ['rank', 'prox-1', 'prox-2', 'rep-ql', 'rep-tfidf', 'reg', 'stm-1', 'stm-2', 'stm-3'], each line is an axiomatic score for a query

Quick Start

Note that to accelerate the training process, we adopt the parallel training technique. The scripts for pre-training and fine-tuning are as follow:

Pre-training

export BERT_DIR=/path/to/bert-base/
export XGB_DIR=/path/to/xgboost.model

cd pretrain

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 NCCL_BLOCKING_WAIT=1 \
python  -m torch.distributed.launch --nproc_per_node=6 --nnodes=1 train.py \
        --model_type ARES \
        --PRE_TRAINED_MODEL_NAME BERT_DIR \
        --gpu_num 6 --world_size 6 \
        --MLM --axiom REP RANK REG PROX STM \
        --clf_model XGB_DIR

Here model type can be ARES or ICT.

Zero-shot evaluation (based on AS top100)

export MODEL_DIR=/path/to/ares-simple/
export CKPT_NAME=ares.ckpt

cd finetune

CUDA_VISIBLE_DEVICES=0 python train.py \
        --test \
        --PRE_TRAINED_MODEL_NAME MODEL_DIR \
        --model_type ARES \
        --model_name ARES_simple \
        --load_ckpt \
        --model_path CKPT_NAME

You can get:

#####################
<----- MS Dev ----->
MRR @10: 0.2991
MRR @100: 0.3130
QueriesRanked: 5193
#####################

on MS MARCO dev set and:

#############################
<--------- DL 2019 --------->
QueriesRanked: 43
nDCG @10: 0.5955
nDCG @100: 0.4863
#############################

on DL 2019 set.

Fine-tuning

export MODEL_DIR=/path/to/ares-simple/

cd finetune

CUDA_VISIBLE_DEVICES=0,1,2,3 NCCL_BLOCKING_WAIT=1 \
python -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 train.py \
        --model_type ARES \
        --distributed_train \
        --PRE_TRAINED_MODEL_NAME MODEL_DIR \
        --gpu_num 4 --world_size 4 \
        --model_name ARES_simple

Visualization

export MODEL_DIR=/path/to/ares-simple/
export SAVE_DIR=/path/to/output/
export CKPT_NAME=ares.ckpt

cd visualization

CUDA_VISIBLE_DEVICES=0 python visual.py \
    --PRE_TRAINED_MODEL_NAME MODEL_DIR \
    --model_name ARES_simple \
    --visual_q_num 1 \
    --visual_d_num 5 \
    --save_path SAVE_DIR \
    --model_path CKPT_NAME

Results

Zero-shot performance:

Model Name MS MARCO [email protected] MS MARCO [email protected] DL [email protected] DL [email protected] COVID EQ
BM25 0.2962 0.3107 0.5776 0.4795 0.4857 0.6690
BERT 0.1820 0.2012 0.4059 0.4198 0.4314 0.6055
PROPwiki 0.2429 0.2596 0.5088 0.4525 0.4857 0.5991
PROPmarco 0.2763 0.2914 0.5317 0.4623 0.4829 0.6454
ARESstrict 0.2630 0.2785 0.4942 0.4504 0.4786 0.6923
AREShard 0.2627 0.2780 0.5189 0.4613 0.4943 0.6822
ARESsimple 0.2991 0.3130 0.5955 0.4863 0.4957 0.6916

Few-shot performance: img

Visualization (attribution values have been normalized within a document): img

Citation

If you find our work useful, please do not save your star and cite our work:

@inproceedings{chen2022axiomatically,
  title={Axiomatically Regularized Pre-training for Ad hoc Search},
  author={Chen, Jia and Liu, Yiqun and Fang, Yan and Mao, Jiaxin and Fang, Hui and Yang, Shenghao and Xie, Xiaohui and Zhang, Min and Ma, Shaoping},
  booktitle={Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval},
  year={2022}
}

Notice

  • Please make sure that all the pre-trained model parameters have been loaded correctly, or the zero-shot and the fine-tuning performance will be greatly impacted.
  • We welcome anyone who would like to contribute to this repo. 🤗
  • If you have any other questions, please feel free to contact me via [email protected] or open an issue.
  • Code for data preprocessing will come soon. Please stay tuned~
Owner
Jia Chen
My life is a beauty. 🦋
Jia Chen
The Internet Archive Research Assistant - Daily search Internet Archive for new items matching your keywords

The Internet Archive Research Assistant - Daily search Internet Archive for new items matching your keywords

Kay Savetz 60 Dec 25, 2022
NLPShala , the best IDE for all Natural language processing tasks.

The revolutionary IDE for all NLP (Natural language processing) stuffs on the internet.

Abhi 3 Aug 08, 2021
PyTorch implementation of Tacotron speech synthesis model.

tacotron_pytorch PyTorch implementation of Tacotron speech synthesis model. Inspired from keithito/tacotron. Currently not as much good speech quality

Ryuichi Yamamoto 279 Dec 09, 2022
STonKGs is a Sophisticated Transformer that can be jointly trained on biomedical text and knowledge graphs

STonKGs STonKGs is a Sophisticated Transformer that can be jointly trained on biomedical text and knowledge graphs. This multimodal Transformer combin

STonKGs 27 Aug 11, 2022
This is the 25 + 1 year anniversary version of the 1995 Rachford-Rice contest

Rachford-Rice Contest This is the 25 + 1 year anniversary version of the 1995 Rachford-Rice contest. Can you solve the Rachford-Rice problem for all t

13 Sep 20, 2022
Pretty-doc - Composable text objects with python

pretty-doc from __future__ import annotations from dataclasses import dataclass

Taine Zhao 2 Jan 17, 2022
Pipelines de datos, 2021.

Este repo ilustra un proceso sencillo de automatización de transformación y modelado de datos, a través de un pipeline utilizando Luigi. Stack princip

Rodolfo Ferro 8 May 19, 2022
Every Google, Azure & IBM text to speech voice for free

TTS-Grabber Quick thing i made about a year ago to download any text with any tts voice, over 630 voices to choose from currently. It will split the i

16 Dec 07, 2022
Meta learning algorithms to train cross-lingual NLI (multi-task) models

Meta learning algorithms to train cross-lingual NLI (multi-task) models

M.Hassan Mojab 4 Nov 20, 2022
초성 해석기 based on ko-BART

초성 해석기 개요 한국어 초성만으로 이루어진 문장을 입력하면, 완성된 문장을 예측하는 초성 해석기입니다. 초성: ㄴㄴ ㄴㄹ ㅈㅇㅎ 예측 문장: 나는 너를 좋아해 모델 모델은 SKT-AI에서 공개한 Ko-BART를 이용합니다. 데이터 문장 단위로 이루어진 아무 코퍼스나

Dawoon Jung 29 Oct 28, 2022
Outreachy TFX custom component project

Schema Curation Custom Component Outreachy TFX custom component project This repo contains the code for Schema Curation Custom Component made as a par

Robert Crowe 5 Jul 16, 2021
NeMo: a toolkit for conversational AI

NVIDIA NeMo Introduction NeMo is a toolkit for creating Conversational AI applications. NeMo product page. Introductory video. The toolkit comes with

NVIDIA Corporation 5.3k Jan 04, 2023
Language-Agnostic SEntence Representations

LASER Language-Agnostic SEntence Representations LASER is a library to calculate and use multilingual sentence embeddings. NEWS 2019/11/08 CCMatrix is

Facebook Research 3.2k Jan 04, 2023
Source code for AAAI20 "Generating Persona Consistent Dialogues by Exploiting Natural Language Inference".

Generating Persona Consistent Dialogues by Exploiting Natural Language Inference Source code for RCDG model in AAAI20 Generating Persona Consistent Di

16 Oct 08, 2022
Search with BERT vectors in Solr and Elasticsearch

Search with BERT vectors in Solr and Elasticsearch

Dmitry Kan 123 Dec 29, 2022
Research Code for NeurIPS 2020 Spotlight paper "Large-Scale Adversarial Training for Vision-and-Language Representation Learning": UNITER adversarial training part

VILLA: Vision-and-Language Adversarial Training This is the official repository of VILLA (NeurIPS 2020 Spotlight). This repository currently supports

Zhe Gan 109 Dec 31, 2022
Prompt tuning toolkit for GPT-2 and GPT-Neo

mkultra mkultra is a prompt tuning toolkit for GPT-2 and GPT-Neo. Prompt tuning injects a string of 20-100 special tokens into the context in order to

61 Jan 01, 2023
Word2Wave: a framework for generating short audio samples from a text prompt using WaveGAN and COALA.

Word2Wave is a simple method for text-controlled GAN audio generation. You can either follow the setup instructions below and use the source code and CLI provided in this repo or you can have a play

Ilaria Manco 91 Dec 23, 2022
Telegram AI chat bot written in Python using Pyrogram

Aurora_Al Just another Telegram AI chat bot written in Python using Pyrogram. A public running instance can be found on telegram as @AuroraAl. Require

♗CσNϙUҽRσR_MҽSƙEƚҽҽR 1 Oct 31, 2021
novel deep learning research works with PaddlePaddle

Research 发布基于飞桨的前沿研究工作,包括CV、NLP、KG、STDM等领域的顶会论文和比赛冠军模型。 目录 计算机视觉(Computer Vision) 自然语言处理(Natrual Language Processing) 知识图谱(Knowledge Graph) 时空数据挖掘(Spa

1.5k Jan 03, 2023