Python library containing BART query generation and BERT-based Siamese models for neural retrieval.

Overview

Neural Retrieval

License

Embedding-based Zero-shot Retrieval through Query Generation leverages query synthesis over large corpuses of unlabeled text (such as Wikipedia) to pre-train siamese neural retrieval models. The resulting models significantly improve over previous BM25 baselines as well as state-of-the-art neural methods.

This package provides support for leveraging BART-large for query synthesis as well as code for training and finetuning a transformer based neural retriever. We also provide pre-generated synthetic queries on Wikipedia, and relevant pre-trained models that are obtainable through our download scripts.

Paper: Davis Liang*, Peng Xu*, Siamak Shakeri, Cicero Nogueira dos Santos, Ramesh Nallapati, Zhiheng Huang, Bing Xiang, Embedding-based Zero-shot Retrieval through Query Generation, 2020.

Getting Started

dependencies:

pip install torch torchvision transformers tqdm

running setup

python setup.py install --user

Package Version
torch >=1.6.0
transformers >=3.0.2
tqdm 4.43.0

WikiGQ dataset and Pretrained Neural Retrieval Model

  • WikiGQ: We process the Wikipedia 2016 dump and split it into passages of maximum length 100 with respecting the sentence boundaries. We synthesis over 100M synthetic queries using BART-large models. The split passages and synthetic queries files can be downloaded from here.
  • Siamese-BERT-base-model: We release our siamese-bert-base-model trained on WikiGQ dataset. The model files can be downloaded from here.

Training and Evaluation

Example: Natural Questions (NQ)

Here we take an example on Natural Questions data. Please download the simplified version of the training set and also use supplied simplify_nq_example function in simplify_nq_data.py to create the simplified dev set as well.

process the data

We provide the python script to convert the data into the format our model consumes.

NQ_DIR=YOUR PATH TO SIMPLIFIED NQ TRAIN AND DEV FILES
python data_processsing/nq_preprocess.py \
--trainfile $NQ_DIR/v1.0-simplified-train.jsonl.gz \
--devfile $NQ_DIR/v1.0-simplified-dev.jsonl.gz \
--passagefile $NQ_DIR/all_passages.jsonl \
--queries_trainfile $NQ_DIR/train_queries.json \
--answers_trainfile $NQ_DIR/train_anwers.json \
--queries_devfile $NQ_DIR/dev_queries.json \
--answers_devfile $NQ_DIR/dev_answers.json \
--qrelsfile $NQ_DIR/all_qrels.txt

training

OUTPUT_DIR=./output
mkdir -p $OUTPUT_DIR
python examples/neural_retrieval.py \
--query_len 64 \
--passage_len 288 \
--epochs 10 \
--sample_size 0 \
--batch_size 50 \
--embed_size 128 \
--print_iter 200 \
--eval_iter 0 \
--passagefile $NQ_DIR/all_passages.jsonl \
--train_queryfile $NQ_DIR/train_queries.json \
--train_answerfile $NQ_DIR/train_answers.json \
--save_model $OUTPUT_DIR/siamese_model.pt \
--share \
--gpu \
--num_nodes 1 \
--num_gpus 1 \
--train 

This will generate two model files in the OUTPUT_DIR: siamese_model.pt.doc and siamese_model.pt.query. They are exactly the same if your add --share during training.

Inference

  • Passage Embedding
python examples/neural_retrieval.py \
--query_len 64 \
--passage_len 288 \
--embed_size 128 \
--passagefile $NQ_DIR/all_passages.jsonl \
--gpu \
--num_nodes 1 \
--num_gpus 1 \
--local_rank 0 \
--doc_embed \
--doc_embed_file $OUTPUT_DIR/psg_embeds.csv \
--save_model $OUTPUT_DIR/siamese_model.pt 
  • Running Retrieval
python examples/neural_retrieval.py \
--query_len 64 \
--passage_len 288 \
--batch_size 100 \
--embed_size 128 \
--test_queryfile $NQ_DIR/dev_queries.json \
--gpu \
--num_nodes 1 \
--num_gpus 1 \
--local_rank 0 \
--topk 100 \
--query_embed \
--query_embed_file $OUTPUT_DIR/dev_query_embeds.csv \
--generate_retrieval \
--doc_embed_file $OUTPUT_DIR/psg_embeds.csv \
--save_model $OUTPUT_DIR/siamese_model.pt  \
--retrieval_outputfile $OUTPUT_DIR/dev_results.json
  • Evaluation

We use trec_eval to do the evaluation.

trec_eval $NQ_DIR/all_qrels.txt $OUTPUT_DIR/dev_results.json.txt -m recall 

BART Model for Query Generation

Finetune BART-QG Model on MSMARCO-PR dataset

MSMARCO_PATH=YOUR PATH TO MSMARCO FILES
QG_MODEL_OUTPUT=./qg_model_output
mkdir -p $QG_MODEL_OUTPUT
CUDA_VISIBLE_DEVICES=0,1,2,3 python examples/bart_qg.py \
--corpusfile $MSMARCO_PATH/collection.tsv \
--train_queryfile $MSMARCO_PATH/queries.train.tsv \
--train_qrelfile $MSMARCO_PATH/qrels.train.tsv \
--valid_queryfile $MSMARCO_PATH/queries.dev.tsv \
--valid_qrelfile $MSMARCO_PATH/qrels.dev.tsv \
--max_input_len 300 \
--max_output_len 100 \
--epochs 5 \
--lr 3e-5 \
--warmup 0.1 \
--wd 1e-3 \
--batch_size 24 \
--print_iter 100 \
--eval_iter 5000 \
--log ms_log \
--save_model $QG_MODEL_OUTPUT/best_qg.pt \
--gpu

Generate Synthetic Queries

As an example, we generate synthetic queries on NQ passages.

QG_OUTPUT_DIR=./qg_output
mkdir -p $QG_OUTPUT_DIR
python examples/bart_qg.py \
--test_corpusfile $QG_OUTPUT_DIR/all_passages.jsonl \
--test_outputfile $QG_OUTPUT_DIR/generated_questions.txt \
--generated_queriesfile $QG_OUTPUT_DIR/syn_queries.json \
--generated_answersfile $QG_OUTPUT_DIR/syn_answers.json \
--model_path $QG_MODEL_OUTPUT/best_qg_ms.pt \
--test \
--num_beams 5 \
--do_sample \
--num_samples 10 \
--top_p 0.95 \
--gpu

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

Owner
Amazon Web Services - Labs
AWS Labs
Amazon Web Services - Labs
Self-supervised spatio-spectro-temporal represenation learning for EEG analysis

EEG-Oriented Self-Supervised Learning and Cluster-Aware Adaptation This repository provides a tensorflow implementation of a submitted paper: EEG-Orie

Wonjun Ko 4 Jun 09, 2022
The source code for the Cutoff data augmentation approach proposed in this paper: "A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation".

Cutoff: A Simple Data Augmentation Approach for Natural Language This repository contains source code necessary to reproduce the results presented in

Dinghan Shen 49 Dec 22, 2022
Car Parking Tracker Using OpenCv

Car Parking Vacancy Tracker Using OpenCv I used basic image processing methods i

Adwait Kelkar 30 Dec 03, 2022
source code the paper Fast and Robust Iterative Closet Point.

Fast-Robust-ICP This repository includes the source code the paper Fast and Robust Iterative Closet Point. Authors: Juyong Zhang, Yuxin Yao, Bailin De

yaoyuxin 320 Dec 28, 2022
Code for "Contextual Non-Local Alignment over Full-Scale Representation for Text-Based Person Search"

Contextual Non-Local Alignment over Full-Scale Representation for Text-Based Person Search This is an implementation for our paper Contextual Non-Loca

Tencent YouTu Research 50 Dec 03, 2022
MIM: MIM Installs OpenMMLab Packages

MIM provides a unified API for launching and installing OpenMMLab projects and their extensions, and managing the OpenMMLab model zoo.

OpenMMLab 254 Jan 04, 2023
Syntax-Aware Action Targeting for Video Captioning

Syntax-Aware Action Targeting for Video Captioning Code for SAAT from "Syntax-Aware Action Targeting for Video Captioning" (Accepted to CVPR 2020). Th

59 Oct 13, 2022
Supervised & unsupervised machine-learning techniques are applied to the database of weighted P4s which admit Calabi-Yau hypersurfaces.

Weighted Projective Spaces ML Description: The database of 5-vectors describing 4d weighted projective spaces which admit Calabi-Yau hypersurfaces are

Ed Hirst 3 Sep 08, 2022
Shape-aware Semi-supervised 3D Semantic Segmentation for Medical Images

SASSnet Code for paper: Shape-aware Semi-supervised 3D Semantic Segmentation for Medical Images(MICCAI 2020) Our code is origin from UA-MT You can fin

klein 125 Jan 03, 2023
Faster RCNN pytorch windows

Faster-RCNN-pytorch-windows Faster RCNN implementation with pytorch for windows Open cmd, compile this comands: cd lib python setup.py build develop T

Hwa-Rang Kim 1 Nov 11, 2022
"Segmenter: Transformer for Semantic Segmentation" reproduced via mmsegmentation

Segmenter-based-on-OpenMMLab "Segmenter: Transformer for Semantic Segmentation, arxiv 2105.05633." reproduced via mmsegmentation. We reproduce Segment

EricKani 22 Feb 24, 2022
Unofficial PyTorch implementation of MobileViT.

MobileViT Overview This is a PyTorch implementation of MobileViT specified in "MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Tr

Chin-Hsuan Wu 348 Dec 23, 2022
Gradient-free global optimization algorithm for multidimensional functions based on the low rank tensor train format

ttopt Description Gradient-free global optimization algorithm for multidimensional functions based on the low rank tensor train (TT) format and maximu

5 May 23, 2022
Temporal-Relational CrossTransformers

Temporal-Relational Cross-Transformers (TRX) This repo contains code for the method introduced in the paper: Temporal-Relational CrossTransformers for

83 Dec 12, 2022
LocUNet is a deep learning method to localize a UE based solely on the reported signal strengths from a set of BSs.

LocUNet LocUNet is a deep learning method to localize a UE based solely on the reported signal strengths from a set of BSs. The method utilizes accura

4 Oct 05, 2022
本步态识别系统主要基于GaitSet模型进行实现

本步态识别系统主要基于GaitSet模型进行实现。在尝试部署本系统之前,建立理解GaitSet模型的网络结构、训练和推理方法。 系统的实现效果如视频所示: 演示视频 由于模型较大,部分模型文件存储在百度云盘。 链接提取码:33mb 具体部署过程 1.下载代码 2.安装requirements.txt

16 Oct 22, 2022
How to Learn a Domain Adaptive Event Simulator? ACM MM, 2021

LETGAN How to Learn a Domain Adaptive Event Simulator? ACM MM 2021 Running Environment: pytorch=1.4, 1 NVIDIA-1080TI. More details can be found in pap

CVTEAM 4 Sep 20, 2022
Python Assignments for the Deep Learning lectures by Andrew NG on coursera with complete submission for grading capability.

Python Assignments for the Deep Learning lectures by Andrew NG on coursera with complete submission for grading capability.

Utkarsh Agiwal 1 Feb 03, 2022
Code for the paper "M2m: Imbalanced Classification via Major-to-minor Translation" (CVPR 2020)

M2m: Imbalanced Classification via Major-to-minor Translation This repository contains code for the paper "M2m: Imbalanced Classification via Major-to

79 Oct 13, 2022
Implementation of PersonaGPT Dialog Model

PersonaGPT An open-domain conversational agent with many personalities PersonaGPT is an open-domain conversational agent cpable of decoding personaliz

ILLIDAN Lab 42 Jan 01, 2023