Repo for WWW 2022 paper: Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval

Related tags

Deep LearningBiDR
Overview

BiDR

Repo for WWW 2022 paper: Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval.

Requirements

torch==1.7
transformers==4.6
faiss-gpu==1.6.4.post2

Data Download and Preprocess

bash download_data.sh
python preprocess.py

These commands will download and preprocess the MSMARCO Passage and Doc dataset, then the resutls will be saved to ./data.
We take the Passage dataset as the example to show the running workflow.

Conventional Workflow

Representation Learning

Train the encoder with random negative (or set --hardneg_json to provied bm25/hard negatives) :

mkdir log
dataset=passage
savename=dense_global_model
python train.py --model_name_or_path roberta-base \
--max_query_length 24 --max_doc_length 128 \
--data_dir ./data/${dataset}/preprocess \
--learning_rate 1e-4 --optimizer_str adamw \
--per_device_train_batch_size 128 \
--per_query_neg_num 1 \
--generate_batch_method random \
--loss_method multi_ce  \
--savename ${savename} --save_model_path ./model \
--world_size 8 --gpu_rank 0_1_2_3_4_5_6_7  --master_port 13256 \
--num_train_epochs 30  \
--use_pq False \
|tee ./log/${savename}.log

Unsupervised Quantization

Generate dense embeddings of queries and docs:

data_type=passage
savename=dense_global_model
epoch=20
python ./inference.py \
--data_type ${data_type} \
--preprocess_dir ./data/${data_type}/preprocess/ \
--max_doc_length 256 --max_query_length 32 \
--eval_batch_size 512 \
--ckpt_path ./model/${savename}/${epoch}/ \
--output_dir  evaluate/${savename}_${epoch} 

Product Quantization based on Faiss and recall performance:

data_type=passage
savename=dense_global_model
epoch=20
python ./test/lightweight_ann.py \
--output_dir ./data/${data_type}/evaluate/${savename}_${epoch} \
--ckpt_path /model/${savename}/${epoch}/ \
--subvector_num 96 \
--index opq \
--topk 1000 \
--data_type ${data_type} \
--MRR_cutoff 10 \
--Recall_cutoff 5 10 30 50 100

Progressively Optimized Bi-Granular Document Representation

Sparse Representation Learning

Instead of running unsupervised quantization for the well-learned dense embeddings, the sparse embeddings are generated from contrastive learning, which optimizes the global discrimination and helps to enable high-quality answers to be covered in candidate search.

Train

We find that using Faiss OPQ to initialize the PQ module has a significant gain for MSMARCO dataset. But for the largest dataset: Ads dataset, initialization with Faiss OPQ is redundant and has no promotion.

dataset=passage
savename=sparse_global_model
python train.py --model_name_or_path ./model/dense_global_model/20 \
--max_query_length 24 --max_doc_length 128 \
--data_dir ./data/${dataset}/preprocess \
--learning_rate 1e-4 --optimizer_str adamw \
--per_device_train_batch_size 128 \
--per_query_neg_num 1 \
--generate_batch_method random \
--loss_method multi_ce  \
--savename ${savename} --save_model_path ./model \
--world_size 8 --gpu_rank 0_1_2_3_4_5_6_7  --master_port 13256 \
--num_train_epochs 30  \
--use_pq True \
--init_index_path ./data/${data_type}/evaluate/dense_global_model_20/OPQ96,PQ96x8.index \
--partition 96 --centroids 256 --quantloss_weight 0.0 \
|tee ./log/${savename}.log

where the ./model/dense_global_model/20 and ./data/${data_type}/evaluate/dense_global_model_20/OPQ96,PQ96x8.index is generated by conventional workflow.

Test

data_type=passage
savename=sparse_global_model
epoch=20

python ./inference.py \
--data_type ${data_type} \
--preprocess_dir ./data/${data_type}/preprocess/ \
--max_doc_length 256 --max_query_length 32 \
--eval_batch_size 512 \
--ckpt_path ./model/${savename}/${epoch}/ \
--output_dir  evaluate/${savename}_${epoch} 

python ./test/lightweight_ann.py \
--output_dir ./data/${data_type}/evaluate/${savename}_${epoch} \
--subvector_num 96 \
--index opq \
--topk 1000 \
--data_type ${data_type} \
--MRR_cutoff 10 \
--Recall_cutoff 5 10 30 50 100 \
--ckpt_path ./model/${savename}/${epoch}/ \
--init_index_path ./data/${data_type}/evaluate/dense_global_model_20/OPQ96,PQ96x8.index

Dense Representation Learning

The dense embeddings are optimized based on the candidate distribution generated by sparse embeddings. We propose a novel sampling strategy called locality-centric sampling to enhance local discrimination: construct a bipartite proximity graph and conduct random walk or snow sample on it.

Train

Encode the quries in train set and generate the candidates for all train queries:

data_type=passage
savename=sparse_global_model
epoch=20

python ./inference.py \
--data_type ${data_type} \
--preprocess_dir ./data/${data_type}/preprocess/ \
--max_doc_length 256 --max_query_length 32 \
--eval_batch_size 512 \
--ckpt_path ./model/${savename}/${epoch}/ \
--output_dir  evaluate/${savename}_${epoch} \
--mode train

python ./test/lightweight_ann.py \
--output_dir ./data/${data_type}/evaluate/${savename}_${epoch} \
--subvector_num 96 \
--index opq \
--topk 1000 \
--data_type ${data_type} \
--MRR_cutoff 10 \
--Recall_cutoff 5 10 30 50 100 \
--ckpt_path ./model/${savename}/${epoch}/ \
--init_index_path ./data/${data_type}/evaluate/dense_global_model_20/OPQ96,PQ96x8.index \
--mode train \
--save_hardneg_to_json

This command will save the train_hardneg.json to output_dir. Then train the dense embeddings to distinguish the ground truth from the negative in candidate:

dataset=passage
savename=dense_local_model
python train.py --model_name_or_path roberta-base \
--max_query_length 24 --max_doc_length 128 \
--data_dir ./data/${dataset}/preprocess \
--learning_rate 1e-4 --optimizer_str adamw \
--per_device_train_batch_size 128 \
--per_query_neg_num 1 \
--generate_batch_method {random_walk or snow_sample} \
--loss_method multi_ce  \
--savename ${savename} --save_model_path ./model \
--world_size 8 --gpu_rank 0_1_2_3_4_5_6_7  --master_port 13256 \
--num_train_epochs 30  \
--use_pq False \
--hardneg_json ./data/${data_type}/evaluate/sparse_global_model_20/train_hardneg.json \
--mink 0  --maxk 200 \
|tee ./log/${savename}.log

Test

data_type=passage
savename=dense_local_model
epoch=10

python ./inference.py \
--data_type ${data_type} \
--preprocess_dir ./data/${data_type}/preprocess/ \
--ckpt_path ./model/${savename}/${epoch}/ \
--max_doc_length 256 --max_query_length 32 \
--eval_batch_size 512 \
--ckpt_path ./model/${savename}/${epoch}/ \
--output_dir  evaluate/${savename}_${epoch} 

python ./test/post_verification.py \
--data_type ${data_type} \
--output_dir  evaluate/${savename}_${epoch} \
--candidate_from_ann ./data/${data_type}/evaluate/sparse_global_model_20/dev.rank_1000_score_faiss_opq.tsv \
--MRR_cutoff 10 \
--Recall_cutoff 5 10 30 50 100

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
Neural Articulated Radiance Field

Neural Articulated Radiance Field NARF Neural Articulated Radiance Field Atsuhiro Noguchi, Xiao Sun, Stephen Lin, Tatsuya Harada ICCV 2021 [Paper] [Co

Atsuhiro Noguchi 144 Jan 03, 2023
A visualization tool to show a TensorFlow's graph like TensorBoard

tfgraphviz tfgraphviz is a module to visualize a TensorFlow's data flow graph like TensorBoard using Graphviz. tfgraphviz enables to provide a visuali

44 Nov 09, 2022
A light weight data augmentation tool for training CNNs and Viola Jones detectors

hey-daug A light weight data augmentation tool for training CNNs and Viola Jones detectors (Haar Cascades). This tool inflates your data by up to six

Jaiyam Sharma 2 Nov 23, 2019
The codes reproduce the figures and statistics in the paper, "Controlling for multiple covariates," by Mark Tygert.

The accompanying codes reproduce all figures and statistics presented in "Controlling for multiple covariates" by Mark Tygert. This repository also pr

Meta Research 1 Dec 02, 2021
🛠 All-in-one web-based IDE specialized for machine learning and data science.

All-in-one web-based development environment for machine learning Getting Started • Features & Screenshots • Support • Report a Bug • FAQ • Known Issu

Machine Learning Tooling 2.9k Jan 09, 2023
URIE: Universal Image Enhancementfor Visual Recognition in the Wild

URIE: Universal Image Enhancementfor Visual Recognition in the Wild This is the implementation of the paper "URIE: Universal Image Enhancement for Vis

Taeyoung Son 43 Sep 12, 2022
Multiview 3D object detection on MultiviewC dataset through moft3d.

Voxelized 3D Feature Aggregation for Multiview Detection [arXiv] Multiview 3D object detection on MultiviewC dataset through VFA. Introduction We prop

Jiahao Ma 20 Dec 21, 2022
Vit-ImageClassification - Pytorch ViT for Image classification on the CIFAR10 dataset

Vit-ImageClassification Introduction This project uses ViT to perform image clas

Kaicheng Yang 4 Jun 01, 2022
Source code for paper "Deep Superpixel-based Network for Blind Image Quality Assessment"

DSN-IQA Source code for paper "Deep Superpixel-based Network for Blind Image Quality Assessment" Requirements Python =3.8.0 Pytorch =1.7.1 Usage wit

7 Oct 13, 2022
The software associated with a paper accepted at EMNLP 2021 titled "Open Knowledge Graphs Canonicalization using Variational Autoencoders".

Open-KG-canonicalization The software associated with a paper accepted at EMNLP 2021 titled "Open Knowledge Graphs Canonicalization using Variational

International Business Machines 13 Nov 11, 2022
A tool to visualise the results of AlphaFold2 and inspect the quality of structural predictions

AlphaFold Analyser This program produces high quality visualisations of predicted structures produced by AlphaFold. These visualisations allow the use

Oliver Powell 3 Nov 13, 2022
PINN(s): Physics-Informed Neural Network(s) for von Karman vortex street

PINN(s): Physics-Informed Neural Network(s) for von Karman vortex street This is

ShotaDEGUCHI 2 Apr 18, 2022
Sentinel-1 vessel detection model used in the xView3 challenge

sar_vessel_detect Code for the AI2 Skylight team's submission in the xView3 competition (https://iuu.xview.us) for vessel detection in Sentinel-1 SAR

AI2 6 Sep 10, 2022
Block-wisely Supervised Neural Architecture Search with Knowledge Distillation (CVPR 2020)

DNA This repository provides the code of our paper: Blockwisely Supervised Neural Architecture Search with Knowledge Distillation. Illustration of DNA

Changlin Li 215 Dec 19, 2022
Fast SHAP value computation for interpreting tree-based models

FastTreeSHAP FastTreeSHAP package is built based on the paper Fast TreeSHAP: Accelerating SHAP Value Computation for Trees published in NeurIPS 2021 X

LinkedIn 369 Jan 04, 2023
SANet: A Slice-Aware Network for Pulmonary Nodule Detection

SANet: A Slice-Aware Network for Pulmonary Nodule Detection This paper (SANet) has been accepted and early accessed in IEEE TPAMI 2021. This code and

Jie Mei 39 Dec 17, 2022
A Dynamic Residual Self-Attention Network for Lightweight Single Image Super-Resolution

DRSAN A Dynamic Residual Self-Attention Network for Lightweight Single Image Super-Resolution Karam Park, Jae Woong Soh, and Nam Ik Cho Environments U

4 May 10, 2022
GraphRNN: Generating Realistic Graphs with Deep Auto-regressive Models

GraphRNN: Generating Realistic Graphs with Deep Auto-regressive Model This repository is the official PyTorch implementation of GraphRNN, a graph gene

Jiaxuan 568 Dec 29, 2022
Code for "PVNet: Pixel-wise Voting Network for 6DoF Pose Estimation" CVPR 2019 oral

Good news! We release a clean version of PVNet: clean-pvnet, including how to train the PVNet on the custom dataset. Use PVNet with a detector. The tr

ZJU3DV 722 Dec 27, 2022
A set of tools for converting a darknet dataset to COCO format working with YOLOX

darknet格式数据→COCO darknet训练数据目录结构(详情参见dataset/darknet): darknet ├── class.names ├── gen_config.data ├── gen_train.txt ├── gen_valid.txt └── images

RapidAI-NG 148 Jan 03, 2023