An Open-Source Package for Information Retrieval.

Overview

OpenMatch

An Open-Source Package for Information Retrieval.

😃 What's New

  • Top Spot on TREC-COVID Challenge (May 2020, Round2)

    The twin goals of the challenge are to evaluate search algorithms and systems for helping scientists, clinicians, policy makers, and others manage the existing and rapidly growing corpus of scientific literature related to COVID-19, and to discover methods that will assist with managing scientific information in future global biomedical crises.
    >> Reproduce Our Submit >> About COVID-19 Dataset >> Our Paper

Overview

OpenMatch integrates excellent neural methods and technologies to provide a complete solution for deep text matching and understanding. The documentation and tutorial of OpenMatch are available at here.

1/ Document Retrieval

Document Retrieval refers to extracting a set of related documents from large-scale document-level data based on user queries.

* Sparse Retrieval

Sparse Retriever is defined as a sparse bag-of-words retrieval model.

* Dense Retrieval

Dense Retriever performs retrieval by encoding documents and queries into dense low-dimensional vectors, and selecting the document that has the highest inner product with the query

2/ Document Reranking

Document reranking aims to further match user query and documents retrieved by the previous step with the purpose of obtaining a ranked list of relevant documents.

* Neural Ranker

Neural Ranker uses neural network as ranker to reorder documents.

* Feature Ensemble

Feature Ensemble can fuse neural features learned by neural ranker with the features of non-neural methods to obtain more robust performance

3/ Domain Transfer Learning

Domain Transfer Learning can leverages external knowledge graphs or weak supervision data to guide and help ranker to overcome data scarcity.

* Knowledge Enhancemnet

Knowledge Enhancement incorporates entity semantics of external knowledge graphs to enhance neural ranker.

* Data Augmentation

Data Augmentation leverages weak supervision data to improve the ranking accuracy in certain areas that lacks large scale relevance labels.

Stage Model Paper
1/ Sparse Retrieval BM25 Best Match25 ~Tool
1/ Dense Retrieval ANN Approximate nearest neighbor ~Tool
2/ Neural Ranker K-NRM End-to-End Neural Ad-hoc Ranking with Kernel Pooling ~Paper
2/ Neural Ranker Conv-KNRM Convolutional Neural Networks for Soft-Matching N-Grams in Ad-hoc Search ~Paper
2/ Neural Ranker TK Interpretable & Time-Budget-Constrained Contextualization for Re-Ranking ~Paper
2/ Neural Ranker BERT BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding ~Paper
2/ Feature Ensemble Coordinate Ascent Linear feature-based models for information retrieval. Information Retrieval ~Paper
3/ Knowledge Enhancement EDRM Entity-Duet Neural Ranking: Understanding the Role of Knowledge Graph Semantics in Neural Information Retrieval ~Paper
3/ Data Augmentation ReInfoSelect Selective Weak Supervision for Neural Information Retrieval ~Paper

Note that the BERT model is following huggingface's implementation - transformers, so other bert-like models are also available in our toolkit, e.g. electra, scibert.

Installation

* From PyPI

pip install git+https://github.com/thunlp/OpenMatch.git

* From Source

git clone https://github.com/thunlp/OpenMatch.git
cd OpenMatch
python setup.py install

* From Docker

To build an OpenMatch docker image from Dockerfile

docker build -t <image_name> .

To run your docker image just built above as a container

docker run --gpus all --name=<container_name> -it -v /:/all/ --rm <image_name>:<TAG>

Quick Start

* Detailed examples are available here.

import torch
import OpenMatch as om

query = "Classification treatment COVID-19"
doc = "By retrospectively tracking the dynamic changes of LYM% in death cases and cured cases, this study suggests that lymphocyte count is an effective and reliable indicator for disease classification and prognosis in COVID-19 patients."

* For bert-like models:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("allenai/scibert_scivocab_uncased")
input_ids = tokenizer.encode(query, doc)
model = om.models.Bert("allenai/scibert_scivocab_uncased")
ranking_score, ranking_features = model(torch.tensor(input_ids).unsqueeze(0))

* For other models:

tokenizer = om.data.tokenizers.WordTokenizer(pretrained="./data/glove.6B.300d.txt")
query_ids, query_masks = tokenizer.process(query, max_len=16)
doc_ids, doc_masks = tokenizer.process(doc, max_len=128)
model = om.models.KNRM(vocab_size=tokenizer.get_vocab_size(),
                       embed_dim=tokenizer.get_embed_dim(),
                       embed_matrix=tokenizer.get_embed_matrix())
ranking_score, ranking_features = model(torch.tensor(query_ids).unsqueeze(0),
                                        torch.tensor(query_masks).unsqueeze(0),
                                        torch.tensor(doc_ids).unsqueeze(0),
                                        torch.tensor(doc_masks).unsqueeze(0))

* The GloVe can be downloaded using:

wget http://nlp.stanford.edu/data/glove.6B.zip -P ./data
unzip ./data/glove.6B.zip -d ./data

* Evaluation

metric = om.Metric()
res = metric.get_metric(qrels, ranking_list, 'ndcg_cut_20')
res = metric.get_mrr(qrels, ranking_list, 'mrr_cut_10')

Experiments

* Ad-hoc Search

Retriever Reranker Coor-Ascent ClueWeb09 Robust04 ClueWeb12
SDM KNRM - 0.1880 0.3016 0.0968
SDM Conv-KNRM - 0.1894 0.2907 0.0896
SDM EDRM - 0.2015 0.2993 0.0937
SDM TK - 0.2306 0.2822 0.0966
SDM BERT Base - 0.2701 0.4168 0.1183
SDM ELECTRA Base - 0.2861 0.4668 0.1078

* MS MARCO Passage Ranking

Retriever Reranker Coor-Ascent dev eval
BM25 BERT Base - 0.349 0.345
BM25 ELECTRA Base - 0.352 0.344
BM25 RoBERTa Large - 0.386 0.375
BM25 ELECTRA Large - 0.388 0.376

* MS MARCO Document Ranking

Retriever Reranker Coor-Ascent dev eval
ANCE FirstP - - 0.373 0.334
ANCE MaxP - - 0.383 0.342
ANCE FirstP+BM25 BERT Base FirstP + 0.431 0.380
ANCE MaxP BERT Base MaxP + 0.432 0.391

* Classic Features

Methods ClueWeb09-B Robust04 TREC-COVID
[email protected] [email protected] [email protected] [email protected] [email protected] [email protected]
BM25 (Anserini) 0.2773 0.1426 0.4129 0.1117 0.6979 0.7670
RankSVM (Dai et al.) 0.289 n.a. 0.420 n.a. n.a. n.a.
RankSVM (OpenMatch) 0.2825 0.1476 0.4309 0.1173 0.6995 0.7570
Coor-Ascent (Dai et al.) 0.295 n.a. 0.427 n.a. n.a. n.a.
Coor-Ascent (OpenMatch) 0.2969 0.1581 0.4340 0.1171 0.7041 0.7770

Contribution

Thanks to all the people who contributed to OpenMatch!

Kaitao Zhang, Si Sun, Zhenghao Liu, Aowei Lu

Project Organizers

  • Zhiyuan Liu
  • Chenyan Xiong
  • Maosong Sun

Citation

@inproceedings{openmatch,
  author = {Liu, Zhenghao and Zhang, Kaitao and Xiong, Chenyan and Liu, Zhiyuan and Sun, Maosong},
  title = {OpenMatch: An Open Source Library for Neu-IR Research},
  booktitle = {Proceedings of SIGIR},
  year = {2021},
  url = {https://doi.org/10.1145/3404835.3462789},
  pages = {2531–2535}
}
Owner
THUNLP
Natural Language Processing Lab at Tsinghua University
THUNLP
From a body shape, infer the anatomic skeleton.

OSSO: Obtaining Skeletal Shape from Outside (CVPR 2022) This repository contains the official implementation of the skeleton inference from: OSSO: Obt

Marilyn Keller 166 Dec 28, 2022
[CVPR 2021] NormalFusion: Real-Time Acquisition of Surface Normals for High-Resolution RGB-D Scanning

NormalFusion: Real-Time Acquisition of Surface Normals for High-Resolution RGB-D Scanning Project Page | Paper | Supplemental material #1 | Supplement

KAIST VCLAB 49 Nov 24, 2022
Deep Surface Reconstruction from Point Clouds with Visibility Information

Data, code and pretrained models for the paper Deep Surface Reconstruction from Point Clouds with Visibility Information.

Raphael Sulzer 23 Jan 04, 2023
Implementation of association rules mining algorithms (Apriori|FPGrowth) using python.

Association Rules Mining Using Python Implementation of association rules mining algorithms (Apriori|FPGrowth) using python. As a part of hw1 code in

Pre 2 Nov 10, 2021
A package, and script, to perform imaging transcriptomics on a neuroimaging scan.

Imaging Transcriptomics Imaging transcriptomics is a methodology that allows to identify patterns of correlation between gene expression and some prop

Alessio Giacomel 10 Dec 27, 2022
Liver segmentation using MONAI and pytorch

Machine Learning use case in the field of Healthcare. In this project MONAI and pytorch frameworks are used for 3D Liver segmentation.

Abhishek Gajbhiye 2 May 30, 2022
Dewarping Document Image By Displacement Flow Estimation with Fully Convolutional Network.

Dewarping Document Image By Displacement Flow Estimation with Fully Convolutional Network

111 Dec 27, 2022
Wide Residual Networks (WideResNets) in PyTorch

Wide Residual Networks (WideResNets) in PyTorch WideResNets for CIFAR10/100 implemented in PyTorch. This implementation requires less GPU memory than

Jason Kuen 296 Dec 27, 2022
Image Matching Evaluation

Image Matching Evaluation (IME) IME provides to test any feature matching algorithm on datasets containing ground-truth homographies. Also, one can re

32 Nov 17, 2022
The original weights of some Caffe models, ported to PyTorch.

pytorch-caffe-models This repo contains the original weights of some Caffe models, ported to PyTorch. Currently there are: GoogLeNet (Going Deeper wit

Katherine Crowson 9 Nov 04, 2022
A weakly-supervised scene graph generation codebase. The implementation of our CVPR2021 paper ``Linguistic Structures as Weak Supervision for Visual Scene Graph Generation''

README.md shall be finished soon. WSSGG 0 Overview 1 Installation 1.1 Faster-RCNN 1.2 Language Parser 1.3 GloVe Embeddings 2 Settings 2.1 VG-GT-Graph

Keren Ye 35 Nov 20, 2022
Evaluation Pipeline for our ECCV2020: Journey Towards Tiny Perceptual Super-Resolution.

Journey Towards Tiny Perceptual Super-Resolution Test code for our ECCV2020 paper: https://arxiv.org/abs/2007.04356 Our x4 upscaling pre-trained model

Royson 6 Mar 30, 2022
Code accompanying "Evolving spiking neuron cellular automata and networks to emulate in vitro neuronal activity," accepted to IEEE SSCI ICES 2021

Evolving-spiking-neuron-cellular-automata-and-networks-to-emulate-in-vitro-neuronal-activity Code accompanying "Evolving spiking neuron cellular autom

SOCRATES: Self-Organizing Computational substRATES 2 Dec 02, 2022
A Next Generation ConvNet by FaceBookResearch Implementation in PyTorch(Original) and TensorFlow.

ConvNeXt A Next Generation ConvNet by FaceBookResearch Implementation in PyTorch(Original) and TensorFlow. A FacebookResearch Implementation on A Conv

Raghvender 2 Feb 14, 2022
Dataset and Code for ICCV 2021 paper "Real-world Video Super-resolution: A Benchmark Dataset and A Decomposition based Learning Scheme"

Dataset and Code for RealVSR Real-world Video Super-resolution: A Benchmark Dataset and A Decomposition based Learning Scheme Xi Yang, Wangmeng Xiang,

Xi Yang 92 Jan 04, 2023
Implementation of Multistream Transformers in Pytorch

Multistream Transformers Implementation of Multistream Transformers in Pytorch. This repository deviates slightly from the paper, where instead of usi

Phil Wang 47 Jul 26, 2022
The project was to detect traffic signs, based on the Megengine framework.

trafficsign 赛题 旷视AI智慧交通开源赛道,初赛1/177,复赛1/12。 本赛题为复杂场景的交通标志检测,对五种交通标志进行识别。 框架 megengine 算法方案 网络框架 atss + resnext101_32x8d 训练阶段 图片尺寸 最终提交版本输入图片尺寸为(1500,2

20 Dec 02, 2022
Deep learning image registration library for PyTorch

TorchIR: Pytorch Image Registration TorchIR is a image registration library for deep learning image registration (DLIR). I have integrated several ide

Bob de Vos 40 Dec 16, 2022
Implementation of Change-Based Exploration Transfer (C-BET)

Implementation of Change-Based Exploration Transfer (C-BET), as presented in Interesting Object, Curious Agent: Learning Task-Agnostic Exploration.

Simone Parisi 29 Dec 04, 2022
Implementation of paper "Decision-based Black-box Attack Against Vision Transformers via Patch-wise Adversarial Removal"

Patch-wise Adversarial Removal Implementation of paper "Decision-based Black-box Attack Against Vision Transformers via Patch-wise Adversarial Removal

4 Oct 12, 2022