Establishing Strong Baselines for TripClick Health Retrieval; ECIR 2022

Overview

TripClick Baselines with Improved Training Data

Welcome 🙌 to the hub-repo of our paper:

Establishing Strong Baselines for TripClick Health Retrieval Sebastian Hofstätter, Sophia Althammer, Mete Sertkan and Allan Hanbury

https://arxiv.org/abs/2201.00365

tl;dr We create strong re-ranking and dense retrieval baselines (BERTCAT, BERTDOT, ColBERT, and TK) for TripClick (health ad-hoc retrieval). We improve the – originally too noisy – training data with a simple negative sampling policy. We achieve large gains over BM25 in the re-ranking and retrieval setting on TripClick, which were not achieved with the original baselines. We publish the improved training files for everyone to use.

If you have any questions, suggestions, or want to collaborate please don't hesitate to get in contact with us via Twitter or mail to [email protected]

Please cite our work as:

@misc{hofstaetter2022tripclick,
      title={Establishing Strong Baselines for TripClick Health Retrieval}, 
      author={Sebastian Hofst{\"a}tter and Sophia Althammer and Mete Sertkan and Allan Hanbury},
      year={2022},
      eprint={2201.00365},
      archivePrefix={arXiv},
      primaryClass={cs.IR}
}

Training Files

We publish the improved training files without the text content instead using the ids from TripClick (with permission from the TripClick owners); for the text content please get the full TripClick dataset from the TripClick Github page.

Our training files have the format query_id pos_passage_id neg_passage_id (with tab separation) and are available as a HuggingFace dataset: https://huggingface.co/datasets/sebastian-hofstaetter/tripclick-training

Source Code

The full source-code for our paper is here, as part of our matchmaker library: https://github.com/sebastian-hofstaetter/matchmaker

We provide getting started guides for training re-ranking and retrieval models, as well as a range of evaluation setups.

Pre-Trained Models

Unfortunately, the license of TripClick does not allow us to publish the trained models.

TripClick Baselines Results

For more information and commentary on the results, please see our ECIR paper.

BM25 Top200 Re-Ranking

Model BERT Instance HEAD TORSO TAIL
nDCG MRR nDCG MRR nDCG MRR
Original Baselines
BM25 -- .140 .276 .206 .283 .267 .258
ConvKNRM -- .198 .420 .243 .347 .271 .265
TK -- .208 .434 .272 .381 .295 .280
Our Improved Baselines
TK -- .232 .472 .300 .390 .345 .319
ColBERT SciBERT .270 .556 .326 .426 .374 .347
PubMedBERT-Abstract .278 .557 .340 .431 .387 .361
BERT_CAT DistilBERT .272 .556 .333 .427 .381 .355
BERT-Base .287 .579 .349 .453 .396 .366
SciBERT .294 .595 .360 .459 .408 .377
PubMedBERT-Full .298 .582 .365 .462 .412 .381
PubMedBERT-Abstract .296 .587 .359 .456 .409 .380
Ensemble (Last 3 BERT_CAT) .303 .601 .370 .472 .420 .392

Dense Retrieval Results

Model BERT Instance Head(DCTR)
[email protected] [email protected] [email protected] [email protected] [email protected] [email protected]
Original Baselines
BM25 -- 31% .140 .276 .499 .621 .834
Our Improved Baselines
BERT_DOT DistilBERT 39% .236 .512 .550 .648 .813
SciBERT 41% .243 .530 .562 .640 .793
PubMedBERT 40% .235 .509 .582 .673 .828
Owner
Sebastian Hofstätter
PhD student; working on machine learning and information retrieval
Sebastian Hofstätter
GarmentNets: Category-Level Pose Estimation for Garments via Canonical Space Shape Completion

GarmentNets This repository contains the source code for the paper GarmentNets: Category-Level Pose Estimation for Garments via Canonical Space Shape

Columbia Artificial Intelligence and Robotics Lab 43 Nov 21, 2022
A Novel Incremental Learning Driven Instance Segmentation Framework to Recognize Highly Cluttered Instances of the Contraband Items

A Novel Incremental Learning Driven Instance Segmentation Framework to Recognize Highly Cluttered Instances of the Contraband Items This repository co

Taimur Hassan 3 Mar 16, 2022
Official Implementation for the "An Empirical Investigation of 3D Anomaly Detection and Segmentation" paper.

An Empirical Investigation of 3D Anomaly Detection and Segmentation Project | Paper Official PyTorch Implementation for the "An Empirical Investigatio

Eliahu Horwitz 55 Dec 14, 2022
Code of the lileonardo team for the 2021 Emotion and Theme Recognition in Music task of MediaEval 2021

Emotion and Theme Recognition in Music The repository contains code for the submission of the lileonardo team to the 2021 Emotion and Theme Recognitio

Vincent Bour 8 Aug 02, 2022
Few-shot Neural Architecture Search

One-shot Neural Architecture Search uses a single supernet to approximate the performance each architecture. However, this performance estimation is super inaccurate because of co-adaption among oper

Yiyang Zhao 38 Oct 18, 2022
Mining-the-Social-Web-3rd-Edition - The official online compendium for Mining the Social Web, 3rd Edition (O'Reilly, 2018)

Mining the Social Web, 3rd Edition The official code repository for Mining the Social Web, 3rd Edition (O'Reilly, 2019). The book is available from Am

Mikhail Klassen 838 Jan 01, 2023
An implementation of the BADGE batch active learning algorithm.

Batch Active learning by Diverse Gradient Embeddings (BADGE) An implementation of the BADGE batch active learning algorithm. Details are provided in o

125 Dec 24, 2022
Scikit-learn compatible estimation of general graphical models

skggm : Gaussian graphical models using the scikit-learn API In the last decade, learning networks that encode conditional independence relationships

213 Jan 02, 2023
Diagnostic tests for linguistic capacities in language models

LM diagnostics This repository contains the diagnostic datasets and experimental code for What BERT is not: Lessons from a new suite of psycholinguist

61 Jan 02, 2023
PyTorch wrapper for Taichi data-oriented class

Stannum PyTorch wrapper for Taichi data-oriented class PRs are welcomed, please see TODOs. Usage from stannum import Tin import torch data_oriented =

86 Dec 23, 2022
Pytorch implementation of TailCalibX : Feature Generation for Long-tail Classification

TailCalibX : Feature Generation for Long-tail Classification by Rahul Vigneswaran, Marc T. Law, Vineeth N. Balasubramanian, Makarand Tapaswi [arXiv] [

Rahul Vigneswaran 34 Jan 02, 2023
A Rao-Blackwellized Particle Filter for 6D Object Pose Tracking

PoseRBPF: A Rao-Blackwellized Particle Filter for 6D Object Pose Tracking PoseRBPF Paper Self-supervision Paper Pose Estimation Video Robot Manipulati

NVIDIA Research Projects 107 Dec 25, 2022
Implementation of the paper: "SinGAN: Learning a Generative Model from a Single Natural Image"

SinGAN This is an unofficial implementation of SinGAN from someone who's been sitting right next to SinGAN's creator for almost five years. Please ref

35 Nov 10, 2022
😇A pyTorch implementation of the DeepMoji model: state-of-the-art deep learning model for analyzing sentiment, emotion, sarcasm etc

------ Update September 2018 ------ It's been a year since TorchMoji and DeepMoji were released. We're trying to understand how it's being used such t

Hugging Face 865 Dec 24, 2022
Neural network pruning for finding a sparse computational model for controlling a biological motor task.

MothPruning Scientific Overview Originally inspired by biological nervous systems, deep neural networks (DNNs) are powerful computational tools for mo

Olivia Thomas 0 Dec 14, 2022
This is the official pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering" on VQA Task

🌈 ERASOR (RA-L'21 with ICRA Option) Official page of "ERASOR: Egocentric Ratio of Pseudo Occupancy-based Dynamic Object Removal for Static 3D Point C

Hyungtae Lim 225 Dec 29, 2022
Open-Ended Commonsense Reasoning (NAACL 2021)

Open-Ended Commonsense Reasoning Quick links: [Paper] | [Video] | [Slides] | [Documentation] This is the repository of the paper, Differentiable Open-

(Bill) Yuchen Lin 31 Oct 19, 2022
Graph-based community clustering approach to extract protein domains from a predicted aligned error matrix

Using a predicted aligned error matrix corresponding to an AlphaFold2 model , returns a series of lists of residue indices, where each list corresponds to a set of residues clustering together into a

Tristan Croll 24 Nov 23, 2022
Code for CVPR2021 paper "Robust Reflection Removal with Reflection-free Flash-only Cues"

Robust Reflection Removal with Reflection-free Flash-only Cues (RFC) Paper | To be released: Project Page | Video | Data Tensorflow implementation for

Chenyang LEI 162 Jan 05, 2023
StyleGAN2-ada for practice

This version of the newest PyTorch-based StyleGAN2-ada is intended mostly for fellow artists, who rarely look at scientific metrics, but rather need a working creative tool. Tested on Python 3.7 + Py

vadim epstein 170 Nov 16, 2022