System-oriented IR evaluations are limited to rather abstract understandings of real user behavior

Last update: Nov 23, 2022

Related tags

Deep Learning ecir2022-uqv-sim

Overview

Validating Simulations of User Query Variants

This repository contains the scripts of the experiments and evaluations, simulated queries, as well as the figures of:

Timo Breuer, Norbert Fuhr, and Philipp Schaer. 2022. Validating Simulations of User Query Variants. In Proceedings of the 44th European Conference on IR Research, ECIR 2022.

System-oriented IR evaluations are limited to rather abstract understandings of real user behavior. As a solution, simulating user interactions provides a cost-efficient way to support system-oriented experiments with more realistic directives when no interaction logs are available. While there are several user models for simulated clicks or result list interactions, very few attempts have been made towards query simulations, and it has not been investigated if these can reproduce properties of real queries. In this work, we validate simulated user query variants with the help of TREC test collections in reference to real user queries that were made for the corresponding topics. Besides, we introduce a simple yet effective method that gives better reproductions of real queries than the established methods. Our evaluation framework validates the simulations regarding the retrieval performance, reproducibility of topic score distributions, shared task utility, effort and effect, and query term similarity when compared with real user query variants. While the retrieval effectiveness and statistical properties of the topic score distributions as well as economic aspects are close to that of real queries, it is still challenging to simulate exact term matches and later query reformulations.

Directory overview

Directory	Description
`config/`	Contains configuration files for the query simulations, experiments, and evaluations.
`data/`	Contains (intermediate) output data of the simulations and experiments as well as the figures of the paper.
`eval/`	Contains scripts of the experiments and evaluations.
`sim/`	Contains scripts of the query simulations.

Setup

Install Anserini and index Core17 (The New York Times Annotated Corpus) according to the regression guide:

anserini/target/appassembler/bin/IndexCollection \
    -collection NewYorkTimesCollection \
    -input /path/to/core17/ \
    -index anserini/indexes/lucene-index.core17 \
    -generator DefaultLuceneDocumentGenerator \
    -threads 4 \
    -storePositions \
    -storeDocvectors \
    -storeRaw \
    -storeContents \
    > anserini/logs/log.core17 &

Install the required Python packages:

pip install -r requirements.txt

Query simulation

In order to prepare the language models and simulate the queries, the scripts have to executed in the order shown in the following table. All of the outputs can be found in the data/ directory. For the sake of better code readability the names of the query reformulation strategies have been mapped: S1 → S1; S2 → S2; S2' → S3; S3 → S4; S3' → S5; S4 → S6; S4' → S7; S4'' → S8. The names of the scripts and output files comply with this name mapping.

Script	Description	Output files
`sim/make_background.py`	Make the background language model form all index terms of Core17. The background model is required for Controlled Query Generation (CQG) by Jordan et al.	`data/lm/background.csv`
`sim/make_cqg.py`	Make the CQG language models with different parameters of lambda from 0.0 to 1.0.	`data/lm/cqg.json`
`sim/simulate_queries_s12345.py`	Simulate TTS and KIS queries with strategies S1 to S3'	`data/queries/s12345.csv`
`sim/simulate_queries_s678.py`	Simulate TTS and KIS queries with strategies S4 to S4''	`data/queries/s678.csv`

Experimental evaluation and results

In order to reproduce the experiments of the study, the scripts have to executed in the order shown in the following table.

Script	Description	Output files	Reproduction of ...
`eval/arp.py`, `eval/arp_first.py`, `eval/arp_max.py`	Retrieval performance: Evaluate the Average Retrieval Performance (ARP).	`data/experimental_results/arp.csv`, `data/experimental_results/arp_first.csv`, `data/experimental_results/arp_max.csv`	`Tab. A.1`
`eval/rmse_s12345.py`, `eval/rmse_s678.py`	Retrieval performance: Evaluate the Root-Mean-Square-Error (RMSE).	`data/experimental_results/rmse_map.csv`, `data/experimental_results/rmse_ndcg.csv`, `data/experimental_results/rmse_p1000.csv`, `data/experimental_results/rmse_uqv_vs_s12345_kis_ndcg.csv`, `data/experimental_results/rmse_uqv_vs_s12345_tts_ndcg.csv`, `data/figures/rmse_map.pdf`, `data/figures/rmse_ndcg.pdf`, `data/figures/rmse_p1000.pdf`, `data/figures/rmse_uqv_vs_s12345_kis_ndcg.pdf`, `data/figures/rmse_uqv_vs_s12345_tts_ndcg.pdf`	`Fig. A.1`, `Fig. 1`
`eval/t-test.py`	Retrieval performance: Evaluate the p-values of paired t-tests.	`data/experimental_results/ttest.csv`, `data/figures/ttest.pdf`	`Fig. A.2`
`eval/system_orderings.py`	Shared task utility: Evaluate Kendall's tau between relative system orderings.	`data/experimental_results/system_orderings.csv`, `data/figures/system_orderings.pdf`	`Fig. 2 (left)`
`eval/sdcg.py`	Effort and effect: Evaluate the Session Discounted Cumulative Gain (sDCG).	`data/experimental_results/sdcg_3queries.csv`, `data/experimental_results/sdcg_5queries.csv`, `data/experimental_results/sdcg_10queries.csv`, `data/figures/sdcg_3queries.pdf`, `data/figures/sdcg_5queries.pdf`, `data/figures/sdcg_10queries.pdf`	`Fig. 3 (top)`
`eval/economic.py`	Effort and effect: Evaluate tradeoffs between number of queries and browsing depth by isoquants.	`data/experimental_results/economic0.3.csv`, `data/experimental_results/economic0.4.csv`, `data/experimental_results/economic0.5.csv`, `data/figures/economic0.3.pdf`, `data/figures/economic0.4.pdf`, `data/figures/economic0.5.pdf`	`Fig. 3 (bottom)`
`eval/jaccard_similarity.py`	Query term similarity: Evaluate query term similarities.	`data/experimental_results/jacc.csv`, `data/figures/jacc.pdf`	`Fig. 2 (right)`

System-oriented IR evaluations are limited to rather abstract understandings of real user behavior

Related tags

Overview

Validating Simulations of User Query Variants

Directory overview

Setup

Query simulation

Experimental evaluation and results

Owner

IR Group at Technische Hochschule Köln

Winning solution of the Indoor Location & Navigation Kaggle competition

Pytorch reimplementation of the Vision Transformer (An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale)

Credo AI Lens is a comprehensive assessment framework for AI systems. Lens standardizes model and data assessment, and acts as a central gateway to assessments created in the open source community.

BasicNeuralNetwork - This project looks over the basic structure of a neural network and how machine learning training algorithms work

Code for the AI lab course 2021/2022 of the University of Verona

Object-aware Contrastive Learning for Debiased Scene Representation

Extract MNIST handwritten digits dataset binary file into bmp images

🦕 NanoSaur is a little tracked robot ROS2 enabled, made for an NVIDIA Jetson Nano

TorchGeo is a PyTorch domain library, similar to torchvision, that provides datasets, transforms, samplers, and pre-trained models specific to geospatial data.

A repository for benchmarking neural vocoders by their quality and speed.

Official codebase for running the small, filtered-data GLIDE model from GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models.

Sample Code for "Pessimism Meets Invariance: Provably Efficient Offline Mean-Field Multi-Agent RL"

ADGAN - The Implementation of paper Controllable Person Image Synthesis with Attribute-Decomposed GAN

Class activation maps for your PyTorch models (CAM, Grad-CAM, Grad-CAM++, Smooth Grad-CAM++, Score-CAM, SS-CAM, IS-CAM, XGrad-CAM, Layer-CAM)

This repository contains several jupyter notebooks to help users learn to use neon, our deep learning framework

This is the code for Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning

Package for extracting emotions from social media text. Tailored for financial data.

A Collection of Papers and Codes for ICCV2021 Low Level Vision and Image Generation

Traffic4D: Single View Reconstruction of Repetitious Activity Using Longitudinal Self-Supervision

Official code repository for the EMNLP 2021 paper