This repo provides code for QB-Norm (Cross Modal Retrieval with Querybank Normalisation)

Last update: Dec 29, 2022

Related tags

Overview

This repo provides code for QB-Norm (Cross Modal Retrieval with Querybank Normalisation)

Usage example

python dynamic_inverted_softmax.py --sims_train_test_path msrvtt/tt-ce-train-captions-test-videos-seed0.pkl --sims_test_path msrvtt/tt-ce-test-captions-test-videos-seed0.pkl --test_query_masks_path msrvtt/tt-ce-test-query_masks.pkl

To test QB-Norm on your own data you need to:

Extract the similarity matrix between the caption from the training split and the videos from the testing split path/to/sims/train/test
Extract testing split similarity matrix (similarities between testing captions and testing video) path/to/sims/test
Run QB-Norm

python dynamic_inverted_softmax.py --sims_train_test_path path/to/sims/train/test --sims_test_path path/to/sims/test

Data

The similarity matrices for each method were extracted using the official repositories as follows: CE+, TT-CE+, CLIP2Video, CLIP4Clip (for CLIP4Clip we used the official repo to train from scratch new models since they do not provide pre-trained weights), CLIP, MMT, Audio-Retrieval.

You can download the extracted similarity matrices for training and testing here: MSRVTT, MSVD, DiDeMo, LSMDC.

Text-Video retrieval results

QB-Norm Results on MSRVTT Benchmark

Model	Split	Task	[email protected]	[email protected]	[email protected]	MdR	Geom
CE+	Full	t2v	_{^14.4_(0.1)}	_{^37.4_(0.1)}	_{^50.2_(0.1)}	_{^10.0_(0.0)}	_{^30.0_(0.1)}
CE+ (+QB-Norm)	Full	t2v	_{^16.4_(0.0)}	_{^40.3_(0.1)}	_{^52.9_(0.1)}	_{^9.0_(0.0)}	_{^32.7_(0.1)}
TT-CE+	Full	t2v	_{^14.9_(0.1)}	_{^38.3_(0.1)}	_{^51.5_(0.1)}	_{^10.0_(0.0)}	_{^30.9_(0.1)}
TT-CE+ (+QB-Norm)	Full	t2v	_{^17.3_(0.0)}	_{^42.1_(0.2)}	_{^54.9_(0.1)}	_{^8.0_(0.0)}	_{^34.2_(0.1)}

QB-Norm Results on MSVD Benchmark

Model	Split	Task	[email protected]	[email protected]	[email protected]	MdR	Geom
TT-CE+	Full	t2v	_{^25.4_(0.3)}	_{^56.9_(0.4)}	_{^71.3_(0.2)}	_{^4.0_(0.0)}	_{^46.9_(0.3)}
TT-CE+ (+QB-Norm)	Full	t2v	_{^26.6_(1.0)}	_{^58.6_(1.3)}	_{^71.8_(1.1)}	_{^4.0_(0.0)}	_{^48.2_(1.2)}
CLIP2Video	Full	t2v	_^47.0	_^76.8	_^85.9	_^2.0	_^67.7
CLIP2Video (+QB-Norm)	Full	t2v	_^48.0	_^77.9	_^86.2	_^2.0	_^68.5

QB-Norm Results on DiDeMo Benchmark

Model	Split	Task	[email protected]	[email protected]	[email protected]	MdR	Geom
TT-CE+	Full	t2v	_{^21.6_(0.7)}	_{^48.6_(0.4)}	_{^62.9_(0.6)}	_{^6.0_(0.0)}	_{^40.4_(0.4)}
TT-CE+ (+QB-Norm)	Full	t2v	_{^24.2_(0.7)}	_{^50.8_(0.7)}	_{^64.4_(0.1)}	_{^5.3_(0.5)}	_{^43.0_(0.2)}
CLIP4Clip	Full	t2v	_^43.0	_^70.5	_^80.0	_^2.0	_^62.4
CLIP4Clip (+QB-Norm)	Full	t2v	_^43.5	_^71.4	_^80.9	_^2.0	_^63.1

QB-Norm Results on LSMDC Benchmark

Model	Split	Task	[email protected]	[email protected]	[email protected]	MdR	Geom
TT-CE+	Full	t2v	_{^17.2_(0.4)}	_{^36.5_(0.6)}	_{^46.3_(0.3)}	_{^13.7_(0.5)}	_{^30.7_(0.3)}
TT-CE+ (+QB-Norm)	Full	t2v	_{^17.8_(0.4)}	_{^37.7_(0.5)}	_{^47.6_(0.6)}	_{^12.7_(0.5)}	_{^31.7_(0.3)}
CLIP4Clip	Full	t2v	_^21.3	_^40.0	_^49.5	_^11.0	_^34.8
CLIP4Clip (+QB-Norm)	Full	t2v	_^22.4	_^40.1	_^49.5	_^11.0	_^35.4

QB-Norm Results on VaTeX Benchmark

Model	Split	Task	[email protected]	[email protected]	[email protected]	MdR	Geom
TT-CE+	Full	t2v	_{^53.2_(0.2)}	_{^87.4_(0.1)}	_{^93.3_(0.0)}	_{^1.0_(0.0)}	_{^75.7_(0.1)}
TT-CE+ (+QB-Norm)	Full	t2v	_{^54.8_(0.1)}	_{^88.2_(0.1)}	_{^93.8_(0.1)}	_{^1.0_(0.0)}	_{^76.8_(0.0)}
CLIP2Video	Full	t2v	_^57.4	_^87.9	_^93.6	_^1.0	_^77.9
CLIP2Video (+QB-Norm)	Full	t2v	_^58.8	_^88.3	_^93.8	_^1.0	_^78.7

QB-Norm Results on QuerYD Benchmark

Model	Split	Task	[email protected]	[email protected]	[email protected]	MdR	Geom
CE+	Full	t2v	_{^13.2_(2.0)}	_{^37.1_(2.9)}	_{^50.5_(1.9)}	_{^10.3_(1.2)}	_{^29.1_(2.2)}
CE+ (+QB-Norm)	Full	t2v	_{^14.1_(1.8)}	_{^38.6_(1.3)}	_{^51.1_(1.6)}	_{^10.0_(0.8)}	_{^30.2_(1.7)}
TT-CE+	Full	t2v	_{^14.4_(0.5)}	_{^37.7_(1.7)}	_{^50.9_(1.6)}	_{^9.8_(1.0)}	_{^30.3_(0.9)}
TT-CE+ (+QB-Norm)	Full	t2v	_{^15.1_(1.6)}	_{^38.3_(2.4)}	_{^51.2_(2.8)}	_{^10.3_(1.7)}	_{^30.9_(2.3)}

Text-Image retrieval results

QB-Norm Results on MSCoCo Benchmark

Model	Split	Task	[email protected]	[email protected]	[email protected]	MdR	Geom
CLIP	5k	t2i	_^30.3	_^56.1	_^67.1	_^4.0	_^48.5
CLIP (+QB-Norm)	5k	t2i	_^34.8	_^59.9	_^70.4	_^3.0	_^52.8
MMT-Oscar	5k	t2i	_^52.2	_^80.2	_^88.0	_^1.0	_^71.7
MMT-Oscar (+QB-Norm)	5k	t2i	_^53.9	_^80.5	_^88.1	_^1.0	_^72.6

Text-Audio retrieval results

QB-Norm Results on AudioCaps Benchmark

Model	Split	Task	[email protected]	[email protected]	[email protected]	MdR	Geom
AR-CE	Full	t2a	_{^23.1_(0.6)}	_{^55.1_(0.7)}	_{^70.7_(0.6)}	_{^4.7_(0.5)}	_{^44.8_(0.7)}
AR-CE (+QB-Norm)	Full	t2a	_{^23.9_(0.2)}	_{^57.1_(0.3)}	_{^71.6_(0.4)}	_{^4.0_(0.0)}	_{^46.0_(0.3)}

References

If you find this code useful or use the extracted similarity matrices, please consider citing:

@misc{bogolin2021cross,
      title={Cross Modal Retrieval with Querybank Normalisation}, 
      author={Simion-Vlad Bogolin and Ioana Croitoru and Hailin Jin and Yang Liu and Samuel Albanie},
      year={2021},
      eprint={2112.12777},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

This repo provides code for QB-Norm (Cross Modal Retrieval with Querybank Normalisation)

Related tags

Overview

Data

Text-Video retrieval results

Text-Image retrieval results

Text-Audio retrieval results

References

Owner

Image Restoration Toolbox (PyTorch). Training and testing codes for DPIR, USRNet, DnCNN, FFDNet, SRMD, DPSR, BSRGAN, SwinIR

A minimalist implementation of score-based diffusion model

A2LP for short, ECCV2020 spotlight, Investigating SSL principles for UDA problems

Spatial color quantization in Rust

NaturalProofs: Mathematical Theorem Proving in Natural Language

HuSpaCy: industrial-strength Hungarian natural language processing

CSKG is a commonsense knowledge graph that combines seven popular sources into a consolidated representation

some academic posters as references. May we have in-person poster session soon!

Pytorch tutorials for Neural Style transfert

Deploy optimized transformer based models on Nvidia Triton server

A memory-efficient implementation of DenseNets

Source code for "UniRE: A Unified Label Space for Entity Relation Extraction.", ACL2021.

Reproduction of Vision Transformer in Tensorflow2. Train from scratch and Finetune.

Sequence-tagging using deep learning

Code for our paper: Online Variational Filtering and Parameter Learning

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Shallow Convolutional Neural Networks for Human Activity Recognition using Wearable Sensors

An implementation of the proximal policy optimization algorithm

NeuroFind - A solution to the to the Task given by the Oberseminar of Messtechnik Institute of TU Dresden in 2021

This program writes christmas wish programmatically. It is using turtle as a pen pointer draw christmas trees and stars.