This is the repository for our paper Ditch the Gold Standard: Re-evaluating Conversational Question Answering

Overview

Ditch the Gold Standard: Re-evaluating Conversational Question Answering

This is the repository for our paper Ditch the Gold Standard: Re-evaluating Conversational Question Answering.

Overview

In this work, we conduct the first large-scale human evaluation of state-of-the-art conversational QA systems. In our evaluation, human annotators chat with conversational QA models about passages from the QuAC development set, and after that the annotators judge the correctness of model answers. We release the human annotated dataset in the following section.

We also identify a critical issue with the current automatic evaluation, which pre-collectes human-human conversations and uses ground-truth answers as conversational history (differences between different evaluations are shown in the following figure). By comparison, we find that the automatic evaluation does not always agree with the human evaluation. We propose a new evaluation protocol that is based on predicted history and question rewriting. Our experiments show that the new protocol better reflects real-world performance compared to the original automatic evaluation. We also provide the new evaluation protocol code in the following.

Different evaluation protocols

Human Evaluation Dataset

You can download the human annotation dataset from data/human_annotation_data.json. The json file contains one data field data, which is a list of conversations. Each conversation contains the following fields:

  • model_name: The model evaluated. One of bert4quac, graphflow, ham, excord.
  • context: The passage used in this conversation.
  • dialog_id: The ID from the original QuAC dataset.
  • qas: The conversation, which contains a list of QA pairs. Each QA pair has the following fields:
    • turn_id: The number of turn.
    • question: The question from the human annotator.
    • answer: The answer from the model.
    • valid: Whether the question is valid (annotated by our human annotator).
    • answerable: Whether the question is answerable (annotated by our human annotator).
    • correct: Whether the model's answer is correct (annotated by our human annotator).

Automatic model evaluation interface

We provide a convenient interface to test model performance on a few evaluation protocols compared in our paper, including Auto-Pred, Auto-Replace and our proposed evaluation protocol, Auto-Rewrite, which better demonstrates models' performance in human-model conversations. Please refer to our paper for more details. Following is a figure describing how Auto-Rewrite works.

Auto-rewrite

To use our evaluation interface on your own model, follow the steps:

  • Step 1: Download the QuAC dataset.

  • Step 2: Install allennlp, allennlp_models, ncr.replace_corefs through pip if you would like to use Auto-Rewrite.

  • Step 3: Download the CANARD dataset and set --canard_path if you would like to use Auto-Replace.

  • Step 4: Write a model interface following the template interface.py. Explanations to each function are provided through in-line comments. Make sure to import all your model dependencies at the top.

  • Step 5: Add the model to the evaluation script run_quac_eval.py. Changes that are need to be made are marked with #TODO.

  • Step 6: Run evaluation script. See run.sh for reference. Explanations of all arguments are provided in run_quac_eval.py. Make sure to turn on only one of --pred, --rewrite or --replace.

Citation

@article{li2021ditch,
   title={Ditch the Gold Standard: Re-evaluating Conversational Question Answering},
   author={Li, Huihan and Gao, Tianyu and Goenka, Manan and Chen, Danqi},
   journal={arXiv preprint arXiv:2112.08812},
   year={2021}
}
Owner
Princeton Natural Language Processing
Princeton Natural Language Processing
Python implementation of Project Fluent

Project Fluent This is a collection of Python packages to use the Fluent localization system. python-fluent consists of these packages: fluent.syntax

Project Fluent 155 Dec 28, 2022
Build a medical knowledge graph based on Unified Language Medical System (UMLS)

UMLS-Graph Build a medical knowledge graph based on Unified Language Medical System (UMLS) Requisite Install MySQL Server 5.6 and import UMLS data int

Donghua Chen 6 Dec 25, 2022
Post-Training Quantization for Vision transformers.

PTQ4ViT Post-Training Quantization Framework for Vision Transformers. We use the twin uniform quantization method to reduce the quantization error on

Zhihang Yuan 61 Dec 28, 2022
Using Hotel Data to predict High Value And Potential VIP Guests

Description Using hotel data and AI to predict high value guests and potential VIP guests. Hotel can leverage on prediction resutls to run more effect

HCG 12 Feb 14, 2022
Fully Adaptive Bayesian Algorithm for Data Analysis (FABADA) is a new approach of noise reduction methods. In this repository is shown the package developed for this new method based on \citepaper.

Fully Adaptive Bayesian Algorithm for Data Analysis FABADA FABADA is a novel non-parametric noise reduction technique which arise from the point of vi

18 Oct 20, 2022
RLHive: a framework designed to facilitate research in reinforcement learning.

RLHive is a framework designed to facilitate research in reinforcement learning. It provides the components necessary to run a full RL experiment, for both single agent and multi agent environments.

88 Jan 05, 2023
Code for the ACL2021 paper "Lexicon Enhanced Chinese Sequence Labelling Using BERT Adapter"

Lexicon Enhanced Chinese Sequence Labeling Using BERT Adapter Code and checkpoints for the ACL2021 paper "Lexicon Enhanced Chinese Sequence Labelling

274 Dec 06, 2022
This is the official implementation of the paper "Object Propagation via Inter-Frame Attentions for Temporally Stable Video Instance Segmentation".

[CVPRW 2021] - Object Propagation via Inter-Frame Attentions for Temporally Stable Video Instance Segmentation

Anirudh S Chakravarthy 6 May 03, 2022
PyTorch implementation of MoCo: Momentum Contrast for Unsupervised Visual Representation Learning

MoCo: Momentum Contrast for Unsupervised Visual Representation Learning This is a PyTorch implementation of the MoCo paper: @Article{he2019moco, aut

Meta Research 3.7k Jan 02, 2023
《Rethinking Sptil Dimensions of Vision Trnsformers》(2021)

Rethinking Spatial Dimensions of Vision Transformers Byeongho Heo, Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Junsuk Choe, Seong Joon Oh | Paper NAVER

NAVER AI 224 Dec 27, 2022
Advancing mathematics by guiding human intuition with AI

Advancing mathematics by guiding human intuition with AI This repo contains two colab notebooks which accompany the paper, available online at https:/

DeepMind 315 Dec 26, 2022
Real-time face detection and emotion/gender classification using fer2013/imdb datasets with a keras CNN model and openCV.

Real-time face detection and emotion/gender classification using fer2013/imdb datasets with a keras CNN model and openCV.

Octavio Arriaga 5.3k Dec 30, 2022
Make Watson Assistant send messages to your Discord Server

Make Watson Assistant send messages to your Discord Server Prerequisites Sign up for an IBM Cloud account. Fill in the required information and press

1 Jan 10, 2022
PyTorch implementation for Stochastic Fine-grained Labeling of Multi-state Sign Glosses for Continuous Sign Language Recognition.

Stochastic CSLR This is the PyTorch implementation for the ECCV 2020 paper: Stochastic Fine-grained Labeling of Multi-state Sign Glosses for Continuou

Zhe Niu 28 Dec 19, 2022
Code release for General Greedy De-bias Learning

General Greedy De-bias for Dataset Biases This is an extention of "Greedy Gradient Ensemble for Robust Visual Question Answering" (ICCV 2021, Oral). T

4 Mar 15, 2022
Library for machine learning stacking generalization.

stacked_generalization Implemented machine learning *stacking technic[1]* as handy library in Python. Feature weighted linear stacking is also availab

114 Jul 19, 2022
Fluency ENhanced Sentence-bert Evaluation (FENSE), metric for audio caption evaluation. And Benchmark dataset AudioCaps-Eval, Clotho-Eval.

FENSE The metric, Fluency ENhanced Sentence-bert Evaluation (FENSE), for audio caption evaluation, proposed in the paper "Can Audio Captions Be Evalua

Zhiling Zhang 13 Dec 23, 2022
🤗 Paper Style Guide

🤗 Paper Style Guide (Work in progress, send a PR!) Libraries to Know booktabs natbib cleveref Either seaborn, plotly or altair for graphs algorithmic

Hugging Face 66 Dec 12, 2022
A Low Complexity Speech Enhancement Framework for Full-Band Audio (48kHz) based on Deep Filtering.

DeepFilterNet A Low Complexity Speech Enhancement Framework for Full-Band Audio (48kHz) based on Deep Filtering. libDF contains Rust code used for dat

Hendrik Schröter 292 Dec 25, 2022
Code for "Localization with Sampling-Argmax", NeurIPS 2021

Localization with Sampling-Argmax [Paper] [arXiv] [Project Page] Localization with Sampling-Argmax Jiefeng Li, Tong Chen, Ruiqi Shi, Yujing Lou, Yong-

JeffLi 71 Dec 17, 2022