Unified MultiWOZ evaluation scripts for the context-to-response task.

Last update: Dec 13, 2022

Overview

MultiWOZ Context-to-Response Evaluation

Standardized and easy to use Inform, Success, BLEU

~ See the paper ~

Easy-to-use scripts for standardized evaluation of response generation on the MultiWOZ benchmark. This repository contains an implementation of the MultiWOZ database with fuzzy matching, functions for normalization of slot names and values, and a careful implementation of the BLEU score and Inform & Succes rates.

🚀 Usage

Install the repository:

pip install git+https://github.com/Tomiinek/[email protected]

Use it directly from your code. Instantiate an evaluator and then call the evaluate method with dictionary of your predictions with a specific format (described later). Set bleu to evaluate the BLEU score, success to get the Success & Inform rate, and use richness for getting lexical richness metrics such as the number of unique unigrams, trigrams, token entropy, bigram conditional entropy, corpus MSTTR-50, and average turn length. Pseudo-code:

from mwzeval.metrics import Evaluator
...

e = Evaluator(bleu=True, success=False, richness=False)
my_predictions = {}
for item in data:
    my_predictions[item.dialog_id] = model.predict(item)
    ...
    
results = e.evaluate(my_predictions)
print(f"Epoch {epoch} BLEU: {results}")

Alternative usage:

git clone https://github.com/Tomiinek/MultiWOZ_Evaluation.git && cd MultiWOZ_Evaluation
pip install -r requirements.txt

And evaluate you predictions from the input file:

python evaluate.py [--bleu] [--success] [--richness] --input INPUT.json [--output OUTPUT.json]

Set the options --bleu, --success, and --richness as you wish.

Input format:

{
    "xxx0000" : [
        {
            "response": "Your generated delexicalized response.",
            "state": {
                "restaurant" : {
                    "food" : "eatable"
                }, ...
            }, 
            "active_domains": ["restaurant"]
        }, ...
    ], ...
}

The input to the evaluator should be a dictionary (or a .json file) with keys matching dialogue ids in the xxx0000 format (e.g. sng0073 instead of SNG0073.json), and values containing a list of turns. Each turn is a dictionary with keys:

response – Your generated delexicalized response. You can use either the slot names with domain names, e.g. restaurant_food, or the domain adaptive delexicalization scheme, e.g. food.
state – Optional, the predicted dialog state. If not present (for example in the case of policy optimization models), the ground truth dialog state from MultiWOZ 2.2 is used during the Inform & Success computation. Slot names and values are normalized prior the usage.
active_domains – Optional, list of active domains for the corresponding turn. If not present, the active domains are estimated from changes in the dialog state during the Inform & Success rate computation. If your model predicts the domain for each turn, place them here. If you use domains in slot names, run the following command to extract the active domains from slot names automatically:
```
python add_slot_domains.py [-h] -i INPUT.json -o OUTPUT.json
```

See the predictions folder with examples.

Output format:

{
    "bleu" : {'damd': … , 'uniconv': … , 'hdsa': … , 'lava': … , 'augpt': … , 'mwz22': … },
    "success" : {
        "inform"  : {'attraction': … , 'hotel': … , 'restaurant': … , 'taxi': … , 'total': … , 'train': … },
        "success" : {'attraction': … , 'hotel': … , 'restaurant': … , 'taxi': … , 'total': … , 'train': … },
    },
    "richness" : {
        'entropy': … , 'cond_entropy': … , 'avg_lengths': … , 'msttr': … , 
        'num_unigrams': … , 'num_bigrams': … , 'num_trigrams': … 
    }
}

The evaluation script outputs a dictionary with keys bleu, success, and richness corresponding to BLEU, Inform & Success rates, and lexical richness metrics, respectively. Their values can be None if not evaluated, otherwise:

BLEU results contain multiple scores corresponding to different delexicalization styles and refernces. Currently included references are DAMD, HDSA, AuGPT, LAVA, UniConv, and MultiWOZ 2.2 whitch we consider to be the canonical one that should be reported in the future.
Inform & Succes rates are reported for each domain (i.e. attraction, restaurant, hotel, taxi, and train in case of the test set) separately and in total.
Lexical richness contains the number of distinct uni-, bi-, and tri-grams, average number of tokens in generated responses, token entropy, conditional bigram entropy, and MSTTR-50 calculated on concatenated responses.

Secret feature

You can use this code even for evaluation of dialogue state tracking (DST) on MultiWOZ 2.2. Set dst=True during initialization of the Evaluator to get joint state accuracy, slot precision, recall, and F1. Note that the resulting numbers are very different from the DST results in the original MultiWOZ evaluation. This is because we use slot name and value normalization, and careful fuzzy slot value matching.

🏆 Results

Please see the orginal MultiWOZ repository for the benchmark results.

👏 Contributing

If you would like to add your results, modify the particular table in the original reposiotry via a pull request, add the file with predictions into the predictions folder in this repository, and create another pull request here.
If you need to update the slot name mapping because of your different delexicalization style, feel free to make the changes, and create a pull request.
If you would like to improve normalization of slot values, add your new rules, and create a pull request.

💭 Citation

@inproceedings{nekvinda-dusek-2021-shades,
    title = "Shades of {BLEU}, Flavours of Success: The Case of {M}ulti{WOZ}",
    author = "Nekvinda, Tom{\'a}{\v{s}} and Du{\v{s}}ek, Ond{\v{r}}ej",
    booktitle = "Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.gem-1.4",
    doi = "10.18653/v1/2021.gem-1.4",
    pages = "34--46"
}

Unified MultiWOZ evaluation scripts for the context-to-response task.

Related tags

Overview

MultiWOZ Context-to-Response Evaluation

Standardized and easy to use Inform, Success, BLEU

~ See the paper ~

🚀 Usage

Install the repository:

Alternative usage:

Input format:

Output format:

Secret feature

🏆 Results

👏 Contributing

💭 Citation

Owner

Tomáš Nekvinda

Official implement of Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

Implementations for the ICLR-2021 paper: SEED: Self-supervised Distillation For Visual Representation.

History Aware Multimodal Transformer for Vision-and-Language Navigation

Easy way to add GoogleMaps to Flask applications. maintainer: @getcake

AgeGuesser: deep learning based age estimation system. Powered by EfficientNet and Yolov5

"NAS-Bench-301 and the Case for Surrogate Benchmarks for Neural Architecture Search".

PyTorch implementation of 1712.06087 "Zero-Shot" Super-Resolution using Deep Internal Learning

Official implementation of Self-supervised Graph Attention Networks (SuperGAT), ICLR 2021.

A simple library that implements CLIP guided loss in PyTorch.

(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

Official code of ICCV2021 paper "Residual Attention: A Simple but Effective Method for Multi-Label Recognition"

SemEval2022 Patronizing and Condescending Language (PCL) Detection

Toward Realistic Single-View 3D Object Reconstruction with Unsupervised Learning from Multiple Images (ICCV 2021)

GeneDisco is a benchmark suite for evaluating active learning algorithms for experimental design in drug discovery.

The official repository for BaMBNet

Official implementation of "Learning Forward Dynamics Model and Informed Trajectory Sampler for Safe Quadruped Navigation" (RSS 2022)

Learning to Map Large-scale Sparse Graphs on Memristive Crossbar

OpenCVのGrabCut()を利用したセマンティックセグメンテーション向けアノテーションツール(Annotation tool using GrabCut() of OpenCV. It can be used to create datasets for semantic segmentation.)

The Unreasonable Effectiveness of Random Pruning: Return of the Most Naive Baseline for Sparse Training

A wrapper around SageMaker ML Lineage Tracking extending ML Lineage to end-to-end ML lifecycles, including additional capabilities around Feature Store groups, queries, and other relevant artifacts.