Unified MultiWOZ evaluation scripts for the context-to-response task.

Overview

MultiWOZ Context-to-Response Evaluation

Standardized and easy to use Inform, Success, BLEU

~ See the paper ~

 


Easy-to-use scripts for standardized evaluation of response generation on the MultiWOZ benchmark. This repository contains an implementation of the MultiWOZ database with fuzzy matching, functions for normalization of slot names and values, and a careful implementation of the BLEU score and Inform & Succes rates.

🚀 Usage

Install the repository:

pip install git+https://github.com/Tomiinek/[email protected]

Use it directly from your code. Instantiate an evaluator and then call the evaluate method with dictionary of your predictions with a specific format (described later). Set bleu to evaluate the BLEU score, success to get the Success & Inform rate, and use richness for getting lexical richness metrics such as the number of unique unigrams, trigrams, token entropy, bigram conditional entropy, corpus MSTTR-50, and average turn length. Pseudo-code:

from mwzeval.metrics import Evaluator
...

e = Evaluator(bleu=True, success=False, richness=False)
my_predictions = {}
for item in data:
    my_predictions[item.dialog_id] = model.predict(item)
    ...
    
results = e.evaluate(my_predictions)
print(f"Epoch {epoch} BLEU: {results}")

Alternative usage:

git clone https://github.com/Tomiinek/MultiWOZ_Evaluation.git && cd MultiWOZ_Evaluation
pip install -r requirements.txt

And evaluate you predictions from the input file:

python evaluate.py [--bleu] [--success] [--richness] --input INPUT.json [--output OUTPUT.json]

Set the options --bleu, --success, and --richness as you wish.

Input format:

{
    "xxx0000" : [
        {
            "response": "Your generated delexicalized response.",
            "state": {
                "restaurant" : {
                    "food" : "eatable"
                }, ...
            }, 
            "active_domains": ["restaurant"]
        }, ...
    ], ...
}

The input to the evaluator should be a dictionary (or a .json file) with keys matching dialogue ids in the xxx0000 format (e.g. sng0073 instead of SNG0073.json), and values containing a list of turns. Each turn is a dictionary with keys:

  • response – Your generated delexicalized response. You can use either the slot names with domain names, e.g. restaurant_food, or the domain adaptive delexicalization scheme, e.g. food.

  • stateOptional, the predicted dialog state. If not present (for example in the case of policy optimization models), the ground truth dialog state from MultiWOZ 2.2 is used during the Inform & Success computation. Slot names and values are normalized prior the usage.

  • active_domainsOptional, list of active domains for the corresponding turn. If not present, the active domains are estimated from changes in the dialog state during the Inform & Success rate computation. If your model predicts the domain for each turn, place them here. If you use domains in slot names, run the following command to extract the active domains from slot names automatically:

    python add_slot_domains.py [-h] -i INPUT.json -o OUTPUT.json

See the predictions folder with examples.

Output format:

{
    "bleu" : {'damd': … , 'uniconv': … , 'hdsa': … , 'lava': … , 'augpt': … , 'mwz22': … },
    "success" : {
        "inform"  : {'attraction': … , 'hotel': … , 'restaurant': … , 'taxi': … , 'total': … , 'train': … },
        "success" : {'attraction': … , 'hotel': … , 'restaurant': … , 'taxi': … , 'total': … , 'train': … },
    },
    "richness" : {
        'entropy': … , 'cond_entropy': … , 'avg_lengths': … , 'msttr': … , 
        'num_unigrams': … , 'num_bigrams': … , 'num_trigrams': … 
    }
}

The evaluation script outputs a dictionary with keys bleu, success, and richness corresponding to BLEU, Inform & Success rates, and lexical richness metrics, respectively. Their values can be None if not evaluated, otherwise:

  • BLEU results contain multiple scores corresponding to different delexicalization styles and refernces. Currently included references are DAMD, HDSA, AuGPT, LAVA, UniConv, and MultiWOZ 2.2 whitch we consider to be the canonical one that should be reported in the future.
  • Inform & Succes rates are reported for each domain (i.e. attraction, restaurant, hotel, taxi, and train in case of the test set) separately and in total.
  • Lexical richness contains the number of distinct uni-, bi-, and tri-grams, average number of tokens in generated responses, token entropy, conditional bigram entropy, and MSTTR-50 calculated on concatenated responses.

Secret feature

You can use this code even for evaluation of dialogue state tracking (DST) on MultiWOZ 2.2. Set dst=True during initialization of the Evaluator to get joint state accuracy, slot precision, recall, and F1. Note that the resulting numbers are very different from the DST results in the original MultiWOZ evaluation. This is because we use slot name and value normalization, and careful fuzzy slot value matching.

🏆 Results

Please see the orginal MultiWOZ repository for the benchmark results.

👏 Contributing

  • If you would like to add your results, modify the particular table in the original reposiotry via a pull request, add the file with predictions into the predictions folder in this repository, and create another pull request here.
  • If you need to update the slot name mapping because of your different delexicalization style, feel free to make the changes, and create a pull request.
  • If you would like to improve normalization of slot values, add your new rules, and create a pull request.

💭 Citation

@inproceedings{nekvinda-dusek-2021-shades,
    title = "Shades of {BLEU}, Flavours of Success: The Case of {M}ulti{WOZ}",
    author = "Nekvinda, Tom{\'a}{\v{s}} and Du{\v{s}}ek, Ond{\v{r}}ej",
    booktitle = "Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.gem-1.4",
    doi = "10.18653/v1/2021.gem-1.4",
    pages = "34--46"
}

Owner
Tomáš Nekvinda
Wisdom giver, bacon & eggs master, ant lover
Tomáš Nekvinda
Simple machine learning library / 簡單易用的機器學習套件

FukuML Simple machine learning library / 簡單易用的機器學習套件 Installation $ pip install FukuML Tutorial Lesson 1: Perceptron Binary Classification Learning Al

Fukuball Lin 279 Sep 15, 2022
Semiconductor Machine learning project

Wafer Fault Detection Problem Statement: Wafer (In electronics), also called a slice or substrate, is a thin slice of semiconductor, such as a crystal

kunal suryawanshi 1 Jan 15, 2022
Create animations for the optimization trajectory of neural nets

Animating the Optimization Trajectory of Neural Nets loss-landscape-anim lets you create animated optimization path in a 2D slice of the loss landscap

Logan Yang 81 Dec 25, 2022
GAN-based 3D human pose estimation model for 3DV'17 paper

Tensorflow implementation for 3DV 2017 conference paper "Adversarially Parameterized Optimization for 3D Human Pose Estimation". @inproceedings{jack20

Dominic Jack 15 Feb 27, 2021
Constrained Logistic Regression - How to apply specific constraints to logistic regression's coefficients

Constrained Logistic Regression Sample implementation of constructing a logistic regression with given ranges on each of the feature's coefficients (v

1 Dec 29, 2021
💡 Learnergy is a Python library for energy-based machine learning models.

Learnergy: Energy-based Machine Learners Welcome to Learnergy. Did you ever reach a bottleneck in your computational experiments? Are you tired of imp

Gustavo Rosa 57 Nov 17, 2022
All of the figures and notebooks for my deep learning book, for free!

"Deep Learning - A Visual Approach" by Andrew Glassner This is the official repo for my book from No Starch Press. Ordering the book My book is called

Andrew Glassner 227 Jan 04, 2023
PyTorch Implementation of Sparse DETR

Sparse DETR By Byungseok Roh*, Jaewoong Shin*, Wuhyun Shin*, and Saehoon Kim at Kakao Brain. (*: Equal contribution) This repository is an official im

Kakao Brain 113 Dec 28, 2022
RTSeg: Real-time Semantic Segmentation Comparative Study

Real-time Semantic Segmentation Comparative Study The repository contains the official TensorFlow code used in our papers: RTSEG: REAL-TIME SEMANTIC S

Mennatullah Siam 592 Nov 18, 2022
This repository contains code from the paper "TTS-GAN: A Transformer-based Time-Series Generative Adversarial Network"

TTS-GAN: A Transformer-based Time-Series Generative Adversarial Network This repository contains code from the paper "TTS-GAN: A Transformer-based Tim

Intelligent Multimodal Computing and Sensing Laboratory (IMICS Lab) - Texas State University 108 Dec 29, 2022
Bio-OFC gym implementation and Gym-Fly environment

Bio-OFC gym implementation and Gym-Fly environment This repository includes the gym compatible implementation of the Bio-OFC algorithm from the paper

Siavash Golkar 1 Nov 16, 2021
A generalist algorithm for cell and nucleus segmentation.

Cellpose | A generalist algorithm for cell and nucleus segmentation. Cellpose was written by Carsen Stringer and Marius Pachitariu. To learn about Cel

MouseLand 733 Dec 29, 2022
Beancount-mercury - Beancount importer for Mercury Startup Checking

beancount-mercury beancount-mercury provides an Importer for converting CSV expo

Michael Lynch 4 Oct 31, 2022
Knowledgeable Prompt-tuning: Incorporating Knowledge into Prompt Verbalizer for Text Classification

Knowledgeable Prompt-tuning: Incorporating Knowledge into Prompt Verbalizer for Text Classification

DingDing 143 Jan 01, 2023
Predicting 10 different clothing types using Xception pre-trained model.

Predicting-Clothing-Types Predicting 10 different clothing types using Xception pre-trained model from Keras library. It is reimplemented version from

AbdAssalam Ahmad 3 Dec 29, 2021
ICLR21 Tent: Fully Test-Time Adaptation by Entropy Minimization

⛺️ Tent: Fully Test-Time Adaptation by Entropy Minimization This is the official project repository for Tent: Fully-Test Time Adaptation by Entropy Mi

Dequan Wang 204 Dec 25, 2022
Official implementation of "A Shared Representation for Photorealistic Driving Simulators" in PyTorch.

A Shared Representation for Photorealistic Driving Simulators The official code for the paper: "A Shared Representation for Photorealistic Driving Sim

VITA lab at EPFL 7 Oct 13, 2022
A PyTorch Implementation of PGL-SUM from "Combining Global and Local Attention with Positional Encoding for Video Summarization", Proc. IEEE ISM 2021

PGL-SUM: Combining Global and Local Attention with Positional Encoding for Video Summarization PyTorch Implementation of PGL-SUM From "PGL-SUM: Combin

Evlampios Apostolidis 35 Dec 22, 2022
A program that uses computer vision to detect hand gestures, used for controlling movie players.

HandGestureDetection This program uses a Haar Cascade algorithm to detect the presence of your hand, and then passes it on to a self-created and self-

2 Nov 22, 2022
Thermal Control of Laser Powder Bed Fusion using Deep Reinforcement Learning

This repository is the implementation of the paper "Thermal Control of Laser Powder Bed Fusion Using Deep Reinforcement Learning", linked here. The project makes use of the Deep Reinforcement Library

BaratiLab 11 Dec 27, 2022