PIZZA - a task-oriented semantic parsing dataset

Overview

PIZZA - a task-oriented semantic parsing dataset

The PIZZA dataset continues the exploration of task-oriented parsing by introducing a new dataset for parsing pizza and drink orders, whose semantics cannot be captured by flat slots and intents.

The dataset comes in two main versions, one in a recently introduced utterance-level hierarchical notation that we call TOP, and one whose targets are executable representations (EXR).

Below are two examples of orders that can be found in the data:

{
    "dev.SRC": "five medium pizzas with tomatoes and ham",
    "dev.EXR": "(ORDER (PIZZAORDER (NUMBER 5 ) (SIZE MEDIUM ) (TOPPING HAM ) (TOPPING TOMATOES ) ) )",
    "dev.TOP": "(ORDER (PIZZAORDER (NUMBER five ) (SIZE medium ) pizzas with (TOPPING tomatoes ) and (TOPPING ham ) ) )"
}
{
    "dev.SRC": "i want to order two medium pizzas with sausage and black olives and two medium pizzas with pepperoni and extra cheese and three large pizzas with pepperoni and sausage",
    "dev.EXR": "(ORDER (PIZZAORDER (NUMBER 2 ) (SIZE MEDIUM ) (COMPLEX_TOPPING (QUANTITY EXTRA ) (TOPPING CHEESE ) ) (TOPPING PEPPERONI ) ) (PIZZAORDER (NUMBER 2 ) (SIZE MEDIUM ) (TOPPING OLIVES ) (TOPPING SAUSAGE ) ) (PIZZAORDER (NUMBER 3 ) (SIZE LARGE ) (TOPPING PEPPERONI ) (TOPPING SAUSAGE ) ) )",
    "dev.TOP": "(ORDER i want to order (PIZZAORDER (NUMBER two ) (SIZE medium ) pizzas with (TOPPING sausage ) and (TOPPING black olives ) ) and (PIZZAORDER (NUMBER two ) (SIZE medium ) pizzas with (TOPPING pepperoni ) and (COMPLEX_TOPPING (QUANTITY extra ) (TOPPING cheese ) ) ) and (PIZZAORDER (NUMBER three ) (SIZE large ) pizzas with (TOPPING pepperoni ) and (TOPPING sausage ) ) )"
}

While more details on the dataset conventions and construction can be found in the paper, we give a high level idea of the main rules our target semantics follow:

  • Each order can include any number of pizza and/or drink suborders. These suborders are labeled with the constructors PIZZAORDER and DRINKORDER, respectively.
  • Each top-level order is always labeled with the root constructor ORDER.
  • Both pizza and drink orders can have NUMBER and SIZE attributes.
  • A pizza order can have any number of TOPPING attributes, each of which can be negated. Negative particles can have larger scope with the use of the or particle, e.g., no peppers or onions will negate both peppers and onions.
  • Toppings can be modified by quantifiers such as a lot or extra, a little, etc.
  • A pizza order can have a STYLE attribute (e.g., thin crust style or chicago style).
  • Styles can be negated.
  • Each drink order must have a DRINKTYPE (e.g. coke), and can also have a CONTAINERTYPE (e.g. bottle) and/or a volume modifier (e.g., three 20 fl ounce coke cans).

We view ORDER, PIZZAORDER, and DRINKORDER as intents, and the rest of the semantic constructors as composite slots, with the exception of the leaf constructors, which are viewed as entities (resolved slot values).

Dataset statistics

In the below table we give high level statistics of the data.

Train Dev Test
Number of utterances 2,456,446 348 1,357
Unique entities 367 109 180
Avg entities per utterance 5.32 5.37 5.42
Avg intents per utterance 1.76 1.25 1.28

More details can be found in appendix of our publication.

Getting Started

The repo structure is as follows:

PIZZA
|
|_____ data
|      |_____ PIZZA_train.json.zip             # a zipped version of the training data
|      |_____ PIZZA_train.10_percent.json.zip  # a random subset representing 10% of training data
|      |_____ PIZZA_dev.json                   # the dev portion of the data
|      |_____ PIZZA_test.json                  # the test portion of the data
|
|_____ utils
|      |_____ __init__.py
|      |_____ entity_resolution.py # entity resolution script
|      |_____ express_utils.py     # utilities
|      |_____ semantic_matchers.py # metric functions
|      |_____ sexp_reader.py       # tree reader helper functions 
|      |_____ trees.py             # tree classes and readers
|      |
|      |_____ catalogs             # directory with catalog files of entities
|      
|_____ doc 
|      |_____ PIZZA_dataset_reader_metrics_examples.ipynb
|
|_____ READMED.md

The dev and test data files are json lines files where each line represents one utterance and contains 4 keys:

  • *.SRC: the natural language order input
  • *.EXR: the ground truth target semantic representation in EXR format
  • *.TOP: the ground truth target semantic representation in TOP format
  • *.PCFG_ERR: a boolean flag indicating whether our PCFG system parsed the utterance correctly. See publication for more details

The training data file comes in a similar format, with two differences:

  • there is no train.PCFG_ERR flag since the training data is all synthetically generated hence parsable with perfect accuracy. In other words, this flag would be True for all utterances in that file.
  • there is an extra train.TOP-DECOUPLED key that is the ground truth target semantic representation in TOP-Decoupled format. See publication for more details.

Security

See CONTRIBUTING for more information.

License

This library is licensed under the CC-BY-NC-4.0 License.

Generate text line images for training deep learning OCR model (e.g. CRNN)

Generate text line images for training deep learning OCR model (e.g. CRNN)

532 Jan 06, 2023
InferSent sentence embeddings

InferSent InferSent is a sentence embeddings method that provides semantic representations for English sentences. It is trained on natural language in

Facebook Research 2.2k Dec 27, 2022
Flaxformer: transformer architectures in JAX/Flax

Flaxformer: transformer architectures in JAX/Flax Flaxformer is a transformer library for primarily NLP and multimodal research at Google. It is used

Google 114 Dec 29, 2022
Th2En & Th2Zh: The large-scale datasets for Thai text cross-lingual summarization

Th2En & Th2Zh: The large-scale datasets for Thai text cross-lingual summarization 📥 Download Datasets 📥 Download Trained Models INTRODUCTION TH2ZH (

Nakhun Chumpolsathien 5 Jan 03, 2022
StarGAN - Official PyTorch Implementation

StarGAN - Official PyTorch Implementation ***** New: StarGAN v2 is available at https://github.com/clovaai/stargan-v2 ***** This repository provides t

Yunjey Choi 5.1k Dec 30, 2022
Natural Language Processing with transformers

we want to create a repo to illustrate usage of transformers in chinese

Datawhale 763 Dec 27, 2022
Extract city and country mentions from Text like GeoText without regex, but FlashText, a Aho-Corasick implementation.

flashgeotext ⚡ 🌍 Extract and count countries and cities (+their synonyms) from text, like GeoText on steroids using FlashText, a Aho-Corasick impleme

Ben 57 Dec 16, 2022
Large-scale Knowledge Graph Construction with Prompting

Large-scale Knowledge Graph Construction with Prompting across tasks (predictive and generative), and modalities (language, image, vision + language, etc.)

ZJUNLP 161 Dec 28, 2022
Turn clang-tidy warnings and fixes to comments in your pull request

clang-tidy pull request comments A GitHub Action to post clang-tidy warnings and suggestions as review comments on your pull request. What platisd/cla

Dimitris Platis 30 Dec 13, 2022
Задания КЕГЭ по информатике 2021 на Python

КЕГЭ 2021 на Python В этом репозитории мои решения типовых заданий КЕГЭ по информатике в 2021 году, БЕСПЛАТНО! Задания Взяты с https://inf-ege.sdamgia

8 Oct 13, 2022
Multilingual Emotion classification using BERT (fine-tuning). Published at the WASSA workshop (ACL2022).

XLM-EMO: Multilingual Emotion Prediction in Social Media Text Abstract Detecting emotion in text allows social and computational scientists to study h

MilaNLP 35 Sep 17, 2022
This repository is home to the Optimus data transformation plugins for various data processing needs.

Transformers Optimus's transformation plugins are implementations of Task and Hook interfaces that allows execution of arbitrary jobs in optimus. To i

Open Data Platform 37 Dec 14, 2022
Code for the paper "Are Sixteen Heads Really Better than One?"

Are Sixteen Heads Really Better than One? This repository contains code to reproduce the experiments in our paper Are Sixteen Heads Really Better than

Paul Michel 143 Dec 14, 2022
Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers

beyond masking Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers The code is coming Figure 1: Pipeline of token-based pre-

Yunjie Tian 23 Sep 27, 2022
Generate product descriptions, blogs, ads and more using GPT architecture with a single request to TextCortex API a.k.a Hemingwai

TextCortex - HemingwAI Generate product descriptions, blogs, ads and more using GPT architecture with a single request to TextCortex API a.k.a Hemingw

TextCortex AI 27 Nov 28, 2022
A Python module made to simplify the usage of Text To Speech and Speech Recognition.

Nav Module The solution for voice related stuff in Python Nav is a Python module which simplifies voice related stuff in Python. Just import the Modul

Snm Logic 1 Dec 20, 2021
Korean extractive summarization. 2021 AI 텍스트 요약 온라인 해커톤 화성갈끄니까팀 코드

korean extractive summarization 2021 AI 텍스트 요약 온라인 해커톤 화성갈끄니까팀 코드 Leaderboard Notice Text Summarization with Pretrained Encoders에 나오는 bertsumext모델(ext

3 Aug 10, 2022
Code for PED: DETR For (Crowd) Pedestrian Detection

Code for PED: DETR For (Crowd) Pedestrian Detection

36 Sep 13, 2022
A natural language modeling framework based on PyTorch

Overview PyText is a deep-learning based NLP modeling framework built on PyTorch. PyText addresses the often-conflicting requirements of enabling rapi

Facebook Research 6.4k Dec 27, 2022
An assignment on creating a minimalist neural network toolkit for CS11-747

minnn by Graham Neubig, Zhisong Zhang, and Divyansh Kaushik This is an exercise in developing a minimalist neural network toolkit for NLP, part of Car

Graham Neubig 63 Dec 29, 2022