GooAQ 🥑 : Google Answers to Google Questions!

This repository contains the code/data accompanying our recent work on long-form question answering.

NOTE This dataset should not be used for any commercial purposes. See the license for the detailed terms.

Data

To get the data, see the data/ directory. Note that the data is stored via git-lfs. If you're cloning the project (git clone [email protected]:allenai/gooaq.git), make sure to also run git lfs pull as well.

Each row of the data file should look like this:

{
  "id": 3339543,
  "question": "what is the difference between collagen and whey protein?",
  "short_answer": null,
  "answer": "The main differences between the amino acid profiles of whey and collagen are that whey contains all 9 essential amino acids, while collagen only has 8. ... Collagen is a fibrous protein found in the skin, cartilage, and bones of animals whereas whey comes from milk.",
  "answer_type": "feat_snip"
}

where the questions question are collected via Google auto-complete.
The answers responses (short_answer and answer) were collected from Google's answer boxes. The answer types (answer_type) are inferred based on the html content of Google's response. Here is the dominant types in the current dataset:

feat_snip: explanatory responses; the majoriy the question/responses are of this type.
collection: list responses (e.g., steps to accomplish something).
knowledge: typically short responses for knowledge seeking questions.
unit_conv: questions about converting units.
time_conv: questions about converting times.
curr_conv: questions about converting currencies.

Here are several more examples from the data:

{
  "id": 5009708,
  "question": "carbon dioxide comprises approximately what percentage of tropospheric gases?",
  "short_answer": "04%",
  "answer": "Carbon dioxide comprise approximately . 04% of tropospheric gases.",
  "answer_type": "feat_snip"
}
{
  "id": 8317711,
  "question": "what is the distance between uranus and earth?",
  "short_answer": "1.7858 billion mi",
  "answer": null,
  "answer_type": "knowledge"
}
{
  "id": 3547745,
  "question": "what is the symbol for the element aluminum?",
  "short_answer": "Al",
  "answer": null,
  "answer_type": "knowledge"
}
{
  "id": 3552841,
  "question": "what is the volume of a 12 oz can?",
  "short_answer": "340.957",
  "answer": null,
  "answer_type": "unit_conv"
}
{
  "id": 1032187,
  "question": "exajoule is how many joules?",
  "short_answer": "1e+18 Joule",
  "answer": null,
  "answer_type": "unit_conv"
}
{
  "id": 610247,
  "question": "are words that start with e?",
  "short_answer": null,
  "answer": "['eager.', 'eagle.', 'eagre.', 'eared.', 'earls.', 'early.', 'earns.', 'earth.']",
  "answer_type": "collection"
}
{
  "id": 1309258,
  "question": "how long does it take to boil a hard egg?",
  "short_answer": null,
  "answer": "['Place your eggs in a single layer on the bottom of your pot and cover with cold water. ... ', 'Over high heat, bring your eggs to a rolling boil.', 'Remove from heat and let stand in water for 10-12 minutes for large eggs. ... ', 'Drain water and immediately run cold water over eggs until cooled.']",
  "answer_type": "collection"
}
{
  "id": 2518757,
  "question": "is ways to lose weight?",
  "short_answer": null,
  "answer": "['Trying intermittent fasting. ... ', 'Tracking your diet and exercise. ... ', 'Eating mindfully. ... ', 'Eating protein for breakfast. ... ', 'Cutting back on sugar and refined carbohydrates. ... ', 'Eating plenty of fiber. ... ', 'Balancing gut bacteria. ... ', \"Getting a good night's sleep.\"]",
  "answer_type": "collection"
}

Baselines

See the scripts for reproducing our T5 baselines, see the experiments/ directory.

Reproducing Human Evaluation

TBD

GooAQ 🥑 : Google Answers to Google Questions!

Related tags

Overview

GooAQ 🥑 : Google Answers to Google Questions!

Data

Baselines

Reproducing Human Evaluation

More reading

Owner

AI2

Speech Recognition Database Management with python

Easy to start. Use deep nerual network to predict the sentiment of movie review.

ChatBotProyect - This is an unfinished project about a simple chatbot.

A Python script that compares files in directories

Repository for Project Insight: NLP as a Service

:P Some basic stuff I'm gonna use for my upcoming Agile Software Development and Devops

Trains an OpenNMT PyTorch model and SentencePiece tokenizer.

An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations

Code and datasets for our paper "PTR: Prompt Tuning with Rules for Text Classification"

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

AI and Machine Learning workflows on Anthos Bare Metal.

Clone a voice in 5 seconds to generate arbitrary speech in real-time

Pretrain CPM - 大规模预训练语言模型的预训练代码

Unsupervised Language Modeling at scale for robust sentiment classification

Spam filtering made easy for you

hashily is a Python module that provides a variety of text decoding and encoding operations.

Collection of scripts to pinpoint obfuscated code

This is the code for the EMNLP 2021 paper AEDA: An Easier Data Augmentation Technique for Text Classification

The Easy-to-use Dialogue Response Selection Toolkit for Researchers