CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

Related tags

Text Data & NLPCCQA
Overview

CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training

This is the official repository for the code and models of the paper CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training. If you use our dataset, code or any parts thereof, please cite this paper:

@misc{huber-etal-2021-ccqa,
  title={CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training}, 
  author={Patrick Huber and Armen Aghajanyan and Barlas Oğuz and Dmytro Okhonko and Wen-tau Yih and Sonal Gupta and Xilun Chen},
  year={2021},
  eprint={2110.07731},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

Getting Common Crawl Snapshots

The Common Crawl project provides monthly web snapshots of new and updates websites in raw HTML format. Every monthly snapshot (~50-70TB) is further separated into smaller WARC (Web ARChive) files. To download a single WARC file, go to the Common Crawl website for the respective month (e.g. May 2021) and download the WARC paths file. The downloaded WARC paths file contains a \newline separated list of download destination of the actual files. Pick a path and prepend s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ for the complete URL. Once downloaded, gunzip the archive and a single Common Crawl web archive is ready to be processed.

Dataset Generation

Dependencies

Below are the required dependencies to run the dataset generation, curation and model evaluations.

  • Rust
  • Rust packages: clap, html-escape, indicatif, kuchiki, rayon, regex, serde, serde_json, warc (see Cargo.toml file for versions)
  • Python 3.7.3
  • Python dependencies: fasttext language identification, fasttext==0.9.2, lxml==4.3.2

Processing Common Crawl data (Rust)

  • Build the cargo package with cargo build from within the rust folder
  • Run the script with cargo run <path/to/warc/file> <path/to/output/file.mhtml>

Curating the minified HTML data (Python)

To generate json objects for every webpage in the minified HTML, run

python mhtml_to_json.py <path/to/fasttext/lid.176.bin> <path/to/mhtml/file> <path/to/output/file>

Aggregating datapoints to remove duplicate URL entries (Python)

As mentioned in the paper, we use the original dataset for our in-domain pre-training experiments. However, we also provide a cleaned version of the dataset, aggregating same-URL duplicates into a single object. To run the datapoint aggregation script, execute

python json_duplicate_filter.py <path/to/json/file> <path/to/output/file>

Converting json dataset into closed-book and passage retrieval formats (Python)

To be able to train closed-book (sequence-to-sequence) and passage retrieval (DPR) models on the CCQA dataset, the corpus needs to be further processed

Closed-book processing

To prepare the dataset for closed-book question-answering training, run:

python closed_book_processing.py <path/to/json/file> <path/to/output/file> <--only_english> <--keep_markup>

Passage retrieval (DPR) processing

To prepare the dataset for passage rertieval (DPR) training, run:

python passage_retrieval_processing.py <path/to/json/file> <path/to/output/file> <--only_english> <--keep_markup>

CCQA In-Domain Pre-Trained Model Checkpoints

BART and T5 checkpoints are Huggingface transformer models tested with transformers version 4.8.2

The DPR model checkpoint can be downloaded for the original DPR codebase or the DPR v2 codebase

LICENSE

The majority of CCQA is licensed under CC-BY-NC, however portions of the project are available under separate license terms: crowbook-text-processing is licensed under the MPL-2.0 license.

Owner
Meta Research
Meta Research
Textpipe: clean and extract metadata from text

textpipe: clean and extract metadata from text textpipe is a Python package for converting raw text in to clean, readable text and extracting metadata

Textpipe 298 Nov 21, 2022
Sentiment-Analysis and EDA on the IMDB Movie Review Dataset

Sentiment-Analysis and EDA on the IMDB Movie Review Dataset The main part of the work focuses on the exploration and study of different approaches whi

Nikolas Petrou 1 Jan 12, 2022
189 Jan 02, 2023
Speech to text streamlit app

Speech to text Streamlit-app! 👄 This speech to text recognition is powered by t

Charly Wargnier 9 Jan 01, 2023
PatrickStar enables Larger, Faster, Greener Pretrained Models for NLP. Democratize AI for everyone.

PatrickStar enables Larger, Faster, Greener Pretrained Models for NLP. Democratize AI for everyone.

Tencent 633 Dec 28, 2022
ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

Antlr Project 13.6k Jan 05, 2023
To be a next-generation DL-based phenotype prediction from genome mutations.

Sequence -----------+-- 3D_structure -- 3D_module --+ +-- ? | |

Eric Alcaide 18 Jan 11, 2022
Official implementation of MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis

MLP Singer Official implementation of MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis. Audio samples are available on our demo page.

Neosapience 103 Dec 23, 2022
QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries

Moment-DETR QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries Jie Lei, Tamara L. Berg, Mohit Bansal For dataset de

Jie Lei 雷杰 133 Dec 22, 2022
Code for lyric-section-to-comment generation based on huggingface transformers.

CommentGeneration Code for lyric-section-to-comment generation based on huggingface transformers. Migrate Guyu model and code (both 12-layers and 24-l

Yawei Sun 8 Sep 04, 2021
An ActivityWatch watcher to pose questions to the user and record her answers.

aw-watcher-ask An ActivityWatch watcher to pose questions to the user and record her answers. This watcher uses Zenity to present dialog boxes to the

Bernardo Chrispim Baron 33 Dec 03, 2022
Super easy library for BERT based NLP models

Fast-Bert New - Learning Rate Finder for Text Classification Training (borrowed with thanks from https://github.com/davidtvs/pytorch-lr-finder) Suppor

Utterworks 1.8k Dec 27, 2022
Blue Brain text mining toolbox for semantic search and structured information extraction

Blue Brain Search Source Code DOI Data & Models DOI Documentation Latest Release Python Versions License Build Status Static Typing Code Style Securit

The Blue Brain Project 29 Dec 01, 2022
Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.

Summarization, translation, Q&A, text generation and more at blazing speed using a T5 version implemented in ONNX. This package is still in alpha stag

Abel 211 Dec 28, 2022
Source code for AAAI20 "Generating Persona Consistent Dialogues by Exploiting Natural Language Inference".

Generating Persona Consistent Dialogues by Exploiting Natural Language Inference Source code for RCDG model in AAAI20 Generating Persona Consistent Di

16 Oct 08, 2022
ChatBotProyect - This is an unfinished project about a simple chatbot.

chatBotProyect This is an unfinished project about a simple chatbot. (union_todo.ipynb) Reminders for the project: Find why one of the vectorizers fai

Tomás 0 Jul 24, 2022
CCKS-Title-based-large-scale-commodity-entity-retrieval-top1

- 基于标题的大规模商品实体检索top1 一、任务介绍 CCKS 2020:基于标题的大规模商品实体检索,任务为对于给定的一个商品标题,参赛系统需要匹配到该标题在给定商品库中的对应商品实体。 输入:输入文件包括若干行商品标题。 输出:输出文本每一行包括此标题对应的商品实体,即给定知识库中商品 ID,

43 Nov 11, 2022
Open-source offline translation library written in Python. Uses OpenNMT for translations

Open source neural machine translation in Python. Designed to be used either as a Python library or desktop application. Uses OpenNMT for translations and PyQt for GUI.

Argos Open Tech 1.6k Jan 01, 2023
This project consists of data analysis and data visualization (done using python)of all IPL seasons from 2008 to 2019 and answering the most asked questions about the IPL.

IPL-data-analysis This project consists of data analysis and data visualization of all IPL seasons from 2008 to 2019 and answering the most asked ques

Sivateja A T 2 Feb 08, 2022
I label phrases on a scale of five values: negative, somewhat negative, neutral, somewhat positive, positive

I label phrases on a scale of five values: negative, somewhat negative, neutral, somewhat positive, positive. Obstacles like sentence negation, sarcasm, terseness, language ambiguity, and many others

1 Jan 13, 2022