JaQuAD: Japanese Question Answering Dataset

Last update: Dec 27, 2022

Related tags

Overview

JaQuAD: Japanese Question Answering Dataset

Overview

Japanese Question Answering Dataset (JaQuAD), released in 2022, is a human-annotated dataset created for Japanese Machine Reading Comprehension. JaQuAD is developed to provide a SQuAD-like QA dataset in Japanese. JaQuAD contains 39,696 question-answer pairs. Questions and answers are manually curated by human annotators. Contexts are collected from Japanese Wikipedia articles.

For more information on how the dataset was created, refer to our paper, JaQuAD: Japanese Question Answering Dataset for Machine Reading Comprehension.

Data

JaQuAD consists of three sets: train, validation, and test. They were created from disjoint sets of Wikipedia articles. The following table shows statistics for each set:

Set	Number of Articles	Number of Contexts	Number of Questions
Train	691	9713	31748
Validation	101	1431	3939
Test	109	1479	4009

You can also download our dataset here. (The test set is not publicly released yet.)

from datasets import load_dataset
jaquad_data = load_dataset('SkelterLabsInc/JaQuAD')

Baseline

We also provide a baseline model for JaQuAD for comparison. We created this model by fine-tuning a publicly available Japanese BERT model on JaQuAD. You can see the performance of the baseline model in the table below.

For more information on the model's creation, refer to JaQuAD.ipynb.

Pre-trained LM	Dev F1	Dev EM	Test F1	Test EM
BERT-Japanese	77.35	61.01	78.92	63.38

You can download the baseline model here.

Usage

from transformers import AutoModelForQuestionAnswering, AutoTokenizer

question = 'アレクサンダー・グラハム・ベルは、どこで生まれたの?'
context = 'アレクサンダー・グラハム・ベルは、スコットランド生まれの科学者、発明家、工学者である。世界初の>実用的電話の発明で知られている。'

model = AutoModelForQuestionAnswering.from_pretrained(
    'SkelterLabsInc/bert-base-japanese-jaquad')
tokenizer = AutoTokenizer.from_pretrained(
    'SkelterLabsInc/bert-base-japanese-jaquad')

inputs = tokenizer(
    question, context, add_special_tokens=True, return_tensors="pt")
input_ids = inputs["input_ids"].tolist()[0]
outputs = model(**inputs)
answer_start_scores = outputs.start_logits
answer_end_scores = outputs.end_logits

# Get the most likely start of the answer with the argmax of the score.
answer_start = torch.argmax(answer_start_scores)
# Get the most likely end of the answer with the argmax of the score.
# 1 is added to `answer_end` because the index of the score is inclusive.
answer_end = torch.argmax(answer_end_scores) + 1

answer = tokenizer.convert_tokens_to_string(
    tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
# answer = 'スコットランド'

Limitations

This dataset is not yet complete. The social biases of this dataset have not yet been investigated.

If you find any errors in JaQuAD, please contact [email protected].

Reference

If you use our dataset or code, please cite our paper:

@misc{so2022jaquad,
      title={{JaQuAD: Japanese Question Answering Dataset for Machine Reading Comprehension}},
      author={ByungHoon So and Kyuhong Byun and Kyungwon Kang and Seongjin Cho},
      year={2022},
      eprint={2202.01764},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

LICENSE

The JaQuAD dataset is licensed under the [CC BY-SA 3.0] (https://creativecommons.org/licenses/by-sa/3.0/) license.

Have Questions?

Ask us at [email protected].

JaQuAD: Japanese Question Answering Dataset

Related tags

Overview

JaQuAD: Japanese Question Answering Dataset

Overview

Data

Baseline

Usage

Limitations

Reference

LICENSE

Have Questions?

Owner

SkelterLabs

Source code for the paper "TearingNet: Point Cloud Autoencoder to Learn Topology-Friendly Representations"

Text classification on IMDB dataset using Keras and Bi-LSTM network

ASCEND Chinese-English code-switching dataset

auto_code_complete is a auto word-completetion program which allows you to customize it on your need

An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hundreds of billions of parameters or larger.

Finetune gpt-2 in google colab

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

Train 🤗transformers with DeepSpeed: ZeRO-2, ZeRO-3

Under the hood working of transformers, fine-tuning GPT-3 models, DeBERTa, vision models, and the start of Metaverse, using a variety of NLP platforms: Hugging Face, OpenAI API, Trax, and AllenNLP

German Text-To-Speech Engine using Tacotron and Griffin-Lim

🧪 Cutting-edge experimental spaCy components and features

This project uses word frequency and Term Frequency-Inverse Document Frequency to summarize a text.

Python functions for summarizing and improving voice dictation input.

Mednlp - Medical natural language parsing and utility library

Quick insights from Zoom meeting transcripts using Graph + NLP

The official repository of the ISBI 2022 KNIGHT Challenge

PyTorch original implementation of Cross-lingual Language Model Pretraining.

PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)

Anomaly Detection 이상치 탐지 전처리 모듈