Automated question generation and question answering from Turkish texts using text-to-text transformers

Last update: Dec 14, 2022

Overview

Turkish Question Generation

Offical source code for
"Automated question generation & question answering from Turkish texts using text-to-text transformers"

citation

If you use this software in your work, please cite as:

@article{akyon2021automated,
  title={Automated question generation and question answering from Turkish texts using text-to-text transformers},
  author={Akyon, Fatih Cagatay and Cavusoglu, Devrim and Cengiz, Cemil and Altinuc, Sinan Onur and Temizel, Alptekin},
  journal={arXiv preprint arXiv:2111.06476},
  year={2021}
}

install

git clone https://github.com/obss/turkish-question-generation.git
cd turkish-question-generation
pip install -r requirements.txt

train

start a training using args:

python run.py --model_name_or_path google/mt5-small  --output_dir runs/exp1 --do_train --do_eval --tokenizer_name_or_path mt5_qg_tokenizer --per_device_train_batch_size 4 --gradient_accumulation_steps 2 --learning_rate 1e-4 --seed 42 --save_total_limit 1

download json config file and start a training:

python run.py config.json

downlaod yaml config file and start a training:

python run.py config.yaml

evaluate

arrange related params in config:

do_train: false
do_eval: true
eval_dataset_list: ["tquad2-valid", "xquad.tr"]
prepare_data: true
mt5_task_list: ["qa", "qg", "ans_ext"]
mt5_qg_format: "both"
no_cuda: false

start an evaluation:

python run.py config.yaml

neptune

install neptune:

pip install neptune-client

download config file and arrange neptune params:

run_name: 'exp1'
neptune_project: 'name/project'
neptune_api_token: 'YOUR_API_TOKEN'

start a training:

python train.py config.yaml

wandb

install wandb:

pip install wandb

download config file and arrange wandb params:

run_name: 'exp1'
wandb_project: 'turque'

start a training:

python train.py config.yaml

finetuned checkpoints

Name	Model	data ^train	params ^(M)	model size ^(GB)
turque-s1	mt5-small	tquad2-train+tquad2-valid+xquad.tr	60M	1.2GB
mt5-small-3task-both-tquad2	mt5-small	tquad2-train	60M	1.2GB
mt5-small-3task-prepend-tquad2	mt5-small	tquad2-train	60M	1.2GB
mt5-base-3task-both-tquad2	mt5-base	tquad2-train	220M	2.3GB

format

answer extraction:

input:

Osman Bey 1258 yılında Söğüt’te doğdu. Osman Bey 1 Ağustos 1326’da Bursa’da hayatını kaybetmiştir.1281 yılında Osman Bey 23 yaşında iken Ahi teşkilatından olan Şeyh Edebali’nin kızı Malhun Hatun ile evlendi." ">

"
      
        Osman Bey 1258 yılında Söğüt’te doğdu. 
       
         Osman Bey 1 Ağustos 1326’da Bursa’da hayatını kaybetmiştir.1281 yılında Osman Bey 23 yaşında iken Ahi teşkilatından olan Şeyh Edebali’nin kızı Malhun Hatun ile evlendi."

target:


    
      1258 
     
       Söğüt’te

question answering:

input:

"question: Osman Bey nerede doğmuştur? context: Osman Bey 1258 yılında Söğüt’te doğdu. Osman Bey 1 Ağustos 1326’da Bursa’da hayatını kaybetmiştir.1281 yılında Osman Bey 23 yaşında iken Ahi teşkilatından olan Şeyh Edebali’nin kızı Malhun Hatun ile evlendi."

target:

"Söğüt’te"

question generation (prepend):

input:

"answer: Söğüt’te context: Osman Bey 1258 yılında Söğüt’te doğdu. Osman Bey 1 Ağustos 1326’da Bursa’da hayatını kaybetmiştir.1281 yılında Osman Bey 23 yaşında iken Ahi teşkilatından olan Şeyh Edebali’nin kızı Malhun Hatun ile evlendi."

target:

"Osman Bey nerede doğmuştur?"

question generation (highlight):

input:

Söğüt’te doğdu. Osman Bey 1 Ağustos 1326’da Bursa’da hayatını kaybetmiştir.1281 yılında Osman Bey 23 yaşında iken Ahi teşkilatından olan Şeyh Edebali’nin kızı Malhun Hatun ile evlendi." ">

"generate question: Osman Bey 1258 yılında 
     
       Söğüt’te 
      
        doğdu. Osman Bey 1 Ağustos 1326’da Bursa’da hayatını kaybetmiştir.1281 yılında Osman Bey 23 yaşında iken Ahi teşkilatından olan Şeyh Edebali’nin kızı Malhun Hatun ile evlendi."

target:

"Osman Bey nerede doğmuştur?"

question generation (both):

input:

"answer: Söğüt’te context: Osman Bey 1258 yılında 
     
       Söğüt’te 
      
        doğdu. Osman Bey 1 Ağustos 1326’da Bursa’da hayatını kaybetmiştir.1281 yılında Osman Bey 23 yaşında iken Ahi teşkilatından olan Şeyh Edebali’nin kızı Malhun Hatun ile evlendi."

target:

"Osman Bey nerede doğmuştur?"

paper results

BERTurk-base and mT5-base QA evaluation results for TQuADv2 fine-tuning.

mT5-base QG evaluation results for single-task (ST) and multi-task (MT) for TQuADv2 fine-tuning.

TQuADv1 and TQuADv2 fine-tuning QG evaluation results for multi-task mT5 variants. MT-Both means, mT5 model is fine-tuned with ’Both’ input format and in a multi-task setting.

paper configs

You can find the config files used in the paper under configs/paper.

contributing

Before opening a PR:

Install required development packages:

pip install "black==21.7b0" "flake8==3.9.2" "isort==5.9.2"

Reformat with black and isort:

black . --config pyproject.toml
isort .

You might also like...

NeuralQA: A Usable Library for Question Answering on Large Datasets with BERT

NeuralQA: A Usable Library for (Extractive) Question Answering on Large Datasets with BERT Still in alpha, lots of changes anticipated. View demo on n

184 Feb 10, 2021

Knowledge Graph,Question Answering System，基于知识图谱和向量检索的医疗诊断问答系统

823 Dec 28, 2022

Baseline code for Korean open domain question answering(ODQA)

Open-Domain Question Answering(ODQA)는 다양한 주제에 대한 문서 집합으로부터 자연어 질의에 대한 답변을 찾아오는 task입니다. 이때 사용자 질의에 답변하기 위해 주어지는 지문이 따로 존재하지 않습니다. 따라서 사전에 구축되어있는 Knowl

69 Nov 4, 2022

Disfl-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering

Disfl-QA is a targeted dataset for contextual disfluencies in an information seeking setting, namely question answering over Wikipedia passages. Disfl-QA builds upon the SQuAD-v2 (Rajpurkar et al., 2018) dataset, where each question in the dev set is annotated to add a contextual disfluency using the paragraph as a source of distractors.

52 Jun 21, 2022

Automated question generation and question answering from Turkish texts using text-to-text transformers

Related tags

Overview

Turkish Question Generation

Offical source code for
"Automated question generation & question answering from Turkish texts using text-to-text transformers"

You might also like...

NeuralQA: A Usable Library for Question Answering on Large Datasets with BERT

Knowledge Graph,Question Answering System，基于知识图谱和向量检索的医疗诊断问答系统

Baseline code for Korean open domain question answering(ODQA)

Disfl-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering

CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

chaii - hindi & tamil question answering

Contact Extraction with Question Answering.

BERT-based Financial Question Answering System

Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

Releases(0.0.1)

0.0.1(Nov 10, 2021)

Owner

Open Business Software Solutions

An implementation of WaveNet with fast generation

NeoDays-based tileset for the roguelike CDDA (Cataclysm Dark Days Ahead)

NLP project that works with news (NER, context generation, news trend analytics)

Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

RecipeReduce: Simplified Recipe Processing for Lazy Programmers

🤗🖼️ HuggingPics: Fine-tune Vision Transformers for anything using images found on the web.

Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Expressions.

An end to end ASR Transformer model training repo

Fake Shakespearean Text Generator

使用Mask LM预训练任务来预训练Bert模型。训练垂直领域语料的模型表征，提升下游任务的表现。

The aim of this task is to predict someone's English proficiency based on a text input.

A Python/Pytorch app for easily synthesising human voices

GraphNLI: A Graph-based Natural Language Inference Model for Polarity Prediction in Online Debates

NLP-Project - Used an API to scrape 2000 reddit posts, then used NLP analysis and created a classification model to mixed succcess

Japanese Long-Unit-Word Tokenizer with RemBertTokenizerFast of Transformers

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

voice2json is a collection of command-line tools for offline speech/intent recognition on Linux

Bpe algorithm can finetune tokenizer - Bpe algorithm can finetune tokenizer

Rootski - Full codebase for rootski.io (without the data)

Automated question generation and question answering from Turkish texts using text-to-text transformers

Related tags

Overview

Turkish Question Generation

Offical source code for "Automated question generation & question answering from Turkish texts using text-to-text transformers"

You might also like...

NeuralQA: A Usable Library for Question Answering on Large Datasets with BERT

Knowledge Graph,Question Answering System，基于知识图谱和向量检索的医疗诊断问答系统

Baseline code for Korean open domain question answering(ODQA)

Disfl-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering

CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

chaii - hindi & tamil question answering

Contact Extraction with Question Answering.

BERT-based Financial Question Answering System

Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

Releases(0.0.1)

0.0.1(Nov 10, 2021)

Owner

Open Business Software Solutions

An implementation of WaveNet with fast generation

NeoDays-based tileset for the roguelike CDDA (Cataclysm Dark Days Ahead)

NLP project that works with news (NER, context generation, news trend analytics)

Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

RecipeReduce: Simplified Recipe Processing for Lazy Programmers

🤗🖼️ HuggingPics: Fine-tune Vision Transformers for anything using images found on the web.

Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Expressions.

An end to end ASR Transformer model training repo

Fake Shakespearean Text Generator

使用Mask LM预训练任务来预训练Bert模型。训练垂直领域语料的模型表征，提升下游任务的表现。

The aim of this task is to predict someone's English proficiency based on a text input.

A Python/Pytorch app for easily synthesising human voices

GraphNLI: A Graph-based Natural Language Inference Model for Polarity Prediction in Online Debates

NLP-Project - Used an API to scrape 2000 reddit posts, then used NLP analysis and created a classification model to mixed succcess

Japanese Long-Unit-Word Tokenizer with RemBertTokenizerFast of Transformers

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

voice2json is a collection of command-line tools for offline speech/intent recognition on Linux

Bpe algorithm can finetune tokenizer - Bpe algorithm can finetune tokenizer

Rootski - Full codebase for rootski.io (without the data)

Offical source code for
"Automated question generation & question answering from Turkish texts using text-to-text transformers"