This is the official implementation of "One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval".

Related tags

Deep LearningCORA
Overview

CORA

This is the official implementation of the following paper: Akari Asai, Xinyan Yu, Jungo Kasai and Hannaneh Hajishirzi. One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval. Preptint. 2021.

cora_image

In this paper, we introduce CORA, a single, unified multilingual open QA model for many languages.
CORA consists of two main components: mDPR and mGEN.
mDPR retrieves documents from multilingual document collections and mGEN generates the answer in the target languages directly instead of using any external machine translation or language-specific retrieval module.
Our experimental results show state-of-the-art results across two multilingual open QA dataset: XOR QA and MKQA.

Contents

  1. Quick Run on XOR QA
  2. Overview
  3. Data
  4. Installation
  5. Training
  6. Evaluation
  7. Citations and Contact

Quick Run on XOR QA

We provide quick_start_xorqa.sh, with which you can easily set up and run evaluation on the XOR QA full dev set.

The script will

  1. download our trained mDPR, mGEN and encoded Wikipedia embeddings,
  2. run the whole pipeline on the evaluation set, and
  3. calculate the QA scores.

You can download the prediction results from here.

Overview

To run CORA, you first need to preprocess Wikipedia using the codes in wikipedia_preprocess.
Then you train mDPR and mGEN.
Once you finish training those components, please run evaluations, and then evaluate the performance using eval_scripts.

Please see the details of each components in each directory.

  • mDPR: codes for training and evaluating our mDPR.
  • mGEN: codes for training and evaluating our mGEN.
  • wikipedia_preprocess: codes for preprocessing Wikipedias.
  • eval_scripts: scripts to evaluate the performance.

Data

Training data

We will upload the final training data for mDPR. Please stay tuned!

Evaluation data

We evaluate our models performance on XOR QA and MKQA.

  • XOR QA Please download the XOR QA (full) data by running the command below.
mkdir data
cd data
wget https://nlp.cs.washington.edu/xorqa/XORQA_site/data/xor_dev_full_v1_1.jsonl
wget https://nlp.cs.washington.edu/xorqa/XORQA_site/data/xor_test_full_q_only_v1_1.jsonl
cd ..
  • MKQA Please download the original MKQA data from the original repository.
wget https://github.com/apple/ml-mkqa/raw/master/dataset/mkqa.jsonl.gz
gunzip mkqa.jsonl.gz

Before evaluating on MKQA, you need to preprocess the MKQA data to convert them into the same format as XOR QA. Please follow the instructions at eval_scripts/README.md.

Installation

Dependencies

  • Python 3
  • PyTorch (currently tested on version 1.7.0)
  • Transformers (version 4.2.1; unlikely to work with a different version)

Trained models

You can download trained models by running the commands below:

mkdir models
wget https://nlp.cs.washington.edu/xorqa/cora/models/all_w100.tsv
wget https://nlp.cs.washington.edu/xorqa/cora/models/mGEN_model.zip
wget https://nlp.cs.washington.edu/xorqa/cora/models/mDPR_biencoder_best.cpt
unzip mGEN_model.zip
mkdir embeddings
cd embeddings
for i in 0 1 2 3 4 5 6 7;
do 
  wget https://nlp.cs.washington.edu/xorqa/cora/models/wikipedia_split/wiki_emb_en_$i 
done
for i in 0 1 2 3 4 5 6 7;
do 
  wget https://nlp.cs.washington.edu/xorqa/cora/models/wikipedia_split/wiki_emb_others_$i  
done
cd ../..

Training

CORA is trained with our iterative training process, where each iteration proceeds over two states: parameter updates and cross-lingual data expansion.

  1. Train mDPR with the current training data. For the first iteration, the training data is the gold paragraph data from Natural Questions and TyDi-XOR QA.
  2. Retrieve top documents using trained mDPR
  3. Train mGEN with retrieved data
  4. Run mGEN on each passages from mDPR and synthetic data retrieval to label the new training data.
  5. Go back to step 1.

overview_training

See the details of each training step in mDPR/README.md and mGEN/README.md.

Evaluation

  1. Run mDPR on the input data
python dense_retriever.py \
    --model_file ../models/mDPR_biencoder_best.cpt \
    --ctx_file ../models/all_w100.tsv \
    --qa_file ../data/xor_dev_full_v1_1.jsonl \
    --encoded_ctx_file "../models/embeddings/wiki_emb_*" \
    --out_file xor_dev_dpr_retrieval_results.json \
    --n-docs 20 --validation_workers 1 --batch_size 256 --add_lang
  1. Convert the retrieved results into mGEN input format
cd mGEN
python3 convert_dpr_retrieval_results_to_seq2seq.py \
    --dev_fp ../mDPR/xor_dev_dpr_retrieval_results.json \
    --output_dir xorqa_dev_final_retriever_results \
    --top_n 15 \
    --add_lang \
    --xor_engspan_train data/xor_train_retrieve_eng_span.jsonl \
    --xor_full_train data/xor_train_full.jsonl \
    --xor_full_dev data/xor_dev_full_v1_1.jsonl
  1. Run mGEN
CUDA_VISIBLE_DEVICES=0 python eval_mgen.py \
    --model_name_or_path \
    --evaluation_set xorqa_dev_final_retriever_results/val.source \
    --gold_data_path xorqa_dev_final_retriever_results/gold_para_qa_data_dev.tsv \
    --predictions_path xor_dev_final_results.txt \
    --gold_data_mode qa \
    --model_type mt5 \
    --max_length 20 \
    --eval_batch_size 4
cd ..
  1. Run the XOR QA full evaluation script
cd eval_scripts
python eval_xor_full.py --data_file ../data/xor_dev_full_v1_1.jsonl --pred_file ../mGEN/xor_dev_final_results.txt --txt_file

Baselines

In our paper, we have tested several baselines such as Translate-test or multilingual baselines. The codes for machine translations or BM 25-based retrievers are at baselines. To run the baselines, you may need to download code and mdoels from the XOR QA repository. Those codes are implemented by Velocity :)

Citations and Contact

If you find this codebase is useful or use in your work, please cite our paper.

@article{
asai2021cora,
title={One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval},
author={Akari Asai and Xinyan Yu and Jungo Kasai and Hannaneh Hajishirzi},
journal={Arxiv Preprint},
year={2021}
}

Please contact Akari Asai (@AkariAsai on Twitter, akari[at]cs.washington.edu) for questions and suggestions.

Owner
Akari Asai
PhD student at @uwnlp . NLP & ML.
Akari Asai
Allows including an action inside another action (by preprocessing the Yaml file). This is how composite actions should have worked.

actions-includes Allows including an action inside another action (by preprocessing the Yaml file). Instead of using uses or run in your action step,

Tim Ansell 70 Nov 04, 2022
Differential fuzzing for the masses!

NEZHA NEZHA is an efficient and domain-independent differential fuzzer developed at Columbia University. NEZHA exploits the behavioral asymmetries bet

147 Dec 05, 2022
A toy compiler that can convert Python scripts to pickle bytecode 🥒

Pickora 🐰 A small compiler that can convert Python scripts to pickle bytecode. Requirements Python 3.8+ No third-party modules are required. Usage us

ꌗᖘ꒒ꀤ꓄꒒ꀤꈤꍟ 68 Jan 04, 2023
An Implicit Function Theorem (IFT) optimizer for bi-level optimizations

iftopt An Implicit Function Theorem (IFT) optimizer for bi-level optimizations. Requirements Python 3.7+ PyTorch 1.x Installation $ pip install git+ht

The Money Shredder Lab 2 Dec 02, 2021
[3DV 2021] A Dataset-Dispersion Perspective on Reconstruction Versus Recognition in Single-View 3D Reconstruction Networks

dispersion-score Official implementation of 3DV 2021 Paper A Dataset-dispersion Perspective on Reconstruction versus Recognition in Single-view 3D Rec

Yefan 7 May 28, 2022
ScaleNet: A Shallow Architecture for Scale Estimation

ScaleNet: A Shallow Architecture for Scale Estimation Repository for the code of ScaleNet paper: "ScaleNet: A Shallow Architecture for Scale Estimatio

Axel Barroso 34 Nov 09, 2022
Spatial Temporal Graph Convolutional Networks (ST-GCN) for Skeleton-Based Action Recognition in PyTorch

Reminder ST-GCN has transferred to MMSkeleton, and keep on developing as an flexible open source toolbox for skeleton-based human understanding. You a

sijie yan 1.1k Dec 25, 2022
Gesture recognition on Event Data

Event based Gesture Recognition Gesture recognition on Event Data usually involv

2 Feb 14, 2022
Source code for Zalo AI 2021 submission

zalo_ltr_2021 Source code for Zalo AI 2021 submission Solution: Pipeline We use the pipepline in the picture below: Our pipeline is combination of BM2

128 Dec 27, 2022
Cmsc11 arcade - Final Project for CMSC11

cmsc11_arcade Final Project for CMSC11 Developers: Limson, Mark Vincent Peñafiel

Gregory 1 Jan 18, 2022
Reverse engineer your pytorch vision models, in style

🔍 Rover Reverse engineer your CNNs, in style Rover will help you break down your CNN and visualize the features from within the model. No need to wri

Mayukh Deb 32 Sep 24, 2022
What can linearized neural networks actually say about generalization?

What can linearized neural networks actually say about generalization? This is the source code to reproduce the experiments of the NeurIPS 2021 paper

gortizji 11 Dec 09, 2022
Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.

English | 简体中文 Easy Parallel Library Overview Easy Parallel Library (EPL) is a general and efficient library for distributed model training. Usability

Alibaba 185 Dec 21, 2022
Neural Magic Eye: Learning to See and Understand the Scene Behind an Autostereogram, arXiv:2012.15692.

Neural Magic Eye Preprint | Project Page | Colab Runtime Official PyTorch implementation of the preprint paper "NeuralMagicEye: Learning to See and Un

Zhengxia Zou 56 Jul 15, 2022
Experiments for Operating Systems Lab (ETCS-352)

Operating Systems Lab (ETCS-352) Experiments for Operating Systems Lab (ETCS-352) performed by me in 2021 at uni. All codes are written by me except t

Deekshant Wadhwa 0 Sep 06, 2022
PyTorch implementation of SwAV (Swapping Assignments between Views)

Unsupervised Learning of Visual Features by Contrasting Cluster Assignments This code provides a PyTorch implementation and pretrained models for SwAV

Meta Research 1.7k Jan 04, 2023
SWA Object Detection

SWA Object Detection This project hosts the scripts for training SWA object detectors, as presented in our paper: @article{zhang2020swa, title={SWA

237 Nov 28, 2022
Rendering color and depth images for ShapeNet models.

Color & Depth Renderer for ShapeNet This library includes the tools for rendering multi-view color and depth images of ShapeNet models. Physically bas

Yinyu Nie 41 Dec 19, 2022
Intrusion Test Tool with Python

P3ntsT00L Uma ferramenta escrita em Python, feita para Teste de intrusão. Requisitos ter o python 3.9.8 instalado em sua máquina. ter a git instalada

josh washington 2 Dec 27, 2021
Streamlit Tutorial (ex: stock price dashboard, cartoon-stylegan, vqgan-clip, stylemixing, styleclip, sefa)

Streamlit Tutorials Install pip install streamlit Run cd [directory] streamlit run app.py --server.address 0.0.0.0 --server.port [your port] # http:/

Jihye Back 30 Jan 06, 2023