Weakly Supervised Text-to-SQL Parsing through Question Decomposition

Overview

Weakly Supervised Text-to-SQL Parsing through Question Decomposition

The official repository for the paper "Weakly Supervised Text-to-SQL Parsing through Question Decomposition" by Tomer Wolfson, Daniel Deutch and Jonathan Berant, accepted to the Finings of NAACL 2022.

This repository contains the code and data used in our paper:

  1. Code for automatically synthesizing SQL queries from question decompositions + answers
  2. Code for the models used in our paper mapping text-to-SQL and text-to-QDMR

Setup ๐Ÿ™Œ๐Ÿผ

  1. Create the virtual environment
conda create -n [ENV_NAME] python=3.8
conda activate [ENV_NAME]
  1. Clone the repository
git clone https://github.com/tomerwolgithub/question-decomposition-to-sql
cd question-decomposition-to-sql
  1. Install the relevant requirements
pip install -r requirements.txt 
python -m spacy download en_core_web_lg
  1. To train the QDMR parser model please setup a separate environment (due to different Hugginface versions):
conda create -n qdmr_parser_env python=3.8
conda activate qdmr_parser_env
pip install -r requirements_qdmr_parser.txt 
python -m spacy download en_core_web_lg

Download Resources ๐Ÿ—๏ธ

1. QDMR Parsing Datasets:

2. Text-to-SQL Datasets:

3. Databases (schema & contents):

Convert the MySQL databases of Academic, IMDB, Yelp and GeoQuery to sqlite format using the tool of Jean-Luc Lacroix:

./mysql2sqlite academic_mysql.sql | sqlite3 academic_sqlite.db

Data Generation ๐Ÿ”จ

Our SQL synthesis is given examples of <QDMR, database, answer> and automatically generates a SQL that executes to the correct answer. The QDMR decompositions are either manually annotated or automatically predicted by a trained QDMR parser.

Begin by copying all relevant sqlite databases to the data_generation directory.

mkdir data_generation/data
mkdir data_generation/data/spider_databases # copy Spider databases here
mkdir data_generation/data/other_databases # copy Academic, IMDB, Yelp and Geo databases here
  1. The SQL synthesis expects a formatted csv file, see example. Note that the SQL query in these files is only used to compute the answer.
  2. This may take several hours, as multiple candidate SQL are being executed on their respective database.
  3. To synthesize SQL from the <QDMR, database, answer> examples run:
python data_generation/main.py \
--input_file input_qdmr_examples.csv \
--output_file qdmr_grounded_sql.csv \
--json_steps True

Synthesized Data

The SQL synthesized using QDMR + answer supervision is available for each dataset in the data/sql_synthesis_results/ directory.

  • data/sql_synthesis_results/gold_qdmr_supervision: contains SQL synthesized using gold QDMRs that are manually annotated
  • data/sql_synthesis_results/predicted_qdmr_supervision: contains SQL synthesized using QDMRs predicted by a trained parser

Models ๐Ÿ—‚๏ธ

QDMR Parser

The QDMR parser is a T5-large sequence-to-sequence model that is finetuned to map questions to their QDMR. The model expects as input two csv files as its train and dev sets. Use the files from the downloaded Break dataset to train the parser. Make sure that you are in the relevant python environment (requirements_qdmr_parser.txt).

To train the QDMR parser configure the following parameters in train.py:

  • data_dir: the path to the directory containing the NL to QDMR datasets
  • training_set_file: name of the train set csv (e.g. break_train.csv)
  • dev_set_file: name of the dev set csv (e.g. break_dev.csv)
  • output_dir: the directory to store the trained model

After configuration, train the model as follows:

TOKENIZERS_PARALLELISM=false CUDA_VISIBLE_DEVICES=0 python src/qdmr_parser/train.py

To test a trained model and store its predictions, configure the following parameters in test.py:

  • checkpoint_path: path to the trained QDMR parser model to be evaluated
  • dev_set_file: name of the dev set csv to generate predictions for
  • predictions_output_file: the output file to store the parser's generated predictions

And run the following command:

TOKENIZERS_PARALLELISM=false CUDA_VISIBLE_DEVICES=0 python src/qdmr_parser/test.py

Text-to-SQL

The text-to-SQL models are T5-large sequence-to-sequence models, finetuned to map questions to executable SQL queries. We compare the models trained on gold SQL queries, annotated by experts, to our synthesized SQL from QDMR and answer supervision.

1. Setup directory

Setup the data for the text-to-SQL experiments as follows:

data
โ”œโ”€โ”€ tables.json			# Spider tables.json
โ””โ”€โ”€ databases
โ”‚   โ””โ”€โ”€ academic			
โ”‚       โ””โ”€โ”€ academic.sqlite	# Sqlite version of the populated Academic database (see downloads)
โ”‚   โ””โ”€โ”€ geo			
โ”‚       โ””โ”€โ”€ geo.sqlite		# Sqlite version of the populated Geo database (see downloads)
โ”‚   โ””โ”€โ”€ imdb			
โ”‚       โ””โ”€โ”€ imdb.sqlite		# Sqlite version of the populated IMDB database (see downloads)
โ”‚   โ””โ”€โ”€ spider_databases 	# Spider databases directory
โ”‚       โ””โ”€โ”€ activity_1
โ”‚           โ””โ”€โ”€ activity_1.sqlite
โ”‚       โ””โ”€โ”€ ...   
โ”‚   โ””โ”€โ”€ yelp			
โ”‚       โ””โ”€โ”€ yelp.sqlite		# Sqlite version of the populated Yelp database (see downloads)
โ””โ”€โ”€ queries
    โ””โ”€โ”€ geo	# See experiments data
        โ”œโ”€โ”€ geo_qdmr_train.json
	โ””โ”€โ”€ geo_qdmr_predicted_train.json
	โ””โ”€โ”€ geo_gold_train.json
	โ””โ”€โ”€ geo_gold_dev.json
	โ””โ”€โ”€ geo_gold_test.json
	โ””โ”€โ”€ geo_gold_train.sql
	โ””โ”€โ”€ geo_gold_dev.sql
	โ””โ”€โ”€ geo_gold_test.sql
    โ””โ”€โ”€ spider
        โ”œโ”€โ”€ spider_qdmr_train.json		# See experiments data
	โ””โ”€โ”€ spider_qdmr_predicted_train.json 	# See experiments data
	โ””โ”€โ”€ spider_gold_train.json 	# Spider training set
	โ””โ”€โ”€ spider_gold_dev.json 	# Spider dev set
	โ””โ”€โ”€ spider_gold_train.sql 	# Spider training set SQL queries
	โ””โ”€โ”€ spider_gold_dev.sql 	# Spider dev set SQL queries

Database files are described in the downloads section. See the experiments section for the exact train and test files.

2. Train model

To train the text-to-SQL model configure its following parameters in train.py:

  • dataset: either spider or geo
  • target_encoding: sql for gold sql and either qdmr_formula or qdmr_sql for the QDMR experiments
  • data_dir: path to the directory containing the experiments data
  • output_dir: the directory to store the trained model
  • db_dir: the directory to store the trained model
  • training_set_file: training set file in the data directory e.g. spider/spider_gold_train.json
  • dev_set_file: dev set file in the data directory e.g. spider/spider_gold_dev.json
  • dev_set_sql: dev set SQL queries in the data directory e.g. spider/spider_gold_dev.sql

Following configuration, to train the model run:

CUDA_VISIBLE_DEVICES=0 python train.py 

3. Test model

To test the text-to-SQL model first configure the relevant parameters and checkpoint_path in test.py. Following the configuration, generate the trained model predictions using:

CUDA_VISIBLE_DEVICES=0 python test.py 

Experiments โš—๏ธ

Data

Gold SQL:

For the Spider experiments we use its original train and dev json and sql files. For Geo880, Academic, IMDB and Yelp we format the original datasets in json files available here.

QDMR Synthesized SQL:

The QDMR text-to-SQL models are not trained directly on the synthesized SQL. Instead, we train on an encoded QDMR representation with its phrase-DB linking (from the SQL synthesis). This representation is automatically mapped to SQL to evaluate the models execution accuracy. To generate these grounded QDMRs we use the output of the data generation phase. The function encoded_grounded_qdmr in src/data_generation/write_encoding.py recieves the json file containing the synthesized SQL examples. It then encodes them as lisp style formulas of QDMR steps and their relevant phrase-DB linking.

For convenience, you can download the encoded QDMR training sets used in our experiments here. These include:

  • qdmr_ground_enc_spider_train.json: 5,349 examples, synthesized using gold QDMR + answer supervision
  • qdmr_ground_enc_predicted_spider_train_few_shot: 5,075 examples, synthesized examples using 700 gold QDMRs, predicted QDMR + answer supervision
  • qdmr_ground_enc_predicted_spider_train_30_db.json: 1,129 examples, synthesized using predicted QDMR + answer supervision
  • qdmr_ground_enc_predicted_spider_train_40_db.json: 1,440 examples, synthesized using predicted QDMR + answer supervision
  • qdmr_ground_enc_predicted_spider_train_40_db_V2.json: 1,552 examples, synthesized using predicted QDMR + answer supervision
  • qdmr_ground_enc_geo880_train.json: 454 examples, synthesized using gold QDMR + answer supervision
  • qdmr_ground_enc_predicted_geo_train_zero_shot.json: 432 examples, synthesized using predicted QDMR + answer supervision

Configurations

The configurations for training the text-to-SQL models on Spider. Other parameters are fixed in train.py.

SQL Gold (Spider):

{'dataset': 'spider',
'target_encoding': 'sql',
'db_dir': 'databases/spider_databases',
'training_set_file': 'queries/spider/spider_gold_train.json',
'dev_set_file': 'queries/spider/spider_gold_dev.json',
'dev_set_sql': 'queries/spider/spider_gold_dev.sql'}

QDMR Gold (Spider):

{'dataset': 'spider',
'target_encoding': 'qdmr_formula',
'db_dir': 'databases/spider_databases',
'training_set_file': 'queries/spider/spider_qdmr_train.json',
'dev_set_file': 'queries/spider/spider_gold_dev.json',
'dev_set_sql': 'queries/spider/spider_gold_dev.sql'}

SQL Predicted (Spider):

{'dataset': 'spider',
'target_encoding': 'qdmr_formula',
'db_dir': `databases/spider_databases',
'training_set_file': 'queries/spider/spider_qdmr_predicted_train.json',
'dev_set_file': 'queries/spider/spider_gold_dev.json',
'dev_set_sql': 'queries/spider/spider_gold_dev.sql'}

The configurations for training the text-to-SQL models on Geo880.

SQL Gold (Geo):

{'dataset': 'geo',
'target_encoding': 'sql',
'db_dir': 'databases',
'training_set_file': 'queries/geo/geo_gold_train.json',
'dev_set_file': 'queries/spider/geo_gold_dev.json',
'dev_set_sql': 'queries/spider/geo_gold_dev.sql'}

QDMR Gold (Geo):

{'dataset': 'geo',
'target_encoding': 'qdmr_sql',
'db_dir': 'databases',
'training_set_file': 'queries/geo/geo_qdmr_train.json',
'dev_set_file': 'queries/spider/geo_gold_dev.json',
'dev_set_sql': 'queries/spider/geo_gold_dev.sql'}

QDMR Predicted (Geo):

{'dataset': 'geo',
'target_encoding': 'qdmr_sql',
'db_dir': 'databases',
'training_set_file': 'queries/geo/geo_qdmr_predicted_train.json',
'dev_set_file': 'queries/spider/geo_gold_dev.json',
'dev_set_sql': 'queries/spider/geo_gold_dev.sql'}

Evaluation

Text-to-SQL model performance is evaluated using SQL execution accuracy in src/text_to_sql/eval_spider.py. The script automatically converts encoded QDMR predictions to SQL before executing them on the target database.

Citation โœ๐Ÿฝ

bibtex
@inproceedings{wolfson-etal-2022-weakly,
    title={"Weakly Supervised Text-to-SQL Parsing through Question Decomposition"},
    author={"Wolfson, Tomer and Deutch, Daniel and Berant, Jonathan"},
    booktitle = {"Findings of the Association for Computational Linguistics: NAACL 2022"},
    year={"2022"},
}

License

This repository and its data is released under the MIT license.

For the licensing of all external datasets and databases used throughout our experiments:

PyTorch implementation for paper "Full-Body Visual Self-Modeling of Robot Morphologies".

Full-Body Visual Self-Modeling of Robot Morphologies Boyuan Chen, Robert Kwiatkowskig, Carl Vondrick, Hod Lipson Columbia University Project Website |

Boyuan Chen 32 Jan 02, 2023
NuPIC Studio is an allยญ-in-ยญone tool that allows users create a HTM neural network from scratch

NuPIC Studio is an allยญ-in-ยญone tool that allows users create a HTM neural network from scratch, train it, collect statistics, and share it among the members of the community. It is not just a visual

HTM Community 93 Sep 30, 2022
load .txt to train YOLOX, same as Yolo others

YOLOX train your data you need generate data.txt like follow format (per line- one image). prepare one data.txt like this: img_path1 x1,y1,x2,y2,clas

LiMingf 18 Aug 18, 2022
EasyMocap is an open-source toolbox for markerless human motion capture from RGB videos.

EasyMocap is an open-source toolbox for markerless human motion capture from RGB videos. In this project, we provide the basic code for fitt

ZJU3DV 2.2k Jan 05, 2023
OBBDetection: an oriented object detection toolbox modified from MMdetection

OBBDetection note: If you have questions or good suggestions, feel free to propose issues and contact me. introduction OBBDetection is an oriented obj

MIXIAOXIN_HO 3 Nov 11, 2022
Final project for Intro to CS class.

Financial Analysis Web App https://share.streamlit.io/mayurk1/fin-web-app-final-project/webApp.py 1. Project Description This project is a technical a

Mayur Khanna 1 Dec 10, 2021
Dictionary Learning with Uniform Sparse Representations for Anomaly Detection

Dictionary Learning with Uniform Sparse Representations for Anomaly Detection Implementation of the Uniform DL Representation for AD algorithm describ

Paul Irofti 1 Nov 23, 2022
A library for low-memory inferencing in PyTorch.

Pylomin Pylomin (PYtorch LOw-Memory INference) is a library for low-memory inferencing in PyTorch. Installation ... Usage For example, the following c

3 Oct 26, 2022
RL algorithm PPO and IRL algorithm AIRL written with Tensorflow.

RL algorithm PPO and IRL algorithm AIRL written with Tensorflow. They have a parallel sampling feature in order to increase computation speed (especially in high-performance computing (HPC)).

Fangjian Li 3 Dec 28, 2021
YOLOv5 Series Multi-backbone, Pruning and quantization Compression Tool Box.

YOLOv5-Compression Update News Requirements ็Žฏๅขƒๅฎ‰่ฃ… pip install -r requirements.txt Evaluation metric Visdrone Model mAP ZhangYuan 719 Jan 02, 2023

Nested cross-validation is necessary to avoid biased model performance in embedded feature selection in high-dimensional data with tiny sample sizes

Pruner for nested cross-validation - Sphinx-Doc Nested cross-validation is necessary to avoid biased model performance in embedded feature selection i

1 Dec 15, 2021
ไธ€ไธชๅ…่ดนๅผ€ๆบไธ€้”ฎๆญๅปบ็š„้€š็”จ้ชŒ่ฏ็ ่ฏ†ๅˆซๅนณๅฐ๏ผŒๅคง้ƒจๅˆ†ๅธธ่ง็š„ไธญ่‹ฑๆ•ฐ้ชŒ่ฏ็ ่ฏ†ๅˆซ้ƒฝๆฒกๅ•ฅ้—ฎ้ข˜ใ€‚

captcha_server ไธ€ไธชๅ…่ดนๅผ€ๆบไธ€้”ฎๆญๅปบ็š„้€š็”จ้ชŒ่ฏ็ ่ฏ†ๅˆซๅนณๅฐ๏ผŒๅคง้ƒจๅˆ†ๅธธ่ง็š„ไธญ่‹ฑๆ•ฐ้ชŒ่ฏ็ ่ฏ†ๅˆซ้ƒฝๆฒกๅ•ฅ้—ฎ้ข˜ใ€‚ ไฝฟ็”จๆ–นๆณ• python = 3.8 ไปฅไธŠ็Žฏๅขƒ pip install -r requirements.txt -i https://pypi.douban.com/simple gun

Sml2h3 189 Dec 02, 2022
The code for our paper Semi-Supervised Learning with Multi-Head Co-Training

Semi-Supervised Learning with Multi-Head Co-Training (PyTorch) Abstract Co-training, extended from self-training, is one of the frameworks for semi-su

cmc 6 Dec 04, 2022
Neural-fractal - Create Fractals Using Complex-Valued Neural Networks!

Neural Fractal Create Fractals Using Complex-Valued Neural Networks! Home Page Features Define Dynamical Systems Using Complex-Valued Neural Networks

Amirabbas Asadi 10 Dec 17, 2022
The datasets and code of ACL 2021 paper "Aspect-Category-Opinion-Sentiment Quadruple Extraction with Implicit Aspects and Opinions".

Aspect-Category-Opinion-Sentiment (ACOS) Quadruple Extraction This repo contains the data sets and source code of our paper: Aspect-Category-Opinion-S

NUSTM 144 Jan 02, 2023
GenGNN: A Generic FPGA Framework for Graph Neural Network Acceleration

GenGNN: A Generic FPGA Framework for Graph Neural Network Acceleration Stefan Abi-Karam*, Yuqi He*, Rishov Sarkar*, Lakshmi Sathidevi, Zihang Qiao, Co

Sharc-Lab 19 Dec 15, 2022
Code repository accompanying the paper "On Adversarial Robustness: A Neural Architecture Search perspective"

On Adversarial Robustness: A Neural Architecture Search perspective Preparation: Clone the repository: https://github.com/tdchaitanya/nas-robustness.g

Chaitanya Devaguptapu 4 Nov 10, 2022
OptaPlanner wrappers for Python. Currently significantly slower than OptaPlanner in Java or Kotlin.

OptaPy is an AI constraint solver for Python to optimize the Vehicle Routing Problem, Employee Rostering, Maintenance Scheduling, Task Assignment, School Timetabling, Cloud Optimization, Conference S

OptaPy 211 Jan 02, 2023
Code to reproduce the results for Compositional Attention

Compositional-Attention This repository contains the official implementation for the paper Compositional Attention: Disentangling Search and Retrieval

Sarthak Mittal 58 Nov 30, 2022
SEAN: Image Synthesis with Semantic Region-Adaptive Normalization (CVPR 2020, Oral)

SEAN: Image Synthesis with Semantic Region-Adaptive Normalization (CVPR 2020 Oral) Figure: Face image editing controlled via style images and segmenta

Peihao Zhu 579 Dec 30, 2022