Code for the paper "Flexible Generation of Natural Language Deductions"

Overview

Flexible Generation of Natural Language Deductions

a.k.a. ParaPattern

https://arxiv.org/abs/2104.08825

Kaj Bostrom, Lucy Zhao, Swarat Chaudhuri, and Greg Durrett

This repository contains all the code needed to replicate the experiments from the paper, and additionally provides a set of tools to put together new natural language deduction operations from scratch.

In the data/ folder, you'll find all the data used to train and evaluate our models, already preprocessed and ready to go, with the exception of the MNLI dataset due to its size - if you want to replicate our MNLI-BART baseline, you'll need to download a copy of MNLI and run data/mnli/filter.py for yourself. The data folder also contains several generic conversion scripts, which you may find useful for processing operation training examples, as well as paraphrase.py, which does automatic paraphrase generation if you pass it a path to a suitable sequence-to-sequence paraphrasing model checkpoint, e.g. https://huggingface.co/tuner007/pegasus_paraphrase

In the modeling/ folder, you'll find the fine-tuning code needed to train operation models, as well as scripts to run all the evaluations described in the paper. Just make sure you're on transformers version 4.2.1, not the latest version, since several of the scripts are carefully built around bugs that have since been patched out of the library.

If you have access to multiple GPUs, you can change the --nproc_per_node argument in finetune.sh from 1 to whatever number of GPUs you want to use for training.

In the dep_search/ folder, you'll find tools to perform bulk dependency parsing using spaCy, as well as scripts to index the resulting stream of dependency trees and scrape them using dependency patterns. For reference, the templates used in the paper live in dep_search/templates/. If you want to write your own templates, a good place to start is playing around with the dependency pattern DSL using dep_search.struct_query.parse_query - if you're wondering how to express a given syntactic pattern, you can start by calling dep_search.struct_query.Head.from_spacy on a spaCy token; this will construct a syntactic pattern without any slots from that token's dependency subtree. Printing patterns this way is a great way to familiarize yourself with dependency structure if you need to brush up on that stuff (I can never remember what POS tag/arc label conventions spaCy uses so I was printing out a lot of these trees while I was developing the templates we used in the paper).

Unfortunately, I never got around to optimizing the syntactic search process all that well, so for large free-text corpora (~=100M sentences or more) it can take a day or two to do a full run of parsing and indexing using dep_search/scrape.py. I find a good way to iterate on a pattern is to start by casting a really broad net, and then narrow down your pattern on a subset of those results so that you don't have to re-index your whole original corpus each time you make a small change to a template.

Owner
Kaj Bostrom
PhD student at UT Austin Computer Science. Studying NLP (reading comprehension/language understanding in particular)
Kaj Bostrom
Galois is an auto code completer for code editors (or any text editor) based on OpenAI GPT-2.

Galois is an auto code completer for code editors (or any text editor) based on OpenAI GPT-2. It is trained (finetuned) on a curated list of approximately 45K Python (~470MB) files gathered from the

Galois Autocompleter 91 Sep 23, 2022
Full Spectrum Bioinformatics - a free online text designed to introduce key topics in Bioinformatics using the Python

Full Spectrum Bioinformatics is a free online text designed to introduce key topics in Bioinformatics using the Python programming language. The text is written in interactive Jupyter Notebooks, whic

Jesse Zaneveld 33 Dec 28, 2022
A simple Flask site that allows users to create, update, and delete posts in a database, as well as perform basic NLP tasks on the posts.

A simple Flask site that allows users to create, update, and delete posts in a database, as well as perform basic NLP tasks on the posts.

Ian 1 Jan 15, 2022
NLP library designed for reproducible experimentation management

Welcome to the Transfer NLP library, a framework built on top of PyTorch to promote reproducible experimentation and Transfer Learning in NLP You can

Feedly 290 Dec 20, 2022
Unsupervised text tokenizer focused on computational efficiency

YouTokenToMe YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE)

VK.com 847 Dec 19, 2022
ADCS - Automatic Defect Classification System (ADCS) for SSMC

Table of Contents Table of Contents ADCS Overview Summary Operator's Guide Demo System Design System Logic Training Mode Production System Flow Folder

Tam Zher Min 2 Jun 24, 2022
Simple NLP based project without any use of AI

Simple NLP based project without any use of AI

Shripad Rao 1 Apr 26, 2022
A Chinese to English Neural Model Translation Project

ZH-EN NMT Chinese to English Neural Machine Translation This project is inspired by Stanford's CS224N NMT Project Dataset used in this project: News C

Zhenbang Feng 29 Nov 26, 2022
Deep learning for NLP crash course at ABBYY.

Deep NLP Course at ABBYY Deep learning for NLP crash course at ABBYY. Suggested textbook: Neural Network Methods in Natural Language Processing by Yoa

Dan Anastasyev 597 Dec 18, 2022
Reformer, the efficient Transformer, in Pytorch

Reformer, the Efficient Transformer, in Pytorch This is a Pytorch implementation of Reformer https://openreview.net/pdf?id=rkgNKkHtvB It includes LSH

Phil Wang 1.8k Dec 30, 2022
A simple command line tool for text to image generation, using OpenAI's CLIP and a BigGAN

artificial intelligence cosmic love and attention fire in the sky a pyramid made of ice a lonely house in the woods marriage in the mountains lantern

Phil Wang 2.3k Jan 01, 2023
TensorFlow code and pre-trained models for BERT

BERT ***** New March 11th, 2020: Smaller BERT Models ***** This is a release of 24 smaller BERT models (English only, uncased, trained with WordPiece

Google Research 32.9k Jan 08, 2023
숭실대학교 컴퓨터학부 전공종합설계프로젝트

✨ 시각장애인을 위한 버스도착 알림 장치 ✨ 👀 개요 현대 사회에서 대중교통 위치 정보를 이용하여 사람들이 간단하게 이용할 대중교통의 정보를 얻고 쉽게 대중교통을 이용할 수 있다. 해당 정보는 각종 어플리케이션과 대중교통 이용시설에서 위치 정보를 제공하고 있지만 시각

taegyun 3 Jan 25, 2022
Creating a chess engine using GPT-3

GPT3Chess Creating a chess engine using GPT-3 Code for my article : https://towardsdatascience.com/gpt-3-play-chess-d123a96096a9 My game (white) vs GP

19 Dec 17, 2022
Final Project for the Intel AI Readiness Boot Camp NLP (Jan)

NLP Boot Camp (Jan) Synopsis Full Name: Prameya Mohanty Name of your School: Delhi Public School, Rourkela Class: VIII Title of the Project: iTransect

TheCodingHub 1 Feb 01, 2022
Synthetic data for the people.

zpy: Synthetic data in Blender. Website • Install • Docs • Examples • CLI • Contribute • Licence Abstract Collecting, labeling, and cleaning data for

Zumo Labs 253 Dec 21, 2022
Fake Shakespearean Text Generator

Fake Shakespearean Text Generator This project contains an impelementation of stateful Char-RNN model to generate fake shakespearean texts. Files and

Recep YILDIRIM 1 Feb 15, 2022
voice2json is a collection of command-line tools for offline speech/intent recognition on Linux

Command-line tools for speech and intent recognition on Linux

Michael Hansen 988 Jan 04, 2023
Wikipedia-Utils: Preprocessing Wikipedia Texts for NLP

Wikipedia-Utils: Preprocessing Wikipedia Texts for NLP This repository maintains some utility scripts for retrieving and preprocessing Wikipedia text

Masatoshi Suzuki 44 Oct 19, 2022