Code related to "Have Your Text and Use It Too! End-to-End Neural Data-to-Text Generation with Semantic Fidelity" paper

Overview

DataTuner

You have just found the DataTuner. This repository provides tools for fine-tuning language models for a task.

Installation

Environment Creation

Assuming you have an existing conda setup, you can setup the environment with the following script. In order to activate the conda environment within the bash script, you need the location of the conda.sh file:

bash setup.sh  ~/miniconda3/etc/profile.d/conda.sh

You can update your existing environment:

conda env update -f=environment.yml

To start development, activate your environment:

conda activate finetune

Alternatively, you can always use the python binary with the absolute path, e.g.: ~/miniconda3/envs/finetune/bin/python.

Data

For any task you want to fine-tune on, you need the data to be a json file containing a list of json objects, one per data point. For example:

[
  {
    "question": "question text 1",
    "query": "query 1"
  },
  {
    "question": "question text 2",
    "query": "query 2 with [SpecialToken example]"
  }
]

The library assumes that you have placed your data in a single directory with three files: train.json, validation.json, and test.json.

Configuration

Now that we have the data in shape, we need to create a new task configuration file that specifies how we want the data to be formatted and what fields should be considered. You can create new config files in the folder src/datatuner/lm/task_configs.

A typical config file would look as follows:

{
"name": "dataset_name",
"data_shape": [
        {
            "id": "<question>",
            "type": "special",
            "learn": false
        },
        {
            "id": "question",
            "type": "text",
            "learn": false
        },
        {
            "id": "<query>",
            "type": "special",
            "learn": false
        },
        {
            "id": "query",
            "type": "text",
            "learn": true,
            "metrics": [
                "match"
            ]
        }
    ],
"extra_special_tokens": ["[SpecialToken"],
"extra_fields": []
}

For each item in the data shape:

  • type (required): special if special token, text if normal text.
  • id (required): the special token ID if type is special; the key for the text in the json data if type is text
  • learn (required): whether to allow the model to learn this part of the text. If false, the model masks that part during fine-tuning.
  • metrics (optional): the list of metrics that the model should compute upon evaluation. Each metric should have a corresponding function with the same name in metrics.py.
  • converter (optional): the name of the converter function in converters.py to apply on that text field after reading the text from the file.

The value of extra_special_tokens is a list of special tokens to be added to the vocabulary. Alternatively (especially if the list is too long or is generated automatically), you can create a text file with one special token per line and pass that as an argument during training via the --special_tokens_file argument.

The value of extra_fields is a list of additional fields to include from the input json files to output during evaluation, aside from the main fields used as inputs/outputs.

Training

The training script train.py can be used in single GPU or multi GPU settings.

cd src/datatuner/lm

# single gpu
python train.py --model_checkpoint ~/data/openai-gpt/  --dataset_path ../../../data/my_dataset/  --task_config ./task_configs/my_task_config.json --n_epoch 3 --lr 1e-5

# multi gpu
python -m torch.distributed.launch --nproc_per_node=4 train.py --model_checkpoint ~/data/openai-gpt/  --dataset_path ../../../data/my_dataset/  --task_config ./task_configs/my_task_config.json --n_epoch 3 --lr 1e-5

Evaluating the Model

You can run the following to evaluate the model on any test set. The data format is the same as the training data. Notice that you have to currently specify the model_type parameter matching the model you're loading:

cd src/datatuner/lm

python ./evaluate.py --task_config ./task_configs/my_task_config.json --model_checkpoint runs/2020-01-01_01-01-01  --filename ../../../data/my_dataset/test.json --max_length 200 --model_type gpt --top_k 1

# or if you just want to evaluate the latest model you trained 
RUN=$(ls -t ./runs | head -1) && python ./evaluate.py --task_config ./task_configs/my_task_config.json --model_checkpoint runs/$RUN  --filename ../../../data/my_dataset/test.json --max_length 200 --model_type gpt  --top_k 1

# or if you want to use the latest intermediate checkpoint while the model is training:
RUN=$(ls -t ./runs | head -1) && CHECKPOINT=$(ls -t ./runs/$RUN/checkpoint* | head -1) && cp $CHECKPOINT runs/$RUN/pytorch_model.bin

During evaluation, the outputs that do not exactly match the expected outputs will be printed. Also, the metrics will be printed (a dictionary with keys <metric_name>_<field_name>). At the end of evaluation, you will find the file with all the generated ouputs in the file eval_results/<run_folder_name>/<task_name>_<test_file_name>_<model_type>_generated.json.

Interacting with the model

You can also interact with the models. The client will ask you to input the fields required, and it will generate the fields it learnt.

cd src/datatuner/lm

python ./evaluate.py --task_config ./task_configs/my_task_config.json --model_checkpoint runs/2020-01-01_01-01-01  --max_length 200 --model_type gpt  --top_k 1 --input

# or if you just want to evaluate the latest model you trained 
RUN=$(ls -t ./runs | head -1) && python ./evaluate.py --task_config ./task_configs/my_task_config.json --model_checkpoint runs/$RUN  --max_length 200 --model_type gpt  --top_k 1 --input
A webcam-based 3x3x3 rubik's cube solver written in Python 3 and OpenCV.

Qbr Qbr, pronounced as Cuber, is a webcam-based 3x3x3 rubik's cube solver written in Python 3 and OpenCV. 🌈 Accurate color detection 🔍 Accurate 3x3x

Kim 金可明 502 Dec 29, 2022
Code for AAAI 2021 paper: Sequential End-to-end Network for Efficient Person Search

This repository hosts the source code of our paper: [AAAI 2021]Sequential End-to-end Network for Efficient Person Search. SeqNet achieves the state-of

Zj Li 218 Dec 31, 2022
This project proposes a camera vision based cursor control system, using hand moment captured from a webcam through a landmarks of hand by using Mideapipe module

This project proposes a camera vision based cursor control system, using hand moment captured from a webcam through a landmarks of hand by using Mideapipe module

Chandru 2 Feb 20, 2022
A curated list of papers, code and resources pertaining to image composition

A curated list of resources including papers, datasets, and relevant links pertaining to image composition.

BCMI 391 Dec 30, 2022
A tool to enhance your old/damaged pictures built using python & opencv.

Breathe Life into your Old Pictures Table of Contents About The Project Getting Started Prerequisites Usage Contact Acknowledgments About The Project

Shah Anwaar Khalid 5 Dec 16, 2021
Use Convolutional Recurrent Neural Network to recognize the Handwritten line text image without pre segmentation into words or characters. Use CTC loss Function to train.

Handwritten Line Text Recognition using Deep Learning with Tensorflow Description Use Convolutional Recurrent Neural Network to recognize the Handwrit

sushant097 224 Jan 07, 2023
Augmenting Anchors by the Detector Itself

Augmenting Anchors by the Detector Itself Introduction It is difficult to determine the scale and aspect ratio of anchors for anchor-based object dete

4 Nov 06, 2022
Amazing 3D explosion animation using Pygame module.

3D Explosion Animation 💣 💥 🔥 Amazing explosion animation with Pygame. 💣 Explosion physics An Explosion instance is made of a set of Particle objec

Dylan Tintenfich 12 Mar 11, 2022
Deskewing images with slanted content

skew_correction De-skewing images with slanted content by finding the deviation using Canny Edge Detection. To Run: In python 3.6, from deskew import

13 Aug 27, 2022
Code for the paper STN-OCR: A single Neural Network for Text Detection and Text Recognition

STN-OCR: A single Neural Network for Text Detection and Text Recognition This repository contains the code for the paper: STN-OCR: A single Neural Net

Christian Bartz 496 Jan 05, 2023
Image processing using OpenCv

Image processing using OpenCv Write a program that opens the webcam, and the user selects one of the following on the video: ✅ If the user presses the

M.Najafi 4 Feb 18, 2022
Generic framework for historical document processing

dhSegment dhSegment is a tool for Historical Document Processing. Its generic approach allows to segment regions and extract content from different ty

Digital Humanities Laboratory 343 Dec 24, 2022
This is a real life mario project using python and mediapipe

real-life-mario This is a real life mario project using python and mediapipe How to run to run this just run - realMario.py file requirements This req

Programminghut 42 Dec 22, 2022
Demo for the paper "Overlap-aware low-latency online speaker diarization based on end-to-end local segmentation"

Streaming speaker diarization Overlap-aware low-latency online speaker diarization based on end-to-end local segmentation by Juan Manuel Coria, Hervé

Juanma Coria 185 Jan 01, 2023
Code for the paper "Controllable Video Captioning with an Exemplar Sentence"

SMCG Code for the paper "Controllable Video Captioning with an Exemplar Sentence" Introduction We investigate a novel and challenging task, namely con

10 Dec 04, 2022
Color Picker and Color Detection tool for METR4202

METR4202 Color Detection Help This is sample code that can be used for the METR4202 project demo. There are two files provided, both running on Python

Miguel Valencia 1 Oct 23, 2021
A python programusing Tkinter graphics library to randomize questions and answers contained in text files

RaffleOfQuestions Um programa simples em python, utilizando a biblioteca gráfica Tkinter para randomizar perguntas e respostas contidas em arquivos de

Gabriel Ferreira Rodrigues 1 Dec 16, 2021
This is a repository to learn and get more computer vision skills, make robotics projects integrating the computer vision as a perception tool and create a lot of awesome advanced controllers for the robots of the future.

This is a repository to learn and get more computer vision skills, make robotics projects integrating the computer vision as a perception tool and create a lot of awesome advanced controllers for the

Elkin Javier Guerra Galeano 17 Nov 03, 2022
ISI's Optical Character Recognition (OCR) software for machine-print and handwriting data

VistaOCR ISI's Optical Character Recognition (OCR) software for machine-print and handwriting data Publications "How to Efficiently Increase Resolutio

ISI Center for Vision, Image, Speech, and Text Analytics 21 Dec 08, 2021