Dataloader tools for language modelling

Overview

Installation:

pip install lm_dataloader

Design Philosophy

  • A library to unify lm dataloading at large scale

  • Simple interface, any tokenizer can be integrated

  • Minimal changes needed from small -> large scale (many multiple GPU nodes)

  • follows fairseq / megatron's 'mmap' dataformat, but with improvements. Those being:

    • Easily combine multiple datasets
    • Easily split a dataset into train / val / test splits
    • Easily build a weighted dataset out of a list of existing ones along with weights.
    • unified into a single 'file' (which is actually a directory containing a .bin / .idx file)
    • index files that are built on the fly are hidden files, leaving less mess in the directory.
    • More straightforward interface, better documentation.
    • Inspectable with a command line tool
    • Can load from urls
    • Can load from S3 buckets
    • Can load from GCS buckets
    • Can tokenize on the fly instead of preprocessing

Misc. TODO: - [ ] Option to set mpu globally (for distributed dataloading)

Example usage

To tokenize a dataset contained in a .jsonl file (where the text to be tokenized can be accessed under the 'text' key):

import lm_dataloader as lmdl
from transformers import GPT2TokenizerFast 

jsonl_path = "test.jsonl"
output = "my_dataset.lmd"
tokenizer = GPT2TokenizerFast.from_pretrained('gpt2')

lmdl.encode(
    jsonl_path,
    output_prefix=output,
    tokenize_fn=tokenizer.encode,
    tokenizer_vocab_size=len(tokenizer),
    eod_token=tokenizer.eos_token_id,
)

This will create a dataset at "my_dataset.lmd" which can be loaded as an indexed torch dataset like so:

from lm_dataloader import LMDataset
from transformers import GPT2TokenizerFast 

tokenizer = GPT2TokenizerFast.from_pretrained('gpt2')
seq_length = tokenizer.model_max_length # or whatever the sequence length of your model is

dataset = LMDataset("my_dataset.lmd", seq_length=seq_length)

# peek at 0th index
print(dataset[0])

Command line utilities

There are also command line utilities provided to inspect / merge datasets, e.g:

lm-dataloader inspect my_dataset.lmd

Launches an interactive terminal to inspect the data in my_dataset.lmd

And:

lm-dataloader merge my_dataset.lmd,my_dataset_2.lmd new_dataset.lmd

Merges the datasets at "my_dataset.lmd" and "my_dataset_2.lmd" into a new file at "new_dataset.lmd".

DeepProbLog is an extension of ProbLog that integrates Probabilistic Logic Programming with deep learning by introducing the neural predicate.

DeepProbLog DeepProbLog is an extension of ProbLog that integrates Probabilistic Logic Programming with deep learning by introducing the neural predic

KU Leuven Machine Learning Research Group 94 Dec 18, 2022
IPATool-py: download ipa easily

IPATool-py Python version of IPATool! Installation pip3 install -r requirements.txt Usage Quickstart: download app with specific bundleId into DIR: p

159 Dec 30, 2022
Bottom-up attention model for image captioning and VQA, based on Faster R-CNN and Visual Genome

bottom-up-attention This code implements a bottom-up attention model, based on multi-gpu training of Faster R-CNN with ResNet-101, using object and at

Peter Anderson 1.3k Jan 09, 2023
Bridging the Gap between Label- and Reference based Synthesis(ICCV 2021)

Bridging the Gap between Label- and Reference based Synthesis(ICCV 2021) Tensorflow implementation of Bridging the Gap between Label- and Reference-ba

huangqiusheng 8 Jul 13, 2022
PySOT - SenseTime Research platform for single object tracking, implementing algorithms like SiamRPN and SiamMask.

PySOT is a software system designed by SenseTime Video Intelligence Research team. It implements state-of-the-art single object tracking algorit

STVIR 4.1k Dec 29, 2022
OptaPlanner wrappers for Python. Currently significantly slower than OptaPlanner in Java or Kotlin.

OptaPy is an AI constraint solver for Python to optimize the Vehicle Routing Problem, Employee Rostering, Maintenance Scheduling, Task Assignment, School Timetabling, Cloud Optimization, Conference S

OptaPy 211 Jan 02, 2023
Code for Generating Disentangled Arguments with Prompts: A Simple Event Extraction Framework that Works

GDAP Code for Generating Disentangled Arguments with Prompts: A Simple Event Extraction Framework that Works Environment Python (verified: v3.8) CUDA

45 Oct 29, 2022
[ICCV 2021 Oral] Mining Latent Classes for Few-shot Segmentation

Mining Latent Classes for Few-shot Segmentation Lihe Yang, Wei Zhuo, Lei Qi, Yinghuan Shi, Yang Gao. This codebase contains baseline of our paper Mini

Lihe Yang 66 Nov 29, 2022
Official Implementation of DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation

DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation [Arxiv] [Paper] As acquiring pixel-wise an

Lukas Hoyer 305 Dec 29, 2022
Creating Multi Task Models With Keras

Creating Multi Task Models With Keras About The Project! I used the keras and Tensorflow Library, To build a Deep Learning Neural Network to Creating

Srajan Chourasia 4 Nov 28, 2022
KoCLIP: Korean port of OpenAI CLIP, in Flax

KoCLIP This repository contains code for KoCLIP, a Korean port of OpenAI's CLIP. This project was conducted as part of Hugging Face's Flax/JAX communi

Jake Tae 100 Jan 02, 2023
Creative Applications of Deep Learning w/ Tensorflow

Creative Applications of Deep Learning w/ Tensorflow This repository contains lecture transcripts and homework assignments as Jupyter Notebooks for th

Parag K Mital 1.5k Dec 30, 2022
A hobby project which includes a hand-gesture based virtual piano using a mobile phone camera and OpenCV library functions

Overview This is a hobby project which includes a hand-gesture controlled virtual piano using an android phone camera and some OpenCV library. My moti

Abhinav Gupta 1 Nov 19, 2021
Implementation of "Deep Implicit Templates for 3D Shape Representation"

Deep Implicit Templates for 3D Shape Representation Zerong Zheng, Tao Yu, Qionghai Dai, Yebin Liu. arXiv 2020. This repository is an implementation fo

Zerong Zheng 144 Dec 07, 2022
Video Instance Segmentation using Inter-Frame Communication Transformers (NeurIPS 2021)

Video Instance Segmentation using Inter-Frame Communication Transformers (NeurIPS 2021) Paper Video Instance Segmentation using Inter-Frame Communicat

Sukjun Hwang 81 Dec 29, 2022
Many Class Activation Map methods implemented in Pytorch for CNNs and Vision Transformers. Including Grad-CAM, Grad-CAM++, Score-CAM, Ablation-CAM and XGrad-CAM

Class Activation Map methods implemented in Pytorch pip install grad-cam ⭐ Tested on many Common CNN Networks and Vision Transformers. ⭐ Includes smoo

Jacob Gildenblat 6.6k Jan 06, 2023
A Keras implementation of CapsNet in the paper: Sara Sabour, Nicholas Frosst, Geoffrey E Hinton. Dynamic Routing Between Capsules

NOTE This implementation is fork of https://github.com/XifengGuo/CapsNet-Keras , applied to IMDB texts reviews dataset. CapsNet-Keras A Keras implemen

Lauro Moraes 5 Oct 23, 2022
Image processing in Python

scikit-image: Image processing in Python Website (including documentation): https://scikit-image.org/ Mailing list: https://mail.python.org/mailman3/l

Image Processing Toolbox for SciPy 5.2k Dec 31, 2022
NBEATSx: Neural basis expansion analysis with exogenous variables

NBEATSx: Neural basis expansion analysis with exogenous variables We extend the NBEATS model to incorporate exogenous factors. The resulting method, c

Cristian Challu 100 Dec 31, 2022
A U-Net combined with a variational auto-encoder that is able to learn conditional distributions over semantic segmentations.

Probabilistic U-Net + **Update** + An improved Model (the Hierarchical Probabilistic U-Net) + LIDC crops is now available. See below. Re-implementatio

Simon Kohl 498 Dec 26, 2022