PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset

Last update: Nov 04, 2022

Overview

PyTorch Large-Scale Language Model

A Large-Scale PyTorch Language Model trained on the 1-Billion Word (LM1B) / (GBW) dataset

Latest Results

39.98 Perplexity after 5 training epochs using LSTM Language Model with Adam Optimizer
Trained in ~26 hours using 1 Nvidia V100 GPU (~5.1 hours per epoch) with 2048 batch size (~10.7 GB GPU memory)

Previous Results

46.47 Perplexity after 5 training epochs on a 1-layer, 2048-unit, 256-projection LSTM Language Model [3]
Trained for 3 days using 1 Nvidia P100 GPU (~12.5 hours per epoch)
Implemented Sampled Softmax and Log-Uniform Sampler functions

GPU Hardware Requirement

Type	LM Memory Size	GPU
w/o tied weights	~9 GB	Nvidia 1080 TI, Nvidia Titan X
w/ tied weights [6]	~7 GB	Nvidia 1070 or higher

There is an option to tie the word embedding and softmax weight matrices together to save GPU memory.

Hyper-Parameters [3]

Parameter	Value
# Epochs	5
Training Batch Size	128
Evaluation Batch Size	1
BPTT	20
Embedding Size	256
Hidden Size	2048
Projection Size	256
Tied Embedding + Softmax	False
# Layers	1
Optimizer	AdaGrad
Learning Rate	0.10
Gradient Clipping	1.00
Dropout	0.01
Weight-Decay (L2 Penalty)	1e-6

Setup - Torch Data Format

Download Google Billion Word Dataset for Torch - Link
Run "process_gbw.py" on the "train_data.th7" file to create the "train_data.sid" file
Install Cython framework and build Log_Uniform Sampler
Convert Torch data tensors to PyTorch tensor format (Requires Pytorch v0.4.1)

I leverage the GBW data preprocessed for the Torch framework. (See Torch GBW) Each data tensor contains all the words in data partition. The "train_data.sid" file marks the start and end positions for each independent sentence. The preprocessing step and "train_data.sid" file speeds up loading the massive training data.

Data Tensors - (test_data, valid_data, train_data, train_small, train_tiny) - (#words x 2) matrix - (sentence id, word id)
Sentence ID Tensor - (#sentences x 2) matrix - (start position, sentence length)

Setup - Original Data Format

Download 1-Billion Word Dataset - Link

The Torch Data Format loads the entire dataset at once, so it requires at least 32 GB of memory. The original format partitions the dataset into smaller chunks, but it runs slower.

PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset

Related tags

Overview

PyTorch Large-Scale Language Model

Latest Results

Previous Results

GPU Hardware Requirement

Hyper-Parameters [3]

Setup - Torch Data Format

Setup - Original Data Format

References

Owner

Ryan Spring

A python library for face detection and features extraction based on mediapipe library

Official repository for Hierarchical Opacity Propagation for Image Matting

Implementation for ACProp ( Momentum centering and asynchronous update for adaptive gradient methdos, NeurIPS 2021)

Capsule endoscopy detection DACON challenge

JAXMAPP: JAX-based Library for Multi-Agent Path Planning in Continuous Spaces

Modeling Temporal Concept Receptive Field Dynamically for Untrimmed Video Analysis

Differentiable Factor Graph Optimization for Learning Smoothers @ IROS 2021

Fast and scalable uncertainty quantification for neural molecular property prediction, accelerated optimization, and guided virtual screening.

Justmagic - Use a function as a method with this mystic script, like in Nim

CONditionals for Ordinal Regression and classification in tensorflow

Code for 2021 NeurIPS --- Towards Multi-Grained Explainability for Graph Neural Networks

Official implementation for Likelihood Regret: An Out-of-Distribution Detection Score For Variational Auto-encoder at NeurIPS 2020

Real Time Object Detection and Classification using Yolo Algorithm.

CVPR2020 Counterfactual Samples Synthesizing for Robust VQA

DWIPrep is a robust and easy-to-use pipeline for preprocessing of diverse dMRI data.

[BMVC 2021] Official PyTorch Implementation of Self-supervised learning of Image Scale and Orientation Estimation

Includes PyTorch -> Keras model porting code for ConvNeXt family of models with fine-tuning and inference notebooks.

EqGAN - Improving GAN Equilibrium by Raising Spatial Awareness

pyhsmm - library for approximate unsupervised inference in Bayesian Hidden Markov Models (HMMs) and explicit-duration Hidden semi-Markov Models (HSMMs), focusing on the Bayesian Nonparametric extensions, the HDP-HMM and HDP-HSMM, mostly with weak-limit approximations.

Gray Zone Assessment