PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset

Overview

PyTorch Large-Scale Language Model

A Large-Scale PyTorch Language Model trained on the 1-Billion Word (LM1B) / (GBW) dataset

Latest Results

  • 39.98 Perplexity after 5 training epochs using LSTM Language Model with Adam Optimizer
  • Trained in ~26 hours using 1 Nvidia V100 GPU (~5.1 hours per epoch) with 2048 batch size (~10.7 GB GPU memory)

Previous Results

  • 46.47 Perplexity after 5 training epochs on a 1-layer, 2048-unit, 256-projection LSTM Language Model [3]
  • Trained for 3 days using 1 Nvidia P100 GPU (~12.5 hours per epoch)
  • Implemented Sampled Softmax and Log-Uniform Sampler functions

GPU Hardware Requirement

Type LM Memory Size GPU
w/o tied weights ~9 GB Nvidia 1080 TI, Nvidia Titan X
w/ tied weights [6] ~7 GB Nvidia 1070 or higher
  • There is an option to tie the word embedding and softmax weight matrices together to save GPU memory.

Hyper-Parameters [3]

Parameter Value
# Epochs 5
Training Batch Size 128
Evaluation Batch Size 1
BPTT 20
Embedding Size 256
Hidden Size 2048
Projection Size 256
Tied Embedding + Softmax False
# Layers 1
Optimizer AdaGrad
Learning Rate 0.10
Gradient Clipping 1.00
Dropout 0.01
Weight-Decay (L2 Penalty) 1e-6

Setup - Torch Data Format

  1. Download Google Billion Word Dataset for Torch - Link
  2. Run "process_gbw.py" on the "train_data.th7" file to create the "train_data.sid" file
  3. Install Cython framework and build Log_Uniform Sampler
  4. Convert Torch data tensors to PyTorch tensor format (Requires Pytorch v0.4.1)

I leverage the GBW data preprocessed for the Torch framework. (See Torch GBW) Each data tensor contains all the words in data partition. The "train_data.sid" file marks the start and end positions for each independent sentence. The preprocessing step and "train_data.sid" file speeds up loading the massive training data.

  • Data Tensors - (test_data, valid_data, train_data, train_small, train_tiny) - (#words x 2) matrix - (sentence id, word id)
  • Sentence ID Tensor - (#sentences x 2) matrix - (start position, sentence length)

Setup - Original Data Format

  1. Download 1-Billion Word Dataset - Link

The Torch Data Format loads the entire dataset at once, so it requires at least 32 GB of memory. The original format partitions the dataset into smaller chunks, but it runs slower.

References

  1. Exploring the Limits of Language Modeling Github
  2. Factorization Tricks for LSTM networks Github
  3. Efficient softmax approximation for GPUs Github
  4. Candidate Sampling
  5. Torch GBW
  6. Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling
Owner
Ryan Spring
A PhD student researching Deep Learning, Locality-Sensitive Hashing, and other large-scale machine learning algorithms.
Ryan Spring
Generate Cartoon Images using Generative Adversarial Network

AvatarGAN ✨ Generate Cartoon Images using DC-GAN Deep Convolutional GAN is a generative adversarial network architecture. It uses a couple of guidelin

Aakash Jhawar 50 Dec 29, 2022
Code for Low-Cost Algorithmic Recourse for Users With Uncertain Cost Functions

EMS-COLS-recourse Initial Code for Low-Cost Algorithmic Recourse for Users With Uncertain Cost Functions Folder structure: data folder contains raw an

Prateek Yadav 1 Nov 25, 2022
KGDet: Keypoint-Guided Fashion Detection (AAAI 2021)

KGDet: Keypoint-Guided Fashion Detection (AAAI 2021) This is an official implementation of the AAAI-2021 paper "KGDet: Keypoint-Guided Fashion Detecti

Qian Shenhan 35 Dec 29, 2022
SlotRefine: A Fast Non-Autoregressive Model forJoint Intent Detection and Slot Filling

SlotRefine: A Fast Non-Autoregressive Model for Joint Intent Detection and Slot Filling Reference Main paper to be cited (Di Wu et al., 2020) @article

Moore 34 Nov 03, 2022
An unofficial implementation of "Unpaired Image Super-Resolution using Pseudo-Supervision." CVPR2020

UnpairedSR An unofficial implementation of "Unpaired Image Super-Resolution using Pseudo-Supervision." CVPR2020 turn RCAN(modified) -- xmodel(xilinx

JiaKui Hu 10 Oct 28, 2022
Constrained Language Models Yield Few-Shot Semantic Parsers

Constrained Language Models Yield Few-Shot Semantic Parsers This repository contains tools and instructions for reproducing the experiments in the pap

Microsoft 43 Nov 23, 2022
A machine learning benchmark of in-the-wild distribution shifts, with data loaders, evaluators, and default models.

WILDS is a benchmark of in-the-wild distribution shifts spanning diverse data modalities and applications, from tumor identification to wildlife monitoring to poverty mapping.

P-Lambda 437 Dec 30, 2022
Using deep learning model to detect breast cancer.

Breast-Cancer-Detection Breast cancer is the most frequent cancer among women, with around one in every 19 women at risk. The number of cases of breas

1 Feb 13, 2022
Log4j JNDI inj. vuln scanner

Log-4-JAM - Log 4 Just Another Mess Log4j JNDI inj. vuln scanner Requirements pip3 install requests_toolbelt Usage # make sure target list has http/ht

Ashish Kunwar 66 Nov 09, 2022
[CVPR'21 Oral] Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning [CVPR'21, Oral] By Zhicheng Huang*, Zhaoyang Zeng*, Yupan H

Multimedia Research 196 Dec 13, 2022
Adversarial Attacks are Reversible via Natural Supervision

Adversarial Attacks are Reversible via Natural Supervision ICCV2021 Citation @InProceedings{Mao_2021_ICCV, author = {Mao, Chengzhi and Chiquier

Computer Vision Lab at Columbia University 20 May 22, 2022
The code for SAG-DTA: Prediction of Drug–Target Affinity Using Self-Attention Graph Network.

SAG-DTA The code is the implementation for the paper 'SAG-DTA: Prediction of Drug–Target Affinity Using Self-Attention Graph Network'. Requirements py

Shugang Zhang 7 Aug 02, 2022
Deep Crop Rotation

Deep Crop Rotation Paper (to come very soon!) We propose a deep learning approach to modelling both inter- and intra-annual patterns for parcel classi

Félix Quinton 5 Sep 23, 2022
A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

MMF is a modular framework for vision and language multimodal research from Facebook AI Research. MMF contains reference implementations of state-of-t

Facebook Research 5.1k Jan 04, 2023
Just playing with getting VQGAN+CLIP running locally, rather than having to use colab.

Just playing with getting VQGAN+CLIP running locally, rather than having to use colab.

Nerdy Rodent 2.3k Jan 04, 2023
Official code for "EagerMOT: 3D Multi-Object Tracking via Sensor Fusion" [ICRA 2021]

EagerMOT: 3D Multi-Object Tracking via Sensor Fusion Read our ICRA 2021 paper here. Check out the 3 minute video for the quick intro or the full prese

Aleksandr Kim 276 Dec 30, 2022
Neural Reprojection Error: Merging Feature Learning and Camera Pose Estimation

Neural Reprojection Error: Merging Feature Learning and Camera Pose Estimation This is the official repository for our paper Neural Reprojection Error

Hugo Germain 78 Dec 01, 2022
Implementation of "Meta-rPPG: Remote Heart Rate Estimation Using a Transductive Meta-Learner"

Meta-rPPG: Remote Heart Rate Estimation Using a Transductive Meta-Learner This repository is the official implementation of Meta-rPPG: Remote Heart Ra

Eugene Lee 137 Dec 13, 2022
Point detection through multi-instance deep heatmap regression for sutures in endoscopy

Suture detection PyTorch This repo contains the reference implementation of suture detection model in PyTorch for the paper Point detection through mu

artificial intelligence in the area of cardiovascular healthcare 3 Jul 16, 2022
A tensorflow=1.13 implementation of Deconvolutional Networks on Graph Data (NeurIPS 2021)

GDN A tensorflow=1.13 implementation of Deconvolutional Networks on Graph Data (NeurIPS 2021) Abstract In this paper, we consider an inverse problem i

4 Sep 13, 2022