WarpRNNT loss ported in Numba CPU/CUDA for Pytorch

Last update: Oct 22, 2022

Related tags

Overview

RNNT loss in Pytorch - Numba JIT compiled (warprnnt_numba)

Warp RNN Transducer Loss for ASR in Pytorch, ported from HawkAaron/warp-transducer and a replica of the stable version in NVIDIA Neural Module repository (NVIDIA NeMo).

NOTE: The code here will have experimental extensions and may be potentially unstable, use the version in NeMo for long term supported loss version of RNNT for PyTorch.

Supported Features

Currently supports :

WarpRNNT loss in pytorch for CPU / CUDA (jit compiled)
FastEmit
Gradient Clipping (from Torch Audio)

Installation

You will need PyTorch (usually the latest version should be used), plus installation of Numba in a Conda environment (pip only environment is untested but may work).

# Follow installation instructions to install pytorch from website (with cuda if required)
conda install -c conda-force numba or conda update -c conda-forge numba (to get latest version)

# Then install this library
pip install --upgrade git+https://github.com/titu1994/warprnnt_numba.git

Usage

Import warprnnt_numba and use RNNTLossNumba. If attempting to use CUDA version of loss, it is advisable to test that your installed CUDA version is compatible with numba version using numba_utils.

There is also included a very slow numpy/pytorch explicit-loop based loss implementation for verification of exact correct results.

import torch
import numpy as np
import warprnnt_numba

# Define the loss function
fastemit_lambda = 0.001  # any float >= 0.0
loss_pt = warprnnt_numba.RNNTLossNumba(blank=4, reduction='sum', fastemit_lambda=fastemit_lambda)

# --------------
# Example usage

device = "cuda"
torch.random.manual_seed(0)

# Assume Batchsize=2, Acoustic Timesteps = 8, Label Timesteps = 5 (including BLANK=BOS token),
# and Vocabulary size of 5 tokens (including RNNT BLANK)
acts = torch.randn(2, 8, 5, 5, device=device, requires_grad=True)
sequence_length = torch.tensor([5, 8], dtype=torch.int32,
                               device=device)  # acoustic sequence length. One element must be == acts.shape[1].

# Let 0 be MASK/PAD value, 1-3 be token ids, and 4 represent RNNT BLANK token
# The BLANK token is overloaded for BOS token as well here, but can be different token.
# Let first sample be padded with 0 (actual length = 3). Loss is computed according to supplied `label_lengths`.
# and gradients for the 4th index onwards (0 based indexing).
labels = torch.tensor([[4, 1, 1, 3, 0], [4, 2, 2, 3, 1]], dtype=torch.int32, device=device)
label_lengths = torch.tensor([3, 4], dtype=torch.int32,
                             device=device)  # Lengths here must be WITHOUT the BOS token.

# If on CUDA, log_softmax is computed internally efficiently (preserving memory and speed)
# Compute it explicitly for CPU, this is done automatically for you inside forward() of the loss.
# -1-th vocab index is RNNT blank token here.
loss_func = warprnnt_numba.RNNTLossNumba(blank=4, reduction='none',
                                         fastemit_lambda=0.0, clamp=0.0)
loss = loss_func(acts, labels, sequence_length, label_lengths)
print("Loss :", loss)
loss.sum().backward()

# When parsing the gradients, look at grads[0] -
# Since it was padded in T (sequence_length=5 < T=8), there are gradients only for grads[0, :5, :, :].
# Since it was padded in U (label_lengths=3+1 < U=5), there are gradeints only for grads[0, :5, :3+1, :].
grads = acts.grad
print("Gradients of activations :")
print(grads)

Tests

Tests will perform CPU only checks if there are no GPUs. If GPUs are present, will run all tests once for cuda:0 as well.

pytest tests/

Requirements

pytorch >= 1.10. Older versions might work, not tested.
numba - Minimum required version is 0.53.0, preferred is 0.54+.

Comments

GPU under utilization due to low occupancy.

Thank you for the warprnnt_numba, I got the warnning (show blow) when I use this loss in my code. Is this known issue? How can it be debugged and solved?

Thank you!

opened by jiay7 2
Fix runtime speed
Improve runtime speed of numba loss

Fix issue with data movement of costs tensor from llForward to pytorch data view in numba

This alone costs a linear loop (scaling with batch size) that is roughly 10x the kernel costs themselves.

Fix by writing a small kernel to copy the data and update the costs.
opened by titu1994 0

Releases(v0.4.0)

v0.4.0(Jan 30, 2022)
Supports

Simple RNNT loss with Atomic Locks implementation

Improvements

Improve runtime speed of numba loss

Fix issue with data movement of costs tensor from llForward to pytorch data view in numba

This alone costs a linear loop (scaling with batch size) that is roughly 10x the kernel costs themselves.

Fix by writing a small kernel to copy the data and update the costs.

Source code(tar.gz)
Source code(zip)
v0.2.2(Jan 24, 2022)
Initial release of Warp RNNT loss with Numba JIT compile (CPU/CUDA)

Supports:

Pytorch RNNT loss (CPU and JIT compiled CUDA)

FastEmit

Gradient clipping

Source code(tar.gz)
Source code(zip)

WarpRNNT loss ported in Numba CPU/CUDA for Pytorch

Related tags

Overview

RNNT loss in Pytorch - Numba JIT compiled (warprnnt_numba)

Supported Features

Installation

Usage

Tests

Requirements

You might also like...

This Repo is the official CUDA implementation of ICCV 2019 Oral paper for CARAFE: Content-Aware ReAssembly of FEatures

Example repository for custom C++/CUDA operators for TorchScript

Convert Python 3 code to CUDA code.

This demo showcase the use of onnxruntime-rs with a GPU on CUDA 11 to run Bert in a data pipeline with Rust.

LightSeq is a high performance training and inference library for sequence processing and generation implemented in CUDA

CUDA Python Low-level Bindings

A dead simple python wrapper for darknet that works with OpenCV 4.1, CUDA 10.1

Prevent `CUDA error: out of memory` in just 1 line of code.

An addernet CUDA version

Comments

GPU under utilization due to low occupancy.

Fix runtime speed

Improve runtime speed of numba loss

Releases(v0.4.0)

v0.4.0(Jan 30, 2022)

Supports

Improvements

v0.2.2(Jan 24, 2022)

Owner

Somshubra Majumdar

The pure and clear PyTorch Distributed Training Framework.

【Arxiv】Exploring Separable Attention for Multi-Contrast MR Image Super-Resolution

A PyTorch implementation of deep-learning-based registration

Unsupervised Feature Loss (UFLoss) for High Fidelity Deep learning (DL)-based reconstruction

Tensorflow 2 implementation of the paper: Learning and Evaluating Representations for Deep One-class Classification published at ICLR 2021

This project provides the code and datasets for 'CapSal: Leveraging Captioning to Boost Semantics for Salient Object Detection', CVPR 2019.

A Python library for adversarial machine learning focusing on benchmarking adversarial robustness.

Source code for our Paper "Learning in High-Dimensional Feature Spaces Using ANOVA-Based Matrix-Vector Multiplication"

Object detection and instance segmentation toolkit based on PaddlePaddle.

Good Semi-Supervised Learning That Requires a Bad GAN

This is an open solution to the Home Credit Default Risk challenge 🏡

Automate issue discovery for your projects against Lightning nightly and releases.

This repository is the official implementation of Using Time-Series Privileged Information for Provably Efficient Learning of Prediction Models

Gapmm2: gapped alignment using minimap2 (align transcripts to genome)

Coarse implement of the paper "A Simultaneous Denoising and Dereverberation Framework with Target Decoupling", On DNS-2020 dataset, the DNSMOS of first stage is 3.42 and second stage is 3.47.

Code for "Intra-hour Photovoltaic Generation Forecasting based on Multi-source Data and Deep Learning Methods."

Code to run experiments in SLOE: A Faster Method for Statistical Inference in High-Dimensional Logistic Regression.

The dataset of tweets pulling from Twitters with keyword: Hydroxychloroquine, location: US, Time: 2020

Speech-Emotion-Analyzer - The neural network model is capable of detecting five different male/female emotions from audio speeches. (Deep Learning, NLP, Python)

blind SQLIpy sebuah alat injeksi sql yang menggunakan waktu sql untuk mendapatkan sebuah server database.