FewBit — a library for memory efficient training of large neural networks

Overview

FewBit

FewBit — a library for memory efficient training of large neural networks. Its efficiency originates from storage optimizations applied to backward pass and memory footprint reduction for saved tensors between forward and backward passes. Namely, the library provides its own implementation of common activation functions and linear layer since they contribute the most to memory usage in training time. Optimized linear layer saves up to 15-20% memory and optimized activation functions save up to 15-30% of memory usage with negligible loss in performance (see [1][2] for details).

In the table below, one can see comparison of different optimizations applied to RoBERTa model. Compression rate of randomized linear layer is 20% (it uses only 20% of input) and GELU approximation uses only 3 bits.

Task Batch Size GELU Linear Layer Peak Memory, GiB Saving, %
1 MRPC 128 Vanilla Vanilla 11.30 0.0
2 MRPC 128 3-bit Vanilla 9.75 13.8
3 MRPC 128 Vanilla Randomized 9.20 18.6
4 MRPC 128 3-bit Randomized 7.60 32.7

Usage

The library fewbit implements basic activation functions with backward pass optimizations for reducing memory footprint during model training. All activation functions exported by the library can be used as a drop-in replacement for most of standard activation functions implemented in PyTorch. The common pattern is to replace torch.nn with fewbit package qualifier.

import fewbit
import torch as T

model = T.nn.Sequential(
    ...,
    fewbit.GELU(bits=3),  # Use 3-bits GELU approximation.
    ...,
)

In the case of pre-trained models, one can rebuild model with map_module routine which walks through model tree recursively and allows to replace some modules or activation functions. So, user should only use suitable constructor for a new module. As an example the code below replaces all default linear layers with randomized ones.

from fewbit import RandomizedLinear
from fewbit.util import convert_linear, map_module

converter = lambda x: convert_linear(x, RandomizedLinear, proj_dim_ratio=0.1)
new_model = map_module(old_model, converter)  # In-place model construction.

Quantized Gradients of Activation Functions

Installation

The simplest and preferred installation way is installation from PyPI.

pip install -U fewbit

FewBit is written in Python, but it implements some opertions in C++/CUDA to archive better performance. So, building from source requires CUDA Toolkit and CMake as a build system. The latest release can be installed with the following command.

pip install -U https://github.com/SkoltechAI/fewbit.git

List of Activation Functions

The library supports the following activation functions.

Piece-wise Activation Functions

In this section, all activation functions has 1-bit derivative. The only difference is band. The band requires two comparison to determine gradient domain. The complete list of activation functions is leaky_relu, relu, threshold, hardsigmoid, hardtanh, relu6, hardshrink, and softshrink.

Continous Activation Functions

All continous activation function could be divided into three classes according to its parity property: odd, even, and neither even nor odd. The parity property allows to use a small optimization to increase precision of approximation. The complete list of reimplemented activation functions in this category is celu, elu, hardswish, logsigmoid, mish, selu, sigmoid, silu, softplus, softsign, tanh, and tanhshrink.

List of Modules

Module RandomizedLinear is a replacement for default Linear module. It is used power of approximate matrix multiplication for memory saving.

Assembly

Preliminary step depends on one's PyTorch distribution and availiable tooling. Building of native components requires CMake and a build system like Make or Ninja. Next, if PyTorch is installed system-wide the the following step is not neccessary. Otherwise, one likely should add search path for CMake modules to environment variables as follows.

export CMAKE_PREFIX_PATH="$(python -c 'import torch.utils; print(torch.utils.cmake_prefix_path)')"

The next step is useful in development environment. It just builds PyTorch operator library in source tree (option --inplace) with forced CUDA support (option --cuda). By default no CUDA support are forced.

python setup.py build_ext --inplace --cuda

With options similar to the previous step, one can build wheel binary distribution of the package.

python setup.py bdist_wheel --inplace --cuda

Development Environment with Docker

In order to develop on different platforms we uses custom docker image for non-priviledge user based on Nvidia CUDA image. Image contains pre-built native extention and it is parametrized by user name and user ID in a host system. The latter is crucial thing in binding host volumes.

docker build -t fewbit --build-arg UID=$(id -u) .
docker run --rm -ti -e TERM=$TERM fewbit

Citation

Please cite the following papers if the library is used in an academic paper (export BibTeX).

@misc{bershatsky2022memoryefficient,
    title={{M}emory-{E}fficient {B}ackpropagation through {L}arge {L}inear {L}ayers},
    author={Daniel Bershatsky and Aleksandr Mikhalev and Alexandr Katrutsa and Julia Gusak and Daniil Merkulov and Ivan Oseledets},
    year={2022},
    eprint={2201.13195},
    archivePrefix={arXiv},
    primaryClass={cs.LG},
}

@misc{novikov2022fewbit,
    title={{F}ew-{B}it {B}ackward: {Q}uantized {G}radients of {A}ctivation {F}unctions for {M}emory {F}ootprint {R}eduction},
    author={Georgii Novikov and Daniel Bershatsky and Julia Gusak and Alex Shonenkov and Denis Dimitrov and Ivan Oseledets},
    year={2022},
    eprint={2202.00441},
    archivePrefix={arXiv},
    primaryClass={cs.LG},
}

License

© The FewBit authors, 2022 — now. Licensed under the BSD 3-Clause License. See AUTHORS and LICENSE file for more details1.

Footnotes

  1. The work was supported by Sber AI and the Analytical center under the RF Government (subsidy agreement 000000D730321P5Q0002, Grant No. 70-2021-00145 02.11.2021).

Tf alloc - Simplication of GPU allocation for Tensorflow2

tf_alloc Simpliying GPU allocation for Tensorflow Developer: korkite (Junseo Ko)

Junseo Ko 3 Feb 10, 2022
Context-Sensitive Misspelling Correction of Clinical Text via Conditional Independence, CHIL 2022

cim-misspelling Pytorch implementation of Context-Sensitive Spelling Correction of Clinical Text via Conditional Independence, CHIL 2022. This model (

Juyong Kim 11 Dec 19, 2022
Pytorch implementation of the paper SPICE: Semantic Pseudo-labeling for Image Clustering

SPICE: Semantic Pseudo-labeling for Image Clustering By Chuang Niu and Ge Wang This is a Pytorch implementation of the paper. (In updating) SOTA on 5

Chuang Niu 154 Dec 15, 2022
Code for our SIGCOMM'21 paper "Network Planning with Deep Reinforcement Learning".

0. Introduction This repository contains the source code for our SIGCOMM'21 paper "Network Planning with Deep Reinforcement Learning". Notes The netwo

NetX Group 68 Nov 24, 2022
RRxIO - Robust Radar Visual/Thermal Inertial Odometry: Robust and accurate state estimation even in challenging visual conditions.

RRxIO - Robust Radar Visual/Thermal Inertial Odometry RRxIO offers robust and accurate state estimation even in challenging visual conditions. RRxIO c

Christopher Doer 64 Dec 29, 2022
A very simple tool for situations where optimization with onnx-simplifier would exceed the Protocol Buffers upper file size limit of 2GB, or simply to separate onnx files to any size you want.

sne4onnx A very simple tool for situations where optimization with onnx-simplifier would exceed the Protocol Buffers upper file size limit of 2GB, or

Katsuya Hyodo 10 Aug 30, 2022
Unofficial implementation of One-Shot Free-View Neural Talking Head Synthesis

face-vid2vid Usage Dataset Preparation cd datasets wget https://yt-dl.org/downloads/latest/youtube-dl -O youtube-dl chmod a+rx youtube-dl python load_

worstcoder 68 Dec 30, 2022
OpenGAN: Open-Set Recognition via Open Data Generation

OpenGAN: Open-Set Recognition via Open Data Generation ICCV 2021 (oral) Real-world machine learning systems need to analyze novel testing data that di

Shu Kong 90 Jan 06, 2023
Image based Human Fall Detection

Here I integrated the YOLOv5 object detection algorithm with my own created dataset which consists of human activity images to achieve low cost, high accuracy, and real-time computing requirements

UTTEJ KUMAR 12 Dec 11, 2022
Supporting code for short YouTube series Neural Networks Demystified.

Neural Networks Demystified Supporting iPython notebooks for the YouTube Series Neural Networks Demystified. I've included formulas, code, and the tex

Stephen 1.3k Dec 23, 2022
This repository contains project created during the Data Challenge module at London School of Hygiene & Tropical Medicine

LSHTM_RCS This repository contains project created during the Data Challenge module at London School of Hygiene & Tropical Medicine (LSHTM) in collabo

Lukas Kopecky 3 Jan 30, 2022
This repository implements variational graph auto encoder by Thomas Kipf.

Variational Graph Auto-encoder in Pytorch This repository implements variational graph auto-encoder by Thomas Kipf. For details of the model, refer to

DaehanKim 215 Jan 02, 2023
Developed an optimized algorithm which finds the most optimal path between 2 points in a 3D Maze using various AI search techniques like BFS, DFS, UCS, Greedy BFS and A*

Developed an optimized algorithm which finds the most optimal path between 2 points in a 3D Maze using various AI search techniques like BFS, DFS, UCS, Greedy BFS and A*. The algorithm was extremely

1 Mar 28, 2022
TensorFlow implementation of Barlow Twins (Barlow Twins: Self-Supervised Learning via Redundancy Reduction)

Barlow-Twins-TF This repository implements Barlow Twins (Barlow Twins: Self-Supervised Learning via Redundancy Reduction) in TensorFlow and demonstrat

Sayak Paul 36 Sep 14, 2022
The Python ensemble sampling toolkit for affine-invariant MCMC

emcee The Python ensemble sampling toolkit for affine-invariant MCMC emcee is a stable, well tested Python implementation of the affine-invariant ense

Dan Foreman-Mackey 1.3k Dec 31, 2022
Official implementation of "Towards Good Practices for Efficiently Annotating Large-Scale Image Classification Datasets" (CVPR2021)

Towards Good Practices for Efficiently Annotating Large-Scale Image Classification Datasets This is the official implementation of "Towards Good Pract

Sanja Fidler's Lab 52 Nov 22, 2022
Noether Networks: meta-learning useful conserved quantities

Noether Networks: meta-learning useful conserved quantities This repository contains the code necessary to reproduce experiments from "Noether Network

Dylan Doblar 33 Nov 23, 2022
PyTorch implementation of the implicit Q-learning algorithm (IQL)

Implicit-Q-Learning (IQL) PyTorch implementation of the implicit Q-learning algorithm IQL (Paper) Currently only implemented for online learning. Offl

Sebastian Dittert 27 Dec 30, 2022
[NIPS 2021] UOTA: Improving Self-supervised Learning with Automated Unsupervised Outlier Arbitration.

UOTA: Improving Self-supervised Learning with Automated Unsupervised Outlier Arbitration This repository is the official PyTorch implementation of UOT

6 Jun 29, 2022
The code for our paper CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention.

CrossFormer This repository is the code for our paper CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention. Introduction Existin

cheerss 238 Jan 06, 2023