This is an early in-development version of training CLIP models with hivemind.

Overview

A transformer that does not hog your GPU memory

This is an early in-development codebase: if you want a stable and documented hivemind codebase, look at CALM or dalle-hivemind.

Readme under construction

LeanTransformer implements a specific version of transformer with two goals in mind:

  • using as little GPU memory as possible
  • stable training for very large models

The core philosophy of LeanTransformer is to replace torch.autograd with grad students. Automatic differentiation is great if you want to test ideas quickly, less so if a single training run can cost over $4 million (or >1000 years in grad school).

Related work: GSO

Our implementation partially replaces automatic differentiation with Grad Student Optimization (GSO) - a biologically inspired black box optimization algorithm. In the past, GSO has seen widespread adoption thanks to its strong theoretical foundations and unparalleled cost efficiency (Chom et al). Previous successfully applied GSO for hyperparameter tuning and natural language generation. To the best of our knowledge we are the first work to successfully apply distributed fault-tolerant GSO for optimizing the memory footprint of transformers. We summarize our findings below:

Memory saving features:

Other features:

Not implemented:

  • In reversible mode, one can further save memory by computing backward in chunks:
    • a few tokens at a time for feedforward layers, since grad(concat(mlp(x1), mlp(x2))) = concat(grad(mlp(x1)), grad(mlp(x2)))
    • a few heads at a time for self-attention, since grad(head1 + head2) = grad(head1) + grad(head2), where head1 and head2 are attention outputs after linear projection
  • Attention could be computed in O(sqrt(n)) memory (Rabe et al, 2021)
  • No sparse or linear attention: they are great for very long sequences. However, for large models, attention is not a bottleneck in typical NLP and vision tasks (tested gpt-3 up to length 4096).
  • Per-block grad scaling as described in (Ramesh et al, 2021) - we rely on Sandwich Norm to maintain stability up to 96 layers (did not test more). However, it would be nice to have per-block scaling to avoid the need for an extra LayerNorm.
  • Something else that we missed - please find us on discord.

A day will come a day when we explain all these modifications and provide instructions on how to tune them. But it is not this day!. Until then, we'll happily answer any questions on our discord.

Running the code

[under constructuion] - use the instructions from CALM readme

Acknowledgements:

  • Most of the architecture and stability optimizations were learned through the BigScience research workshop
  • YSDA community helped us survive through the early messy versions of this code
  • NeuroPark trained the first practical model (SahajBERT-XL, SoTA in bengali, details here)
  • TODO DALLE community: at least mention the demo, maybe we end up training something even cooler
  • TODO NCAI community: ask them how best to acknowledge them
  • TODO Hugging Face: ask them how best to acknowledge them
  • TODO Personal: stas00, samyam, jared, more? (this does not include co-authors: Tim,Lucile,Quentin,Denis,Gennady,etc; also, this does not include hivemind contributors)
Owner
<a href=[email protected]">
Official Pytorch implementation for video neural representation (NeRV)

NeRV: Neural Representations for Videos (NeurIPS 2021) Project Page | Paper | UVG Data Hao Chen, Bo He, Hanyu Wang, Yixuan Ren, Ser-Nam Lim, Abhinav S

hao 214 Dec 28, 2022
SpiroMask: Measuring Lung Function Using Consumer-Grade Masks

SpiroMask: Measuring Lung Function Using Consumer-Grade Masks Anonymised repository for paper submitted for peer review at ACM HEALTH (October 2021).

0 May 10, 2022
Source code of our BMVC 2021 paper: AniFormer: Data-driven 3D Animation with Transformer

AniFormer This is the PyTorch implementation of our BMVC 2021 paper AniFormer: Data-driven 3D Animation with Transformer. Haoyu Chen, Hao Tang, Nicu S

24 Nov 02, 2022
Multiple style transfer via variational autoencoder

ST-VAE Multiple style transfer via variational autoencoder By Zhi-Song Liu, Vicky Kalogeiton and Marie-Paule Cani This repo only provides simple testi

13 Oct 29, 2022
The VeriNet toolkit for verification of neural networks

VeriNet The VeriNet toolkit is a state-of-the-art sound and complete symbolic interval propagation based toolkit for verification of neural networks.

9 Dec 21, 2022
Robotics environments

Robotics environments Details and documentation on these robotics environments are available in OpenAI's blog post and the accompanying technical repo

Farama Foundation 121 Dec 28, 2022
Crossover Learning for Fast Online Video Instance Segmentation (ICCV 2021)

TL;DR: CrossVIS (Crossover Learning for Fast Online Video Instance Segmentation) proposes a novel crossover learning paradigm to fully leverage rich c

Hust Visual Learning Team 79 Nov 25, 2022
A more easy-to-use implementation of KPConv based on PyTorch.

A more easy-to-use implementation of KPConv This repo contains a more easy-to-use implementation of KPConv based on PyTorch. Introduction KPConv is a

Zheng Qin 36 Dec 29, 2022
Informal Persian Universal Dependency Treebank

Informal Persian Universal Dependency Treebank (iPerUDT) Informal Persian Universal Dependency Treebank, consisting of 3000 sentences and 54,904 token

Roya Kabiri 0 Jan 05, 2022
Boosted CVaR Classification (NeurIPS 2021)

Boosted CVaR Classification Runtian Zhai, Chen Dan, Arun Sai Suggala, Zico Kolter, Pradeep Ravikumar NeurIPS 2021 Table of Contents Quick Start Train

Runtian Zhai 4 Feb 15, 2022
Creating multimodal multitask models

Fusion Brain Challenge The English version of the document can be found here. Обновления 01.11 Мы выкладываем пример данных, аналогичных private test

Sber AI 43 Nov 28, 2022
Lightweight, Python library for fast and reproducible experimentation :microscope:

Steppy What is Steppy? Steppy is a lightweight, open-source, Python 3 library for fast and reproducible experimentation. Steppy lets data scientist fo

minerva.ml 134 Jul 10, 2022
Hierarchical Few-Shot Generative Models

Hierarchical Few-Shot Generative Models Giorgio Giannone, Ole Winther This repo contains code and experiments for the paper Hierarchical Few-Shot Gene

Giorgio Giannone 6 Dec 12, 2022
Source code for CAST - Crisis Domain Adaptation Using Sequence-to-sequence Transformers (Accepted to ISCRAM 2021, CorePaper).

Source code for CAST: Crisis Domain Adaptation UsingSequence-to-sequenceTransformers (Paper, BibTeX, Accepted to ISCRAM 2021, CorePaper) Quick start D

Congcong Wang 0 Jul 14, 2021
This repository provides some of the code implemented and the data used for the work proposed in "A Cluster-Based Trip Prediction Graph Neural Network Model for Bike Sharing Systems".

cluster-link-prediction This repository provides some of the code implemented and the data used for the work proposed in "A Cluster-Based Trip Predict

Bárbara 0 Dec 28, 2022
Code and real data for the paper "Counterfactual Temporal Point Processes", available at arXiv.

counterfactual-tpp This is a repository containing code and real data for the paper Counterfactual Temporal Point Processes. Pre-requisites This code

Networks Learning 11 Dec 09, 2022
95.47% on CIFAR10 with PyTorch

Train CIFAR10 with PyTorch I'm playing with PyTorch on the CIFAR10 dataset. Prerequisites Python 3.6+ PyTorch 1.0+ Training # Start training with: py

5k Dec 30, 2022
Towards the D-Optimal Online Experiment Design for Recommender Selection (KDD 2021)

Towards the D-Optimal Online Experiment Design for Recommender Selection (KDD 2021) Contact 0 Jan 11, 2022

This project uses reinforcement learning on stock market and agent tries to learn trading. The goal is to check if the agent can learn to read tape. The project is dedicated to hero in life great Jesse Livermore.

Reinforcement-trading This project uses Reinforcement learning on stock market and agent tries to learn trading. The goal is to check if the agent can

Deepender Singla 1.4k Dec 22, 2022
Lightweight Salient Object Detection in Optical Remote Sensing Images via Feature Correlation

CorrNet This project provides the code and results for 'Lightweight Salient Object Detection in Optical Remote Sensing Images via Feature Correlation'

Gongyang Li 13 Nov 03, 2022