Run Effective Large Batch Contrastive Learning on Limited Memory GPU

Overview

Gradient Cache

Gradient Cache is a simple technique for unlimitedly scaling contrastive learning batch far beyond GPU memory constraint. This means training that used to take heavy hardware, e.g. 8 V100 GPU, can be done on a single GPU. In addition, Gradient Cache allow users to replace big RAM GPU with much more cost efficient high FLOP low RAM cards.

This repo holds a generic Pytorch implementation of Gradient Cache described in our paper Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup .

@inproceedings{gao2021scaling,
     title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
     author={Luyu Gao, Yunyi Zhang, Jiawei Han, Jamie Callan},
     booktitle ={Proceedings of the 6th Workshop on Representation Learning for NLP},
     year={2021},
}

Gradient Cache has also been integrated into dense passage retrieval (DPR). Checkout our GC-DPR toolkit.

Installation

The package depends only on pytorch>=1.6. To install, clone this repo and run pip.

git clone https://github.com/luyug/GradCache
cd GradCache
pip install .

For development,

pip install --editable .

Usage

Gradient caching functionalities are implemented in GradCache class. If you are developing a new project instead of patching an old one, also checkout our functional approach for a effort reduced approach.

Initialization

The class's __init__ method defines the cache and has several functional parameters *_fn for easy adjust of model behaviors. Alternatively you can also sub-class GradCache.

grad_cache.GradCache(  
  models: List[nn.Module],  
  chunk_sizes: Union[int, List[int]],  
  loss_fn: Callable[..., Tensor],  
  split_input_fn: Callable[[Any, int], Any] = None,  
  get_rep_fn: Callable[..., Tensor] = None,  
  fp16: bool = False,  
  scaler: GradScaler = None,  
)

models - A list of encoder models to be updated with with the Gradient Cache.

chunk_sizes - An integer indicating chunk size. Or a list of integers of chunk size for each model. This controls for each model the sub-batch size to run forward-backward pass and should be set based on available GPU memory. A value too small will leave the GPU under utilized.

loss_fn - A loss function that takes representation tensors of number equal to number of models in models and arbitrary numbers of keyword arguments. It should compute loss based on the input tensors, and in no case modify the input tensors' relations in the autograd graph, which are later relied upon to create the gradient cache.

split_input_fn - An optional function that split generic model input into chunks based on defined chunk_sizes. If not provided, this class will try its best to split the inputs of supported types. See split_inputs function.

get_rep_fn - An optional function that takes generic model output and return representation tensors. If not provided, the generic output is assumed to be the representation tensor.

fp16 - If True, run mixed precision training, which requires scaler to also be set.

scaler - A GradScaler object for automatic mixed precision training.

Cache Gradient Step

To run a cached gradient computatoin step, call cache_step function,

cache_step(  
  *model_inputs,  
  no_sync_except_last: bool = False,  
  **loss_kwargs  
)

Run a single gradient cache step. Upon function return, updates are computed for each model in self.models with gradient populated on the weights, as if the model_inputs are run as a huge single batch on sufficiently large hardware. Calling an GradCache object with __call__ will also invoke this function.

model_inputs - List of inputs to each encoder model. Should be in similar order as self.models.

no_sync_except_last - If True, under distributed setup, for each model, only trigger gradient reduction across processes for the last sub-batch's forward-backward pass. This could come in handy when dealing with a) large model, and/or b) non trivial number of sub-batches.

loss_kwargs - Additional keyword arguments to the loss function loss_fn. This is intended to enable flexible loss computation (thanks to dynamic graph in Pytorch) such as reduction, weighting, etc. Potentially, using loss_kwargs you can incorporate outputs from those encoder models not tracked by the cache.

Return - loss, the current steps loss scaler tensor (detached from the graph).

Natively Supported Input Types

  • x: Tensor - will be passed in as model(x)
  • x: List[Tensor] - will be passed in as model(*x)
  • x: Dict[str, Tensor] (or UserDict[str, Tensor]) - will be passed in as model(**x)
  • x: Tuple[List[Tensor], Dict[str, Tensor]] - will be passed in as model(*x[0], **x[1])

Other generic input are not fully supported, we perform model call using the following heuristics,

  • x: List[Any] - will be passed in as model(*x)
  • x: Dict[str, Any] - will be passed in as model(**x)
  • x: Tuple[List[Any], Dict[str, Any]] - will be passed in as model(*x[0], **x[1])

To run with them, split_input_fn should be specified during cache initialization to break these inputs into smaller batches. In some rare cases, you may also need to override get_input_tensors when its heuristic can not grab enough tensors that covers all cuda devices that hold some tensors in the input.

Example Usage with Huggingface Transformers

Learning a Bi-encoder

Say we want to learn a embedding space of labels and text. Consider the following four pairs. (In practice, you will have many more and much longer text entries.)

labels = ['fruit', 'meat', 'school', 'company']
texts = [
  'this is an apple', 
  'steak should be cooked medium rare', 
  'cmu is pittsburgh', 
  'apple sells laptop'
]

Initialize our encoder models,

from transformers import AutoTokenizer, TFAutoModel
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
encoder1 = AutoModel.from_pretrained("bert-base-uncased").cuda()
encoder2 = AutoModel.from_pretrained("bert-base-uncased").cuda()

Initialize the GradCache object,

from grad_cache import GradCache
from grad_cache.loss import SimpleContrastiveLoss

loss_fn = SimpleContrastiveLoss()
gc = GradCache(
  models=[encoder1, encoder2], 
  chunk_sizes=2, 
  loss_fn=loss_fn, 
  get_rep_fn=lambda v: v.pooler_output
)

Here we use the get_rep_fn argument to specify a function that takes generic Huggingface model output and return the actual representation tensor.

Create model input,

xx = tokenizer(tt, return_tensors='pt', padding=True)
yy = tokenizer(tt2, return_tensors='pt', padding=True)

Run a cache step,

gc(xx, yy, reduction='mean')

Here we use reduction='mean' as a loss_kwargs to control loss behavior. With a defined optimizer, the full gradient update can be done as,

optimizer.zero_grad()
gc(xx, yy, reduction='mean')
optimizer.step()

Use Tied Encoder?

This is naturally handled by the (magic of) dynamic graph. You pass shallow copies of the same encoder model to the GradCache init method.

tied_encoder = AutoModel.from_pretrained("bert-base-uncased").cuda()
gc = GradCache(
  models=[tied_encoder , tied_encoder], 
  chunk_sizes=2, 
  loss_fn=loss_fn, 
  get_rep_fn=lambda v: v.pooler_output
)

Under the hood, distinct hooks will be registered to make correct gradient computation.

Distributed Training with Multiple GPUs?

We expect cross process communication of representations to be handled by the loss_fn.

from grad_cache.loss import DistributedContrastiveLoss
loss_fn_dist = DistributedContrastiveLoss()

Properly wrap the the encoder models for gradient reduction,

encoder1_ddp = DistributedDataParallel(
	encoder1, device_ids=[local_rank], output_device=local_rank, find_unused_parameters=True)
encoder2_ddp = DistributedDataParallel(
	encoder2, device_ids=[local_rank], output_device=local_rank, find_unused_parameters=True)

You can initialize the cache use the distributed loss and the DDP models,

gc = GradCache(
  models=[encoder1_ddp, encoder2_ddp], 
  chunk_sizes=2, 
  loss_fn=loss_fn_dist, 
  get_rep_fn=lambda v: v.pooler_output
)

Run a cache step,

gc(xx, yy, no_sync_except_last=True, reduction='mean')

Set no_sync_except_last=True to avoid unnecessary gradient reduction.

Functional Approach

Decorators

If you are developing a new project, we recommend also checking out the decorators we have provided to create higher order functions for cache.

grad_cache.functional.cached(func: Callable[..., Tensor])

A decorator that takes a model call function into a cached compatible version.

func - A function that calls the model and return representation tensor.

Return - A function that returns 1) representation leaf tensors for cache construction, 2) a closure function for the 2nd forward and the cached backward. Call 2) with 1) as argument after calling backward on the loss Tensor.

grad_cache.functional.cat_input_tensor(func: Callable[..., Tensor])

A decorator that concatenates positional and keyword arguments of type List[Tensor] into a single Tensor on the 0th dimension. This can come in handy dealing with results of representation tensors from multiple cached forward.

func - A loss function

Return - Decorated loss function for cached results.

Usage

The functional decorators are particular useful if your data loader is emitting small batches, from which you can construct the big batch. Say you also want to do automatic mixed precision, we first define the model call function and loss function,

from grad_cache.functional import cached, cat_input_tensor

import torch
import torch.nn.functional as F
from torch.cuda.amp import autocast

@cached
@autocast()
def  call_model(model, input):
	return model(**input).pooler_output

@cat_input_tensor
@autocast()
def  contrastive_loss(x, y):
	target = torch.arange(0, y.size(0), int(y.size(0) / x.size(0)), device=x.device)
	scores = torch.matmul(x, y.transpose(0, 1))
	return F.cross_entropy(scores, target=target)

Say you have a DataLoader loader emitting small batches of tuple (xx, yy) of size (M * N) and that you want to train by aggregating 16 small batches to get a batch of (16M * 16N),

cache_x = []
cache_y = []
closures_x = []
closures_y = []

for step, sub_batch in enumerate(loader):  
    xx, yy = sub_batch
    rx, cx = call_model(bert, xx)
    ry, cy = call_model(bert, yy)
    
    cache_x.append(rx)
    cache_y.append(ry)
    closuresx.append(cx)
    closuresy.append(cy)
    
    if (step + 1) % 16 == 0:
        loss = contrastive_loss(cache_x, cache_y)
        scaler.scale(loss).backward()
        
	for f, r in zip(closuresx, cache_x):
            f(r)
        for f, r in zip(closuresy, cache_y):
            f(r)

        cache_x = []
        cache_y = []
        closures_x = []
        closures_y = []
	
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad()

Code Structure

grad_cache/grad_cache.py - Define the GradCache class. The code is under 300 lines including comments. For development, we encourage you to read through it.

grad_cache/functional.py - Define decorators to create higher order function for gradient caching from ordinary model call functions and loss functions.

Owner
Luyu Gao
NLP Research [email protected], CMU
Luyu Gao
A Python library that provides a simplified alternative to DBAPI 2

A Python library that provides a simplified alternative to DBAPI 2. It provides a facade in front of DBAPI 2 drivers.

Tony Locke 44 Nov 17, 2021
Code for our CVPR 2021 paper "MetaCam+DSCE"

Joint Noise-Tolerant Learning and Meta Camera Shift Adaptation for Unsupervised Person Re-Identification (CVPR'21) Introduction Code for our CVPR 2021

FlyingRoastDuck 59 Oct 31, 2022
Explainability for Vision Transformers (in PyTorch)

Explainability for Vision Transformers (in PyTorch) This repository implements methods for explainability in Vision Transformers

Jacob Gildenblat 442 Jan 04, 2023
LogDeep is an open source deeplearning-based log analysis toolkit for automated anomaly detection.

LogDeep is an open source deeplearning-based log analysis toolkit for automated anomaly detection.

donglee 279 Dec 13, 2022
A Python framework for developing parallelized Computational Fluid Dynamics software to solve the hyperbolic 2D Euler equations on distributed, multi-block structured grids.

pyHype: Computational Fluid Dynamics in Python pyHype is a Python framework for developing parallelized Computational Fluid Dynamics software to solve

Mohamed Khalil 21 Nov 22, 2022
source code for https://arxiv.org/abs/2005.11248 "Accelerating Antimicrobial Discovery with Controllable Deep Generative Models and Molecular Dynamics"

Accelerating Antimicrobial Discovery with Controllable Deep Generative Models and Molecular Dynamics This work will be published in Nature Biomedical

International Business Machines 71 Nov 15, 2022
Code in conjunction with the publication 'Contrastive Representation Learning for Hand Shape Estimation'

HanCo Dataset & Contrastive Representation Learning for Hand Shape Estimation Code in conjunction with the publication: Contrastive Representation Lea

Computer Vision Group, Albert-Ludwigs-Universität Freiburg 38 Dec 13, 2022
A compendium of useful, interesting, inspirational usage of pandas functions, each example will be an ipynb file

Pandas_by_examples A compendium of useful/interesting/inspirational usage of pandas functions, each example will be an ipynb file What is this reposit

Guangyuan(Frank) Li 32 Nov 20, 2022
Official PyTorch implementation of Learning Intra-Batch Connections for Deep Metric Learning (ICML 2021) published at International Conference on Machine Learning

About This repository the official PyTorch implementation of Learning Intra-Batch Connections for Deep Metric Learning. The config files contain the s

Dynamic Vision and Learning Group 41 Dec 10, 2022
A Deep Reinforcement Learning Framework for Stock Market Trading

DQN-Trading This is a framework based on deep reinforcement learning for stock market trading. This project is the implementation code for the two pap

61 Jan 01, 2023
High performance Cross-platform Inference-engine, you could run Anakin on x86-cpu,arm, nv-gpu, amd-gpu,bitmain and cambricon devices.

Anakin2.0 Welcome to the Anakin GitHub. Anakin is a cross-platform, high-performance inference engine, which is originally developed by Baidu engineer

514 Dec 28, 2022
TensorFlow-based implementation of "ICNet for Real-Time Semantic Segmentation on High-Resolution Images".

ICNet_tensorflow This repo provides a TensorFlow-based implementation of paper "ICNet for Real-Time Semantic Segmentation on High-Resolution Images,"

HsuanKung Yang 406 Nov 27, 2022
A PyTorch implementation of the paper "Semantic Image Synthesis via Adversarial Learning" in ICCV 2017

Semantic Image Synthesis via Adversarial Learning This is a PyTorch implementation of the paper Semantic Image Synthesis via Adversarial Learning. Req

Seonghyeon Nam 146 Nov 25, 2022
Implementation for paper MLP-Mixer: An all-MLP Architecture for Vision

MLP Mixer Implementation for paper MLP-Mixer: An all-MLP Architecture for Vision. Give us a star if you like this repo. Author: Github: bangoc123 Emai

Ngoc Nguyen Ba 86 Dec 10, 2022
Reaction SMILES-AA mapping via language modelling

rxn-aa-mapper Reactions SMILES-AA sequence mapping setup conda env create -f conda.yml conda activate rxn_aa_mapper In the following we consider on ex

16 Dec 13, 2022
Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"

Memory Efficient Attention Pytorch Implementation of a memory efficient multi-head attention as proposed in the paper, Self-attention Does Not Need O(

Phil Wang 180 Jan 05, 2023
《Image2Reverb: Cross-Modal Reverb Impulse Response Synthesis》(2021)

Image2Reverb Image2Reverb is an end-to-end neural network that generates plausible audio impulse responses from single images of acoustic environments

Nikhil Singh 48 Nov 27, 2022
Implementation of Perceiver, General Perception with Iterative Attention in TensorFlow

Perceiver This Python package implements Perceiver: General Perception with Iterative Attention by Andrew Jaegle in TensorFlow. This model builds on t

Rishit Dagli 84 Oct 15, 2022
A rough implementation of the paper "A Steering Algorithm for Redirected Walking Using Reinforcement Learning"

A rough implementation of the paper "A Steering Algorithm for Redirected Walking Using Reinforcement Learning"

Somnus `Chen 2 Jun 09, 2022
SSL_SLAM2: Lightweight 3-D Localization and Mapping for Solid-State LiDAR (mapping and localization separated) ICRA 2021

SSL_SLAM2 Lightweight 3-D Localization and Mapping for Solid-State LiDAR (Intel Realsense L515 as an example) This repo is an extension work of SSL_SL

Wang Han 王晗 1.3k Jan 08, 2023