Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"

Last update: Jan 05, 2023

Overview

Memory Efficient Attention Pytorch

Implementation of a memory efficient multi-head attention as proposed in the paper, Self-attention Does Not Need O(n²) Memory. In addition, the module will take care of masking, causal masking, as well as cross attention.

Install

$ pip install memory-efficient-attention-pytorch

Usage

For autoregressive language model

import torch
from memory_efficient_attention_pytorch import Attention

attn = Attention(
    dim = 512,
    dim_head = 64,                # dimension per head
    heads = 8,                    # number of attention heads
    causal = True,                # autoregressive or not
    memory_efficient = True,      # whether to use memory efficient attention (can be turned off to test against normal attention)
    q_bucket_size = 1024,         # bucket size along queries dimension
    k_bucket_size = 2048          # bucket size along key / values dimension
).cuda()

x = torch.randn(1, 65536, 512).cuda()
out = attn(x) # (1, 65536, 512)

Cross attention

import torch
from memory_efficient_attention_pytorch import Attention

cross_attn = Attention(
    dim = 512,
    dim_head = 64,
    heads = 8,
    memory_efficient = True,
    q_bucket_size = 1024,
    k_bucket_size = 2048
).cuda()

x = torch.randn(1, 65536, 512).cuda()
context = torch.randn(1, 65536, 512).cuda()
mask = torch.ones(1, 65536).bool().cuda()

out = cross_attn(x, context = context, mask = mask) # (1, 65536, 512)

benchmark and see how much torch jit helps
look at Triton and Keops and see if either can be a fit

Citations

@misc{rabe2021selfattention,
    title   = {Self-attention Does Not Need $O(n^2)$ Memory}, 
    author  = {Markus N. Rabe and Charles Staats},
    year    = {2021},
    eprint  = {2112.05682},
    archivePrefix = {arXiv},
    primaryClass = {cs.LG}
}

@misc{liu2021swin,
    title   = {Swin Transformer V2: Scaling Up Capacity and Resolution},
    author  = {Ze Liu and Han Hu and Yutong Lin and Zhuliang Yao and Zhenda Xie and Yixuan Wei and Jia Ning and Yue Cao and Zheng Zhang and Li Dong and Furu Wei and Baining Guo},
    year    = {2021},
    eprint  = {2111.09883},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}

Comments

[feature request] Combining with flash attention?

There is a new algorithm to optimize the qkv attention, https://github.com/HazyResearch/flash-attention https://arxiv.org/abs/2205.14135 It optimises the qkv attention part. Maybe you can look into integrating it with this.

opened by Vbansal21 15
i did this, we could build on top

Hi there!

It seems I did already some of the code... https://github.com/CHARM-Tx/linear_mem_attention_pytorch could we build on top of this? I talked to https://github.com/Chillee about an experimental functionality from functorch: https://github.com/pytorch/functorch that would allow for increased speed (mainly i want to match jax perofmance but its just difficult w/ pytorch imperative style).

I would love to collaborate on this if you want!

opened by hypnopump 5
Added dropout support to memory efficient variant

Hey Phil,

I have been using this repository for a project and I wanted to add dropout for completeness. I checked consistency with perceiver-ar impl.. I hope this is helpful.

-Matt

opened by usryokousha 2
Making this work with relative position bias from XTransformers

Is there a way to make this work with RelativePositionBias. Currently this produces an attention bias of size $BHN^2$ where B is batch size, H is number of heads and N is input size. Can this be chunked and computed per chunk?

opened by pfeatherstone 5
save_for_backward can only save variables, but argument 5 is of type bool

Hi,

Thank you for your indescribable work. I was trying to test your method specifically for cross-attention but It seems I get the error " save_for_backward can only save variables, but argument 5 is of type bool". I am not sure what I am doing wrong. I tried your own examples too but get the same error.

Can you please help me out?

Code:

import torch from memory_efficient_attention_pytorch import Attention

cross_attn = Attention( dim = 512, dim_head = 64, heads = 8, memory_efficient = True, q_bucket_size = 1024, k_bucket_size = 2048 ).cuda() (# out = sm_mod(inp1)) did this to avoid being a header x = torch.randn(1, 65536, 512).cuda() context = torch.randn(1, 65536, 512).cuda() (# mask = torch.ones(1, 65536).bool().cuda()) did this to avoid being a heading out = cross_attn(x

ERROR:

File "/home/abali/.conda/envs/py38_ydp5/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/abali/.conda/envs/py38_ydp5/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/abali/.vscode-server/extensions/ms-python.python-2022.8.1/pythonFiles/lib/python/debugpy/main.py", line 45, in cli.main() File "/home/abali/.vscode-server/extensions/ms-python.python-2022.8.1/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 444, in main run() File "/home/abali/.vscode-server/extensions/ms-python.python-2022.8.1/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 285, in run_file runpy.run_path(target_as_str, run_name=compat.force_str("main")) File "/home/abali/.conda/envs/py38_ydp5/lib/python3.8/runpy.py", line 265, in run_path return _run_module_code(code, init_globals, run_name, File "/home/abali/.conda/envs/py38_ydp5/lib/python3.8/runpy.py", line 97, in _run_module_code _run_code(code, mod_globals, init_globals, File "/home/abali/.conda/envs/py38_ydp5/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/data/stars/user/abali/Phd_work/ISBI2023/X3D-Multigrid/CrossAttn_X3d_v2.py", line 872, in out = cross_attn(x, context = context, mask = mask) # (1, 65536, 512) print(out) File "/home/abali/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/abali/.conda/envs/py38_ydp5/lib/python3.8/site-packages/memory_efficient_attention_pytorch/memory_efficient_attention.py", line 215, in forward out = attn_fn(q, k, v, mask = mask, attn_bias = attn_bias, causal = self.causal, q_bucket_size = q_bucket_size, k_bucket_size = k_bucket_size) File "/home/abali/.conda/envs/py38_ydp5/lib/python3.8/site-packages/memory_efficient_attention_pytorch/memory_efficient_attention.py", line 127, in memory_efficient_attention exp_weight_chunk, weighted_value_chunk, weight_max_chunk = summarize_qkv_fn( File "/home/abali/.local/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 163, in checkpoint return CheckpointFunction.apply(function, preserve, *args) TypeError: save_for_backward can only save variables, but argument 5 is of type bool

opened by aliabid2243 1
Checkpointing is not compatible with .grad() or when an `inputs` parameter is passed to .backward()

https://github.com/lucidrains/memory-efficient-attention-pytorch/blob/35559a05572f9d4eb982a8e2e399b40a2d61b85c/memory_efficient_attention_pytorch/memory_efficient_attention.py#L95

Should this be: summarize_qkv_fn = summarize_qkv_chunk if needs_backwards else checkpointed_summarize_qkv_chunk instead of: summarize_qkv_fn = checkpointed_summarize_qkv_chunk if needs_backwards else summarize_qkv_chunk

opened by vrobot 0

Releases(0.1.1)

0.1.1(Dec 30, 2022)

null
Source code(tar.gz)
Source code(zip)
0.1.0(Dec 30, 2022)

Source code(tar.gz)
Source code(zip)
0.0.27(Nov 1, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.26(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.25(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.24(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.23(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.22(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.21(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.20(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.19(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.18(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.17(Mar 22, 2022)

Source code(tar.gz)
Source code(zip)
0.0.16(Mar 21, 2022)

Source code(tar.gz)
Source code(zip)
0.0.15(Mar 13, 2022)

Source code(tar.gz)
Source code(zip)
0.0.14(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.12(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.11(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.10(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.9(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.8(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.7(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.6(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.5(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.4(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.2(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.1(Mar 3, 2022)

Source code(tar.gz)
Source code(zip)

Owner

Phil Wang

Working with Attention. It's all we need

GitHub Repository

Python lib to talk to pylontech lithium batteries (US2000, US3000, ...) using RS485

python-pylontech Python lib to talk to pylontech lithium batteries (US2000, US3000, ...) using RS485 What is this lib ? This lib is meant to talk to P

26 Dec 28, 2022

PyTorch implementation for paper Neural Marching Cubes.

NMC PyTorch implementation for paper Neural Marching Cubes, Zhiqin Chen, Hao Zhang. Paper | Supplementary Material (to be updated) Citation If you fin

109 Dec 27, 2022

SW components and demos for visual kinship recognition. An emphasis is put on the FIW dataset-- data loaders, benchmarks, results in summary.

FIW Data Development Kit Table of Contents Introduction Families In the Wild Database Publications Organization To Do License Getting Involved Introdu

12 Jun 04, 2022

Personal project about genus-0 meshes, spherical harmonics and a cow

How to transform a cow into spherical harmonics ? Spot the cow, from Keenan Crane's blog Context In the field of Deep Learning, training on images or

3 Aug 22, 2022

Python calculations for the position of the sun and moon.

Astral This is 'astral' a Python module which calculates Times for various positions of the sun: dawn, sunrise, solar noon, sunset, dusk, solar elevat

169 Dec 20, 2022

(CVPR 2022) A minimalistic mapless end-to-end stack for joint perception, prediction, planning and control for self driving.

LAV Learning from All Vehicles Dian Chen, Philipp Krähenbühl CVPR 2022 (also arXiV 2203.11934) This repo contains code for paper Learning from all veh

300 Dec 15, 2022

Compare outputs between layers written in Tensorflow and layers written in Pytorch

Compare outputs of Wasserstein GANs between TensorFlow vs Pytorch This is our testing module for the implementation of improved WGAN in Pytorch Prereq

72 Dec 20, 2022

AniGAN: Style-Guided Generative Adversarial Networks for Unsupervised Anime Face Generation

AniGAN: Style-Guided Generative Adversarial Networks for Unsupervised Anime Face Generation AniGAN: Style-Guided Generative Adversarial Networks for U

81 Dec 14, 2022

Multi-angle c(q)uestion answering

Macaw Introduction Macaw (Multi-angle c(q)uestion answering) is a ready-to-use model capable of general question answering, showing robustness outside

430 Jan 04, 2023

Exploit ILP to learn symmetry breaking constraints of ASP programs.

ILP Symmetry Breaking Overview This project aims to exploit inductive logic programming to lift symmetry breaking constraints of ASP programs. Given a

1 Apr 13, 2022

My personal code and solution to the Synacor Challenge from 2012 OSCON.

Synacor OSCON Challenge Solution (2012) This repository contains my code and solution to solve the Synacor OSCON 2012 Challenge. If you are interested

2 Mar 20, 2022

AugLiChem - The augmentation library for chemical systems.

AugLiChem Welcome to AugLiChem! The augmentation library for chemical systems. This package supports augmentation for both crystaline and molecular sy

17 Jan 08, 2023

Few-Shot Object Detection via Association and DIscrimination

Few-Shot Object Detection via Association and DIscrimination Code release of our NeurIPS 2021 paper: Few-Shot Object Detection via Association and DIs

49 Dec 18, 2022

Dual Attention Network for Scene Segmentation (CVPR2019)

Dual Attention Network for Scene Segmentation(CVPR2019) Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang,and Hanqing Lu Introduction W

2.2k Dec 28, 2022

(Python, R, C/C++) Isolation Forest and variations such as SCiForest and EIF, with some additions (outlier detection + similarity + NA imputation)

IsoTree Fast and multi-threaded implementation of Extended Isolation Forest, Fair-Cut Forest, SCiForest (a.k.a. Split-Criterion iForest), and regular

141 Dec 29, 2022

Equipped customers with insights about their EVs Hourly energy consumption and helped predict future charging behavior using LSTM model

Equipped customers with insights about their EVs Hourly energy consumption and helped predict future charging behavior using LSTM model. Designed sample dashboard with insights and recommendation for

2 Apr 07, 2022

Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"

Related tags

Overview

Memory Efficient Attention Pytorch

Install

Usage

Citations

Comments

[feature request] Combining with flash attention?

i did this, we could build on top

Added dropout support to memory efficient variant

Making this work with relative position bias from XTransformers

save_for_backward can only save variables, but argument 5 is of type bool

Code:

ERROR:

Checkpointing is not compatible with .grad() or when an `inputs` parameter is passed to .backward()

Releases(0.1.1)

0.1.1(Dec 30, 2022)

0.1.0(Dec 30, 2022)

0.0.27(Nov 1, 2022)

0.0.26(Jul 23, 2022)

0.0.25(Jul 23, 2022)

0.0.24(Jul 23, 2022)

0.0.23(Jul 23, 2022)

0.0.22(Jul 23, 2022)

0.0.21(Jul 23, 2022)

0.0.20(Jul 23, 2022)

0.0.19(Jul 23, 2022)

0.0.18(Jul 23, 2022)

0.0.17(Mar 22, 2022)

0.0.16(Mar 21, 2022)

0.0.15(Mar 13, 2022)

0.0.14(Mar 4, 2022)

0.0.12(Mar 4, 2022)

0.0.11(Mar 4, 2022)

0.0.10(Mar 4, 2022)

0.0.9(Mar 4, 2022)

0.0.8(Mar 4, 2022)

0.0.7(Mar 4, 2022)

0.0.6(Mar 4, 2022)

0.0.5(Mar 4, 2022)

0.0.4(Mar 4, 2022)

0.0.2(Mar 4, 2022)

0.0.1(Mar 3, 2022)