Implementation of the Transformer variant proposed in "Transformer Quality in Linear Time"

Last update: Dec 28, 2022

Overview

FLASH - Pytorch

Implementation of the Transformer variant proposed in the paper Transformer Quality in Linear Time

Install

$ pip install FLASH-pytorch

Usage

The main novel circuit in this paper is the "Gated Attention Unit", which they claim can replace multi-headed attention while reducing it to just one head.

It uses a relu squared activation in place of the softmax, the activation of which was first seen in the Primer paper, and the use of ReLU in ReLA Transformer. The gating style seems mostly inspired by gMLPs.

import torch
from flash_pytorch import GAU

gau = GAU(
    dim = 512,
    query_key_dim = 128,     # query / key dimension
    causal = True,           # autoregressive or not
    expansion_factor = 2,    # hidden dimension = dim * expansion_factor
)

x = torch.randn(1, 1024, 512)
out = gau(x) # (1, 1024, 512)

The authors then combine GAU with Katharopoulos linear attention, using grouping of the sequences to overcome a known issue with autoregressive linear attention.

This combination of the quadratic gated attention unit with grouped linear attention they named FLASH

You can also use this quite easily

import torch
from flash_pytorch import FLASH

flash = FLASH(
    dim = 512,
    group_size = 256,             # group size
    causal = True,                # autoregressive or not
    query_key_dim = 128,          # query / key dimension
    expansion_factor = 2.         # hidden dimension = dim * expansion_factor
)

x = torch.randn(1, 1111, 512)     # sequence will be auto-padded to nearest group size
out = flash(x) # (1, 1111, 512)

Finally, you can use the full FLASH transformer as mentioned in the paper. This contains all the positional embeddings mentioned in the paper. Absolute positional embedding uses scaled sinusoidal. GAU quadratic attention will get one-headed T5 relative positional bias. On top of all this, both GAU attention as well as the linear attention will be rotary embedded (RoPE).

import torch
from flash_pytorch import FLASHTransformer

model = FLASHTransformer(
    num_tokens = 20000,          # number of tokens
    dim = 512,                   # model dimension
    depth = 12,                  # depth
    causal = True,               # autoregressive or not
    group_size = 256,            # size of the groups
    query_key_dim = 128,         # dimension of queries / keys
    expansion_factor = 2.,       # hidden dimension = dim * expansion_factor
    norm_type = 'scalenorm',     # in the paper, they claimed scalenorm led to faster training at no performance hit. the other option is 'layernorm' (also default)
    shift_tokens = True          # discovered by an independent researcher in Shenzhen @BlinkDL, this simply shifts half of the feature space forward one step along the sequence dimension - greatly improved convergence even more in my local experiments
)

x = torch.randint(0, 20000, (1, 1024))
logits = model(x) # (1, 1024, 20000)

Test on Autoregressive Enwik8

$ python train.py

Citations

@article{Hua2022TransformerQI,
    title   = {Transformer Quality in Linear Time},
    author  = {Weizhe Hua and Zihang Dai and Hanxiao Liu and Quoc V. Le},
    journal = {ArXiv},
    year    = {2022},
    volume  = {abs/2202.10447}
}

@software{peng_bo_2021_5196578,
    author    = {PENG Bo},
    title     = {BlinkDL/RWKV-LM: 0.01},
    month     = {aug},
    year      = {2021},
    publisher = {Zenodo},
    version   = {0.01},
    doi       = {10.5281/zenodo.5196578},
    url       = {https://doi.org/10.5281/zenodo.5196578}
}

Comments

einsum operation in Linear Attention Part
Hi, Thanks a lot for your FLASH_pytorch, which helps a lot. I found that there are some differences from the paper in the Linear Attention Part: https://github.com/lucidrains/FLASH-pytorch/blob/main/flash_pytorch/flash_pytorch.py#L342-L343

lin_kv = einsum('b g n d, b g n e -> b d e', lin_k, v) / n lin_out = einsum('b g n d, b d e -> b g n e', lin_q, lin_kv)

the lin_kv is three-dim (bde) And the code in the paper is

lin_kv = tf.einsum('bhke,bgh→bgke', lin_kv, mask) linear = tf.einsum('bgnk,bgke→bgne', lin_q, lin_kv)

the lin_kv is four-dim (bgke) It seems that the two ways are not equivalent.

Looking forward to your reply. Best,
opened by ShomyLiu 5
mask error
x = torch.randint(0, 20000, (1, 1024)) mask = x.ne(0) logits = model(x, mask=mask)

RuntimeError: The size of tensor a (1024) must match the size of tensor b (128) at non-singleton dimension 2
opened by keyunluo 1
Speed on TPU

Hi, Thanks for the code! I test it on Google TPU v3, the training speed seems slower than my expectation. Maybe there is some operation which is not lower on TPU.

opened by magicknight 0
About the "shift_tokens"

Thank you for your amazing code.

In the class of FLASH, I find a flag: shift_tokens, and the corresponding code is as following: if self.shift_tokens: x_shift, x_pass = normed_x.chunk(2, dim = -1) x_shift = F.pad(x_shift, (0, 0, 1, -1), value = 0.) normed_x = torch.cat((x_shift, x_pass), dim = -1)

Assume we have normed_x in the shape [1024, 512], the x_shift/x_pass is the shape of [1024, 256]. Then it adds a row (with all 0 value) and remove the last row in the x_shift, and concat x_shift and x_pass to get the normed_x.

In my opinion, the F.pad operation will make the row in x_shift and x_pass do not match again.

May I know why it works?

Kang

opened by kangzhao2 1
Cross-Attention?

Hi, @lucidrains. Thank you for sharing this excellent implementation with us all! Do you have any thoughts as to what changes would need to be made to make cross-attention possible with your FLASH model?

opened by amorehead 2

Releases(0.1.6)

0.1.6(Sep 23, 2022)

null
Source code(tar.gz)
Source code(zip)
v0.1.5(Jun 19, 2022)

null
Source code(tar.gz)
Source code(zip)
v0.1.4(Jun 18, 2022)

null
Source code(tar.gz)
Source code(zip)
0.1.2(Apr 8, 2022)

Source code(tar.gz)
Source code(zip)
0.1.1(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.15a(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.14(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.12(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.11(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.10(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.9(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.8(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.7(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.6(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.1a(Mar 29, 2022)

Source code(tar.gz)
Source code(zip)
0.0.5(Mar 28, 2022)

Source code(tar.gz)
Source code(zip)
0.0.4(Mar 28, 2022)

Source code(tar.gz)
Source code(zip)
0.0.3(Mar 28, 2022)

Source code(tar.gz)
Source code(zip)
0.0.2a(Mar 28, 2022)

Source code(tar.gz)
Source code(zip)
0.0.1(Mar 28, 2022)

Source code(tar.gz)
Source code(zip)

Owner

Phil Wang

Working with Attention. It's all we need

GitHub Repository

This is the code for HOI Transformer

HOI Transformer Code for CVPR 2021 accepted paper End-to-End Human Object Interaction Detection with HOI Transformer. Reproduction We recomend you to

124 Dec 29, 2022

An implementation for Neural Architecture Search with Random Labels (CVPR 2021 poster) on Pytorch.

Neural Architecture Search with Random Labels(RLNAS) Introduction This project provides an implementation for Neural Architecture Search with Random L

18 Nov 08, 2022

Code for our paper "Multi-scale Guided Attention for Medical Image Segmentation"

Medical Image Segmentation with Guided Attention This repository contains the code of our paper: "'Multi-scale self-guided attention for medical image

394 Dec 28, 2022

True Few-Shot Learning with Language Models

This codebase supports using language models (LMs) for true few-shot learning: learning to perform a task using a limited number of examples from a single task distribution.

124 Jan 04, 2023

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

ood-text-emnlp Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them" Files fine_tune.py is used to finetune the GPT-2 mo

19 Oct 28, 2022

Official re-implementation of the Calibrated Adversarial Refinement model described in the paper Calibrated Adversarial Refinement for Stochastic Semantic Segmentation

31 Nov 22, 2022

Official repository for "Action-Based Conversations Dataset: A Corpus for Building More In-Depth Task-Oriented Dialogue Systems"

Action-Based Conversations Dataset (ABCD) This respository contains the code and data for ABCD (Chen et al., 2021) Introduction Whereas existing goal-

49 Oct 09, 2022

Pytorch implementation of our paper LIMUSE: LIGHTWEIGHT MULTI-MODAL SPEAKER EXTRACTION.

LiMuSE Overview Pytorch implementation of our paper LIMUSE: LIGHTWEIGHT MULTI-MODAL SPEAKER EXTRACTION. LiMuSE explores group communication on a multi

17 Oct 26, 2022

ACV is a python library that provides explanations for any machine learning model or data.

ACV is a python library that provides explanations for any machine learning model or data. It gives local rule-based explanations for any model or data and different Shapley Values for tree-based mod

85 Dec 27, 2022

[SDM 2022] Towards Similarity-Aware Time-Series Classification

SimTSC This is the PyTorch implementation of SDM2022 paper Towards Similarity-Aware Time-Series Classification. We propose Similarity-Aware Time-Serie

49 Dec 27, 2022

Highway networks implemented in PyTorch.

PyTorch Highway Networks Highway networks implemented in PyTorch. Just the MNIST example from PyTorch hacked to work with Highway layers. Todo Make th

56 Dec 14, 2022

EASY - Ensemble Augmented-Shot Y-shaped Learning: State-Of-The-Art Few-Shot Classification with Simple Ingredients.

EASY - Ensemble Augmented-Shot Y-shaped Learning: State-Of-The-Art Few-Shot Classification with Simple Ingredients. This repository is the official im

57 Dec 26, 2022

Simple implementation of Mobile-Former on Pytorch

Simple-implementation-of-Mobile-Former At present, only the model but no trained. There may be some bug in the code, and some details may be different

103 Dec 31, 2022

Malware Bypass Research using Reinforcement Learning

76 Dec 26, 2022

FIRA: Fine-Grained Graph-Based Code Change Representation for Automated Commit Message Generation

FIRA is a learning-based commit message generation approach, which first represents code changes via fine-grained graphs and then learns to generate commit messages automatically.

21 Dec 30, 2022

Ludwig is a toolbox that allows to train and evaluate deep learning models without the need to write code.

Translated in 🇰🇷 Korean/ Ludwig is a toolbox that allows users to train and test deep learning models without the need to write code. It is built on

8.7k Dec 31, 2022

Language model Prompt And Query Archive

LPAQA: Language model Prompt And Query Archive This repository contains data and code for the paper How Can We Know What Language Models Know? Install

127 Dec 20, 2022

Provided is code that demonstrates the training and evaluation of the work presented in the paper: "On the Detection of Digital Face Manipulation" published in CVPR 2020.

FFD Source Code Provided is code that demonstrates the training and evaluation of the work presented in the paper: "On the Detection of Digital Face M

88 Nov 22, 2022

Self-Correcting Quantum Many-Body Control using Reinforcement Learning with Tensor Networks

Self-Correcting Quantum Many-Body Control using Reinforcement Learning with Tensor Networks This repository contains the code and data for the corresp

7 Apr 23, 2022

A Simulated Optimal Intrusion Response Game

Optimal Intrusion Response An OpenAI Gym interface to a MDP/Markov Game model for optimal intrusion response of a realistic infrastructure simulated u

10 Dec 09, 2022

Implementation of the Transformer variant proposed in "Transformer Quality in Linear Time"

Related tags

Overview

FLASH - Pytorch

Install

Usage

Test on Autoregressive Enwik8

Citations

Comments

einsum operation in Linear Attention Part

mask error

Speed on TPU

About the "shift_tokens"

Cross-Attention?

Releases(0.1.6)

0.1.6(Sep 23, 2022)

v0.1.5(Jun 19, 2022)

v0.1.4(Jun 18, 2022)

0.1.2(Apr 8, 2022)

0.1.1(Mar 29, 2022)

0.0.15a(Mar 29, 2022)

0.0.14(Mar 29, 2022)

0.0.12(Mar 29, 2022)

0.0.11(Mar 29, 2022)

0.0.10(Mar 29, 2022)

0.0.9(Mar 29, 2022)

0.0.8(Mar 29, 2022)

0.0.7(Mar 29, 2022)

0.0.6(Mar 29, 2022)

0.0.1a(Mar 29, 2022)

0.0.5(Mar 28, 2022)

0.0.4(Mar 28, 2022)

0.0.3(Mar 28, 2022)

0.0.2a(Mar 28, 2022)

0.0.1(Mar 28, 2022)