Implementation of Feedback Transformer in Pytorch

Last update: Oct 04, 2022

Overview

Feedback Transformer - Pytorch

Simple implementation of Feedback Transformer in Pytorch. They improve on Transformer-XL by having each token have access to the representations of all previous layers through time. This is achieved by aggregating the outputs of all layers into a shared memory, which each token across layers can attend to at each time step.

The main drawback is longer training time, due to its non-parallel nature. But I thought I'd build it to further exploration and research into this line of work.

Yannic Kilcher video

I also took the liberty to add some various enhancements, including pre-normalization, GLU gated feedforwards, as well as simplified T5 relative positional embeddings.

Install

$ pip install feedback-transformer-pytorch

Usage

import torch
from feedback_transformer_pytorch import FeedbackTransformer

model = FeedbackTransformer(
    num_tokens = 20000,           # number of tokens
    dim = 512,                    # dimension
    depth = 6,                    # depth
    seq_len = 2,                  # the sequence length of each segment or window
    mem_len = 256,                # length of the memory buffer
    dim_head = 64,                # dimension of each head
    heads = 8,                    # number of heads
    attn_dropout = 0.1,           # attention dropout
    ff_dropout = 0.1              # feedforward dropout
).cuda()

x = torch.randint(0, 20000, (2, 64)).cuda()
model(x)  # (2, 64, 20000)

If you would like to have fine control over the memory (when to detach, etc), you can do it with some extra keyword arguments on .forward

import torch
from feedback_transformer_pytorch import FeedbackTransformer

model = FeedbackTransformer(
    num_tokens = 20000,
    dim = 512,
    depth = 6,
    seq_len = 32,
    mem_len = 256
).cuda()

x1 = torch.randint(0, 20000, (2, 32)).cuda()
x2 = torch.randint(0, 20000, (2, 32)).cuda()
x3 = torch.randint(0, 20000, (2, 32)).cuda()

out1, mem1 = model(x1, return_memory = True)
out2, mem2 = model(x2, memory = mem1, return_memory = True)
out3, mem3 = model(x3, memory = mem2, return_memory = True)  # (2, 32, 20000)

Citations

@misc{fan2021addressing,
    title   = {Addressing Some Limitations of Transformers with Feedback Memory}, 
    author  = {Angela Fan and Thibaut Lavril and Edouard Grave and Armand Joulin and Sainbayar Sukhbaatar},
    year    = {2021},
    eprint  = {2002.09402},
    archivePrefix = {arXiv},
    primaryClass = {cs.LG}
}

Comments

Should it really be using lower layers output for keys and values?

Could you explain the logic of how the key-value pairs are formed at these lines and whether it is necessary?

https://github.com/lucidrains/feedback-transformer-pytorch/blob/d7d8939910d1491f01a3d93ce81d4663925fb389/feedback_transformer_pytorch/feedback_transformer_pytorch.py#L146-L151

It looks to me that line 146 transforms the output of the layer below (x) to keys and values, and the following lines combine these keys and values with the memory. I thought that x should only be used for forming the query here, and only the existing memory is used for keys and values.

opened by tarvaina 6
In place operation with gradient

https://github.com/lucidrains/feedback-transformer-pytorch/blob/main/feedback_transformer_pytorch/feedback_transformer_pytorch.py#L173 I think this is an error.

opened by hadaev8 4
Bug in weighted sum

Bug in https://github.com/lucidrains/feedback-transformer-pytorch/blob/main/feedback_transformer_pytorch/feedback_transformer_pytorch.py#L264

Should be layer_weight = rearrange(layer_weight, 'd -> d () () ()')

opened by Victor0118 1

Input/Output dimensions

Hey @lucidrains

Can I check the dimensions of the input and output, is it (seq_len, dim) -> (? ,dim, tokens)?

model = FeedbackTransformer(
    num_tokens = 20000,           # number of tokens
    dim = 512,                    # dimension
    depth = 6,                    # depth
    seq_len = 2,                  # the sequence length of each segment or window
    mem_len = 256,                # length of the memory buffer
    dim_head = 64,                # dimension of each head
    heads = 8,                    # number of heads
    attn_dropout = 0.1,           # attention dropout
    ff_dropout = 0.1              # feedforward dropout
).cuda()

x = torch.randint(0, 256, (2, 512)).cuda()
model(x)  # (1, 512, 20000)

opened by iiSeymour 1

Non intuitive memory usage with cross attention

Give simple 256 dim and 512 len tensor and memory len 16 feedback transformer uses 3.6gm memory after forward pass. With cross attention on 100 len tensor usage grows to 14gb.

While parallel version uses 3.1gb and 3.5gb.

Notebooks for testing https://colab.research.google.com/drive/1dRImydFn3WthOXdLYIvdf5bsqjXcmhC5?usp=sharing https://colab.research.google.com/drive/1n653j4Pz9_U7OukhTlUbomAHMvpPXwx0?usp=sharing

opened by hadaev8 0
I think mask padding value should be False

Here https://github.com/lucidrains/feedback-transformer-pytorch/blob/with-cross-attention/feedback_transformer_pytorch/feedback_transformer_pytorch.py#L181

opened by hadaev8 0
ETA for the enwiki8 example

Hey @lucidrains,

Any eta on the example for auto-regressive enwiki8 example? I and others would really appreciate it as always :)

Also, if you can provide an example for training on custom line-by-line TXT datasets, it would be absolutely fantastic.

Thank you.

opened by asigalov61 0

Releases(0.0.11)

0.0.11(Mar 2, 2021)

Source code(tar.gz)
Source code(zip)
0.0.10(Feb 22, 2021)

Source code(tar.gz)
Source code(zip)
0.0.9(Feb 3, 2021)

Source code(tar.gz)
Source code(zip)
0.0.8(Feb 2, 2021)

Source code(tar.gz)
Source code(zip)
0.0.7(Feb 2, 2021)

Source code(tar.gz)
Source code(zip)
0.0.6(Feb 2, 2021)

Source code(tar.gz)
Source code(zip)
0.0.5(Feb 2, 2021)

Source code(tar.gz)
Source code(zip)
0.0.4(Feb 2, 2021)

Source code(tar.gz)
Source code(zip)
0.0.3(Feb 2, 2021)

Source code(tar.gz)
Source code(zip)
0.0.2(Feb 2, 2021)

Source code(tar.gz)
Source code(zip)
0.0.1(Feb 2, 2021)

Source code(tar.gz)
Source code(zip)

Owner

Phil Wang

Working with Attention. It's all we need.

GitHub Repository

The Submission for SIMMC 2.0 Challenge 2021

The Submission for SIMMC 2.0 Challenge 2021 challenge website Requirements python 3.8.8 pytorch 1.8.1 transformers 4.8.2 apex for multi-gpu nltk Prepr

5 Jul 26, 2022

Real-Time Semantic Segmentation in Mobile device

Real-Time Semantic Segmentation in Mobile device This project is an example project of semantic segmentation for mobile real-time app. The architectur

708 Jan 01, 2023

Convenient tool for speeding up the intern/officer review process.

icpc-app-screen Convenient tool for speeding up the intern/officer applicant review process. Eliminates the pain from reading application responses of

1 Oct 30, 2021

Code to produce syntactic representations that can be used to study syntax processing in the human brain

Can fMRI reveal the representation of syntactic structure in the brain? The code base for our paper on understanding syntactic representations in the

4 Dec 18, 2022

This is a Pytorch implementation of paper: DropEdge: Towards Deep Graph Convolutional Networks on Node Classification

DropEdge: Towards Deep Graph Convolutional Networks on Node Classification This is a Pytorch implementation of paper: DropEdge: Towards Deep Graph Con

401 Dec 16, 2022

DeepVoxels is an object-specific, persistent 3D feature embedding.

DeepVoxels is an object-specific, persistent 3D feature embedding. It is found by globally optimizing over all available 2D observations of

196 Dec 25, 2022

Implementation of OmniNet, Omnidirectional Representations from Transformers, in Pytorch

Omninet - Pytorch Implementation of OmniNet, Omnidirectional Representations from Transformers, in Pytorch. The authors propose that we should be atte

48 Nov 21, 2022

Code for layerwise detection of linguistic anomaly paper (ACL 2021)

Layerwise Anomaly This repository contains the source code and data for our ACL 2021 paper: "How is BERT surprised? Layerwise detection of linguistic

6 Dec 07, 2022

Get started learning C# with C# notebooks powered by .NET Interactive and VS Code.

.NET Interactive Notebooks for C# Welcome to the home of .NET interactive notebooks for C#! How to Install Download the .NET Coding Pack for VS Code f

425 Dec 25, 2022

CAR-API: Cityscapes Attributes Recognition API

CAR-API: Cityscapes Attributes Recognition API This is the official api to download and fetch attributes annotations for Cityscapes Dataset. Content I

5 Dec 22, 2022

Graph Self-Supervised Learning for Optoelectronic Properties of Organic Semiconductors

SSL_OSC Graph Self-Supervised Learning for Optoelectronic Properties of Organic Semiconductors

2 May 14, 2022

PyTorch Implementation of Vector Quantized Variational AutoEncoders.

Pytorch implementation of VQVAE. This paper combines 2 tricks: Vector Quantization (check out this amazing blog for better understanding.) Straight-Th

2 Oct 06, 2021

Hand-distance-measurement-game - Hand Distance Measurement Game

Hand Distance Measurement Game This is program is made to calculate the distance

2 Jan 12, 2022

ReConsider is a re-ranking model that re-ranks the top-K (passage, answer-span) predictions of an Open-Domain QA Model like DPR (Karpukhin et al., 2020).

ReConsider ReConsider is a re-ranking model that re-ranks the top-K (passage, answer-span) predictions of an Open-Domain QA Model like DPR (Karpukhin

47 Jul 26, 2022

Deep Learning to Improve Breast Cancer Detection on Screening Mammography

Shield: This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Deep Learning to Improve Breast

305 Jan 03, 2023

More than a hundred strange attractors

dysts Analyze more than a hundred chaotic systems. Basic Usage Import a model and run a simulation with default initial conditions and parameter value

185 Dec 23, 2022

Code for the paper: Fighting Fake News: Image Splice Detection via Learned Self-Consistency

Fighting Fake News: Image Splice Detection via Learned Self-Consistency [paper] [website] Minyoung Huh *12, Andrew Liu *1, Andrew Owens1, Alexei A. Ef

174 Dec 09, 2022

You Only Hypothesize Once: Point Cloud Registration with Rotation-equivariant Descriptors

You Only Hypothesize Once: Point Cloud Registration with Rotation-equivariant Descriptors In this paper, we propose a novel local descriptor-based fra

80 Dec 15, 2022

Pose Detection and Machine Learning for real-time body posture analysis during exercise to provide audiovisual feedback on improvement of form.

Posture: Pose Tracking and Machine Learning for prescribing corrective suggestions to improve posture and form while exercising. This repository conta

10 Nov 11, 2022

Learning Super-Features for Image Retrieval

Learning Super-Features for Image Retrieval This repository contains the code for running our FIRe model presented in our ICLR'22 paper: @inproceeding

101 Dec 28, 2022

Implementation of Feedback Transformer in Pytorch

Related tags

Overview

Feedback Transformer - Pytorch

Install

Usage

Citations

Comments

Should it really be using lower layers output for keys and values?

In place operation with gradient

Bug in weighted sum

Input/Output dimensions

Non intuitive memory usage with cross attention

I think mask padding value should be False

ETA for the enwiki8 example

Releases(0.0.11)

0.0.11(Mar 2, 2021)

0.0.10(Feb 22, 2021)

0.0.9(Feb 3, 2021)

0.0.8(Feb 2, 2021)

0.0.7(Feb 2, 2021)

0.0.6(Feb 2, 2021)

0.0.5(Feb 2, 2021)

0.0.4(Feb 2, 2021)

0.0.3(Feb 2, 2021)

0.0.2(Feb 2, 2021)

0.0.1(Feb 2, 2021)