Implementation of a Transformer, but completely in Triton

Last update: Dec 22, 2022

Overview

Transformer in Triton (wip)

Implementation of a Transformer, but completely in Triton. I'm completely new to lower-level neural net code, so this repository will mostly be a learning experience, with the end-goal being a vanilla transformer that is faster and more efficient to train.

Install

$ pip install triton-transformer

Usage

import torch
from triton_transformer import Transformer

model = Transformer(
    num_tokens = 256,
    max_seq_len = 1024,
    dim = 512,
    depth = 6,
    heads = 8,
    dim_head = 64
)

x = torch.randint(0, 256, (1, 1024))
mask = torch.ones(1, 1024).bool()

logits = model(x, mask = mask) # (1, 1024, 256)

Citations

@article{Tillet2019TritonAI,
    title   = {Triton: an intermediate language and compiler for tiled neural network computations},
    author  = {Philippe Tillet and H. Kung and D. Cox},
    journal = {Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages},
    year    = {2019}
}

@misc{vaswani2017attention,
    title   = {Attention Is All You Need}, 
    author  = {Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin},
    year    = {2017},
    eprint  = {1706.03762},
    archivePrefix = {arXiv},
    primaryClass = {cs.CL}
}

A Pytorch implementation of CVPR 2021 paper "RSG: A Simple but Effective Module for Learning Imbalanced Datasets"

RSG: A Simple but Effective Module for Learning Imbalanced Datasets (CVPR 2021) A Pytorch implementation of our CVPR 2021 paper "RSG: A Simple but Eff

120 Dec 12, 2022

A concise but complete implementation of CLIP with various experimental improvements from recent papers

x-clip (wip) A concise but complete implementation of CLIP with various experimental improvements from recent papers Install $ pip install x-clip Usag

515 Dec 26, 2022

A concise but complete implementation of CLIP with various experimental improvements from recent papers

x-clip (wip) A concise but complete implementation of CLIP with various experimental improvements from recent papers Install $ pip install x-clip Usag

115 Dec 9, 2021

Implementation of a protein autoregressive language model, but with autoregressive infilling objective (editing subsequences capability)

Protein GLM (wip) Implementation of a protein autoregressive language model, but with autoregressive infilling objective (editing subsequences capabil

17 May 6, 2022

Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

ImageProcessingTransformer Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

61 Jan 1, 2023

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

Episodic Transformers (E.T.) Episodic Transformer for Vision-and-Language Navigation Alexander Pashevich, Cordelia Schmid, Chen Sun Episodic Transform

62 Dec 24, 2022

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped

CSWin-Transformer This repo is the official implementation of "CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows". Th

409 Jan 6, 2023

nnFormer: Interleaved Transformer for Volumetric Segmentation Code for paper "nnFormer: Interleaved Transformer for Volumetric Segmentation "

nnFormer: Interleaved Transformer for Volumetric Segmentation Code for paper "nnFormer: Interleaved Transformer for Volumetric Segmentation ". Please

610 Dec 28, 2022

3D-Transformer: Molecular Representation with Transformer in 3D Space

55 Dec 19, 2022

Comments

Question concerning PyTorch build

Hello. I find your project very interesting and I have seen your comparison between PyTorch and Triton implementations.

However, I am curious whether your PyTorch environment is a source build optimized for your machine or a pip/conda install.

Source building has faster runtimes and if a conda install is being used for comparison, the difference in speed may simply be due to Triton optimizing CUDA for the run environment.

Thank you again for your interesting project.

opened by veritas9872 13
_layernorm implementation forward result not equal F.layer_norm

I have a try on your triton-transformer and test the layernorm module alone. It's very weird that the forward result is different while the backward result is equal.

code: from triton_transformer.layernorm import layernorm import torch import torch.nn as nn

torch.manual_seed(0) x = torch.randn(2,5).cuda() x.requires_grad_(True) dy = .1*torch.randn_like(x).cuda() dim = 5 norm = nn.LayerNorm(dim).cuda()

y1 = layernorm(x, norm.weight, norm.bias, use_triton = True) y2 = layernorm(x, norm.weight, norm.bias, use_triton = False) print(y1, y2) print(torch.allclose(y1, y2))

y1.backward(dy, retain_graph=True) dx_y1 = x.grad.clone()

x.grad = None

y2.backward(dy, retain_graph=True) dx_y2 = x.grad.clone() print(dx_y1, dx_y2) print(torch.allclose(dx_y1, dx_y2))

result: `tensor([[ 0.9492, -0.0021, -0.9797, 0.4449, -0.4123], [-0.7624, 0.4399, 0.7299, -0.3091, -0.0983]], device='cuda:0', grad_fn=<_layernormBackward>) tensor([[ 1.4217, -0.0031, -1.4674, 0.6663, -0.6175], [-1.4342, 0.8276, 1.3732, -0.5815, -0.1850]], device='cuda:0', grad_fn=) False

tensor([[-0.0706, 0.0288, -0.0813, 0.0446, 0.0785], [ 0.0218, -0.0152, 0.0141, -0.0522, 0.0315]], device='cuda:0') tensor([[-0.0706, 0.0288, -0.0813, 0.0446, 0.0785], [ 0.0218, -0.0152, 0.0141, -0.0522, 0.0315]], device='cuda:0') True`

opened by Tengxu-Sun 1
Current state of benchmarking & contributing?
Hey @lucidrains - hope you're doing well! I have some time to hack the next couple weeks, just wanted to get a sense of:

Current state of benchmarking (what Triton kernels provide how much lift, aggregate lift over a "vanilla Transformer implementation"

If there's anything I could help with, especially as I learn Triton!
opened by siddk 0
Official layer norm added

Hi @lucidrains , in Triton layer norm was just added in examples, https://github.com/openai/triton/commit/d4baad426db72b83c5222e1c83c929c1860cae54 I tested it, it's twice as fast as Torch, often faster then Apex.

I'm looking forward for your implementation of attention, so far the Torch implementation is the fastest with 12.3 / 14.5 (forw / back) vs the other Triton implementation in DeepSpeed which is 17.3/ 23.0 on my data.

opened by olegklimov 2

Releases(0.1.1)

0.1.1(Apr 5, 2022)

Source code(tar.gz)
Source code(zip)
0.1.0(Apr 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.28(Mar 23, 2022)

Source code(tar.gz)
Source code(zip)
0.0.27(Nov 6, 2021)

Source code(tar.gz)
Source code(zip)
0.0.26(Nov 6, 2021)

Source code(tar.gz)
Source code(zip)
0.0.25(Oct 6, 2021)

Source code(tar.gz)
Source code(zip)
0.0.24(Oct 4, 2021)

Source code(tar.gz)
Source code(zip)
0.0.23(Oct 4, 2021)

Source code(tar.gz)
Source code(zip)
0.0.22(Oct 4, 2021)

Source code(tar.gz)
Source code(zip)
0.0.21(Oct 4, 2021)

Source code(tar.gz)
Source code(zip)
0.0.20(Sep 29, 2021)

Source code(tar.gz)
Source code(zip)
0.0.19(Sep 29, 2021)

Source code(tar.gz)
Source code(zip)
0.0.18(Sep 29, 2021)

Source code(tar.gz)
Source code(zip)
0.0.17(Sep 28, 2021)

Source code(tar.gz)
Source code(zip)
0.0.16(Sep 28, 2021)

Source code(tar.gz)
Source code(zip)
0.0.15(Sep 27, 2021)

Source code(tar.gz)
Source code(zip)
0.0.14(Sep 23, 2021)

Source code(tar.gz)
Source code(zip)
0.0.12(Sep 23, 2021)

Source code(tar.gz)
Source code(zip)
0.0.10(Sep 23, 2021)

Source code(tar.gz)
Source code(zip)
0.0.9(Sep 22, 2021)

Source code(tar.gz)
Source code(zip)
0.0.8(Sep 22, 2021)

Source code(tar.gz)
Source code(zip)
0.0.7(Sep 22, 2021)

Source code(tar.gz)
Source code(zip)
0.0.6(Sep 22, 2021)

Source code(tar.gz)
Source code(zip)
0.0.5(Sep 22, 2021)

Source code(tar.gz)
Source code(zip)
0.0.4(Sep 22, 2021)

Source code(tar.gz)
Source code(zip)
0.0.3(Sep 15, 2021)

Source code(tar.gz)
Source code(zip)
0.0.2(Sep 15, 2021)

Source code(tar.gz)
Source code(zip)

Owner

Phil Wang

Working with Attention. It's all we need

GitHub Repository

Adversarial Color Enhancement: Generating Unrestricted Adversarial Images by Optimizing a Color Filter

ACE Please find the preliminary version published at BMVC 2020 in the folder BMVC_version, and its extended journal version in Journal_version. Datase

28 Dec 25, 2022

Optimal Camera Position for a Practical Application of Gaze Estimation on Edge Devices,

Optimal Camera Position for a Practical Application of Gaze Estimation on Edge Devices, Linh Van Ma, Tin Trung Tran, Moongu Jeon, ICAIIC 2022 (The 4th

11 Oct 10, 2022

Lucid library adapted for PyTorch

Lucent PyTorch + Lucid = Lucent The wonderful Lucid library adapted for the wonderful PyTorch! Lucent is not affiliated with Lucid or OpenAI's Clarity

520 Dec 26, 2022

Adversarial Graph Augmentation to Improve Graph Contrastive Learning

ADGCL : Adversarial Graph Augmentation to Improve Graph Contrastive Learning Introduction This repo contains the Pytorch [1] implementation of Adversa

62 Nov 19, 2022

Fully convolutional deep neural network to remove transparent overlays from images

1.1k Jan 06, 2023

Parallel Latent Tree-Induction for Faster Sequence Encoding

FastTrees This repository contains the experimental code supporting the FastTrees paper by Bill Pung. Software Requirements Python 3.6, NLTK and PyTor

4 Mar 29, 2022

TensorFlow implementation for Bayesian Modeling and Uncertainty Quantification for Learning to Optimize: What, Why, and How

Bayesian Modeling and Uncertainty Quantification for Learning to Optimize: What, Why, and How TensorFlow implementation for Bayesian Modeling and Unce

8 Sep 02, 2022

Code for the head detector (HeadHunter) proposed in our CVPR 2021 paper Tracking Pedestrian Heads in Dense Crowd.

Head Detector Code for the head detector (HeadHunter) proposed in our CVPR 2021 paper Tracking Pedestrian Heads in Dense Crowd. The head_detection mod

76 Dec 06, 2022

Tensorflow implementation for "Improved Transformer for High-Resolution GANs" (NeurIPS 2021).

HiT-GAN Official TensorFlow Implementation HiT-GAN presents a Transformer-based generator that is trained based on Generative Adversarial Networks (GA

78 Oct 31, 2022

Meli Data Challenge 2021 - First Place Solution

My solution for the Meli Data Challenge 2021

23 Mar 09, 2022

Lab course materials for IEMBA 8/9 course "Coding and Artificial Intelligence"

IEMBA 8/9 - Coding and Artificial Intelligence Dear IEMBA 8/9 students, welcome to our IEMBA 8/9 elective course Coding and Artificial Intelligence, t

1 Jan 11, 2022

A comprehensive list of published machine learning applications to cosmology

ml-in-cosmology This github attempts to maintain a comprehensive list of published machine learning applications to cosmology, organized by subject ma

290 Dec 29, 2022

Frequency Spectrum Augmentation Consistency for Domain Adaptive Object Detection

Frequency Spectrum Augmentation Consistency for Domain Adaptive Object Detection Main requirements torch = 1.0 torchvision = 0.2.0 Python 3 Environm

15 Apr 04, 2022

Reinforcement Learning Theory Book (rus)

206 Nov 27, 2022

McGill Physics Hackathon 2021: Reaction-Diffusion Models for the Generation of Biological Patterns

DiffuseAnimals: Reaction-Diffusion Models for the Generation of Biological Patterns Introduction Reaction-diffusion equations can be utilized in order

2 Mar 07, 2022

Implementation of the paper titled "Using Sampling to Estimate and Improve Performance of Automated Scoring Systems with Guarantees"

Using Sampling to Estimate and Improve Performance of Automated Scoring Systems with Guarantees Implementation of the paper titled "Using Sampling to

2 Aug 29, 2022

EsViT: Efficient self-supervised Vision Transformers

Efficient Self-Supervised Vision Transformers (EsViT) PyTorch implementation for EsViT, built with two techniques: A multi-stage Transformer architect

352 Dec 25, 2022

Class activation maps for your PyTorch models (CAM, Grad-CAM, Grad-CAM++, Smooth Grad-CAM++, Score-CAM, SS-CAM, IS-CAM, XGrad-CAM, Layer-CAM)

TorchCAM: class activation explorer Simple way to leverage the class-specific activation of convolutional layers in PyTorch. Quick Tour Setting your C

1.2k Dec 29, 2022

[ICLR 2021] Is Attention Better Than Matrix Decomposition?

Enjoy-Hamburger 🍔 Official implementation of Hamburger, Is Attention Better Than Matrix Decomposition? (ICLR 2021) Under construction. Introduction T

271 Dec 29, 2022

Official implementation of deep-multi-trajectory-based single object tracking (IEEE T-CSVT 2021).

DeepMTA_PyTorch Officical PyTorch Implementation of "Dynamic Attention-guided Multi-TrajectoryAnalysis for Single Object Tracking", Xiao Wang, Zhe Che

7 Dec 03, 2022

Implementation of a Transformer, but completely in Triton

Related tags

Overview

Transformer in Triton (wip)

Install

Usage

Citations

You might also like...

A Pytorch implementation of CVPR 2021 paper "RSG: A Simple but Effective Module for Learning Imbalanced Datasets"

A concise but complete implementation of CLIP with various experimental improvements from recent papers

A concise but complete implementation of CLIP with various experimental improvements from recent papers

Implementation of a protein autoregressive language model, but with autoregressive infilling objective (editing subsequences capability)

Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped

nnFormer: Interleaved Transformer for Volumetric Segmentation Code for paper "nnFormer: Interleaved Transformer for Volumetric Segmentation "

3D-Transformer: Molecular Representation with Transformer in 3D Space

Comments

Question concerning PyTorch build

_layernorm implementation forward result not equal F.layer_norm

Current state of benchmarking & contributing?

Official layer norm added

Releases(0.1.1)

0.1.1(Apr 5, 2022)

0.1.0(Apr 4, 2022)

0.0.28(Mar 23, 2022)

0.0.27(Nov 6, 2021)

0.0.26(Nov 6, 2021)

0.0.25(Oct 6, 2021)

0.0.24(Oct 4, 2021)

0.0.23(Oct 4, 2021)

0.0.22(Oct 4, 2021)

0.0.21(Oct 4, 2021)

0.0.20(Sep 29, 2021)

0.0.19(Sep 29, 2021)

0.0.18(Sep 29, 2021)

0.0.17(Sep 28, 2021)

0.0.16(Sep 28, 2021)

0.0.15(Sep 27, 2021)

0.0.14(Sep 23, 2021)

0.0.12(Sep 23, 2021)

0.0.10(Sep 23, 2021)

0.0.9(Sep 22, 2021)

0.0.8(Sep 22, 2021)

0.0.7(Sep 22, 2021)

0.0.6(Sep 22, 2021)

0.0.5(Sep 22, 2021)

0.0.4(Sep 22, 2021)

0.0.3(Sep 15, 2021)

0.0.2(Sep 15, 2021)

Owner

Phil Wang

Adversarial Color Enhancement: Generating Unrestricted Adversarial Images by Optimizing a Color Filter

Optimal Camera Position for a Practical Application of Gaze Estimation on Edge Devices,

Lucid library adapted for PyTorch

Adversarial Graph Augmentation to Improve Graph Contrastive Learning

Fully convolutional deep neural network to remove transparent overlays from images

Parallel Latent Tree-Induction for Faster Sequence Encoding

TensorFlow implementation for Bayesian Modeling and Uncertainty Quantification for Learning to Optimize: What, Why, and How

Code for the head detector (HeadHunter) proposed in our CVPR 2021 paper Tracking Pedestrian Heads in Dense Crowd.

Tensorflow implementation for "Improved Transformer for High-Resolution GANs" (NeurIPS 2021).

Meli Data Challenge 2021 - First Place Solution

Lab course materials for IEMBA 8/9 course "Coding and Artificial Intelligence"

A comprehensive list of published machine learning applications to cosmology

Frequency Spectrum Augmentation Consistency for Domain Adaptive Object Detection

Reinforcement Learning Theory Book (rus)

McGill Physics Hackathon 2021: Reaction-Diffusion Models for the Generation of Biological Patterns

Implementation of the paper titled "Using Sampling to Estimate and Improve Performance of Automated Scoring Systems with Guarantees"

EsViT: Efficient self-supervised Vision Transformers

Class activation maps for your PyTorch models (CAM, Grad-CAM, Grad-CAM++, Smooth Grad-CAM++, Score-CAM, SS-CAM, IS-CAM, XGrad-CAM, Layer-CAM)

[ICLR 2021] Is Attention Better Than Matrix Decomposition?

Official implementation of deep-multi-trajectory-based single object tracking (IEEE T-CSVT 2021).