PyTorch implementation of the Transformer in Post-LN (Post-LayerNorm) and Pre-LN (Pre-LayerNorm).

Last update: Feb 27, 2022

Overview

Transformer-PyTorch

A PyTorch implementation of the Transformer from the paper Attention is All You Need in both Post-LN (Post-LayerNorm) and Pre-LN (Pre-LayerNorm).

Pre-LN applies LayerNorm to the input of every sublayers instead of the residual connection part in Post-LN. The proposed model architecture in the paper was in Post-LN, however the official implementation has been changed into Pre-LN version. The experiment result shows that Pre-LN transformer converges faster while doesn't even need warming up, and is less sensitive to hyperparameters. For more detail about the difference between them, check out the paper On Layer Normalization in the Transformer Architecture.

A STAR would be so nice if you like it!

Dataset

The English-German small-dataset WMT 2016 multimodal task from torchtext.

Prerequisites

Python3
PyTorch >= 1.2.0
torchtext
spacy
nltk
tqdm

Implementation Notes

Beam search is not supported.
Label smoothing is not implemented.
BPE is not adapted.

Usage

Run transformer.ipynb to download dataset and train the model.
Change the flag pre_lnorm to determine which to use.

Evaluation

Parameter settings
- hidden size: 512
- feed forward size: 2048
- num head: 8
- layer: 6
- warm-up: 2000
- batch size: 128

Generated Examples

Here's an example from test data:

source
- eine frau verwendet eine bohrmaschine während ein mann sie fotografiert .
gold
- a woman uses a drill while another man takes her picture .
inference
- a woman uses an electric drill as a man takes a picture .

TODO

Label smoothing
Attention visualization

PyTorch implementation of the Transformer in Post-LN (Post-LayerNorm) and Pre-LN (Pre-LayerNorm).

Related tags

Overview

Transformer-PyTorch

A STAR would be so nice if you like it!

Dataset

Prerequisites

Implementation Notes

Usage

Evaluation

Generated Examples

TODO

References

Owner

Jared Wang

Implements a fake news detection program using classifiers.

A GUI to automatically create a TOPAS-readable MLC simulation file

Source code for "Pack Together: Entity and Relation Extraction with Levitated Marker"

Pytorch code for our paper "Feedback Network for Image Super-Resolution" (CVPR2019)

A collection of Jupyter notebooks to play with NVIDIA's StyleGAN3 and OpenAI's CLIP for a text-based guided image generation.

Production First and Production Ready End-to-End Speech Recognition Toolkit

GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training @ KDD 2020

Towards Fine-Grained Reasoning for Fake News Detection

O-CNN: Octree-based Convolutional Neural Networks for 3D Shape Analysis

This project is the official implementation of our accepted ICLR 2021 paper BiPointNet: Binary Neural Network for Point Clouds.

Image Super-Resolution by Neural Texture Transfer

Implementation for the paper 'YOLO-ReT: Towards High Accuracy Real-time Object Detection on Edge GPUs'

DeepGNN is a framework for training machine learning models on large scale graph data.

Edison AT is software Depression Assistant personal.

Using Random Effects to Account for High-Cardinality Categorical Features and Repeated Measures in Deep Neural Networks

Single Image Super-Resolution (SISR) with SRResNet, EDSR and SRGAN

PyTorch implementation of ECCV 2020 paper "Foley Music: Learning to Generate Music from Videos "

unofficial pytorch implementation of RefineGAN

TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks

Reproduced Code for Image Forgery Detection papers.