[NeurIPS 2021] Galerkin Transformer: a linear attention without softmax

Overview

[NeurIPS 2021] Galerkin Transformer: linear attention without softmax

License: MIT Python 3.8 Pytorch 1.9 arXiv Open in Visual Studio Code

Summary

Introduction

The new attention operator (for the encoder) is simply Q(K^TV), or the quadratic complexity one (QK^T)V.

  • No softmax, or the approximation thereof, at all.
  • Whichever two latent representations doing matmul get the layer normalization, similar to Gram-Schmidt process where we have to divide the basis's norm squared. Q, K get layer normalized in the Fourier-type attention (every position attends with every other), as for K, V in the Galerkin-type attention (every basis attends with every other basis). No layer normalization is applied afterward.
  • Some other components are tweaked according to our Hilbertian interpretation of attention.

Overall this is called a scale-preserving simple attention. For the full operator learner, the feature extractor is a simple linear layer or an interpolation-based CNN, the decoder is the spectral convolution real parameter re-implementation from the best operator learner to-date Fourier Neural Operator (FNO) in Li et al 2020 if the target is smooth, or just a pointwise FFN if otherwise. The resulting network is extremely powerful in learning PDE-related operators (energy decay, inverse coefficient identification).

Hilbertian framework to analyze linear attention

Even though everyone is Transformer'ing, the mathematics behind the attention mechanism is not well understood. We have also shown that the Galerkin-type attention (a linear attention without softmax) has an approximation capacity on par with a Petrov-Galerkin projection under a Hilbertian setup. We use a method commonly known as ''mixed method'' in the finite element analysis community that is used to solve fluid/electromagnetics problems. Unlike finite element methods, in an attention-based operator learner the approximation is not discretization-tied, in that:

  1. The latent representation is interpreted "column-wise" (each column represents a basis), opposed to the conventional "row-wise"/ "position-wise"/"word-wise" interpretation of attention in NLP.
  2. The dimensions of the approximation spaces are not tied to the geometry as in the traditional finite element analysis (or finite difference, spectral methods, radial basis, etc.);
  3. The approximation spaces are being dynamically updated by the nonlinear universal approximator due to the presence of the positional encodings, which determines the topology of the approximation space.

For details please refer to: https://arxiv.org/abs/2105.14995

@Misc{Cao:2021transformer,
  author        = {Shuhao Cao},
  title         = {Choose a Transformer: Fourier or Galerkin},
  year          = {2021},
  archiveprefix = {arXiv},
  eprint        = {2105.14995},
  primaryclass  = {cs.CL},
  url           = {https://arxiv.org/abs/2105.14995},
}

Install

Requirements

(Updated Jun 17 2021) PyTorch requirement updated to 1.9.0 as the introduction of the batch_first argument will conform with our pipeline.

This package can be cloned locally and used with the following requirements:

git clone https://github.com/scaomath/galerkin-transformer.git
cd galerkin-transformer
python3 -m pip install -r requirements.txt
seaborn==0.11.1
torchinfo==0.0.8
numpy==1.20.2
torch==1.9.0
plotly==4.14.3
scipy==1.6.2
psutil==5.8.0
matplotlib==3.3.4
tqdm==4.56.0
PyYAML==5.4.1

If interactive mode is to be used, please install

jupyterthemes==0.20.0
ipython==7.23.1

Installing using pip

This package can be installed using pip.

python3 -m pip install galerkin-transformer

Example usage of the Simple Fourier/Galerkin Transformer encoder layers:

from galerkin_transformer.model import *

encoder_layer = FourierTransformerEncoderLayer(
                 d_model=128,
                 pos_dim=1,
                 n_head=4,
                 dim_feedforward=512,
                 attention_type='galerkin',
                 layer_norm=False,
                 attn_norm=True,
                 norm_type='layer',
                 dropout=0.05)
encoder_layers = nn.ModuleList([copy.deepcopy(encoder_layer) for _ in range(6)])
x = torch.randn(8, 8192, 128) # embedding
pos = torch.arange(0, 8192).unsqueeze(-1) # Euclidean coordinates
pos = pos.repeat(8, 1, 1)
for layer in encoder_layers:
    x = layer(x, pos)

Data

The data is courtesy of Zongyi Li (Caltech) under the MIT license. Download the following data from here:

burgers_data_R10.mat
piececonst_r421_N1024_smooth1.mat
piececonst_r421_N1024_smooth2.mat.

The repo has a semi env variable $DATA_PATH set in utils_ft.py, if you have a global system environ variable name DATA_PATH, then please put the data in that folder. Otherwise, please unzip the Burgers and Darcy flow problem files to the ./data folder.

Examples

All examples are learning PDE-related operators. The setting can be found in config.yml. To fully reproducing our result, please refer to the training scripts for all the possible args.

By default the evaluation is performed on the last 100 samples in the test dataset like the code in FNO repo. All trainers are using the 1cycle scheduler in PyTorch for 100 epochs. Every example has a --seed $SEED argument and the default seed is 1127802. Again if you have a system wide env variable named SEED, the code will use that seed instead.

A caveat for Darcy problems

Since nn.functional.interpolate is used in Darcy examples, a fixed seed may still yield different results each training cycle on GPU according to PyTorch documents, but we have verified that the variance is negligible. Some example set-ups are as follows.

Example 1: Burgers equation

net

The baseline benchmark ex1_burgers.py: evaluation relative error is about 1e-3 with a simple pointwise forward expansion feature extractor. The input is the initial condition of a viscous Burgers' equation on a discrete grid, the output is an approximation to the solution marched to time $1$. The initial data are generating using a GRF and the data in the validation set are not in the train set.

Default benchmark on a 2048 grid using a Fourier Transformer, with 4 Fourier-type attention encoder layers as the encoder and 2 spectral convolution layers from Li et al 2020 as the decoder (to reduce the overfit we decrease the dmodel of the spectral conv from the original 64 to 48):

python ex1_burgers.py

For more choices of arguments, please refer to Example 1 in models.

Example 2 Interface Darcy's flow

net

The baseline benchmark ex2_darcy.py: evaluation relative error is about 8e-3 to 1e-2 with a 3-level interpolation-based CNN (CiNN) feature extractor. The coarse grid latent representation is sent to attention layers The operator input is discontinuous coefficient with a random interface sampled at a discrete grid, the output is a finite difference approximation to the solution restricted to the sampled grid from a fine 421x421 grid. The coefficient in the validation set are not in the train set.

Default benchmark on a 141x141 grid using the Galerkin Transformer, 6 Galerkin-type attention layers with d_model=128 and nhead=4 as the encoder, and 2 spectral conv layers from Li et al 2020 as the decoder. There is a small dropout 5e-2 in the attention layer as well as in the feature extraction layer:

python ex2_darcy.py

For a smaller memory GPU or CPU, please use the 85x85 grid fine, 29x29 coarse grid setting:

python ex2_darcy.py --subsample-attn 15 --subsample-nodes 5 --attention-type 'galerkin' --xavier-init 0.01 --diagonal-weight 0.01

For more choices of arguments, please refer to Example 2 in models.

Example 3 Inverse coefficient identification for interface Darcy's flow

Example 3 is an inverse interface coefficient identification for Darcy flow based on the same dataset used in Example 2. However, in this example, the input and the target are reversed, i.e., the target is the interface coefficient with a random geometry, and the input is the finite difference approximation to the PDE problem, together with an optional noise added to the input to simulate measurement errors. Due to a limit of interpolation operator having no approximation property to nonsmooth functions, the coefficient cannot be resolved at the resolution, the target is sampled at a lower resolution than the input.

Evaluation input data with no noise

Evaluation input

Evaluation input data with 10% noise fed to the model

Evaluation input

True target (diffusion coefficient with a sharp interface)

Evaluation target

Reconstructed target

Evaluation target

The baseline benchmark ex3_darcy_inv.py: Evaluation relative error is about 1.5e-2 to 2e-2 without noise, 2.5e-2 with 1% noise, and 7e-2 to 8e-2 with 10% noise in both train and test. If the training data is clean, then adding noise would not generalize well in the test. It is recommended to training with a reasonable amount of noise.

Default benchmark is on a 141x141 fine grid input and a 36x36 coarse grid coefficient output. The model is the Galerkin Transformer with 6 stacked Galerkin-type attention layers (d_model=192, nhead=4) with a simple pointwise feed-forward neural network to map the attention output back the desired dimension. There is a small dropout in every key components of the network (5e-2). The noise is added to the normalized input, so 0.01 noise means 1%, and 0.1 means 10%. By default there is 1% noise added.

python ex3_darcy_inv.py --noise 0.01

For more choices of arguments, please refer to Example 3 in models.

Evaluation notebooks

Please download the pretrained model's .pt files from Releases and put them in the ./models folder.

Memory and speed profiling using autograd.profiler

Using CUDA, Fourier Transformer features an over 40% reduction in self_cuda_memory_usage versus the standard softmax normalized transformers, and Galerkin Transformer's the backpropagation speed has a 20% to 100% increase over the standard linearized transformers. If no GPU is available please enable the --no-cuda switch.

Example 1 memory profile of a model with 96 hidden dimension with an input sequence length 8192. Compare the memory usage of the Fourier transformer with the one with softmax

python ex1_memory_profile.py --batch-size 4 --seq-len 8192 --dmodel 96 --attention-type 'softmax' 'fourier'

Compare the backpropagation time usage of the Galerkin transformer versus the same net, but with Galerkin-type simple attention replaced by the standard linearized attention.

python ex1_memory_profile.py --batch-size 4 --seq-len 8192 --dmodel 96 --num-iter 100 --attention-type 'linear' 'galerkin'

Encoder layer wrapper profiling: profile a wrapper with 10 layers of encoder in a model for operators defined for functions whose domain is isomorphic to a 2D Euclidean space. Example:

python encoder_memory_profile.py --batch-size 4 --dmodel 128 --num-layers 6 -ndim 2

Please refer to the memory profile section in models for more detailed profiling in each example.

License

This software is distributed with the MIT license which translates roughly that you can use it however you want and for whatever reason you want. All the information regarding support, copyright and the license can be found in the LICENSE file.

Acknowledgement

The hardware to perform this work is provided by Andromeda Saving Fund. This work was supported in part by the National Science Foundation under grants DMS-1913080 and no additional revenues are related to this work. We would like to thank Dr. Long Chen (Univ of California Irvine) for the inspiration of and encouragement on the initial conceiving of this paper, as well as numerous constructive advices on revising this paper, not mentioning his persistent dedication of making publicly available tutorials on writing beautiful vectorized code. We would like to thank Dr. Ari Stern (Washington Univ in St. Louis) for the help on the relocation during the COVID-19 pandemic. We would like to thank Dr. Ruchi Guo (Univ of California Irvine) and Dr. Yuanzhe Xi (Emory) for the invaluable feedbacks on the choice of the numerical experiments. We would like to thank the Kaggle community, including but not limited to Jean-François Puget (Uncle [email protected]) and Murakami Akira ([email protected]) for sharing a simple Graph Transformer in Tensorflow, Cher Keng Heng ([email protected]) for sharing a Graph Transformer in PyTorch. We would like to thank [email protected], OpenVaccine, and Eterna for hosting the COVID-19 mRNA Vaccine competition and Deng Lab (Univ of Georgia) for collaborating in this competition. We would like to thank CHAMPS (Chemistry and Mathematics in Phase Space) for hosting the J-coupling quantum chemistry competition and Corey Levinson ([email protected], Eligo Energy, LLC) for collaborating in this competition. We would like to thank Zongyi Li (Caltech) for sharing some early dev code in the updated PyTorch torch.fft interface. We would like to thank Ziteng Pang (Univ of Michigan) to update us with various references on Transformers. We would like to thank Joel Schlosser to incorporate our change to the PyTorch transformer submodule to simplify our testing pipeline. We would be grateful to the PyTorch community for selflessly code sharing, including Phil Wang([email protected]) and Harvard NLP group Klein et al. (2017). We would like to thank the chebfun Driscoll et al. (2014) for integrating powerful tools into a simple interface to solve PDEs. We would like to thank Dr. Yannic Kilcher and Dr. Hung-yi Lee (National Taiwan Univ) for frequently covering the newest research on Transformers in video formats. We would also like to thank the Python community (Van Rossum and Drake (2009); Oliphant (2007)) for sharing and developing the tools that enabled this work, including Pytorch Paszke et al.(2017), NumPy Harris et al. (2020), SciPy Virtanen et al. (2020), Seaborn Waskom (2021), Plotly Inc. (2015), Matplotlib Hunter (2007), and the Python team for Visual Studio Code. For details please refer to the documents of every function that is not built from the ground up in our open-source software library.

Comments
  • 包的路径问题

    包的路径问题

    • 用命令 “pip install galerkin-transformer”, 可以安装好这个包:
    Installing collected packages: galerkin-transformer
    Successfully installed galerkin-transformer-0.1.1
    
    • 但是当我调用 “from galerkin_transformer.model import *” 语句时,会出现如下错误:
    Traceback (most recent call last):
      File "xxx/lib/python3.8/site-packages/galerkin_transformer/model.py", line 2, in <module>
        from libs.layers import *
    ModuleNotFoundError: No module named 'libs'
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "model_galerkin_transformer.py", line 7, in <module>
        from galerkin_transformer.model import *
      File "xxx/lib/python3.8/site-packages/galerkin_transformer/model.py", line 5, in <module>
        from layers import *
    ModuleNotFoundError: No module named 'layers'
    
    • 应该是包的路径管理有些问题,麻烦您debug一下(或者是我哪个地方搞错了?)
    opened by huangxiang360729 2
  • A numerical experiment problem of darcy flow experiment.

    A numerical experiment problem of darcy flow experiment.

    Hi Shuhao. Great of your work. I am running your ex2_darcy.py, the L2 loss is about 0.00914. I tried many times and the result is similar, never reach 0.00847 shown in your paper. The fine resolution is 211 and coarse resolution is 61. Is it normal?

    Thanks.

    opened by cesare4444 2
  • Question about calculation of elem in DarcyDataset

    Question about calculation of elem in DarcyDataset

    Nice job. Is there any reference for calculating the elem matrix in DarcyDataset (libs/ft.py line650-661)? What needs to be done to generalize from 2D to 3D

    opened by liyang-7 2
  • More than one channel

    More than one channel

    Hi,

    Thank you for this great contribution!

    I was just wondering what your thoughts were regarding expanding the input channels so that the models can accept multiple (x,y,z). Furthermore, the new FNO implementation has the ability to accommodate different height and width could these changes be merge with this repo?

    opened by NicolaiLassen 1
  • How does parameter initialization influencce performance?

    How does parameter initialization influencce performance?

    Hi Cao, I notice the parameter initialization in your code.

    def _reset_parameters(self):
            for param in self.linears.parameters():
                if param.ndim > 1:
                    xavier_uniform_(param, gain=self.xavier_init)
                    if self.diagonal_weight > 0.0:
                        param.data += self.diagonal_weight * \
                            torch.diag(torch.ones(
                                param.size(-1), dtype=torch.float))
                    if self.symmetric_init:
                        param.data += param.data.T
                        # param.data /= 2.0
                else:
                    constant_(param, 0)
    

    Does it influence the performance greatly? And why do you initialize the linear layers like this? Thank you very much!

    opened by WangChen100 1
Owner
Shuhao Cao
An amateur computational mathematician.
Shuhao Cao
GenGNN: A Generic FPGA Framework for Graph Neural Network Acceleration

GenGNN: A Generic FPGA Framework for Graph Neural Network Acceleration Stefan Abi-Karam*, Yuqi He*, Rishov Sarkar*, Lakshmi Sathidevi, Zihang Qiao, Co

Sharc-Lab 19 Dec 15, 2022
Open source implementation of "A Self-Supervised Descriptor for Image Copy Detection" (SSCD).

A Self-Supervised Descriptor for Image Copy Detection (SSCD) This is the open-source codebase for "A Self-Supervised Descriptor for Image Copy Detecti

Meta Research 68 Jan 04, 2023
🔥RandLA-Net in Tensorflow (CVPR 2020, Oral & IEEE TPAMI 2021)

RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds (CVPR 2020) This is the official implementation of RandLA-Net (CVPR2020, Oral

Qingyong 1k Dec 30, 2022
Code implementation from my Medium blog post: [Transformers from Scratch in PyTorch]

transformer-from-scratch Code for my Medium blog post: Transformers from Scratch in PyTorch Note: This Transformer code does not include masked attent

Frank Odom 27 Dec 21, 2022
Official code repository of the paper Learning Associative Inference Using Fast Weight Memory by Schlag et al.

Learning Associative Inference Using Fast Weight Memory This repository contains the offical code for the paper Learning Associative Inference Using F

Imanol Schlag 18 Oct 12, 2022
Deep Markov Factor Analysis (NeurIPS2021)

Deep Markov Factor Analysis (DMFA) Codes and experiments for deep Markov factor analysis (DMFA) model accepted for publication at NeurIPS2021: A. Farn

Sarah Ostadabbas 2 Dec 16, 2022
A Marvelous ChatBot implement using PyTorch.

PyTorch Marvelous ChatBot [Update] it's 2019 now, previously model can not catch up state-of-art now. So we just move towards the future a transformer

JinTian 223 Oct 18, 2022
Learning To Have An Ear For Face Super-Resolution

Learning To Have An Ear For Face Super-Resolution [Project Page] This repository contains demo code of our CVPR2020 paper. Training and evaluation on

50 Nov 16, 2022
PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

VAENAR-TTS - PyTorch Implementation PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

Keon Lee 67 Nov 14, 2022
Equivariant Imaging: Learning Beyond the Range Space

[Project] Equivariant Imaging: Learning Beyond the Range Space Project about the

Georges Le Bellier 3 Feb 06, 2022
A deep neural networks for images using CNN algorithm.

Example-CNN-Project This is a simple project showing how to implement deep neural networks using CNN algorithm. The dataset is taken from this link: h

Mohammad Amin Dadgar 3 Sep 16, 2022
Implemenets the Contourlet-CNN as described in C-CNN: Contourlet Convolutional Neural Networks, using PyTorch

C-CNN: Contourlet Convolutional Neural Networks This repo implemenets the Contourlet-CNN as described in C-CNN: Contourlet Convolutional Neural Networ

Goh Kun Shun (KHUN) 10 Nov 03, 2022
TensorFlow, PyTorch and Numpy layers for generating Orthogonal Polynomials

OrthNet TensorFlow, PyTorch and Numpy layers for generating multi-dimensional Orthogonal Polynomials 1. Installation 2. Usage 3. Polynomials 4. Base C

Chuan 29 May 25, 2022
StackRec: Efficient Training of Very Deep Sequential Recommender Models by Iterative Stacking

StackRec: Efficient Training of Very Deep Sequential Recommender Models by Iterative Stacking Datasets You can download datasets that have been pre-pr

25 May 29, 2022
Tightness-aware Evaluation Protocol for Scene Text Detection

TIoU-metric Release on 27/03/2019. This repository is built on the ICDAR 2015 evaluation code. If you propose a better metric and require further eval

Yuliang Liu 206 Nov 18, 2022
A framework for annotating 3D meshes using the predictions of a 2D semantic segmentation model.

Semantic Meshes A framework for annotating 3D meshes using the predictions of a 2D semantic segmentation model. Paper If you find this framework usefu

Florian 40 Dec 09, 2022
A implemetation of the LRCN in mxnet

A implemetation of the LRCN in mxnet ##Abstract LRCN is a combination of CNN and RNN ##Installation Download UCF101 dataset ./avi2jpg.sh to split the

44 Aug 25, 2022
Supplemental learning materials for "Fourier Feature Networks and Neural Volume Rendering"

Fourier Feature Networks and Neural Volume Rendering This repository is a companion to a lecture given at the University of Cambridge Engineering Depa

Matthew A Johnson 133 Dec 26, 2022
Unsupervised Image-to-Image Translation

UNIT: UNsupervised Image-to-image Translation Networks Imaginaire Repository We have a reimplementation of the UNIT method that is more performant. It

Ming-Yu Liu 劉洺堉 1.9k Dec 26, 2022
WSDM2022 "A Simple but Effective Bidirectional Extraction Framework for Relational Triple Extraction"

BiRTE WSDM2022 "A Simple but Effective Bidirectional Extraction Framework for Relational Triple Extraction" Requirements The main requirements are: py

9 Dec 27, 2022