当前位置：网站首页>A general U-shaped transformer for image restoration

A general U-shaped transformer for image restoration

2022-04-23 06:00:00 【umbrellalalalala】

2021 year 6 month 6 Submitted to arxiv Articles on .ICCV2021 Of Eformer Namely Uformer Based on improvement , It seems worth reading , Simply record .

Know that the account with the same name is released synchronously .

One 、 Architecture design

The structure is as shown in the figure ：
Insert picture description here
Compared to the ordinary UNet, The difference is that LeWin Transformer, such Transformer It is also the innovation of this work .

So-called LeWin Transformer, Namely local-enhanced window Transformer, It includes W-MSA and LeFF：

W-MSA：non-overlapping window-based self-attention, The purpose is to reduce the computational overhead （ Tradition transformer It is calculated globally self-attention, And it's not ）;
LeFF： Tradition transformer Feedforward neural network is used in , Can't make good use of local context,LeFF The adoption of can capture local information.

️ Two innovations ：

Put forward LeWin Transformer, introduce UNet
Three jump connections

Two 、 Main module details

2.1,W-MSA

~~This is the biggest innovation of this work .~~ （ Reminded ,swin Transformer There are ）
Insert picture description here
First will C×H×W Of X It is divided into N individual C×M×M individual patch, Every patch Deemed to have M×M individual C dimension vector（N = H × W / M²）, this C individual vector enter W-MSA in . According to the above formula , A simple understanding is to make X Divided into non overlapping N slice , And then... For each piece self-attention The calculation of .
Insert picture description here
Author expresses , Although it is carried out on one piece self-attention The calculation of , But in UNet Of encode Stage , Due to the existence of down sampling , So calculate self attention on this piece , Corresponding to the calculation of self attention on the larger receptive field before down sampling .

Adopted relative position encoding, So the calculation formula can be expressed as ：
Insert picture description here
A reference to this location code [48,41] Namely ：

[48] Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position repre-
sentations. arXiv preprint arXiv:1803.02155, 2018.
[41] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining
Guo. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint
arXiv:2103.14030, 2021.

2.2,LeFF

LeFF yes Incorporating Convolution Designs into Visual Transformers Invented , Among them Convolution-enhanced image Transformer (CeiT) Including this design .

Insert picture description here
The essence is to self-attention Calculate the output of the module N individual token（vector）, Rearrange to $\sqrt{N} \times \sqrt{N}$ Of “image”, Then proceed depth-wise Convolution operation of . After watching CeiT The diagram given by the author , Look again Uformer The diagram given by the author , It's not difficult to understand the meaning ：
Insert picture description here
Each linear layer / After the convolution layer , Using all of these GELU Activation function .
（depth-wise A search on the Internet has , The function is to reduce parameters , Speed up computing ）

2.3, Three jump connections

UNet The architecture has jump connections , In this work is to encoder Part of the Transformer The output of is passed to decoder part , But there are many ways to use these jump connections to convey information , The author explores three ：

Insert picture description here

The first is direct concat To come over ;
The second is ： every last decode stage There is one upsampling and two Transformer block, It means the first to use self-attention, The second use cross attention;
The third is concat Information as key and value Of cross attention.

Insert picture description here
The author believes that the three are similar , But the first one is a little better , So the first one is used as Uformer Default Settings .

Uformer The details of architecture design are here , I won't take a closer look at other contents .

3、 ... and 、 Calculate the cost

since LeWin Transformer Medium W-MSA Focus on reducing computing overhead , So naturally, let's look at the complexity of the algorithm ：

Given feature map X, Dimension for C×H×W, If it's traditional self-attention, So the complexity is $O(H^2 W^2 C)$ , Is divided into M×M Of patch Do it again self-attention, It is $O(\frac{HW}{M^2}M^4 C)=O(M^2HWC)$ , Reduced complexity .