当前位置：网站首页>Detailed explanation of VIT transformer

Detailed explanation of VIT transformer

2022-08-09 20:44:00 【The romance of cherry blossoms】

1.VIT overall structure

Build a patch sequence for image data

For an image, divide the image into 9 windows. To pull these windows into a vector, such as a 10*10*3-dimensional image, we first need to pull the image into a 300-dimensional vector.

Location code:

There are two ways of position coding. The first coding is one-dimensional coding. These windows are coded into 1, 2, 3, 4, 5, 6, 7, 8, 9 in order.The second way is two-dimensional encoding, which returns the coordinates of each image window.

Finally, connect a layer of fully-connected layers to map the image encoding and positional encoding to a more easily recognizable encoding for computation.

So, what does the 0 code in the architecture diagram do?

We generally add 0 codes to image classification. Image segmentation and target detection generally do not need to be added. 0patch is mainly used for feature integration to integrate the feature vectors of each window. Therefore, 0 patch can be added in any position.

2. Detailed explanation of the formula

Enter patch[(P*P)*C] to get ((P*P)*D) through full connection E, that is, make a feature map, N+1 means to find additionalA patch represents the classification feature, that is, the 0 patch mentioned above, and then the features are integrated and added.

MSA is a residual connection.

3. The receptive field of multi-head attention

As shown in the figure, the vertical axis represents the distance of attention, which is also equivalent to the receptive field of convolution. When there is only one head, the receptive field is relatively small, and the receptive field is also large. With the number of headsThe increase of , the receptive field is generally relatively large, which shows that Transformer extracts global features.

4.Position coding

Conclusion: The encoding is useful, but the encoding has little effect. Simply use the simple one. The effect of 2D (calculating the encoding of rows and columns separately, and then summing) is stillIt is not as good as 1D, and it is not very useful to add a shared position code to each layer

Of course, this is a classification task, and positional encoding may not have much effect

5. Experimental effect(/14 indicates the side length of the patch)

6.TNT: Transformer in Transformer

VIT only models the pathch, ignoring the smaller details

The external transformer divides the original image into windows, and generates a feature vector through image encoding and position encoding.

The internal transformer will further reorganize the window of the external transformer into multiple superpixels and reorganize them into new vectors. For example, the external transformer will split the image into 16*16*3 windows, and the internal transformer will split it again.It is divided into 4*4 superpixels, and the size of the small window is 4*4*48, so that each patch integrates the information of multiple channels.The new vector changes the output feature size through full connection. At this time, the internal combined vector is the same as the patch code size, and the internal vector and the external vector are added.

Visualization of TNT's PatchEmbedding

For the blue dots represent the features extracted by TNT, it can be seen from the visual image that the features of the blue dots are more discrete, have larger variance, and are more conducive to separation, More distinctive features and more diverse distribution