当前位置:网站首页>Detailed explanation of VIT transformer
Detailed explanation of VIT transformer
2022-08-09 20:44:00 【The romance of cherry blossoms】
1.VIT overall structure
Build a patch sequence for image data
For an image, divide the image into 9 windows. To pull these windows into a vector, such as a 10*10*3-dimensional image, we first need to pull the image into a 300-dimensional vector.
Location code:
There are two ways of position coding. The first coding is one-dimensional coding. These windows are coded into 1, 2, 3, 4, 5, 6, 7, 8, 9 in order.The second way is two-dimensional encoding, which returns the coordinates of each image window.
Finally, connect a layer of fully-connected layers to map the image encoding and positional encoding to a more easily recognizable encoding for computation.
So, what does the 0 code in the architecture diagram do?
We generally add 0 codes to image classification. Image segmentation and target detection generally do not need to be added. 0patch is mainly used for feature integration to integrate the feature vectors of each window. Therefore, 0 patch can be added in any position.
2. Detailed explanation of the formula
3. The receptive field of multi-head attention
As shown in the figure, the vertical axis represents the distance of attention, which is also equivalent to the receptive field of convolution. When there is only one head, the receptive field is relatively small, and the receptive field is also large. With the number of headsThe increase of , the receptive field is generally relatively large, which shows that Transformer extracts global features.
4.Position coding
Conclusion: The encoding is useful, but the encoding has little effect. Simply use the simple one. The effect of 2D (calculating the encoding of rows and columns separately, and then summing) is stillIt is not as good as 1D, and it is not very useful to add a shared position code to each layer
Of course, this is a classification task, and positional encoding may not have much effect
5. Experimental effect(/14 indicates the side length of the patch)
6.TNT: Transformer in Transformer
VIT only models the pathch, ignoring the smaller details
The external transformer divides the original image into windows, and generates a feature vector through image encoding and position encoding.
The internal transformer will further reorganize the window of the external transformer into multiple superpixels and reorganize them into new vectors. For example, the external transformer will split the image into 16*16*3 windows, and the internal transformer will split it again.It is divided into 4*4 superpixels, and the size of the small window is 4*4*48, so that each patch integrates the information of multiple channels.The new vector changes the output feature size through full connection. At this time, the internal combined vector is the same as the patch code size, and the internal vector and the external vector are added.
Visualization of TNT's PatchEmbedding
For the blue dots represent the features extracted by TNT, it can be seen from the visual image that the features of the blue dots are more discrete, have larger variance, and are more conducive to separation, More distinctive features and more diverse distribution
Experimental Results
For both internal and external training, the best effect is to add coding
边栏推荐
猜你喜欢
随机推荐
毕昇编译器优化:Lazy Code Motion
Unix domain socket
ThreadLocal 夺命 11 连问,万字长文深度解析
URLError: <urlopen error [Errno 11004] getaddrinfo failed>调用seaborn-data无法使用
jmeter-录制脚本
Cortex-A7 MPCore 架构
目录
winpe工具WEPE微PE工具箱
How to play with container local storage through open-local? | Dragon Lizard Technology
anno arm移植Qt环境后,编译正常,程序无法正常启动问题的记录
哈希表
Linux上给PHP安装redis扩展
win10 uwp 改变鼠标
论文精读:VIT - AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE
JMeter压测时如何在达到给定错误数量后停止测试
win10 uwp 自定义控件 SplitViewItem
C程序设计-第四版
艺术与科技的狂欢,云端XR支撑阿那亚2022砂之盒沉浸艺术季
动态RDLC报表(一)
动态RDLC报表(四)