Sorting and replying to questions related to transformer
2022-04-23 15:27:00 【moletop】
RNN characteristic ： Give you a sequence , The calculation is step by step from left to right . For sentences , It's a word by word , Right. t One word counts one ht, Also known as his hidden state , It's from the previous word ht-1 and Current t The word itself determines . In this way, the historical information learned before can be passed through ht-1 In the present , Then do some calculations with the current word and get the output .
problem ： Because it's timing transmission , Lead to 1. Difficult to parallel 2. The early learned information will be lost , If you don't want to lose , That may require a big ht, But if you make one more big ht, Every time should be saved , The overhead of memory is relatively large .
——> Pure use attention Mechanism can greatly improve the parallelism
- Consider using CNN Replace the recurrent neural network ：CNN It is difficult to model long sequences , Every convolution calculation is to look at a small window , Take pixels as an example , If two pixels are far apart , Then you need many layers of convolution , To fuse these two distant pixels . And for transformer Come on , I can see all the pixels at once , Relatively speaking, there is no such problem . But convolution has the advantage of being able to do multiple output channels . Each channel can think that it can recognize different patterns . therefore muti-headed attention To be mentioned , It is used to simulate the effect of multiple output channels of convolutional neural network , It can also be said to be a multi-scale concept , Let the model learn from multiple different scale spaces .
The left part ,N be equal to 6 Exactly the same layer Every layer There are two of them sub-layers. first sub-layer Namely multihead self attention, The second one is actually MLP. Then each sublayer is connected with a residual , Finally, I used one LayerNorm. So the output of each sublayer is LayerNorm(x + Sublayer (x)).
details ： Because the input and output of the residual should be the same size , So for the sake of simplicity , The dimension of each output is not set 512, That is, every word , No matter what floor , It's all done 512 The length of a is . This and CNN perhaps MLP It's different , These two are either the dimension reduction or the dimension reduction of space , The dimension of the channel is pulled up . So this also makes the model relatively simpler , If you want to adjust parameters, just adjust this 512 And the one in front N = 6 Just fine .
Location code ： There is no RNN Of transformer There seems to be no function to capture sequence information , It can't tell whether I bit the dog or the dog bit me . What shall I do? , The position information of the word can be combined when inputting the word vector , In this way, you can learn word order information
One way is to allocate one 0 To 1 The values between give each time step , among ,0 Indicates the first word ,1 Indicates the last word . Although this method is simple , But it will bring many problems . One of them is that you can't know how many words exist in a specific range . let me put it another way , The time step difference between different sentences has no meaning .
Another way is to linearly assign a value to each time step . That is to say ,1 Assign to the first word ,2 Assign to the second word , And so on . The problem with this approach is that , Not only will these values become very large , And the model will also encounter some sentences longer than all the sentences in the training . Besides , The data set does not necessarily contain sentences of corresponding length on all values , That is, the model probably hasn't seen any sample sentences of such length , This will seriously affect the generalization ability of the model .
In fact, the cycle is different sin and cos Calculated
There is a third sublayer more than the encoder （ It's also a long attention mechanism , Residual error is also used , Also used layernorm） It's called masked muti-headed attention（ Provide... For the next floor Q）. The decoder uses an autoregressive , Some input of the current layer is the input of some time above , This means that when you make predictions , Of course, you can't see the output of those moments after . But do attention When , You can see the complete input , To avoid this happening , use mask The attention mechanism of . This ensures that the behavior is consistent when training and predicting .
Two kinds of mask：
1.padding mask: The length of the sequence we enter is not necessarily the same . For sequences longer than we expect , We just keep the content within the expected length . For a sequence whose length does not reach the desired length , We will use 0 To fill it , The position of filling is meaningless , We don't want attention The mechanism allocates any attention to it , So we add negative infinity to the filled position , Because we use it when calculating attention softmax function , Plus the position that is too negative infinity will be softmax Processing becomes 0）.
2. For prediction mask: You can't see the following output .
First look at batch Normlization
Be careful ： Finally, add the scaling factor and translation factor （ Parameters are learned by yourself ）. This is a In order to ensure that the expression ability of the model does not decline due to normalization . Because the upper neurons may be studying very hard , But no matter how it changes , The output results are processed before they are handed over to the lower neurons , Will be roughly readjusted to this fixed range . But not every sample is suitable for normalization , After some special samples are normalized . Lost his learning value . On the other hand, it is important to ensure the nonlinear expression ability . Normalization maps almost all data to the unsaturated region of the activation function （ Linear area ）, Only the ability to change linearly . Only the ability to change linearly , Thus, the expression ability of neural network is reduced . And then change it , Then the data can be transformed from linear region to nonlinear region , Restore the expressiveness of the model .
Why layerNorm better ：
Normlization Purpose : An albino , Make it independent and identically distributed
batch Normlization It is the same channel of different samples for normalization ,layer Normlization Different channels of the same sample are normalized . For a characteristic graph , The two just cut in different directions , A horizontal , A vertical .RNN Text network is not suitable for BN Why ：Normalize The object of (position) From different distributions .CNN Use in BN, To a batch Every one of them channel Do standardization . The same of multiple training images channel, The probability comes from a similar distribution .( For example, the graph of a tree , Initial 3 individual channel yes 3 A color channel , Will have similar tree shape and color depth ).RNN Use in BN, To a batch Every one of them position Do standardization . Multiple sequence The same position, It's hard to say from a similar distribution .( For example, film reviews , However, various sentence patterns can be used , It is difficult for words in the same position to obey similar distribution ) therefore RNN in BN It's hard to learn the right μ and σ. But if you're in a single sample of yourself normlization Words , There is no such thing .
Other Normlization：Weight Normalization（ Parameter normalization ） Cosine Normalization （ Cosine normalization ）
Reference resources ：https://zhuanlan.zhihu.com/p/33173246
Here is the sticky chart , Reference link ：https://zhuanlan.zhihu.com/p/264749298
- MySQL Basics
- Educational codeforces round 127 A-E problem solution
- Wechat applet customer service access to send and receive messages
- Advanced version of array simulation queue - ring queue (real queuing)
- What is CNAs certification? What are the software evaluation centers recognized by CNAs?
- JS -- realize click Copy function
- Three uses of kprobe
- The life cycle of key value in redis module programming
- Rsync + inotify remote synchronization
我的树莓派 Raspberry Pi Zero 2W 折腾笔记，记录一些遇到的问题和解决办法
Detailed explanation of kubernetes (IX) -- actual combat of creating pod with resource allocation list
Byte interview programming question: the minimum number of K
API gateway / API gateway (III) - use of Kong - current limiting rate limiting (redis)
我的 Raspberry Pi Zero 2W 折腾笔记，记录一些遇到的问题和解决办法
My raspberry PI zero 2W tossing notes record some problems encountered and solutions
On the day of entry, I cried (mushroom street was laid off and fought for seven months to win the offer)
Functions (Part I)
G007-hwy-cc-estor-03 Huawei Dorado V6 storage simulator construction
How to use OCR in 5 minutes
MySQL InnoDB transaction
MultiTimer v2 重构版本 | 一款可无限扩展的软件定时器
The wechat applet optimizes the native request through the promise of ES6
Mysql database explanation (VII)
Rsync + inotify remote synchronization
Openstack theoretical knowledge
Llvm - generate local variables
What role does the software performance test report play? How much is the third-party test report charged?
Llvm - generate addition
Nuxt project: Global get process Env information
MySQL sync could not find first log file name in binary log index file error
Use of common pod controller of kubernetes
Analysis of common storage types and FTP active and passive modes
Tun equipment principle