当前位置:网站首页>10.2 concentration
10.2 concentration
2022-04-21 20:54:00 【mingqian_ chu】
The following is the learning record :
1964 Put forward in Nadaraya-Watson Kernel regression model Is a simple but complete example , It can be used to demonstrate machine learning with attention mechanism .
1. Generate data set
Simplicity , Consider the following regression problem : Given in pairs “ Input - Output ” Data sets { ( x 1 , y 1 ) , … , ( x n , y n ) } \{(x_1, y_1), \ldots, (x_n, y_n)\} { (x1,y1),…,(xn,yn)},
How to learn about attention level f f f function , To predict any new input x x x Output
y ^ = f ( x ) \hat{y} = f(x) y^=f(x)?
Generate an artificial data set according to the following nonlinear function , The noise term added is :
y i = 2 sin ( x i ) + x i 0.8 + ϵ , y_i = 2\sin(x_i) + x_i^{0.8} + \epsilon, yi=2sin(xi)+xi0.8+ϵ,
among ϵ \epsilon ϵ To obey the mean is 0 And the standard deviation is 0.5 Is a normal distribution .
We have created 50 Training samples and 50 Three test samples .
In order to better visualize the attention pattern after , We will sort the training samples .
2. Average pool output
We first use the simplest estimator to solve the regression problem : Calculate the average of the output values of all training samples based on the average aggregation :
f ( x ) = 1 n ∑ i = 1 n y i , f(x) = \frac{1}{n}\sum_{i=1}^n y_i, f(x)=n1i=1∑nyi,
As shown in the figure below , This estimator is really not smart enough : Real functions (“Truth”) And prediction function (“Pred”) There's a big difference .

3. The attention function of kernel regression method
obviously , The average aggregation ignores the input x i x_i xi.
therefore Nadaraya [Nadaraya, 1964] and Watson [Watson, 1964] Came up with a better idea , The output is adjusted according to the position of the input
y i y_i yi Weighted :
f ( x ) = ∑ i = 1 n K ( x − x i ) ∑ j = 1 n K ( x − x j ) y i , f(x) = \sum_{i=1}^n \frac{K(x - x_i)}{\sum_{j=1}^n K(x - x_j)} y_i, f(x)=i=1∑n∑j=1nK(x−xj)K(x−xi)yi,
among K K K It's nuclear (kernel).
The estimator described by the formula is called Nadaraya-Watson Nuclear regression (Nadaraya-Watson kernel regression).
3.1 Nonparametric kernel regression method :
We won't go into the details of kernel function here , But inspired by this , We can chart 10.1.3 From the perspective of the attention mechanism framework in rewrite (10.2.3), Become a more general focus (attention pooling) The formula :
f ( x ) = ∑ i = 1 n α ( x , x i ) y i , f(x) = \sum_{i=1}^n \alpha(x, x_i) y_i, f(x)=i=1∑nα(x,xi)yi,
Where is x x x Inquire about , ( x i , y i ) (x_i, y_i) (xi,yi) It's a key value pair .
The output of the attention layer is the predicted value y i y_i yi The weighted average of .
Will query x x x Sum key x i x_i xi The relationship is modeled as attention weight (attention weight) α ( x , x i ) \alpha(x, x_i) α(x,xi),
This weight will be assigned to each corresponding value y i y_i yi.
For any query , The model is an effective probability distribution in all key values to attention weight : They are nonnegative , And the sum is 1.
In order to better understand the focus , We consider a Gaussian kernel (Gaussian kernel), The definition for :
K ( u ) = 1 2 π exp ( − u 2 2 ) . K(u) = \frac{1}{\sqrt{2\pi}} \exp(-\frac{u^2}{2}). K(u)=2π1exp(−2u2).
Substitute Gaussian kernel into , You can get :
f ( x ) = ∑ i = 1 n α ( x , x i ) y i = ∑ i = 1 n exp ( − 1 2 ( x − x i ) 2 ) ∑ j = 1 n exp ( − 1 2 ( x − x j ) 2 ) y i = ∑ i = 1 n s o f t m a x ( − 1 2 ( x − x i ) 2 ) y i . \begin{aligned} f(x) &=\sum_{i=1}^n \alpha(x, x_i) y_i\\ &= \sum_{i=1}^n \frac{\exp\left(-\frac{1}{2}(x - x_i)^2\right)}{\sum_{j=1}^n \exp\left(-\frac{1}{2}(x - x_j)^2\right)} y_i \\&= \sum_{i=1}^n \mathrm{softmax}\left(-\frac{1}{2}(x - x_i)^2\right) y_i. \end{aligned} f(x)=i=1∑nα(x,xi)yi=i=1∑n∑j=1nexp(−21(x−xj)2)exp(−21(x−xi)2)yi=i=1∑nsoftmax(−21(x−xi)2)yi.
If a key x i x_i xi The closer to a given query x x x, Then the corresponding value assigned to this key y i y_i yi The greater the weight of attention , It's just “ Get more attention ”.
It is worth noting that ,Nadaraya-Watson Kernel regression is a nonparametric model . therefore , Nonparametric focus (nonparametric attention pooling) Model . Next , We will draw the prediction results based on this nonparametric attention gathering model .
You will find that the prediction line of the new model is smooth , And closer to reality than the average convergence prediction .

3.2 Kernel regression method with parameters :
Nonparametric Nadaraya-Watson Kernel regression is consistent (consistency) The advantages of : If there's enough data , The model will converge to the optimal result . For all that , We can still easily integrate learnable parameters into attention gathering .
for example : In the following query x x x Sum key x i x_i xi The distance between is multiplied by the learnable parameter w w w:
f ( x ) = ∑ i = 1 n α ( x , x i ) y i = ∑ i = 1 n exp ( − 1 2 ( ( x − x i ) w ) 2 ) ∑ j = 1 n exp ( − 1 2 ( ( x − x j ) w ) 2 ) y i = ∑ i = 1 n s o f t m a x ( − 1 2 ( ( x − x i ) w ) 2 ) y i . \begin{aligned} f(x) &= \sum_{i=1}^n \alpha(x, x_i) y_i \\ &= \sum_{i=1}^n \frac{\exp\left(-\frac{1}{2}((x - x_i)w)^2\right)}{\sum_{j=1}^n \exp\left(-\frac{1}{2}((x - x_j)w)^2\right)} y_i \\ &= \sum_{i=1}^n \mathrm{softmax}\left(-\frac{1}{2}((x - x_i)w)^2\right) y_i.\end{aligned} f(x)=i=1∑nα(x,xi)yi=i=1∑n∑j=1nexp(−21((x−xj)w)2)exp(−21((x−xi)w)2)yi=i=1∑nsoftmax(−21((x−xi)w)2)yi.
版权声明
本文为[mingqian_ chu]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204212048517980.html
边栏推荐
- Specific methods of configuring Profibus and PROFINET communication in two TIA botu projects
- TGIP-CN 038 报名|深度解析 Apache Pulsar 源码阅读正确姿势(一)
- 《ROS2机器人建模URDF》8.1URDF统一机器人建模语言
- libVLC库推流,sout参数如何设置
- Deploying redis in Linux Environment
- 10.2注意力汇聚
- Trackup | six unique benefits of using project management
- Go语言自学系列 | golang defer语句
- On the essence of enterprise informatization
- Andorid - - Pourquoi utiliser une transaction, qu'est - ce qu'une transaction commit et ROLLBACK?
猜你喜欢

PlatoFarm提出永动机模型,将于4月22日在HUOBI开启IEO

【嵌入式】关于IAP+Xmodem从外部接收bin文件对芯片进行升级学习记录

《ROS2机器人建模URDF》8.1URDF统一机器人建模语言
![[Hetai ht32 communicates with STM32 through serial port and lights up]](/img/77/750edf9608d8661856afbd43449690.png)
[Hetai ht32 communicates with STM32 through serial port and lights up]

Guanglianda is positioned as a digital "enabler"

3、MySQL Workbench 对表进行增删改查

通达oa工作流升级 操作说明

如何正确有效的进行滑环的安装

Release announcement of HMS core version 6.4.0

How to install the slip ring correctly and effectively
随机推荐
通达OA系统对接 单点登录平台使用和开发手册
Rhcsa (day 5)
Oracle data import notes
After five years of outsourcing, I'm almost a loser
异常处理器
APM 行业认知系列 - 十六
使用foremost对磁盘镜像文件做数字取证
《ROS2机器人建模URDF》8.1URDF统一机器人建模语言
素描
APM 行业认知系列 - 十五
Complex linear space and complex structure
Yaml
824.山羊拉丁文
2022起重机械指挥考试题模拟考试题库及答案
为何PostgreSQL即将超越SQL Server?
About c34d
Andorid --- 为什么要使用事务,什么叫做事务的提交和回滚?
TCP和UDP的135、137、138、139、445端口的作用?
Andorid --- 為什麼要使用事務,什麼叫做事務的提交和回滾?
APM(应用性能监控) 行业认知系列 - 一