当前位置：网站首页>10.2 concentration

10.2 concentration

2022-04-21 20:54:00 【mingqian_ chu】

The following is the learning record ：

1964 Put forward in Nadaraya-Watson Kernel regression model Is a simple but complete example , It can be used to demonstrate machine learning with attention mechanism .

1. Generate data set

Simplicity , Consider the following regression problem ： Given in pairs “ Input － Output ” Data sets $\{(x_1, y_1), \ldots, (x_n, y_n)\}$ ,

How to learn about attention level $f$ function , To predict any new input $x$ Output
$\hat{y} = f(x)$ ？

Generate an artificial data set according to the following nonlinear function , The noise term added is ：
$y_i = 2\sin(x_i) + x_i^{0.8} + \epsilon,$

among $\epsilon$ To obey the mean is 0 And the standard deviation is 0.5 Is a normal distribution .

We have created 50 Training samples and 50 Three test samples .

In order to better visualize the attention pattern after , We will sort the training samples .

2. Average pool output

We first use the simplest estimator to solve the regression problem ： Calculate the average of the output values of all training samples based on the average aggregation ：

$\frac{1}{n}\sum_{i=1}^n y_i,$

As shown in the figure below , This estimator is really not smart enough ： Real functions （“Truth”） And prediction function （“Pred”） There's a big difference .

Insert picture description here

3. The attention function of kernel regression method

obviously , The average aggregation ignores the input $x_i$ .

therefore Nadaraya [Nadaraya, 1964] and Watson [Watson, 1964] Came up with a better idea , The output is adjusted according to the position of the input
$y_i$ Weighted ：

$\sum_{i=1}^n \frac{K(x - x_i)}{\sum_{j=1}^n K(x - x_j)} y_i,$

among $K$ It's nuclear （kernel）.
The estimator described by the formula is called Nadaraya-Watson Nuclear regression （Nadaraya-Watson kernel regression）.

3.1 Nonparametric kernel regression method ：

We won't go into the details of kernel function here , But inspired by this , We can chart 10.1.3 From the perspective of the attention mechanism framework in rewrite (10.2.3), Become a more general focus （attention pooling） The formula ：
$\sum_{i=1}^n \alpha(x, x_i) y_i,$

Where is $x$ Inquire about , $x_i, y_i)$ It's a key value pair .

The output of the attention layer is the predicted value $y_i$ The weighted average of .

Will query $x$ Sum key $x_i$ The relationship is modeled as attention weight （attention weight） $\alpha(x, x_i)$ ,

This weight will be assigned to each corresponding value $y_i$ .

For any query , The model is an effective probability distribution in all key values to attention weight ： They are nonnegative , And the sum is 1.

In order to better understand the focus , We consider a Gaussian kernel （Gaussian kernel）, The definition for ：
$\frac{1}{\sqrt{2\pi}} \exp(-\frac{u^2}{2}).$

Substitute Gaussian kernel into , You can get ：

$\begin{aligned} f(x) &=\sum_{i=1}^n \alpha(x, x_i) y_i\\ &= \sum_{i=1}^n \frac{\exp\left(-\frac{1}{2}(x - x_i)^2\right)}{\sum_{j=1}^n \exp\left(-\frac{1}{2}(x - x_j)^2\right)} y_i \\&= \sum_{i=1}^n \mathrm{softmax}\left(-\frac{1}{2}(x - x_i)^2\right) y_i. \end{aligned}$

If a key $x_i$ The closer to a given query $x$ , Then the corresponding value assigned to this key $y_i$ The greater the weight of attention , It's just “ Get more attention ”.

It is worth noting that ,Nadaraya-Watson Kernel regression is a nonparametric model . therefore , Nonparametric focus （nonparametric attention pooling） Model . Next , We will draw the prediction results based on this nonparametric attention gathering model .

You will find that the prediction line of the new model is smooth , And closer to reality than the average convergence prediction .

Insert picture description here

3.2 Kernel regression method with parameters ：

Nonparametric Nadaraya-Watson Kernel regression is consistent （consistency） The advantages of ： If there's enough data , The model will converge to the optimal result . For all that , We can still easily integrate learnable parameters into attention gathering .

for example : In the following query $x$ Sum key $x_i$ The distance between is multiplied by the learnable parameter $w$ ：

$\begin{aligned} f(x) &= \sum_{i=1}^n \alpha(x, x_i) y_i \\ &= \sum_{i=1}^n \frac{\exp\left(-\frac{1}{2}((x - x_i)w)^2\right)}{\sum_{j=1}^n \exp\left(-\frac{1}{2}((x - x_j)w)^2\right)} y_i \\ &= \sum_{i=1}^n \mathrm{softmax}\left(-\frac{1}{2}((x - x_i)w)^2\right) y_i.\end{aligned}$