当前位置:网站首页>Machine learning theory (7): kernel function kernels -- a way to help SVM realize nonlinear decision boundary
Machine learning theory (7): kernel function kernels -- a way to help SVM realize nonlinear decision boundary
2022-04-23 18:33:00 【Flying warm】
List of articles
About SVM The principle and why SVM Always choose the big one margin Decide the boundary , I'm in the article :
https://blog.csdn.net/qq_42902997/article/details/124310782
It is expounded in , If you don't understand, you can review
review SVM The goal of optimization
C Σ i = 1 m y i c o s t 1 ( θ T x i ) + ( 1 − y i ) c o s t 0 ( θ T x i ) + 1 2 Σ j = 1 n θ j 2 ( 1 ) C\Sigma_{i=1}^m{y_i}cost_1(\theta^Tx_i)+(1-y_i)cost_0(\theta^Tx_i)+\frac{1}{2}\Sigma_{j=1}^n\theta_j^2~~~~~~~~~~(1) CΣi=1myicost1(θTxi)+(1−yi)cost0(θTxi)+21Σj=1nθj2 (1)
- We all know SVM Is a linear classifier , Then face the following linear inseparable scene :
- If you want to implement classification , Then it is absolutely impossible to use a linear classifier to realize , Therefore, we want to introduce polynomials to construct higher-order features to achieve the purpose of the final nonlinear classification .
- Look at the formula given in the figure ; For all the characteristics of a sample x ⃗ = { x 1 , x 2 } \vec{x}=\{x_1,x_2\} x={
x1,x2}; Try to make higher-order features based on these features x 1 x 2 , x 1 2 , x 2 2 x_1x_2, x_1^2, x_2^2 x1x2,x12,x22 And so on to construct a nonlinear decision boundary :
- Well, along this line , We use a more general form to express , Because our goal is to make higher-order features , So we use θ 0 + θ 1 f 1 + θ 2 f 2 + . . . + θ n f n \theta_0 + \theta_1f_1 + \theta_2f_2+...+\theta_nf_n θ0+θ1f1+θ2f2+...+θnfn To represent our decision boundary ; among n n n Is the number of features we end up using ; And these f 1 , f 2 , . . . , f n f_1, f_2, ..., f_n f1,f2,...,fn It's the new feature we're going to use .
- In the example above , f 1 = x 1 , f 2 = x 2 , f 3 = x 1 x 2 , f 4 = x 1 2 , f 5 = x 2 2 , . . . f_1=x_1, f_2=x_2, f_3=x_1x_2, f_4=x_1^2,f_5=x_2^2, ... f1=x1,f2=x2,f3=x1x2,f4=x12,f5=x22,...
- Now comes the question , How to ensure that we f 1 , . . . , f n f_1,...,f_n f1,...,fn Is it effective ?
How to construct f 1 , . . . , f n f_1,...,f_n f1,...,fn
landmark & similarity
- Use to measure the current sample points and some landmark( Marker points ) Similarity is used to generate f 1 , . . . , f n f_1,...,f_n f1,...,fn.
Still use the example mentioned above , For each sample point x x x Use two features to represent x ⃗ = { x 1 , x 2 } \vec{x}=\{x_1,x_2\} x={ x1,x2}, At this time, we assume that 3 3 3 Different landmarks( This will be explained in detail later landmarks How it was chosen )
Next , We measure the samples separately x x x And these three landmarks Similarity as f 1 , . . . , f n f_1,...,f_n f1,...,fn The process of defining :
- f 1 = s i m i l a r i t y ( x , l ( 1 ) ) f_1=similarity(x,l^{(1)}) f1=similarity(x,l(1))
- f 2 = s i m i l a r i t y ( x , l ( 2 ) ) f_2=similarity(x,l^{(2)}) f2=similarity(x,l(2))
- f 3 = s i m i l a r i t y ( x , l ( 3 ) ) f_3=similarity(x,l^{(3)}) f3=similarity(x,l(3))
And this s i m i l a r i t y similarity similarity We use Gaussian kernel function to define ( I'll explain it in detail later )
That is, we use a :
s i m i l a r i t y ( x , z ) = e x p ( − ∣ ∣ x − z ∣ ∣ 2 2 σ 2 ) similarity(x,z)=exp(-\frac{||x-z||^2}{2\sigma^2}) similarity(x,z)=exp(−2σ2∣∣x−z∣∣2)
therefore , above f 1 , . . f 3 f_1,..f_3 f1,..f3 Can be transformed into :- f 1 = e x p ( − ∣ ∣ x − l ( 1 ) ∣ ∣ 2 2 σ 2 ) f_1=exp(-\frac{||x-l^{(1)}||^2}{2\sigma^2}) f1=exp(−2σ2∣∣x−l(1)∣∣2)
- f 2 = e x p ( − ∣ ∣ x − l ( 2 ) ∣ ∣ 2 2 σ 2 ) f_2=exp(-\frac{||x-l^{(2)}||^2}{2\sigma^2}) f2=exp(−2σ2∣∣x−l(2)∣∣2)
- f 3 = e x p ( − ∣ ∣ x − l ( 3 ) ∣ ∣ 2 2 σ 2 ) f_3=exp(-\frac{||x-l^{(3)}||^2}{2\sigma^2}) f3=exp(−2σ2∣∣x−l(3)∣∣2)
Note: ∣ ∣ x − l ( 1 ) ∣ ∣ 2 = Σ i = 1 m ( x i − l i ) 2 ||x-l^{(1)}||^2=\Sigma_{i=1}^m(x_i-l_i)^2 ∣∣x−l(1)∣∣2=Σi=1m(xi−li)2 m m m It's a sample x x x Number of features used in , Here it is. x 1 , x 2 x_1,x_2 x1,x2 Two characteristics, so m = 2 m=2 m=2
You can see , Under the definition of this function , If x x x and l ( i ) l^{(i)} l(i) Are very similar , Then the result of the kernel function will be close to f i ≈ e x p ( − 0 2 σ 2 ) ≈ 1 f_i\approx exp(-\frac{0}{2\sigma^2})\approx1 fi≈exp(−2σ20)≈1 ; contrary , If x x x and l ( i ) l^{(i)} l(i) Far away , Then the result is close to 0 0 0.
So we can think of f 1 , . . . , f n f_1,...,f_n f1,...,fn Measured x x x This sample to these landmark Similar procedures ( Distance metric ); And these f 1 , . . . , f n f_1,...,f_n f1,...,fn Become a component x x x This sample is n n n Features
For ease of understanding , We visualize Gaussian kernel function
As you can see from the picture , When l ( 1 ) l^{(1)} l(1) The closer you are to x x x, Then the result of passing through the Gaussian kernel function is closer to the center of the graph , The closer the value is 1 1 1; contrary , If it is farther away, the value is closer to 0 0 0
We can also draw a simple conclusion : Standard deviation of Gaussian kernel function σ \sigma σ The size of determines the smoothness of the distribution , For example, choose different σ \sigma σ The distribution is different :
How to generate nonlinear boundary
- Let's say we've got through calculation θ 0 , θ 1 , . . . θ 3 \theta_0, \theta_1,...\theta_3 θ0,θ1,...θ3, Their values are shown in the figure :
- If there is a sample at this time x x x In the position in the figure : distance l ( 1 ) l^{(1)} l(1) Very close , But distance l ( 2 ) , l ( 3 ) l^{(2)},l^{(3)} l(2),l(3) Far away , Then we can think that f 1 ≈ 1 ; f 2 , f 3 ≈ 0 f_1\approx1; f_2, f_3\approx0 f1≈1;f2,f3≈0 Calculate by bringing in the formula , The final prediction we can get is > = 0 >=0 >=0 Of , Positive sample .
- If there is another sample at this time x x x, Three l ( i ) l^{(i)} l(i) Far away : According to the same steps, we can calculate that his prediction label is < 0 <0 <0 Of , Negative samples .
- If the sample size is large enough , You'll find that : The final prediction result of these samples actually depends on all landmarks, The resulting decision boundary will become a nonlinear decision boundary . All samples inside the red boundary will be judged as positive samples , All samples outside the red boundary will be judged as negative samples .
How to choose landmarks
-
In fact, you may have thought of , We can use all the sample points in this dataset X = x ( 1 ) , x ( 2 ) , . . . , x ( n ) X={x^{(1)},x^{(2)},...,x^{(n)}} X=x(1),x(2),...,x(n) As these landmarks l ( 1 ) = x ( 1 ) , l ( 2 ) = x ( 2 ) , . . . , l ( n ) = x ( n ) l^{(1)}=x^{(1)},l^{(2)}=x^{(2)},...,l^{(n)}=x^{(n)} l(1)=x(1),l(2)=x(2),...,l(n)=x(n)
-
The similarity of samples can be rewritten into the following formula :
- f 1 = e x p ( − ∣ ∣ x − x ( 1 ) ∣ ∣ 2 2 σ 2 ) f_1=exp(-\frac{||x-x^{(1)}||^2}{2\sigma^2}) f1=exp(−2σ2∣∣x−x(1)∣∣2)
- f 2 = e x p ( − ∣ ∣ x − x ( 2 ) ∣ ∣ 2 2 σ 2 ) f_2=exp(-\frac{||x-x^{(2)}||^2}{2\sigma^2}) f2=exp(−2σ2∣∣x−x(2)∣∣2)
- f 3 = e x p ( − ∣ ∣ x − x ( 3 ) ∣ ∣ 2 2 σ 2 ) f_3=exp(-\frac{||x-x^{(3)}||^2}{2\sigma^2}) f3=exp(−2σ2∣∣x−x(3)∣∣2)
.
.
. - f n = e x p ( − ∣ ∣ x − x ( n ) ∣ ∣ 2 2 σ 2 ) f_n=exp(-\frac{||x-x^{(n)}||^2}{2\sigma^2}) fn=exp(−2σ2∣∣x−x(n)∣∣2)
-
thus , Every sample x ( i ) x^{(i)} x(i) Eigenvector of x ( i ) ⃗ \vec{x^{(i)}} x(i) By the original { x 1 ( i ) , x 2 ( i ) } \{x^{(i)}_1,x^{(i)}_2\} { x1(i),x2(i)} Turned into f ( i ) ⃗ = { f 1 ( i ) , f 2 ( i ) , . . . , f n ( i ) } \vec{f^{(i)}}=\{f^{(i)}_1, f^{(i)}_2,..., f^{(i)}_n\} f(i)={ f1(i),f2(i),...,fn(i)}; In fact, we omit the intercept feature here f 0 ( i ) f^{(i)}_0 f0(i), If you want to add it, it's also very simple , f ( i ) ⃗ = { f 0 ( i ) , f 1 ( i ) , f 2 ( i ) , . . . , f n ( i ) } \vec{f^{(i)}}=\{f^{(i)}_0, f^{(i)}_1, f^{(i)}_2,..., f^{(i)}_n\} f(i)={ f0(i),f1(i),f2(i),...,fn(i)}
-
The same is about to change , It's with this n n n Dimension vector ( If you count f 0 f_0 f0 Namely n + 1 n+1 n+1 dimension ) The vector corresponds to θ ⃗ = θ 0 , . . . , θ n \vec{\theta}={\theta_0,..., \theta_n} θ=θ0,...,θn
-
We finally passed θ T ⋅ f ⃗ = θ 0 f 0 + . . . + θ n f n > = 0 \theta^T \cdot \vec{f} = \theta_0f_0+...+\theta_nf_n>=0 θT⋅f=θ0f0+...+θnfn>=0 To predict a positive sample .
SVM Combined kernel function
-
Combine the above , Let's look back SVM The objective function of , For a containing m m m The data set of 2 samples :
C Σ i = 1 m y i c o s t 1 ( θ T x ( i ) ) + ( 1 − y i ) c o s t 0 ( θ T x ( i ) ) + 1 2 Σ j = 1 n θ j 2 ( 1 ) C\Sigma_{i=1}^m{y_i}cost_1(\theta^Tx^{(i)})+(1-y_i)cost_0(\theta^Tx^{(i)})+\frac{1}{2}\Sigma_{j=1}^n\theta_j^2~~~~~~~~~~(1) CΣi=1myicost1(θTx(i))+(1−yi)cost0(θTx(i))+21Σj=1nθj2 (1) -
We can rewrite this optimization goal as :
C Σ i = 1 m y i c o s t 1 ( θ T ⋅ f ( i ) ⃗ ) + ( 1 − y i ) c o s t 0 ( θ T ⋅ f ( i ) ⃗ ) + 1 2 Σ j = 1 n θ j 2 ( 2 ) C\Sigma_{i=1}^m{y_i}cost_1(\theta^T\cdot \vec{f^{(i)}})+(1-y_i)cost_0(\theta^T\cdot \vec{f^{(i)}})+\frac{1}{2}\Sigma_{j=1}^n\theta_j^2~~~~~~~~~~(2) CΣi=1myicost1(θT⋅f(i))+(1−yi)cost0(θT⋅f(i))+21Σj=1nθj2 (2) -
That is to put the original x x x The feature vector is transformed into a new one based on f f f Eigenvector of ; Now the number of samples is m m m.
-
At the same time , Regularized part of n n n It originally represented a sample x ( i ) x^{(i)} x(i) The number of effective features in , Now this quantity can also be replaced by m m m 了 , Because the number of effective features is the number of samples , So the formula is optimized again to :
C Σ i = 1 m y i c o s t 1 ( θ T ⋅ f ( i ) ⃗ ) + ( 1 − y i ) c o s t 0 ( θ T ⋅ f ( i ) ⃗ ) + 1 2 Σ j = 1 m θ j 2 ( 2 ) C\Sigma_{i=1}^m{y_i}cost_1(\theta^T\cdot \vec{f^{(i)}})+(1-y_i)cost_0(\theta^T\cdot \vec{f^{(i)}})+\frac{1}{2}\Sigma_{j=1}^m\theta_j^2~~~~~~~~~~(2) CΣi=1myicost1(θT⋅f(i))+(1−yi)cost0(θT⋅f(i))+21Σj=1mθj2 (2)
- there j = 1... m j=1...m j=1...m The feature that cannot contain intercept , That is, this part of regularization , There can only be at most m m m The item , Because the intercept feature is not included in the optimization of regularization, it cannot be written as j = 0... m j=0...m j=0...m.
- So the regular term can also be written as θ T ⋅ θ ( i g n o r i n g θ 0 ) \theta^T\cdot \theta ~~~(ignoring~~ \theta_0) θT⋅θ (ignoring θ0)
版权声明
本文为[Flying warm]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204231826143997.html
边栏推荐
- Differences between SSD hard disk SATA interface and m.2 interface (detailed summary)
- CANopen usage method and main parameters of object dictionary
- 深度学习经典网络解析目标检测篇(一):R-CNN
- 使用 bitnami/postgresql-repmgr 镜像快速设置 PostgreSQL HA
- Kettle paoding jieniu Chapter 17 text file output
- Quantexa CDI(场景决策智能)Syneo平台介绍
- Ucosiii transplantation and use, reference punctual atom
- Solution to Chinese garbled code after reg file is imported into the registry
- 软件测试总结
- Promote QT default control to custom control
猜你喜欢
ESP32 LVGL8. 1 - label (style 14)
【ACM】376. 摆动序列
使用 bitnami/postgresql-repmgr 镜像快速设置 PostgreSQL HA
iptables -L执行缓慢
Kettle paoding jieniu Chapter 17 text file output
【ACM】455. Distribute Biscuits (1. Give priority to big biscuits to big appetite; 2. Traverse two arrays with only one for loop (use subscript index -- to traverse another array))
Install the yapiupload plug-in in idea and upload the API interface to the Yapi document
Query the logistics update quantity according to the express order number
硬核解析Promise对象(这七个必会的常用API和七个关键问题你都了解吗?)
Jeecg boot microservice architecture
随机推荐
QT excel operation summary
回路-通路
CANopen usage method and main parameters of object dictionary
CANopen STM32 transplantation
After CANopen starts PDO timing transmission, the heartbeat frame time is wrong, PDO is delayed, and CANopen time axis is disordered
Domestic GD chip can filter
深度学习经典网络解析目标检测篇(一):R-CNN
CISSP certified daily knowledge points (April 11, 2022)
Vulnérabilité d'exécution de la commande de fond du panneau de commande JD - freefuck
Can filter
PowerDesigner various font settings; Preview font setting; SQL font settings
Teach you to quickly rename folder names in a few simple steps
kettle庖丁解牛第17篇之文本文件输出
Resolve the error Max virtual memory areas VM max_ map_ count [65530] is too low, increase to at least [262144]
Robocode Tutorial 4 - robocode's game physics
14个py小游戏源代码分享第二弹
Hard core parsing promise object (do you know these seven common APIs and seven key questions?)
The vivado project corresponding to the board is generated by TCL script
JD-FreeFuck 京東薅羊毛控制面板 後臺命令執行漏洞
Use Chenxi bookkeeping book to analyze the balance of revenue and expenditure of each account in a certain period of time