当前位置：网站首页>KNN, kmeans and GMM
KNN, kmeans and GMM
20220423 15:27:00 【moletop】
Knn,Kmeans and GMM
KNN

Classification algorithm

Supervised learning

K Value means  For a sample X, To classify it , First, from the dataset , stay X Look nearby for the nearest K Data points , Divide it into the category with the most categories .
problem ： KNN The core of the algorithm is to find the location of the sample to be tested in the training sample set k A close neighbor , If the training sample set is too large , Then the traditional traversal of the whole sample to find k The nearest neighbor approach will lead to a sharp decline in performance .
improvement :1.kdtree
 kdtree Space for time ,** Use the sample points in the training sample set , Align... In turn along each dimension k Divide the dimensional space , Build a binary tree ,** The idea of divide and conquer is used to greatly improve the search efficiency of the algorithm . We know , The algorithm complexity of binary search is O(logN),kdtree The search efficiency is close to （ Depending on the construction kdtree Whether it is close to the balance tree ).
1. Every layer down the tree , We will change a dimension to measure . The reason is also simple , Because we want this tree to be K Dimensional space has a good expression ability , It is convenient for us to query quickly according to different dimensions .
2. Leaf node , Actually Represents a small space in the sample space .
3. In one query Go straight to K A recent sample
Excellent blog ：https://zhuanlan.zhihu.com/p/127022333

balltree
stay kdtree in , One of the core factors leading to performance degradation is kdtree The partitioned subspace in is a hypercube , The nearest neighbor is the Euclidean distance （ Super ball ）, The possibility of a hypercube intersecting with a hypersphere is very high , All intersecting subspaces , All need to be checked , Greatly reduce the operation efficiency .
If The division area is also a hypersphere , The probability of intersection is greatly reduced . As shown in the figure below , by balltree Divide space by hypersphere , Remove edges and corners , The probability of intersection between dividing hypersphere and searching hypersphere is greatly reduced , Especially when the data dimension is very high , The efficiency of the algorithm has been greatly improved .
Kmeans
 clustering algorithm
 Unsupervised learning
 The dataset is null Label, Messy data
 K Value means  K It's a preset number , Divide the dataset into K A cluster of , We need to rely on human prior knowledge
kmeans shortcoming
 Number of cluster centers K Need to be given in advance , But in practice, this K The choice of value is very difficult to estimate
 Kmeans The initial clustering center needs to be determined artificially , Different initial clustering centers may lead to different clustering results , For example, falling into local optimization .（ have access to Kmeans++ Algorithm to solve ）
 Sensitive to noise and outliers
improvement ：
1.Kmeans++
 Step one ： Randomly select a sample as the first cluster center c1;
 Step two ： Calculate the shortest distance between each sample and the existing cluster center （ That is, the distance from the nearest cluster center ）, use
D(x)
Express ; The bigger this is , It means that the probability of being selected as the cluster center is high ; Last , Use the roulette method to select the next cluster center ; Step three ： Repeat step 2 , Know how to choose k Cluster centers .
After selecting the initial point , Just keep using the standard kmeans The algorithm .
2. Two points K_means
solve KMeans The algorithm is sensitive to the initial cluster center , Two points KMeans The algorithm is an algorithm that weakens the initial centroid , The specific steps are as follows ：
Put all the sample data into a queue as a cluster ;
Select a cluster from the queue to Kmeans Algorithm partition （ The first iteration is equivalent to all samples K=2 Division ）, It is divided into two sub clusters , And add the sub cluster to the queue to iterate the second step of operation , Until the suspension conditions are met ( Number of clusters 、 Minimum square error 、 Iterations, etc )
The cluster in the queue is the final cluster set .
GMM

And K The mean algorithm is similar to , The same is used EM The algorithm is iterative . The Gaussian mixture model assumes that the data of each cluster conforms to the Gaussian distribution （ It's also called normal distribution ） Of , At present The distribution of data presented is the result of superposition of Gaussian distribution of each cluster .

The core idea of Gaussian mixture model is , Suppose the data can be seen as generated from multiple Gaussian distributions Of . Under this assumption , Each individual sub model is a standard Gaussian model , The average is ui And variance ∑i Is the parameter to be estimated . Besides , Each sub model has a parameter πi, It can be understood as weight or probability of generating data . The formula of Gaussian mixture model is ：

The process ：
First , Initial random selection of the value of each parameter . then , Repeat the following two steps , Until it converges .1.E step . According to the current parameters , Calculate the probability that each point is generated by a certain sub model .
2.M step . Use E The estimated probability of the step , To improve the mean value of each sub model , Variance and weight .
in other words , We don't know the best K Each of the Gauss distributions 3 Parameters , I don't know every one of them Which Gaussian distribution is the data point generated . So every time I cycle , Fix the current Gaussian distribution first change , Obtain the probability of each data point generated by each Gaussian distribution . Then fix the generation probability to be the same , According to data points and generation probability , Get a better Gaussian distribution for a group . cycle , Until the parameters don't change , Or change very little , Then we get a reasonable set of Gaussian distribution .
GMM And KMeans comparison
Gaussian mixture model and K The same thing about the mean algorithm is ：
 They are all clustering algorithms ;
 Need to be Appoint K value ;
 Is to use EM Algorithm to solve ;
 They can only converge to the local optimum .
And it's compared to K The advantage of mean algorithm is , We can give the probability that a sample belongs to a certain class ; Not just clustering , It can also be used to estimate the probability density ; And can be used to generate new sample points .
版权声明
本文为[moletop]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/04/202204231523160709.html
边栏推荐
 C语言超全学习路线（收藏让你少走弯路）
 Crawling fragment of a button style on a website
 Basic operation of circular queue (Experiment)
 The life cycle of key value in redis module programming
 Knn，Kmeans和GMM
 On the day of entry, I cried (mushroom street was laid off and fought for seven months to win the offer)
 Hj31 word inversion
 通过 PDO ODBC 将 PHP 连接到 MSSQL
 Async void caused the program to crash
 如何设计一个良好的API接口？
猜你喜欢
【Leetcode每日一题】安装栅栏
2022年中国数字科技专题分析
Special analysis of China's digital technology in 2022
Basic operation of sequential stack
Advanced version of array simulation queue  ring queue (real queuing)
Analysis of common storage types and FTP active and passive modes
How did the computer reinstall the system? The display has no signal
About UDP receiving ICMP port unreachable
On the day of entry, I cried (mushroom street was laid off and fought for seven months to win the offer)
Openfaas practice 4: template operation
随机推荐
The life cycle of key value in redis module programming
电脑怎么重装系统后显示器没有信号了
Basic operation of circular queue (Experiment)
JUC learning record (2022.4.22)
On the day of entry, I cried (mushroom street was laid off and fought for seven months to win the offer)
Advanced version of array simulation queue  ring queue (real queuing)
Nacos program connects to mysql8 0+ NullPointerException
Deep learning  Super parameter setting
How to use OCR in 5 minutes
MySQL query library size
php函数
Comparaison du menu de l'illustrateur Adobe en chinois et en anglais
MySQL Basics
Reptile exercises (1)
TLS / SSL protocol details (28) differences between TLS 1.0, TLS 1.1 and TLS 1.2
Collation of errors encountered in the use of redis shake
服务器中毒了怎么办？服务器怎么防止病毒入侵？
Use of common pod controller of kubernetes
How to design a good API interface?
我的树莓派 Raspberry Pi Zero 2W 折腾笔记，记录一些遇到的问题和解决办法