当前位置：网站首页>KNN, kmeans and GMM

KNN, kmeans and GMM

2022-04-23 15:27:00 【moletop】

Knn,Kmeans and GMM

KNN

Classification algorithm
Supervised learning
K Value means - For a sample X, To classify it , First, from the dataset , stay X Look nearby for the nearest K Data points , Divide it into the category with the most categories .

problem ： KNN The core of the algorithm is to find the location of the sample to be tested in the training sample set k A close neighbor , If the training sample set is too large , Then the traditional traversal of the whole sample to find k The nearest neighbor approach will lead to a sharp decline in performance .
improvement :

1.kd-tree
- kd-tree Space for time ,** Use the sample points in the training sample set , Align... In turn along each dimension k Divide the dimensional space , Build a binary tree ,** The idea of divide and conquer is used to greatly improve the search efficiency of the algorithm . We know , The algorithm complexity of binary search is O(logN),kd-tree The search efficiency is close to （ Depending on the construction kd-tree Whether it is close to the balance tree ).

Insert picture description here

1. Every layer down the tree , We will change a dimension to measure . The reason is also simple , Because we want this tree to be K Dimensional space has a good expression ability , It is convenient for us to query quickly according to different dimensions .

2. Leaf node , Actually Represents a small space in the sample space .

3. In one query Go straight to K A recent sample

Excellent blog ：https://zhuanlan.zhihu.com/p/127022333

ball-tree

stay kd-tree in , One of the core factors leading to performance degradation is kd-tree The partitioned subspace in is a hypercube , The nearest neighbor is the Euclidean distance （ Super ball ）, The possibility of a hypercube intersecting with a hypersphere is very high , All intersecting subspaces , All need to be checked , Greatly reduce the operation efficiency .

If The division area is also a hypersphere , The probability of intersection is greatly reduced . As shown in the figure below , by ball-tree Divide space by hypersphere , Remove edges and corners , The probability of intersection between dividing hypersphere and searching hypersphere is greatly reduced , Especially when the data dimension is very high , The efficiency of the algorithm has been greatly improved .

Insert picture description here

K-means

clustering algorithm
Unsupervised learning
The dataset is null Label, Messy data
K Value means - K It's a preset number , Divide the dataset into K A cluster of , We need to rely on human prior knowledge

k-means shortcoming

Number of cluster centers K Need to be given in advance , But in practice, this K The choice of value is very difficult to estimate
Kmeans The initial clustering center needs to be determined artificially , Different initial clustering centers may lead to different clustering results , For example, falling into local optimization .（ have access to Kmeans++ Algorithm to solve ）
Sensitive to noise and outliers

improvement ：

1.Kmeans++

Step one ： Randomly select a sample as the first cluster center c1;

Step two ： Calculate the shortest distance between each sample and the existing cluster center （ That is, the distance from the nearest cluster center ）, use D(x) Express ; The bigger this is , It means that the probability of being selected as the cluster center is high ; Last , Use the roulette method to select the next cluster center ;

Step three ： Repeat step 2 , Know how to choose k Cluster centers .

After selecting the initial point , Just keep using the standard k-means The algorithm .

2. Two points K_means

solve K-Means The algorithm is sensitive to the initial cluster center , Two points K-Means The algorithm is an algorithm that weakens the initial centroid , The specific steps are as follows ：

Put all the sample data into a queue as a cluster ;
Select a cluster from the queue to K-means Algorithm partition （ The first iteration is equivalent to all samples K=2 Division ）, It is divided into two sub clusters , And add the sub cluster to the queue to iterate the second step of operation , Until the suspension conditions are met ( Number of clusters 、 Minimum square error 、 Iterations, etc )
The cluster in the queue is the final cluster set .

GMM

And K The mean algorithm is similar to , The same is used EM The algorithm is iterative . The Gaussian mixture model assumes that the data of each cluster conforms to the Gaussian distribution （ It's also called normal distribution ） Of , At present The distribution of data presented is the result of superposition of Gaussian distribution of each cluster .
The core idea of Gaussian mixture model is , Suppose the data can be seen as generated from multiple Gaussian distributions Of . Under this assumption , Each individual sub model is a standard Gaussian model , The average is ui And variance ∑i Is the parameter to be estimated . Besides , Each sub model has a parameter πi, It can be understood as weight or probability of generating data . The formula of Gaussian mixture model is ：
The process ：
First , Initial random selection of the value of each parameter . then , Repeat the following two steps , Until it converges .

1.E step . According to the current parameters , Calculate the probability that each point is generated by a certain sub model .

2.M step . Use E The estimated probability of the step , To improve the mean value of each sub model , Variance and weight .

in other words , We don't know the best K Each of the Gauss distributions 3 Parameters , I don't know every one of them Which Gaussian distribution is the data point generated . So every time I cycle , Fix the current Gaussian distribution first change , Obtain the probability of each data point generated by each Gaussian distribution . Then fix the generation probability to be the same , According to data points and generation probability , Get a better Gaussian distribution for a group . cycle , Until the parameters don't change , Or change very little , Then we get a reasonable set of Gaussian distribution .
GMM And K-Means comparison

Gaussian mixture model and K The same thing about the mean algorithm is ：
- They are all clustering algorithms ;
- Need to be Appoint K value ;
- Is to use EM Algorithm to solve ;
- They can only converge to the local optimum .
And it's compared to K The advantage of mean algorithm is , We can give the probability that a sample belongs to a certain class ; Not just clustering , It can also be used to estimate the probability density ; And can be used to generate new sample points .