当前位置:网站首页>KNN, kmeans and GMM

KNN, kmeans and GMM

2022-04-23 15:27:00 moletop

Knn,Kmeans and GMM

KNN

  • Classification algorithm

  • Supervised learning

  • K Value means - For a sample X, To classify it , First, from the dataset , stay X Look nearby for the nearest K Data points , Divide it into the category with the most categories .

    problem : KNN The core of the algorithm is to find the location of the sample to be tested in the training sample set k A close neighbor , If the training sample set is too large , Then the traditional traversal of the whole sample to find k The nearest neighbor approach will lead to a sharp decline in performance .
    improvement :

    1.kd-tree

    • kd-tree Space for time ,** Use the sample points in the training sample set , Align... In turn along each dimension k Divide the dimensional space , Build a binary tree ,** The idea of divide and conquer is used to greatly improve the search efficiency of the algorithm . We know , The algorithm complexity of binary search is O(logN),kd-tree The search efficiency is close to ( Depending on the construction kd-tree Whether it is close to the balance tree ).

 Insert picture description here

1. Every layer down the tree , We will change a dimension to measure . The reason is also simple , Because we want this tree to be K Dimensional space has a good expression ability , It is convenient for us to query quickly according to different dimensions .

2. Leaf node , Actually Represents a small space in the sample space .

3. In one query Go straight to K A recent sample

Excellent blog :https://zhuanlan.zhihu.com/p/127022333

  1. ball-tree

    stay kd-tree in , One of the core factors leading to performance degradation is kd-tree The partitioned subspace in is a hypercube , The nearest neighbor is the Euclidean distance ( Super ball ), The possibility of a hypercube intersecting with a hypersphere is very high , All intersecting subspaces , All need to be checked , Greatly reduce the operation efficiency .

    If The division area is also a hypersphere , The probability of intersection is greatly reduced . As shown in the figure below , by ball-tree Divide space by hypersphere , Remove edges and corners , The probability of intersection between dividing hypersphere and searching hypersphere is greatly reduced , Especially when the data dimension is very high , The efficiency of the algorithm has been greatly improved .

 Insert picture description here

K-means

  • clustering algorithm
  • Unsupervised learning
  • The dataset is null Label, Messy data
  • K Value means - K It's a preset number , Divide the dataset into K A cluster of , We need to rely on human prior knowledge

k-means shortcoming

  • Number of cluster centers K Need to be given in advance , But in practice, this K The choice of value is very difficult to estimate
  • Kmeans The initial clustering center needs to be determined artificially , Different initial clustering centers may lead to different clustering results , For example, falling into local optimization .( have access to Kmeans++ Algorithm to solve )
  • Sensitive to noise and outliers

improvement :

1.Kmeans++

  • Step one : Randomly select a sample as the first cluster center c1;
  • Step two : Calculate the shortest distance between each sample and the existing cluster center ( That is, the distance from the nearest cluster center ), use D(x) Express ; The bigger this is , It means that the probability of being selected as the cluster center is high ; Last , Use the roulette method to select the next cluster center ;
  • Step three : Repeat step 2 , Know how to choose k Cluster centers .

After selecting the initial point , Just keep using the standard k-means The algorithm .

2. Two points K_means

solve K-Means The algorithm is sensitive to the initial cluster center , Two points K-Means The algorithm is an algorithm that weakens the initial centroid , The specific steps are as follows :

Put all the sample data into a queue as a cluster ;
Select a cluster from the queue to K-means Algorithm partition ( The first iteration is equivalent to all samples K=2 Division ), It is divided into two sub clusters , And add the sub cluster to the queue to iterate the second step of operation , Until the suspension conditions are met ( Number of clusters 、 Minimum square error 、 Iterations, etc )
The cluster in the queue is the final cluster set .

GMM

  • And K The mean algorithm is similar to , The same is used EM The algorithm is iterative . The Gaussian mixture model assumes that the data of each cluster conforms to the Gaussian distribution ( It's also called normal distribution ) Of , At present The distribution of data presented is the result of superposition of Gaussian distribution of each cluster .

  • The core idea of Gaussian mixture model is , Suppose the data can be seen as generated from multiple Gaussian distributions Of . Under this assumption , Each individual sub model is a standard Gaussian model , The average is ui And variance ∑i Is the parameter to be estimated . Besides , Each sub model has a parameter πi, It can be understood as weight or probability of generating data . The formula of Gaussian mixture model is :
     Insert picture description here

  • The process :
    First , Initial random selection of the value of each parameter . then , Repeat the following two steps , Until it converges .

    1.E step . According to the current parameters , Calculate the probability that each point is generated by a certain sub model .

    2.M step . Use E The estimated probability of the step , To improve the mean value of each sub model , Variance and weight .

    in other words , We don't know the best K Each of the Gauss distributions 3 Parameters , I don't know every one of them Which Gaussian distribution is the data point generated . So every time I cycle , Fix the current Gaussian distribution first change , Obtain the probability of each data point generated by each Gaussian distribution . Then fix the generation probability to be the same , According to data points and generation probability , Get a better Gaussian distribution for a group . cycle , Until the parameters don't change , Or change very little , Then we get a reasonable set of Gaussian distribution .

    GMM And K-Means comparison

    Gaussian mixture model and K The same thing about the mean algorithm is :

    • They are all clustering algorithms ;
    • Need to be Appoint K value ;
    • Is to use EM Algorithm to solve ;
    • They can only converge to the local optimum .

    And it's compared to K The advantage of mean algorithm is , We can give the probability that a sample belongs to a certain class ; Not just clustering , It can also be used to estimate the probability density ; And can be used to generate new sample points .

版权声明
本文为[moletop]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204231523160709.html