当前位置：网站首页>KNN, kmeans and GMM
KNN, kmeans and GMM
20220423 15:27:00 【moletop】
Knn,Kmeans and GMM
KNN

Classification algorithm

Supervised learning

K Value means  For a sample X, To classify it , First, from the dataset , stay X Look nearby for the nearest K Data points , Divide it into the category with the most categories .
problem ： KNN The core of the algorithm is to find the location of the sample to be tested in the training sample set k A close neighbor , If the training sample set is too large , Then the traditional traversal of the whole sample to find k The nearest neighbor approach will lead to a sharp decline in performance .
improvement :1.kdtree
 kdtree Space for time ,** Use the sample points in the training sample set , Align... In turn along each dimension k Divide the dimensional space , Build a binary tree ,** The idea of divide and conquer is used to greatly improve the search efficiency of the algorithm . We know , The algorithm complexity of binary search is O(logN),kdtree The search efficiency is close to （ Depending on the construction kdtree Whether it is close to the balance tree ).
1. Every layer down the tree , We will change a dimension to measure . The reason is also simple , Because we want this tree to be K Dimensional space has a good expression ability , It is convenient for us to query quickly according to different dimensions .
2. Leaf node , Actually Represents a small space in the sample space .
3. In one query Go straight to K A recent sample
Excellent blog ：https://zhuanlan.zhihu.com/p/127022333

balltree
stay kdtree in , One of the core factors leading to performance degradation is kdtree The partitioned subspace in is a hypercube , The nearest neighbor is the Euclidean distance （ Super ball ）, The possibility of a hypercube intersecting with a hypersphere is very high , All intersecting subspaces , All need to be checked , Greatly reduce the operation efficiency .
If The division area is also a hypersphere , The probability of intersection is greatly reduced . As shown in the figure below , by balltree Divide space by hypersphere , Remove edges and corners , The probability of intersection between dividing hypersphere and searching hypersphere is greatly reduced , Especially when the data dimension is very high , The efficiency of the algorithm has been greatly improved .
Kmeans
 clustering algorithm
 Unsupervised learning
 The dataset is null Label, Messy data
 K Value means  K It's a preset number , Divide the dataset into K A cluster of , We need to rely on human prior knowledge
kmeans shortcoming
 Number of cluster centers K Need to be given in advance , But in practice, this K The choice of value is very difficult to estimate
 Kmeans The initial clustering center needs to be determined artificially , Different initial clustering centers may lead to different clustering results , For example, falling into local optimization .（ have access to Kmeans++ Algorithm to solve ）
 Sensitive to noise and outliers
improvement ：
1.Kmeans++
 Step one ： Randomly select a sample as the first cluster center c1;
 Step two ： Calculate the shortest distance between each sample and the existing cluster center （ That is, the distance from the nearest cluster center ）, use
D(x)
Express ; The bigger this is , It means that the probability of being selected as the cluster center is high ; Last , Use the roulette method to select the next cluster center ; Step three ： Repeat step 2 , Know how to choose k Cluster centers .
After selecting the initial point , Just keep using the standard kmeans The algorithm .
2. Two points K_means
solve KMeans The algorithm is sensitive to the initial cluster center , Two points KMeans The algorithm is an algorithm that weakens the initial centroid , The specific steps are as follows ：
Put all the sample data into a queue as a cluster ;
Select a cluster from the queue to Kmeans Algorithm partition （ The first iteration is equivalent to all samples K=2 Division ）, It is divided into two sub clusters , And add the sub cluster to the queue to iterate the second step of operation , Until the suspension conditions are met ( Number of clusters 、 Minimum square error 、 Iterations, etc )
The cluster in the queue is the final cluster set .
GMM

And K The mean algorithm is similar to , The same is used EM The algorithm is iterative . The Gaussian mixture model assumes that the data of each cluster conforms to the Gaussian distribution （ It's also called normal distribution ） Of , At present The distribution of data presented is the result of superposition of Gaussian distribution of each cluster .

The core idea of Gaussian mixture model is , Suppose the data can be seen as generated from multiple Gaussian distributions Of . Under this assumption , Each individual sub model is a standard Gaussian model , The average is ui And variance ∑i Is the parameter to be estimated . Besides , Each sub model has a parameter πi, It can be understood as weight or probability of generating data . The formula of Gaussian mixture model is ：

The process ：
First , Initial random selection of the value of each parameter . then , Repeat the following two steps , Until it converges .1.E step . According to the current parameters , Calculate the probability that each point is generated by a certain sub model .
2.M step . Use E The estimated probability of the step , To improve the mean value of each sub model , Variance and weight .
in other words , We don't know the best K Each of the Gauss distributions 3 Parameters , I don't know every one of them Which Gaussian distribution is the data point generated . So every time I cycle , Fix the current Gaussian distribution first change , Obtain the probability of each data point generated by each Gaussian distribution . Then fix the generation probability to be the same , According to data points and generation probability , Get a better Gaussian distribution for a group . cycle , Until the parameters don't change , Or change very little , Then we get a reasonable set of Gaussian distribution .
GMM And KMeans comparison
Gaussian mixture model and K The same thing about the mean algorithm is ：
 They are all clustering algorithms ;
 Need to be Appoint K value ;
 Is to use EM Algorithm to solve ;
 They can only converge to the local optimum .
And it's compared to K The advantage of mean algorithm is , We can give the probability that a sample belongs to a certain class ; Not just clustering , It can also be used to estimate the probability density ; And can be used to generate new sample points .
版权声明
本文为[moletop]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/04/202204231523160709.html
边栏推荐
 Precautions for use of dispatching system
 Summary of interfaces for JDBC and servlet to write CRUD
 Use of common pod controller of kubernetes
 adobe illustrator 菜单中英文对照
 adobe illustrator 菜單中英文對照
 今日睡眠质量记录76分
 Redis cluster principle
 Mysql database explanation (IX)
 Openstack command operation
 Educational codeforces round 127 AE problem solution
猜你喜欢
My raspberry PI zero 2W tossing notes record some problems encountered and solutions
How to design a good API interface?
函数(第一部分)
木木一路走好呀
About UDP receiving ICMP port unreachable
Wechat applet customer service access to send and receive messages
setcontext getcontext makecontext swapcontext
电脑怎么重装系统后显示器没有信号了
Have you learned the basic operation of circular queue?
群体智能自主作业智慧农场项目启动及实施方案论证会议
随机推荐
[leetcode daily question] install fence
Code live collection ▏ software test report template Fan Wen is here
Connectez PHP à MySQL via aodbc
adobe illustrator 菜单中英文对照
asp. Net method of sending mail using mailmessage
电脑怎么重装系统后显示器没有信号了
Compiling OpenSSL
Sword finger offer (1)  for Huawei
T2 icloud calendar cannot be synchronized
On the day of entry, I cried (mushroom street was laid off and fought for seven months to win the offer)
Do keyword search, duplicate keyword search, or do not match
Will golang share data with fragment append
What role does the software performance test report play? How much is the thirdparty test report charged?
如果conda找不到想要安装的库怎么办PackagesNotFoundError: The following packages are not available from current
Openfaas practice 4: template operation
Common types of automated testing framework ▏ automated testing is handed over to software evaluation institutions
Openstack command operation
What are the mobile app software testing tools? Sharing of thirdparty software evaluation
TLS / SSL protocol details (28) differences between TLS 1.0, TLS 1.1 and TLS 1.2
Precautions for use of dispatching system