当前位置：网站首页>KNN, kmeans and GMM
KNN, kmeans and GMM
2022-04-23 15:27:00 【moletop】
K Value means - For a sample X, To classify it , First, from the dataset , stay X Look nearby for the nearest K Data points , Divide it into the category with the most categories .
problem ： KNN The core of the algorithm is to find the location of the sample to be tested in the training sample set k A close neighbor , If the training sample set is too large , Then the traditional traversal of the whole sample to find k The nearest neighbor approach will lead to a sharp decline in performance .
- kd-tree Space for time ,** Use the sample points in the training sample set , Align... In turn along each dimension k Divide the dimensional space , Build a binary tree ,** The idea of divide and conquer is used to greatly improve the search efficiency of the algorithm . We know , The algorithm complexity of binary search is O(logN),kd-tree The search efficiency is close to （ Depending on the construction kd-tree Whether it is close to the balance tree ).
1. Every layer down the tree , We will change a dimension to measure . The reason is also simple , Because we want this tree to be K Dimensional space has a good expression ability , It is convenient for us to query quickly according to different dimensions .
2. Leaf node , Actually Represents a small space in the sample space .
3. In one query Go straight to K A recent sample
Excellent blog ：https://zhuanlan.zhihu.com/p/127022333
stay kd-tree in , One of the core factors leading to performance degradation is kd-tree The partitioned subspace in is a hypercube , The nearest neighbor is the Euclidean distance （ Super ball ）, The possibility of a hypercube intersecting with a hypersphere is very high , All intersecting subspaces , All need to be checked , Greatly reduce the operation efficiency .
If The division area is also a hypersphere , The probability of intersection is greatly reduced . As shown in the figure below , by ball-tree Divide space by hypersphere , Remove edges and corners , The probability of intersection between dividing hypersphere and searching hypersphere is greatly reduced , Especially when the data dimension is very high , The efficiency of the algorithm has been greatly improved .
- clustering algorithm
- Unsupervised learning
- The dataset is null Label, Messy data
- K Value means - K It's a preset number , Divide the dataset into K A cluster of , We need to rely on human prior knowledge
- Number of cluster centers K Need to be given in advance , But in practice, this K The choice of value is very difficult to estimate
- Kmeans The initial clustering center needs to be determined artificially , Different initial clustering centers may lead to different clustering results , For example, falling into local optimization .（ have access to Kmeans++ Algorithm to solve ）
- Sensitive to noise and outliers
- Step one ： Randomly select a sample as the first cluster center c1;
- Step two ： Calculate the shortest distance between each sample and the existing cluster center （ That is, the distance from the nearest cluster center ）, use
D(x)Express ; The bigger this is , It means that the probability of being selected as the cluster center is high ; Last , Use the roulette method to select the next cluster center ;
- Step three ： Repeat step 2 , Know how to choose k Cluster centers .
After selecting the initial point , Just keep using the standard k-means The algorithm .
2. Two points K_means
solve K-Means The algorithm is sensitive to the initial cluster center , Two points K-Means The algorithm is an algorithm that weakens the initial centroid , The specific steps are as follows ：
Put all the sample data into a queue as a cluster ;
Select a cluster from the queue to K-means Algorithm partition （ The first iteration is equivalent to all samples K=2 Division ）, It is divided into two sub clusters , And add the sub cluster to the queue to iterate the second step of operation , Until the suspension conditions are met ( Number of clusters 、 Minimum square error 、 Iterations, etc )
The cluster in the queue is the final cluster set .
And K The mean algorithm is similar to , The same is used EM The algorithm is iterative . The Gaussian mixture model assumes that the data of each cluster conforms to the Gaussian distribution （ It's also called normal distribution ） Of , At present The distribution of data presented is the result of superposition of Gaussian distribution of each cluster .
The core idea of Gaussian mixture model is , Suppose the data can be seen as generated from multiple Gaussian distributions Of . Under this assumption , Each individual sub model is a standard Gaussian model , The average is ui And variance ∑i Is the parameter to be estimated . Besides , Each sub model has a parameter πi, It can be understood as weight or probability of generating data . The formula of Gaussian mixture model is ：
The process ：
First , Initial random selection of the value of each parameter . then , Repeat the following two steps , Until it converges .
1.E step . According to the current parameters , Calculate the probability that each point is generated by a certain sub model .
2.M step . Use E The estimated probability of the step , To improve the mean value of each sub model , Variance and weight .
in other words , We don't know the best K Each of the Gauss distributions 3 Parameters , I don't know every one of them Which Gaussian distribution is the data point generated . So every time I cycle , Fix the current Gaussian distribution first change , Obtain the probability of each data point generated by each Gaussian distribution . Then fix the generation probability to be the same , According to data points and generation probability , Get a better Gaussian distribution for a group . cycle , Until the parameters don't change , Or change very little , Then we get a reasonable set of Gaussian distribution .
GMM And K-Means comparison
Gaussian mixture model and K The same thing about the mean algorithm is ：
- They are all clustering algorithms ;
- Need to be Appoint K value ;
- Is to use EM Algorithm to solve ;
- They can only converge to the local optimum .
And it's compared to K The advantage of mean algorithm is , We can give the probability that a sample belongs to a certain class ; Not just clustering , It can also be used to estimate the probability density ; And can be used to generate new sample points .
- Precautions for use of dispatching system
- Summary of interfaces for JDBC and servlet to write CRUD
- Use of common pod controller of kubernetes
- adobe illustrator 菜单中英文对照
- adobe illustrator 菜單中英文對照
- Redis cluster principle
- Mysql database explanation (IX)
- Openstack command operation
- Educational codeforces round 127 A-E problem solution
My raspberry PI zero 2W tossing notes record some problems encountered and solutions
How to design a good API interface?
About UDP receiving ICMP port unreachable
Wechat applet customer service access to send and receive messages
setcontext getcontext makecontext swapcontext
Have you learned the basic operation of circular queue?
[leetcode daily question] install fence
Code live collection ▏ software test report template Fan Wen is here
Connectez PHP à MySQL via aodbc
adobe illustrator 菜单中英文对照
asp. Net method of sending mail using mailmessage
Sword finger offer (1) -- for Huawei
T2 icloud calendar cannot be synchronized
On the day of entry, I cried (mushroom street was laid off and fought for seven months to win the offer)
Do keyword search, duplicate keyword search, or do not match
Will golang share data with fragment append
What role does the software performance test report play? How much is the third-party test report charged?
如果conda找不到想要安装的库怎么办PackagesNotFoundError: The following packages are not available from current
Openfaas practice 4: template operation
Common types of automated testing framework ▏ automated testing is handed over to software evaluation institutions
Openstack command operation
What are the mobile app software testing tools? Sharing of third-party software evaluation
TLS / SSL protocol details (28) differences between TLS 1.0, TLS 1.1 and TLS 1.2
Precautions for use of dispatching system