当前位置:网站首页>KNN, kmeans and GMM
KNN, kmeans and GMM
2022-04-23 15:27:00 【moletop】
Knn,Kmeans and GMM
KNN
-
Classification algorithm
-
Supervised learning
-
K Value means - For a sample X, To classify it , First, from the dataset , stay X Look nearby for the nearest K Data points , Divide it into the category with the most categories .
problem : KNN The core of the algorithm is to find the location of the sample to be tested in the training sample set k A close neighbor , If the training sample set is too large , Then the traditional traversal of the whole sample to find k The nearest neighbor approach will lead to a sharp decline in performance .
improvement :1.kd-tree
- kd-tree Space for time ,** Use the sample points in the training sample set , Align... In turn along each dimension k Divide the dimensional space , Build a binary tree ,** The idea of divide and conquer is used to greatly improve the search efficiency of the algorithm . We know , The algorithm complexity of binary search is O(logN),kd-tree The search efficiency is close to ( Depending on the construction kd-tree Whether it is close to the balance tree ).
1. Every layer down the tree , We will change a dimension to measure . The reason is also simple , Because we want this tree to be K Dimensional space has a good expression ability , It is convenient for us to query quickly according to different dimensions .
2. Leaf node , Actually Represents a small space in the sample space .
3. In one query Go straight to K A recent sample
Excellent blog :https://zhuanlan.zhihu.com/p/127022333
-
ball-tree
stay kd-tree in , One of the core factors leading to performance degradation is kd-tree The partitioned subspace in is a hypercube , The nearest neighbor is the Euclidean distance ( Super ball ), The possibility of a hypercube intersecting with a hypersphere is very high , All intersecting subspaces , All need to be checked , Greatly reduce the operation efficiency .
If The division area is also a hypersphere , The probability of intersection is greatly reduced . As shown in the figure below , by ball-tree Divide space by hypersphere , Remove edges and corners , The probability of intersection between dividing hypersphere and searching hypersphere is greatly reduced , Especially when the data dimension is very high , The efficiency of the algorithm has been greatly improved .
K-means
- clustering algorithm
- Unsupervised learning
- The dataset is null Label, Messy data
- K Value means - K It's a preset number , Divide the dataset into K A cluster of , We need to rely on human prior knowledge
k-means shortcoming
- Number of cluster centers K Need to be given in advance , But in practice, this K The choice of value is very difficult to estimate
- Kmeans The initial clustering center needs to be determined artificially , Different initial clustering centers may lead to different clustering results , For example, falling into local optimization .( have access to Kmeans++ Algorithm to solve )
- Sensitive to noise and outliers
improvement :
1.Kmeans++
- Step one : Randomly select a sample as the first cluster center c1;
- Step two : Calculate the shortest distance between each sample and the existing cluster center ( That is, the distance from the nearest cluster center ), use
D(x)
Express ; The bigger this is , It means that the probability of being selected as the cluster center is high ; Last , Use the roulette method to select the next cluster center ;- Step three : Repeat step 2 , Know how to choose k Cluster centers .
After selecting the initial point , Just keep using the standard k-means The algorithm .
2. Two points K_means
solve K-Means The algorithm is sensitive to the initial cluster center , Two points K-Means The algorithm is an algorithm that weakens the initial centroid , The specific steps are as follows :
Put all the sample data into a queue as a cluster ;
Select a cluster from the queue to K-means Algorithm partition ( The first iteration is equivalent to all samples K=2 Division ), It is divided into two sub clusters , And add the sub cluster to the queue to iterate the second step of operation , Until the suspension conditions are met ( Number of clusters 、 Minimum square error 、 Iterations, etc )
The cluster in the queue is the final cluster set .
GMM
-
And K The mean algorithm is similar to , The same is used EM The algorithm is iterative . The Gaussian mixture model assumes that the data of each cluster conforms to the Gaussian distribution ( It's also called normal distribution ) Of , At present The distribution of data presented is the result of superposition of Gaussian distribution of each cluster .
-
The core idea of Gaussian mixture model is , Suppose the data can be seen as generated from multiple Gaussian distributions Of . Under this assumption , Each individual sub model is a standard Gaussian model , The average is ui And variance ∑i Is the parameter to be estimated . Besides , Each sub model has a parameter πi, It can be understood as weight or probability of generating data . The formula of Gaussian mixture model is :
-
The process :
First , Initial random selection of the value of each parameter . then , Repeat the following two steps , Until it converges .1.E step . According to the current parameters , Calculate the probability that each point is generated by a certain sub model .
2.M step . Use E The estimated probability of the step , To improve the mean value of each sub model , Variance and weight .
in other words , We don't know the best K Each of the Gauss distributions 3 Parameters , I don't know every one of them Which Gaussian distribution is the data point generated . So every time I cycle , Fix the current Gaussian distribution first change , Obtain the probability of each data point generated by each Gaussian distribution . Then fix the generation probability to be the same , According to data points and generation probability , Get a better Gaussian distribution for a group . cycle , Until the parameters don't change , Or change very little , Then we get a reasonable set of Gaussian distribution .
GMM And K-Means comparison
Gaussian mixture model and K The same thing about the mean algorithm is :
- They are all clustering algorithms ;
- Need to be Appoint K value ;
- Is to use EM Algorithm to solve ;
- They can only converge to the local optimum .
And it's compared to K The advantage of mean algorithm is , We can give the probability that a sample belongs to a certain class ; Not just clustering , It can also be used to estimate the probability density ; And can be used to generate new sample points .
版权声明
本文为[moletop]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204231523160709.html
边栏推荐
- Code live collection ▏ software test report template Fan Wen is here
- G007-HWY-CC-ESTOR-03 华为 Dorado V6 存储仿真器搭建
- C语言超全学习路线(收藏让你少走弯路)
- Sword finger offer (1) -- for Huawei
- Detailed explanation of kubernetes (XI) -- label and label selector
- Five data types of redis
- Precautions for use of dispatching system
- GFS distributed file system (Theory)
- TLS / SSL protocol details (30) RSA, DHE, ecdhe and ecdh processes and differences in SSL
- 深度学习调参的技巧
猜你喜欢
Kubernetes详解(十一)——标签与标签选择器
My raspberry PI zero 2W tossing notes record some problems encountered and solutions
Mysql database explanation (10)
About UDP receiving ICMP port unreachable
My raspberry PI zero 2W toss notes to record some problems and solutions
MySQL InnoDB transaction
Tun model of flannel principle
深度学习——超参数设置
Set onedrive or Google drive as a drawing bed in upic for free
Mysql连接查询详解
随机推荐
MySQL InnoDB transaction
Redis cluster principle
Code live collection ▏ software test report template Fan Wen is here
Design of digital temperature monitoring and alarm system based on DS18B20 single chip microcomputer [LCD1602 display + Proteus simulation + C program + paper + key setting, etc.]
如果conda找不到想要安装的库怎么办PackagesNotFoundError: The following packages are not available from current
Mysql database explanation (VII)
2022年中国数字科技专题分析
What role does the software performance test report play? How much is the third-party test report charged?
MultiTimer v2 重构版本 | 一款可无限扩展的软件定时器
Adobe Illustrator menu in Chinese and English
API gateway / API gateway (II) - use of Kong - load balancing
码住收藏▏软件测试报告模板范文来了
What are the mobile app software testing tools? Sharing of third-party software evaluation
Detailed explanation of C language knowledge points -- data types and variables [1] - carry counting system
如何设计一个良好的API接口?
Openstack theoretical knowledge
PHP PDO ODBC loads files from one folder into the blob column of MySQL database and downloads the blob column to another folder
Will golang share data with fragment append
How to design a good API interface?
深度学习调参的技巧