当前位置:网站首页>Machine learning practice - naive Bayes
Machine learning practice - naive Bayes
2022-04-23 18:34:00 【Xuanche_】
Naive Bayes
One 、 summary
Bayesian classification algorithm is a probabilistic classification method of statistics , Naive Bayesian classification is the simplest of Bayesian classification . The classification principle is to use Bayesian formula to calculate the posterior probability according to the prior probability of a feature , Then, the class with the maximum a posteriori probability is selected as the class to which the feature belongs . And
So it's called ” simple ”, Because Bayesian classification is only the most primitive 、 The simplest hypothesis : All features are statistically independent .
Suppose a sample X Yes a 1 {a}_{1} a1, a 2 {a}_{2} a2, a 3 {a}_{3} a3… a n {a}_{n} an Attributes , So there are P(X) = P( a 1 {a}_{1} a1, a 2 {a}_{2} a2… a n {a}_{n} an) =P( a 1 {a}_{1} a1)P( a 2 {a}_{2} a2)…*P( a n {a}_{n} an) Satisfying the sample formula means that the characteristic statistics are independent .
1. Conditional probability formula
Conditional probability (Condittional probability), It means in the event B When it happens , event A Probability of occurrence , use P(A|B) To express .
According to Wen's diagram : stay B In the event of an incident , event A The probability of occurrence is P(A∩B) Divide P(B)
P ( A ∣ B ) = P ( A ∩ B ) P ( B ) P(A|B)\, =\, \frac {P(A\cap B)} {P(B)} P(A∣B)=P(B)P(A∩B) ⇒ P ( A ∣ B ) P ( B ) = P ( A ∩ B ) P(A|B)\, P(B)=\, P(A\cap B) P(A∣B)P(B)=P(A∩B)
The same can be : P ( B ∣ A ) P ( A ) = P ( A ∩ B ) P(B|A)\, P(A)=\, P(A\cap B) P(B∣A)P(A)=P(A∩B)
therefore : P ( B ∣ A ) P ( A ) = P ( A ∣ B ) P ( B ) P(B|A)\, P(A)=\, P(A|B)P(B) P(B∣A)P(A)=P(A∣B)P(B) ⇒ P ( A ∣ B ) = P ( B ∣ A ) P ( A ) P ( B ) P(A|B)=\frac {P(B|A)P(A)} {P(B)} P(A∣B)=P(B)P(B∣A)P(A)
Then look at the full probability formula , If the event A 1 {A}_{1} A1, A 2 {A}_{2} A2,… A n {A}_{n} An Constitute a complete event with positive probability , So for any event B Then there are :
P ( B ) = P ( B A 1 ) + P ( B A 2 ) + . . . + P ( B A n ) P(B)\, =\, P(B{A}_{1})+P(B{A}_{2})+...+P(B{A}_{n}) P(B)=P(BA1)+P(BA2)+...+P(BAn)
P ( B ) = ∑ i = 1 n P ( A i ) P ( B ∣ A i ) P(B)\, =\, \sum ^{n}_{i=1} {P({
{A}_{i}}_{})P(B|{A}_{i})} P(B)=∑i=1nP(Ai)P(B∣Ai)
Bayesian judgment
According to the formula of conditional probability and total probability , The Bayesian formula can be obtained as follows :
P ( A ∣ B ) = P ( A ) P ( B ∣ A ) P ( B ) P(A|B)=P(A)\frac {P(B|A)} {P(B)} P(A∣B)=P(A)P(B)P(B∣A)
P ( A i ∣ B ) = P ( A i ) P ( B ∣ A ) ∑ i = 1 n P ( A i ) P ( B ∣ A i ) P({A}_{i}|B)=P({A}_{i})\frac {P(B|A)} {\sum ^{n}_{i=1} {P({A}_{i})P(B|{A}_{i})}} P(Ai∣B)=P(Ai)∑i=1nP(Ai)P(B∣Ai)P(B∣A)
P(A) be called " Prior probability "(Prior probability)
, That is to say B Before the incident , We are right. A The probability of an event .
P(A|B) be called " Posterior probability "(Posterior probability)
, That is to say B After the event , We are right. A Reassessment of event probabilities .
P(B|A)/P(B) be called " Possibility function "(Likely hood)
, This is an adjustment factor , Make the estimated probability closer to the real probability .
So conditional probability can be understood as : Posterior probability = Prior probability * Adjustment factor
If " Possibility function ">1, signify " Prior probability " Enhanced , event A More likely to happen ;
If " Possibility function "=1, signify B Events do not help to judge events A The possibility of ;
If " Possibility function "<1, signify " Prior probability " Weakened , event A Less likely .
Naive Bayes species
stay scikit-learn in , Altogether 3 A naive Bayesian classification algorithm .
Namely GaussianNB,MultinomialNB and BernoulliNB.
1. GaussianNB
GaussianNB A priori is ** Gaussian distribution ( Normal distribution ) Naive Bayes
**, Assume that the data of each tag follows a simple normal distribution .
among by Y Of the k Class category . and For the values that need to be estimated from the training set .
here , use scikit-learn It's a simple implementation GaussianNB.
# Import package
import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Import dataset
from sklearn import datasets
iris=datasets.load_iris()
# Sharding data sets
Xtrain, Xtest, ytrain, ytest = train_test_split(iris.data,
iris.target,
random_state=12)
# modeling
clf = GaussianNB()
clf.fit(Xtrain, ytrain)
# Perform prediction on the test set ,proba Derived is the probability that each sample belongs to a certain class
clf.predict(Xtest)
clf.predict_proba(Xtest)
# Test accuracy
accuracy_score(ytest, clf.predict(Xtest))
MultinomialNB
MultinomialNB Is a naive Bayes with a priori polynomial distribution . It assumes that the feature is generated by a simple polynomial distribution . Multinomial distribution can
Describe the probability of occurrence of various types of samples , Therefore, polynomial naive Bayes is very suitable for describing the characteristics of the number of occurrences or the proportion of occurrences .
This model is often used in text classification , The feature represents the number of times , For example, the number of occurrences of a word .
The polynomial distribution formula is as follows :
P ( X j = x j ∣ Y = C k ) = x j l + ξ m k + n ξ P({X}_{j}={x}_{j}|Y={C}_{k})=\frac { {x}_{jl\, }+ξ} { {m}_{k}+nξ} P(Xj=xj∣Y=Ck)=mk+nξxjl+ξ
among , P ( X j = x j ∣ Y = C k ) P({X}_{j}={x}_{j}|Y={C}_{k}) P(Xj=xj∣Y=Ck) It's No k Of categories j The number of dimensional features l The probability of the value conditions . m k {m}_{k} mk It's the output of training concentration k A sample of class
Count . ξ For a greater than 0 The constant , Often taken as 1, Laplace smoothing . You can also take other values .
BernoulliNB
BernoulliNB Is the naive Bayes with Bernoulli distribution a priori . Suppose that the prior probability of the feature is a binary Bernoulli distribution , It's like the following :
here There are only two values . x j l {x}_{jl} xjl Only value 0 perhaps 1.
In the Bernoulli model , The value of each feature is Boolean , namely true and false, perhaps 1 and 0.
In text classification , Is whether a feature appears in a document .
summary
- Generally speaking , If the distribution of sample characteristics is mostly continuous , Use GaussianNB It will be better. .
- If the distribution of sample features is mostly multivariate discrete values , Use MultinomialNB More appropriate .
- If the sample feature is binary discrete value or very sparse multivariate discrete value , You should use BernoulliNB.
版权声明
本文为[Xuanche_]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204231824246309.html
边栏推荐
- How to virtualize the video frame and background is realized in a few simple steps
- Ionic 从创建到打包指令集顺序
- 【数学建模】—— 层次分析法(AHP)
- Use stm32cube MX / stm32cube ide to generate FatFs code and operate SPI flash
- 纠结
- Mysql database backup command -- mysqldump
- Ionic instruction set order from creation to packaging
- ctfshow-web362(SSTI)
- Permission management with binary
- Test post and login function
猜你喜欢
QT reading and writing XML files (including source code + comments)
Kettle paoding jieniu Chapter 17 text file output
kettle庖丁解牛第17篇之文本文件输出
【ACM】509. 斐波那契数(dp五部曲)
JD-FreeFuck 京東薅羊毛控制面板 後臺命令執行漏洞
listener. log
Custom prompt box MessageBox in QT
STM32: LCD display
实战业务优化方案总结---主目录---持续更新
Spark performance optimization guide
随机推荐
The vivado project corresponding to the board is generated by TCL script
Gson fastjason Jackson of object to JSON difference modifies the field name
配置iptables
Ucosiii transplantation and use, reference punctual atom
Daily network security certification test questions (April 13, 2022)
Using transmittablethreadlocal to realize parameter cross thread transmission
STM32: LCD display
Rust: the output information of println is displayed during the unit test
Connection mode of QT signal and slot connect() and the return value of emit
ESP32 LVGL8. 1 - BTN button (BTN 15)
使用 bitnami/postgresql-repmgr 镜像快速设置 PostgreSQL HA
教你用简单几个步骤快速重命名文件夹名
串口调试工具cutecom和minicom
Daily network security certification test questions (April 18, 2022)
玻璃体中的硫酸软骨素
Resolves the interface method that allows annotation requests to be written in postman
Interpretation and compilation of JVM
From introduction to mastery of MATLAB (2)
Solution to Chinese garbled code after reg file is imported into the registry
Spark performance optimization guide