当前位置：网站首页>Data mining -- naive Bayesian classification

Data mining -- naive Bayesian classification

2022-04-22 05:44:00 【Muxi dare】

《 data mining 》 National Defense University of science and technology
《 data mining 》 Qingdao University
Python： Bayesian classification
Bayesian classification is based on Bayesian theorem , It is one of the core methods of machine learning .
At present, there are four kinds of Bayesian classifiers ：

Naive Bayes classifier (Naive Bayes Classifier, or NBC)
TAN
BAN
GBN

Naive Bayesian classification of data mining

• Naive Bayesian classifier has a solid mathematical foundation , And stable classification efficiency . meanwhile , The estimation required for this model
There are few parameters in the meter , Less sensitive to missing data , The algorithm is also relatively simple .

Bayes theorem

Insert picture description here

The ultimate goal is to achieve p( Category | features ) .
• Simplicity in naive Bayes is to assume that features are independent of each other .

working process

Insert picture description here

Advantages and disadvantages

advantage ：
（1） The logic is simple 、 Easy to implement 、 The time and space overhead of the algorithm in the classification process is relatively small ;
（2） The algorithm is relatively stable 、 It has good robustness
shortcoming ： There is the assumption of conditional independence between attributes , In many practical problems, this independence assumption is not tenable , If you ignore this in the practical problem of correlation between attributes , It will reduce the classification effect .

Add

Insert picture description here
come from ： Whiteboard derivation notes

Python Realization

sklearn.naive_bayes: Naive Bayes modular

According to the characteristic data Prior distribution Different ,scikit-learn The library provides 5 Different naive Bayesian classification algorithms ：

example

import pandas as pd
from sklearn.naive_bayes import GaussianNB

data_url="../1.  data mining - National Defense University of science and technology / Data packets /diabetes.csv"
df = pd.read_csv(data_url)
x = df.iloc[0:735,0:8]
y = df.iloc[0:735,8]
clf = GaussianNB().fit(x,y)
dftest = df.iloc[735:768,0:8]
df1 = pd.DataFrame(columns=['test','true'])

df2 = df.iloc[735:768,8].to_frame()
df2 = df2.reset_index(drop=True)

df1['test'] = clf.predict(dftest)
df1['true']=df2

#  Accuracy calculation 
m = 0
for i in range(0,df1.shape[0]):
    if df1.at[i,'test']==df1.at[i,'true']:
        m = m + 1
    i = i + 1
acc = m/df1.shape[0]
print(" Accuracy rate is ：",acc)

import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
import numpy as np

#  Function encapsulation ：
def Bayes_test(df):
    X = df.iloc[:,1:4]
    y = df.iloc[:,4]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
    print(y_test)
    #  Use Gaussian naive Bayes to calculate 
    clf = GaussianNB()
    clf.fit(X_train, y_train)
    #  assessment 
    y_pred = clf.predict(X_test)
    print(y_pred)
    acc = np.sum(y_test == y_pred) / X_test.shape[0]
    return acc

#  Read the data and calculate the accuracy ：
data_url = "../1.  data mining - National Defense University of science and technology / Data packets /iris.csv"
df = pd.read_csv(data_url,index_col=0)
acc = Bayes_test(df)
print(" Accuracy rate is ：",acc)