当前位置：网站首页>Introduction to data analysis 𞓜 kaggle Titanic mission (IV) - > data cleaning and feature processing

Introduction to data analysis 𞓜 kaggle Titanic mission (IV) - > data cleaning and feature processing

2022-04-23 10:33:00 【Ape knowledge】

Please add a picture description

Series index ： Introduction to data analysis | kaggle Titanic mission

List of articles

One 、 Data cleaning and feature processing

（1） Brief description of data cleaning

（2） Observe missing values

（3） Missing value processing

（4） Processing of duplicate values

（5） Feature observation and processing

（6） Divide the age into boxes （ discretization ） Handle

（7） Convert text variables

（8） From plain text Name Extract... From the feature Titles Characteristics of ( So-called Titles Namely Mr,Miss,Mrs etc. )

One 、 Data cleaning and feature processing

（1） Brief description of data cleaning

The data we get is usually unclean , The so-called unclean , There are missing values in the data , There are some outliers, etc , After some processing, we can continue to do the following analysis or modeling , So the first step to get the data is to clean the data , In this chapter, we will learn about missing values 、 duplicate value 、 String and data conversion , Clean the data into something that can be analyzed or modeled .

（2） Observe missing values

① Method 1 ：

df.info()

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 # Column Non-Null Count Dtype 
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

② Method 2 ：

df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

（3） Missing value processing

df[df['Age']==None]=0
df.head(3)

df[df['Age'].isnull()] = 0 #  not so bad 
df.head(3)

df[df['Age'] == np.nan] = 0
df.head()

df.dropna().head(3)

df.fillna({
    'Age':0}).head(3) # repair Age
df.fillna(0) # Full table supplement 0

df.loc[df['Age'].isnull(),'Age'] = 0
# stay Age Fill in the empty column 0

【 reflection 】 For retrieving vacancy values np.nan,None as well as .isnull() Which is better? , Why is that ？ If one of these methods cannot find the missing value , The reason is why ？

【 answer 】 After reading the data in the numeric column , The data type of vacancy value is float64 So use None Generally, the index cannot find , It's best to use np.nan

（4） Processing of duplicate values

df[df.duplicated()] # The repeated values will be output 

df = df.drop_duplicates() # Delete duplicate lines 
df.head()

（5） Feature observation and processing

Let's take a look at the characteristics , Characteristics can be roughly divided into two categories ：

Numerical features ：Survived ,Pclass, Age ,SibSp, Parch, Fare, among Survived, Pclass by Discrete numerical characteristics ,Age,SibSp, Parch, Fare by Continuous numerical characteristics

Textual features ：Name, Sex, Cabin,Embarked, Ticket, among Sex, Cabin, Embarked, Ticket by Category text features .

Numerical features can be directly used in model training , But sometimes for the sake of the model stability And Robustness Can discretize continuous variables . Text features often need to be converted into numerical features before they can be used for modeling and analysis .

（6） Divide the age into boxes （ discretization ） Handle

# Will be a continuous variable Age The average box is divided into 5 Age groups , And use category variables 12345 Express 
df['AgeBand'] = pd.cut(df['Age'], 5,labels = [1,2,3,4,5])
df.head()

# Will be a continuous variable Age Divided into (0,5] (5,15] (15,30] (30,50] (50,80] Five age groups , And use category variables 12345 Express 
df['AgeBand'] = pd.cut(df['Age'],[0,5,15,30,50,80],labels = [1,2,3,4,5])
df.head(3)

# Will be a continuous variable Age Press 10% 30% 50 70% 90% Five age groups , And use classification variables 12345 Express 
df['AgeBand'] = pd.qcut(df['Age'],[0,0.1,0.3,0.5,0.7,0.9],labels = [1,2,3,4,5])
df.head()

（7） Convert text variables

# View category text, variable name and type 

# Method 1 : value_counts
df['Sex'].value_counts()
-->output:
male      453
female    261
0           1
Name: Sex, dtype: int64

# Method 2 : unique
df['Sex'].unique()
-->output:
array(['male', 'female', 0], dtype=object)


# Convert category text to 12345

# Method 1 : replace
df['Sex_num'] = df['Sex'].replace(['male','female'],[1,2]) # No addition inplace Return copy 
df.head()

# Method 2 : map
df['Sex_num'] = df['Sex'].map({
    'male': 1, 'female': 2})
df.head()

# Method 3 :  Use sklearn.preprocessing Of LabelEncoder
from sklearn.preprocessing import LabelEncoder
for feat in ['Cabin', 'Ticket']:
    lbl = LabelEncoder()  
    label_dict = dict(zip(df[feat].unique(), range(df[feat].nunique())))
    df[feat + "_labelEncode"] = df[feat].map(label_dict)
    df[feat + "_labelEncode"] = lbl.fit_transform(df[feat].astype(str))
df.head()

from sklearn.preprocessing import LabelEncoder
df['Cabin'] = LabelEncoder().fit_transform(df['Cabin'])
df.head()


# Convert category text to one-hot code 

# Method 1 : OneHotEncoder
for feat in ["Age", "Embarked"]:
# x = pd.get_dummies(df["Age"] // 6)
# x = pd.get_dummies(pd.cut(df['Age'],5))
    x = pd.get_dummies(df[feat], prefix=feat) # The first parameter is the data to be processed , The second is the prefix after renaming 
    df = pd.concat([df, x], axis=1) # Add to the original data by column 
    #df[feat] = pd.get_dummies(df[feat], prefix=feat)
    
df.head()

（8） From plain text Name Extract... From the feature Titles Characteristics of ( So-called Titles Namely Mr,Miss,Mrs etc. )

df['Title'] = df.Name.str.extract('([A-Za-z]+)\.', expand=False)
df.head()

Introduction to data analysis | kaggle Titanic mission The series is constantly updated , welcome Like collection ＋ Focus on

Last one ： Introduction to data analysis | kaggle Titanic mission （ 3、 ... and ）—＞ Explore data analysis
Next ：

My level is limited , Please comment and correct the deficiencies in the article in the comment area below ~

If feelings help you , Point a praise Give me a hand ~

Share... From time to time Interesting 、 Have a material 、 Nutritious content , welcome Subscribe to follow My blog , Looking forward to meeting you here ~

版权声明
本文为[Ape knowledge]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/04/202204230619310753.html