当前位置:网站首页>Introduction to data analysis 𞓜 kaggle Titanic mission (IV) - > data cleaning and feature processing
Introduction to data analysis 𞓜 kaggle Titanic mission (IV) - > data cleaning and feature processing
2022-04-23 10:33:00 【Ape knowledge】
Series index : Introduction to data analysis | kaggle Titanic mission
List of articles
- One 、 Data cleaning and feature processing
- (1) Brief description of data cleaning
- (2) Observe missing values
- (3) Missing value processing
- (4) Processing of duplicate values
- (5) Feature observation and processing
- (6) Divide the age into boxes ( discretization ) Handle
- (7) Convert text variables
- (8) From plain text Name Extract... From the feature Titles Characteristics of ( So-called Titles Namely Mr,Miss,Mrs etc. )
One 、 Data cleaning and feature processing
(1) Brief description of data cleaning
The data we get is usually unclean , The so-called unclean , There are missing values in the data , There are some outliers, etc , After some processing, we can continue to do the following analysis or modeling , So the first step to get the data is to clean the data , In this chapter, we will learn about missing values 、 duplicate value 、 String and data conversion , Clean the data into something that can be analyzed or modeled .
(2) Observe missing values
① Method 1 :
df.info()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
② Method 2 :
df.isnull().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
(3) Missing value processing
df[df['Age']==None]=0
df.head(3)
df[df['Age'].isnull()] = 0 # not so bad
df.head(3)
df[df['Age'] == np.nan] = 0
df.head()
df.dropna().head(3)
df.fillna({
'Age':0}).head(3) # repair Age
df.fillna(0) # Full table supplement 0
df.loc[df['Age'].isnull(),'Age'] = 0
# stay Age Fill in the empty column 0
【 reflection 】 For retrieving vacancy values np.nan
,None
as well as .isnull()
Which is better? , Why is that ? If one of these methods cannot find the missing value , The reason is why ?
【 answer 】 After reading the data in the numeric column , The data type of vacancy value is float64 So use None Generally, the index cannot find , It's best to use np.nan
(4) Processing of duplicate values
df[df.duplicated()] # The repeated values will be output
df = df.drop_duplicates() # Delete duplicate lines
df.head()
(5) Feature observation and processing
Let's take a look at the characteristics , Characteristics can be roughly divided into two categories :
Numerical features :Survived ,Pclass, Age ,SibSp, Parch, Fare, among Survived, Pclass by Discrete numerical characteristics
,Age,SibSp, Parch, Fare by Continuous numerical characteristics
Textual features :Name, Sex, Cabin,Embarked, Ticket, among Sex, Cabin, Embarked, Ticket by Category text features
.
Numerical features can be directly used in model training , But sometimes for the sake of the model stability
And Robustness
Can discretize continuous variables . Text features often need to be converted into numerical features before they can be used for modeling and analysis .
(6) Divide the age into boxes ( discretization ) Handle
# Will be a continuous variable Age The average box is divided into 5 Age groups , And use category variables 12345 Express
df['AgeBand'] = pd.cut(df['Age'], 5,labels = [1,2,3,4,5])
df.head()
# Will be a continuous variable Age Divided into (0,5] (5,15] (15,30] (30,50] (50,80] Five age groups , And use category variables 12345 Express
df['AgeBand'] = pd.cut(df['Age'],[0,5,15,30,50,80],labels = [1,2,3,4,5])
df.head(3)
# Will be a continuous variable Age Press 10% 30% 50 70% 90% Five age groups , And use classification variables 12345 Express
df['AgeBand'] = pd.qcut(df['Age'],[0,0.1,0.3,0.5,0.7,0.9],labels = [1,2,3,4,5])
df.head()
(7) Convert text variables
# View category text, variable name and type
# Method 1 : value_counts
df['Sex'].value_counts()
-->output:
male 453
female 261
0 1
Name: Sex, dtype: int64
# Method 2 : unique
df['Sex'].unique()
-->output:
array(['male', 'female', 0], dtype=object)
# Convert category text to 12345
# Method 1 : replace
df['Sex_num'] = df['Sex'].replace(['male','female'],[1,2]) # No addition inplace Return copy
df.head()
# Method 2 : map
df['Sex_num'] = df['Sex'].map({
'male': 1, 'female': 2})
df.head()
# Method 3 : Use sklearn.preprocessing Of LabelEncoder
from sklearn.preprocessing import LabelEncoder
for feat in ['Cabin', 'Ticket']:
lbl = LabelEncoder()
label_dict = dict(zip(df[feat].unique(), range(df[feat].nunique())))
df[feat + "_labelEncode"] = df[feat].map(label_dict)
df[feat + "_labelEncode"] = lbl.fit_transform(df[feat].astype(str))
df.head()
from sklearn.preprocessing import LabelEncoder
df['Cabin'] = LabelEncoder().fit_transform(df['Cabin'])
df.head()
# Convert category text to one-hot code
# Method 1 : OneHotEncoder
for feat in ["Age", "Embarked"]:
# x = pd.get_dummies(df["Age"] // 6)
# x = pd.get_dummies(pd.cut(df['Age'],5))
x = pd.get_dummies(df[feat], prefix=feat) # The first parameter is the data to be processed , The second is the prefix after renaming
df = pd.concat([df, x], axis=1) # Add to the original data by column
#df[feat] = pd.get_dummies(df[feat], prefix=feat)
df.head()
(8) From plain text Name Extract... From the feature Titles Characteristics of ( So-called Titles Namely Mr,Miss,Mrs etc. )
df['Title'] = df.Name.str.extract('([A-Za-z]+)\.', expand=False)
df.head()
Introduction to data analysis | kaggle Titanic mission The series is constantly updated , welcome
Like collection
+Focus on
Last one : Introduction to data analysis | kaggle Titanic mission ( 3、 ... and )—> Explore data analysis
Next :
My level is limited , Please comment and correct the deficiencies in the article in the comment area below ~If feelings help you , Point a praise Give me a hand ~
Share... From time to time Interesting 、 Have a material 、 Nutritious content , welcome Subscribe to follow My blog , Looking forward to meeting you here ~
版权声明
本文为[Ape knowledge]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204230619310753.html
边栏推荐
- Operation of 2022 tea artist (primary) test question simulation test platform
- 349. Intersection of two arrays
- 997. Square of ordered array (array)
- 19. Delete the penultimate node of the linked list (linked list)
- 任意文件读取漏洞 利用指南
- 202. Happy number
- net start mysql MySQL 服务正在启动 . MySQL 服务无法启动。 服务没有报告任何错误。
- Wonderful review | deepnova x iceberg meetup online "building a real-time data Lake based on iceberg"
- 19、删除链表的倒数第N个节点(链表)
- Detailed explanation of MapReduce calculation process
猜你喜欢
Example of pop-up task progress bar function based on pyqt5
Cve-2019-0708 vulnerability exploitation of secondary vocational network security 2022 national competition
【leetcode】199.二叉树的右视图
MapReduce compression
Juc并发编程06——深入剖析队列同步器AQS源码
Redis design and Implementation
SSH uses private key to connect to server without key
SQL Server 递归查询上下级
Chapter 120 SQL function round
net start mysql MySQL 服务正在启动 . MySQL 服务无法启动。 服务没有报告任何错误。
随机推荐
Exercise questions and simulation test of refrigeration and air conditioning equipment operation test in 2022
997. Square of ordered array (array)
997、有序数组的平方(数组)
Jerry's users how to handle events in the simplest way [chapter]
Art template template engine
Ansible cloud computing automation
基于PyQt5实现弹出任务进度条功能示例
【省选联考 2022 D2T1】卡牌(状态压缩 DP,FWT卷积)
Yarn resource scheduler
Sim Api User Guide(8)
Swagger2 接口如何导入Postman
JVM——》常用命令
DBA common SQL statements (2) - SGA and PGA
Go language practice mode - functional options pattern
Linked list intersection (linked list)
19、删除链表的倒数第N个节点(链表)
Deploy jar package
242、有效字母异位词(哈希表)
Charles function introduction and use tutorial
域名和IP地址的联系