当前位置:网站首页>Another data analysis artifact: Polaris is really powerful
Another data analysis artifact: Polaris is really powerful
2022-04-23 20:44:00 【Python data mining】
I believe that for many data analysis practitioners , What we use more is Pandas
as well as SQL
These two tools ,Pandas
Not only can the data set be cleaned and analyzed , And can also draw all kinds of cool charts , But when the data set is large, if you still use Pandas
It's obviously a little weak to deal with .
Today I will introduce another data processing and analysis tool , be called Polars
, It is faster in data processing , Of course, there are also two API, One is Eager API
, The other is Lazy API
, among Eager API
and Pandas
The use of is similar to , The grammatical similarity is not too bad , Immediate execution can produce results . Like this article, remember to collect 、 Focus on 、 give the thumbs-up .
notes : Complete code 、 Information 、 At the end of the technical exchange document
and Lazy API
and Spark
Very similar , There will be parallel and query logic optimization operations .
Module installation and import
Let's install the module first , Use pip
command
pip install polars
After the installation is successful , We use... Separately Pandas
and Polars
To read data , Look at the differences in their performance , We import the modules we will use
import pandas as pd
import polars as pl
import matplotlib.pyplot as plt
%matplotlib inline
use Pandas
Read the file
The data set used this time is the user name data of registered users of a website , All in all 360MB size , We use first Pandas
Module to read the csv
file
%%time
df = pd.read_csv("users.csv")
df.head()
output
Can be seen with Pandas
Read CSV
The documents cost a total of 12 The second time , The dataset has two columns in total , One column is the user name , And the number of times the user name is repeated “n”, Let's sort the data set , It's called sort_values()
Method , The code is as follows
%%time
df.sort_values("n", ascending=False).head()
output
use Polars
To read the operation file
Now let's use Polars
Module to read and manipulate files , See how long it takes , The code is as follows
%%time
data = pl.read_csv("users.csv")
data.head()
output
Can be seen with polars
Module to read data only costs 730 Time in milliseconds , It can be said that it is much faster , We according to the “n” This column is used to sort the data set , The code is as follows
%%time
data.sort(by="n", reverse=True).head()
output
The time taken to sort the dataset is 1.39 second , Next we use polars Module to conduct a preliminary exploratory analysis of the data set , What are the total columns of the dataset 、 What are the names , We are still familiar with “ Titanic ” Data sets, for example
df_titanic = pd.read_csv("titanic.csv")
df_titanic.columns
output
['PassengerId',
'Survived',
'Pclass',
'Name',
'Sex',
'Age',
......]
and Pandas
The same output column name calls columns
Method , Then let's look at how many rows and columns there are in the dataset ,
df_titanic.shape
output
(891, 12)
Look at the data type of each column in the dataset
df_titanic.dtypes
output
[polars.datatypes.Int64,
polars.datatypes.Int64,
polars.datatypes.Int64,
polars.datatypes.Utf8,
polars.datatypes.Utf8,
polars.datatypes.Float64,
......]
Fill in null values and statistical analysis of data
Let's look at the distribution of null values in the dataset , call null_count()
Method
df_titanic.null_count()
output
We can see “Age” as well as “Cabin” There are two empty columns of values , We can try to fill in with the average , The code is as follows
df_titanic["Age"] = df_titanic["Age"].fill_nan(df_titanic["Age"].mean())
To calculate the average value of a column, you only need to call mean()
The method can , So the median 、 Maximum / The calculation of the minimum value is the same , The code is as follows
print(f'Median Age: {
df_titanic["Age"].median()}')
print(f'Average Age: {
df_titanic["Age"].mean()}')
print(f'Maximum Age: {
df_titanic["Age"].max()}')
print(f'Minimum Age: {
df_titanic["Age"].min()}')
output
Median Age: 29.69911764705882
Average Age: 29.699117647058817
Maximum Age: 80.0
Minimum Age: 0.42
Data filtering and visualization
We screened out people older than 40 What are the age-old passengers , The code is as follows
df_titanic[df_titanic["Age"] > 40]
output
Finally, let's simply draw a chart , The code is as follows
fig, ax = plt.subplots(figsize=(10, 5))
ax.boxplot(df_titanic["Age"])
plt.xticks(rotation=90)
plt.xlabel('Age Column')
plt.ylabel('Age')
plt.show()
output
On the whole ,polars
In data analysis and processing Pandas
Modules have many similarities , There will be part of it API There are differences , Interested children's shoes can refer to its official website :https://www.pola.rs/
Recommended articles
-
Li Hongyi 《 machine learning 》 Mandarin Program (2022) coming
-
Some people made Mr. Wu Enda's machine learning and in-depth learning into a Chinese version
-
So elegant ,4 paragraph Python Automatic data analysis artifact is really fragrant
-
It's very fragrant , Tidy up 20 Visual large screen template
Technical communication
Welcome to reprint 、 Collection 、 Gain some praise and support !
At present, a technical exchange group has been opened , Group friends have exceeded 2000 people , The best way to add notes is : source + Interest direction , Easy to find like-minded friends
- The way ①、 Send the following picture to wechat , Long press recognition , The background to reply : Add group ;
- The way ②、 Add microsignals :dkl88191, remarks : come from CSDN
- The way ③、 WeChat search official account :Python Learning and data mining , The background to reply : Add group
版权声明
本文为[Python data mining]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204232041258613.html
边栏推荐
- An error occurs when the addressable assets system project is packaged. Runtimedata is null
- 6-5 string - 2 String copy (assignment) (10 points) the C language standard function library includes the strcpy function for string copy (assignment). As an exercise, we write a function with the sam
- 浅谈数据库设计之三大范式
- 黑客的入侵方式你知道几种?
- PHP的Laravel与Composer部署项目时常见问题
- Leetcode 232, queue with stack
- Commande dos pour la pénétration de l'Intranet
- Scripy tutorial - (2) write a simple crawler
- Communication between RING3 and ring0
- High paid programmer & interview question series 91 limit 20000 loading is very slow. How to solve it? How to locate slow SQL?
猜你喜欢
Scrapy教程 - (2)寫一個簡單爬蟲
MySQL进阶之数据的增删改查(DML)
Devexpress 14.1 installation record
Resolve the eslint warning -- ignore the warning that there is no space between the method name and ()
Flex layout
C migration project record: modify namespace and folder name
UnhandledPromiseRejectionwarning:CastError: Cast to ObjectId failed for value
缓存淘汰算法初步认识(LRU和LFU)
Shanghai responded that "flour official website is an illegal website": neglect of operation and maintenance has been "hacked", and the police have filed a case
Recognition of high-speed road signs by Matlab using alexnet
随机推荐
LeetCode 1337、矩阵中战斗力最弱的 K 行
Recognition of high-speed road signs by Matlab using alexnet
GO語言開發天天生鮮項目第三天 案例-新聞發布系統二
Leetcode 994, rotten orange
Scrapy教程 - (2)寫一個簡單爬蟲
Unity solves Z-fighting
打新债中签以后怎么办,网上开户安全吗
DOS command of Intranet penetration
三十一. `prototype`显示原型属性和`__proto__`隐式原型属性
学会打字后的思考
2021-06-29 C escape character cancellation and use
Matlab matrix index problem
Scripy tutorial - (2) write a simple crawler
Latex formula
2022dasctf APR x fat epidemic prevention challenge crypto easy_ real
C migration project record: modify namespace and folder name
Syntax Error: TypeError: this. getOptions is not a function
Unity Odin ProgressBar add value column
中创存储|想要一个好用的分布式存储云盘,到底该怎么选
Bash script learning -- for loop traversal