当前位置:网站首页>03-DataFrame & Column
03-DataFrame & Column
2022-04-22 04:09:00 【wangyanglongcc】
Construct columns
A column is a logical construction that will be computed based on the data in a DataFrame using an expression
Construct a new column based on the input columns existing in a DataFrame
from pyspark.sql.functions import col
col("device")
df.device
df["device"] # recommend , The most versatile and easy to use
Use column objects to form complex expressions
col("ecommerce.purchase_revenue_in_usd") + col("ecommerce.total_item_quantity")
col("event_timestamp").desc()
(col("ecommerce.purchase_revenue_in_usd") * 100).cast("int")
Here’s an example of using these column expressions in the context of a DataFrame
recdf = (df.filter(col("ecommerce.purchase_revenue_in_usd").isNotNull())
.withColumn("purchase_revenue", (col("ecommerce.purchase_revenue_in_usd") * 100).cast("int"))
.withColumn("avg_purchase_revenue", col("ecommerce.purchase_revenue_in_usd") / col("ecommerce.total_item_quantity"))
.sort(col("avg_purchase_revenue").desc()))
display(revdf)
Subset columns
Use DataFrame transformations to subset columns
select
devicesDF = eventsDF.select("user_id", "device")
display(devicesDF)
from pyspark.sql.functions import col
locationsDF = eventsDF.select("user_id",
col("geo.city").alias("city"),
col("geo.state").alias("state"))
display(locationsDF)
selectExpr
appleDF = eventsDF.selectExpr("user_id", "device in ('macOS', 'iOS') as apple_user")
display(appleDF)
drop
Returns a new DataFrame after dropping the given column, specified as a string or column object
Use strings to specify multiple columns
anonymousDF = eventsDF.drop("user_id", "geo", "device")
noSalesDF = eventsDF.drop(col("ecommerce"))
Add or replace columns
Use DataFrame transformations to add or replace columns
withColumn The most common method
Returns a new DataFrame by adding a column or replacing the existing column that has the same name
mobileDF = df.withColumn("mobile", df["device"].isin("iOS", "Android"))
display(mobileDF)
purchaseQuantityDF = eventsDF.withColumn("purchase_quantity", col("ecommerce.total_item_quantity").cast("int"))
purchaseQuantityDF.printSchema()
withColumnRenamed rename
Returns a new DataFrame with a column renamed
locationDF = eventsDF.withColumnRenamed("geo", "location")
Subset Rows
Use DataFrame transformations to subset rows
purchasesDF = eventsDF.filter("ecommerce.total_item_quantity > 0")
revenueDF = eventsDF.filter(col("ecommerce.purchase_revenue_in_usd").isNotNull())
androidDF = eventsDF.filter((col("traffic_source") != "direct") & (col("device") == "Android"))
dropDuplicates
Returns a new DataFrame with duplicate rows removed, optionally considering only a subset of columns.
Alias: distinct
eventsDF.distinct() # Full field de duplication
distinctUsersDF = eventsDF.dropDuplicates(["user_id"]) # according to user_id duplicate removal
limit
limitDF = eventsDF.limit(100)
Sort Rows
sort()
Returns a new DataFrame sorted by the given columns or expressions.
Alias: orderBy
increaseTimestampsDF = eventsDF.sort("event_timestamp")
display(increaseTimestampsDF)
decreaseTimestampsDF = eventsDF.sort(col("event_timestamp").desc())
display(decreaseTimestampsDF)
increaseSessionsDF = eventsDF.orderBy(["user_first_touch_timestamp", "event_timestamp"])
display(increaseSessionsDF)
decreaseSessionsDF = eventsDF.sort(col("user_first_touch_timestamp").desc(), col("event_timestamp"))
display(decreaseSessionsDF)
版权声明
本文为[wangyanglongcc]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204220408303735.html
边栏推荐
- 浏览器 概述本地缓存 cookie 等
- 智慧用电安全管理系统
- php excel导入时间格式转换
- 染色法判定二分图
- Record the solution to the failure of configuring MySQL remote connection for ECS once
- LeetCode 63. 不同路径 II
- Sumo tutorial - Manhattan
- [recent force deduction] Fibonacci sequence + realizing queue with two stacks + printing linked list from end to end
- Sub database and sub table
- 偶然间用到的sql语句
猜你喜欢

On the origin of wireless operation and maintenance and project construction

How do programmers ensure that software is free of bugs?

Rsync remote synchronization

Implement joint type verification of parameters in nest

Where is the whole house intelligence that Huawei, Haier Zhijia and Xiaomi are all doing?

MySQL Download

Do447ansible tower navigation

Tensorflow error: returned a result with an error set solution

Autodesk Genuine Service2020删除

Sr-te policy (Cisco) -- supplement
随机推荐
Shell programming
How to solve the problem that the table association is not displayed when importing SQL from powerdesipowerdesigner
均线双边对锁策略原理
rsync远程同步
export ‘createStore‘ (imported as ‘createStore‘) was not found in ‘./ store/index. js‘ (possible expor
Data mining series (2)_ The data mining plug-in of Excel connects to SQL server
01背包问题(二维数组解法以及一位数组优化)
sumo教程——公共交通教程
【网络实验】/主机/路由器/交换机/网关/路由协议/RIP+OSPF/DHCP
LeetCode_矩形_困难_391.完美矩形
[force buckle] repeated substring
24 pictures to conquer border image
Why is Nacos so strong
04-Functions
05-Aggregation
Leetcode1615. Maximum network rank (medium)
Browser overview local cache cookies, etc
【近日力扣】重复的子字符串
Release announcement of HMS core version 6.4.0
英特尔边缘软件中心介绍