当前位置:网站首页>[spark] (task6) spark RDD completes the statistical logic
[spark] (task6) spark RDD completes the statistical logic
2022-04-22 18:49:00 【Peak evening view】
List of articles
One 、Spark RDD
RDD:resilient distributed dataset (RDD)
Every spark Every program has a driver program function main function , stay cluster Perform various parallel operations on the cluster . We can also put RDD Persist to memory , Facilitate reuse in parallel operations .RDD yes Spark The most basic data abstraction , It's read-only 、 Set of partitioned records , Support parallel operation , Can be from an external data set or other RDD It's a transformation , It has the following characteristics :
- One RDD By one or more partitions (Partitions) form . about RDD Come on , Each partition is processed by a calculation task , Users can create RDD Specify the number of partitions , If not specified , By default, the program is assigned to CPU The number of core ;
- RDD Have a function to compute partitions compute;
- RDD Will preserve their dependence on each other ,RDD Each transformation of creates a new dependency , such RDD The dependency between them is like a pipeline . After some partition data is lost , The missing partition data can be recalculated through this dependency , Not right. RDD All partitions of are recalculated ;
- Key-Value Type RDD Also has a Partitioner( Comparator ), Used to determine which partition the data is stored in , at present Spark Chinese support HashPartitioner( Partition by hash ) and RangeParationer( Zone by range );
- A list of priorities ( Optional ), Used to store the priority of each partition (prefered location). For one HDFS The file is , This list holds the location of the block where each partition is located , according to “ Mobile data is not as good as mobile computing ” Idea ,Spark When scheduling tasks , As far as possible, the calculation task will be assigned to the storage location of the data block to be processed .
Two 、 Use RDD functions To complete the task 2 Statistical logic of
from pyspark.sql import SparkSession
from pyspark import SparkFiles
import pandas as pd
spark = SparkSession.builder.appName('pyspark').getOrCreate()
spark.sparkContext.addFile('https://cdn.coggle.club/Pokemon.csv')
df = spark.read.csv("file://"+SparkFiles.get("Pokemon.csv"), header=True, inferSchema= True)
df = df.withColumnRenamed('Sp. Atk', 'SpAtk')
df = df.withColumnRenamed('Sp. Def', 'SpDef')
df = df.withColumnRenamed('Type 1', 'Type1')
df = df.withColumnRenamed('Type 2', 'Type2')
df.show()
+--------------------+-----+------+-----+---+------+-------+-----+-----+-----+----------+---------+
| Name|Type1| Type2|Total| HP|Attack|Defense|SpAtk|SpDef|Speed|Generation|Legendary|
+--------------------+-----+------+-----+---+------+-------+-----+-----+-----+----------+---------+
| Bulbasaur|Grass|Poison| 318| 45| 49| 49| 65| 65| 45| 1| false|
| Ivysaur|Grass|Poison| 405| 60| 62| 63| 80| 80| 60| 1| false|
| Venusaur|Grass|Poison| 525| 80| 82| 83| 100| 100| 80| 1| false|
|VenusaurMega Venu...|Grass|Poison| 625| 80| 100| 123| 122| 120| 80| 1| false|
| Charmander| Fire| null| 309| 39| 52| 43| 60| 50| 65| 1| false|
| Charmeleon| Fire| null| 405| 58| 64| 58| 80| 65| 80| 1| false|
| Charizard| Fire|Flying| 534| 78| 84| 78| 109| 85| 100| 1| false|
|CharizardMega Cha...| Fire|Dragon| 634| 78| 130| 111| 130| 85| 100| 1| false|
|CharizardMega Cha...| Fire|Flying| 634| 78| 104| 78| 159| 115| 100| 1| false|
| Squirtle|Water| null| 314| 44| 48| 65| 50| 64| 43| 1| false|
| Wartortle|Water| null| 405| 59| 63| 80| 65| 80| 58| 1| false|
| Blastoise|Water| null| 530| 79| 83| 100| 85| 105| 78| 1| false|
|BlastoiseMega Bla...|Water| null| 630| 79| 103| 120| 135| 115| 78| 1| false|
| Caterpie| Bug| null| 195| 45| 30| 35| 20| 20| 45| 1| false|
| Metapod| Bug| null| 205| 50| 20| 55| 25| 25| 30| 1| false|
| Butterfree| Bug|Flying| 395| 60| 45| 50| 90| 80| 70| 1| false|
| Weedle| Bug|Poison| 195| 40| 35| 30| 20| 20| 50| 1| false|
| Kakuna| Bug|Poison| 205| 45| 25| 50| 25| 25| 35| 1| false|
| Beedrill| Bug|Poison| 395| 65| 90| 40| 45| 80| 75| 1| false|
|BeedrillMega Beed...| Bug|Poison| 495| 65| 150| 40| 15| 80| 145| 1| false|
+--------------------+-----+------+-----+---+------+-------+-----+-----+-----+----------+---------+
only showing top 20 rows
rdd = df.rdd
cols = df.columns
for i in range(len(cols)):
print('-'*10,cols[i],'-'*10)
print(' Number of different values :',len(dict(rdd.map(lambda x: x[i]).countByValue())))
print(' Number of null values :',rdd.filter(lambda x: x[i] == None).count())
""" ---------- Name ---------- Number of different values : 799 Number of null values : 0 ---------- Type1 ---------- Number of different values : 18 Number of null values : 0 ---------- Type2 ---------- Number of different values : 19 Number of null values : 386 ---------- Total ---------- Number of different values : 200 Number of null values : 0 ---------- HP ---------- Number of different values : 94 Number of null values : 0 ---------- Attack ---------- Number of different values : 111 Number of null values : 0 ---------- Defense ---------- Number of different values : 103 Number of null values : 0 ---------- SpAtk ---------- Number of different values : 105 Number of null values : 0 ---------- SpDef ---------- Number of different values : 92 Number of null values : 0 ---------- Speed ---------- Number of different values : 108 Number of null values : 0 ---------- Generation ---------- Number of different values : 6 Number of null values : 0 ---------- Legendary ---------- Number of different values : 2 Number of null values : 0 """
Reference
[1] Official documents RDD Programming Guide
[2] https://blog.csdn.net/qq_56870570/article/details/118177403?spm=1001.2014.3001.5506
[3] Spark Literacy notes of entry stage 1
[4] ( a key )SPARK Official Tutorial Series Quick Start
[5] Spark RDD brief introduction
版权声明
本文为[Peak evening view]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204221833220161.html
边栏推荐
- 高并发之——深度解析ScheduledThreadPoolExecutor类的源代码
- 2022语言与智能技术竞赛再升级,推出NLP四大前沿任务
- B03 based on STM32 single chip microcomputer independent key control nixie tube stopwatch Proteus design, Keil program, C language, source code, standard library version
- 旅游产品分析:要出发周边游
- 关于.net core 中使用ActionFilter 以及ActionFilter的自动事务
- Model Inspector — 软件模型静态规范检查工具
- JSP learning (VIII. JDBC and file upload processing project)
- Simple application of tablayout + viewpager2 + fragment
- 如何设计 API 接口,实现统一格式返回
- 力扣-64.最小路径和
猜你喜欢
随机推荐
It's fun to learn programming. Why don't you know where to start when you do a project alone?
2022 mise à niveau du concours de langues et de technologies intelligentes pour lancer les quatre tâches de pointe du PNL
Win10 problems: one-time permanent shutdown and automatic update of win10 system
Server side password encryption
【CIcadplayer】进度条回调
RHCE-ansible
2022年江西省安全员A证考试练习题及模拟考试
【Spark】(task6)Spark RDD完成统计逻辑
The Sandbox 与 Apex Athletes 达成合作关系
浅析局域网聊天软件的能力
乌国家安全与国防委员会秘书:目前不可能恢复乌克兰的核地位
类和对象—5
基于SSM框架开发OA企业在线办公系统项目教程-附源码-毕业设计
The sandbox has entered into a cooperative relationship with apex athetes
2022装载机司机(建筑特殊工种)上岗证题目模拟考试平台操作
mmdeploy快速上手
stream演示
Proteus 8.9SP2仿真软件
What kind of database products do we need
Recommendation of safe, fast and low-cost futures companies in 2022?









