当前位置：网站首页>[spark] (task6) spark RDD completes the statistical logic

[spark] (task6) spark RDD completes the statistical logic

2022-04-22 18:49:00 【Peak evening view】

List of articles

One 、Spark RDD
Two 、 Use RDD functions To complete the task 2 Statistical logic of
Reference

One 、Spark RDD

RDD：resilient distributed dataset (RDD)

Every spark Every program has a driver program function main function , stay cluster Perform various parallel operations on the cluster . We can also put RDD Persist to memory , Facilitate reuse in parallel operations .RDD yes Spark The most basic data abstraction , It's read-only 、 Set of partitioned records , Support parallel operation , Can be from an external data set or other RDD It's a transformation , It has the following characteristics ：

One RDD By one or more partitions （Partitions） form . about RDD Come on , Each partition is processed by a calculation task , Users can create RDD Specify the number of partitions , If not specified , By default, the program is assigned to CPU The number of core ;
RDD Have a function to compute partitions compute;
RDD Will preserve their dependence on each other ,RDD Each transformation of creates a new dependency , such RDD The dependency between them is like a pipeline . After some partition data is lost , The missing partition data can be recalculated through this dependency , Not right. RDD All partitions of are recalculated ;
Key-Value Type RDD Also has a Partitioner( Comparator ), Used to determine which partition the data is stored in , at present Spark Chinese support HashPartitioner( Partition by hash ) and RangeParationer( Zone by range );
A list of priorities ( Optional ), Used to store the priority of each partition (prefered location). For one HDFS The file is , This list holds the location of the block where each partition is located , according to “ Mobile data is not as good as mobile computing ” Idea ,Spark When scheduling tasks , As far as possible, the calculation task will be assigned to the storage location of the data block to be processed .

Two 、 Use RDD functions To complete the task 2 Statistical logic of

from pyspark.sql import SparkSession
from pyspark import SparkFiles
import pandas as pd

spark = SparkSession.builder.appName('pyspark').getOrCreate()
spark.sparkContext.addFile('https://cdn.coggle.club/Pokemon.csv')
df = spark.read.csv("file://"+SparkFiles.get("Pokemon.csv"), header=True, inferSchema= True)
df = df.withColumnRenamed('Sp. Atk', 'SpAtk')
df = df.withColumnRenamed('Sp. Def', 'SpDef')
df = df.withColumnRenamed('Type 1', 'Type1')
df = df.withColumnRenamed('Type 2', 'Type2')
df.show()

+--------------------+-----+------+-----+---+------+-------+-----+-----+-----+----------+---------+
|                Name|Type1| Type2|Total| HP|Attack|Defense|SpAtk|SpDef|Speed|Generation|Legendary|
+--------------------+-----+------+-----+---+------+-------+-----+-----+-----+----------+---------+
|           Bulbasaur|Grass|Poison|  318| 45|    49|     49|   65|   65|   45|         1|    false|
|             Ivysaur|Grass|Poison|  405| 60|    62|     63|   80|   80|   60|         1|    false|
|            Venusaur|Grass|Poison|  525| 80|    82|     83|  100|  100|   80|         1|    false|
|VenusaurMega Venu...|Grass|Poison|  625| 80|   100|    123|  122|  120|   80|         1|    false|
|          Charmander| Fire|  null|  309| 39|    52|     43|   60|   50|   65|         1|    false|
|          Charmeleon| Fire|  null|  405| 58|    64|     58|   80|   65|   80|         1|    false|
|           Charizard| Fire|Flying|  534| 78|    84|     78|  109|   85|  100|         1|    false|
|CharizardMega Cha...| Fire|Dragon|  634| 78|   130|    111|  130|   85|  100|         1|    false|
|CharizardMega Cha...| Fire|Flying|  634| 78|   104|     78|  159|  115|  100|         1|    false|
|            Squirtle|Water|  null|  314| 44|    48|     65|   50|   64|   43|         1|    false|
|           Wartortle|Water|  null|  405| 59|    63|     80|   65|   80|   58|         1|    false|
|           Blastoise|Water|  null|  530| 79|    83|    100|   85|  105|   78|         1|    false|
|BlastoiseMega Bla...|Water|  null|  630| 79|   103|    120|  135|  115|   78|         1|    false|
|            Caterpie|  Bug|  null|  195| 45|    30|     35|   20|   20|   45|         1|    false|
|             Metapod|  Bug|  null|  205| 50|    20|     55|   25|   25|   30|         1|    false|
|          Butterfree|  Bug|Flying|  395| 60|    45|     50|   90|   80|   70|         1|    false|
|              Weedle|  Bug|Poison|  195| 40|    35|     30|   20|   20|   50|         1|    false|
|              Kakuna|  Bug|Poison|  205| 45|    25|     50|   25|   25|   35|         1|    false|
|            Beedrill|  Bug|Poison|  395| 65|    90|     40|   45|   80|   75|         1|    false|
|BeedrillMega Beed...|  Bug|Poison|  495| 65|   150|     40|   15|   80|  145|         1|    false|
+--------------------+-----+------+-----+---+------+-------+-----+-----+-----+----------+---------+
only showing top 20 rows

rdd = df.rdd
cols = df.columns
for i in range(len(cols)):
    print('-'*10,cols[i],'-'*10)
    print(' Number of different values :',len(dict(rdd.map(lambda x: x[i]).countByValue())))
    print(' Number of null values :',rdd.filter(lambda x: x[i] == None).count())

""" ---------- Name ----------  Number of different values : 799  Number of null values : 0 ---------- Type1 ----------  Number of different values : 18  Number of null values : 0 ---------- Type2 ----------  Number of different values : 19  Number of null values : 386 ---------- Total ----------  Number of different values : 200  Number of null values : 0 ---------- HP ----------  Number of different values : 94  Number of null values : 0 ---------- Attack ----------  Number of different values : 111  Number of null values : 0 ---------- Defense ----------  Number of different values : 103  Number of null values : 0 ---------- SpAtk ----------  Number of different values : 105  Number of null values : 0 ---------- SpDef ----------  Number of different values : 92  Number of null values : 0 ---------- Speed ----------  Number of different values : 108  Number of null values : 0 ---------- Generation ----------  Number of different values : 6  Number of null values : 0 ---------- Legendary ----------  Number of different values : 2  Number of null values : 0 """