当前位置：网站首页>Partitionby of spark operator

Partitionby of spark operator

2022-04-23 15:45:00 【Uncle flying against the wind】

Preface

In previous studies , We use groupBy The data can be processed according to the specified key Grouping rules , Imagine a scenario like this , If you want to tuple Data of type , namely key/value What should I do to group data of different types ？ In response to this Spark Provides partitionBy Operator solution ;

partitionBy

Function signature

def partitionBy( partitioner: Partitioner ): RDD[(K, V)]

Function description

Set the data as specified Partitioner Repartitioning . Spark The default comparator is HashPartitioner

The case shows

Pass a set of data through partitionBy Then it is stored in multiple partition files

import org.apache.spark.rdd.RDD
import org.apache.spark.{HashPartitioner, SparkConf, SparkContext}

object PartionBy_Test {

  def main(args: Array[String]): Unit = {

    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Operator")
    val sc = new SparkContext(sparkConf)

    // TODO  operator  - (Key - Value type )
    val rdd = sc.makeRDD(List(1, 2, 3, 4), 2)

    val mapRDD: RDD[(Int, Int)] = rdd.map((_, 1))

    // partitionBy Repartition the data according to the specified partition rules 
    val newRDD = mapRDD.partitionBy(new HashPartitioner(2)).saveAsTextFile("E:\\output")

    sc.stop()

  }

}

Run the above code , After execution , Observe the local directory , You can see 4 Pieces of data cannot be divided into different partition files