当前位置：网站首页>Groupby use of spark operator

Groupby use of spark operator

2022-04-23 15:48:00 【Uncle flying against the wind】

Preface

groupBy, seeing the name of a thing one thinks of its function , That is the meaning of grouping , stay mysql in groupBy Often used , I believe many students are not strangers , As Spark One of the more commonly used operators in , It is necessary to deeply understand and learn ;

Function signature

def groupBy[K](f: T => K )(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])]

Function description

Group data according to specified rules , By default, the partition remains unchanged , But the data will be Break up, regroup , We will

The operation of is called shuffle . In extreme cases , The data may be divided into the same partition

Additional explanation ：

The data of a group is in a partition , But it's not that there is only one group in a partition

Case presentation I

Customize a collection , There are multiple strings in it , Group according to the first letter of each element

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD

object Group_Test {

  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Operator")
    val sc = new SparkContext(sparkConf)

    val rdd : RDD[String] = sc.makeRDD(List("Hello","spark","scala","Hadoop"))
    val result = rdd.groupBy(_.charAt(0))

    result.collect().foreach(println)


  }

}

Run the above code , Observe the console output effect

Case display II

As shown below , For a log file , Now you need to group by time , Count the quantity in each time period

import java.text.SimpleDateFormat

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD

object Group_Test {

  def main(args: Array[String]): Unit = {

    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Operator")
    val sc = new SparkContext(sparkConf)


    val rdd: RDD[String] = sc.textFile("E:\\code-self\\spi\\datas\\apache.log")

    val result = rdd.map(
      line => {
        val datas = line.split(" ")
        val time = datas(3)
        val sdf = new SimpleDateFormat("dd/MM/yyyy:HH:mm")
        val date = sdf.parse(time)

        val sdf1 = new SimpleDateFormat("yyyy:HH")
        val hour = sdf1.format(date)
        (hour, 1)
      }
    ).groupBy(_._1)

    result.map {
      case (hour, iter) => {
        (hour, iter.size)
      }
    }.collect().foreach(println)

  }

}

Run this program , Observe the console output effect