当前位置:网站首页>Groupby use of spark operator
Groupby use of spark operator
2022-04-23 15:48:00 【Uncle flying against the wind】
Preface
groupBy, seeing the name of a thing one thinks of its function , That is the meaning of grouping , stay mysql in groupBy Often used , I believe many students are not strangers , As Spark One of the more commonly used operators in , It is necessary to deeply understand and learn ;
Function signature
def groupBy[K](f: T => K )(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])]
Function description
Group data according to specified rules , By default, the partition remains unchanged , But the data will be Break up, regroup , We willThe operation of is called shuffle . In extreme cases , The data may be divided into the same partition
Additional explanation :
-
The data of a group is in a partition , But it's not that there is only one group in a partition
Case presentation I
Customize a collection , There are multiple strings in it , Group according to the first letter of each element
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
object Group_Test {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Operator")
val sc = new SparkContext(sparkConf)
val rdd : RDD[String] = sc.makeRDD(List("Hello","spark","scala","Hadoop"))
val result = rdd.groupBy(_.charAt(0))
result.collect().foreach(println)
}
}

Case display II
As shown below , For a log file , Now you need to group by time , Count the quantity in each time period
import java.text.SimpleDateFormat
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
object Group_Test {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Operator")
val sc = new SparkContext(sparkConf)
val rdd: RDD[String] = sc.textFile("E:\\code-self\\spi\\datas\\apache.log")
val result = rdd.map(
line => {
val datas = line.split(" ")
val time = datas(3)
val sdf = new SimpleDateFormat("dd/MM/yyyy:HH:mm")
val date = sdf.parse(time)
val sdf1 = new SimpleDateFormat("yyyy:HH")
val hour = sdf1.format(date)
(hour, 1)
}
).groupBy(_._1)
result.map {
case (hour, iter) => {
(hour, iter.size)
}
}.collect().foreach(println)
}
}

版权声明
本文为[Uncle flying against the wind]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204231544587523.html
边栏推荐
- Basic greedy summary
- Cap theorem
- Application of Bloom filter in 100 million flow e-commerce system
- C language --- advanced pointer
- 导入地址表分析(根据库文件名求出:导入函数数量、函数序号、函数名称)
- VIM specifies the line comment and reconciliation comment
- [AI weekly] NVIDIA designs chips with AI; The imperfect transformer needs to overcome the theoretical defect of self attention
- Single architecture system re architecture
- 为啥禁用外键约束
- Today's sleep quality record 76 points
猜你喜欢
Mumu, go all the way
Modèle de Cluster MySQL et scénario d'application
pgpool-II 4.3 中文手册 - 入门教程
cadence SPB17. 4 - Active Class and Subclass
[AI weekly] NVIDIA designs chips with AI; The imperfect transformer needs to overcome the theoretical defect of self attention
Import address table analysis (calculated according to the library file name: number of imported functions, function serial number and function name)
How can poor areas without networks have money to build networks?
WPS品牌再升级专注国内,另两款国产软件低调出国门,却遭禁令
CAP定理
Spark 算子之distinct使用
随机推荐
多级缓存使用
Pgpool II 4.3 Chinese Manual - introductory tutorial
Application of Bloom filter in 100 million flow e-commerce system
MySQL集群模式与应用场景
大厂技术实现 | 行业解决方案系列教程
Spark 算子之partitionBy
JVM - Chapter 2 - class loader subsystem
Temporal model: long-term and short-term memory network (LSTM)
Mobile finance (for personal use)
Tencent offer has been taken. Don't miss the 99 algorithm high-frequency interview questions. 80% of them are lost in the algorithm
小程序知识点积累
Modèle de Cluster MySQL et scénario d'application
实现缺省页面
Go并发和通道
R语言中实现作图对象排列的函数总结
单体架构系统重新架构
Large factory technology implementation | industry solution series tutorials
Deletes the least frequently occurring character in the string
山寨版归并【上】
Date date calculation in shell script