当前位置:网站首页>Filter usage of spark operator
Filter usage of spark operator
2022-04-23 15:48:00 【Uncle flying against the wind】
Preface
filter, It can be understood as filtering , Intuitive, , Is to filter a group of data according to the specified rules ,filter This operator is in Java Or in other languages , It can easily help us filter the desired data from a set of data ;
Function signature
def filter(f: T => Boolean ): RDD[T]
Function description
Filter the data according to the specified rules , Consistent data retention , Data that does not conform to the rules is discarded . When the data is filtered , The partition does not change , But the data in the partition may be uneven , In the production environment , There may be Data skew ;
Case a , Filter out even numbers from a set of data
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object Filter_Test {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Operator")
val sc = new SparkContext(sparkConf)
val rdd = sc.makeRDD(List(1,2,3,4,5,6))
val result = rdd.filter(
item => item % 2 ==0
)
result.collect().foreach(println)
sc.stop()
}
}
Run this code , Observe the console output
Case 2 , Filter out from log file 2015 year 5 month 17 The data of
The contents of the log file are as follows :
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object Filter_Test {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Operator")
val sc = new SparkContext(sparkConf)
val rdd: RDD[String] = sc.textFile("E:\\code-self\\spi\\datas\\apache.log")
rdd.filter(
line =>{
val datas = line.split(" ")
val time = datas(3)
time.contains("17/05/2015")
}
).collect().foreach(println)
sc.stop()
}
}
Run the above code , Observe the console output ,
版权声明
本文为[Uncle flying against the wind]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204231544587482.html
边栏推荐
猜你喜欢
随机推荐
Spark 算子之交集、并集、差集
【第5节 if和for】
Vision of building interstellar computing network
Go语言切片,范围,集合
Timing model: gated cyclic unit network (Gru)
Fastjon2他来了,性能显著提升,还能再战十年
布隆过滤器在亿级流量电商系统的应用
cadence SPB17. 4 - Active Class and Subclass
Partitionby of spark operator
New developments: new trends in cooperation between smartmesh and meshbox
删除字符串中出现次数最少的字符
Treatment of idempotency
导入地址表分析(根据库文件名求出:导入函数数量、函数序号、函数名称)
负载均衡器
新动态:SmartMesh和MeshBox的合作新动向
[open source tool sharing] MCU debugging assistant (oscillograph / modification / log) - linkscope
Go language slice, range, set
[self entertainment] construction notes week 2
时序模型:长短期记忆网络(LSTM)
CVPR 2022 quality paper sharing