当前位置:网站首页>Filter usage of spark operator
Filter usage of spark operator
2022-04-23 15:48:00 【Uncle flying against the wind】
Preface
filter, It can be understood as filtering , Intuitive, , Is to filter a group of data according to the specified rules ,filter This operator is in Java Or in other languages , It can easily help us filter the desired data from a set of data ;
Function signature
def filter(f: T => Boolean ): RDD[T]
Function description
Filter the data according to the specified rules , Consistent data retention , Data that does not conform to the rules is discarded . When the data is filtered , The partition does not change , But the data in the partition may be uneven , In the production environment , There may be Data skew ;
Case a , Filter out even numbers from a set of data
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object Filter_Test {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Operator")
val sc = new SparkContext(sparkConf)
val rdd = sc.makeRDD(List(1,2,3,4,5,6))
val result = rdd.filter(
item => item % 2 ==0
)
result.collect().foreach(println)
sc.stop()
}
}
Run this code , Observe the console output
Case 2 , Filter out from log file 2015 year 5 month 17 The data of
The contents of the log file are as follows :
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object Filter_Test {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Operator")
val sc = new SparkContext(sparkConf)
val rdd: RDD[String] = sc.textFile("E:\\code-self\\spi\\datas\\apache.log")
rdd.filter(
line =>{
val datas = line.split(" ")
val time = datas(3)
time.contains("17/05/2015")
}
).collect().foreach(println)
sc.stop()
}
}
Run the above code , Observe the console output ,
版权声明
本文为[Uncle flying against the wind]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204231544587482.html
边栏推荐
- 北京某信护网蓝队面试题目
- String sorting
- Introduction to dynamic programming of leetcode learning plan day3 (198213740)
- Spark 算子之sortBy使用
- 导入地址表分析(根据库文件名求出:导入函数数量、函数序号、函数名称)
- Merging of Shanzhai version [i]
- The principle and common methods of multithreading and the difference between thread and runnable
- Import address table analysis (calculated according to the library file name: number of imported functions, function serial number and function name)
- R语言中实现作图对象排列的函数总结
- 【自娱自乐】构造笔记 week 2
猜你喜欢
Metalife established a strategic partnership with ESTV and appointed its CEO Eric Yoon as a consultant
单体架构系统重新架构
Neodynamic Barcode Professional for WPF V11. 0
Codejock Suite Pro v20.3.0
Demonstration meeting on startup and implementation scheme of swarm intelligence autonomous operation smart farm project
cadence SPB17. 4 - Active Class and Subclass
Treatment of idempotency
Mobile finance (for personal use)
建设星际计算网络的愿景
时序模型:门控循环单元网络(GRU)
随机推荐
王启亨谈Web3.0与价值互联网“通证交换”
How can poor areas without networks have money to build networks?
【AI周报】英伟达用AI设计芯片;不完美的Transformer要克服自注意力的理论缺陷
[self entertainment] construction notes week 2
WPS brand was upgraded to focus on China. The other two domestic software were banned from going abroad with a low profile
时序模型:长短期记忆网络(LSTM)
C, calculation method and source program of bell number
leetcode-396 旋转函数
Best practices of Apache APIs IX high availability configuration center based on tidb
基础贪心总结
pywintypes. com_ Error: (- 2147221020, 'invalid syntax', none, none)
建设星际计算网络的愿景
Treatment of idempotency
导入地址表分析(根据库文件名求出:导入函数数量、函数序号、函数名称)
Single architecture system re architecture
删除字符串中出现次数最少的字符
Config组件学习笔记
Partitionby of spark operator
s16.基于镜像仓库一键安装containerd脚本
Merging of Shanzhai version [i]