当前位置:网站首页>Spark 算子之filter使用
Spark 算子之filter使用
2022-04-23 15:45:00 【逆风飞翔的小叔】
前言
filter,可以理解为过滤,直观来说,就是对一组数据按照指定的规则做过滤,filter这个算子在Java或者其他语言中多有使用,能够很方便的帮我们从一组数据中过滤得到期望的数据;
函数签名
def filter(f: T => Boolean ): RDD[T]
函数说明
将数据根据指定的规则进行筛选过滤,符合规则的数据保留,不符合规则的数据丢弃。 当数据进行筛选过滤后,分区不变,但是分区内的数据可能不均衡,生产环境下,可能会出现 数据倾斜;
案例一,从一组数据中过滤出偶数
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object Filter_Test {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Operator")
val sc = new SparkContext(sparkConf)
val rdd = sc.makeRDD(List(1,2,3,4,5,6))
val result = rdd.filter(
item => item % 2 ==0
)
result.collect().foreach(println)
sc.stop()
}
}
运行这段代码,观察控制台输出结果
案例二,从日志文件中过滤出2015年5月17的数据
日志文件内容如下:
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object Filter_Test {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Operator")
val sc = new SparkContext(sparkConf)
val rdd: RDD[String] = sc.textFile("E:\\code-self\\spi\\datas\\apache.log")
rdd.filter(
line =>{
val datas = line.split(" ")
val time = datas(3)
time.contains("17/05/2015")
}
).collect().foreach(println)
sc.stop()
}
}
运行上面的代码,观察控制台输出结果,
版权声明
本文为[逆风飞翔的小叔]所创,转载请带上原文链接,感谢
https://blog.csdn.net/congge_study/article/details/124355911
边栏推荐
- MySQL集群模式与应用场景
- KNN, kmeans and GMM
- Modèle de Cluster MySQL et scénario d'application
- Fastjon2 here he is, the performance is significantly improved, and he can fight for another ten years
- PHP PDO ODBC loads files from one folder into the blob column of MySQL database and downloads the blob column to another folder
- Today's sleep quality record 76 points
- Common types of automated testing framework ▏ automated testing is handed over to software evaluation institutions
- Best practices of Apache APIs IX high availability configuration center based on tidb
- 【递归之数的拆分】n分k,限定范围的拆分
- Neodynamic Barcode Professional for WPF V11. 0
猜你喜欢
One brush 314 sword finger offer 09 Implement queue (E) with two stacks
Codejock Suite Pro v20. three
多生成树MSTP的配置
Cookie&Session
How did the computer reinstall the system? The display has no signal
JVM - Chapter 2 - class loader subsystem
Neodynamic Barcode Professional for WPF V11.0
C#,贝尔数(Bell Number)的计算方法与源程序
移动金融(自用)
大型互联网为什么禁止ip直连
随机推荐
Date date calculation in shell script
Sorting and replying to questions related to transformer
GFS distributed file system (Theory)
Configuration of multi spanning tree MSTP
计算某字符出现次数
【AI周报】英伟达用AI设计芯片;不完美的Transformer要克服自注意力的理论缺陷
One brush 313 sword finger offer 06 Print linked list from end to end (E)
WPS brand was upgraded to focus on China. The other two domestic software were banned from going abroad with a low profile
Introduction to dynamic programming of leetcode learning plan day3 (198213740)
Go language, array, pointer, structure
Extract non duplicate integers
北京某信护网蓝队面试题目
Redis master-slave replication process
小程序知识点积累
新动态:SmartMesh和MeshBox的合作新动向
c语言---指针进阶
CAP定理
Control structure (I)
ICE -- 源码分析
一刷312-简单重复set-剑指 Offer 03. 数组中重复的数字(e)