当前位置:网站首页>Spark 算子之filter使用
Spark 算子之filter使用
2022-04-23 15:45:00 【逆风飞翔的小叔】
前言
filter,可以理解为过滤,直观来说,就是对一组数据按照指定的规则做过滤,filter这个算子在Java或者其他语言中多有使用,能够很方便的帮我们从一组数据中过滤得到期望的数据;
函数签名
def filter(f: T => Boolean ): RDD[T]
函数说明
将数据根据指定的规则进行筛选过滤,符合规则的数据保留,不符合规则的数据丢弃。 当数据进行筛选过滤后,分区不变,但是分区内的数据可能不均衡,生产环境下,可能会出现 数据倾斜;
案例一,从一组数据中过滤出偶数
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object Filter_Test {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Operator")
val sc = new SparkContext(sparkConf)
val rdd = sc.makeRDD(List(1,2,3,4,5,6))
val result = rdd.filter(
item => item % 2 ==0
)
result.collect().foreach(println)
sc.stop()
}
}
运行这段代码,观察控制台输出结果
案例二,从日志文件中过滤出2015年5月17的数据
日志文件内容如下:
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object Filter_Test {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Operator")
val sc = new SparkContext(sparkConf)
val rdd: RDD[String] = sc.textFile("E:\\code-self\\spi\\datas\\apache.log")
rdd.filter(
line =>{
val datas = line.split(" ")
val time = datas(3)
time.contains("17/05/2015")
}
).collect().foreach(println)
sc.stop()
}
}
运行上面的代码,观察控制台输出结果,
版权声明
本文为[逆风飞翔的小叔]所创,转载请带上原文链接,感谢
https://blog.csdn.net/congge_study/article/details/124355911
边栏推荐
- CVPR 2022 quality paper sharing
- 使用 Bitnami PostgreSQL Docker 镜像快速设置流复制集群
- Fastjon2 here he is, the performance is significantly improved, and he can fight for another ten years
- 字符串排序
- 通过 PDO ODBC 将 PHP 连接到 MySQL
- JVM-第2章-类加载子系统(Class Loader Subsystem)
- Why is IP direct connection prohibited in large-scale Internet
- Recommended search common evaluation indicators
- Codejock Suite Pro v20. three
- Open source project recommendation: 3D point cloud processing software paraview, based on QT and VTK
猜你喜欢
携号转网最大赢家是中国电信,为何人们嫌弃中国移动和中国联通?
单体架构系统重新架构
IronPDF for .NET 2022.4.5455
CVPR 2022 优质论文分享
Large factory technology implementation | industry solution series tutorials
pgpool-II 4.3 中文手册 - 入门教程
Best practices of Apache APIs IX high availability configuration center based on tidb
【开源工具分享】单片机调试助手(示波/改值/日志) - LinkScope
Timing model: gated cyclic unit network (Gru)
多生成树MSTP的配置
随机推荐
现在做自媒体能赚钱吗?看完这篇文章你就明白了
Multi level cache usage
Redis master-slave replication process
What is CNAs certification? What are the software evaluation centers recognized by CNAs?
Explanation 2 of redis database (redis high availability, persistence and performance management)
The length of the last word of the string
Named in pytoch_ parameters、named_ children、named_ Modules function
Large factory technology implementation | industry solution series tutorials
Sorting and replying to questions related to transformer
大厂技术实现 | 行业解决方案系列教程
[backtrader source code analysis 18] Yahoo Py code comments and analysis (boring, interested in the code, you can refer to)
Go concurrency and channel
一刷312-简单重复set-剑指 Offer 03. 数组中重复的数字(e)
How do you think the fund is REITs? Is it safe to buy the fund through the bank
Redis主从复制过程
Interview questions of a blue team of Beijing Information Protection Network
Upgrade MySQL 5.1 to 5.69
【开源工具分享】单片机调试助手(示波/改值/日志) - LinkScope
Node. JS ODBC connection PostgreSQL
北京某信护网蓝队面试题目