当前位置:网站首页>Spark 算子之filter使用
Spark 算子之filter使用
2022-04-23 15:45:00 【逆风飞翔的小叔】
前言
filter,可以理解为过滤,直观来说,就是对一组数据按照指定的规则做过滤,filter这个算子在Java或者其他语言中多有使用,能够很方便的帮我们从一组数据中过滤得到期望的数据;
函数签名
def filter(f: T => Boolean ): RDD[T]
函数说明
将数据根据指定的规则进行筛选过滤,符合规则的数据保留,不符合规则的数据丢弃。 当数据进行筛选过滤后,分区不变,但是分区内的数据可能不均衡,生产环境下,可能会出现 数据倾斜;
案例一,从一组数据中过滤出偶数
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object Filter_Test {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Operator")
val sc = new SparkContext(sparkConf)
val rdd = sc.makeRDD(List(1,2,3,4,5,6))
val result = rdd.filter(
item => item % 2 ==0
)
result.collect().foreach(println)
sc.stop()
}
}
运行这段代码,观察控制台输出结果
案例二,从日志文件中过滤出2015年5月17的数据
日志文件内容如下:
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object Filter_Test {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Operator")
val sc = new SparkContext(sparkConf)
val rdd: RDD[String] = sc.textFile("E:\\code-self\\spi\\datas\\apache.log")
rdd.filter(
line =>{
val datas = line.split(" ")
val time = datas(3)
time.contains("17/05/2015")
}
).collect().foreach(println)
sc.stop()
}
}
运行上面的代码,观察控制台输出结果,
版权声明
本文为[逆风飞翔的小叔]所创,转载请带上原文链接,感谢
https://blog.csdn.net/congge_study/article/details/124355911
边栏推荐
- GFS distributed file system (Theory)
- mysql乐观锁解决并发冲突
- What role does the software performance test report play? How much is the third-party test report charged?
- Configuration of multi spanning tree MSTP
- 携号转网最大赢家是中国电信,为何人们嫌弃中国移动和中国联通?
- Codejock Suite Pro v20. three
- 字符串最后一个单词的长度
- 多生成树MSTP的配置
- 【递归之数的拆分】n分k,限定范围的拆分
- Treatment of idempotency
猜你喜欢
导入地址表分析(根据库文件名求出:导入函数数量、函数序号、函数名称)
Redis主从复制过程
Sorting and replying to questions related to transformer
Treatment of idempotency
C#,贝尔数(Bell Number)的计算方法与源程序
大厂技术实现 | 行业解决方案系列教程
Cookie&Session
Pgpool II 4.3 Chinese Manual - introductory tutorial
cadence SPB17. 4 - Active Class and Subclass
Explanation 2 of redis database (redis high availability, persistence and performance management)
随机推荐
CVPR 2022 优质论文分享
Why disable foreign key constraints
How do you think the fund is REITs? Is it safe to buy the fund through the bank
字符串最后一个单词的长度
导入地址表分析(根据库文件名求出:导入函数数量、函数序号、函数名称)
【AI周报】英伟达用AI设计芯片;不完美的Transformer要克服自注意力的理论缺陷
Independent operation smart farm Innovation Forum
WPS brand was upgraded to focus on China. The other two domestic software were banned from going abroad with a low profile
Neodynamic Barcode Professional for WPF V11.0
Do we media make money now? After reading this article, you will understand
shell脚本中的DATE日期计算
现在做自媒体能赚钱吗?看完这篇文章你就明白了
pywintypes.com_error: (-2147221020, ‘无效的语法‘, None, None)
Date date calculation in shell script
Accumulation of applet knowledge points
布隆过滤器在亿级流量电商系统的应用
What if the server is poisoned? How does the server prevent virus intrusion?
[backtrader source code analysis 18] Yahoo Py code comments and analysis (boring, interested in the code, you can refer to)
删除字符串中出现次数最少的字符
utils.DeprecatedIn35 因升级可能取消,该如何办