当前位置:网站首页>Spark 算子之filter使用
Spark 算子之filter使用
2022-04-23 15:45:00 【逆风飞翔的小叔】
前言
filter,可以理解为过滤,直观来说,就是对一组数据按照指定的规则做过滤,filter这个算子在Java或者其他语言中多有使用,能够很方便的帮我们从一组数据中过滤得到期望的数据;
函数签名
def filter(f: T => Boolean ): RDD[T]
函数说明
将数据根据指定的规则进行筛选过滤,符合规则的数据保留,不符合规则的数据丢弃。 当数据进行筛选过滤后,分区不变,但是分区内的数据可能不均衡,生产环境下,可能会出现 数据倾斜;
案例一,从一组数据中过滤出偶数
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object Filter_Test {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Operator")
val sc = new SparkContext(sparkConf)
val rdd = sc.makeRDD(List(1,2,3,4,5,6))
val result = rdd.filter(
item => item % 2 ==0
)
result.collect().foreach(println)
sc.stop()
}
}
运行这段代码,观察控制台输出结果
案例二,从日志文件中过滤出2015年5月17的数据
日志文件内容如下:
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object Filter_Test {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Operator")
val sc = new SparkContext(sparkConf)
val rdd: RDD[String] = sc.textFile("E:\\code-self\\spi\\datas\\apache.log")
rdd.filter(
line =>{
val datas = line.split(" ")
val time = datas(3)
time.contains("17/05/2015")
}
).collect().foreach(println)
sc.stop()
}
}
运行上面的代码,观察控制台输出结果,
版权声明
本文为[逆风飞翔的小叔]所创,转载请带上原文链接,感谢
https://blog.csdn.net/congge_study/article/details/124355911
边栏推荐
- Basic concepts of website construction and management
- Upgrade MySQL 5.1 to 5.611
- MySQL Cluster Mode and application scenario
- 导入地址表分析(根据库文件名求出:导入函数数量、函数序号、函数名称)
- IronPDF for . NET 2022.4.5455
- Large factory technology implementation | industry solution series tutorials
- Best practices of Apache APIs IX high availability configuration center based on tidb
- Redis master-slave replication process
- Connect PHP to MySQL via PDO ODBC
- php函数
猜你喜欢
WPS品牌再升级专注国内,另两款国产软件低调出国门,却遭禁令
Codejock Suite Pro v20. three
[leetcode daily question] install fence
MetaLife与ESTV建立战略合作伙伴关系并任命其首席执行官Eric Yoon为顾问
单体架构系统重新架构
Explanation 2 of redis database (redis high availability, persistence and performance management)
多生成树MSTP的配置
Large factory technology implementation | industry solution series tutorials
KNN, kmeans and GMM
Special analysis of China's digital technology in 2022
随机推荐
fatal error: torch/extension. h: No such file or directory
为啥禁用外键约束
Upgrade MySQL 5.1 to 5.66
KNN, kmeans and GMM
Code live collection ▏ software test report template Fan Wen is here
通过 PDO ODBC 将 PHP 连接到 MSSQL
PHP PDO ODBC将一个文件夹的文件装载到MySQL数据库BLOB列,并将BLOB列下载到另一个文件夹
建设星际计算网络的愿景
Deletes the least frequently occurring character in the string
Single architecture system re architecture
WPS品牌再升级专注国内,另两款国产软件低调出国门,却遭禁令
s16. One click installation of containerd script based on image warehouse
Rsync + inotify remote synchronization
What are the mobile app software testing tools? Sharing of third-party software evaluation
Upgrade MySQL 5.1 to 5.69
Advantages, disadvantages and selection of activation function
Configuration of multi spanning tree MSTP
c语言---指针进阶
Best practices of Apache APIs IX high availability configuration center based on tidb
MySQL optimistic lock to solve concurrency conflict