当前位置:网站首页>Filter usage of spark operator
Filter usage of spark operator
2022-04-23 15:48:00 【Uncle flying against the wind】
Preface
filter, It can be understood as filtering , Intuitive, , Is to filter a group of data according to the specified rules ,filter This operator is in Java Or in other languages , It can easily help us filter the desired data from a set of data ;
Function signature
def filter(f: T => Boolean ): RDD[T]
Function description
Filter the data according to the specified rules , Consistent data retention , Data that does not conform to the rules is discarded . When the data is filtered , The partition does not change , But the data in the partition may be uneven , In the production environment , There may be Data skew ;
Case a , Filter out even numbers from a set of data
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object Filter_Test {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Operator")
val sc = new SparkContext(sparkConf)
val rdd = sc.makeRDD(List(1,2,3,4,5,6))
val result = rdd.filter(
item => item % 2 ==0
)
result.collect().foreach(println)
sc.stop()
}
}
Run this code , Observe the console output

Case 2 , Filter out from log file 2015 year 5 month 17 The data of
The contents of the log file are as follows :

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object Filter_Test {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Operator")
val sc = new SparkContext(sparkConf)
val rdd: RDD[String] = sc.textFile("E:\\code-self\\spi\\datas\\apache.log")
rdd.filter(
line =>{
val datas = line.split(" ")
val time = datas(3)
time.contains("17/05/2015")
}
).collect().foreach(println)
sc.stop()
}
}
Run the above code , Observe the console output ,

版权声明
本文为[Uncle flying against the wind]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204231544587482.html
边栏推荐
- 多线程原理和常用方法以及Thread和Runnable的区别
- 携号转网最大赢家是中国电信,为何人们嫌弃中国移动和中国联通?
- What if the package cannot be found
- IronPDF for . NET 2022.4.5455
- Named in pytoch_ parameters、named_ children、named_ Modules function
- Neodynamic Barcode Professional for WPF V11.0
- Merging of Shanzhai version [i]
- R语言中绘制ROC曲线方法二:pROC包
- Go语言切片,范围,集合
- c语言---字符串+内存函数
猜你喜欢

For examination

ICE -- 源码分析

C language --- string + memory function

多生成树MSTP的配置

MySQL集群模式与应用场景

携号转网最大赢家是中国电信,为何人们嫌弃中国移动和中国联通?

MySQL optimistic lock to solve concurrency conflict

Application of Bloom filter in 100 million flow e-commerce system

WPS品牌再升级专注国内,另两款国产软件低调出国门,却遭禁令

Single architecture system re architecture
随机推荐
One brush 313 sword finger offer 06 Print linked list from end to end (E)
APISIX jwt-auth 插件存在错误响应中泄露信息的风险公告(CVE-2022-29266)
现在做自媒体能赚钱吗?看完这篇文章你就明白了
Go并发和通道
Partitionby of spark operator
携号转网最大赢家是中国电信,为何人们嫌弃中国移动和中国联通?
基于 TiDB 的 Apache APISIX 高可用配置中心的最佳实践
编译,连接 -- 笔记
多线程原理和常用方法以及Thread和Runnable的区别
幂等性的处理
Cap theorem
Pytorch中named_parameters、named_children、named_modules函数
C, calculation method and source program of bell number
网站压测工具Apache-ab,webbench,Apache-Jemeter
IronPDF for .NET 2022.4.5455
ICE -- 源码分析
JS regular determines whether the port path of the domain name or IP is correct
Introduction to dynamic programming of leetcode learning plan day3 (198213740)
What if the package cannot be found
leetcode-374 猜数字大小