当前位置：网站首页>Intersection, union and difference sets of spark operators

Intersection, union and difference sets of spark operators

2022-04-23 15:48:00 【Uncle flying against the wind】

Preface

In daily development , It often involves the intersection of different sets of data , Operation of Union and difference sets , stay Spark in , Similar operators are also provided to help us deal with such business , namely double Value type Data processing ;

intersection

Function signature

def intersection(other: RDD[T]): RDD[T]

Function description

To the source RDD And parameters RDD Returns a new RDD

Case a , Find the intersection of two sets

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

object DoubleValueTest {

  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Operator")
    val sc = new SparkContext(sparkConf)

    // TODO  operator  -  double Value type 

    //  intersection , Union and difference sets require the data types of the two data sources to be consistent 
    //  Zipper operation the types of two data sources can be different 

    val rdd1 = sc.makeRDD(List(1,2,3,4))
    val rdd2 = sc.makeRDD(List(3,4,5,6))
    val rdd7 = sc.makeRDD(List("3","4","5","6"))

    //  intersection  : 【3,4】
    val rdd3: RDD[Int] = rdd1.intersection(rdd2)
    //val rdd8 = rdd1.intersection(rdd7)
    println(rdd3.collect().mkString(","))


    sc.stop()
  }

}

Run the above code , Observe the console output effect

union

Function signature

def union(other: RDD[T]): RDD[T]

Function description

To the source RDD And parameters RDD Returns a new... After union RDD

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

object DoubleValueTest {

  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Operator")
    val sc = new SparkContext(sparkConf)

    // TODO  operator  -  double Value type 

    //  intersection , Union and difference sets require the data types of the two data sources to be consistent 
    //  Zipper operation the types of two data sources can be different 

    val rdd1 = sc.makeRDD(List(1,2,3,4))
    val rdd2 = sc.makeRDD(List(3,4,5,6))
    val rdd7 = sc.makeRDD(List("3","4","5","6"))

    //  intersection  : 【3,4】
    /*val rdd3: RDD[Int] = rdd1.intersection(rdd2)
    //val rdd8 = rdd1.intersection(rdd7)
    println(rdd3.collect().mkString(","))*/

  //  Combine  : 【1,2,3,4,3,4,5,6】
    val rdd4: RDD[Int] = rdd1.union(rdd2)
    println(rdd4.collect().mkString(","))

    

    sc.stop()
  }

}

subtract

Function signature

def subtract(other: RDD[T]): RDD[T]

Function description

With a RDD The main elements are , Remove two RDD Repeat elements in , Keep the other elements . Difference set

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

object DoubleValueTest {

  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Operator")
    val sc = new SparkContext(sparkConf)

    // TODO  operator  -  double Value type 

    //  intersection , Union and difference sets require the data types of the two data sources to be consistent 
    //  Zipper operation the types of two data sources can be different 

    val rdd1 = sc.makeRDD(List(1,2,3,4))
    val rdd2 = sc.makeRDD(List(3,4,5,6))
    val rdd7 = sc.makeRDD(List("3","4","5","6"))

    //  intersection  : 【3,4】
    /*val rdd3: RDD[Int] = rdd1.intersection(rdd2)
    //val rdd8 = rdd1.intersection(rdd7)
    println(rdd3.collect().mkString(","))*/

  //  Combine  : 【1,2,3,4,3,4,5,6】
    /*val rdd4: RDD[Int] = rdd1.union(rdd2)
    println(rdd4.collect().mkString(","))*/

      //  Difference set  : 【1,2】
    val rdd5: RDD[Int] = rdd1.subtract(rdd2)
    println(rdd5.collect().mkString(","))


    sc.stop()
  }

}

zip

zip Also known as zipper operator , Function signature

def zip[U: ClassTag](other: RDD[U]): RDD[(T, U)]

Function description

Put two RDD The elements in , Merge in the form of key value pairs . among , Key value alignment Key For the first time 1 individual RDD

The elements in , Value For the first time 2 individual RDD Elements in the same position in

import org.apache.spark.rdd.RDD
  import org.apache.spark.{SparkConf, SparkContext}
  
  object DoubleValueTest {
  
    def main(args: Array[String]): Unit = {
      val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Operator")
      val sc = new SparkContext(sparkConf)
  
      // TODO  operator  -  double Value type 
  
      //  intersection , Union and difference sets require the data types of the two data sources to be consistent 
      //  Zipper operation the types of two data sources can be different 
  
      val rdd1 = sc.makeRDD(List(1,2,3,4))
      val rdd2 = sc.makeRDD(List(3,4,5,6))
      val rdd7 = sc.makeRDD(List("3","4","5","6"))
  
      //  intersection  : 【3,4】
      /*val rdd3: RDD[Int] = rdd1.intersection(rdd2)
      //val rdd8 = rdd1.intersection(rdd7)
      println(rdd3.collect().mkString(","))*/
  
    //  Combine  : 【1,2,3,4,3,4,5,6】
      /*val rdd4: RDD[Int] = rdd1.union(rdd2)
      println(rdd4.collect().mkString(","))*/
  
        //  Difference set  : 【1,2】
      /*val rdd5: RDD[Int] = rdd1.subtract(rdd2)
      println(rdd5.collect().mkString(","))*/
  
      //  zipper  : 【1-3,2-4,3-5,4-6】
      val rdd6: RDD[(Int, Int)] = rdd1.zip(rdd2)
      val rdd8 = rdd1.zip(rdd7)
      println(rdd6.collect().mkString(","))
  
      sc.stop()
    }
  
  }

版权声明
本文为[Uncle flying against the wind]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/04/202204231544587277.html