当前位置：网站首页>Spark / Scala - read rcfile & orcfile

Spark / Scala - read rcfile & orcfile

2022-04-21 15:13:00 【BIT_ six hundred and sixty-six】

One . introduction

As mentioned above MapReduce - Read OrcFile, RcFile file , Through here Java + MapReduce Implemented read RcFile and OrcFile file , Later encountered MapReduce - Read at the same time RcFile and OrcFile The conflict of dependence , Also solved smoothly , But the usual development is still used to spark So use spark Implement read OrcFile and RcFile as well as Map-Reduce The function of .

Two . Read RcFile

front mr Our mission has been to RcFile The form of , Its key The form of is not critical ,LongWritable Is the line number , NullWritable Just ignore the line number , Mainly value In the form of : BytesRefArrayWritable, So you can use spark Of hadoopFile API Realization RcFile The read , Let's take a look at hadoopFile Parameters of :

  def hadoopFile[K, V](
      path: String,
      inputFormatClass: Class[_ <: InputFormat[K, V]],
      keyClass: Class[K],
      valueClass: Class[V],
      minPartitions: Int = defaultMinPartitions): RDD[(K, V)]

keyClass and ValueClass It has been determined that ,inputFormat It's OK to be sure , according to MR - MultipleInputs.addInputPath We can know its inputFormatClass by RCFileInputFormat, Now start reading ：

    val conf = (new SparkConf).setAppName("TestReadRcFile").setMaster("local[1]")

    val spark = SparkSession
      .builder
      .config(conf)
      .getOrCreate()

    val sc = spark.sparkContext

    val rcFileInput = “”

    val minPartitions = 100

    println(repeatString("=", 30) + " Start reading  RcFile" + repeatString("=", 30))

    val rcFileRdd = sc.hadoopFile(rcFileInput, classOf[org.apache.hadoop.hive.ql.io.RCFileInputFormat[LongWritable, BytesRefArrayWritable]], classOf[LongWritable], classOf[BytesRefArrayWritable], minPartitions)
      .map(line => {
        val key = LazyBinaryRCFileUtils.readString(line._2.get(0))
        val value = LazyBinaryRCFileUtils.readString(line._2.get(1))
        (key, value)
      })

    println(repeatString("=", 30) + " End read  RcFile" + repeatString("=", 30))

3、 ... and . Read OrcFile

Compared to reading RcFile Need to use hadoopFile, because SparkSession Provides direct reading orcFile Of API, bring spark Read OrcFile Quite silky , Note that there orc After reading, you will return DataSet, Need to pass through .rdd Turn into Spark The conventional RDD.

    val conf = (new SparkConf).setAppName("TestReadRcFile").setMaster("local[1]")

    val spark = SparkSession
      .builder
      .config(conf)
      .getOrCreate()

    println(repeatString("=", 30) + " Start reading  OrcFile" + repeatString("=", 30))

    import spark.implicits._
    val orcInput = “”
    val orcFileRdd = spark.read.orc(orcInput).map(row => {
       val key = row.getString(0)
       val value = row.getString(1)
      （key, value)
    }).rdd

    println(repeatString("=", 30) + " End read  OrcFile" + repeatString("=", 30))

Four .spark Realization Map-Reduce

Above rcFileRdd and orcFileRdd Two PairRdd It can be regarded as two Mapper, Following execution reduce operation , adopt union Achieve each pairRdd The merger of , Follow up groupByKey To the goal key Conduct value polymerization , Follow up reduce The operation of ：

    rcFileRdd.union(orcFileRdd).groupByKey().map(info => {
      val key: String = info._1
      val values: Iterable[String] = info._2
        ... reduce func ...
    })