当前位置:网站首页>12 Spark on RDD 分区器
12 Spark on RDD 分区器
2022-08-08 23:31:00 【YaPengLi.】
RDD 分区器
Spark 目前支持 Hash 分区和 Range 分区,和用户自定义分区。Hash 分区为当前的默认分区。分区器直接决定了 RDD 中分区的个数、RDD 中每条数据经过 Shuffle 后进入哪个分区,进而决定了 Reduce 的个数。只有 Key-Value 类型的 RDD 才有分区器,非 Key-Value 类型的 RDD 分区的值是 None每个 RDD 的分区 ID 范围:0 ~ (numPartitions - 1),决定这个值是属于那个分区的。
Hash 分区:对于给定的 key,计算其 hashCode,并除以分区个数取余。
class HashPartitioner(partitions: Int) extends Partitioner {
require(partitions >= 0, s"Number of partitions ($partitions) cannot be
negative.")
def numPartitions: Int = partitions
def getPartition(key: Any): Int = key match {
case null => 0
case _ => Utils.nonNegativeMod(key.hashCode, numPartitions)
}
override def equals(other: Any): Boolean = other match {
case h: HashPartitioner =>
h.numPartitions == numPartitions
case _ =>
false
}
override def hashCode: Int = numPartitions
}Range 分区:将一定范围内的数据映射到一个分区中,尽量保证每个分区数据均匀,而且分区间有序
class RangePartitioner[K : Ordering : ClassTag, V](
partitions: Int,
rdd: RDD[_ <: Product2[K, V]],
private var ascending: Boolean = true)
extends Partitioner {
// We allow partitions = 0, which happens when sorting an empty RDD under the
default settings.
require(partitions >= 0, s"Number of partitions cannot be negative but found
$partitions.")
private var ordering = implicitly[Ordering[K]]
// An array of upper bounds for the first (partitions - 1) partitions
private var rangeBounds: Array[K] = {
...
}
def numPartitions: Int = rangeBounds.length + 1
private var binarySearch: ((Array[K], K) => Int) =
CollectionsUtils.makeBinarySearch[K]
def getPartition(key: Any): Int = {
val k = key.asInstanceOf[K]
var partition = 0
if (rangeBounds.length <= 128) {
// If we have less than 128 partitions naive search
while (partition < rangeBounds.length && ordering.gt(k,
rangeBounds(partition))) {
partition += 1
}
} else {
// Determine which binary search method to use only once.
partition = binarySearch(rangeBounds, k)
// binarySearch either returns the match location or -[insertion point]-1
if (partition < 0) {
partition = -partition-1 }
if (partition > rangeBounds.length) {
partition = rangeBounds.length
} }
if (ascending) {
partition
} else {
rangeBounds.length - partition
} }
override def equals(other: Any): Boolean = other match {
...
}
override def hashCode(): Int = {
...
}
@throws(classOf[IOException])
private def writeObject(out: ObjectOutputStream): Unit =
Utils.tryOrIOException {
...
}
@throws(classOf[IOException])
private def readObject(in: ObjectInputStream): Unit = Utils.tryOrIOException
{
...
} }边栏推荐
- mysql 高级知识【order by 排序优化】
- 2022牛客多校六 A-Array(构造+哈夫曼)
- (2022牛客多校四)N-Particle Arts(思维)
- PHP 正则给img的src添加域名
- (2022牛客多校五)G-KFC Crazy Thursday(二分+哈希/Manacher)
- 2021 RoboCom 世界机器人开发者大赛-本科组(决赛)7-1绿色围栏(模拟)
- 【latex异常与错误】There were undefined references.Reference `xxx‘ on page x undefined.参考引用公式编号时发生错误
- -Wl,--start-group ... -Wl,--end-group 用于解决几个库的循环依赖关系
- 待完善:tf.name_scope() 和 tf.variable_scope()的区别
- 51nod1798 打怪兽
猜你喜欢

(2022牛客多校四)D-Jobs (Easy Version)(三维前缀或)

Virtual router redundancy protocol VRRP - double-machine hot backup

用模态框 实现 注册 登陆

Hi3516 use wifi module

【LaTex异常与错误】 - 公式编号的参考引用命令\eqref发生错误Undefined control sequence——可能是因为没加载宏包amsmath

Introduction to Qt (4) - Continuous playback of pictures (the use of two timers)

线性筛求积性函数

Hi3516 使用 wifi模块

51nod 2887 抓小偷 平面图最小割转换成最短路

2022杭电多校六 1007-Shinobu loves trip(同余方程)
随机推荐
Excel 2013 下拉为“快速分拆”调整为“填充序号”
Kubernetes web网站无法访问
WeChat applet develops some function usage methods
iptables firewall content full solution
2021 RoboCom 世界机器人开发者大赛-本科组(决赛)7-4猛犸不上 Ban(最短路)
STM8L LCD digital tube driver, thermometer LCD display
bp神经网络的学习心得
2022杭电多校六 1006-Maex (树形DP)
A preliminary study on the use of ndk and JNI
51nod1798 打怪兽
(nowcoder22529C)dinner(容斥原理+排列组合)
PHP 正则给img的src添加域名
Use Mongoose populate to implement multi-table associative storage and query, with complete code included
Small program figure display banner
Hi3516 使用 wifi模块
用模态框 实现 注册 登陆
数组去重的几种方法
[PP-YOLOv2] Test a custom dataset
LeetCode:最长有效括号
PHP 类函数和对象函数