当前位置:网站首页>MapReduce compression
MapReduce compression
2022-04-23 10:12:00 【zhaojiew】
Compression principle
- Computationally intensive Job, Use less compression
- IO intensive Job, Multi use compression
Compression algorithm comparison
| Compressed format | Hadoop Bring their own | Algorithm | File extension name | Whether it can be sliced | After changing to compressed format , Does the original program need to be modified |
|---|---|---|---|---|---|
| DEFLATE | yes | DEFLATE | .deflate | no | Like text processing , It doesn't need to be modified |
| Gzip | yes | DEFLATE | .gz | no | Like text processing , It doesn't need to be modified |
| bzip2 | yes | bzip2 | .bz2 | yes | Like text processing , It doesn't need to be modified |
| LZO | no | LZO | .lzo | yes | Need to index , You also need to specify the input format |
| Snappy | yes | Snappy | .snappy | no | Like text processing , It doesn't need to be modified |
Compression performance comparison
| Compression algorithm | Original file size | Compressed file size | Compression speed | Decompression speed |
|---|---|---|---|---|
| gzip | 8.3GB | 1.8GB | 17.5MB/s | 58MB/s |
| bzip2 | 8.3GB | 1.1GB | 2.4MB/s | 9.5MB/s |
| LZO | 8.3GB | 2.9GB | 49.3MB/s | 74.6MB/s |
Compression algorithm selection
Compression needs to be considered / Decompression speed 、 compression ratio ( Compressed storage size )、 Whether slicing can be supported after compression

mapper Input end
There is no need to display the encoding and decoding method specified , Hadoop Automatically check file extensions , If the extension can match , The file will be compressed and decompressed in an appropriate encoding and decoding way .
Enterprise development : Considerations .
- The amount of data is less than the block size , Focus on those with fast compression and decompression speed LZO/Snappy
- Very large amount of data , Focus on supporting slicing Bzip2 and LZO
mapper Output terminal
How to choose in enterprise development : In order to reduce the MapTask and ReduceTask The network between IO, Focus on fast compression and decompression LZO、 Snappy.
reducer Output terminal
Depending on the demand :
-
If the data is permanently saved , Consider those with high compression ratio Bzip2 and Gzip
-
If, as the next MapReduce Input , Consider the amount of data and whether slicing is supported
Compression parameter configuration
Hadoop code / decoder
| Compressed format | Corresponding code / decoder |
|---|---|
| DEFLATE | org.apache.hadoop.io.compress.DefaultCodec |
| gzip | org.apache.hadoop.io.compress.GzipCodec |
| bzip2 | org.apache.hadoop.io.compress.BZip2Codec |
| LZO | com.hadoop.compression.lzo.LzopCodec |
| Snappy | org.apache.hadoop.io.compress.SnappyCodec |
Parameter configuration
| Parameters | The default value is | Stage | Suggest |
|---|---|---|---|
| io.compression.codecs ( stay core-site.xml Middle configuration ) |
nothing , You need to type... On the command line hadoop checknative see | Input compression | Hadoop Use file extensions Name to determine whether to support some kind of codecs |
| mapreduce.map.output.compress ( stay mapred-site.xml Middle configuration ) |
false | mapper Output | This parameter is set to true Qi With compression |
| mapreduce.map.output.compress.codec ( stay mapred site.xml Middle configuration ) |
org.apache.hadoop.io.com press.DefaultCodec | mapper Output | Enterprises use more LZO or Snappy Codec here Stage compressed data |
| mapreduce.output.fileoutpu tformat.compress ( stay mapred-site.xml Middle configuration ) |
false | reducer Output | This parameter is set to true Qi With compression |
| mapreduce.output.fileoutpu tformat.compress.codec ( stay mapred-site.xml Middle configuration ) |
org.apache.hadoop.io.com press.DefaultCodec | reducer Output | Use standard tools or edit decoder , Such as gzip and bzip2 |
Compression routine
Map Compression at output
Hadoop The compression formats supported by the source code are : BZip2Codec、 DefaultCodec
stay driver Middle configuration , and hdfs The configuration of the client is similar
Configuration conf = new Configuration();
// Turn on map End output compression
conf.setBoolean("mapreduce.map.output.compress", true);
// Set up map End output compression mode
conf.setClass("mapreduce.map.output.compress.codec", BZip2Codec.class, CompressionCodec.class);
Job job = Job.getInstance(conf);
Reduce Compression at output
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
// Set up reduce End output compression on
FileOutputFormat.setCompressOutput(job, true);
// Set the way to compress
FileOutputFormat.setOutputCompressorClass(job, BZip2Codec.class);
//FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
// FileOutputFormat.setOutputCompressorClass(job, DefaultCodec.class);
版权声明
本文为[zhaojiew]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204230949431047.html
边栏推荐
- 使用IDEA开发Spark程序
- 《谷雨系列》空投
- Operation of 2022 tea artist (primary) test question simulation test platform
- Chapter 3 enable and adjust the size of IM column storage (im-3.1)
- failureForwardUrl与failureUrl
- Realize data value through streaming data integration (3) - real-time continuous data collection
- 二叉树的构建和遍历
- 转:毛姆:阅读是一座随身携带的避难所
- Realizing data value through streaming data integration (5) - flow analysis
- Realize data value through streaming data integration (1)
猜你喜欢

Juc并发编程07——公平锁真的公平吗(源码剖析)

Exercise questions and simulation test of refrigeration and air conditioning equipment operation test in 2022

第120章 SQL函数 ROUND

实践六 Windows操作系统安全攻防

The central control learning infrared remote control module supports network and serial port control

基于PyQt5实现弹出任务进度条功能示例

101. Symmetric Tree

杰理之更准确地确定异常地址【篇】

2022年制冷与空调设备运行操作考试练习题及模拟考试

ARM调试(1):两种在keil中实现printf重定向到串口的方法
随机推荐
454、四数之和(哈希表)
第一章 Oracle Database In-Memory 相关概念(IM-1.1)
Odoo 服务器搭建备忘
通过流式数据集成实现数据价值(4)-流数据管道
第三章 启用和调整IM列存储的大小(IM-3.1)
Operation of 2022 tea artist (primary) test question simulation test platform
2022 mobile crane driver test question bank simulation test platform operation
第二章 Oracle Database In-Memory 体系结构(上) (IM-2.1)
转:毛姆:阅读是一座随身携带的避难所
Computer network security experiment II DNS protocol vulnerability utilization experiment
DBA常用SQL语句(4)- Top SQL
Formattime timestamp format conversion
242、有效字母异位词(哈希表)
DBA common SQL statements (2) - SGA and PGA
Custom login failure handling
Zhengda international explains what the Dow Jones industrial index is?
ARM调试(1):两种在keil中实现printf重定向到串口的方法
1、两数之和(哈希表)
Windows安装redis并将redis设置成服务开机自启
Jerry's factors that usually affect CPU performance test results are: [article]