当前位置:网站首页>MapReduce compression
MapReduce compression
2022-04-23 10:12:00 【zhaojiew】
Compression principle
- Computationally intensive Job, Use less compression
- IO intensive Job, Multi use compression
Compression algorithm comparison
Compressed format | Hadoop Bring their own | Algorithm | File extension name | Whether it can be sliced | After changing to compressed format , Does the original program need to be modified |
---|---|---|---|---|---|
DEFLATE | yes | DEFLATE | .deflate | no | Like text processing , It doesn't need to be modified |
Gzip | yes | DEFLATE | .gz | no | Like text processing , It doesn't need to be modified |
bzip2 | yes | bzip2 | .bz2 | yes | Like text processing , It doesn't need to be modified |
LZO | no | LZO | .lzo | yes | Need to index , You also need to specify the input format |
Snappy | yes | Snappy | .snappy | no | Like text processing , It doesn't need to be modified |
Compression performance comparison
Compression algorithm | Original file size | Compressed file size | Compression speed | Decompression speed |
---|---|---|---|---|
gzip | 8.3GB | 1.8GB | 17.5MB/s | 58MB/s |
bzip2 | 8.3GB | 1.1GB | 2.4MB/s | 9.5MB/s |
LZO | 8.3GB | 2.9GB | 49.3MB/s | 74.6MB/s |
Compression algorithm selection
Compression needs to be considered / Decompression speed 、 compression ratio ( Compressed storage size )、 Whether slicing can be supported after compression
mapper Input end
There is no need to display the encoding and decoding method specified , Hadoop Automatically check file extensions , If the extension can match , The file will be compressed and decompressed in an appropriate encoding and decoding way .
Enterprise development : Considerations .
- The amount of data is less than the block size , Focus on those with fast compression and decompression speed LZO/Snappy
- Very large amount of data , Focus on supporting slicing Bzip2 and LZO
mapper Output terminal
How to choose in enterprise development : In order to reduce the MapTask and ReduceTask The network between IO, Focus on fast compression and decompression LZO、 Snappy.
reducer Output terminal
Depending on the demand :
-
If the data is permanently saved , Consider those with high compression ratio Bzip2 and Gzip
-
If, as the next MapReduce Input , Consider the amount of data and whether slicing is supported
Compression parameter configuration
Hadoop code / decoder
Compressed format | Corresponding code / decoder |
---|---|
DEFLATE | org.apache.hadoop.io.compress.DefaultCodec |
gzip | org.apache.hadoop.io.compress.GzipCodec |
bzip2 | org.apache.hadoop.io.compress.BZip2Codec |
LZO | com.hadoop.compression.lzo.LzopCodec |
Snappy | org.apache.hadoop.io.compress.SnappyCodec |
Parameter configuration
Parameters | The default value is | Stage | Suggest |
---|---|---|---|
io.compression.codecs ( stay core-site.xml Middle configuration ) |
nothing , You need to type... On the command line hadoop checknative see | Input compression | Hadoop Use file extensions Name to determine whether to support some kind of codecs |
mapreduce.map.output.compress ( stay mapred-site.xml Middle configuration ) |
false | mapper Output | This parameter is set to true Qi With compression |
mapreduce.map.output.compress.codec ( stay mapred site.xml Middle configuration ) |
org.apache.hadoop.io.com press.DefaultCodec | mapper Output | Enterprises use more LZO or Snappy Codec here Stage compressed data |
mapreduce.output.fileoutpu tformat.compress ( stay mapred-site.xml Middle configuration ) |
false | reducer Output | This parameter is set to true Qi With compression |
mapreduce.output.fileoutpu tformat.compress.codec ( stay mapred-site.xml Middle configuration ) |
org.apache.hadoop.io.com press.DefaultCodec | reducer Output | Use standard tools or edit decoder , Such as gzip and bzip2 |
Compression routine
Map Compression at output
Hadoop The compression formats supported by the source code are : BZip2Codec、 DefaultCodec
stay driver Middle configuration , and hdfs The configuration of the client is similar
Configuration conf = new Configuration();
// Turn on map End output compression
conf.setBoolean("mapreduce.map.output.compress", true);
// Set up map End output compression mode
conf.setClass("mapreduce.map.output.compress.codec", BZip2Codec.class, CompressionCodec.class);
Job job = Job.getInstance(conf);
Reduce Compression at output
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
// Set up reduce End output compression on
FileOutputFormat.setCompressOutput(job, true);
// Set the way to compress
FileOutputFormat.setOutputCompressorClass(job, BZip2Codec.class);
//FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
// FileOutputFormat.setOutputCompressorClass(job, DefaultCodec.class);
版权声明
本文为[zhaojiew]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204230949431047.html
边栏推荐
- 域名和IP地址的联系
- 第二章 In-Memory 体系结构 (IM-2.2)
- Formattime timestamp format conversion
- DBA常用SQL语句 (5) - Latch 相关
- net start mysql MySQL 服务正在启动 . MySQL 服务无法启动。 服务没有报告任何错误。
- JUC concurrent programming 07 -- is fair lock really fair (source code analysis)
- 2022年流动式起重机司机考试题库模拟考试平台操作
- Realize data value through streaming data integration (2)
- Can Jerry's AES 256bit [chapter]
- failureForwardUrl与failureUrl
猜你喜欢
随机推荐
解决VMware卸载后再安装出现的问题
Sim Api User Guide(5)
Depth selector
Chapter 2 Oracle database in memory architecture (I) (im-2.1)
打印页面的功能实现
[untitled]
第二章 Oracle Database In-Memory 体系结构(上) (IM-2.1)
中控学习型红外遥控模块支持网络和串口控制
第120章 SQL函数 ROUND
域名和IP地址的联系
Question bank and answers of Shanghai safety officer C certificate examination in 2022
1、两数之和(哈希表)
《Redis设计与实现》
深度选择器
最长公共前串
Understand scope
中职网络安全2022国赛之CVE-2019-0708漏洞利用
2022年制冷与空调设备运行操作考试练习题及模拟考试
Failureforwardurl and failureurl
The central control learning infrared remote control module supports network and serial port control