HDFS and Hive storage - comparing file formats and compression methods
A few days ago, we have conducted a test in order to compare various Hive file formats and compression methods. Among those file formats, some are native to HDFS and apply to all Hadoop users. The test suite is composed of similar Hive queries which create a table, eventually set a compression type and load the same dataset into the new table. Among all the queries, we tested the “sequence file”, “text file” and “RCFILE” formats and the “default”, “bz”, “gz”, “LZO” and “Snappy” compression codecs.
April 4th 2012: Answer to Huchev comment relative to LZ4.
The environment is a 20 nodes and 120 terabytes Hadoop cluster running Cloudera CDH3U3. The original dataset is a 1.33 Go folder with 80 compressed and unsplitable “bz2” files inside. The data inside is formatted as CSV for a total of about 125,000,000 lines.
-- Prepare CREATE TABLE rc_lzo ( client BIGINT, ctime INT, mtime INT, code STRING, value_1 INT, value_2 INT ) STORED AS RCFILE; -- Compression SET hive.exec.compress.output=true; SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec; -- Import INSERT OVERWRITE TABLE rc_lzo SELECT * FROM ( SELECT client, round(ctime / 1000), round(mtime / 1000), code, value_1, value_2 FROM staging ) T;
The table below are the results we obtained. The query columns describe the type of test. The query name starts with the file format followed by the compression codec. Both “block” and “record” compression types have been reported for the “sequence file” format and the “default” compression codec.
The “serdesf” family of queries makes use of a custom SerDe which encodes each column in an shorter size when appropriate. In our use case, the
code can be stored as 1 character (1 byte), the
value_2 can be represented as the diff between itself and the
value_1 (2 bytes). Overall, a line is stored as 16 bytes compared to 65 bytes originally.
The “bss” query uses the
BinarySortableSerDe serialization which is a custom Hive Serde that we associated to the “sequence file” format.
The “b64” family of queries uses the
base64 package present in the Hive contrib project.
|sf||2mn 3s||7.91 Go|
|sf_string||2mn 22s||8.72 Go|
|sf_df_block||2mn 17s||8.72 Go|
|sf_df_record||2mn 12s||7.32 Go|
|sf_bz||2h 43mn 24s||9.9 Go|
|sf_gz||2mn 29s||8.72 Go|
|sf_lzo||2mn 36s||8.80 Go|
|sf_snappy||3mn 55s||8.23 Go|
|tf||1mn 45s||6.44 Go|
|tf_bz||2mn 14s||1.12 Go|
|tf_df||2mn 16s||1.12 Go|
|tf_lzo||1mn 28s||2.41 Go|
|tf_snappy||1mn 2s||2.55 Go|
|rc||1mn 30s||5.78 Go|
|rc_df||5mn 15s||917.68 Mo|
|rc_gz||4mn 36s||917.80 Mo|
|serdesf_df||1mn 27s||4.61 Go|
|serdesf_bz||3h 6mn 9s||9.63 Go|
|serdesf_gz||1mn 51s||6.02 Go|
|serdesf_snappy||1mn 35s||4.80 Go|
|bss||1mn 25s||5.73 Go|
|b64||2mn 5s||9.17 Go|
|b64_bz||21mn 15s||1.14 Go|
|b64_df||21mn 25s||1.14 Go|
We wish we had run the tests on a larger input data set with a more common file format but cluster time is a scarce resource at the moment. As a result, the file sizes are probably representative but the speed results should be interpreted with care.
The speed reflects importing time only (with not so standard files as input) and not how fast those formats are against map/reduce jobs.
As part of the test, we also tested
block type of compression on other type of codecs but they had no effect so we came to the conclusion that
block type only apply to the default codec in sequence files.
We tried to run the
serdesf test in a similar mode but using the
RCFILE format instead of
sequence file but the results are identical to the
rc family of queries.
The “tf” query act as a reference since it stores our data in an uncompressed CSV format. It is kind of awkward to see that all the “sequence file” queries generate larger file size. This isn’t something we expected but maybe is it due to our high usage of the integer type.
In term of file size, the “RCFILE” format with the “default” and “gz” compression achieve the best results. The “base64” SerDe using “sequence file” with “bz” and “default” compression aren’t so far away. However, those “base64” results are slow enough to be bypassed.
In term of speed, the “RCFILE” formats with the “lzo” and “snappy” are very fast while preserving a high compression rate.
We didn’t test LZ4. From our understanding of HADOOP-7657, the support for LZ4 target Hadoop version 0.23.1, 0.24.0 and is not backported to our running Cloudera CDH3U3. If you are curious about LZ4, here’s an interesing article comparing LZ4 to Snappy. Also worth of interest is the in-memory benchmark of the fastest compressors.