HDFS and Hive storage - comparing file formats and compression methods

HDFS and Hive storage - comparing file formats and compression methods

Do you like our work......we hire!

Never miss our publications about Open Source, big data and distributed systems, low frequency of one email every two months.

A few days ago, we have conducted a test in order to compare various Hive file formats and compression methods. Among those file formats, some are native to HDFS and apply to all Hadoop users. The test suite is composed of similar Hive queries which create a table, eventually set a compression type and load the same dataset into the new table. Among all the queries, we tested the “sequence file”, “text file” and “RCFILE” formats and the “default”, “bz”, “gz”, “LZO” and “Snappy” compression codecs.

April 4th 2012: Answer to Huchev comment relative to LZ4.

Setup

The environment is a 20 nodes and 120 terabytes Hadoop cluster running Cloudera CDH3U3. The original dataset is a 1.33 Go folder with 80 compressed and unsplitable “bz2” files inside. The data inside is formatted as CSV for a total of about 125,000,000 lines.

Below is an example of a Hive Query importing data using a “RCFILE” coming from HBase format with “LZO” compression:

-- Prepare
CREATE TABLE rc_lzo (
    client BIGINT, ctime INT, mtime INT,
    code STRING, value_1 INT, value_2 INT
) STORED AS RCFILE;
-- Compression
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
-- Import
INSERT OVERWRITE TABLE rc_lzo
SELECT * FROM (
    SELECT
        client, round(ctime / 1000), round(mtime / 1000),
        code, value_1, value_2
    FROM staging
) T;

Results

The table below are the results we obtained. The query columns describe the type of test. The query name starts with the file format followed by the compression codec. Both “block” and “record” compression types have been reported for the “sequence file” format and the “default” compression codec.

The “serdesf” family of queries makes use of a custom SerDe which encodes each column in an shorter size when appropriate. In our use case, the code can be stored as 1 character (1 byte), the value_2 can be represented as the diff between itself and the value_1 (2 bytes). Overall, a line is stored as 16 bytes compared to 65 bytes originally.

The “bss” query uses the BinarySortableSerDe serialization which is a custom Hive Serde that we associated to the “sequence file” format.

The “b64” family of queries uses the base64 package present in the Hive contrib project.

querytimesize
sf2mn 3s7.91 Go
sf_string2mn 22s8.72 Go
sf_df_block2mn 17s8.72 Go
sf_df_record2mn 12s7.32 Go
sf_bz2h 43mn 24s9.9 Go
sf_gz2mn 29s8.72 Go
sf_lzo2mn 36s8.80 Go
sf_snappy3mn 55s8.23 Go
tf1mn 45s6.44 Go
tf_bz2mn 14s1.12 Go
tf_df2mn 16s1.12 Go
tf_gz48s1.34 Go
tf_lzo1mn 28s2.41 Go
tf_snappy1mn 2s2.55 Go
rc1mn 30s5.78 Go
rc_df5mn 15s917.68 Mo
rc_gz4mn 36s917.80 Mo
rc_snappy52s1.85 Go
rc_lzo38s1.67 Go
serdesf59s3.63 Go
serdesf_df1mn 27s4.61 Go
serdesf_bz3h 6mn 9s9.63 Go
serdesf_gz1mn 51s6.02 Go
serdesf_snappy1mn 35s4.80 Go
bss1mn 25s5.73 Go
b642mn 5s9.17 Go
b64_bz21mn 15s1.14 Go
b64_df21mn 25s1.14 Go
b64_gz53s1.62 Go
b64_snappy59s2.89 Go

Some notes

We wish we had run the tests on a larger input data set with a more common file format but cluster time is a scarce resource at the moment. As a result, the file sizes are probably representative but the speed results should be interpreted with care.

The speed reflects importing time only (with not so standard files as input) and not how fast those formats are against map/reduce jobs.

As part of the test, we also tested block type of compression on other type of codecs but they had no effect so we came to the conclusion that block type only apply to the default codec in sequence files.

We tried to run the serdesf test in a similar mode but using the RCFILE format instead of sequence file but the results are identical to the rc family of queries.

Interpretation

The “tf” query act as a reference since it stores our data in an uncompressed CSV format. It is kind of awkward to see that all the “sequence file” queries generate larger file size. This isn’t something we expected but maybe is it due to our high usage of the integer type.

In term of file size, the “RCFILE” format with the “default” and “gz” compression achieve the best results. The “base64” SerDe using “sequence file” with “bz” and “default” compression aren’t so far away. However, those “base64” results are slow enough to be bypassed.

In term of speed, the “RCFILE” formats with the “lzo” and “snappy” are very fast while preserving a high compression rate.

About LZ4

We didn’t test LZ4. From our understanding of HADOOP-7657, the support for LZ4 target Hadoop version 0.23.1, 0.24.0 and is not backported to our running Cloudera CDH3U3. If you are curious about LZ4, here’s an interesing article comparing LZ4 to Snappy. Also worth of interest is the in-memory benchmark of the fastest compressors.

Share this article

Canada - Morocco - France

We are a team of Open Source enthusiasts doing consulting in Big Data, Cloud, DevOps, Data Engineering, Data Science…

We provide our customers with accurate insights on how to leverage technologies to convert their use cases to projects in production, how to reduce their costs and increase the time to market.

If you enjoy reading our publications and have an interest in what we do, contact us and we will be thrilled to cooperate with you.

Support Ukrain