Timeseries storage in Hadoop and Hive

By David WORMS

Jan 10, 2012

Categories: Data Engineering | Tags: HDFS, Hive, CRM, File Format, timeseries, Tuning, Hadoop [more][less]

In the next few weeks, we will be exploring the storage and analytic of a large generated dataset. This dataset is composed of CRM tables associated to one timeserie table of about 7,000 billiard rows.

Before importing the dataset into Hive, we will be exploring different optimization options expected to impact speed and storage size.

Note, the following advantages and disadvantages of each optimization reflect our current understanding of Hadoop and Hive.

Hive partitioning

The distribution of the data files in HDFS under hierarchical directories.


  • With partition pruning, some queries may only need to read a subset of the files


  • Multiply the number of files managed by the namenode
  • A query may work against the partition thus requiring to traverse more files. For example a query which aggregates the data of a single customer over a year would be less efficient under a table partitioned by day.

Block size

Each file in HDFS is split under different chunks of fixed size to be distributed across the Hadoop cluster. This block size is by default set to 64M but could be increased to 126M or more.


  • Less work between M/R tasks


  • Decrease the potential load repartition with smaller files (think Hive partitioning)

Hive index

Maintain an index against a Hive table


  • Work on any columns with impacting the HDFS layout


  • Still slower than a natural file storing layout (think sorted Clustering)
  • Duplication of index for each Hive partition (might not be a problem)


Dispatch data into a defined number of buckets (HDFS files) using a hash key on one or more columns and order the data.


  • Sampling

Advantages not yet implemented:

  • Bucket pruning when filtering on the bucket column
  • Take advantage of the sorted schema
  • Join optimization between two tables using the same column hash


  • Multiply the number of files managed by the namenode.

Native file formats

Text versus binary and row versus column layout. Hive offers TEXTFILE, SEQUENCEFILE and RCFILE.


  • RCFILE may benefit better file size and increase speed on joins
  • SEQUENCEFILE shall better handle map phase referencing multiple columns

Hive format

Best usage of Hive data type and experiment Hive data structure in the context of timeseries. For example, instead of storing one measure per row, we could store all the measure of a single customer in a Hive structure type of 144 elements for each day (one data every 10mn).


  • Drastically reduce the storage size (while increasing speed)


  • Increase queries complexity

Custom file formats

Leverage Hive SerDe to implement a custom layout. This is similar to the solution described above but pushed one step further.


Find the right equilibrium between IO and CPU usage by testing the various Codecs.

MR framework

Test various map and reduce tuning properties.


Before we start we need:

  • A consequent cluster (around 20 nodes)
  • A simplified data schema
  • A medium sized data sets (around 1To)
  • A few representative queries


  • Number of servers: 19 (waiting for 2 more)
  • Number of cpus: 320 logical thread
  • Number of disk: 124 * 1Go SATA 7200rpm
  • Number of rack: 2
  • Hadoop version: Cloudera 0.20.2-cdh3u2

Data schema

The original Hive database schema is:

USE timeseries;
-- Cities table
    id_city BIGINT,
    fk_region INT,
    zipcode INT,
    name STRING
-- Client table
    id_client BIGINT,
    fk_city INT,
    lastname STRING
-- Metering table
CREATE TABLE metering 
    fk_client BIGINT,
    date_emission INT,
    date_reception INT,
    type STRING,
    index INT,
    diff INT
PARTITIONED BY(date_emission_day STRING)
CLUSTERED BY(fk_client) SORTED BY(date_emission) INTO 32 BUCKETS

Canada - Morocco - France

International locations

10 rue de la Kasbah
2393 Rabbat

We are a team of Open Source enthusiasts doing consulting in Big Data, Cloud, DevOps, Data Engineering, Data Science…

We provide our customers with accurate insights on how to leverage technologies to convert their use cases to projects in production, how to reduce their costs and increase the time to market.

If you enjoy reading our publications and have an interest in what we do, contact us and we will be thrilled to cooperate with you.