Timeseries storage in Hadoop and Hive

Timeseries storage in Hadoop and Hive

In the next few weeks, we will be exploring the storage and analytic of a large generated dataset. This dataset is composed of CRM tables associated to one timeserie table of about 7,000 billiard rows.

Before importing the dataset into Hive, we will be exploring different optimization options expected to impact speed and storage size.

Note, the following advantage and disadvantage of each optimization reflect our current understanding of Hadoop and Hive.

Hive partitioning

The distribution of the data files in HDFS under hierarchical directories.

Advantage:

  • With partition pruning, some queries may only need to read a subset of the files

Disadvantage:

  • Multiply the number of files managed by the namenode
  • A query may work against the partition thus requiring to traverse more files. For example a query which aggregates the data of a single customer over a year would be less efficient under a table partitioned by day.

Block size

Each file in HDFS is split under different chunks of fixed size to be distributed across the Hadoop cluster. This block size is by default set to 64M but could be increased to 126M or more.

Advantage:

  • Less work between M/R tasks

Disadvantage:

  • Decrease the potential load repartition with smaller files (think Hive partitioning)

Hive index

Maintain an index against a Hive table

Advantage:

  • Work on any columns with impacting the HDFS layout

Disadvantage:

  • Still slower than a natural file storing layout (think sorted Clustering)
  • Duplication of index for each Hive partition (might not be a problem)

Clustering

Dispatch data into a defined number of buckets (HDFS files) using a hash key on one or more columns and order the data.

Advantage:

  • Sampling

Advantages not yet implemented:

  • Bucket pruning when filtering on the bucket column
  • Take advantage of the sorted schema
  • Join optimization between two tables using the same column hash

Disadvantage:

  • Multiply the number of files managed by the namenode.

Native file formats

Text versus binary and row versus column layout. Hive offers TEXTFILE, SEQUENCEFILE and RCFILE.

Advantage:

  • RCFILE may benefit better file size and increase speed on joins
  • SEQUENCEFILE shall better handle map phase referencing multiple columns

Hive format

Best usage of Hive data type and experiment Hive data structure in the context of timeseries. For example, instead of storing one measure per row, we could store all the measure of a single customer in a Hive structure type of 144 elements for each day (one data every 10mn).

Advantage:

  • Drastically reduce the storage size (while increasing speed)

Disadvantage:

  • Increase queries complexity

Custom file formats

Leverage Hive SerDe to implement a custom layout. This is similar to the solution described above but pushed one step further.

Compression

Find the right equilibrium between IO and CPU usage by testing the various Codecs.

MR framework

Text various map and reduce tuning properties.

Preparation

Before we start we need:

  • a consequent cluster (around 20 nodes)
  • a simplified data schema
  • a medium sized data sets (around 1To)
  • a few representative queries

Cluster

  • Number of servers: 19 (waiting for 2 more)
  • Number of cpus: 320 logical thread
  • Number of disk: 124 * 1Go SATA 7200rpm
  • Number of rack: 2
  • Hadoop version: Cloudera 0.20.2-cdh3u2

Data schema

The original Hive database schema is:

By | 2017-11-21T20:22:06+00:00 January 10th, 2012|Categories: Big Data|0 Comments

About the Author:

Passionate with programming, data and entrepreneurship, I participate in shaping Adaltas to be a team of talented engineers to share our skills and experiences.

Leave A Comment

Time limit is exhausted. Please reload the CAPTCHA.