In the next few weeks, we will be exploring the storage and analytic of a large generated dataset. This dataset is composed of CRM tables associated to one timeserie table of about 7,000 billiard rows.
Before importing the dataset into Hive, we will be exploring different optimization options expected to impact speed and storage size.
Note, the following advantages and disadvantages of each optimization reflect our current understanding of Hadoop and Hive.
The distribution of the data files in HDFS under hierarchical directories.
- With partition pruning, some queries may only need to read a subset of the files
- Multiply the number of files managed by the namenode
- A query may work against the partition thus requiring to traverse more files. For example a query which aggregates the data of a single customer over a year would be less efficient under a table partitioned by day.
Each file in HDFS is split under different chunks of fixed size to be distributed across the Hadoop cluster. This block size is by default set to 64M but could be increased to 126M or more.
- Less work between M/R tasks
- Decrease the potential load repartition with smaller files (think Hive partitioning)
Maintain an index against a Hive table
- Work on any columns with impacting the HDFS layout
- Still slower than a natural file storing layout (think sorted Clustering)
- Duplication of index for each Hive partition (might not be a problem)
Dispatch data into a defined number of buckets (HDFS files) using a hash key on one or more columns and order the data.
Advantages not yet implemented:
- Bucket pruning when filtering on the bucket column
- Take advantage of the sorted schema
- Join optimization between two tables using the same column hash
- Multiply the number of files managed by the namenode.
Text versus binary and row versus column layout. Hive offers
RCFILEmay benefit better file size and increase speed on joins
SEQUENCEFILEshall better handle map phase referencing multiple columns
Best usage of Hive data type and experiment Hive data structure in the context of timeseries. For example, instead of storing one measure per row, we could store all the measure of a single customer in a Hive structure type of 144 elements for each day (one data every 10mn).
- Drastically reduce the storage size (while increasing speed)
- Increase queries complexity
Leverage Hive SerDe to implement a custom layout. This is similar to the solution described above but pushed one step further.
Find the right equilibrium between IO and CPU usage by testing the various Codecs.
Test various map and reduce tuning properties.
Before we start we need:
- A consequent cluster (around 20 nodes)
- A simplified data schema
- A medium sized data sets (around 1To)
- A few representative queries
- Number of servers: 19 (waiting for 2 more)
- Number of cpus: 320 logical thread
- Number of disk: 124 * 1Go SATA 7200rpm
- Number of rack: 2
- Hadoop version: Cloudera 0.20.2-cdh3u2
The original Hive database schema is:
CREATE DATABASE timeseries; USE timeseries; -- Cities table CREATE TABLE cities ( id_city BIGINT, fk_region INT, zipcode INT, name STRING ) STORED AS SEQUENCEFILE; -- Client table CREATE TABLE client ( id_client BIGINT, fk_city INT, lastname STRING ) STORED AS SEQUENCEFILE; -- Metering table CREATE TABLE metering ( fk_client BIGINT, date_emission INT, date_reception INT, type STRING, index INT, diff INT ) PARTITIONED BY(date_emission_day STRING) CLUSTERED BY(fk_client) SORTED BY(date_emission) INTO 32 BUCKETS STORED AS SEQUENCEFILE;