Hive, Calcite and Druid

Hive, Calcite and Druid

Do you like our work......we hire!

Never miss our publications about Open Source, big data and distributed systems, low frequency of one email every two months.

BI/OLAP requires interactive visualization of complex data streams:

  • Real time bidding events
  • User activity streams
  • Voice call logs
  • Network trafic flows
  • Firewall events
  • Application KPIs

Traditionnal solutions

  • RDBMS (Mysql..): don’t scale, need caching but adhoc queries remain slow
  • Key/value store (HBase…): quick but takes forever to compute (pre-materialization of data)

Context

  • Created in 2011, open-sourced in 2012
  • Built for interactive analytics
  • +150 contributors as of 2016

Main Features

  • Column oriented
  • Batch & real-time ingestion
  • Scalable
  • Sub-second response time for time-based slice-and-dice

Persistent storage

  • Segment files partitionned by date
  • Segment should be around 1GB
  • If larger, use smaller partitions
  • Segment consist of: timestamp column, dimension (eg page, username, city), metrics (for aggregation) and indexes (to speed up the lookup)

Druid + Hive

  • Indexing complex query result in Druid using Hive
  • Introduce a SQL interface to Druid
  • Execute complex operations on Druid data
  • Efficient OLAP queries in Hive
  • Demo: based on Hive 2.20, no release plans yet, rely on Druid 0.9.1.1 and Apache Calcite 1.10.0, (master and awaiting 0.9.2)

Druid data source in Hive

How to index data in Hive, 2 ways

  • Register Druid data sources in Hive when data already in druid
    STORED BY 'org.apache.hadoop.hive.druid.DruidSTorageHandler' TBLPROPERTIES ("druid.datasource" = "wikiticker")
  • Create druid data sources
    STORED BY 'org.apache.hadoop.hive.druid.DruidSTorageHandler' TBLPROPERTIES ("druid.datasource" = "wikiticker" AS SELECT __time, page, user, c_added, c_removed FROM src;
    “_time” is timestamp, “page, user” are dimensions, “c*” are metrics

Need to partition the data by date granularity.

Query Druid data sources from Hive

  • Automatic rewriting with Calcite
  • Use a Druid Input Format to fetch records into the Hive pipeline
  • Will bypass the broker

Roadmap

  • Push more computation to Druid (push down optimization)
  • Parallelize results retrieval (bypass broker)
  • Denormalize start schema
  • Huge perspectives for materialized views in Hive with automatic input query rewriting
Share this article

Canada - Morocco - France

We are a team of Open Source enthusiasts doing consulting in Big Data, Cloud, DevOps, Data Engineering, Data Science…

We provide our customers with accurate insights on how to leverage technologies to convert their use cases to projects in production, how to reduce their costs and increase the time to market.

If you enjoy reading our publications and have an interest in what we do, contact us and we will be thrilled to cooperate with you.

Support Ukrain