BI/OLAP requires interactive visualization of complex data streams:

  • Real time bidding events
  • User activity streams
  • Voice call logs
  • Network trafic flows
  • Firewall events
  • Application KPIs

Traditionnal solutions

  • RDBMS (Mysql..): don’t scale, need caching but adhoc queries remain slow
  • Key/value store (HBase…): quick but takes forever to compute (pre-materialization of data)

Context

  • Created in 2011, open-sourced in 2012
  • Built for interactive analytics
  • +150 contributors as of 2016

Main Features

  • Column oriented
  • Batch & real-time ingestion
  • Scalable
  • Sub-second response time for time-based slice-and-dice

Persistent storage

  • Segment files partitionned by date
  • Segment should be around 1GB
  • If larger, use smaller partitions
  • Segment consist of: timestamp column, dimension (eg page, username, city), metrics (for aggregation) and indexes (to speed up the lookup)

Druid + Hive

  • Indexing complex query result in Druid using Hive
  • Introduce a SQL interface to Druid
  • Execute complex operations on Druid data
  • Efficient OLAP queries in Hive
  • Demo: based on Hive 2.20, no release plans yet, rely on Druid 0.9.1.1 and Apache Calcite 1.10.0, (master and awaiting 0.9.2)

Druid data source in Hive

How to index data in Hive, 2 ways

  • Register Druid data sources in Hive when data already in druid
    STORED BY 'org.apache.hadoop.hive.druid.DruidSTorageHandler'
    TBLPROPERTIES ("druid.datasource" = "wikiticker")
  • Create druid data sources
    STORED BY 'org.apache.hadoop.hive.druid.DruidSTorageHandler'
    TBLPROPERTIES ("druid.datasource" = "wikiticker"
    AS SELECT __time, page, user, c_added, c_removed
    FROM src;

    “__time” is timestamp, “page, user” are dimensions, “c_*” are metrics

Need to partition the data by date granularity.

Query Druid data sources from Hive

  • Automatic rewriting with Calcite
  • Use a Druid Input Format to fetch records into the Hive pipeline
  • Will bypass the broker

Roadmap

  • Push more computation to Druid (push down optimisation)
  • Parallelize results retrieval (bypass broker)
  • Denormalize start schema
  • Huge perspectives for materialized views in Hive with automatic input query rewriting