Hive

Running Apache Hive 3, new features and tips and tricks

Apache Hive 3 brings a bunch of new and nice features to the data warehouse. Unfortunately, like many major FOSS releases, it comes with a few bugs and not much documentation. It is available since July 2018 as part of HDP3 (Hortonworks Data Platform version 3). I will first review the new features available with [...]

By |2019-07-25T22:40:14+00:00July 25th, 2019|Categories: Big Data, DataWorks Summit 2019|Tags: , , , , , , , |0 Comments

Publish Spark SQL DataFrame and RDD with Spark Thrift Server

The distributed and in-memory nature of the Spark engine makes it an excellent candidate to expose data to clients which expect low latencies. Dashboards, notebooks, BI studios, KPIs-based reports tools commonly speak the JDBC/ODBC protocols and are such examples. Spark Thrift Server may be used in various fashions. It can run independently as Spark standalone [...]

By |2019-03-25T14:50:18+00:00March 25th, 2019|Categories: Big Data, Data Engineering|Tags: , , , , |1 Comment

Clusters and workloads migration from Hadoop 2 to Hadoop 3

Hadoop 2 to Hadoop 3 migration is a hot subject. How to upgrade your clusters, which features present in the new release may solve current problems and bring new opportunities, how are your current processes impacted, which migration strategy is the most appropriate to your organization? […]

By |2018-08-17T09:36:26+00:00July 25th, 2018|Categories: Big Data|Tags: , , , |0 Comments

Data Lake ingestion best practices

Creating a Data Lake requires rigor and experience. Here are some good practices around data ingestion both for batch and stream architectures that we recommend and implement with our customers. […]

By |2018-06-18T09:29:50+00:00June 18th, 2018|Categories: Data Engineering, DevOps|Tags: , , , , , , , |1 Comment

Essential questions about Time Series

Today, the bulk of Big Data is temporal. We see it in the media and among our customers: smart meters, banking transactions, smart factories, connected vehicles … IoT and Big Data go hand in hand. […]

By |2019-08-14T23:13:42+00:00March 19th, 2018|Categories: Big Data, Data Engineering|Tags: , , , , , |0 Comments

MariaDB integration with Hadoop

During a workshop with one of our customers, Adaltas has identified a potential risk to use MariaDB's High Availability (HA) strategy. Since the customer selected Cloudera's CDH 5 distribution, the reasoning below is based on Cloudera's official documentation. However, it applies to all Hadoop distributions including Hortonworks. Cloudera lists the various databases supported in HA [...]

By |2019-08-05T21:03:36+00:00July 31st, 2017|Categories: Big Data, Infrastructure|Tags: , , , , |0 Comments

Hive Metastore HA with DBTokenStore: Failed to initialize master key

This article describes my little adventure around a startup error with the Hive Metastore. It shall be reproducable with any  secure installation, meaning with Kerberos, with high availability enabled and with the storage of the delegation token in a database. The version of Hive is the 1.2 packaged inside the Hortonworks 2.4.2 distribution. Storage for [...]

By |2019-06-18T21:53:47+00:00July 21st, 2016|Categories: Big Data, DevOps|Tags: , , |0 Comments

Hive, Calcite and Druid

BI/OLAP requires interactive visualization of complex data streams: Real time bidding events User activity streams Voice call logs Network trafic flows Firewall events Application KPIs Traditionnal solutions RDBMS (Mysql..): don't scale, need caching but adhoc queries remain slow Key/value store (HBase...): quick but takes forever to compute (pre-materialization of data) Context Created in 2011, open-sourced [...]

By |2019-06-21T22:05:23+00:00July 14th, 2016|Categories: Big Data|Tags: , , , , |0 Comments

HDFS and Hive storage – comparing file formats and compression methods

A few days ago, we have conducted a test in order to compare various Hive file formats and compression methods. Among those file formats, some are native to HDFS and apply to all Hadoop users. The test suite is composed of similar Hive queries which create a table, eventually set a compression type and load [...]

By |2019-06-25T10:32:24+00:00March 13th, 2012|Categories: Data Engineering|Tags: , , , , , |0 Comments

Two Hive UDAF to convert an aggregation to a map

I am publishing two new Hive UDAF to help with maps in Apache Hive. The source code is available on GitHub in two Java classes: “UDAFToMap” and “UDAFToOrderedMap” or you can download the jar file. The first function converts an aggregation into a map and is internally using a Java HashMap. The second function extends [...]

By |2019-06-25T10:25:53+00:00March 6th, 2012|Categories: Data Engineering|Tags: , , , |0 Comments