Two Hive UDAF to convert an aggregation to a map

Two Hive UDAF to convert an aggregation to a map

I am publishing two new Hive UDAF to help with maps in Apache Hive. The source code is available on GitHub in two Java classes: “UDAFToMap” and “UDAFToOrderedMap” or you can download the jar file. The first function converts an aggregation into a map and is internally using a Java HashMap. The second function extends the first one. It converts an aggregation into an ordered map and is internally using a Java TreeMap.

API

As a reminder, an UDF stands for a User Defined Function and an UDAF stands for User Defined Aggregate Function. When an UDF is a custom scalar function on one or more column of a single row (for example the CONCAT function in SQL), an UDAF work on a aggregation of one or multiple columns (for example the MAX function in SQL).

Both function share the same API. They accept two arguments, first is the key and second is the value. Keys may be any primitive type. Values may be a primitive or a complex type. In Hive, primitive types are integer, boolean, floating point number, string, date and binary while complex types are structure, map and array.

Motivation

While working on time series in Hive, I created those functions out of the following needs:

  • Since UDF are much trivial to implement than UDAF, it is convenient to write and use a UDF which takes the result of a generic UDAF as its input, for example:
    select MY_UDF( MY_UDAF(column_1, column_2) ) from my_table group by column_3.
  • The HBase structure we are modeling use keys to store customer identifiers and columns to store dates. Importing data from Hive to HBase could be done in two ways: with a select statement or with an complicated low-level bulk loading strategy. Using a select statement, since it is making usage of a map and since we couldn’t find any way to convert the result of an aggregation into a map, we had to insert on row per metering data which we suspect to be less efficient than inserting one row of multiple metering data for a single customer. Using a bulk loading strategy, we also faced the need to group our dataset into an ordered map to reflect to final layout of the HBase storage file format.
  • We are experimenting a custom HDFS file format integrated to Hive which internally need to group the metering data of a customer to optimize their storage. We first wrote an implementation which takes an ordered result set as its input. This work but is a much more complicated implementation. Also, because of the nature of HDFS and MapReduce, we had no guaranty that all the data of a single customer would be store in a same file which prevent us from a complete optimization. Once again, structuring the data as a map was a requirement.

Usage

Considering a source data set as follow made of 4 columns (customer id, timestamp, meter value, meter state):

The CSV source is imported into Hive with the next statements:

Now we can declare our two UDAF:

And we can finally used them with the following queries:

The output of the last select statement looks like:

If you have clone our GitHub repository, you can run the above sample with the command mvn install && hive -f sample/to_map.hive.

While writing this article, I also published a JIRA enhancement issue (HIVE-2843) proposing the integration of the source code into Hive. I’ll prepare the patch for its inclusion into Hive if the community express its interest.[/fusion_builder_column][/fusion_builder_row][/fusion_builder_container]

By | 2017-11-21T20:23:13+00:00 March 6th, 2012|Categories: Big Data|0 Comments

About the Author:

Leave A Comment

Time limit is exhausted. Please reload the CAPTCHA.