Adaltas

Apache Hive Essentials How-to by Darren Lee

Recently, I’ve been ask to review a new book on Apache Hive called “Apache Hive Essentials How-to” written by Darren Lee and published by Packt Publishing. To say it short, I sincerely recommend it. I focused here on what I liked the most and the things I would have personnaly liked to read about.

Looking at the table of contents, the book covers the essential usages of Hive such as partitioning, writing custom UDF (User Defined Function) and UDAF (User Defined Aggregate Function) and dealing with SerDe.

After reading a large part of the book, most Hive users will benefit from it, including those who never had been in contact with Hadoop before. It start with the basic Hive commands to define and manipulate a data set and go over more advanced usages over the following chapters.

I wish the author would had spent some time about how to install Hive with or without Hadoop. Hope I’m not wrong about this one but Hive doesn’t necessary need Hadoop which could be handy when someone wish to only test a few things. For those who wish to test Hive inside the Hadoop environment, the virtual machines provided by Hortonworks and Cloudera save a lot of time.

There are a few chapters like the “Join optimizations” chapter which I use as a reference when I wish to refresh my memories about which type of join to define and how to express them.

I also compliment the author Darren for dedicating four entire chapters on user defined functions. This is a must for any user who wishes to implement custom algorithm at the core of the Hive SQL engine.

My main regret is how the book isn’t covering the interaction between Hive and other external system such as Oracle, Microsoft Power Pivot, Microstrategy, … To my opinion, one or two chapters could have been dedicated to this topic. This is crucial when you integrate Hadoop and its ecosystem to an existing information system.

Also, it could have been interesting to write a chapter about how to write your own file format. The book already covers SerDe usage but this isn’t always enough. When you know well your data, you may achieve amazing space saving and speed increase but writing your own. I worked on a case with time series where a custom Serde was less efficient than an RCFile while a custom file format blows all our expectations. In this project, my solution was to serialize customer time-series as vectors with my own compression algorithms.

Comments