Blog, last published articles

Apache Hive Essentials How-to by Darren Lee

Recently, I've been ask to review a new book on Apache Hive called "Apache Hive Essentials How-to" written by Darren Lee and published by Packt Publishing. To say it short, I sincerely recommend it. I focused here on what I liked the most and the things I would have personnaly liked to read about. Looking [...]

By |2018-06-05T22:37:12+00:00July 23rd, 2013|Categories: Big Data|0 Comments

Splitting HDFS file into multiple hive tables

I am going to show how to split a file store as CSV inside HDFS into multiple Hive tables based on the content of each record. The context is simple. We are using Flume to collect logs from all over our datacenter through syslog. The stream is dumped into HDFS files partitioned by minute. Oozie [...]

By |2018-06-05T22:37:13+00:00July 15th, 2013|Categories: Big Data|0 Comments

Options to connect and integrate Hadoop with Oracle

I will list the different tools and libraries available to us developers in order to integrate Oracle and Hadoop. The Oracle SQL Connector for HDFS described below is covered in a follow up article with more details. To summarize, we have Sqoop originally from Cloudera and now part of Apache, a Sqoop plugin from MapQuest [...]

By |2018-06-05T22:37:13+00:00July 15th, 2013|Categories: Big Data|0 Comments

Testing the Oracle SQL Connector for Hadoop HDFS

Using Oracle SQL Connector for HDFS, you can use Oracle Database to access and analyze data residing in HDFS files or a Hive table. You can also query and join data in HDFS or a Hive table with other database-resident data. If required, you can also load data into the database using SQL. For an [...]

By |2018-06-05T22:37:14+00:00July 15th, 2013|Categories: Big Data|0 Comments

Merging multiple files in hadoop

This is a command I used to concatenate the files stored in Hadoop HDFS matching a globing expression into a single file. It use the "getmerge" utility of "hadoop fs" but contrary to "getmerge", the final merged file isn't put into the local filesystem but inside HDFS. Here's how it look like echo '' > [...]

By |2017-11-21T20:13:25+00:00July 12th, 2013|Categories: Big Data|0 Comments

Maven 3 behind a proxy

Maven 3 isn't so different to it's previous version 2. You will migrate most of your project quite easily between the two versions. That wasn't the case a fews years ago between versions 1 and 2. However it took me some time to find out how to properly configure my proxy settings and this article [...]

By |2017-11-21T20:16:01+00:00July 11th, 2013|Categories: Hack|0 Comments

The state of Hadoop distributions

Apache Hadoop is of course made available for download on its official webpage. However, downloading and installing the several components that make a Hadoop cluster is not an easy task and is a daunting task. Below is a list of the main distributions including Hadoop. This follows an article published a few days ago about [...]

By |2018-06-05T22:37:15+00:00July 11th, 2013|Categories: Big Data|0 Comments

Definitions of machine learning algorithms present in Apache Mahout

Apache Mahout is a machine learning library built for scalability. Its core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm. It contains various algorithms which we are defining below. Each of them may define multiple implementations. A mojority but not all of the [...]

By |2018-06-05T22:37:16+00:00July 8th, 2013|Categories: Big Data|0 Comments