Articles published in 2013

Merging multiple files in Hadoop

Categories: Hack | Tags: File system, Hadoop, HDFS, Storage

This is a command I used to concatenate the files stored in Hadoop HDFS matching a globing expression into a single file. It uses the “getmerge” utility of but contrary to “getmerge”, the final…

By David WORMS

Jan 12, 2013

Definitions of machine learning algorithms present in Apache Mahout

Categories: Data Science | Tags: Algorithm, Сlassification, Hadoop, Mahout, Clustering, Machine Learning

Apache Mahout is a machine learning library built for scalability. Its core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoop…

By David WORMS

Mar 8, 2013

Apache Hive Essentials How-to by Darren Lee

Categories: Business Intelligence, Learning | Tags: UDF, Hadoop, Hive, File Format, SQL

Recently, I’ve been ask to review a new book on Apache Hive called “Apache Hive Essentials How-to” (edit: the second edition is now available) written by Darren Lee and published by Packt Publishing…

By David WORMS

Apr 23, 2013

Virtual machines with static IP for your Hadoop development cluster

Categories: Infrastructure | Tags: Ambari, Hortonworks, Red Hat, VirtualBox, VM, VMware, Cloudera, Network

While I am about to install and test Ambari, this article is the occasion to illustrate how I set up my development environment with multiple virtual machines. Ambari, the deployment and monitoring…

By David WORMS

Feb 27, 2013

The state of Hadoop distributions

Categories: Big Data | Tags: Hortonworks, Intel, Oracle, Hadoop, Cloudera

Apache Hadoop is of course made available for download on its official webpage. However, downloading and installing the several components that make a Hadoop cluster is not an easy task and is a…

By David WORMS

May 11, 2013

Hadoop development cluster of virtual machines with static IP using VirtualBox

Categories: Infrastructure | Tags: Ambari, Hortonworks, Red Hat, VirtualBox, VM, VMware, Cloudera, Network

A few days ago, I explained how to set up a cluster of virtual machine with static IPsand Internet access suitable to host your Hadoop cluster locally for development. At the time I made use of VMWare…

By David WORMS

Mar 14, 2013

Oracle to Apache Hive with the Oracle SQL Connector

Categories: Business Intelligence | Tags: Oracle, HDFS, Hive, Network

In a previous article published last week, I introduced the choices available to connect Oracle and Hadoop. In a follow up article, I covered the Oracle SQL Connector, its installation and integration…

By David WORMS

May 27, 2013

Oracle and Hive, how data are published?

Categories: Big Data | Tags: Oracle, Hive, Sqoop, Data Lake

In the past few days, I’ve published 3 related articles: a first one covering the option to integrate Oracle and Hadoop, a second one explaining how to install and use the Oracle SQL Connector with…

By David WORMS

Jul 6, 2013

Node CSV version 0.2.7

Categories: Hack | Tags: Pipeline, CoffeeScript, CSV, Node.js

While I’m release version 0.2.7 of the CSV parser for Node.js, I stop here to drop a few lines of what has made into this release. We are now using the latest CoffeeScript, which is version 1.4.…

By David WORMS

Jul 9, 2013

Options to connect and integrate Hadoop with Oracle

Categories: Data Engineering | Tags: Database, Java, Oracle, R, RDBMS, Avro, HDFS, Hive, MapReduce, Sqoop, NoSQL, SQL

I will list the different tools and libraries available to us developers in order to integrate Oracle and Hadoop. The Oracle SQL Connector for HDFS described below is covered in a follow up article…

By David WORMS

May 15, 2013

Maven 3 behind a proxy

Categories: Hack | Tags: Maven, Java, Proxy

Maven 3 isn’t so different to it’s previous version 2. You will migrate most of your project quite easily between the two versions. That wasn’t the case a fews years ago between versions 1 and…

By David WORMS

Jul 11, 2013

Kerberos and delegation tokens security with WebHDFS

Categories: Cyber Security | Tags: HTTP, HDFS, Big Data, Kerberos

WebHDFS is an HTTP Rest server bundle with the latest version of Hadoop. What interests me on this article is to dig into security with the Kerberos and delegation tokens functionalities. I will cover…

By David WORMS

Jul 25, 2013

About the new BSD license and its difference with other BSD licenses

Categories: Data Governance | Tags: License, Open source

As a non restrictive Open Source license, the “new BSD license” is a commonly used license across the Node.js community. However, this is only one of the BSD license available along the original “BSD…

By David WORMS

Aug 8, 2013

Splitting HDFS files into multiple hive tables

Categories: Data Engineering | Tags: Flume, Pig, HDFS, Hive, Oozie, Python, SQL

I am going to show how to split a CSV file stored inside HDFS as multiple Hive tables based on the content of each record. The context is simple. We are using Flume to collect logs from all over our…

By David WORMS

Sep 15, 2013

Testing the Oracle SQL Connector for Hadoop HDFS

Categories: Data Engineering | Tags: Database, File system, Oracle, HDFS, CDH, SQL

Using Oracle SQL Connector for HDFS, you can use Oracle Database to access and analyze data residing in HDFS files or a Hive table. You can also query and join data in HDFS or a Hive table with other…

By David WORMS

Jul 15, 2013

Remote connection with SSH

Categories: Cyber Security | Tags: Automation, HTTP, SSH

While teaching Big Data and Hadoop, a student asks me about SSH and how to use. I’ll discuss about the protocol and the tools to benefit from it. Lately, I automate the deployment of Hadoop clusters…

By David WORMS

Oct 2, 2013

State of the Hadoop open-source ecosystem in early 2013

Categories: Big Data | Tags: Flume, Mesos, Phoenix, Pig, File system, MongoDB, Hadoop, Kafka, Mahout, Consensus, Data Science, File Format, PostgreSQL, Storage

Hadoop is already a large ecosystem and my guess is that 2013 will be the year where it grows even larger. There are some pieces that we no longer need to present. ZooKeeper, hbase, Hive, Pig, Flume…

By David WORMS

Jul 8, 2013

Catch 'uncaughtException' error in your mocha test

Categories: Node.js | Tags: DevOps, Mocha, JavaScript, Unit tests

This isn’t the first time I faced this situation. Today, I finally found the time and energy to look for a solution. In your mocha test, let’s say you need to test an expected “uncaughtException…

By David WORMS

Oct 27, 2013

Tutorial for creating and publishing a new Node.js module

Categories: Front End | Tags: Learning and tutorial, License, Mocha, NPM, Travis CI, CoffeeScript, GitHub, JavaScript, Node.js, Unit tests

In this tutorial, I provide complete instructions for creating a new Node.js module, writing the code in coffee-script, publishing it on GitHub, sharing it with other Node.js fellows through NPM…

By David WORMS

Dec 3, 2013

Composants for CDH and HDP

Categories: Big Data | Tags: Flume, Hortonworks, Hadoop, Hive, Oozie, Sqoop, Zookeeper, Cloudera, CDH, HDP

I was interested to compare the different components distributed by Cloudera and HortonWorks. This also gives us an idea of the versions packaged by the two distributions. At the time of this writting…

By David WORMS

Sep 22, 2013

Crawl you website including login form with Phantomjs

Categories: Front End | Tags: Mocha, CoffeeScript, JavaScript, Node.js, Unit tests

With PhantomJS, we start a headless WebKit and pilot it with our own scripts. Said differently, we write a script in JavaScript or CoffeeScript which controls an Internet browser and manipulates the…

By David WORMS

Nov 27, 2013