Merging multiple files in hadoop

This is a command I used to concatenate the files stored in Hadoop HDFS matching a globing expression into a single file. It use the “getmerge” utility of “hadoop fs” but contrary to “getmerge”, the final merged file isn’t put into the local filesystem but inside HDFS.

Here’s how it look like

echo ” > /tmp/test; hadoop fs -getmerge /user/hdfs/source/*/ /tmp/test & cat /tmp/test | hadoop fs -put – /user/hdfs/merged; rm /tmp/test

Here’s what happens. We start by creating a temporary file in “/tmp/test”. We run the “getmerge” command and at the same time, it’s generated content is piped into the Hadoop “put” command. Notice the “-” just after “-put” which tell Hadoop to get its content from stdin. Finally, we remove the temporary file.

You can check the result for your command by comparing the file size of the origin directory and the one of the generated file:

hadoop fs -du -s /user/hdfs/source hadoop fs -du -s /user/hdfs/merged

You could also use a “cat” implementation but the globing was more restrictive in my test. In both case, this isn’t efficient. You are downloading locally the content and even temporary storing it. You could eventually save the storage part if you have HDFS mounted locally.

Latest versions of HDFS will ship with concat functionnalities as documented in HDFS-222.

By |2017-11-21T20:13:25+00:00July 12th, 2013|Categories: Big Data|0 Comments

About the Author:

Passionate with programming, data and entrepreneurship, I participate in shaping Adaltas to be a team of talented engineers to share our skills and experiences.

Leave A Comment