Merging multiple files in Hadoop

Merging multiple files in Hadoop

By David WORMS

Jan 12, 2013

Categories: Hack | Tags: HDFS, File system, Hadoop

This is a command I used to concatenate the files stored in Hadoop HDFS matching a globing expression into a single file. It uses the “getmerge” utility of hadoop fs but contrary to “getmerge”, the final merged file isn’t put into the local filesystem but inside HDFS.

Here’s how it looks like

echo '' > /tmp/test; hadoop fs -getmerge /user/hdfs/source/**/*/tmp/test & cat /tmp/test | hadoop fs -put - /user/hdfs/merged; rm /tmp/test

Here’s what happens. We start by creating a temporary file in “/tmp/test”. We run the “getmerge” command and at the same time, it’s generated content is piped into the Hadoop “put” command. Notice the ”-” just after “-put” which tells Hadoop to get its content from stdin. Finally, we remove the temporary file.

You can check the result for your command by comparing the file size of the origin directory and the one of the generated file:

hadoop fs -du -s /user/hdfs/source hadoop fs -du -s /user/hdfs/merged

You could also use a “cat” implementation but the globing was more restrictive in my test. In both case, this isn’t efficient. You are downloading locally the content and even temporary storing it. You could eventually save the storage part if you have HDFS mounted locally.

Latest versions of HDFS will ship with concat functionnalities as documented in HDFS-222.

Canada - Morocco - France

International locations

10 rue de la Kasbah
2393 Rabbat
Canada

We are a team of Open Source enthusiasts doing consulting in Big Data, Cloud, DevOps, Data Engineering, Data Science…

We provide our customers with accurate insights on how to leverage technologies to convert their use cases to projects in production, how to reduce their costs and increase the time to market.

If you enjoy reading our publications and have an interest in what we do, contact us and we will be thrilled to cooperate with you.