Merging multiple files in Hadoop
This is a command I used to concatenate the files stored in Hadoop HDFS matching a globing expression into a single file. It uses the “getmerge” utility of
hadoop fs but contrary to “getmerge”, the final merged file isn’t put into the local filesystem but inside HDFS.
Here’s how it looks like
echo '' > /tmp/test; hadoop fs -getmerge /user/hdfs/source/**/*/tmp/test & cat /tmp/test | hadoop fs -put - /user/hdfs/merged; rm /tmp/test
Here’s what happens. We start by creating a temporary file in “/tmp/test”. We run the “getmerge” command and at the same time, it’s generated content is piped into the Hadoop “put” command. Notice the ”-” just after “-put” which tells Hadoop to get its content from stdin. Finally, we remove the temporary file.
You can check the result for your command by comparing the file size of the origin directory and the one of the generated file:
hadoop fs -du -s /user/hdfs/source hadoop fs -du -s /user/hdfs/merged
You could also use a “cat” implementation but the globing was more restrictive in my test. In both case, this isn’t efficient. You are downloading locally the content and even temporary storing it. You could eventually save the storage part if you have HDFS mounted locally.
Latest versions of HDFS will ship with concat functionnalities as documented in HDFS-222.