While streaming data to Hadoop with Flume, I have ran into several instances where I get a large amount of 0 byte files.

I am still trying to determine exactly what triggers this to happen, but in the meantime, and for general maintenance and clean up, here is a simple shell command I have come up with to make this happen.

Run the following command from the terminal on your HDFS namenode (line wrapped for readability:

>$HADOOP_HOME/bin/hadoop fs -lsr  | \
grep seq |awk '{ if ($5 == 0) print $8 }' |xargs \
/$HADOOP_HOME/bin/hadoop fs -rm

Make sure $HADOOP_HOME is set to the installation directory of your Hadoop installation, or replace it with the full path.

Now what does the above command do. First it performs a recursive list of the HDFS path specified. The output is piped to grep to only include the seq files, otherwise we get a listing of the directories that will also appear to be 0 byte, and trying to remove a directory with the Hadoop rm command will just result in errors. The grep output is the piped to awk and we check to see if 5th field ( the file size field) is equal to 0, here is our 0 byte files. Alright, we have now filter the list of files in HDFS and we pipe the final output to xargs and the Hadoop command to remove the file. Nothing to it. Our HDFS filesystem should be free of 0 byte files.

2 Thoughts on “Dealing With 0 Byte Files In HDFS

  1. trying to find you on facebook, wats ur profile

Leave a Reply

Post Navigation