DISCLAIMER: First off I want to point out that this is a hack and not supported by Cloudera. This can be cleaned up and made more elegant, but was done as a proof of concept and to see what was needed to make it work. I am also not a Java developer. I know enough to be dangerous. I am sure a developer or build engineer could clean things up, and package it up much cleaner. Also, this method could lead to issues during the normal Cloudera upgrade process. I do have notes below in how to potentially avoid this. This process also does not put Hive into any maven repository to be included as dependencies in other projects. I would appreciate any comments and feedback on how to make this process better

Introduction
The Hive-0.11.0 release includes many if not all of the features for phase 1 of the stinger project. Most notable for me was the inclusion of many analytical functions available in most SQL implementations. But for those using Cloudera’s Distribution of Hadoop, CDH for short, how do we get the new features in Hive? Well, I set out to try and figure it out. I myself would not consider this a clean process and it can be made much better for larger more managed deployments. In addition, I have not tested the implications if Cloudera does release an update, however, I do believe I have a manual solution for that as well, but again, not very polished. I have been running it in a 10 node development cluster for about 2 weeks now, running queries in both Hue/Beeswax and the Hive CLI, and have not experienced any issues. I have not used Beeline or HiveServer2 yet, so I cannot comment on any issues that might be there. I also have not tested this with Oozie/Hive actions, but if you have this working with CDH4.3.0′s version of Hive, Hive-0.10.0, then I don’t see why it wouldn’t work with Hive-0.11.0.

Read More →

Introduction

I have been using Hive, included with the Cloudera Distribution of Hadoop (currently CDH4.3.0) in one form or another for about a year and a half now, and have found it to be a very useful discovery tool for exploring click data stored in Hadoop. However, it lacked the analytical functions available in other SQL languages, and other tools like Aster.

Read More →

The other day I ran across a situation where I was importing an excel file, and the age column I was expecting an age in had the DOB instead. Now I could have handled this on the import side, and probably will in the long run, however there was too much involved and not enough time to get it all working before we needed the data imported. So the search began to find an easy way in excel to convert the DOB to age.
Read More →

This information is not new, and can be found in the flume cookbook. However these are real world working example showing how to send a file from a flume agent to your collector from the command line. The main use case I have for this, would be to resend a file of events that was rotated while the agent was down, or sending a file that was created before you started monitoring it.

Read More →

While streaming data to Hadoop with Flume, I have ran into several instances where I get a large amount of 0 byte files.

I am still trying to determine exactly what triggers this to happen, but in the meantime, and for general maintenance and clean up, here is a simple shell command I have come up with to make this happen.

Read More →

OBSOLETE: This post is about Flume-og. Unless you have a driving reason to still be using Flume-og, I recommend upgrading the flume-ng 1.4.0. The Flume development community has done a fantastic job rewriting Flume, and have add many great improvments.

Lately I have been working with and evaluating Flume as a log aggregation engine for our mid-sized web cluster. Flume is relatively simple to get up and running out of the box, and even easier if you take the time to create your own RPM, which I highly recommend if you are going to be deploying to more than a handful of machines.

Read More →