Hadoop Streaming for Rapid Prototyping of Distributed Algorithms
Posted: January 4th, 2009 | Author: Brad | Filed under: Programming | Tags: distributed computing, ga, genetic algorithms, hadoop, hadoop streaming, hadoop streaming tutorial, hadoop tutorial, ruby, tutorial | 13 Comments »Note: This article assumes that you know a little about MapReduce, or that if you don’t, you might skim the enclosed links so you know what I’m talking about when I get to the examples, or check out the Hadoop Tutorial. It also assumes that you have Hadoop set up – either clustered or pseudo-clustered – if you’re going to run the examples. Or you can just read along.
Hadoop is a framework (written in Java) that supports distributed computing – specifically Google’s MapReduce algorithm. It also comprises HDFS (the Hadoop Distrubuted File System), which allows you to redundantly store large quantities of data across multiple disconnected disks as if they were a single storage unit. I’ve used Hadoop at two jobs and at home, and it rocks.
It comes with a problem, though, which you may spot in the first paragraph: It’s written in Java. Now, don’t get me wrong – Java’s a great language. But developing software in Java with the most commonly used tools (Eclipse, Ant/Maven, &c) is a monumental pain (and you’re hearing this from an old Visual C++ hand). I’m the first to admit that I should shore up my skills with the Java tools, but even if I were better at it, getting a non-trivial Java project from zero to first runnable build is still about as complex as the invasion of Normandy, and the proliferation of XML config files is just inhumane.
Still, the performance and solidity of Java make it the right choice for a production Hadoop project. But what if you just want to kick around an idea or test an algorithm? Wouldn’t it be nice if you could do that in 2 hours instead of a day and a half? Wouldn’t it be nicer still if you could do it in your language of choice?