<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Kickass Labs &#187; hadoop tutorial</title>
	<atom:link href="http://www.kickasslabs.com/tag/hadoop-tutorial/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.kickasslabs.com</link>
	<description>We &#9829; code.</description>
	<lastBuildDate>Wed, 28 Dec 2011 16:57:56 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Hadoop Streaming for Rapid Prototyping of Distributed Algorithms</title>
		<link>http://www.kickasslabs.com/2009/01/04/hadoop-streaming-for-rapid-prototyping-of-distributed-algorithms/</link>
		<comments>http://www.kickasslabs.com/2009/01/04/hadoop-streaming-for-rapid-prototyping-of-distributed-algorithms/#comments</comments>
		<pubDate>Sun, 04 Jan 2009 22:44:06 +0000</pubDate>
		<dc:creator>Brad</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[distributed computing]]></category>
		<category><![CDATA[ga]]></category>
		<category><![CDATA[genetic algorithms]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hadoop streaming]]></category>
		<category><![CDATA[hadoop streaming tutorial]]></category>
		<category><![CDATA[hadoop tutorial]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[tutorial]]></category>

		<guid isPermaLink="false">http://www.kickasslabs.com/?p=132</guid>
		<description><![CDATA[Note: This article assumes that you know a little about MapReduce, or that if you don&#8217;t, you might skim the enclosed links so you know what I&#8217;m talking about when I get to the examples, or check out the Hadoop &#8230; <a href="http://www.kickasslabs.com/2009/01/04/hadoop-streaming-for-rapid-prototyping-of-distributed-algorithms/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><i><b>Note:</b> This article assumes that you know a little about MapReduce, or that if you don&#8217;t, you might skim the enclosed links so you know what I&#8217;m talking about when I get to the examples, or check out the <a href="http://hadoop.apache.org/core/docs/current/mapred_tutorial.html" title="Hadoop tutorial" target="hadooptutorial">Hadoop Tutorial</a>.  It also assumes that you have Hadoop set up &#8211; either clustered or pseudo-clustered &#8211; if you&#8217;re going to run the examples.  Or you can just read along.</i></p>
<p><a href="http://hadoop.apache.org/" title="Hadoop" target="hadoop">Hadoop</a> is a framework (written in Java) that supports distributed computing &#8211; specifically Google&#8217;s <a href="http://labs.google.com/papers/mapreduce.html" title="MapReduce" target="mapreduce">MapReduce</a> algorithm.  It also comprises <a href="http://hadoop.apache.org/core/docs/current/hdfs_design.html" title="HDFS" target="hdfs">HDFS</a> (the Hadoop Distrubuted File System), which allows you to redundantly store large quantities of data across multiple disconnected disks as if they were a single storage unit.  I&#8217;ve used Hadoop at two jobs and at home, and it <i>rocks.</i></p>
<p>It comes with a problem, though, which you may spot in the first paragraph:  It&#8217;s written in <i>Java.</i>  Now, don&#8217;t get me wrong &#8211; Java&#8217;s a great language.  But developing software in Java with the most commonly used tools (Eclipse, Ant/Maven, &amp;c) is a monumental pain (and you&#8217;re hearing this from an old Visual C++ hand).  I&#8217;m the first to admit that I should shore up my skills with the Java tools, but even if I were better at it, getting a non-trivial Java project from zero to first runnable build is still about as complex as the invasion of Normandy, and the proliferation of XML config files is just inhumane.</p>
<p>Still, the performance and solidity of Java make it the right choice for a production Hadoop project.  But what if you just want to kick around an idea or test an algorithm?  Wouldn&#8217;t it be nice if you could do that in 2 hours instead of a day and a half?  Wouldn&#8217;t it be nicer still if you could do it in your language of choice?</p>
<p><span id="more-132"></span></p>
<p><a href="http://hadoop.apache.org/core/docs/r0.19.0/streaming.html" title="Hadoop Streaming" target="hadoopstreaming">Hadoop Streaming</a> has made me a very happy man.  Any language that can take data from <code>stdin</code> and give it to <code>stdout</code> can be used to make Hadoop MapReduce jobs.</p>
<p>So of course, I&#8217;m doing mine in Ruby.</p>
<p>As an example, I&#8217;ll transliterate the <strike>trivial</strike> canonical Hadoop word counting example into Ruby.  This example involves taking a large text, and counting the number of instances of each word it contains.  The mapper takes in rows of text, and emits key-value pairs where the key is a word and the value is the number of times that word has occurred in a given row of text.  It would look something like this:</p>
<p><b>word_count_mapper.rb:</b></p>

<div class="wp_syntax"><div class="code"><pre class="ruby" style="font-family:monospace;"><span style="color:#008000; font-style:italic;">#!/usr/bin/env ruby</span>
&nbsp;
STDIN.<span style="color:#9900CC;">each_line</span> <span style="color:#9966CC; font-weight:bold;">do</span> <span style="color:#006600; font-weight:bold;">|</span>line<span style="color:#006600; font-weight:bold;">|</span>
  word_count = <span style="color:#006600; font-weight:bold;">&#123;</span><span style="color:#006600; font-weight:bold;">&#125;</span>
  line.<span style="color:#CC0066; font-weight:bold;">split</span>.<span style="color:#9900CC;">each</span> <span style="color:#9966CC; font-weight:bold;">do</span> <span style="color:#006600; font-weight:bold;">|</span>word<span style="color:#006600; font-weight:bold;">|</span>
    word_count<span style="color:#006600; font-weight:bold;">&#91;</span>word<span style="color:#006600; font-weight:bold;">&#93;</span> <span style="color:#006600; font-weight:bold;">||</span>= <span style="color:#006666;">0</span>
    word_count<span style="color:#006600; font-weight:bold;">&#91;</span>word<span style="color:#006600; font-weight:bold;">&#93;</span> <span style="color:#006600; font-weight:bold;">+</span>= <span style="color:#006666;">1</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
  word_count.<span style="color:#9900CC;">each</span> <span style="color:#9966CC; font-weight:bold;">do</span> <span style="color:#006600; font-weight:bold;">|</span>k,v<span style="color:#006600; font-weight:bold;">|</span>
    <span style="color:#CC0066; font-weight:bold;">puts</span> <span style="color:#996600;">&quot;#{k}<span style="color:#000099;">\t</span>#{v}&quot;</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
<span style="color:#9966CC; font-weight:bold;">end</span></pre></div></div>

<p>Already, we&#8217;re at our first non-trivial design decision, and it&#8217;s due to a significant design difference between Hadoop Streaming and regular Hadoop jobs:  When you&#8217;re streaming, a single instance of your script will handle many pieces of input &#8211; it&#8217;s a bit like a filter-style Unix command (e.g., <code>grep</code>).  A regular Hadoop mapping job is more firmly seated in the functional paradigm &#8211; each call to your mapper gets one line of data, with no knowledge of others.  Because of this, <a href="http://www.raja-gopal.com/?p=42" title="Raja Gopal on Hadoop Streaming with Ruby" target="gopal">some tutorials</a> will suggest shortcuts such as keeping the hash-based accumulator outside the <code>STDIN</code> loop, and emitting all your rows of <code>[word, count]</code> pairs at the end of processing.  There are advantages to this, but I&#8217;m going to stick with my code above, for two reasons:</p>
<ol>
<li>If you maintain global state in your mapper, you&#8217;ll incur significant rework if you try to port the code to a more purely functional Java Hadoop mapper.</li>
<li>This could get damagingly memory-intensive on large data sets.  Going functional and streaming everything straight to HDFS will add some storage cost, but for large jobs I&#8217;m much more worried about RAM than disk.</li>
</ol>
<p>The purpose of the reduce step is to bring together all the per-line counts for each word, and reduce them to a single, global count per word.  This is facilitated by the fact that Hadoop orders all the records by key before handing it to the reducer (where by default the key is everything before the first tab character &#8211; in this case, the word being counted.)  My version of the reducer looks like this:</p>
<p><b>word_count_reducer.rb:</b></p>

<div class="wp_syntax"><div class="code"><pre class="ruby" style="font-family:monospace;"><span style="color:#008000; font-style:italic;">#!/usr/bin/env ruby</span>
&nbsp;
current_word = <span style="color:#0000FF; font-weight:bold;">nil</span>
current_count = <span style="color:#006666;">0</span>
STDIN.<span style="color:#9900CC;">each_line</span> <span style="color:#9966CC; font-weight:bold;">do</span> <span style="color:#006600; font-weight:bold;">|</span>line<span style="color:#006600; font-weight:bold;">|</span>
  word, count = line.<span style="color:#9900CC;">strip</span>.<span style="color:#CC0066; font-weight:bold;">split</span>
  <span style="color:#9966CC; font-weight:bold;">if</span> word != current_word
    <span style="color:#CC0066; font-weight:bold;">puts</span> <span style="color:#996600;">&quot;#{current_word}<span style="color:#000099;">\t</span>#{current_count}&quot;</span> <span style="color:#9966CC; font-weight:bold;">unless</span> current_word.<span style="color:#0000FF; font-weight:bold;">nil</span>?
    current_word = word
    current_count = <span style="color:#006666;">0</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
  current_count <span style="color:#006600; font-weight:bold;">+</span>= count.<span style="color:#9900CC;">to_i</span>
<span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
<span style="color:#CC0066; font-weight:bold;">puts</span> <span style="color:#996600;">&quot;#{current_word}<span style="color:#000099;">\t</span>#{current_count}&quot;</span> <span style="color:#9966CC; font-weight:bold;">unless</span> current_word.<span style="color:#0000FF; font-weight:bold;">nil</span>?</pre></div></div>

<p>The whole business with <code>current_count</code> is necessitated by another Hadoop Streaming-vs.-Hadoop quirk:  In a regular Hadoop reducer, one call to the reducing function gets a collection of all rows associated with a particular key.  When you&#8217;re streaming, you&#8217;re getting only one row at a time, but you&#8217;re guaranteed that the rows will be ordered on keys, and that the rows for a particular key will not be split up across reducer instances.  Again, the temptation might be to place a single accumulator outside the main <code>STDIN</code> loop, but I&#8217;m going to stick with my strategy of lean runtime footprint and going straight to storage.  The price for that is keeping track of which key you&#8217;re on.</p>
<p>I have placed the mapper and reducer in a folder called <code>scripts</code> in my <code>$HADOOP_HOME</code> folder.  To run them against an existing file in HDFS, I do:</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">bin<span style="color: #000000; font-weight: bold;">/</span>hadoop jar contrib<span style="color: #000000; font-weight: bold;">/</span>streaming<span style="color: #000000; font-weight: bold;">/</span>hadoop-0.19.0-streaming.jar \
  <span style="color: #660033;">-mapper</span> scripts<span style="color: #000000; font-weight: bold;">/</span>word_count_mapper.rb <span style="color: #660033;">-reducer</span> scripts<span style="color: #000000; font-weight: bold;">/</span>word_count_reducer.rb \
  <span style="color: #660033;">-file</span> <span style="color: #000000; font-weight: bold;">`</span><span style="color: #7a0874; font-weight: bold;">pwd</span><span style="color: #000000; font-weight: bold;">`/</span>scripts<span style="color: #000000; font-weight: bold;">/</span>word_count_mapper.rb \
  <span style="color: #660033;">-file</span> <span style="color: #000000; font-weight: bold;">`</span><span style="color: #7a0874; font-weight: bold;">pwd</span><span style="color: #000000; font-weight: bold;">`/</span>scripts<span style="color: #000000; font-weight: bold;">/</span>word_count_reducer.rb \
  <span style="color: #660033;">-input</span> texts<span style="color: #000000; font-weight: bold;">/</span>my_text <span style="color: #660033;">-output</span> word_counts</pre></div></div>

<p>&#8230;and results may be extracted from the <code>word_counts</code> folder in HDFS.</p>
<p>This is all there is to the trivial example.  There&#8217;s more, of course &#8211; I&#8217;m currently using this technology to prototype a much more ambitious project that I will eventually port to Java (more on that later).  While I&#8217;m finding that there is (at least subjectively) some performance overhead associated with streaming that I don&#8217;t see otherwise, it&#8217;s proving incredibly useful and speedy for testing out algorithms and designs.</p>
<p>Are you working with Hadoop Streaming?  Do you have questions?  Drop a line in the comments &#8211; I&#8217;d love to swap learnings.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fwww.kickasslabs.com%2F2009%2F01%2F04%2Fhadoop-streaming-for-rapid-prototyping-of-distributed-algorithms%2F&amp;title=Hadoop%20Streaming%20for%20Rapid%20Prototyping%20of%20Distributed%20Algorithms" id="wpa2a_2"><img src="http://www.kickasslabs.com/wp-content/plugins/add-to-any/share_save_120_16.png" width="120" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://www.kickasslabs.com/2009/01/04/hadoop-streaming-for-rapid-prototyping-of-distributed-algorithms/feed/</wfw:commentRss>
		<slash:comments>20</slash:comments>
		</item>
	</channel>
</rss>

