<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Kickass Labs &#187; ruby</title>
	<atom:link href="http://www.kickasslabs.com/tag/ruby/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.kickasslabs.com</link>
	<description>We &#9829; code.</description>
	<lastBuildDate>Sun, 04 Jul 2010 03:00:07 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Quick Hits: Setting the User Agent Header in Webrat</title>
		<link>http://www.kickasslabs.com/2009/03/31/quick-hits-setting-the-user-agent-header-in-webrat/</link>
		<comments>http://www.kickasslabs.com/2009/03/31/quick-hits-setting-the-user-agent-header-in-webrat/#comments</comments>
		<pubDate>Tue, 31 Mar 2009 16:35:51 +0000</pubDate>
		<dc:creator>Brad</dc:creator>
				<category><![CDATA[Rails]]></category>
		<category><![CDATA[Testing]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[cucumber]]></category>
		<category><![CDATA[rspec]]></category>
		<category><![CDATA[webrat]]></category>

		<guid isPermaLink="false">http://www.kickasslabs.com/?p=248</guid>
		<description><![CDATA[If you&#8217;ve read the new PragProg beta e-book on RSpec, you may have read that you can set HTTP headers for your Webrat request like so:

Given /^I am browsing the site using Safari$/ do
  header &#34;User-Agent&#34; , &#34;Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_6; en-us)&#34;
end

Like me, you may have found out the hard [...]]]></description>
			<content:encoded><![CDATA[<p>If you&#8217;ve read the new <a href="http://pragprog.com/" title="Pragmatic Programmers" target="pragprog">PragProg</a> <a href="http://www.pragprog.com/titles/achbd/the-rspec-book" title="The RSpec Book beta ebook" target="pragprog">beta e-book on RSpec</a>, you may have read that you can set HTTP headers for your Webrat request like so:</p>

<div class="wp_syntax"><div class="code"><pre class="ruby" style="font-family:monospace;">Given <span style="color:#006600; font-weight:bold;">/</span>^I am browsing the site using Safari$<span style="color:#006600; font-weight:bold;">/</span> <span style="color:#9966CC; font-weight:bold;">do</span>
  header <span style="color:#996600;">&quot;User-Agent&quot;</span> , <span style="color:#996600;">&quot;Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_6; en-us)&quot;</span>
<span style="color:#9966CC; font-weight:bold;">end</span></pre></div></div>

<p>Like me, you may have found out the hard way that this doesn&#8217;t work.  Webrat does not automagically apply these new HTTP headers to your request &#8211; they certainly don&#8217;t make it to my controller.  What worked for me:</p>

<div class="wp_syntax"><div class="code"><pre class="ruby" style="font-family:monospace;">Given <span style="color:#006600; font-weight:bold;">/</span>^I am browsing the site using Safari$<span style="color:#006600; font-weight:bold;">/</span> <span style="color:#9966CC; font-weight:bold;">do</span>
  headers<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">&quot;User-Agent&quot;</span><span style="color:#006600; font-weight:bold;">&#93;</span> = <span style="color:#996600;">&quot;Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_6; en-us)&quot;</span>
<span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
<span style="color:#9966CC; font-weight:bold;">When</span> <span style="color:#006600; font-weight:bold;">/</span>^I visit my precious site$<span style="color:#006600; font-weight:bold;">/</span> <span style="color:#9966CC; font-weight:bold;">do</span>
  get <span style="color:#996600;">'/my/precious/path'</span>, my_query_string, headers
<span style="color:#9966CC; font-weight:bold;">end</span></pre></div></div>

<p>In the code above, <code>headers</code> is a method call that returns all the HTTP headers for your request.  Just tack <code>headers</code> on as the third argument of your request, and you&#8217;re good to go.</p>
<p><a class="a2a_dd addtoany_share_save" href="http://www.addtoany.com/share_save?linkurl=http%3A%2F%2Fwww.kickasslabs.com%2F2009%2F03%2F31%2Fquick-hits-setting-the-user-agent-header-in-webrat%2F&amp;linkname=Quick%20Hits%3A%20Setting%20the%20User%20Agent%20Header%20in%20Webrat"><img src="http://www.kickasslabs.com/wp-content/plugins/add-to-any/share_save_120_16.png" width="120" height="16" alt="Share/Bookmark"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://www.kickasslabs.com/2009/03/31/quick-hits-setting-the-user-agent-header-in-webrat/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Playing with JRuby, Part 1A</title>
		<link>http://www.kickasslabs.com/2009/02/18/playing-with-jruby-part-1a/</link>
		<comments>http://www.kickasslabs.com/2009/02/18/playing-with-jruby-part-1a/#comments</comments>
		<pubDate>Wed, 18 Feb 2009 15:59:08 +0000</pubDate>
		<dc:creator>Brad</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[jruby]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[ruby]]></category>

		<guid isPermaLink="false">http://www.kickasslabs.com/?p=164</guid>
		<description><![CDATA[First, thanks to Lopex, Charles/Headius, and Thomas for their well-coordinated effort to alleviate some of my JRuby ignorance in my earlier post. And for not going out of their way to make fun of my janky &#8220;benchmarks&#8221; &#8211; like I said, those weren&#8217;t meant to be real benchmarks, just something to give me an idea [...]]]></description>
			<content:encoded><![CDATA[<p>First, thanks to Lopex, Charles/Headius, and Thomas for their well-coordinated effort to alleviate some of my JRuby ignorance in <a href="http://www.kickasslabs.com/2009/02/16/playing-with-jruby-part-1/" title="Playing with JRuby, Part 1">my earlier post</a>. And for not going out of their way to make fun of my janky &#8220;benchmarks&#8221; &#8211; like I said, those weren&#8217;t meant to be real benchmarks, just something to give me an idea of relative performance with some operations I care about.</p>
<p>In that same spirit, I took Lopex&#8217;s advice and turned on the &#8211;server flag; performance gain went from almost 2-to-1 to almost 3-to-1, with no other changes or optimizations.</p>
<p>Before:</p>
<pre>
[~/project/jrubytest/primes]> ruby ./primes.rb
FINDING first 100000 PRIMES
1299709
Time: 57.716853s
[~/project/jrubytest/primes]> jruby ./primes.rb
FINDING first 100000 PRIMES
1299709
Time: 33.885273s
</pre>
<p>After:</p>
<pre>
[~/project/jrubytest/primes]> ruby primes.rb
FINDING first 100000 PRIMES
1299709
Time: 57.204178s
[~/project/jrubytest/primes]> jruby --server primes.rb
FINDING first 100000 PRIMES
1299709
Time: 21.067204s
</pre>
<p>Thanks to the JRuby crew for stopping by here, and for being so helpful generally.</p>
<p><a class="a2a_dd addtoany_share_save" href="http://www.addtoany.com/share_save?linkurl=http%3A%2F%2Fwww.kickasslabs.com%2F2009%2F02%2F18%2Fplaying-with-jruby-part-1a%2F&amp;linkname=Playing%20with%20JRuby%2C%20Part%201A"><img src="http://www.kickasslabs.com/wp-content/plugins/add-to-any/share_save_120_16.png" width="120" height="16" alt="Share/Bookmark"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://www.kickasslabs.com/2009/02/18/playing-with-jruby-part-1a/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Playing with JRuby, Part 1</title>
		<link>http://www.kickasslabs.com/2009/02/16/playing-with-jruby-part-1/</link>
		<comments>http://www.kickasslabs.com/2009/02/16/playing-with-jruby-part-1/#comments</comments>
		<pubDate>Mon, 16 Feb 2009 05:37:12 +0000</pubDate>
		<dc:creator>Brad</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[java]]></category>
		<category><![CDATA[jruby]]></category>
		<category><![CDATA[ruby]]></category>

		<guid isPermaLink="false">http://www.kickasslabs.com/?p=157</guid>
		<description><![CDATA[As part of a new project, I&#8217;m experimenting with JRuby, to see (a) if the alleged performance gains are all that and (b) see how far the Java integration &#8211; especially interface implementation &#8211; can be pushed.
On the performance front, I did a quick-and-dirty test with a CPU-expensive script that finds prime numbers &#8211; not [...]]]></description>
			<content:encoded><![CDATA[<p>As part of a new project, I&#8217;m experimenting with <a title="JRuby" href="http://jruby.codehaus.org/">JRuby</a>, to see (a) if the alleged performance gains are all that and (b) see how far the Java integration &#8211; especially interface implementation &#8211; can be pushed.</p>
<p>On the performance front, I did a quick-and-dirty test with a CPU-expensive script that finds prime numbers &#8211; not a real benchmark, but something to give me an idea of how JRuby held up under a steady load of basic math operations.  Details follow.</p>
<p><span id="more-157"></span></p>
<p><em><strong>Disclaimer:</strong> I know this is not the optimal algorithm for finding prime numbers. The point here is not to find prime numbers, but to tax my CPU predictably.  If you post a comment with the general intent of HEY FINDING PRIMES UR DOIN IT WRONG, I&#8217;ll just delete it.</em></p>
<p>The CPU is an Intel Core Duo in an old Mac Mini.  The Ruby and JRuby versions under consideration were:</p>
<pre>[~/project/jrubytest/primes]&gt; ruby -v ; jruby -v
ruby 1.8.6 (2007-03-13 patchlevel 0) [i686-darwin8.9.1]
jruby 1.1.6 (ruby 1.8.6 patchlevel 114) (2008-12-17 rev 8388) [i386-java]</pre>
<p>The code for the test was:</p>

<div class="wp_syntax"><div class="code"><pre class="ruby" style="font-family:monospace;"><span style="color:#008000; font-style:italic;">#!/usr/bin/env ruby</span>
&nbsp;
limit = <span style="color:#006600; font-weight:bold;">&#40;</span>ARGV<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#006666;">0</span><span style="color:#006600; font-weight:bold;">&#93;</span> <span style="color:#006600; font-weight:bold;">||</span> <span style="color:#006666;">100000</span><span style="color:#006600; font-weight:bold;">&#41;</span>.<span style="color:#9900CC;">to_i</span>
<span style="color:#CC0066; font-weight:bold;">puts</span> <span style="color:#996600;">&quot;FINDING first #{limit} PRIMES&quot;</span>
&nbsp;
t0 = <span style="color:#CC00FF; font-weight:bold;">Time</span>.<span style="color:#9900CC;">now</span>
primes = <span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#006666;">2</span>, <span style="color:#006666;">3</span><span style="color:#006600; font-weight:bold;">&#93;</span>
counter = <span style="color:#006666;">2</span> <span style="color:#008000; font-style:italic;"># starting with 2 and 3</span>
&nbsp;
currnum = primes.<span style="color:#9900CC;">last</span>
<span style="color:#9966CC; font-weight:bold;">while</span> counter <span style="color:#006600; font-weight:bold;">&amp;</span>lt; limit
  currnum <span style="color:#006600; font-weight:bold;">+</span>= <span style="color:#006666;">2</span>
  primes.<span style="color:#9900CC;">each</span> <span style="color:#9966CC; font-weight:bold;">do</span> <span style="color:#006600; font-weight:bold;">|</span>p<span style="color:#006600; font-weight:bold;">|</span>
    <span style="color:#9966CC; font-weight:bold;">break</span> <span style="color:#9966CC; font-weight:bold;">if</span> <span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#006666;">0</span> == <span style="color:#006600; font-weight:bold;">&#40;</span>currnum <span style="color:#006600; font-weight:bold;">%</span> <span style="color:#CC0066; font-weight:bold;">p</span><span style="color:#006600; font-weight:bold;">&#41;</span><span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#9966CC; font-weight:bold;">and</span> <span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#CC0066; font-weight:bold;">p</span> <span style="color:#006600; font-weight:bold;">&amp;</span>lt;= <span style="color:#CC00FF; font-weight:bold;">Math</span>.<span style="color:#9900CC;">sqrt</span><span style="color:#006600; font-weight:bold;">&#40;</span>currnum.<span style="color:#9900CC;">to_f</span><span style="color:#006600; font-weight:bold;">&#41;</span><span style="color:#006600; font-weight:bold;">&#41;</span>
    <span style="color:#9966CC; font-weight:bold;">if</span> <span style="color:#CC0066; font-weight:bold;">p</span> <span style="color:#006600; font-weight:bold;">&amp;</span>gt;= <span style="color:#CC00FF; font-weight:bold;">Math</span>.<span style="color:#9900CC;">sqrt</span><span style="color:#006600; font-weight:bold;">&#40;</span>currnum.<span style="color:#9900CC;">to_f</span><span style="color:#006600; font-weight:bold;">&#41;</span>
      primes <span style="color:#006600; font-weight:bold;">&amp;</span>lt;<span style="color:#006600; font-weight:bold;">&amp;</span>lt; currnum
      counter <span style="color:#006600; font-weight:bold;">+</span>= <span style="color:#006666;">1</span>
      <span style="color:#9966CC; font-weight:bold;">break</span>
    <span style="color:#9966CC; font-weight:bold;">end</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
<span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
<span style="color:#CC0066; font-weight:bold;">puts</span> primes.<span style="color:#9900CC;">last</span>
<span style="color:#CC0066; font-weight:bold;">puts</span> <span style="color:#996600;">&quot;Time: #{Time.now - t0}s&quot;</span></pre></div></div>

<p>The results showed just shy of a factor of 2 gain using JRuby (which is consistent with other tests I&#8217;ve run since writing this up):</p>
<pre>[~/project/jrubytest/primes]&gt; ruby ./primes.rb
FINDING first 100000 PRIMES
1299709
Time: 57.716853s
[~/project/jrubytest/primes]&gt; jruby ./primes.rb
FINDING first 100000 PRIMES
1299709
Time: 33.885273s</pre>
<p>These results were gotten after I&#8217;d made sure I had plenty of free RAM so I wasn&#8217;t I/O bound; the difference you&#8217;re seeing is all CPU.  Note also that the timers are <em>inside</em> the Ruby code &#8211; they don&#8217;t account for JVM startup time.</p>
<p>Just for laughs, let&#8217;s see what the difference is with JVM overhead taken into account:</p>
<pre>[~/project/jrubytest/primes]&gt; date ; ruby ./primes.rb ; date ; jruby ./primes.rb ; date
Sun Feb 15 22:26:36 EST 2009
FINDING first 100000 PRIMES
1299709
Time: 54.065758s
Sun Feb 15 22:27:30 EST 2009
FINDING first 100000 PRIMES
1299709
Time: 34.427694s
Sun Feb 15 22:28:05 EST 2009</pre>
<p>So startup overhead is about a second for JRuby.  Not too shabby.</p>
<p>Part 2 of this post is going to cover some of the hairier tasks involved in:</p>
<ul>
<li>Subclassing Java classes and implementing Java interfaces.</li>
<li>Packaging JRuby code in JARs with <a title="Rawr" href="http://rawr.rubyforge.org/">Rawr</a></li>
<li>Calling code in those JARs from Java</li>
</ul>
<p>More to come&#8230;</p>
<p><a class="a2a_dd addtoany_share_save" href="http://www.addtoany.com/share_save?linkurl=http%3A%2F%2Fwww.kickasslabs.com%2F2009%2F02%2F16%2Fplaying-with-jruby-part-1%2F&amp;linkname=Playing%20with%20JRuby%2C%20Part%201"><img src="http://www.kickasslabs.com/wp-content/plugins/add-to-any/share_save_120_16.png" width="120" height="16" alt="Share/Bookmark"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://www.kickasslabs.com/2009/02/16/playing-with-jruby-part-1/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Hadoop Streaming for Rapid Prototyping of Distributed Algorithms</title>
		<link>http://www.kickasslabs.com/2009/01/04/hadoop-streaming-for-rapid-prototyping-of-distributed-algorithms/</link>
		<comments>http://www.kickasslabs.com/2009/01/04/hadoop-streaming-for-rapid-prototyping-of-distributed-algorithms/#comments</comments>
		<pubDate>Sun, 04 Jan 2009 22:44:06 +0000</pubDate>
		<dc:creator>Brad</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[distributed computing]]></category>
		<category><![CDATA[ga]]></category>
		<category><![CDATA[genetic algorithms]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hadoop streaming]]></category>
		<category><![CDATA[hadoop streaming tutorial]]></category>
		<category><![CDATA[hadoop tutorial]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[tutorial]]></category>

		<guid isPermaLink="false">http://www.kickasslabs.com/?p=132</guid>
		<description><![CDATA[Note: This article assumes that you know a little about MapReduce, or that if you don&#8217;t, you might skim the enclosed links so you know what I&#8217;m talking about when I get to the examples, or check out the Hadoop Tutorial.  It also assumes that you have Hadoop set up &#8211; either clustered or [...]]]></description>
			<content:encoded><![CDATA[<p><i><b>Note:</b> This article assumes that you know a little about MapReduce, or that if you don&#8217;t, you might skim the enclosed links so you know what I&#8217;m talking about when I get to the examples, or check out the <a href="http://hadoop.apache.org/core/docs/current/mapred_tutorial.html" title="Hadoop tutorial" target="hadooptutorial">Hadoop Tutorial</a>.  It also assumes that you have Hadoop set up &#8211; either clustered or pseudo-clustered &#8211; if you&#8217;re going to run the examples.  Or you can just read along.</i></p>
<p><a href="http://hadoop.apache.org/" title="Hadoop" target="hadoop">Hadoop</a> is a framework (written in Java) that supports distributed computing &#8211; specifically Google&#8217;s <a href="http://labs.google.com/papers/mapreduce.html" title="MapReduce" target="mapreduce">MapReduce</a> algorithm.  It also comprises <a href="http://hadoop.apache.org/core/docs/current/hdfs_design.html" title="HDFS" target="hdfs">HDFS</a> (the Hadoop Distrubuted File System), which allows you to redundantly store large quantities of data across multiple disconnected disks as if they were a single storage unit.  I&#8217;ve used Hadoop at two jobs and at home, and it <i>rocks.</i></p>
<p>It comes with a problem, though, which you may spot in the first paragraph:  It&#8217;s written in <i>Java.</i>  Now, don&#8217;t get me wrong &#8211; Java&#8217;s a great language.  But developing software in Java with the most commonly used tools (Eclipse, Ant/Maven, &amp;c) is a monumental pain (and you&#8217;re hearing this from an old Visual C++ hand).  I&#8217;m the first to admit that I should shore up my skills with the Java tools, but even if I were better at it, getting a non-trivial Java project from zero to first runnable build is still about as complex as the invasion of Normandy, and the proliferation of XML config files is just inhumane.</p>
<p>Still, the performance and solidity of Java make it the right choice for a production Hadoop project.  But what if you just want to kick around an idea or test an algorithm?  Wouldn&#8217;t it be nice if you could do that in 2 hours instead of a day and a half?  Wouldn&#8217;t it be nicer still if you could do it in your language of choice?</p>
<p><span id="more-132"></span></p>
<p><a href="http://hadoop.apache.org/core/docs/r0.19.0/streaming.html" title="Hadoop Streaming" target="hadoopstreaming">Hadoop Streaming</a> has made me a very happy man.  Any language that can take data from <code>stdin</code> and give it to <code>stdout</code> can be used to make Hadoop MapReduce jobs.</p>
<p>So of course, I&#8217;m doing mine in Ruby.</p>
<p>As an example, I&#8217;ll transliterate the <strike>trivial</strike> canonical Hadoop word counting example into Ruby.  This example involves taking a large text, and counting the number of instances of each word it contains.  The mapper takes in rows of text, and emits key-value pairs where the key is a word and the value is the number of times that word has occurred in a given row of text.  It would look something like this:</p>
<p><b>word_count_mapper.rb:</b></p>

<div class="wp_syntax"><div class="code"><pre class="ruby" style="font-family:monospace;"><span style="color:#008000; font-style:italic;">#!/usr/bin/env ruby</span>
&nbsp;
STDIN.<span style="color:#9900CC;">each_line</span> <span style="color:#9966CC; font-weight:bold;">do</span> <span style="color:#006600; font-weight:bold;">|</span>line<span style="color:#006600; font-weight:bold;">|</span>
  word_count = <span style="color:#006600; font-weight:bold;">&#123;</span><span style="color:#006600; font-weight:bold;">&#125;</span>
  line.<span style="color:#CC0066; font-weight:bold;">split</span>.<span style="color:#9900CC;">each</span> <span style="color:#9966CC; font-weight:bold;">do</span> <span style="color:#006600; font-weight:bold;">|</span>word<span style="color:#006600; font-weight:bold;">|</span>
    word_count<span style="color:#006600; font-weight:bold;">&#91;</span>word<span style="color:#006600; font-weight:bold;">&#93;</span> <span style="color:#006600; font-weight:bold;">||</span>= <span style="color:#006666;">0</span>
    word_count<span style="color:#006600; font-weight:bold;">&#91;</span>word<span style="color:#006600; font-weight:bold;">&#93;</span> <span style="color:#006600; font-weight:bold;">+</span>= <span style="color:#006666;">1</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
  word_count.<span style="color:#9900CC;">each</span> <span style="color:#9966CC; font-weight:bold;">do</span> <span style="color:#006600; font-weight:bold;">|</span>k,v<span style="color:#006600; font-weight:bold;">|</span>
    <span style="color:#CC0066; font-weight:bold;">puts</span> <span style="color:#996600;">&quot;#{k}<span style="color:#000099;">\t</span>#{v}&quot;</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
<span style="color:#9966CC; font-weight:bold;">end</span></pre></div></div>

<p>Already, we&#8217;re at our first non-trivial design decision, and it&#8217;s due to a significant design difference between Hadoop Streaming and regular Hadoop jobs:  When you&#8217;re streaming, a single instance of your script will handle many pieces of input &#8211; it&#8217;s a bit like a filter-style Unix command (e.g., <code>grep</code>).  A regular Hadoop mapping job is more firmly seated in the functional paradigm &#8211; each call to your mapper gets one line of data, with no knowledge of others.  Because of this, <a href="http://www.raja-gopal.com/?p=42" title="Raja Gopal on Hadoop Streaming with Ruby" target="gopal">some tutorials</a> will suggest shortcuts such as keeping the hash-based accumulator outside the <code>STDIN</code> loop, and emitting all your rows of <code>[word, count]</code> pairs at the end of processing.  There are advantages to this, but I&#8217;m going to stick with my code above, for two reasons:</p>
<ol>
<li>If you maintain global state in your mapper, you&#8217;ll incur significant rework if you try to port the code to a more purely functional Java Hadoop mapper.</li>
<li>This could get damagingly memory-intensive on large data sets.  Going functional and streaming everything straight to HDFS will add some storage cost, but for large jobs I&#8217;m much more worried about RAM than disk.</li>
</ol>
<p>The purpose of the reduce step is to bring together all the per-line counts for each word, and reduce them to a single, global count per word.  This is facilitated by the fact that Hadoop orders all the records by key before handing it to the reducer (where by default the key is everything before the first tab character &#8211; in this case, the word being counted.)  My version of the reducer looks like this:</p>
<p><b>word_count_reducer.rb:</b></p>

<div class="wp_syntax"><div class="code"><pre class="ruby" style="font-family:monospace;"><span style="color:#008000; font-style:italic;">#!/usr/bin/env ruby</span>
&nbsp;
current_word = <span style="color:#0000FF; font-weight:bold;">nil</span>
current_count = <span style="color:#006666;">0</span>
STDIN.<span style="color:#9900CC;">each_line</span> <span style="color:#9966CC; font-weight:bold;">do</span> <span style="color:#006600; font-weight:bold;">|</span>line<span style="color:#006600; font-weight:bold;">|</span>
  word, count = line.<span style="color:#9900CC;">strip</span>.<span style="color:#CC0066; font-weight:bold;">split</span>
  <span style="color:#9966CC; font-weight:bold;">if</span> word != current_word
    <span style="color:#CC0066; font-weight:bold;">puts</span> <span style="color:#996600;">&quot;#{current_word}<span style="color:#000099;">\t</span>#{current_count}&quot;</span> <span style="color:#9966CC; font-weight:bold;">unless</span> current_word.<span style="color:#0000FF; font-weight:bold;">nil</span>?
    current_word = word
    current_count = <span style="color:#006666;">0</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
  current_count <span style="color:#006600; font-weight:bold;">+</span>= count.<span style="color:#9900CC;">to_i</span>
<span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
<span style="color:#CC0066; font-weight:bold;">puts</span> <span style="color:#996600;">&quot;#{current_word}<span style="color:#000099;">\t</span>#{current_count}&quot;</span> <span style="color:#9966CC; font-weight:bold;">unless</span> current_word.<span style="color:#0000FF; font-weight:bold;">nil</span>?</pre></div></div>

<p>The whole business with <code>current_count</code> is necessitated by another Hadoop Streaming-vs.-Hadoop quirk:  In a regular Hadoop reducer, one call to the reducing function gets a collection of all rows associated with a particular key.  When you&#8217;re streaming, you&#8217;re getting only one row at a time, but you&#8217;re guaranteed that the rows will be ordered on keys, and that the rows for a particular key will not be split up across reducer instances.  Again, the temptation might be to place a single accumulator outside the main <code>STDIN</code> loop, but I&#8217;m going to stick with my strategy of lean runtime footprint and going straight to storage.  The price for that is keeping track of which key you&#8217;re on.</p>
<p>I have placed the mapper and reducer in a folder called <code>scripts</code> in my <code>$HADOOP_HOME</code> folder.  To run them against an existing file in HDFS, I do:</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">bin<span style="color: #000000; font-weight: bold;">/</span>hadoop jar contrib<span style="color: #000000; font-weight: bold;">/</span>streaming<span style="color: #000000; font-weight: bold;">/</span>hadoop-0.19.0-streaming.jar \
  <span style="color: #660033;">-mapper</span> scripts<span style="color: #000000; font-weight: bold;">/</span>word_count_mapper.rb <span style="color: #660033;">-reducer</span> scripts<span style="color: #000000; font-weight: bold;">/</span>word_count_reducer.rb \
  <span style="color: #660033;">-file</span> <span style="color: #000000; font-weight: bold;">`</span><span style="color: #7a0874; font-weight: bold;">pwd</span><span style="color: #000000; font-weight: bold;">`/</span>scripts<span style="color: #000000; font-weight: bold;">/</span>word_count_mapper.rb \
  <span style="color: #660033;">-file</span> <span style="color: #000000; font-weight: bold;">`</span><span style="color: #7a0874; font-weight: bold;">pwd</span><span style="color: #000000; font-weight: bold;">`/</span>scripts<span style="color: #000000; font-weight: bold;">/</span>word_count_reducer.rb \
  <span style="color: #660033;">-input</span> texts<span style="color: #000000; font-weight: bold;">/</span>my_text <span style="color: #660033;">-output</span> word_counts</pre></div></div>

<p>&#8230;and results may be extracted from the <code>word_counts</code> folder in HDFS.</p>
<p>This is all there is to the trivial example.  There&#8217;s more, of course &#8211; I&#8217;m currently using this technology to prototype a much more ambitious project that I will eventually port to Java (more on that later).  While I&#8217;m finding that there is (at least subjectively) some performance overhead associated with streaming that I don&#8217;t see otherwise, it&#8217;s proving incredibly useful and speedy for testing out algorithms and designs.</p>
<p>Are you working with Hadoop Streaming?  Do you have questions?  Drop a line in the comments &#8211; I&#8217;d love to swap learnings.</p>
<p><a class="a2a_dd addtoany_share_save" href="http://www.addtoany.com/share_save?linkurl=http%3A%2F%2Fwww.kickasslabs.com%2F2009%2F01%2F04%2Fhadoop-streaming-for-rapid-prototyping-of-distributed-algorithms%2F&amp;linkname=Hadoop%20Streaming%20for%20Rapid%20Prototyping%20of%20Distributed%20Algorithms"><img src="http://www.kickasslabs.com/wp-content/plugins/add-to-any/share_save_120_16.png" width="120" height="16" alt="Share/Bookmark"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://www.kickasslabs.com/2009/01/04/hadoop-streaming-for-rapid-prototyping-of-distributed-algorithms/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
		</item>
		<item>
		<title>Quick and Dirty Messaging</title>
		<link>http://www.kickasslabs.com/2008/11/22/93/</link>
		<comments>http://www.kickasslabs.com/2008/11/22/93/#comments</comments>
		<pubDate>Sat, 22 Nov 2008 20:05:55 +0000</pubDate>
		<dc:creator>Brad</dc:creator>
				<category><![CDATA[Great Minds]]></category>
		<category><![CDATA[Javascript]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Rails]]></category>
		<category><![CDATA[gm]]></category>
		<category><![CDATA[message queue]]></category>
		<category><![CDATA[messaging]]></category>
		<category><![CDATA[rails rumble]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[ruby on rails]]></category>
		<category><![CDATA[rumble]]></category>

		<guid isPermaLink="false">http://www.kickasslabs.com/?p=93</guid>
		<description><![CDATA[Our Rails Rumble 2008 entry, Great Minds (you can have a look at the latest version or the original Rumble version), required a messaging system.  Such systems are easy to do wrong, and we knew we&#8217;d need something that would stay solid under unknown load during Rumble judging.
The solution was quick &#38; dirty (as [...]]]></description>
			<content:encoded><![CDATA[<p>Our <a href="http://railsrumble.com/" title="Rails Rumble 2008" target="rr2008">Rails Rumble 2008</a> entry, <i>Great Minds</i> (you can have a look at the <a href="http://greatminds.kickasslabs.com/" title="Great Minds" target="gmkal">latest version</a> or the <a href="http://greatminds.r08.railsrumble.com/" title="Great Minds - Rumble Version" target="gmrr">original Rumble version</a>), required a messaging system.  Such systems are easy to do wrong, and we knew we&#8217;d need something that would stay solid under unknown load during Rumble judging.</p>
<p>The solution was quick &amp; dirty (as most solutions are during Rails Rumble), but the results worked, and allowed us to qualify for judging and reach a respectable 28th place finish in the &#8220;Completeness&#8221; category.</p>
<p>Check out the details after the fold.</p>
<p><span id="more-93"></span></p>
<h3>Design Constraints</h3>
<p><i>Great Minds</i> is a word game played by 2 or more players.  The game is played in a virtual chat <b>Room</b>, containing one or more <b>Players.</b>  Players can take various actions, such as submitting a word to the game, chatting, or changing their handle/username.  The results of these actions must be transmitted to other players in the room.</p>
<p>Other constraints on the design included:</p>
<ul>
<li><b>No New Technologies:</b>  There are tools like <a href="http://juggernaut.rubyforge.org/" title="Juggernaut" target="juggernaut">Juggernaut</a> that make messaging pretty easy, but we had no hands-on experience with them, and learning a new tool on a 48-hour schedule was deemed a risk to completing the project &#038; qualifying for Rumble judging.</li>
<li><b>Broadcast:</b>  We needed something more than a simple message queue like <a href="http://xph.us/software/beanstalkd/" title="Beanstalk" target="beanstalk">Beanstalk</a>; the consumer of a message couldn&#8217;t remove it from the queue, as there might be other consumers needing the same message.</li>
<li><b>Room State Reconstruction:</b>  Someone coming to the room for the first time (or leaving and returning) should be able to see the current room state, including all participants, recently played words, and recent chat.</li>
<li><b>Replay:</b>  Though we didn&#8217;t get around to implementing it, we wanted to offer the ability to review a particularly funny game.</li>
</ul>
<p>The first two constraints affected our design and technology choices; the second two demanded that the messages be persisted.  (Alternatively, we could have saved the room state in some other way, but as long as we had the messages anyway, it seemed simpler to just keep them and reconstruct the state.)</p>
<h3>Client Implementation</h3>
<p>In the absence of a viable push solution like Juggernaut, we opted for a polling solution:  Clients would periodically send a signal that:</p>
<ul>
<li>let the server know they&#8217;re still alive and connected,</li>
<li>sent any new messages to the server, and</li>
<li>requested any pending messages from the server.</li>
</ul>
<p>On the client side, this meant that every message got packaged as JSON, stuck in a queue, and periodically collected and sent to the server as the optional payload of a keep-alive signal.  With some help from Prototype, this was as simple as:</p>
<pre>
  function SyncMessages() {
          var data = Object.toJSON(MESSAGES_QUEUE);
          var url = '/room/sync_messages';
          var params = {
                  authenticity_token: AUTH_TOKEN,
                  messages: '[]',
                  //messages: Object.toJSON(MESSAGES_QUEUE),
                  last_message_id: LAST_MESSAGE_ID };

          new Ajax.Request(url, {
            method: 'post',
                  parameters: params,
            onSuccess: HandleMessages });
  }

  SyncMessages(); //start first on load
  var message_syncer = new PeriodicalExecuter(SyncMessages, 3);
</pre>
<p><code>HandleMessages()</code> updates the DOM based on messages coming back.</p>
<p>The other feature to note is the <code>LAST_MESSAGE_ID</code> token above.  Each client is responsible for keeping track of the last message ID it got.  On the server side, message IDs are assigned sequentially, so all new messages are guaranteed to have a higher ID.</p>
<p>We could have kept track of each client&#8217;s last message on the server side, but this would have been much more complex &#8211; we would have needed to store data either in the session (which already held a fair amount of stuff) or in another table in order to keep track of who had received what.</p>
<h3>Server Implementation</h3>
<p>On the server side, messages were persisted in a table.  This table would have relatively frequent inserts, somewhat more frequent reads, and no updates or deletions.  For this reason, we went with the default MyISAM table architecture.  (We had considered a design that would have updated the table with delivery status information and would have required InnoDB&#8217;s row-level locking, but in the end we opted for the simpler design.)  The table looks like:</p>
<pre>
  class CreateMessages < ActiveRecord::Migration
    def self.up
      create_table :messages do |t|
        t.integer :message_type_id, :room_id, :game_id, :user_id
        t.string  :data, :user_handle
        t.timestamps
      end
    end

    def self.down
      drop_table :messages
    end
  end
</pre>
<p>It also has an index:</p>
<pre>
  class AddRoomIndexToMessages < ActiveRecord::Migration
    def self.up
      add_index :messages, [:room_id, :id]
    end

    def self.down
      remove_index :messages, [:room_id, :id]
    end
  end
</pre>
<p>Recall that the client keeps track of the last message it got, and periodically requests all messages since.  This index speeds that request - which is just <code>msgs = Message.find :all, :conditions => ['room_id = ? AND id > ?', session['room_id'], last_message_id], <img src='http://www.kickasslabs.com/wp-includes/images/smilies/icon_surprised.gif' alt=':o' class='wp-smiley' /> rder => 'id'</code>.  (Yes, I know that should be a <a href="http://ryandaigle.com/articles/2008/8/20/named-scope-it-s-not-just-for-conditions-ya-know" title="Named Scopes" title="namedscopes">named scope</a>.  I hadn't gotten around to learning those yet at the time of the Rumble.)</p>
<p>When the a keep-alive signal hits the server:</p>
<ul>
<li>any new messages attached to the signal are added to the <code>messages</code> table,</li>
<li>any game mechanics related to those messages are executed (which may result in more messages being added to the table), and</li>
<li>all messages for the caller's room with an ID greater than the caller's LAST_MESSAGE_ID are packaged as JSON and sent back down in the HTTP response.</li>
</ul>
<p>A note:  If you're in a room and you enter (for example) a chat message, there's actually a server round-trip involved (pushing the message to the server and getting the same message back as a response) before that chat message is shown in your browser.  We worried about this, but not for long - the interactivity delay turned out to be minimal, and it was much simpler to keep it this way (every message is treated the same for all clients) than to create an exception (you update your DOM with your chat messages immediately, but your own chat messages are somehow filtered from your message stream).</p>
<h3>What's Missing?</h3>
<p>Obviously, this was a quick &amp; dirty solution, tailored to the immediate needs of a Rails Rumble project.  Here's what I'd do differently (and indeed, will do differently, as we have plans for the future of Great Minds - stay tuned):</p>
<ul>
<li><b>Security:</b>  We do use the Rails session <code>authenticity_token</code>, but there's still no protection from a server flood from someone who has a token, either by cranking down the polling interval or by fiddling with the LAST_MESSAGE_ID.</li>
<li><b>Server Push:</b>  Polling is wasteful - requests are made whether or not there are any messages to send or receive.  We plan to move to Juggernaut soon.</li>
<li><b>Decoupling from Game Mechanics:</b>  In the first iteration, message receipt and delivery were too intimately tied to game mechanics, with <code>Message</code> objects being explictly constructed and followed by <code>Message.save!</code> calls.  I've already started the process of teasing them apart in preparation for the move to Juggernaut, factoring this into a <code>deliver_message()</code> method.</li>
</ul>
<p>Questions?  Thoughts?  Add a comment!  I promise to get back to you.</p>
<p><a class="a2a_dd addtoany_share_save" href="http://www.addtoany.com/share_save?linkurl=http%3A%2F%2Fwww.kickasslabs.com%2F2008%2F11%2F22%2F93%2F&amp;linkname=Quick%20and%20Dirty%20Messaging"><img src="http://www.kickasslabs.com/wp-content/plugins/add-to-any/share_save_120_16.png" width="120" height="16" alt="Share/Bookmark"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://www.kickasslabs.com/2008/11/22/93/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
