<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Kickass Labs &#187; hadoop</title>
	<atom:link href="http://www.kickasslabs.com/tag/hadoop/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.kickasslabs.com</link>
	<description>We &#9829; code.</description>
	<lastBuildDate>Wed, 28 Dec 2011 16:57:56 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Traveling Salesman Attack! (Hadoop and Genetic Algorithms)</title>
		<link>http://www.kickasslabs.com/2011/10/10/traveling-salesman-attack/</link>
		<comments>http://www.kickasslabs.com/2011/10/10/traveling-salesman-attack/#comments</comments>
		<pubDate>Mon, 10 Oct 2011 15:49:15 +0000</pubDate>
		<dc:creator>Brad</dc:creator>
				<category><![CDATA[hadoop]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Just for Kicks]]></category>
		<category><![CDATA[genetic algorithms]]></category>
		<category><![CDATA[java]]></category>
		<category><![CDATA[traveling salesman]]></category>

		<guid isPermaLink="false">http://www.kickasslabs.com/?p=669</guid>
		<description><![CDATA[For one of my &#8220;coffee projects&#8221; (things I work on for about 20-30 minutes each morning to warm my brain up and stay amused), I wrote a genetic algorithm attack on the Traveling Salesman problem. Because I&#8217;m a Big Data &#8230; <a href="http://www.kickasslabs.com/2011/10/10/traveling-salesman-attack/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>For one of my &#8220;coffee projects&#8221; (things I work on for about 20-30 minutes each morning to warm my brain up and stay amused), I wrote a <a href="http://en.wikipedia.org/wiki/Genetic_algorithm">genetic algorithm</a> attack on the <a href="http://en.wikipedia.org/wiki/Travelling_salesman_problem">Traveling Salesman problem</a>. Because I&#8217;m a Big Data geek, I parallelized it with <a href="http://hadoop.apache.org">Hadoop</a>.</p>
<p><i>(Note:  I actually gave a talk on this topic some time back, having implemented this same concept in Ruby using Hadoop Streaming.  That implementation contained a number of flaws, some of which I&#8217;m attempting to rectify.  Also, <a href="http://scholar.google.com/scholar?q=hadoop+genetic+algorithm&#038;hl=en&#038;as_sdt=0&#038;as_vis=1&#038;oi=scholart">it&#8217;s not as though I&#8217;m the first guy to work this angle</a>. I haven&#8217;t perused the latest Hadoop/GA work in the field yet, because I wanted to maximize my first-hand learning at this stage &#8211; but I&#8217;ll be looking to other sources for inspiration as I develop and generalize this code.)</i></p>
<p>This was just done for my own edification; I haven&#8217;t put enough effort into tuning this software or studying its properties for this to be considered a serious work of research (yet). I&#8217;m not a specialist or any kind of expert with genetic algorithms, beyond my recreational reading. That said, I had some success at building a gene representation and evolutionary mechanisms that drove a population of solutions for an NP-hard problem toward (mostly) monotonic improvement.</p>
<p>What follows is a brief description of my approach to the problem, with code attached. If you <em>are</em> an expert with GAs, I&#8217;d love to hear your feedback on what I could do better.</p>
<p>I haven&#8217;t spent any effort in this post describing the <a href="http://en.wikipedia.org/wiki/Travelling_salesman_problem">Traveling Salesman problem</a>, <a href="http://en.wikipedia.org/wiki/Genetic_algorithm">genetic algorithms</a>, or <a href="http://hadoop.apache.org">Hadoop</a>/MapReduce. (It was plenty long already.)  If you&#8217;re interested in any of those topics but need to brush up on the basics, check out the preceding links.</p>
<p>More geekery below the fold&#8230;</p>
<p><span id="more-669"></span></p>
<p><strong>The Map</strong></p>
<p>A &#8220;city&#8221; in my representation is an (x, y) coordinate pair, with each coordinate having a floating-point value in [0.0, 1.0). The software can take any arrangement of cities, but for testing I used an arrangement with a set of trivial solutions:</p>
<table>
<tbody>
<tr>
<td>(0.0, 0.0)</td>
<td>(0.0, 0.2)</td>
<td>(0.0, 0.4)</td>
<td>(0.0, 0.6)</td>
<td>(0.0, 0.8)</td>
</tr>
<tr>
<td>(0.0, 1.0)</td>
<td>(0.2, 1.0)</td>
<td>(0.4, 1.0)</td>
<td>(0.6, 1.0)</td>
<td>(0.8, 1.0)</td>
</tr>
<tr>
<td>(1.0, 1.0)</td>
<td>(1.0, 0.8)</td>
<td>(1.0, 0.6)</td>
<td>(1.0, 0.4)</td>
<td>(1.0, 0.2)</td>
</tr>
<tr>
<td>(1.0, 0.0)</td>
<td>(0.8, 0.0)</td>
<td>(0.6, 0.0)</td>
<td>(0.4, 0.0)</td>
<td>(0.2, 0.0)</td>
</tr>
</tbody>
</table>
<p><strong>Gene Representation</strong></p>
<p>When choosing a non-repeating path among <em>N</em> distinct cities, there are <em>N</em> possible choices for the first city; <em>N-1</em> choices for the second city, and so on; and only one choice for the final city.</p>
<p>In my representation, a chromosome has <em>N &#8211; 1</em> integer genes. For a gene <em>G<sub>n</sub>,</em> valid values are <em>[0, N - n)</em> &#8211; thus, the first gene has <em>N</em> valid values, the second gene <em>N &#8211; 1,</em> and so on. Each gene corresponds to the 0-based index of a city in an ordered collection of the cities not yet accounted for in the path. For example, if we start with an ordered collection of cities <em>{C<sub>0</sub>, C<sub>1</sub>, C<sub>2</sub>},</em> and the chromosome<em> {0, 0}:</em></p>
<ul>
<li>The first city in this candidate solution path is at index 0 in the collection:  <em>C<sub>0</sub></em></li>
<li>The second city in the path is at index 0 in the collection of remaining cities, <em>{C<sub>1</sub>, C<sub>2</sub>}: C<sub>1</sub></em></li>
<li>The final city is trivially <em>C<sub>2</sub></em>.</li>
</ul>
<p>Likewise, the chromosome <em>{1, 1}</em> encodes the path <em>C<sub>1</sub> -&gt; C<sub>2</sub> -&gt; C<sub>0</sub>.</em></p>
<p><strong>Fitness Scoring</strong></p>
<p>Here I&#8217;ve implemented the <i>symmetric</i> Traveling Salesman problem (the distance/cost from <i>C<sub>1</sub></i> to <i>C<sub>2</sub></i> is the same as from <i>C<sub>2</sub></i> to <i>C<sub>1</sub></i>), and the distance between cities is just the Cartesian distance between their coordinates.  The maximum possible distance between any two cities in the landscape I&#8217;ve defined is <em>√<span style="text-decoration: overline;">2</span></em>, so the maximum/worst-case distance possible for an <em>N</em>-city map is <em>(N &#8211; 1) * √<span style="text-decoration: overline;">2</span></em>.</p>
<p>A chromosome&#8217;s score is this worst-case distance minus the actual distance of the path it encodes; thus, the greater the improvement over the theoretically pessimal case, the higher the chromomsome&#8217;s score.</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">class</span> ChromosomeScorer <span style="color: #009900;">&#123;</span>
    <span style="color: #000000; font-weight: bold;">protected</span> <span style="color: #003399;">ArrayList</span> cities<span style="color: #339933;">;</span>
    <span style="color: #000000; font-weight: bold;">protected</span> <span style="color: #003399;">ArrayList</span> citiesUsed<span style="color: #339933;">;</span>
    <span style="color: #000000; font-weight: bold;">final</span> <span style="color: #000000; font-weight: bold;">static</span> <span style="color: #000066; font-weight: bold;">double</span> SQRT2 <span style="color: #339933;">=</span> <span style="color: #003399;">Math</span>.<span style="color: #006633;">sqrt</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">2.0</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #000000; font-weight: bold;">protected</span> <span style="color: #000066; font-weight: bold;">double</span> maxDistance<span style="color: #339933;">;</span>
&nbsp;
    <span style="color: #000000; font-weight: bold;">public</span> ChromosomeScorer<span style="color: #009900;">&#40;</span><span style="color: #003399;">String</span> cityString<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
        cities <span style="color: #339933;">=</span> getCitiesFromString<span style="color: #009900;">&#40;</span>cityString<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        citiesUsed <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">ArrayList</span><span style="color: #009900;">&#40;</span>cities.<span style="color: #006633;">size</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        maxDistance <span style="color: #339933;">=</span> SQRT2 <span style="color: #339933;">*</span> <span style="color: #009900;">&#40;</span>cities.<span style="color: #006633;">size</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">-</span> <span style="color: #cc66cc;">1</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #000000; font-weight: bold;">for</span> <span style="color: #009900;">&#40;</span><span style="color: #000066; font-weight: bold;">int</span> i <span style="color: #339933;">=</span> <span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span> i <span style="color: #339933;">&lt;</span> cities.<span style="color: #006633;">size</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <span style="color: #339933;">++</span>i<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
            citiesUsed.<span style="color: #006633;">add</span><span style="color: #009900;">&#40;</span><span style="color: #003399;">Boolean</span>.<span style="color: #000066; font-weight: bold;">FALSE</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #009900;">&#125;</span>
    <span style="color: #009900;">&#125;</span>
&nbsp;
    <span style="color: #000000; font-weight: bold;">protected</span> <span style="color: #000066; font-weight: bold;">double</span> score<span style="color: #009900;">&#40;</span><span style="color: #003399;">String</span> chromosomeString<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
        <span style="color: #003399;">ArrayList</span> route <span style="color: #339933;">=</span> getRouteFromChromosome<span style="color: #009900;">&#40;</span>chromosomeString<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #000066; font-weight: bold;">double</span> distance <span style="color: #339933;">=</span> <span style="color: #cc66cc;">0.0</span><span style="color: #339933;">;</span>
        <span style="color: #000000; font-weight: bold;">for</span> <span style="color: #009900;">&#40;</span><span style="color: #000066; font-weight: bold;">int</span> i <span style="color: #339933;">=</span> <span style="color: #cc66cc;">1</span><span style="color: #339933;">;</span> i <span style="color: #339933;">&lt;</span> cities.<span style="color: #006633;">size</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <span style="color: #339933;">++</span>i<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
            <span style="color: #000066; font-weight: bold;">int</span> city1Index <span style="color: #339933;">=</span> route.<span style="color: #006633;">get</span><span style="color: #009900;">&#40;</span>i<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
            <span style="color: #000066; font-weight: bold;">int</span> city2Index <span style="color: #339933;">=</span> route.<span style="color: #006633;">get</span><span style="color: #009900;">&#40;</span>i <span style="color: #339933;">-</span> <span style="color: #cc66cc;">1</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
            <span style="color: #000066; font-weight: bold;">double</span><span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span> city1 <span style="color: #339933;">=</span> cities.<span style="color: #006633;">get</span><span style="color: #009900;">&#40;</span>city1Index<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
            <span style="color: #000066; font-weight: bold;">double</span><span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span> city2 <span style="color: #339933;">=</span> cities.<span style="color: #006633;">get</span><span style="color: #009900;">&#40;</span>city2Index<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
            <span style="color: #000066; font-weight: bold;">double</span> city1x <span style="color: #339933;">=</span> city1<span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span> <span style="color: #000066; font-weight: bold;">double</span> city1y <span style="color: #339933;">=</span> city1<span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span>
            <span style="color: #000066; font-weight: bold;">double</span> city2x <span style="color: #339933;">=</span> city2<span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span> <span style="color: #000066; font-weight: bold;">double</span> city2y <span style="color: #339933;">=</span> city2<span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span>
&nbsp;
            distance <span style="color: #339933;">+=</span> <span style="color: #003399;">Math</span>.<span style="color: #006633;">sqrt</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#40;</span>city1x <span style="color: #339933;">-</span> city2x<span style="color: #009900;">&#41;</span> <span style="color: #339933;">*</span> <span style="color: #009900;">&#40;</span>city1x <span style="color: #339933;">-</span> city2x<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">+</span> <span style="color: #009900;">&#40;</span><span style="color: #009900;">&#40;</span>city1y <span style="color: #339933;">-</span> city2y<span style="color: #009900;">&#41;</span> <span style="color: #339933;">*</span> <span style="color: #009900;">&#40;</span>city1y <span style="color: #339933;">-</span> city2y<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #009900;">&#125;</span>
&nbsp;
        <span style="color: #000000; font-weight: bold;">return</span> maxDistance <span style="color: #339933;">-</span> distance<span style="color: #339933;">;</span>
    <span style="color: #009900;">&#125;</span>
&nbsp;
    <span style="color: #666666; font-style: italic;">// city string parsing code elided</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>The <tt>ArrayList</tt> is a list of flags indicating whether a city has been added to the route yet; it&#8217;s for implementing the chromosome parsing I described above. In my example, I described the process in terms of pulling elements from an ordered collection, and performing further operations on the reduced collection. Regenerating the ordered list of cities and manipulating it in this way every time I wanted to score a chromosome sounded computationally expensive; flipping booleans sounded cheaper. (I could probably get cheaper still using actual bit flags, but I was trying to keep the code both readable and writable at this stage. Optimizations will come later.)</p>
<p><strong>Initial Population</strong></p>
<p>The following code creates an initial population of chromosomes (encoded as Strings) and dumps them to a requested file:</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">protected</span> <span style="color: #000000; font-weight: bold;">static</span> <span style="color: #000066; font-weight: bold;">void</span> createInitialPopulation<span style="color: #009900;">&#40;</span>FSDataOutputStream populationOutfile, <span style="color: #000000; font-weight: bold;">final</span> <span style="color: #000066; font-weight: bold;">int</span> populationSize, <span style="color: #000000; font-weight: bold;">final</span> <span style="color: #000066; font-weight: bold;">int</span> numCities<span style="color: #009900;">&#41;</span> <span style="color: #000000; font-weight: bold;">throws</span> <span style="color: #003399;">IOException</span> <span style="color: #009900;">&#123;</span>
    <span style="color: #000000; font-weight: bold;">for</span> <span style="color: #009900;">&#40;</span><span style="color: #000066; font-weight: bold;">int</span> i <span style="color: #339933;">=</span> <span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span> i <span style="color: #339933;">&lt;</span> populationSize<span style="color: #339933;">;</span> <span style="color: #339933;">++</span>i<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
        <span style="color: #000000; font-weight: bold;">for</span> <span style="color: #009900;">&#40;</span><span style="color: #000066; font-weight: bold;">int</span> j <span style="color: #339933;">=</span> <span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span> j <span style="color: #339933;">&lt;</span> <span style="color: #009900;">&#40;</span>numCities <span style="color: #339933;">-</span> <span style="color: #cc66cc;">1</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <span style="color: #339933;">++</span>j<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
             <span style="color: #000000; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span>j <span style="color: #339933;">&gt;</span> <span style="color: #cc66cc;">0</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
                populationOutfile.<span style="color: #006633;">writeBytes</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot; &quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
            <span style="color: #009900;">&#125;</span>
            populationOutfile.<span style="color: #006633;">writeBytes</span><span style="color: #009900;">&#40;</span><span style="color: #003399;">String</span>.<span style="color: #006633;">format</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;%d&quot;</span>, random.<span style="color: #006633;">nextInt</span><span style="color: #009900;">&#40;</span>numCities <span style="color: #339933;">-</span> j<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #009900;">&#125;</span>
        populationOutfile.<span style="color: #006633;">writeBytes</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #009900;">&#125;</span>
    populationOutfile.<span style="color: #006633;">close</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>At a later time, I&#8217;ll experiment with using Hadoop&#8217;s ArrayWritable or other serialization schemes for more efficient representation of the chromosomes; at this stage, it was important to me to keep the data human-readable.</p>
<p>There&#8217;s a mapper dedicated to scoring generation 0:</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">class</span> ScoringMapper <span style="color: #000000; font-weight: bold;">extends</span> Mapper <span style="color: #009900;">&#123;</span>
    <span style="color: #000000; font-weight: bold;">private</span> Text outKey <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Text<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #000000; font-weight: bold;">private</span> DoubleWritable outValue <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> DoubleWritable<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
    <span style="color: #000000; font-weight: bold;">protected</span> ChromosomeScorer scorer <span style="color: #339933;">=</span> <span style="color: #000066; font-weight: bold;">null</span><span style="color: #339933;">;</span>
&nbsp;
    @Override
    <span style="color: #000000; font-weight: bold;">protected</span> <span style="color: #000066; font-weight: bold;">void</span> map<span style="color: #009900;">&#40;</span>LongWritable key, Text value, <span style="color: #003399;">Context</span> context<span style="color: #009900;">&#41;</span> <span style="color: #000000; font-weight: bold;">throws</span> <span style="color: #003399;">IOException</span>, <span style="color: #003399;">InterruptedException</span> <span style="color: #009900;">&#123;</span>
        <span style="color: #003399;">String</span> incomingValue<span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">=</span> value.<span style="color: #006633;">toString</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>.<span style="color: #006633;">split</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;<span style="color: #000099; font-weight: bold;">\t</span>&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #003399;">String</span> chromosome <span style="color: #339933;">=</span> incomingValue<span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span>
        <span style="color: #000066; font-weight: bold;">double</span> score <span style="color: #339933;">=</span> <span style="color: #cc66cc;">0.0</span><span style="color: #339933;">;</span>
        <span style="color: #000000; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span>incomingValue.<span style="color: #006633;">length</span> <span style="color: #339933;">&gt;</span> <span style="color: #cc66cc;">1</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
            <span style="color: #003399;">String</span> scoreString <span style="color: #339933;">=</span> incomingValue<span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span>
            <span style="color: #000000; font-weight: bold;">try</span> <span style="color: #009900;">&#123;</span>
                score <span style="color: #339933;">=</span> <span style="color: #003399;">Double</span>.<span style="color: #006633;">parseDouble</span><span style="color: #009900;">&#40;</span>scoreString<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
            <span style="color: #009900;">&#125;</span> <span style="color: #000000; font-weight: bold;">catch</span><span style="color: #009900;">&#40;</span><span style="color: #003399;">NumberFormatException</span> nfe<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
                <span style="color: #666666; font-style: italic;">// weird - but let's try re-generating the score</span>
                score <span style="color: #339933;">=</span> scorer.<span style="color: #006633;">score</span><span style="color: #009900;">&#40;</span>value.<span style="color: #006633;">toString</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
            <span style="color: #009900;">&#125;</span>
        <span style="color: #009900;">&#125;</span> <span style="color: #000000; font-weight: bold;">else</span> <span style="color: #009900;">&#123;</span> <span style="color: #666666; font-style: italic;">// we only go through the scoring process if we don't have a map</span>
            score <span style="color: #339933;">=</span> scorer.<span style="color: #006633;">score</span><span style="color: #009900;">&#40;</span>value.<span style="color: #006633;">toString</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #009900;">&#125;</span>
&nbsp;
        outValue.<span style="color: #006633;">set</span><span style="color: #009900;">&#40;</span>score<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
	outKey.<span style="color: #006633;">set</span><span style="color: #009900;">&#40;</span>chromosome<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
	context.<span style="color: #006633;">write</span><span style="color: #009900;">&#40;</span>outKey, outValue<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #009900;">&#125;</span>
&nbsp;
    <span style="color: #666666; font-style: italic;">// setup code elided</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p><strong>Subsequent Generations</strong></p>
<p>I managed to cram the rest of the operations for each generation into one mapper and one reducer &#8211; really, into one reducer, since the map function just randomly assigns chromosomes to groups:</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;">    @Override
    <span style="color: #000000; font-weight: bold;">protected</span> <span style="color: #000066; font-weight: bold;">void</span> map<span style="color: #009900;">&#40;</span>LongWritable key, Text value, <span style="color: #003399;">Context</span> context<span style="color: #009900;">&#41;</span> <span style="color: #000000; font-weight: bold;">throws</span> <span style="color: #003399;">IOException</span>, <span style="color: #003399;">InterruptedException</span> <span style="color: #009900;">&#123;</span>
        outKey.<span style="color: #006633;">set</span><span style="color: #009900;">&#40;</span>random.<span style="color: #006633;">nextInt</span><span style="color: #009900;">&#40;</span>numBins<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        context.<span style="color: #006633;">write</span><span style="color: #009900;">&#40;</span>outKey, value<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <span style="color: #666666; font-style: italic;">// shuffle</span>
    <span style="color: #009900;">&#125;</span></pre></div></div>

<p>So why am I shuffling the chromosomes thus? Most methods of population <a href="http://en.wikipedia.org/wiki/Selection_(genetic_algorithm)">selection</a> involve having some view over the data; the naïve method of <a href="http://en.wikipedia.org/wiki/Truncation_selection">truncation selection</a> is an extreme example, where you need to sort the entire population, and keep some fraction of the highest-scoring individuals. This is hard to parallelize, and not especially scalable for large populations.</p>
<p>Other selection methods such as <a href="http://en.wikipedia.org/wiki/Fitness_proportionate_selection">fitness proportionate selection</a>, however, look like they&#8217;d retain their properties well if the total population was divided into large, random subpopulations. (I have yet to do a detailed mathematical treatment of this, but some combination of intuition and cocktail napkin calculation has convinced me of this well enough to forge ahead.) To this end, I&#8217;ve set up the system to randomly distribute the full population into a user-settable number of subpopulations; the size of these subpopulations can be tuned to fit within the memory allocated to a reducer instance. (Naturally, the ideal would be to have a subpopulation just large enough to be handled by one reducer instance.) Assuming that the subpopulations are sufficiently large, the random distribution should retain the aggregate characteristics (distribution of fitness scores, &amp;c) of the total population, and give similar results under fitness proportionate selection.</p>
<p>After this subdivision of the population, a reducer instance receives a subpopulation, and is then responsible for selecting survivors from that population, breeding replacement members of the population from the survivors, and scoring the new population members.</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;">    @Override
    <span style="color: #000000; font-weight: bold;">protected</span> <span style="color: #000066; font-weight: bold;">void</span> reduce<span style="color: #009900;">&#40;</span>VIntWritable key, Iterable values, <span style="color: #003399;">Context</span> context<span style="color: #009900;">&#41;</span> <span style="color: #000000; font-weight: bold;">throws</span> <span style="color: #003399;">IOException</span>, <span style="color: #003399;">InterruptedException</span> <span style="color: #009900;">&#123;</span>
        <span style="color: #003399;">TreeSet</span> sortedChromosomes <span style="color: #339933;">=</span> getSortedChromosomeSet<span style="color: #009900;">&#40;</span>values<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        normalizeScores<span style="color: #009900;">&#40;</span>sortedChromosomes<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
        <span style="color: #000066; font-weight: bold;">int</span> survivorsWanted <span style="color: #339933;">=</span> <span style="color: #009900;">&#40;</span><span style="color: #000066; font-weight: bold;">int</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#40;</span><span style="color: #009900;">&#40;</span><span style="color: #000066; font-weight: bold;">double</span><span style="color: #009900;">&#41;</span> sortedChromosomes.<span style="color: #006633;">size</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">*</span> survivorProportion<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #003399;">Set</span> survivors <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">HashSet</span><span style="color: #009900;">&#40;</span>survivorsWanted<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
        <span style="color: #000000; font-weight: bold;">while</span> <span style="color: #009900;">&#40;</span>survivors.<span style="color: #006633;">size</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">&lt;</span> survivorsWanted<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
            survivors.<span style="color: #006633;">add</span><span style="color: #009900;">&#40;</span>selectSurvivor<span style="color: #009900;">&#40;</span>sortedChromosomes<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #009900;">&#125;</span>
&nbsp;
        <span style="color: #003399;">ArrayList</span> parentPool <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">ArrayList</span><span style="color: #009900;">&#40;</span>survivors<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #000000; font-weight: bold;">while</span> <span style="color: #009900;">&#40;</span>survivors.<span style="color: #006633;">size</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">&lt;</span> desiredPopulationSize<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
            survivors.<span style="color: #006633;">add</span><span style="color: #009900;">&#40;</span>makeOffspring<span style="color: #009900;">&#40;</span>parentPool<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #009900;">&#125;</span>
&nbsp;
        <span style="color: #003399;">Iterator</span> iter <span style="color: #339933;">=</span> survivors.<span style="color: #006633;">iterator</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #000000; font-weight: bold;">while</span> <span style="color: #009900;">&#40;</span>iter.<span style="color: #006633;">hasNext</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
            ScoredChromosome sc <span style="color: #339933;">=</span> iter.<span style="color: #006633;">next</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
            outKey.<span style="color: #006633;">set</span><span style="color: #009900;">&#40;</span>sc.<span style="color: #006633;">chromosome</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
            outValue.<span style="color: #006633;">set</span><span style="color: #009900;">&#40;</span>sc.<span style="color: #006633;">score</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
            context.<span style="color: #006633;">write</span><span style="color: #009900;">&#40;</span>outKey, outValue<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #009900;">&#125;</span>
    <span style="color: #009900;">&#125;</span></pre></div></div>

<p><strong>Selection:</strong> The first step in fitness proportionate selection is to sort the new members according to their score, ascending. After that, survivors are selected from the population randomly, weighted by their scores (the higher a chromosome&#8217;s score, the more likely it is to survive).</p>
<p><strong>Repopulation:</strong> Surviving chromosomes are used to breed back up to our original population numbers. I used single crossover and random mutations (maximum of one mutation per chromosome per generation):</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;">    <span style="color: #000000; font-weight: bold;">protected</span> ScoredChromosome makeOffspring<span style="color: #009900;">&#40;</span><span style="color: #003399;">ArrayList</span> parentPool<span style="color: #009900;">&#41;</span> <span style="color: #000000; font-weight: bold;">throws</span> <span style="color: #003399;">InterruptedException</span> <span style="color: #009900;">&#123;</span>
        <span style="color: #000066; font-weight: bold;">int</span> parent1Index <span style="color: #339933;">=</span> random.<span style="color: #006633;">nextInt</span><span style="color: #009900;">&#40;</span>parentPool.<span style="color: #006633;">size</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #000066; font-weight: bold;">int</span> parent2Index <span style="color: #339933;">=</span> parent1Index<span style="color: #339933;">;</span>
        <span style="color: #000000; font-weight: bold;">while</span> <span style="color: #009900;">&#40;</span>parent2Index <span style="color: #339933;">==</span> parent1Index<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
            parent2Index <span style="color: #339933;">=</span> random.<span style="color: #006633;">nextInt</span><span style="color: #009900;">&#40;</span>parentPool.<span style="color: #006633;">size</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #009900;">&#125;</span>
&nbsp;
        ScoredChromosome parent1 <span style="color: #339933;">=</span> parentPool.<span style="color: #006633;">get</span><span style="color: #009900;">&#40;</span>parent1Index<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        ScoredChromosome parent2 <span style="color: #339933;">=</span> parentPool.<span style="color: #006633;">get</span><span style="color: #009900;">&#40;</span>parent2Index<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
        ScoredChromosome offspring <span style="color: #339933;">=</span> crossover<span style="color: #009900;">&#40;</span>parent1, parent2<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #000000; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span>random.<span style="color: #006633;">nextDouble</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">&lt;</span> mutationChance<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
            mutate<span style="color: #009900;">&#40;</span>offspring<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #009900;">&#125;</span>
&nbsp;
        offspring.<span style="color: #006633;">score</span> <span style="color: #339933;">=</span> scorer.<span style="color: #006633;">score</span><span style="color: #009900;">&#40;</span>offspring.<span style="color: #006633;">chromosome</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
        <span style="color: #000000; font-weight: bold;">return</span> offspring<span style="color: #339933;">;</span>
    <span style="color: #009900;">&#125;</span></pre></div></div>

<p><strong>Scoring:</strong> Finally, all unscored population members are scored in preparation for the next round.</p>
<p>This process is repeated for as many generations as is specified at the beginning of the run.  (Later, I plan to allow the number of generations to be unspecified, and have the algorithm terminate when the maximum score starts to settle/converge.)</p>
<p><strong>Parameters</strong></p>
<ul>
<li>Number of cities</li>
<li>Population size</li>
<li>Proportion of survivors in each generation</li>
<li>Subpopulation size (see above for rationale for dividing up the population, and thoughts on tuning)</li>
<li>Chance of random mutation</li>
<li>Number of generations</li>
</ul>
<p><strong>Results</strong></p>
<p>I&#8217;ve thrown the data for min/max/mean scores for each generation of each run into a <a href="https://docs.google.com/spreadsheet/ccc?key=0AkBz3y6su8VvdHNkMGdFYnlSdmNhZ1hNOW5GTWwtb0E&#038;hl=en_US">Google spreadsheet</a>, and made some simple graphs.</p>
<p>The first run was done with 10,000-member population, 30% survival chance, 500 generations.  There&#8217;s an unmistakable fitness increase over time, and it gets pretty close to the theoretical maximum for my test city (~23.066).</p>
<p>The second run was done with a 100,000-member population and 50% survival chance, 500 generations.  It gets off to a slower start, but there&#8217;s a burst of improvement late in the 500-generation run and it gets pretty close to the optimum.</p>
<p>The difference in early-stage performance of the second run had me kicking myself for changing <i>two</i> inputs.  I switched back to a 30% survival chance but kept the larger population.  This run landed in the same neighborhood as the others, and looked like it still had room to settle.</p>
<p>Interesting to note on all 3 runs is that all start with a few generations of quick improvement, then flatten out, then have an inflection point after which there&#8217;s a burst of quick improvement.  (<a href="http://en.wikipedia.org/wiki/Punctuated_equilibrium">Punctuated equilibrium</a>?)  All runs seem to still be on the climb; I suspect that a moving average over a few generations and/or some statistical tests for convergence would give a clearer picture of what&#8217;s happening with the max trend; the average was still definitely on the rise at the end of all three runs.</p>
<p><strong>Notes</strong></p>
<p>Test coverage could be improved, especially with the use of mocks.  I may move to <a href="https://ccp.cloudera.com/display/SUPPORT/Downloads">CDH3</a> and try <a href="http://www.cloudera.com/blog/2009/07/debugging-mapreduce-programs-with-mrunit/">MRUnit</a> to ease test design.  As it stands I&#8217;m taking most of the logic out of the mappers and reducers and just using the overridden map() and reduce() functions to coordinate the calls that do the actual work, largely to make it easier to test the logic in isolation.</p>
<p>For the sake of speed, I could probably take out some of the square roots and do my scoring using the squares of the distances.</p>
<p>On my 13&#8243; MacBook Pro with Core Duo processor running 2 mappers and 2 reducers, a population of 10,000 took 45-60 seconds to cull, breed, and score. The time per generation didn&#8217;t change appreciably when I went from 10,000 chromomsomes per generation to a population of 100,000 &#8211; which means most of my time is still spent in Hadoop overhead, which means that even on my little laptop, I could put up much bigger populations (or much bigger maps) and still run 500 generations in about 6 hours.</p>
<p>I haven&#8217;t bothered taking the simulation parameters in via the command line yet; this would be an obvious improvement, and probably the next thing I&#8217;ll do. I was a lot more focused on getting the GA code right and checking that it behaved as expected than I was with making a usable tool.  I&#8217;ve gotten the itch to generalize and grow this thing, though, so I&#8217;ll need to start making it usable &amp; configurable &amp; scriptable.</p>
<p>Clearly, this is not a general toolkit for GA, or even GA applied to NP-hard problems. (I don&#8217;t like to generalize without at least two cases.) I do see plenty of places where it is generalizable, though, and will starting doing so as soon as I pick a second problem to attack. A long-term goal is to factor out a library/API for doing GA on Hadoop.</p>
<p>Genetic algorithms are not one single thing &#8211; there are multiple strategies for culling populations, doing crossover/breeding, etc. A longer-term design goal is to make each of those variable pieces of the algorithm into pluggable strategies selectable from the command line.</p>
<p>There&#8217;s optimization to do all over the place, e.g. with the multiple copies of data kept in the reducer. I chose the data structures I used for ease of implementing the algorithm &#8211; e.g., picking a Set to hold survivors so as to avoid picking up duplicate survivors &#8211; with the intent to get working software quickly, and <a href="http://www.irocon.com/blog/2006/09/07/optimize-last.aspx">optimize last</a>.</p>
<p><strong>The Code</strong></p>
<p>The code, such as it is, is in a <a href="https://github.com/bradheintz/TravSales">public repository on GitHub</a>.  More to come.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fwww.kickasslabs.com%2F2011%2F10%2F10%2Ftraveling-salesman-attack%2F&amp;title=Traveling%20Salesman%20Attack%21%20%28Hadoop%20and%20Genetic%20Algorithms%29" id="wpa2a_2"><img src="http://www.kickasslabs.com/wp-content/plugins/add-to-any/share_save_120_16.png" width="120" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://www.kickasslabs.com/2011/10/10/traveling-salesman-attack/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Hadoop Streaming for Rapid Prototyping of Distributed Algorithms</title>
		<link>http://www.kickasslabs.com/2009/01/04/hadoop-streaming-for-rapid-prototyping-of-distributed-algorithms/</link>
		<comments>http://www.kickasslabs.com/2009/01/04/hadoop-streaming-for-rapid-prototyping-of-distributed-algorithms/#comments</comments>
		<pubDate>Sun, 04 Jan 2009 22:44:06 +0000</pubDate>
		<dc:creator>Brad</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[distributed computing]]></category>
		<category><![CDATA[ga]]></category>
		<category><![CDATA[genetic algorithms]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hadoop streaming]]></category>
		<category><![CDATA[hadoop streaming tutorial]]></category>
		<category><![CDATA[hadoop tutorial]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[tutorial]]></category>

		<guid isPermaLink="false">http://www.kickasslabs.com/?p=132</guid>
		<description><![CDATA[Note: This article assumes that you know a little about MapReduce, or that if you don&#8217;t, you might skim the enclosed links so you know what I&#8217;m talking about when I get to the examples, or check out the Hadoop &#8230; <a href="http://www.kickasslabs.com/2009/01/04/hadoop-streaming-for-rapid-prototyping-of-distributed-algorithms/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><i><b>Note:</b> This article assumes that you know a little about MapReduce, or that if you don&#8217;t, you might skim the enclosed links so you know what I&#8217;m talking about when I get to the examples, or check out the <a href="http://hadoop.apache.org/core/docs/current/mapred_tutorial.html" title="Hadoop tutorial" target="hadooptutorial">Hadoop Tutorial</a>.  It also assumes that you have Hadoop set up &#8211; either clustered or pseudo-clustered &#8211; if you&#8217;re going to run the examples.  Or you can just read along.</i></p>
<p><a href="http://hadoop.apache.org/" title="Hadoop" target="hadoop">Hadoop</a> is a framework (written in Java) that supports distributed computing &#8211; specifically Google&#8217;s <a href="http://labs.google.com/papers/mapreduce.html" title="MapReduce" target="mapreduce">MapReduce</a> algorithm.  It also comprises <a href="http://hadoop.apache.org/core/docs/current/hdfs_design.html" title="HDFS" target="hdfs">HDFS</a> (the Hadoop Distrubuted File System), which allows you to redundantly store large quantities of data across multiple disconnected disks as if they were a single storage unit.  I&#8217;ve used Hadoop at two jobs and at home, and it <i>rocks.</i></p>
<p>It comes with a problem, though, which you may spot in the first paragraph:  It&#8217;s written in <i>Java.</i>  Now, don&#8217;t get me wrong &#8211; Java&#8217;s a great language.  But developing software in Java with the most commonly used tools (Eclipse, Ant/Maven, &amp;c) is a monumental pain (and you&#8217;re hearing this from an old Visual C++ hand).  I&#8217;m the first to admit that I should shore up my skills with the Java tools, but even if I were better at it, getting a non-trivial Java project from zero to first runnable build is still about as complex as the invasion of Normandy, and the proliferation of XML config files is just inhumane.</p>
<p>Still, the performance and solidity of Java make it the right choice for a production Hadoop project.  But what if you just want to kick around an idea or test an algorithm?  Wouldn&#8217;t it be nice if you could do that in 2 hours instead of a day and a half?  Wouldn&#8217;t it be nicer still if you could do it in your language of choice?</p>
<p><span id="more-132"></span></p>
<p><a href="http://hadoop.apache.org/core/docs/r0.19.0/streaming.html" title="Hadoop Streaming" target="hadoopstreaming">Hadoop Streaming</a> has made me a very happy man.  Any language that can take data from <code>stdin</code> and give it to <code>stdout</code> can be used to make Hadoop MapReduce jobs.</p>
<p>So of course, I&#8217;m doing mine in Ruby.</p>
<p>As an example, I&#8217;ll transliterate the <strike>trivial</strike> canonical Hadoop word counting example into Ruby.  This example involves taking a large text, and counting the number of instances of each word it contains.  The mapper takes in rows of text, and emits key-value pairs where the key is a word and the value is the number of times that word has occurred in a given row of text.  It would look something like this:</p>
<p><b>word_count_mapper.rb:</b></p>

<div class="wp_syntax"><div class="code"><pre class="ruby" style="font-family:monospace;"><span style="color:#008000; font-style:italic;">#!/usr/bin/env ruby</span>
&nbsp;
STDIN.<span style="color:#9900CC;">each_line</span> <span style="color:#9966CC; font-weight:bold;">do</span> <span style="color:#006600; font-weight:bold;">|</span>line<span style="color:#006600; font-weight:bold;">|</span>
  word_count = <span style="color:#006600; font-weight:bold;">&#123;</span><span style="color:#006600; font-weight:bold;">&#125;</span>
  line.<span style="color:#CC0066; font-weight:bold;">split</span>.<span style="color:#9900CC;">each</span> <span style="color:#9966CC; font-weight:bold;">do</span> <span style="color:#006600; font-weight:bold;">|</span>word<span style="color:#006600; font-weight:bold;">|</span>
    word_count<span style="color:#006600; font-weight:bold;">&#91;</span>word<span style="color:#006600; font-weight:bold;">&#93;</span> <span style="color:#006600; font-weight:bold;">||</span>= <span style="color:#006666;">0</span>
    word_count<span style="color:#006600; font-weight:bold;">&#91;</span>word<span style="color:#006600; font-weight:bold;">&#93;</span> <span style="color:#006600; font-weight:bold;">+</span>= <span style="color:#006666;">1</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
  word_count.<span style="color:#9900CC;">each</span> <span style="color:#9966CC; font-weight:bold;">do</span> <span style="color:#006600; font-weight:bold;">|</span>k,v<span style="color:#006600; font-weight:bold;">|</span>
    <span style="color:#CC0066; font-weight:bold;">puts</span> <span style="color:#996600;">&quot;#{k}<span style="color:#000099;">\t</span>#{v}&quot;</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
<span style="color:#9966CC; font-weight:bold;">end</span></pre></div></div>

<p>Already, we&#8217;re at our first non-trivial design decision, and it&#8217;s due to a significant design difference between Hadoop Streaming and regular Hadoop jobs:  When you&#8217;re streaming, a single instance of your script will handle many pieces of input &#8211; it&#8217;s a bit like a filter-style Unix command (e.g., <code>grep</code>).  A regular Hadoop mapping job is more firmly seated in the functional paradigm &#8211; each call to your mapper gets one line of data, with no knowledge of others.  Because of this, <a href="http://www.raja-gopal.com/?p=42" title="Raja Gopal on Hadoop Streaming with Ruby" target="gopal">some tutorials</a> will suggest shortcuts such as keeping the hash-based accumulator outside the <code>STDIN</code> loop, and emitting all your rows of <code>[word, count]</code> pairs at the end of processing.  There are advantages to this, but I&#8217;m going to stick with my code above, for two reasons:</p>
<ol>
<li>If you maintain global state in your mapper, you&#8217;ll incur significant rework if you try to port the code to a more purely functional Java Hadoop mapper.</li>
<li>This could get damagingly memory-intensive on large data sets.  Going functional and streaming everything straight to HDFS will add some storage cost, but for large jobs I&#8217;m much more worried about RAM than disk.</li>
</ol>
<p>The purpose of the reduce step is to bring together all the per-line counts for each word, and reduce them to a single, global count per word.  This is facilitated by the fact that Hadoop orders all the records by key before handing it to the reducer (where by default the key is everything before the first tab character &#8211; in this case, the word being counted.)  My version of the reducer looks like this:</p>
<p><b>word_count_reducer.rb:</b></p>

<div class="wp_syntax"><div class="code"><pre class="ruby" style="font-family:monospace;"><span style="color:#008000; font-style:italic;">#!/usr/bin/env ruby</span>
&nbsp;
current_word = <span style="color:#0000FF; font-weight:bold;">nil</span>
current_count = <span style="color:#006666;">0</span>
STDIN.<span style="color:#9900CC;">each_line</span> <span style="color:#9966CC; font-weight:bold;">do</span> <span style="color:#006600; font-weight:bold;">|</span>line<span style="color:#006600; font-weight:bold;">|</span>
  word, count = line.<span style="color:#9900CC;">strip</span>.<span style="color:#CC0066; font-weight:bold;">split</span>
  <span style="color:#9966CC; font-weight:bold;">if</span> word != current_word
    <span style="color:#CC0066; font-weight:bold;">puts</span> <span style="color:#996600;">&quot;#{current_word}<span style="color:#000099;">\t</span>#{current_count}&quot;</span> <span style="color:#9966CC; font-weight:bold;">unless</span> current_word.<span style="color:#0000FF; font-weight:bold;">nil</span>?
    current_word = word
    current_count = <span style="color:#006666;">0</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
  current_count <span style="color:#006600; font-weight:bold;">+</span>= count.<span style="color:#9900CC;">to_i</span>
<span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
<span style="color:#CC0066; font-weight:bold;">puts</span> <span style="color:#996600;">&quot;#{current_word}<span style="color:#000099;">\t</span>#{current_count}&quot;</span> <span style="color:#9966CC; font-weight:bold;">unless</span> current_word.<span style="color:#0000FF; font-weight:bold;">nil</span>?</pre></div></div>

<p>The whole business with <code>current_count</code> is necessitated by another Hadoop Streaming-vs.-Hadoop quirk:  In a regular Hadoop reducer, one call to the reducing function gets a collection of all rows associated with a particular key.  When you&#8217;re streaming, you&#8217;re getting only one row at a time, but you&#8217;re guaranteed that the rows will be ordered on keys, and that the rows for a particular key will not be split up across reducer instances.  Again, the temptation might be to place a single accumulator outside the main <code>STDIN</code> loop, but I&#8217;m going to stick with my strategy of lean runtime footprint and going straight to storage.  The price for that is keeping track of which key you&#8217;re on.</p>
<p>I have placed the mapper and reducer in a folder called <code>scripts</code> in my <code>$HADOOP_HOME</code> folder.  To run them against an existing file in HDFS, I do:</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">bin<span style="color: #000000; font-weight: bold;">/</span>hadoop jar contrib<span style="color: #000000; font-weight: bold;">/</span>streaming<span style="color: #000000; font-weight: bold;">/</span>hadoop-0.19.0-streaming.jar \
  <span style="color: #660033;">-mapper</span> scripts<span style="color: #000000; font-weight: bold;">/</span>word_count_mapper.rb <span style="color: #660033;">-reducer</span> scripts<span style="color: #000000; font-weight: bold;">/</span>word_count_reducer.rb \
  <span style="color: #660033;">-file</span> <span style="color: #000000; font-weight: bold;">`</span><span style="color: #7a0874; font-weight: bold;">pwd</span><span style="color: #000000; font-weight: bold;">`/</span>scripts<span style="color: #000000; font-weight: bold;">/</span>word_count_mapper.rb \
  <span style="color: #660033;">-file</span> <span style="color: #000000; font-weight: bold;">`</span><span style="color: #7a0874; font-weight: bold;">pwd</span><span style="color: #000000; font-weight: bold;">`/</span>scripts<span style="color: #000000; font-weight: bold;">/</span>word_count_reducer.rb \
  <span style="color: #660033;">-input</span> texts<span style="color: #000000; font-weight: bold;">/</span>my_text <span style="color: #660033;">-output</span> word_counts</pre></div></div>

<p>&#8230;and results may be extracted from the <code>word_counts</code> folder in HDFS.</p>
<p>This is all there is to the trivial example.  There&#8217;s more, of course &#8211; I&#8217;m currently using this technology to prototype a much more ambitious project that I will eventually port to Java (more on that later).  While I&#8217;m finding that there is (at least subjectively) some performance overhead associated with streaming that I don&#8217;t see otherwise, it&#8217;s proving incredibly useful and speedy for testing out algorithms and designs.</p>
<p>Are you working with Hadoop Streaming?  Do you have questions?  Drop a line in the comments &#8211; I&#8217;d love to swap learnings.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fwww.kickasslabs.com%2F2009%2F01%2F04%2Fhadoop-streaming-for-rapid-prototyping-of-distributed-algorithms%2F&amp;title=Hadoop%20Streaming%20for%20Rapid%20Prototyping%20of%20Distributed%20Algorithms" id="wpa2a_4"><img src="http://www.kickasslabs.com/wp-content/plugins/add-to-any/share_save_120_16.png" width="120" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://www.kickasslabs.com/2009/01/04/hadoop-streaming-for-rapid-prototyping-of-distributed-algorithms/feed/</wfw:commentRss>
		<slash:comments>20</slash:comments>
		</item>
	</channel>
</rss>

