<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Hadoop Streaming for Rapid Prototyping of Distributed Algorithms</title>
	<atom:link href="http://www.kickasslabs.com/2009/01/04/hadoop-streaming-for-rapid-prototyping-of-distributed-algorithms/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.kickasslabs.com/2009/01/04/hadoop-streaming-for-rapid-prototyping-of-distributed-algorithms/</link>
	<description>We &#9829; code.</description>
	<lastBuildDate>Thu, 26 Jan 2012 05:50:23 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
	<item>
		<title>By: Brad</title>
		<link>http://www.kickasslabs.com/2009/01/04/hadoop-streaming-for-rapid-prototyping-of-distributed-algorithms/comment-page-1/#comment-670</link>
		<dc:creator>Brad</dc:creator>
		<pubDate>Thu, 26 Jan 2012 05:50:23 +0000</pubDate>
		<guid isPermaLink="false">http://www.kickasslabs.com/?p=132#comment-670</guid>
		<description>Zahide: You&#039;re correct about the reducer.  One reducer instance can receive many keys, and you must store some state to detect when your reducer gets a new key.  However, all values for a single key will pass through a single reducer instance.  (It wouldn&#039;t be MapReduce if that weren&#039;t the case.)

The times I&#039;ve tried it, Streaming was slower.  I was also using a 1.8 series Ruby, which is not the fastest of interpreted languages.  JRuby or another more performant scripting language will give you better results; I happened to use Ruby because I wanted to prototype an algorithm quickly.

The fact that Hadoop Streaming might not be as fast as writing your own Hadoop framework .jar files doesn&#039;t make it a bad thing - if you&#039;re doing something that doesn&#039;t require extreme performance, it might be appropriate to write MR jobs in the language of your choice, either to re-use working code or just to save on developer time (which tends to be more expensive than cluster time, for small and medium clusters).</description>
		<content:encoded><![CDATA[<p>Zahide: You&#8217;re correct about the reducer.  One reducer instance can receive many keys, and you must store some state to detect when your reducer gets a new key.  However, all values for a single key will pass through a single reducer instance.  (It wouldn&#8217;t be MapReduce if that weren&#8217;t the case.)</p>
<p>The times I&#8217;ve tried it, Streaming was slower.  I was also using a 1.8 series Ruby, which is not the fastest of interpreted languages.  JRuby or another more performant scripting language will give you better results; I happened to use Ruby because I wanted to prototype an algorithm quickly.</p>
<p>The fact that Hadoop Streaming might not be as fast as writing your own Hadoop framework .jar files doesn&#8217;t make it a bad thing &#8211; if you&#8217;re doing something that doesn&#8217;t require extreme performance, it might be appropriate to write MR jobs in the language of your choice, either to re-use working code or just to save on developer time (which tends to be more expensive than cluster time, for small and medium clusters).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Zahide</title>
		<link>http://www.kickasslabs.com/2009/01/04/hadoop-streaming-for-rapid-prototyping-of-distributed-algorithms/comment-page-1/#comment-669</link>
		<dc:creator>Zahide</dc:creator>
		<pubDate>Thu, 26 Jan 2012 03:56:48 +0000</pubDate>
		<guid isPermaLink="false">http://www.kickasslabs.com/?p=132#comment-669</guid>
		<description>Thanks for the useful suggestions Brad.
I have two questions about streaming.
First, so NOT like the Java Reducer (in which every Reducer gets values associated with a single key), in streaming one reducer might get values from multiple keys. That&#039;s something we need to take care of when we write the reducer code. Is it possible that values for single key might spread over different reducers ? 

Second, are you saying that in most cases, streaming is NOT as efficient as MapReduce written in java ? 

Thanks</description>
		<content:encoded><![CDATA[<p>Thanks for the useful suggestions Brad.<br />
I have two questions about streaming.<br />
First, so NOT like the Java Reducer (in which every Reducer gets values associated with a single key), in streaming one reducer might get values from multiple keys. That&#8217;s something we need to take care of when we write the reducer code. Is it possible that values for single key might spread over different reducers ? </p>
<p>Second, are you saying that in most cases, streaming is NOT as efficient as MapReduce written in java ? </p>
<p>Thanks</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Brad</title>
		<link>http://www.kickasslabs.com/2009/01/04/hadoop-streaming-for-rapid-prototyping-of-distributed-algorithms/comment-page-1/#comment-454</link>
		<dc:creator>Brad</dc:creator>
		<pubDate>Thu, 08 Sep 2011 19:24:23 +0000</pubDate>
		<guid isPermaLink="false">http://www.kickasslabs.com/?p=132#comment-454</guid>
		<description>Pig/Hive is one option, but for raw performance, it&#039;s tough to beat a custom-written Hadoop job in Java.  This entails writing Java classes that inherit from Hadoop&#039;s Mapper and Reducer classes, and packaging them as a .jar that gets distributed to your cluster.

Naturally, speed is not the only concern - and you can always buy/rent more machines to get more speed.  A lot depends on your team&#039;s skills and what you&#039;re paying for developer time vs. machine time.</description>
		<content:encoded><![CDATA[<p>Pig/Hive is one option, but for raw performance, it&#8217;s tough to beat a custom-written Hadoop job in Java.  This entails writing Java classes that inherit from Hadoop&#8217;s Mapper and Reducer classes, and packaging them as a .jar that gets distributed to your cluster.</p>
<p>Naturally, speed is not the only concern &#8211; and you can always buy/rent more machines to get more speed.  A lot depends on your team&#8217;s skills and what you&#8217;re paying for developer time vs. machine time.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: jo</title>
		<link>http://www.kickasslabs.com/2009/01/04/hadoop-streaming-for-rapid-prototyping-of-distributed-algorithms/comment-page-1/#comment-447</link>
		<dc:creator>jo</dc:creator>
		<pubDate>Tue, 06 Sep 2011 14:30:05 +0000</pubDate>
		<guid isPermaLink="false">http://www.kickasslabs.com/?p=132#comment-447</guid>
		<description>Thanks Brad ; will check up on JRuby. Reg #2, we have  a Hadoop cluster with enough machines, so shouldn&#039;t be a big problem. In pt #3-Switch languages, you mention &quot;regular Hadoop jobs&quot; -- hope you meant using stuff like Pig/Hive.</description>
		<content:encoded><![CDATA[<p>Thanks Brad ; will check up on JRuby. Reg #2, we have  a Hadoop cluster with enough machines, so shouldn&#8217;t be a big problem. In pt #3-Switch languages, you mention &#8220;regular Hadoop jobs&#8221; &#8212; hope you meant using stuff like Pig/Hive.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Brad</title>
		<link>http://www.kickasslabs.com/2009/01/04/hadoop-streaming-for-rapid-prototyping-of-distributed-algorithms/comment-page-1/#comment-438</link>
		<dc:creator>Brad</dc:creator>
		<pubDate>Sun, 04 Sep 2011 14:00:41 +0000</pubDate>
		<guid isPermaLink="false">http://www.kickasslabs.com/?p=132#comment-438</guid>
		<description>Jo -

There was a significant performance hit; part of it was Ruby, part of it was just the fact of leaving the JVM and pushing all this data across processes.

There are three things you might do about it:

1) If this is performing as well as you need it to, don&#039;t sweat it.  (But do keep an eye on your expected data throughput growth.)

2) Add more machines.  That&#039;s what the horizontal scalability is for, after all.

3) Switch languages.  You could move up to Ruby 1.9, or JRuby, or Java or C/C++ (though if you&#039;re going to do the latter, might as well go with regular Hadoop jobs).  If you&#039;re going to do this, do some profiling to determine that you&#039;re going after the correct bottleneck.  If your job chain is long - that is, if you have many scripts that would need replacing - pick one, profile it, swap it with an equivalent in the language you think will do better and profile that.

If you do wind up switching languages, I&#039;d recommend taking a deep look at JRuby.  The last time I played with JRuby in a Hadoop context was a couple of years ago, but I was seeing order-of-magnitude improvements in speed (when I put the -server flag in my #! line).

Good luck!</description>
		<content:encoded><![CDATA[<p>Jo -</p>
<p>There was a significant performance hit; part of it was Ruby, part of it was just the fact of leaving the JVM and pushing all this data across processes.</p>
<p>There are three things you might do about it:</p>
<p>1) If this is performing as well as you need it to, don&#8217;t sweat it.  (But do keep an eye on your expected data throughput growth.)</p>
<p>2) Add more machines.  That&#8217;s what the horizontal scalability is for, after all.</p>
<p>3) Switch languages.  You could move up to Ruby 1.9, or JRuby, or Java or C/C++ (though if you&#8217;re going to do the latter, might as well go with regular Hadoop jobs).  If you&#8217;re going to do this, do some profiling to determine that you&#8217;re going after the correct bottleneck.  If your job chain is long &#8211; that is, if you have many scripts that would need replacing &#8211; pick one, profile it, swap it with an equivalent in the language you think will do better and profile that.</p>
<p>If you do wind up switching languages, I&#8217;d recommend taking a deep look at JRuby.  The last time I played with JRuby in a Hadoop context was a couple of years ago, but I was seeing order-of-magnitude improvements in speed (when I put the -server flag in my #! line).</p>
<p>Good luck!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: jo</title>
		<link>http://www.kickasslabs.com/2009/01/04/hadoop-streaming-for-rapid-prototyping-of-distributed-algorithms/comment-page-1/#comment-434</link>
		<dc:creator>jo</dc:creator>
		<pubDate>Sun, 04 Sep 2011 03:33:33 +0000</pubDate>
		<guid isPermaLink="false">http://www.kickasslabs.com/?p=132#comment-434</guid>
		<description>Hi Brad,
Is there a real performance hit when using Hadoop Streaming? . We&#039;re planning to use this in a production env and wanted to know if the whole processing would be slower if MR is done in Ruby vs Java.
Thanks,
jo</description>
		<content:encoded><![CDATA[<p>Hi Brad,<br />
Is there a real performance hit when using Hadoop Streaming? . We&#8217;re planning to use this in a production env and wanted to know if the whole processing would be slower if MR is done in Ruby vs Java.<br />
Thanks,<br />
jo</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Brad</title>
		<link>http://www.kickasslabs.com/2009/01/04/hadoop-streaming-for-rapid-prototyping-of-distributed-algorithms/comment-page-1/#comment-354</link>
		<dc:creator>Brad</dc:creator>
		<pubDate>Tue, 22 Mar 2011 14:39:16 +0000</pubDate>
		<guid isPermaLink="false">http://www.kickasslabs.com/?p=132#comment-354</guid>
		<description>Vic - I&#039;m happy I could help.</description>
		<content:encoded><![CDATA[<p>Vic &#8211; I&#8217;m happy I could help.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: vic</title>
		<link>http://www.kickasslabs.com/2009/01/04/hadoop-streaming-for-rapid-prototyping-of-distributed-algorithms/comment-page-1/#comment-353</link>
		<dc:creator>vic</dc:creator>
		<pubDate>Tue, 22 Mar 2011 08:19:40 +0000</pubDate>
		<guid isPermaLink="false">http://www.kickasslabs.com/?p=132#comment-353</guid>
		<description>Thanks a lot, you just saved me many hours of hadoop reading :)</description>
		<content:encoded><![CDATA[<p>Thanks a lot, you just saved me many hours of hadoop reading <img src='http://www.kickasslabs.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
	</item>
	<item>
		<title>By: omiplirlChorn</title>
		<link>http://www.kickasslabs.com/2009/01/04/hadoop-streaming-for-rapid-prototyping-of-distributed-algorithms/comment-page-1/#comment-347</link>
		<dc:creator>omiplirlChorn</dc:creator>
		<pubDate>Mon, 30 Aug 2010 19:05:55 +0000</pubDate>
		<guid isPermaLink="false">http://www.kickasslabs.com/?p=132#comment-347</guid>
		<description>why not...</description>
		<content:encoded><![CDATA[<p>why not&#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Carmelo</title>
		<link>http://www.kickasslabs.com/2009/01/04/hadoop-streaming-for-rapid-prototyping-of-distributed-algorithms/comment-page-1/#comment-201</link>
		<dc:creator>Carmelo</dc:creator>
		<pubDate>Wed, 10 Mar 2010 15:15:45 +0000</pubDate>
		<guid isPermaLink="false">http://www.kickasslabs.com/?p=132#comment-201</guid>
		<description>This is the mainn reason I read www.kickasslabs.com. Fascinating posts.</description>
		<content:encoded><![CDATA[<p>This is the mainn reason I read <a href="http://www.kickasslabs.com" rel="nofollow">http://www.kickasslabs.com</a>. Fascinating posts.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

