<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>pero on anything</title>
	<atom:link href="http://aprilmayjune.org/feed/" rel="self" type="application/rss+xml" />
	<link>http://aprilmayjune.org</link>
	<description></description>
	<lastBuildDate>Mon, 16 Apr 2012 23:12:34 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
		<item>
		<title>Change Keyboard Shortcuts in Gnome-Shell (Gnome 3)</title>
		<link>http://aprilmayjune.org/2012/04/17/change-keyboard-shortcuts-in-gnome-shell-gnome-3/</link>
		<comments>http://aprilmayjune.org/2012/04/17/change-keyboard-shortcuts-in-gnome-shell-gnome-3/#comments</comments>
		<pubDate>Mon, 16 Apr 2012 23:12:34 +0000</pubDate>
		<dc:creator>pero</dc:creator>
				<category><![CDATA[linux]]></category>

		<guid isPermaLink="false">http://aprilmayjune.org/?p=109</guid>
		<description><![CDATA[I really love Gnome 3 and its Shell, but I almost went nuts on this. It really took me a while to finally figure this out. To make a really short blog post even shorter: Use dconf-editor (from package dconf-tools) and go to org.gnome.desktop.wm.keybindings. There you have it. Why would I want to change keyboard [...]]]></description>
			<content:encoded><![CDATA[<p>I really love Gnome 3 and its Shell, but I almost went nuts on this. It really took me a while to finally figure this out. To make a really short blog post even shorter:</p>
<p>Use <code>dconf-editor</code> (from package <code>dconf-tools</code>) and go to <code>org.gnome.desktop.wm.keybindings</code>. There you have it.</p>
<p>Why would I want to change keyboard shortcuts? Some of them (namely <code>Alt+F1</code>, <code>Alt+F7</code> and <code>Alt+F8</code>) clash with <a href="http://www.jetbrains.com/idea/">IntelliJ IDEA&#8217;s</a> shortcuts. That&#8217;s why! <img src='http://aprilmayjune.org/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://aprilmayjune.org/2012/04/17/change-keyboard-shortcuts-in-gnome-shell-gnome-3/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Back online</title>
		<link>http://aprilmayjune.org/2011/12/19/back-online/</link>
		<comments>http://aprilmayjune.org/2011/12/19/back-online/#comments</comments>
		<pubDate>Mon, 19 Dec 2011 21:48:30 +0000</pubDate>
		<dc:creator>pero</dc:creator>
				<category><![CDATA[this and that]]></category>

		<guid isPermaLink="false">http://aprilmayjune.org/?p=102</guid>
		<description><![CDATA[After a few sudden server deaths, this blog is back online. Which does not necessarily mean that I am back online writing new posts. That said, there is a lot in my mind that might be worth blogging about, so stay tuned.]]></description>
			<content:encoded><![CDATA[<p>After a few sudden server deaths, this blog is back online. Which does not necessarily mean that I am back online writing new posts. <img src='http://aprilmayjune.org/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />  That said, there is a lot in my mind that might be worth blogging about, so stay tuned.</p>
]]></content:encoded>
			<wfw:commentRss>http://aprilmayjune.org/2011/12/19/back-online/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Automated performance degradation tests with JUnit4</title>
		<link>http://aprilmayjune.org/2010/09/11/automated-performance-degradation-tests-with-junit4/</link>
		<comments>http://aprilmayjune.org/2010/09/11/automated-performance-degradation-tests-with-junit4/#comments</comments>
		<pubDate>Sat, 11 Sep 2010 21:38:14 +0000</pubDate>
		<dc:creator>pero</dc:creator>
				<category><![CDATA[java]]></category>

		<guid isPermaLink="false">http://pero.blogs.aprilmayjune.org/2010/09/11/automated-performance-degradation-tests-with-junit4/</guid>
		<description><![CDATA[There are several extension to JUnit that provide means to test performance like JUnitPerf or p-unit. But it is hard to formulate the right assertions. What if the test runs on a beefier machine or in another environment? Did performance degrade? I just want to answer a simple question: Did performance degrade? (And if, when?) [...]]]></description>
			<content:encoded><![CDATA[<p>There are several extension to <a href="http://http://www.junit.org/">JUnit</a> that provide means to test performance like <a href="http://www.clarkware.com/software/JUnitPerf.html">JUnitPerf</a> or <a href="http://p-unit.sourceforge.net/">p-unit</a>.</p>
<p>But it is hard to formulate the right assertions. What if the test runs on a beefier machine or in another environment?</p>
<h2>Did performance degrade?</h2>
<p>I just want to answer a simple question: <b>Did performance degrade?</b> (And if, when?)</p>
<p>Lately I came across <a href="http://labs.carrotsearch.com/junit-benchmarks.html">junit-benchmarks</a>. This framework allows you to <a href="http://labs.carrotsearch.com/junit-benchmarks-tutorial.html#results-visualization">visualize the performance history</a> of select unit tests.<br />
This is great, but I want my tests to fail if the performance degraded by some factor. A test run should fail, if it performs <code>X</code>% worse than the run before or <code>Y</code>% worse than the average of the last <code>Z</code> runs.</p>
<h2>Extending junit-benchmarks</h2>
<p>So I forked the junit-benchmarks, and created a <a href="http://github.com/optivo/junit-benchmarks">project over at github</a>.</p>
<p>Now you can assert that performance did not degrade:</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">class</span> MyTest <span style="color: #009900;">&#123;</span>
&nbsp;
    <span style="color: #008000; font-style: italic; font-weight: bold;">/** This is needed to enable junit-benchmark. */</span>
    @Rule
    <span style="color: #000000; font-weight: bold;">public</span> MethodRule benchmarkRun <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> BenchmarkRule<span style="color: #009900;">&#40;</span>h2consumer<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
    <span style="color: #008000; font-style: italic; font-weight: bold;">/**
     * This test fails, if it performs 20% worse than the last test run or 10% worse than the average of the last 10 runs.
     */</span>
    @Test
    @BenchmarkOptions<span style="color: #009900;">&#40;</span>perfDiffToLastRun <span style="color: #339933;">=</span> 0.2d, perfDiffToAverage <span style="color: #339933;">=</span> 0.1d, perfAverageOverRuns <span style="color: #339933;">=</span> <span style="color: #cc66cc;">10</span><span style="color: #009900;">&#41;</span>
    <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000066; font-weight: bold;">void</span> test<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #000000; font-weight: bold;">throws</span> <span style="color: #003399;">Exception</span> <span style="color: #009900;">&#123;</span>
        <span style="color: #666666; font-style: italic;">// Do something ...</span>
    <span style="color: #009900;">&#125;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>Note: Consumer H2 must be enabled. See <a href="http://labs.carrotsearch.com/junit-benchmarks-tutorial.html#persistent-benchmark-history">junit-benchmark documentation</a>.<br />
As for all junit-benchmark settings, you can set <a href="http://github.com/optivo/junit-benchmarks/blob/master/src/main/java/com/carrotsearch/junitbenchmarks/BenchmarkOptionsSystemProperties.java">defaults via JVM command line options</a>.</p>
<h2>But new features might cost performance</h2>
<p>The approach above works if performance assumptions never change. But they do. So, performance tests will fail after adding a new feature that decreases performance on purpose. What should we do? Delete all the performance history? You could do that. Another way is to tell junit-benchmark that it should ignore all performance data up to a certain test run:</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #008000; font-style: italic; font-weight: bold;">/**
 * This test fails, if it performs 20% worse than the last test run or 10% worse than the average of the last 10 runs.
 * But it ignores the 96 runs.
 */</span>
@Test
@BenchmarkOptions<span style="color: #009900;">&#40;</span>perfDiffToLastRun <span style="color: #339933;">=</span> 0.2d, perfDiffToAverage <span style="color: #339933;">=</span> 0.1d, perfAverageOverRuns <span style="color: #339933;">=</span> <span style="color: #cc66cc;">10</span>, perfIgnoreUpToRun <span style="color: #339933;">=</span> <span style="color: #cc66cc;">96</span><span style="color: #009900;">&#41;</span>
<span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000066; font-weight: bold;">void</span> test<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #000000; font-weight: bold;">throws</span> <span style="color: #003399;">Exception</span> <span style="color: #009900;">&#123;</span>
    <span style="color: #666666; font-style: italic;">// Do something ...</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<h2>Get the code</h2>
<p><a href="http://github.com/optivo/junit-benchmarks">http://github.com/optivo/junit-benchmarks</a></p>
]]></content:encoded>
			<wfw:commentRss>http://aprilmayjune.org/2010/09/11/automated-performance-degradation-tests-with-junit4/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Integrating MySQL and Hadoop &#8211; or &#8211; A different approach on using CSV files in MySQL</title>
		<link>http://aprilmayjune.org/2010/09/05/integrating-mysql-and-hadoop-or-a-different-approach-on-using-csv-files-in-mysql/</link>
		<comments>http://aprilmayjune.org/2010/09/05/integrating-mysql-and-hadoop-or-a-different-approach-on-using-csv-files-in-mysql/#comments</comments>
		<pubDate>Sun, 05 Sep 2010 21:46:12 +0000</pubDate>
		<dc:creator>pero</dc:creator>
				<category><![CDATA[hadoop]]></category>
		<category><![CDATA[java]]></category>

		<guid isPermaLink="false">http://pero.blogs.aprilmayjune.org/2010/09/05/integrating-mysql-and-hadoop-or-a-different-approach-on-using-csv-files-in-mysql/</guid>
		<description><![CDATA[We use both MySQL and Hadoop a lot. If you utilize each system to its strengths then this is a powerful combination. One problem we are constantly facing is to make data extracted from our Hadoop cluster available in MySQL. The problem Look at this simple example: Let&#8217;s say we have a table customer: CREATE [...]]]></description>
			<content:encoded><![CDATA[<p>We use both MySQL and Hadoop a lot. If you utilize each system to its strengths then this is a powerful combination. One problem we are constantly facing is to make data extracted from our Hadoop cluster available in MySQL.</p>
<h2>The problem</h2>
<p>Look at this simple example: Let&#8217;s say we have a table <code>customer</code>:</p>

<div class="wp_syntax"><div class="code"><pre class="sql" style="font-family:monospace;"><span style="color: #993333; font-weight: bold;">CREATE</span> <span style="color: #993333; font-weight: bold;">TABLE</span> customer <span style="color: #66cc66;">&#123;</span>
&nbsp;
    id <span style="color: #993333; font-weight: bold;">UNSIGNED</span> <span style="color: #993333; font-weight: bold;">INT</span> <span style="color: #993333; font-weight: bold;">NOT</span> <span style="color: #993333; font-weight: bold;">NULL</span><span style="color: #66cc66;">,</span>
    firstname <span style="color: #993333; font-weight: bold;">VARCHAR</span><span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;">100</span><span style="color: #66cc66;">&#41;</span> <span style="color: #993333; font-weight: bold;">NOT</span> <span style="color: #993333; font-weight: bold;">NULL</span><span style="color: #66cc66;">,</span>
    lastname <span style="color: #993333; font-weight: bold;">VARCHAR</span><span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;">100</span><span style="color: #66cc66;">&#41;</span> <span style="color: #993333; font-weight: bold;">NOT</span> <span style="color: #993333; font-weight: bold;">NULL</span><span style="color: #66cc66;">,</span>
    city <span style="color: #993333; font-weight: bold;">VARCHAR</span><span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;">100</span><span style="color: #66cc66;">&#41;</span> <span style="color: #993333; font-weight: bold;">NOT</span> <span style="color: #993333; font-weight: bold;">NULL</span><span style="color: #66cc66;">,</span>
&nbsp;
    <span style="color: #993333; font-weight: bold;">PRIMARY</span> <span style="color: #993333; font-weight: bold;">KEY</span><span style="color: #66cc66;">&#40;</span>id<span style="color: #66cc66;">&#41;</span>
<span style="color: #66cc66;">&#125;</span></pre></div></div>

<p>In addition to that we store orders customers made in Hadoop. An order includes: <code>customerId, date, itemId, price</code>. Note that these structures serve as a very simplified example.</p>
<p>Let&#8217;s say we want to find the first 50 customers, that placed at least one order sorted by firstname ascending. If both tables were in MySQL we could use a single SQL statement like:</p>

<div class="wp_syntax"><div class="code"><pre class="sql" style="font-family:monospace;"><span style="color: #993333; font-weight: bold;">SELECT</span> <span style="color: #993333; font-weight: bold;">DISTINCT</span> c<span style="color: #66cc66;">.</span>id<span style="color: #66cc66;">,</span> c<span style="color: #66cc66;">.</span>firstname <span style="color: #993333; font-weight: bold;">FROM</span> customer c <span style="color: #993333; font-weight: bold;">JOIN</span> <span style="color: #993333; font-weight: bold;">ORDER</span> o <span style="color: #993333; font-weight: bold;">ON</span> c<span style="color: #66cc66;">.</span>id <span style="color: #66cc66;">=</span> o<span style="color: #66cc66;">.</span>customerId <span style="color: #993333; font-weight: bold;">ORDER</span> <span style="color: #993333; font-weight: bold;">BY</span> c<span style="color: #66cc66;">.</span>firstname <span style="color: #993333; font-weight: bold;">ASC</span> <span style="color: #993333; font-weight: bold;">LIMIT</span> <span style="color: #cc66cc;">50</span></pre></div></div>

<p>Having the orders in Hadoop we have basically two options:</p>
<ol>
<li>We write a Map-Reduce job that reads all customers from MySQL and joins them with the orders stored in Hadoop&#8217;s HDFS. The output is sorted by firstname ascending. From the result we use only the first 50 entries.</li>
<li>We write a Map-Reduce job to extract all distinct <code>customerIds</code>, write them to a table in MySQL and use a <code>SELECT</code> with a <code>JOIN</code>.</li>
</ol>
<p>In most cases option 2 will be the better choice if we have a non-trivial number of rows in our <code>customer</code> table. And that&#8217;s for three reasons:</p>
<ol>
<li>MySQL is not optimized for streaming rows. As our Map-Reduce job would always have to read the whole table, we would stream a lot.</li>
<li>You cannot easily write something like a <code>LIMIT</code> clause in Map-Reduce. Even if you could, you&#8217;d likely have to read through all customer entries anyway. So the amount of data processed by the Map-Reduce job is significant higher if you use aproach 1.</li>
<li>If you just started to move to Hadoop, most of the data structures like categories, product information etc. are still kept in MySQL and most of the business logic relies on SQL. In most application you would not move all your data to Hadoop anyway. So storing Hadoop&#8217;s result in MySQL simply integrates better with your existing application.</li>
</ol>
<p>So, storing Map-Reduce results in MySQL seems to be the better option most of the time. But you still have to write all <code>customerIds</code> extracted by the Map-Reduce job into a table. And <b>that</b> is a performance killer. Even if you use <a href="http://dev.mysql.com/doc/refman/5.0/en/memory-storage-engine.html">HEAP tables</a> it puts a lot of pressure on MySQL. Other options like <a href="http://dev.mysql.com/doc/refman/5.0/en/csv-storage-engine.html">CSV storage engine</a> are not feasible since they do not provide any keys. And joining without a key is never a good idea.</p>
<h2>Introducing: MySQL UDF csv_find() and csv_get()</h2>
<p>One of the big advantages of Map-Reduce is that it produces output sorted by whatever we want. So we could output sorted CSV files. And we could perform <a href="http://en.wikipedia.org/wiki/Binary_search_algorithm">binary search</a> on these sorted CSV files. Great!</p>
<p>I have written two MySQL <a href="http://dev.mysql.com/doc/refman/5.1/de/udf-aggr-calling.html">User Defined Functions (UDF)</a> that provide <code>find</code> and <code>get</code> functionality on sorted CSV files.</p>
<h3>How to use it?</h3>
<p>Taking our example from above we transfer the resulting CSV file from HDFS to the local filesystem of our MySQL server and write a query like this:</p>

<div class="wp_syntax"><div class="code"><pre class="sql" style="font-family:monospace;"><span style="color: #993333; font-weight: bold;">SELECT</span> <span style="color: #66cc66;">*</span> <span style="color: #993333; font-weight: bold;">FROM</span> customer <span style="color: #993333; font-weight: bold;">WHERE</span> csv_find<span style="color: #66cc66;">&#40;</span><span style="color: #ff0000;">'/tmp/myHadoopResult.csv'</span><span style="color: #66cc66;">,</span> customer<span style="color: #66cc66;">.</span>id<span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">=</span> <span style="color: #cc66cc;">1</span> <span style="color: #993333; font-weight: bold;">ORDER</span> <span style="color: #993333; font-weight: bold;">BY</span> firstname <span style="color: #993333; font-weight: bold;">ASC</span> <span style="color: #993333; font-weight: bold;">LIMIT</span> <span style="color: #cc66cc;">50</span></pre></div></div>

<p>And this is <b>a lot</b> faster than inserting the Map-Reduce result into a table. It might even be faster than our original <code>SELECT</code> statement where we assumed both tables <code>customer</code> and <code>order</code> are in MySQL since we are not using a <code>JOIN</code> at all. More on performance later on.</p>
<h3>How does it work?</h3>
<p>On initialization of <code>csv_find</code> the CSV file will be loaded into memory using <a href="http://en.wikipedia.org/wiki/Mmap">mmap</a>. And since the first column of the CSV file is sorted in ascending order we can simply use binary search on each call to <code>csv_find</code>.</p>
<p>If you need to access other columns of a CSV use <code>csv_get(&lt;file expression>, &lt;key expression>, &lt;column expression>)</code>. Example:</p>

<div class="wp_syntax"><div class="code"><pre class="sql" style="font-family:monospace;"><span style="color: #993333; font-weight: bold;">SELECT</span> customer<span style="color: #66cc66;">.</span>lastname<span style="color: #66cc66;">,</span> csv_get<span style="color: #66cc66;">&#40;</span><span style="color: #ff0000;">'purchases.csv'</span><span style="color: #66cc66;">,</span> customer<span style="color: #66cc66;">.</span>id<span style="color: #66cc66;">,</span> <span style="color: #cc66cc;">2</span><span style="color: #66cc66;">&#41;</span> <span style="color: #993333; font-weight: bold;">AS</span> price <span style="color: #993333; font-weight: bold;">FROM</span> customer</pre></div></div>

<p> assuming that column 2 contains the price of a product purchased.</p>
<h3>Prerequisites</h3>
<p>The following assumptions are made and must be met by your CSV files:</p>
<ul>
<li>Column delimiter is &#8216;\t&#8217; and row delimiter is &#8216;\n&#8217;. You can change this at compile time.</li>
<li>The first column must be sorted in UTF-8 binary ascending order. &#8220;binary&#8221; means that it has to be sorted by byte value and not by a specific collation. For example &#8216;ä&#8217; (<code>0xc3 0xb6</code>) comes after &#8216;z&#8217; (<code>0x7a</code>). In bash you would sort a file like this:

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;"><span style="color: #007800;">LC_ALL</span>=C <span style="color: #c20cb9; font-weight: bold;">sort</span> <span style="color: #000000; font-weight: bold;">&lt;</span> input.csv <span style="color: #000000; font-weight: bold;">&gt;</span> ordered.csv</pre></div></div>

<p>Remember that sorting comes for free in Map-Reduce.</li>
<li>No escaping is done. If you need it, you could do the following: First, escape everything in your CSV, say by replacing &#8216;\n&#8217; with &#8216;\\n&#8217; and then use <code>csv_find</code> or <code>csv_get</code> like this:

<div class="wp_syntax"><div class="code"><pre class="sql" style="font-family:monospace;">csv_find<span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&lt;</span>file expression<span style="color: #66cc66;">&gt;,</span> <span style="color: #993333; font-weight: bold;">REPLACE</span><span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&lt;</span>key expression<span style="color: #66cc66;">&gt;,</span> <span style="color: #ff0000;">'<span style="color: #000099; font-weight: bold;">\n</span>'</span><span style="color: #66cc66;">,</span> <span style="color: #ff0000;">'<span style="color: #000099; font-weight: bold;">\\</span>n'</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span></pre></div></div>

</li>
<li>Some MySQL APIs (at least JDBC) treat results of an UDF as binary data. You have to explicitly cast the return value of <code>csv_get</code> like this:

<div class="wp_syntax"><div class="code"><pre class="sql" style="font-family:monospace;"><span style="color: #993333; font-weight: bold;">SELECT</span> <span style="color: #993333; font-weight: bold;">CAST</span><span style="color: #66cc66;">&#40;</span>csv_get<span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&lt;</span>file expression<span style="color: #66cc66;">&gt;,</span> <span style="color: #66cc66;">&lt;</span>key expression<span style="color: #66cc66;">&gt;,</span> <span style="color: #66cc66;">&lt;</span>column expression<span style="color: #66cc66;">&gt;</span><span style="color: #66cc66;">&#41;</span> <span style="color: #993333; font-weight: bold;">AS</span> <span style="color: #993333; font-weight: bold;">CHAR</span><span style="color: #66cc66;">&#41;</span></pre></div></div>

</ul>
<p>For more information take look into the source code documentation.</p>
<h3>Usage patterns other than integrating with Hadoop</h3>
<p>We use <code>csv_find</code> and <code>csv_get</code> not only to integrate with Hadoop but to integrate multiple MySQL servers. To make data from one MySQL server available in another you could export it like this:</p>

<div class="wp_syntax"><div class="code"><pre class="sql" style="font-family:monospace;"><span style="color: #993333; font-weight: bold;">SELECT</span> <span style="color: #66cc66;">*</span> <span style="color: #993333; font-weight: bold;">FROM</span> customer <span style="color: #993333; font-weight: bold;">WHERE</span>  <span style="color: #66cc66;">&lt;</span>some condition<span style="color: #66cc66;">&gt;</span> <span style="color: #993333; font-weight: bold;">ORDER</span> <span style="color: #993333; font-weight: bold;">BY</span> <span style="color: #993333; font-weight: bold;">BINARY</span> id <span style="color: #993333; font-weight: bold;">ASC</span> <span style="color: #993333; font-weight: bold;">INTO</span> <span style="color: #993333; font-weight: bold;">OUTFILE</span> <span style="color: #ff0000;">'/tmp/customer.csv'</span></pre></div></div>

<p>Then copy the file over to the other MySQL server (or use NFS). Of course you could use <a href="http://dev.mysql.com/doc/refman/5.0/en/federated-storage-engine.html">FEDERATED storage engine</a>. We decided not to because it has/had some <a href="http://bugs.mysql.com/bug.php?id=36728">glitches</a>.</p>
<p>Another useful application is to replace complicated <code>JOINs</code> or <code>SUBSELECTs</code>. MySQL is good at performing some <code>JOINs</code> but really poor at a lot others, especially <code>SUBSELECTs</code>.</p>
<h3>A brief performance evaluation</h3>
<p>First we create a test CSV file:</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;"><span style="color: #666666; font-style: italic;">#&gt; for a in $(seq 1000000 2000000); do echo $a &gt;&gt; /tmp/random.csv; done</span></pre></div></div>

<p>Then we load it into a table:</p>

<div class="wp_syntax"><div class="code"><pre class="sql" style="font-family:monospace;">mysql<span style="color: #66cc66;">&gt;</span> <span style="color: #993333; font-weight: bold;">CREATE</span> <span style="color: #993333; font-weight: bold;">TABLE</span> rand <span style="color: #66cc66;">&#40;</span>id <span style="color: #993333; font-weight: bold;">VARCHAR</span><span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;">255</span><span style="color: #66cc66;">&#41;</span> <span style="color: #993333; font-weight: bold;">NOT</span> <span style="color: #993333; font-weight: bold;">NULL</span><span style="color: #66cc66;">,</span> <span style="color: #993333; font-weight: bold;">PRIMARY</span> <span style="color: #993333; font-weight: bold;">KEY</span><span style="color: #66cc66;">&#40;</span>id<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>;
Query OK<span style="color: #66cc66;">,</span> <span style="color: #cc66cc;">0</span> <span style="color: #993333; font-weight: bold;">ROWS</span> affected <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;">0.00</span> sec<span style="color: #66cc66;">&#41;</span>
&nbsp;
mysql<span style="color: #66cc66;">&gt;</span> <span style="color: #993333; font-weight: bold;">LOAD</span> <span style="color: #993333; font-weight: bold;">DATA</span> <span style="color: #993333; font-weight: bold;">INFILE</span> <span style="color: #ff0000;">'/tmp/random.csv'</span> <span style="color: #993333; font-weight: bold;">INTO</span> <span style="color: #993333; font-weight: bold;">TABLE</span> rand;
Query OK<span style="color: #66cc66;">,</span> <span style="color: #cc66cc;">1000001</span> <span style="color: #993333; font-weight: bold;">ROWS</span> affected <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;">5.60</span> sec<span style="color: #66cc66;">&#41;</span></pre></div></div>

<p>To test performance of <code>JOIN</code> vs. <code>csv_find</code> we create a second table containing the same rows:</p>

<div class="wp_syntax"><div class="code"><pre class="sql" style="font-family:monospace;">mysql<span style="color: #66cc66;">&gt;</span> <span style="color: #993333; font-weight: bold;">CREATE</span> <span style="color: #993333; font-weight: bold;">TABLE</span> rand2 <span style="color: #66cc66;">&#40;</span>id <span style="color: #993333; font-weight: bold;">VARCHAR</span><span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;">255</span><span style="color: #66cc66;">&#41;</span> <span style="color: #993333; font-weight: bold;">NOT</span> <span style="color: #993333; font-weight: bold;">NULL</span><span style="color: #66cc66;">,</span> <span style="color: #993333; font-weight: bold;">PRIMARY</span> <span style="color: #993333; font-weight: bold;">KEY</span><span style="color: #66cc66;">&#40;</span>id<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>;
Query OK<span style="color: #66cc66;">,</span> <span style="color: #cc66cc;">0</span> <span style="color: #993333; font-weight: bold;">ROWS</span> affected <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;">0.01</span> sec<span style="color: #66cc66;">&#41;</span>
&nbsp;
mysql<span style="color: #66cc66;">&gt;</span> <span style="color: #993333; font-weight: bold;">LOAD</span> <span style="color: #993333; font-weight: bold;">DATA</span> <span style="color: #993333; font-weight: bold;">INFILE</span> <span style="color: #ff0000;">'/tmp/random.csv'</span> <span style="color: #993333; font-weight: bold;">INTO</span> <span style="color: #993333; font-weight: bold;">TABLE</span> rand2;
Query OK<span style="color: #66cc66;">,</span> <span style="color: #cc66cc;">1000001</span> <span style="color: #993333; font-weight: bold;">ROWS</span> affected <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;">5.75</span> sec<span style="color: #66cc66;">&#41;</span></pre></div></div>

<p>We see that importing 1 million rows already took <b>5.75 seconds</b>.</p>
<p>Now lets compare the actual <code>JOIN</code> and <code>csv_find</code>:</p>

<div class="wp_syntax"><div class="code"><pre class="sql" style="font-family:monospace;">mysql<span style="color: #66cc66;">&gt;</span> <span style="color: #993333; font-weight: bold;">SELECT</span> <span style="color: #993333; font-weight: bold;">COUNT</span><span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">*</span><span style="color: #66cc66;">&#41;</span> <span style="color: #993333; font-weight: bold;">FROM</span> rand <span style="color: #993333; font-weight: bold;">JOIN</span> rand2 <span style="color: #993333; font-weight: bold;">ON</span> rand<span style="color: #66cc66;">.</span>id <span style="color: #66cc66;">=</span> rand2<span style="color: #66cc66;">.</span>id;
<span style="color: #cc66cc;">1</span> <span style="color: #993333; font-weight: bold;">ROW</span> <span style="color: #993333; font-weight: bold;">IN</span> <span style="color: #993333; font-weight: bold;">SET</span> <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;">2.37</span> sec<span style="color: #66cc66;">&#41;</span>
&nbsp;
mysql<span style="color: #66cc66;">&gt;</span> <span style="color: #993333; font-weight: bold;">SELECT</span> <span style="color: #993333; font-weight: bold;">COUNT</span><span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">*</span><span style="color: #66cc66;">&#41;</span> <span style="color: #993333; font-weight: bold;">FROM</span> rand <span style="color: #993333; font-weight: bold;">WHERE</span> csv_find<span style="color: #66cc66;">&#40;</span><span style="color: #ff0000;">'/tmp/random.csv'</span><span style="color: #66cc66;">,</span> id<span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">=</span> <span style="color: #cc66cc;">1</span>;
<span style="color: #cc66cc;">1</span> <span style="color: #993333; font-weight: bold;">ROW</span> <span style="color: #993333; font-weight: bold;">IN</span> <span style="color: #993333; font-weight: bold;">SET</span> <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;">1.83</span> sec<span style="color: #66cc66;">&#41;</span></pre></div></div>

<p>We see <b>1.83 seconds</b> for <code>csv_find</code> vs. <b>2.37 seconds</b> for a <code>JOIN</code>.</p>
<p>Taking the time spent in <code>LOAD DATA</code> into account we even have <b>1.83 seconds</b> vs. <b>8.12 seconds</b> meaning <b><code>csv_find</code> is 4 times faster</b>.</p>
<p>Since most Map-Reduce jobs do not use <code>LOAD DATA</code> but a ton of <code>INSERT</code> statements the real performance might be even worse. Not to mention the load massive <code>INSERTs</code> put on the MySQL server.</p>
<p><code>rand2</code> is an InnoDB table. Let&#8217;s retry with a memory table:</p>

<div class="wp_syntax"><div class="code"><pre class="sql" style="font-family:monospace;">mysql<span style="color: #66cc66;">&gt;</span> <span style="color: #993333; font-weight: bold;">SET</span> max_heap_table_size <span style="color: #66cc66;">=</span> <span style="color: #cc66cc;">64</span> <span style="color: #66cc66;">*</span> <span style="color: #cc66cc;">1024</span> <span style="color: #66cc66;">*</span> <span style="color: #cc66cc;">1024</span>;
Query OK<span style="color: #66cc66;">,</span> <span style="color: #cc66cc;">0</span> <span style="color: #993333; font-weight: bold;">ROWS</span> affected <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;">0.00</span> sec<span style="color: #66cc66;">&#41;</span>
&nbsp;
mysql<span style="color: #66cc66;">&gt;</span> <span style="color: #993333; font-weight: bold;">CREATE</span> <span style="color: #993333; font-weight: bold;">TABLE</span> rand3 <span style="color: #66cc66;">&#40;</span>id <span style="color: #993333; font-weight: bold;">VARCHAR</span><span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;">10</span><span style="color: #66cc66;">&#41;</span> <span style="color: #993333; font-weight: bold;">NOT</span> <span style="color: #993333; font-weight: bold;">NULL</span><span style="color: #66cc66;">,</span> <span style="color: #993333; font-weight: bold;">PRIMARY</span> <span style="color: #993333; font-weight: bold;">KEY</span><span style="color: #66cc66;">&#40;</span>id<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span> ENGINE<span style="color: #66cc66;">=</span>HEAP;
Query OK<span style="color: #66cc66;">,</span> <span style="color: #cc66cc;">0</span> <span style="color: #993333; font-weight: bold;">ROWS</span> affected <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;">0.00</span> sec<span style="color: #66cc66;">&#41;</span>
&nbsp;
mysql<span style="color: #66cc66;">&gt;</span> <span style="color: #993333; font-weight: bold;">LOAD</span> <span style="color: #993333; font-weight: bold;">DATA</span> <span style="color: #993333; font-weight: bold;">INFILE</span> <span style="color: #ff0000;">'/tmp/random.csv'</span> <span style="color: #993333; font-weight: bold;">INTO</span> <span style="color: #993333; font-weight: bold;">TABLE</span> rand3;
Query OK<span style="color: #66cc66;">,</span> <span style="color: #cc66cc;">1000001</span> <span style="color: #993333; font-weight: bold;">ROWS</span> affected <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;">1.94</span> sec<span style="color: #66cc66;">&#41;</span>
Records: <span style="color: #cc66cc;">1000001</span>  Deleted: <span style="color: #cc66cc;">0</span>  Skipped: <span style="color: #cc66cc;">0</span>  Warnings: <span style="color: #cc66cc;">0</span>
&nbsp;
mysql<span style="color: #66cc66;">&gt;</span> <span style="color: #993333; font-weight: bold;">SELECT</span> <span style="color: #993333; font-weight: bold;">COUNT</span><span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">*</span><span style="color: #66cc66;">&#41;</span> <span style="color: #993333; font-weight: bold;">FROM</span> rand <span style="color: #993333; font-weight: bold;">JOIN</span> rand3 <span style="color: #993333; font-weight: bold;">ON</span> rand<span style="color: #66cc66;">.</span>id <span style="color: #66cc66;">=</span> rand3<span style="color: #66cc66;">.</span>id;
<span style="color: #cc66cc;">1</span> <span style="color: #993333; font-weight: bold;">ROW</span> <span style="color: #993333; font-weight: bold;">IN</span> <span style="color: #993333; font-weight: bold;">SET</span> <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;">1.80</span> sec<span style="color: #66cc66;">&#41;</span></pre></div></div>

<p>As you can see, execution time of both queries is nearly equal, but we still need <code>1.94 seconds</code> to load the data into table. Thus <code>csv_find</code> is still <b>twice as fast</b> compared to a <code>JOIN</code> on a <code>HEAP</code> table.</p>
<p>But, did you notice this statement?</p>

<div class="wp_syntax"><div class="code"><pre class="sql" style="font-family:monospace;"><span style="color: #993333; font-weight: bold;">SET</span> max_heap_table_size <span style="color: #66cc66;">=</span> <span style="color: #cc66cc;">64</span> <span style="color: #66cc66;">*</span> <span style="color: #cc66cc;">1024</span> <span style="color: #66cc66;">*</span> <span style="color: #cc66cc;">1024</span></pre></div></div>

<p>We had to raise the maximum heap table size since the contents of our 7.7 MB test file would not fit into the default 16 megabytes.</p>
<p>Actually the <b>HEAP table uses about 50 MB of RAM</b> compared to just exactly <b>7.7 MB for csv_find</b>.</p>
<p>And because RAM is a limiting factor you cannot use <code>HEAP</code> all that often anyway. <code>csv_find</code> and <code>csv_get</code> allocate as much memory as the file size. You can limit the maximum allowed file size at compile time.</p>
<h3>Where to download?</h3>
<p><a href="http://aprilmayjune.org/wp-content/uploads/2010/09/mysql_udf_csv_binary_search-0.1.tar.gz"><b>Download</b> mysql_udf_csv_binary_search-0.1.tar.gz</a>.</p>
<p>This package includes instructions on how to install (see <code>README</code>) as well as a comprehensive test suite containing both, unit and integration tests (<code>make test</code>).</p>
<p>A note on Windows: Since I don&#8217;t use Windows there are no build instructions for this OS. I tried to write portable code but since I cannot test it, I don&#8217;t know if it is working. It would be great if someone out there could contribute a Windows version.</p>
<p>Code has been tested on MySQL version 5.1.41 as well as 5.0.83.</p>
<h2>Final words</h2>
<p>This package provides fast and simple integration of sorted CSV files coming from any source.</p>
<p>Comments and improvements are welcome.</p>
]]></content:encoded>
			<wfw:commentRss>http://aprilmayjune.org/2010/09/05/integrating-mysql-and-hadoop-or-a-different-approach-on-using-csv-files-in-mysql/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>MySQL Connector/J randomly hanging at com.mysql.jdbc.util.ReadAheadInputStream.fill</title>
		<link>http://aprilmayjune.org/2010/02/02/mysql-connectorj-randomly-hanging-at-commysqljdbcutilreadaheadinputstreamfill/</link>
		<comments>http://aprilmayjune.org/2010/02/02/mysql-connectorj-randomly-hanging-at-commysqljdbcutilreadaheadinputstreamfill/#comments</comments>
		<pubDate>Tue, 02 Feb 2010 19:12:52 +0000</pubDate>
		<dc:creator>pero</dc:creator>
				<category><![CDATA[java]]></category>
		<category><![CDATA[mysql]]></category>

		<guid isPermaLink="false">http://pero.blogs.aprilmayjune.org/2010/02/02/mysql-connectorj-randomly-hanging-at-commysqljdbcutilreadaheadinputstreamfill/</guid>
		<description><![CDATA[In the past months we struggled with large SELECT queries just get stuck at: java.net.SocketInputStream.socketRead0(Native Method) java.net.SocketInputStream.read(SocketInputStream.java:129) com.mysql.jdbc.util.ReadAheadInputStream.fill(ReadAheadInputStream.java:113) com.mysql.jdbc.util.ReadAheadInputStream.readFromUnderlyingStreamIfNecessary(ReadAheadInputStream.java:160) com.mysql.jdbc.util.ReadAheadInputStream.read(ReadAheadInputStream.java:188) - locked com.mysql.jdbc.util.ReadAheadInputStream@cb9a81c com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2494) com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2949) com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2938) com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3481) com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1959) com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2109) com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2642) - locked java.lang.Object@70cbccca com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2571) com.mysql.jdbc.StatementImpl.execute(StatementImpl.java:782) - locked java.lang.Object@70cbccca com.mysql.jdbc.StatementImpl.execute(StatementImpl.java:625) org.apache.commons.dbcp.DelegatingStatement.execute(DelegatingStatement.java:260) org.apache.commons.dbcp.DelegatingStatement.execute(DelegatingStatement.java:260) Whenever this happened we just restarted the Tomcat server and everything was fine again [...]]]></description>
			<content:encoded><![CDATA[<p>In the past months we struggled with large <code>SELECT</code> queries just get stuck at:</p>
<p><code><br />
java.net.SocketInputStream.socketRead0(Native Method)<br />
java.net.SocketInputStream.read(SocketInputStream.java:129)<br />
com.mysql.jdbc.util.ReadAheadInputStream.fill(ReadAheadInputStream.java:113)<br />
com.mysql.jdbc.util.ReadAheadInputStream.readFromUnderlyingStreamIfNecessary(ReadAheadInputStream.java:160)<br />
com.mysql.jdbc.util.ReadAheadInputStream.read(ReadAheadInputStream.java:188)<br />
   - locked com.mysql.jdbc.util.ReadAheadInputStream@cb9a81c<br />
com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2494)<br />
com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2949)<br />
com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2938)<br />
com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3481)<br />
com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1959)<br />
com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2109)<br />
com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2642)<br />
   - locked java.lang.Object@70cbccca<br />
com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2571)<br />
com.mysql.jdbc.StatementImpl.execute(StatementImpl.java:782)<br />
   - locked java.lang.Object@70cbccca<br />
com.mysql.jdbc.StatementImpl.execute(StatementImpl.java:625)<br />
org.apache.commons.dbcp.DelegatingStatement.execute(DelegatingStatement.java:260)<br />
org.apache.commons.dbcp.DelegatingStatement.execute(DelegatingStatement.java:260)<br />
</code></p>
<p>Whenever this happened we just restarted the Tomcat server and everything was fine again for some days or weeks. But today it struck us very hard so we finally took the time to hunt this down. It seems to be related to this <a href="http://bugs.mysql.com/bug.php?id=31353">bug report</a>. Some comments suggested to use <code>SQL_NO_CACHE</code> with your queries.</p>
<p>A lot of people (including me) suggest to disable the <a href="http://dev.mysql.com/doc/refman/5.0/en/query-cache.html">MySQL query cache</a> since it may cause <a href="http://www.mysqlperformanceblog.com/2009/03/19/mysql-random-freezes-could-be-the-query-cache/">severe problems</a>. To disable the query cache at server startup, set the query_cache_size system variable to 0.</p>
<p>This is what we usually do, but one of our servers had query cache turned on. Disabling it solved this problem.</p>
]]></content:encoded>
			<wfw:commentRss>http://aprilmayjune.org/2010/02/02/mysql-connectorj-randomly-hanging-at-commysqljdbcutilreadaheadinputstreamfill/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Improve performance on small hadoop clusters</title>
		<link>http://aprilmayjune.org/2009/11/30/improve-performance-on-small-hadoop-clusters/</link>
		<comments>http://aprilmayjune.org/2009/11/30/improve-performance-on-small-hadoop-clusters/#comments</comments>
		<pubDate>Mon, 30 Nov 2009 17:25:13 +0000</pubDate>
		<dc:creator>pero</dc:creator>
				<category><![CDATA[hadoop]]></category>
		<category><![CDATA[java]]></category>

		<guid isPermaLink="false">http://pero.blogs.aprilmayjune.org/2009/11/30/improve-performance-on-small-hadoop-clusters/</guid>
		<description><![CDATA[Hadoop is designed to run on huge clusters containing several hundred machines. But some people just don&#8217;t need such a big cluster and are able to use the benefits of HDFS and MapReduce on a smaller scale. We managed to improve performance of our 10-node-test-cluster by almost 100% by adjusting the heartbeat intervals. Namenode and [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://hadoop.apache.org">Hadoop</a> is designed to run on huge clusters containing several hundred machines. But some people just don&#8217;t need such a big cluster and are able to use the benefits of HDFS and MapReduce on a smaller scale.</p>
<p>We managed to improve performance of our 10-node-test-cluster by almost 100% by adjusting the heartbeat intervals. Namenode and jobtracker use heartbeats to communicate with their workers (datanodes and tasktrackers).<br />
We concentrate on jobtracker heartbeats. To reliably manage huge cluster the minimum interval is 3 seconds. Every 10 nodes the interval is increased by a second. If you have lots of fast running map- or reduce-tasks this implies a noticeable overhead.</p>
<p>What we did was to patch Hadoop and lower the minimum heartbeat interval to as low as 500ms and the increment to 10ms per node. This way we got our MapReduce-jobs run almost twice as fast. If you want to try it, you could take a look at <a href="http://github.com/optivo/hadoop-0.20.1">our github branch</a> (<a href="http://github.com/optivo/hadoop-0.20.1/commit/00cfd8a1a03d07698a42250777df57356e25b8b4">view commit</a>). Please note that the git-branch contains our adopted version of Hadoop, so use it only for testing purposes.</p>
<p>There is a fix (<a href="http://issues.apache.org/jira/browse/HADOOP-5784">HADOOP-5784</a>) in the upcoming version 0.21 which allows you to lower the heartbeat increment per node.</p>
]]></content:encoded>
			<wfw:commentRss>http://aprilmayjune.org/2009/11/30/improve-performance-on-small-hadoop-clusters/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>&#8220;Internet slow&#8221; on Ubuntu Karmic Koala (9.10)</title>
		<link>http://aprilmayjune.org/2009/11/08/internet-slow-on-ubuntu-karmic-koala-910/</link>
		<comments>http://aprilmayjune.org/2009/11/08/internet-slow-on-ubuntu-karmic-koala-910/#comments</comments>
		<pubDate>Sun, 08 Nov 2009 21:34:14 +0000</pubDate>
		<dc:creator>pero</dc:creator>
				<category><![CDATA[linux]]></category>

		<guid isPermaLink="false">http://pero.blogs.aprilmayjune.org/2009/11/08/internet-slow-on-ubuntu-karmic-koala-910/</guid>
		<description><![CDATA[&#8220;Internet slow&#8221; means actually &#8220;DNS slow&#8221;. After upgrading to Ubuntu 9.10 I experienced a strange and very annoying lag in DNS resolution. Running dig in a shell worked like a charm. But Firefox, Synaptic and everything else was hanging at DNS resolution. To make a long story short (you probably read a lot of forum [...]]]></description>
			<content:encoded><![CDATA[<p>&#8220;Internet slow&#8221; means actually &#8220;DNS slow&#8221;. After upgrading to Ubuntu 9.10 I experienced a strange and very annoying lag in DNS resolution. Running <code>dig</code> in a shell worked like a charm. But Firefox, Synaptic and everything else was hanging at DNS resolution.</p>
<p>To make a long story short (you probably read a lot of forum threads about this): Our Karmic Koala uses IPv6 for DNS queries and only if this fails it falls back to IPv4. A lot of home routers do not support IPv6 DNS queries. DOH!</p>
<p>Resolutions:</p>
<p>1. Firefox only: Disable IPv6 support by typing &#8220;about:config&#8221; into your location bar, then search for ipv6 and disable it by clicking on the line.</p>
<p>2. Disable IPv6 entirely: If you do not need IPv6-Support (I don&#8217;t) you could disable it completely and everything is up to speed again. <a href="http://www.webupd8.org/2009/11/how-to-disable-ipv6-in-ubuntu-910.html">How do I do this?</code></p>
]]></content:encoded>
			<wfw:commentRss>http://aprilmayjune.org/2009/11/08/internet-slow-on-ubuntu-karmic-koala-910/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>collectd + drraw.cgi &#8211; zoom into your graphs like you used to with cacti</title>
		<link>http://aprilmayjune.org/2009/09/16/collectd-drrawcgi-zoom-into-your-graphs-like-you-used-to-with-cacti/</link>
		<comments>http://aprilmayjune.org/2009/09/16/collectd-drrawcgi-zoom-into-your-graphs-like-you-used-to-with-cacti/#comments</comments>
		<pubDate>Wed, 16 Sep 2009 18:08:07 +0000</pubDate>
		<dc:creator>pero</dc:creator>
				<category><![CDATA[linux]]></category>

		<guid isPermaLink="false">http://pero.blogs.aprilmayjune.org/2009/09/16/collectd-drrawcgi-zoom-into-your-graphs-like-you-used-to-with-cacti/</guid>
		<description><![CDATA[I fell in love with collectd and drraw.cgi (a front-end to collectd). This combination is great: Fast, simple and yet sufficient. But there was one thing I missed in drraw that I loved in cacti: Zooming. (This is how it looks like in cacti) So I went on and hacked it into drraw.cgi using jQuery. [...]]]></description>
			<content:encoded><![CDATA[<p>I fell in love with <a href="http://collectd.org/">collectd</a> and <a href="http://web.taranis.org/drraw/">drraw.cgi</a> (a front-end to collectd). This combination is great: Fast, simple and yet sufficient.</p>
<p>But there was one thing I missed in drraw that I loved in cacti: Zooming. (<a href="http://www.cacti.net/image.php?image_id=36">This</a> is how it looks like in cacti)</p>
<p>So I went on and hacked it into drraw.cgi using <a href="http://www.jquery.com">jQuery</a>. This is how it looks like:</p>
<p><a href='http://aprilmayjune.org/wp-content/uploads/2009/09/drraw_zoom.png' title='drraw_zoom.png'><img src='http://aprilmayjune.org/wp-content/uploads/2009/09/drraw_zoom.png' alt='drraw_zoom.png' /></a></p>
<p><a href="http://aprilmayjune.org/wp-content/uploads/2009/09/drraw.cgi.zoom_patch"><b>Download</b></a> the patch (6kb).</p>
<p><i>A note: For simplicity&#8217;s sake I just included the jQuery lib hosted at <a href="http://ajax.googleapis.com/ajax/libs/jquery/1.3.2/jquery.min.js">Google APIs</a>. If this is a problem for you, just download a copy, put it on your webserver and adjust the line in the patched drraw.cgi.</i></p>
<p>Have fun with it! Comments and improvements are always welcome!</p>
]]></content:encoded>
			<wfw:commentRss>http://aprilmayjune.org/2009/09/16/collectd-drrawcgi-zoom-into-your-graphs-like-you-used-to-with-cacti/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Linux: Executables on a Samba/CIFS Share</title>
		<link>http://aprilmayjune.org/2009/08/30/linux-executables-on-a-sambacifs-share/</link>
		<comments>http://aprilmayjune.org/2009/08/30/linux-executables-on-a-sambacifs-share/#comments</comments>
		<pubDate>Sun, 30 Aug 2009 19:57:07 +0000</pubDate>
		<dc:creator>pero</dc:creator>
				<category><![CDATA[linux]]></category>

		<guid isPermaLink="false">http://pero.blogs.aprilmayjune.org/2009/08/30/linux-executables-on-a-sambacifs-share/</guid>
		<description><![CDATA[Just a quick note: Don&#8217;t mount a cifs share with flag directio if you want to execute binaries that reside on that share. Otherwise you will get the following error: &#60;command&#62;: cannot execute binary file Took me 2 hours to find out.]]></description>
			<content:encoded><![CDATA[<p>Just a quick note: Don&#8217;t mount a cifs share with flag <code>directio</code> if you want to execute binaries that reside on that share.</p>
<p>Otherwise you will get the following error:<br />
<code><br />
&lt;command&gt;: cannot execute binary file<br />
</code></p>
<p>Took me 2 hours to find out.</p>
]]></content:encoded>
			<wfw:commentRss>http://aprilmayjune.org/2009/08/30/linux-executables-on-a-sambacifs-share/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Simulating indexes in Hadoop</title>
		<link>http://aprilmayjune.org/2009/06/06/simulating-indexes-in-hadoop/</link>
		<comments>http://aprilmayjune.org/2009/06/06/simulating-indexes-in-hadoop/#comments</comments>
		<pubDate>Sat, 06 Jun 2009 19:07:09 +0000</pubDate>
		<dc:creator>pero</dc:creator>
				<category><![CDATA[hadoop]]></category>
		<category><![CDATA[java]]></category>

		<guid isPermaLink="false">http://pero.blogs.aprilmayjune.org/2009/06/06/simulating-indexes-in-hadoop/</guid>
		<description><![CDATA[You should not try to use Hadoop as a &#8220;drop-in&#8221; replacement of your current (R)DBMS. That said it is still possible to utilize the power of cluster computing while circumventing its weaknesses when it comes to ad-hoc or real-time queries. We use Hadoop as an on-line system tightly integrated with our application and use it [...]]]></description>
			<content:encoded><![CDATA[<p>You should not try to use Hadoop as a &#8220;drop-in&#8221; replacement of your current (R)DBMS. That said it is still possible to utilize the power of cluster computing while circumventing its weaknesses when it comes to ad-hoc or real-time queries. We use Hadoop as an on-line system tightly integrated with our application and use it for both, long-running analytical queries and ad-hoc style queries.</p>
<p>In the mindset of a &#8220;traditional&#8221; database engineer one of the biggest concerns about Hadoop, or MapReduce in conjunction with a distributed file system in general, is the lack of indexes. Set aside that the debate <a href="http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html">&#8220;(R)DBMS vs MapReduce&#8221;</a> is most of the time superfluous and sometimes almost leads to religious debates, the absence of a thing like an index is one the biggest hurdles you face when migrating data from a traditional DBMS.<br />
Even though you will love the ability to view your data in any way you want without caring about its structure, at some point you feel that it is not right to always scan you 45TB of log files. (Even though it is soooo easy&#8230;).</p>
<h2>Brute force is easy. Brute force is bad.</h2>
<p>When we began migrating all those TBs of log-style data from our huge MySQL installations to Hadoop we did a lot of testing. We tested everything from Hadoop and MapReduce settings to different MapReduce abstractions like <a href="http://hadoop.apache.org/pig/">PIG</a>, <a href="http://www.cascading.org">Cascading</a>, <a href="http://hadoop.apache.org/hive/">Hive</a> and others. There was this huge mass of data grinning at us and waited to be analysed in multiple ways, from &#8220;online&#8221; real-time access to &#8220;offline&#8221; decision making analysis. Due to our multiple views on the same data we came to this conclusion quite quickly: &#8220;Brute force is easy. Brute force is bad.&#8221; Yes, we can optimize our Hadoop installations and we can choose the really best query mechanism (actually we ended up writing our own), but it will not make things <i>noticeably</i> faster if you continue scanning all of our data all of the time.</p>
<h2>Partitions are (sometimes) the better indexes</h2>
<p>So, why are you using indexes (in the context of data retrieval)? I know why we did and do. It is all about primary key lookup and data clustering. Say you have the following table (MySQL):<br />
<code lang="sql"><br />
CREATE TABLE order (<br />
    id INT NOT NULL,<br />
    product INT NOT NULL,<br />
    customer INT NOT NULL,<br />
    amount FLOAT NOT NULL,<br />
    orderDate DATE NOT NULL,</p>
<p>    PRIMARY KEY(id),<br />
    INDEX idx_product_customer(product, customer),<br />
    INDEX idx_customer(customer)<br />
)<br />
</code><br />
Just a simple order log with an unique identifier (id) and a single associated product and customer. Since we want to view our data from different perspectives we added two additional indexes on product and customer. (In this example we need two indexes because MySQL can only use the <a href="http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html">leftmost prefix of an index</a>.)<br />
Dumping the whole table as a single CSV-file into your Hadoop cluster would mean that you always have to use what (R)DBMS call a &#8220;full table scan&#8221;. It would be pretty much the same like removing all indexes from your MySQL-table. Try to search for all products a customer ordered without the index <code>idx_product_customer</code>. (In fact Hadoop would perform this full table scan an order of magnitude faster.) But it would be ridiculous to remove all indexes from your table. But that is actually what you did when you exported the whole table into a flat-file!<br />
What you should do, and what we did with great success, is to split up your flat-file CSV and arrange the data so that you can decide beforehand which part of the data needs to be accessed. So let&#8217;s split up the data and simulate all of the indexes (besides the primary key, more on that later on). A file-system-layout could look like this:<br />
<code lang="sql"><br />
orders/<br />
    product_A/<br />
        customer_1.csv<br />
        customer_2.csv<br />
    product_B/<br />
        customer_1.csv<br />
        customer_3.csv<br />
</code><br />
So when searching all orders <code>customer_1</code> placed, we just use this file-pattern <code>orders/*/customer_1.csv</code>. Remember: HDFS and MapReduce&#8217;s inputs (like <a href="http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html"><code>FileInputFormat</code></a>) support <a href="http://en.wikipedia.org/wiki/Glob_%28programming%29">globbing</a>.</p>
<p>Now we actually simulated indexes by partitioning the data!</p>
<p>From here on you can go into more detail depending on your data structure. As an example you could add the date- and id-range to the file name like this:</p>
<p><code lang="sql"><br />
orders/product_A/customer_1.2009-06-04.2009-06-05.1000.2000.csv<br />
orders/product_A/customer_1.2009-06-06.2009-06-07.5000.7000.csv<br />
</code></p>
<p>This comes handy if you keep adding data to your cluster.<br />
To make thinks even easier you could write your own <a href="http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html"><code>InputFormat</code></a> that encapsulates the building of the paths that match your query.</p>
<h3>The small file problem</h3>
<p>Since Hadoop has been designed to work on quite huge blocks of data it is not efficient when using <a href="http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/">a lot of small files</a>. To prevent the creation of millions of very small files take a closer look at your data. Say your average customer places 50 orders. It would be a waste of resources to store multiple files for a single customer, each filling only a few KBs. A possible solution: group customers together.<br />
<code lang="sql"><br />
orders/<br />
    product_A/<br />
        customer_1_to_1000.csv<br />
        customer_1001_to_2000.csv<br />
</code></p>
<p>You have to find the right balance between file-size and access pattern.</p>
<h3>Some final words</h3>
<p>To make it clear: Even though we have found a way to partition our data we have not gained the same flexibility as we have in any descent (R)DBMS (with enough of disk space, processing power and &#8211; most of all &#8211; RAM!). Querying for all orders made by a single customer may still take 0.01s in a (R)DBMS vs. 10s (or more) in Hadoop.</p>
<p>Never try to simply replace your (R)DBMS with Hadoop! Eventually you will end up writing a blog post saying that MapReduce and Hadoop are hopelessly worse than your favourite (R)DBMS. Hadoop is not a database!</p>
<h2>Real-time lookups</h2>
<p>You can still accomplish real-time lookup performance using Hadoop. One thing you could do is to take a look at <a href="http://wiki.apache.org/hadoop/Hbase">HBase</a>, a Google BigTable implementation.<br />
Some times it is enough to use <a href="http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/MapFile.html"><code>MapFile</code>s</a> which are simply a huge disk-based Hashtable.<br />
In our application we implemented primary key and secondary keys directly on CSV files using MapFiles and distribute the lookup and local in-memory-caches over several machines. To speed things up even more we use a <a href="http://www.danga.com/memcached/">memcached</a> cluster. (Eventually we will release all of this along with our MapReduce-abstraction as open-source once we feel it is mature and stable enough.)<br />
One way or the other: Data redundancy will most likely become your best friend in these situations.</p>
<p>Regardless the techniques you are actually using you still have to think about your data in another way.  You always have to when moving from a traditional (R)DBMS to any other kind of data storage and retrieval system!</p>
]]></content:encoded>
			<wfw:commentRss>http://aprilmayjune.org/2009/06/06/simulating-indexes-in-hadoop/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
