Category Archives: Crawler.io

How shaving 0.001s from a function saved $400/mo on Amazon EC2

Update 2014-11-26: so this old post hit the HN front page. Feel free to join the discussion over there: https://news.ycombinator.com/item?id=8661387.

If premature optimisation is the root of all evil, then timely optimisation is clearly the sum of all things good.

Over at ExtractBot, my HTML utility API, things have been hotting up gradually over several months; to the extent that, at peak, it’s now running across 18 c1.medium instances on Amazon EC2. Each of those weigh in at 1.7Gb memory and 5 compute units (2 cores x 2.5 units).

At standard EC2 rates that would work out at around $2.52/hr (almost $2000/mo).

Amazon states that one EC2 compute unit is the “equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor”. So that’s like having 90 of them churning through HTML; and it takes a lot of it to keep them busy.

It’s not so much the number of requests that dictates CPU load with ExtractBot, but more what the assemblies look like (think of an assembly as a factory conveyor belt of robots passing HTML snippets to each other). Now, most of our beta testers are fairly low volume right now, but one of them is a little different; over ~18 hours of each day they pump around 2.2M HTML pages into the system. In their specific assembly, each page runs through a single CSS robot and the results (~10 per page) then get fed into a further 11 separate CSS robots along with a couple of Regex robots.

If we look at just the CSS robots for now, that’s around 244 million over the course of the 18 hour run. Or to put it in a way that’s easier to visualise – over 3,700 per second.

Normally, shaving 0.001s from a function would not exactly be top of my optimisation hit list, but after looking at where requests were spending most of their time it was obvious it would make considerable difference. 0.001s on 3.7k loops means we could save a whopping 3.7 seconds of CPU time in every second of real time. To put that another way, we could effectively drop about four of our c1.medium instances, a saving on standard EC2 pricing of over $400/mo.

So, what does shaving 0.001s from a single function look like?

cpudrop_500px

Running a Bloom Filter as a Web Service

One of the fundamental problems that need to be solved when building a web crawler of any non-trivial scale is the question of how to determine if a link has been seen before and already added to your URL frontier and/or crawled and parsed. From an initial look at this you might think it’s fairly easy to solve; a MySQL table with a unique key on the md5sum(URL)+crawl_id should do the trick, right?

Well, yes and no in equal measures. This approach is fine if your crawl is likely to be fairly small in scale; but then, you aren’t really building a web crawler at all. The thing you’ll quickly find out with a setup like this is that the MySQL ‘INSERT IGNORE’ or whatever method you choose to use (SELECT/INSERT etc) is going to rapidly become a bottleneck in your crawl flow.

What you really need is a Bloom Filter.

My very first PoC for Crawler.io used something very similar to the MySQL solution described above. The purpose at the time was not to build anything of scale, but simply to see if I could crawl a single domain quickly and extract the information required with minimal pain. This was fine while I was running on a single EC2 instance, but as my tests increased in size I quickly discovered that I needed a better way to keep track of URLs.

The problem I had was that all of the open source Bloom Filter implementations I came across were really designed for local access only. This was a problem for me as, by this time, I had proved the concept and then completely rewritten Crawler.io into a distributed crawler (multiple independent ‘crawl units’ running on multiple EC2 crawl nodes). Without going into unnecessary detail, a single crawl job gets split across these crawl units (and crawl nodes); which meant that a distributed bloom filter was going to be essential to ensure they stayed in sync.

Now, the way in which I handle crawl jobs within the system is paramount to the success of this bloom filter implementation and so it may not work well for everyone; but due to the fact that each crawl job is restricted to a single domain, and that each crawl job has a unique identifier it makes it very easy to shard my bloom filter caches and scale that particular part of the service out with little pain.

The code below is a simplified version of what I ended up using. It leverages php-bloom-filter and assumes the use of a static Cache class (memcached works great):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
<?php

// assumes $_POST contains a JSON-encoded  array of URLs to check ($links)
// and a unique crawl identifier ($crawl_id)
extract($_POST);

if (!$b = unserialize(Cache::read('filter_'.$crawl_id))) {
    $b = new BloomFilter(100000, 0.001);
}

$return = array();
foreach (json_decode($links, true) as $link) {
    if (!$b->has($crawl_id.'_'.$link)) {
        $return[] = $link;
        $b->add($crawl_id.'_'.$link);
    }
}

// put the filter back into our cache
Cache::write('filter_'.$crawl_id, serialize($b));

echo json_encode($return);

As I said, this is NOT production code, far from it, but it should give you a good starting point. This can handle a surprising volume of queries and scales really well for my specific use case (single domain crawls).

[Note: actually, this part of Crawler.io is now powered by a node.js/redis-backed implementation, but it wasn’t due to scaling or reliability reasons that I moved away from php-bloom-filter.]