Infrastructure Archives

One of the fundamental problems that need to be solved when building a web crawler of any non-trivial scale is the question of how to determine if a link has been seen before and already added to your URL frontier and/or crawled and parsed. From an initial look at this you might think it’s fairly easy to solve; a MySQL table with a unique key on the md5sum(URL)+crawl_id should do the trick, right?

Well, yes and no in equal measures. This approach is fine if your crawl is likely to be fairly small in scale; but then, you aren’t really building a web crawler at all. The thing you’ll quickly find out with a setup like this is that the MySQL ‘INSERT IGNORE’ or whatever method you choose to use (SELECT/INSERT etc) is going to rapidly become a bottleneck in your crawl flow.

What you really need is a Bloom Filter.

My very first PoC for Crawler.io used something very similar to the MySQL solution described above. The purpose at the time was not to build anything of scale, but simply to see if I could crawl a single domain quickly and extract the information required with minimal pain. This was fine while I was running on a single EC2 instance, but as my tests increased in size I quickly discovered that I needed a better way to keep track of URLs.

The problem I had was that all of the open source Bloom Filter implementations I came across were really designed for local access only. This was a problem for me as, by this time, I had proved the concept and then completely rewritten Crawler.io into a distributed crawler (multiple independent ‘crawl units’ running on multiple EC2 crawl nodes). Without going into unnecessary detail, a single crawl job gets split across these crawl units (and crawl nodes); which meant that a distributed bloom filter was going to be essential to ensure they stayed in sync.

Now, the way in which I handle crawl jobs within the system is paramount to the success of this bloom filter implementation and so it may not work well for everyone; but due to the fact that each crawl job is restricted to a single domain, and that each crawl job has a unique identifier it makes it very easy to shard my bloom filter caches and scale that particular part of the service out with little pain.

The code below is a simplified version of what I ended up using. It leverages php-bloom-filter and assumes the use of a static Cache class (memcached works great):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

<?php

// assumes $_POST contains a JSON-encoded array of URLs to check ($links)
// and a unique crawl identifier ($crawl_id)
extract($_POST);

if (!$b = unserialize(Cache::read('filter_'.$crawl_id))) {
$b = new BloomFilter(100000, 0.001);
}

$return = array();
foreach (json_decode($links, true) as $link) {
if (!$b->has($crawl_id.'_'.$link)) {
$return[] = $link;
$b->add($crawl_id.'_'.$link);
}
}

// put the filter back into our cache
Cache::write('filter_'.$crawl_id, serialize($b));

echo json_encode($return);

As I said, this is NOT production code, far from it, but it should give you a good starting point. This can handle a surprising volume of queries and scales really well for my specific use case (single domain crawls).

[Note: actually, this part of Crawler.io is now powered by a node.js/redis-backed implementation, but it wasn’t due to scaling or reliability reasons that I moved away from php-bloom-filter.]

Ben Milleare

on Marketing, Coding, Tech & Business

Category Archives: Infrastructure

Running a Bloom Filter as a Web Service