How shaving 0.001s from a function saved $400/mo on Amazon EC2

Update 2014-11-26: so this old post hit the HN front page. Feel free to join the discussion over there: https://news.ycombinator.com/item?id=8661387.

If premature optimisation is the root of all evil, then timely optimisation is clearly the sum of all things good.

Over at ExtractBot, my HTML utility API, things have been hotting up gradually over several months; to the extent that, at peak, it’s now running across 18 c1.medium instances on Amazon EC2. Each of those weigh in at 1.7Gb memory and 5 compute units (2 cores x 2.5 units).

At standard EC2 rates that would work out at around $2.52/hr (almost $2000/mo).

Amazon states that one EC2 compute unit is the “equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor”. So that’s like having 90 of them churning through HTML; and it takes a lot of it to keep them busy.

It’s not so much the number of requests that dictates CPU load with ExtractBot, but more what the assemblies look like (think of an assembly as a factory conveyor belt of robots passing HTML snippets to each other). Now, most of our beta testers are fairly low volume right now, but one of them is a little different; over ~18 hours of each day they pump around 2.2M HTML pages into the system. In their specific assembly, each page runs through a single CSS robot and the results (~10 per page) then get fed into a further 11 separate CSS robots along with a couple of Regex robots.

If we look at just the CSS robots for now, that’s around 244 million over the course of the 18 hour run. Or to put it in a way that’s easier to visualise – over 3,700 per second.

Normally, shaving 0.001s from a function would not exactly be top of my optimisation hit list, but after looking at where requests were spending most of their time it was obvious it would make considerable difference. 0.001s on 3.7k loops means we could save a whopping 3.7 seconds of CPU time in every second of real time. To put that another way, we could effectively drop about four of our c1.medium instances, a saving on standard EC2 pricing of over $400/mo.

So, what does shaving 0.001s from a single function look like?

cpudrop_500px

Jill Linda Milleare: May 31 1955 – April 27 2013

On April 27 2013 at approx 19:40, my dear Mum lost her 9-month battle with Pleural Mesothelioma.

I remember when she first found out her diagnosis; it was my birthday last year and I had taken the day off work to go with her to see the consultant. Although we didn’t talk about it that day, we all knew what might be on the horizon even at that very first moment. I know Mum was scared, but I kept as positive as I could.

For those that don’t know, Mesothelioma is – almost always – caused by asbestos (something confirmed in my Mum’s case last week by the Coroner). The thing is, she never worked with asbestos or even knowingly came into contact with it at any point in her life. We believe her exposure actually occurred many years ago when she was still at high school.

It’s a fickle beast, cancer. Some people smoke 40 cigarettes a day for a lifetime and live to 90 years old. Some people, back in the 60s and 70s, would have regularly returned home from work covered in asbestos fibres and yet never caught so much as a cough. Some people, however, can come into contact with air-born asbestos maybe just the once – 20, 30 or 40 years ago – and have a very different outcome.

Essentially, it simply boils down to this: some people are lucky and some people are not.

My Mum kept a diary during her illness; documenting every pill, every needle, every pain and every discomfort. One of the things she wrote was that she knew she’d “never make old bones”. It’s something my sister Leala used to say too – in different words, but something she would say all the same. Leala had a carefree slant on life and would always live for the moment. Given the outcome, it was a pretty good attitude to take. At 57, my Mum died young – my sister, though, didn’t even see her 30th birthday.

Last Thursday we finally said goodbye to my amazing Mum. At the (packed) Church service prior to her cremation my Uncle Mark and I stood up and said some nice and very true words about her – which was honestly one of the hardest things I’ve ever had to do. I’ve pasted my tribute below.

Rest peacefully, Mum. x

front-cover
Jill Linda Milleare – 31/05/1955 – 27/04/2013

When Mum asked me a few months ago that if things didn’t go to plan with her treatment that she wanted me to speak at her service, I worried relentlessly over what I would say and whether I could ever do her justice.

Once she had passed away, somebody told me that whatever I said would be perfect, because talking about your Mum should be the easiest thing in the World; and as I sat down to write this tribute, those words couldn’t have been truer.

Mum was a truly wonderful person. She was a devoted mother to Leala, myself, Luke & Jamie and also a fiercely proud Gran to my little ones; George and Grace. She dearly loved being a grandparent, and she took to the role effortlessly; regularly exhibiting all of those same protective motherly instincts that I had witnessed growing up.

Unfortunately, there are some things in this life that even a doting Mother can’t shield her children from and Mum’s worst fears came true when my sister Leala was diagnosed with Cervical Cancer in 2007.

During Leala’s illness, Mum was there for her 24 hours a day and 7 days a week. Mum didn’t leave her side throughout her battle which allowed her to concentrate 100% on herself without being troubled by the finer details of her treatment or, later, the burden of knowing when her condition had sadly gone beyond being treatable. Mum, and Dad, took all of that on their own shoulders instead. Caring for Leala made Mum stronger than she’d ever been in her life – not through choice, of course, but by necessity. Throughout all of this though, Mum never complained, and she never faltered. Leala was the only priority.

A mother should never have to out-live her daughter and, predictably, Leala’s death in June 2009 affected Mum profoundly; it sucked some happiness away from her that could simply never be replenished. Leala fought a long, hard battle against her cancer; Mum’s fight was shorter but it was by no means any less brave.

Mum wrote a few months ago that, despite not wanting to leave any time soon, she was no longer actually scared of dying because she knew she’d be with Leala again. Something that’s obviously extremely comforting for us as a family to know.

Aside from being a strong parental figure, Mum was also a great friend. Several people have said to me that without the love and support Mum gave them at certain points in their lives that they would have struggled dealing with whatever events they were going through – and that certainly says a lot about the kind of person Mum was. She would often go out of her way to make sure people were ok and she was always available as a shoulder to cry on or a source of advice. She was the kind of friend that I’m sure we all hope we have.

Of course – as Mark has already mentioned – anyone that knew Mum will also know how much pride she took in her appearance. I remember, as a kid, going to the hairdressers with her and wondering what on Earth they were possibly doing that could take over 3 hours when my haircuts were done in 4 minutes flat. Mum’s desire to look her best was most evident at the height of her illness. When she hadn’t even been out of bed – let alone the house – for days, she somehow still managed to convince her friend June to make the journey to Ingatestone so she could get her roots done. This is despite being told by her oncologist that she definitely wasn’t allowed to! Of course, she got home and was violently ill that day – but this lifelong devotion to looking her best had certainly already paid dividends. People would often look at me in disbelief when I told them Mum was nearing 60 years old and I’ve no doubt that her smile played a starring role in that too. I remember walking past the nurses’ station at St Helena Hospice late one night and overhearing two of the nurses having a conversation about how fantastic Mum’s teeth were.

Mum was a stickler for proper manners and, while I certainly didn’t think it at the time, in later life I’m now very grateful for the training I received at a young age. The reason I instinctively stand when somebody needs a seat, hold doors open and offer assistance when somebody appears in need is all because Mum taught me to do so. At the dinner table, I often find myself telling my own kids to stop waving their cutlery around like flags, as kids tend to do. Mum must have said those same words to me a thousand times growing up, so I guess it finally did sink in somewhere along the way!

Unsurprisingly, I don’t remember the first time I met Mum, but my earliest real memory of her is from when we were living at Riverside Walk in Wickford – I must have been about 5 or 6 years old at the time, so I guess it was around 1986 or 87. And like any good memory of that age it starts with me being naughty enough to receive some punishment. I honestly have no idea what I had done to deserve it – that part is long forgotten – but it was certainly bad enough to warrant a smack on the bottom, which was ok of course back in the 80s. Anyway, I was standing at the bottom of the stairs as my punishment was about to be metered out and as Mum drew her hand back and then swung it in the direction my backside, I decided at the very last second to simply hop out of the way. Mum’s hand crashed into the banisters and she reeled back in agony. I’m not sure who was more horrified to be honest, Mum at realising the backfire or me at suddenly realising the likely repercussions of my actions!

I have no idea if Mum ever remembered that herself or not, but it’s a day that’s stuck with me for over 25 years and I’ve often recalled it vividly. It’s funny how memories work and the things that trigger them. Whenever I hear the song ‘Driving Home for Christmas’ for example, I get flashbacks of sitting in the back of our old Toyota Space Cruiser and Mum humming away at the steering wheel.

But it’s not just the old memories I’ll cherish – I’m lucky enough to have 30 years worth of them. More recently – just a few months ago in fact – Mum arranged for the twins, Adele, little George and I to go to London for the weekend so that George could experience The Lion King stage show; being that he was obsessed with watching the Disney film daily at the time. The memories from that weekend will stay with us all forever – we ate at London’s oldest restaurant and even managed a flying visit to the Natural History Museum. George was absolutely mesmerised by the show and Mum was the happiest I’d seen her in some time as she sat there watching his little beaming face throughout the performance.

A few days after Mum passed away, and we’d tried to explain as best we could to the children that their Gran had now gone to heaven; George turned to us and said that it was a bit like she was playing hide and seek in the sky with us. That’s a lovely way to think about her. We know she’s there somewhere, but we just can’t see her any more.

That follows on nicely to a quote I’d like to leave you all with. It’s from an American author called Helen Keller. She said, “The best and most beautiful things in the world cannot be seen or even touched. They must be felt with the heart”

If, by the end of her life, Mum ended up touching even some your hearts in one way or another – then her life was certainly a successful one.

Mum, we’re going to miss you terribly. When something good happens, it won’t feel quite as real because we can’t call and tell you all about it; and when something bad happens, you’ll no longer be there to comfort us.

But please know that we’ll always love you and you’ll forever be in our hearts. Thanks for being the best Mum we could ever have asked for.

Running a Bloom Filter as a Web Service

One of the fundamental problems that need to be solved when building a web crawler of any non-trivial scale is the question of how to determine if a link has been seen before and already added to your URL frontier and/or crawled and parsed. From an initial look at this you might think it’s fairly easy to solve; a MySQL table with a unique key on the md5sum(URL)+crawl_id should do the trick, right?

Well, yes and no in equal measures. This approach is fine if your crawl is likely to be fairly small in scale; but then, you aren’t really building a web crawler at all. The thing you’ll quickly find out with a setup like this is that the MySQL ‘INSERT IGNORE’ or whatever method you choose to use (SELECT/INSERT etc) is going to rapidly become a bottleneck in your crawl flow.

What you really need is a Bloom Filter.

My very first PoC for Crawler.io used something very similar to the MySQL solution described above. The purpose at the time was not to build anything of scale, but simply to see if I could crawl a single domain quickly and extract the information required with minimal pain. This was fine while I was running on a single EC2 instance, but as my tests increased in size I quickly discovered that I needed a better way to keep track of URLs.

The problem I had was that all of the open source Bloom Filter implementations I came across were really designed for local access only. This was a problem for me as, by this time, I had proved the concept and then completely rewritten Crawler.io into a distributed crawler (multiple independent ‘crawl units’ running on multiple EC2 crawl nodes). Without going into unnecessary detail, a single crawl job gets split across these crawl units (and crawl nodes); which meant that a distributed bloom filter was going to be essential to ensure they stayed in sync.

Now, the way in which I handle crawl jobs within the system is paramount to the success of this bloom filter implementation and so it may not work well for everyone; but due to the fact that each crawl job is restricted to a single domain, and that each crawl job has a unique identifier it makes it very easy to shard my bloom filter caches and scale that particular part of the service out with little pain.

The code below is a simplified version of what I ended up using. It leverages php-bloom-filter and assumes the use of a static Cache class (memcached works great):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
<?php

// assumes $_POST contains a JSON-encoded  array of URLs to check ($links)
// and a unique crawl identifier ($crawl_id)
extract($_POST);

if (!$b = unserialize(Cache::read('filter_'.$crawl_id))) {
    $b = new BloomFilter(100000, 0.001);
}

$return = array();
foreach (json_decode($links, true) as $link) {
    if (!$b->has($crawl_id.'_'.$link)) {
        $return[] = $link;
        $b->add($crawl_id.'_'.$link);
    }
}

// put the filter back into our cache
Cache::write('filter_'.$crawl_id, serialize($b));

echo json_encode($return);

As I said, this is NOT production code, far from it, but it should give you a good starting point. This can handle a surprising volume of queries and scales really well for my specific use case (single domain crawls).

[Note: actually, this part of Crawler.io is now powered by a node.js/redis-backed implementation, but it wasn't due to scaling or reliability reasons that I moved away from php-bloom-filter.]

House of Fraser Product Ads #superfail

Earlier tonight our Dyson Ball decided to give up the ghost and fall apart – not bad really for over 4 years constant use; but it obviously meant it was time to buy a new hoover. Being the super SEO type that I am, I turned to none other than my trusty Google SERPs.

What I found was… interesting. I clicked on a House of Fraser Product Ad (the PPC ads that look like shopping results) and ended up at a dead end. It was a URL I hadn’t seen before, and assumed it was a error on Google’s redirect/tracking URLs:


So I clicked back to the SERP, selected a different ad and ended up where I expected; and then did the same again a second time with the same result. Clicking on the HoF ad still sent me to the “native_url” 404.

So I did a new search, this time for [white bedding]. Sure enough, there was a HoF product ad over on the right hand side, and sure enough it again sent me to a similar dead tracking URL.

The question is – is this a Google issue (unlikely) or is this a fuck up by HoF or their agency instead? If it’s the latter then that’s a lot of wasted click spend before somebody rocks up on Monday morning to notice the error. I certainly wouldn’t want to be in THAT meeting!

Update (01/12/2012): it looks like House of Failure (or Google) have pulled all of their product ads for the time being. Either way, they’ve lost a lot of traffic this weekend.

Skin Heads for Charity

Today was the big day – a bunch of us guys at HP Group had a mass balding in aid of two very deserving charities. It was organised by Patrick Hegarty and he even roped in his wife (yeah, how brave was she?!) and son to join the ranks of the follically challenged.

beforeafter

The two charities receiving funds are St Helena Hospice and Building Futures in Malawi. St Helena’s is very close to my heart as my sister passed away at the hospice back in 2009 – and they were absolutely fantastic both with her and the family.

All in all, 9 of us followed through with the deed (including five bald SEO’s and one bald PPC’er), and as if on cue a Royal Air Force recruitment bus happened to park outside just in time for a photo opportunity:

Army Recruits

And here’s a final bonus pic of me mid-shave getting a St George’s cross cut in (yes, very patriotic of me):

St George's Cross