Facebook-bot behaving badly

Quick Summary: Facebookbot can be sporadically greedy, but will back-off temporarily when sent an HTTP 429 (Too Many Requests) response.

You may know the scenario all too well: Your website is running great; Your real-time analytics show a nice (but not unusual) flow of visitors; and then someone calls out, “Hey, is anyone else seeing timeout errors on the site?” Crap. You flip over to your infrastructure monitoring tools. CPU utilization is fine; the cache servers are up; the load balancers are… dropping servers. At work, we started experiencing this pattern from time-to-time, and the post-mortem, log-analysis always showed the same thing: An unusual spike in requests from bots (on the order of 10-20x our normal traffic loads.)

Traffic from bots is a mixed blessing — they’re “God’s little spiders”, without which your site will be unfindable. But entertaining these visitors comes with an infrastructure cost. At a minimum, you’re paying for the extra bandwidth they eat; and worst-case, extra hardware (or a more complex architecture) to keep-up with the traffic. For a small business, it’s hard to justify buying extra hardware just so the bots can crawl your archives faster; but you can’t afford site outages either. What do you do? This is traffic you’d like to keep — there’s just too much of it.

Our first step was to move from post-mortem to pre-cog. I wanted to see these scenarios playing out in real-time so that we could experiment with different solutions. To do this, we wrote a bot-detecting middleware layer for our application servers (something easily accomplished with Django’s middleware hooks.) Once we could identify the traffic, we used statsd and graphite to collect data and visualize bot activity. We now had a way of observing human-traffic patterns in comparison to the bots — and the resulting information was eye-opening.

Let’s start with a view of normal, site-wide traffic:

normal site traffic in graphite

In the graph above, the top, purple line plots “total, successful server requests” (e.g, 200’s, for the web folks.) Below that we see Googlebot in blue, Facebookbot in green, and all other known-bots in red. This isn’t what I’d call a high-traffic site, so you’ll notice that bots make up almost half of the server requests. [This is, by the way, one of the dangers of only using tools like Chartbeat to gauge traffic — you can gauge content impressions, but you’re not seeing the full server load.]

Now let’s look at some interesting behavior:

bad bot traffic in graphite

In this graph, we have the same color-coding: Purple plots valid, HTTP 200 responses; Blue plots Googlebot; Green plots Facebookbot; and red is all other bots. During the few minutes represented in the far, right-hand side of the graph, you might have called the website “sluggish”. The bot-behavior during this time is rather interesting: Even though the site is struggling to keep up with the increased requests from Facebookbot, the bot continues hammering the site. It’s like a kid repeatedly hitting “reload” in their web browser when they see a HTTP 500 error message response. On the other hand, Googlebot notices the site problems and backs-down. Here’s a wider view of the same data that shows how Googlebot slowly ramps back up after the incident:

bad bot traffic in graphite

Very well done Google engineers! Thank you for that considerate bot behavior.

With our suspicions confirmed, it was time to act. We could identify the traffic at the application layer, so we knew that we could respond to bots differently if needed. We added a throttling mechanism using memcache to count requests per minute, per bot, per server. [By counting requests/minute at the server-level instead of site-wide, we didn’t have to worry about clock-sync; and with round-robin load balancing, we get a “good enough” estimate of traffic. By including the bot-name in the cache-key, we can count each bot separately.]

On each bot-request, the middleware checks the counter. If it has exceeded its threshold, the process is killed, and an HTTP 429 (Too Many Requests) response is returned instead. Let’s see how they respond:

Bot traffic response to HTTP 429

Here we see the total count of HTTP 200 responses in green; Googlebot in red; and Facebookbot in purple. At annotation ‘A’, we see Googlebot making a high number of requests. Annotation ‘B’ shows our server responding with an HTTP 429. After the HTTP 429 response, Googlebot backs down, waits, and the resumes at it’s normal rate. Very nice!

In the same chart (above), we also see Facebookbot making a high number of requests at annotation ‘C’. Annotation ‘D’ shows our HTTP 429 response. Following the response, Facebookbot switches to a spike-pause-repeat pattern. It’s odd, but at least it’s an acknowledgement, and the pauses are long enough for the web-servers to breathe and handle the load.

While each bot may react differently, we’ve learned that the big offenders do pay attention to HTTP 429 responses, and will back down. With a little threshold tuning, this simple rate-limiting solution allows the bots to keep crawling content (as long as they’re not too greedy) without impacting site responsiveness for paying customers (and without spending more money on servers.) That’s a win-win in my book.

jsmacro — an oddly named JavaScript preprocessor

For awhile now I’ve wanted a JavaScript preprocessor to conditionally include debug and testing code when needed. It’s always registered as merely a “nice to have”, so I hadn’t sought one out. However, I had a little time over the weekend and wanted to play with the idea, so here it is: jsmacro (on GitHub.)

[Note that before writing this I did seek out existing implementations, and found js-preprocess to be the most interesting; However, I needed something that would work as part of an existing build chain, so authoring the tool in Python instead of JavaScript made more sense.]

Currently, jsmacro is poorly named, as I didn’t write the macro system that was in my head. Instead, it’s currently a basic preprocessor supporting only DEFINE and IF statements, which happened to be all I needed at the time. Usage works like this:

Input JavaScript


  //@define DEBUG 0

  var foo = function() {
    //@if DEBUG
    alert('This.');
    alert('That.');
    //@end

    print "Hi";
  };

Pass the above JavaScript through jsmacro from the command line like this: ./jsmacro.py -f infile.js > outfile.js (assuming the files are all in the same directory), and you get the following:

Output JavaScript


  var foo = function() {

    print "Hi";
  };

The tool has registered the variable ‘DEBUG’ as 0 (i.e., false), so the conditional include statements omit the alert() calls. If DEBUG had been set to 1 (i.e., true), the alert() statements would remain (though all jsmacro instructions would be removed either way.)

One of the tricky things about doing macros or preprocessing in JavaScript is that I wanted the code to be valid JavaScript before the tool is run (which is why C-preprocessors won’t work.) The idea is that you develop as you normally would, but wrap your debug and testing code in conditional jsmacro statements so that they are automatically removed as part of your build process.

There’s nothing fancy about the current implementation (it’s a crude state machine that scans line-by-line, top-to-bottom looking for regex patterns and deciding whether to output the line of not.) Crude as it may be though, it completely solved a problem for me, and hopefully it will help you out as well.

Ubiquity command to expand hyperlinks

Another simple Ubiquity command for the morning… This one, called ‘expandlinks’, finds all links on the current page and adds the link’s URL (as a hyperlink itself) next to each existing link label. This is particularly handy if you’re going to print an HTML page for later reference.


CmdUtils.CreateCommand({
  name: "expandlinks",
  homepage: "http://eriksmartt.com/blog/",
  author: { name: "Erik Smartt"},
  license: "MPL",
  preview: "Expands all hyperlinks, showing link locations.",
  execute: function() {
    var doc =  Application.activeWindow.activeTab.document;
    jQuery(doc.body).find("a").each(function(i) {
        jQuery(this).after(" &lt;<a href='" + this.href + "'>" + this.href + "</a>&gt;");
    });
  }
})

And yes, it will be much easier to subscribe to these commands once I gather them into a JS file for Ubiquity. For now, you can copy/paste into the command editor if you’re interested in trying it out.

Even simpler then my last Ubiquity examp…

Even simpler then my last Ubiquity example, this one came about from an actual project need to verify a custom character-length based text truncation filter. Select the text in the browser, invoke Ubiquity, and type: charcount

CmdUtils.CreateCommand({
  name: "charcount",
  takes: {"text to count chars in": noun_arb_text},
  preview: function( pblock, argText ) {
    pblock.innerHTML = argText.text.length;
  }
})

Update: See comments below for Ubiquity 0.5 compatibility updates

Extending Mozilla Ubiquity — stock charts and Google Finance lookup

Mozilla Ubiquity was released this week, and the functionality was so inspiring that I couldn’t help playing with it. For those that haven’t checked it out yet, think “Quicksilver inside Firefox”… or perhaps, “a contextually-aware command-line for your web browser.” If that still doesn’t mean anything to you… well, you’ll have to watch the intro video ;-)

Extending Ubiquity’s vocabulary is done via JavaScript, and the developer docs are pretty straight forward.

The docs cover Hello World, so I figured that the next best intro test would be a way to lookup stock charts and quotes. Here’s the result of a few minutes hacking on it:

CmdUtils.CreateCommand({
  name: "tik",
  takes: {"stock ticker symbol": noun_arb_text},
  preview: function( pblock, argText ) {
    var charturl = "http://chart.finance.yahoo.com/c/1y/a/" + argText.text;
    pblock.innerHTML = "";
  },
  execute: function( argText ) {
    var windowManager = Components.classes["@mozilla.org/appshell/window-mediator;1"]
                      .getService(Components.interfaces.nsIWindowMediator);
    var browserWindow = windowManager.getMostRecentWindow("navigator:browser");
    var browser = browserWindow.getBrowser();
    var url = "http://finance.google.com/finance?q=" + argText.text;
    browser.loadOneTab(url, null, null, null, false, false);
  }
})

This command introduces a ‘tik’ keyword, which loads 1-year stock symbol charts (from Yahoo) into the preview pane, and allows click-through to open a new tab for the Google Finance page of said symbol. Note that the preview-pane doesn’t always resize correctly for the chart to fit (though you can generally make it happen by typing a space after the stock symbol.) I guess there’s still some work to do there.