Facebook-bot behaving badly

Quick Summary: Facebookbot can be sporadically greedy, but will back-off temporarily when sent an HTTP 429 (Too Many Requests) response.

You may know the scenario all too well: Your website is running great; Your real-time analytics show a nice (but not unusual) flow of visitors; and then someone calls out, “Hey, is anyone else seeing timeout errors on the site?” Crap. You flip over to your infrastructure monitoring tools. CPU utilization is fine; the cache servers are up; the load balancers are… dropping servers. At work, we started experiencing this pattern from time-to-time, and the post-mortem, log-analysis always showed the same thing: An unusual spike in requests from bots (on the order of 10-20x our normal traffic loads.)

Traffic from bots is a mixed blessing — they’re “God’s little spiders”, without which your site will be unfindable. But entertaining these visitors comes with an infrastructure cost. At a minimum, you’re paying for the extra bandwidth they eat; and worst-case, extra hardware (or a more complex architecture) to keep-up with the traffic. For a small business, it’s hard to justify buying extra hardware just so the bots can crawl your archives faster; but you can’t afford site outages either. What do you do? This is traffic you’d like to keep — there’s just too much of it.

Our first step was to move from post-mortem to pre-cog. I wanted to see these scenarios playing out in real-time so that we could experiment with different solutions. To do this, we wrote a bot-detecting middleware layer for our application servers (something easily accomplished with Django’s middleware hooks.) Once we could identify the traffic, we used statsd and graphite to collect data and visualize bot activity. We now had a way of observing human-traffic patterns in comparison to the bots — and the resulting information was eye-opening.

Let’s start with a view of normal, site-wide traffic:

normal site traffic in graphite

In the graph above, the top, purple line plots “total, successful server requests” (e.g, 200’s, for the web folks.) Below that we see Googlebot in blue, Facebookbot in green, and all other known-bots in red. This isn’t what I’d call a high-traffic site, so you’ll notice that bots make up almost half of the server requests. [This is, by the way, one of the dangers of only using tools like Chartbeat to gauge traffic — you can gauge content impressions, but you’re not seeing the full server load.]

Now let’s look at some interesting behavior:

bad bot traffic in graphite

In this graph, we have the same color-coding: Purple plots valid, HTTP 200 responses; Blue plots Googlebot; Green plots Facebookbot; and red is all other bots. During the few minutes represented in the far, right-hand side of the graph, you might have called the website “sluggish”. The bot-behavior during this time is rather interesting: Even though the site is struggling to keep up with the increased requests from Facebookbot, the bot continues hammering the site. It’s like a kid repeatedly hitting “reload” in their web browser when they see a HTTP 500 error message response. On the other hand, Googlebot notices the site problems and backs-down. Here’s a wider view of the same data that shows how Googlebot slowly ramps back up after the incident:

bad bot traffic in graphite

Very well done Google engineers! Thank you for that considerate bot behavior.

With our suspicions confirmed, it was time to act. We could identify the traffic at the application layer, so we knew that we could respond to bots differently if needed. We added a throttling mechanism using memcache to count requests per minute, per bot, per server. [By counting requests/minute at the server-level instead of site-wide, we didn’t have to worry about clock-sync; and with round-robin load balancing, we get a “good enough” estimate of traffic. By including the bot-name in the cache-key, we can count each bot separately.]

On each bot-request, the middleware checks the counter. If it has exceeded its threshold, the process is killed, and an HTTP 429 (Too Many Requests) response is returned instead. Let’s see how they respond:

Bot traffic response to HTTP 429

Here we see the total count of HTTP 200 responses in green; Googlebot in red; and Facebookbot in purple. At annotation ‘A’, we see Googlebot making a high number of requests. Annotation ‘B’ shows our server responding with an HTTP 429. After the HTTP 429 response, Googlebot backs down, waits, and the resumes at it’s normal rate. Very nice!

In the same chart (above), we also see Facebookbot making a high number of requests at annotation ‘C’. Annotation ‘D’ shows our HTTP 429 response. Following the response, Facebookbot switches to a spike-pause-repeat pattern. It’s odd, but at least it’s an acknowledgement, and the pauses are long enough for the web-servers to breathe and handle the load.

While each bot may react differently, we’ve learned that the big offenders do pay attention to HTTP 429 responses, and will back down. With a little threshold tuning, this simple rate-limiting solution allows the bots to keep crawling content (as long as they’re not too greedy) without impacting site responsiveness for paying customers (and without spending more money on servers.) That’s a win-win in my book.

“Mining of Massive Datasets”

My earlier work with Social Book Club, and current work with Kirkus Reviews, has me spending a fair amount of time exploring and developing recommendation systems. There are a variety of good books and papers on the subject, but I recently finished reading “Mining of Massive Datasets” (a free ebook that accompanies a Stanford CS course on Data Mining), and it was a surprisingly good read.

The book covers a number of topics that come up frequently in data mining: reworking algorithms into a map-reduce paradigm, finding similar items, mining streams of data, finding frequent items, clustering, and recommending items. Unlike many texts on the subject, you won’t find source-code in this book; but rather, extensive explanations of multiple techniques and algorithms to address each topic. This lends itself to a better understanding of the theory, so that you understand the trade-offs you might be making when implementing your own systems.

There are easier texts to get through, but if you’re getting started with recommendation or data-mining systems, and haven’t read this book, I’d encourage you to do so.

JavaScript on the Server, and conversations at TXJS

We’ve seen various attempts at using JavaScript on the server over the last decade. Mozilla’s Rhino (Java) engine fueled most of it. However, with the release of Google’s V8 (C++) engine (and the networking performance example set by Node.js), the conversation is gaining traction.

The motivation for a 100% JavaScript stack, per conversations at Texas JavaScript Conference (TXJS) last weekend, is the desire to use a single programming language when developing web applications, rather than the mix of technologies we use today. It’s not so much that JavaScript is the best language for application development (contrary to the JS fanboys), but since it’s what we’re stuck with on the client-side, it’s worth considering on the server-side. With a single language, business logic can be reused on the client and the server (think form validation), and you avoid bugs caused by frequent language switching (i.e., using, or forgetting semi-colons, putting commas after the last item in an array, using the wrong comment delimiter, etc.)

The wrinkle in the 100% JavaScript argument, is whether JavaScript is actually the language you want to write your back-end in. The language lacks package management standards (though CommonJS is working to change that); It lacks the standard libraries and tools that the incumbents offer (i.e., no batteries included); Maybe people who use it don’t actually know the language very well; And it suffers from the multitude of bad examples and advice freely available online.

There have been some interesting Node-based applications developed already (i.e., Hummingbird), and the JavaScript on App Engine efforts (i.e., AppEngineJS) will be interesting to watch as well. (I expect both to foster more mature development patterns for large applications written in JavaScript.) However, in the near term, the 100% JavaScript stack will likely remain as niche as the Erlang, Haskel, Lisp, etc. web frameworks (as interesting as they may be.)

The question for you (Mr./Mrs. web developer/web-savvy business person), is whether JavaScript on the back-end offers a competitive advantage. Can you execute on an idea faster/better/cheaper than your competition because of your technology stack?

“Coders at Work”

Coders at Work book cover

I finished reading “Coders at Work last night. In it, author Peter Seibel interviews 15 legendary programmers, discussing how they got started with computers, how they learned to program, how they read and debug code, etc. The interviews cover a wide range of opinions and approaches, and offers a fascinating look at “computer science” history.

The format of the book is a little unusual, in that it’s entirely interview transcripts. No analysis. No author-interpretation. Just recorded conversations. At first it’s a little surprising that one can publish a book like this; But then you get into the content and it’s wonderfully engaging. Analysis and interpretation would just get in the way of letting these folks talk. Reading direct quotes makes the content all the more exciting.

The book isn’t for everyone (obviously), but I rather enjoyed it. There’s some great stories about the history of our profession, and many topics raised that inspired additional research. (I went out and found a number of research papers referenced in the interviews, and bookmarked a lot of content for further exploration.) There’s also a fair amount on the history of different programming languages, and I have a fascination with programming languages, so it was a great fit.

A few take-away themes and ideas:

  • While programming was no easy task in the early days, at least it was possible to fully-understand the hardware and all the software running it (as opposed to modern computers.) The modern computing environment presents very different challenges to present-day programmers, especially those new to the field.
  • Even some of best use print statements.
  • Passion and enthusiasm separate good programmers from great ones.
  • In academia, you have time to think about the “best” solution, without the deadlines imposed on commercial developers.
  • There’s certainly a component of “doing great work” that requires being in the right place at the right time — sometimes it’s just a matter of getting staffed on the right project.
  • There’s some negativity towards C/C++ in here, mostly due to it’s negative impact on compiler and high-level language development. (i.e., one school of thought is that you give people a high-level language and make the compiler smart. The other is that you give people a low-level language and let them do the work. Unfortunately, humans aren’t so good at hand-writing code optimized for concurrency, but once you have a language that let’s them try, it’s hard to fund compiler research.)

Here’s a few of the quotes I highlighted while reading:

“One of the most important things for having a successful project is having people that have enough experience that they build the right thing. And barring that, if it’s something that you haven’t built before, that you don’t know how to do, then the next best thing you can do is to be flexible enough that if you build the wrong thing you can adjust.” — Peter Norvig

“…there are user-interface things where you just don’t know until you build it. You think this interaction will be great but then you show it to the user and half the users just can’t get it.” — Peter Norvig

“I get so much of a thrill bringing things to life that it doesn’t even matter if it’s wrong at first. The point is, that as soon as it comes to life it starts telling you what it is.” — Dan Ingalls

“…a complex algorithm requires complex code. And I’d much rather have a simple algorithm and simple code…” — Ken Thompson

“If you can really work hard and get some little piece of a big program to run twice as fast, then you could have gotten the whole program to run twice as fast if you had just waited a year or two.” — Ken Thompson

“if they’d have asked, ‘How did you fix the bug?’ my answer would have been, ‘I couldn’t understand the code well enough to figure out what it was doing, so I rewrote it.'” — Bernie Cosell

“You have to supplement what your job is asking you to do. If your job requires that you do a Tcl thing, just learning enough Tcl to build the interface for the job is barely adequate. The right thing is, that weekend start hacking up some Tcl things so that by Monday morning you’re pretty well versed in the mechanics of it.” — Bernie Cosell

“…computer-program source code is for people, not for computers. Computers don’t care.” — Bernie Cosell

“if you rewrite a hundred lines of code, you may well have fixed the one bug and introduced six new ones.” — Bernie Cosell

“I had two convictions, which actually served me well: that programs ought to make sense and there are very, very few inherently hard problems. Anything that looks really hard or tricky is probably more the product of the programmer not fully understanding what they needed to do” — Bernie Cosell

“You never, ever fix the bug in the place where you find it. My rule is, ‘If you knew then what you know now about the fact that this piece of code is broken, how would you have organized this piece of the routine?'” — Bernie Cosell

“Part of what I call the artistry of the computer program is how easy it is for future people to be able to change it without breaking it.” — Bernie Cosell

Finished reading “Even Faster Web Sites”

book cover

I just finished reading “Even Faster Web Sites: Performance Best Practices for Web Developers“, by Steve Souders. It’s technical, and definitely for a limited audience, but it’s certainly relevant for web developers trying to squeeze a few extra milliseconds out of page render times with older browsers. (Yes, many of the techniques are just as applicable for modern browsers, but the performance competition between Firefox, Safari, and Chrome has the latest builds addressing, and solving, some of the common bottlenecks.)

What I liked best about the book were the tests and test results. Souders runs each browser through numerous test scenarios to demonstrate the (sometimes huge) impacts that small authoring decisions can make. (e.g., the surprising relationship between CSS files and inline JavaScript.) Souders also provides implementation details and decision trees for choosing and implementing as much asynchronous loading as possible.

All in all, it was a nice exploration of how different browser implementations approach page loading and painting, and how to exploit this knowledge for speed.

Book Review: “Practical Django Projects”


  • Targeted at developers wanting to learn Django by building example applications rather then (or in addition to) reading the docs and man pages
  • The reader builds three working applications by following along
  • The examples are based on up-to-date Django features (ie., a 2008 build)
  • Lesson’s focused on using Django (not on Django’s inner workings)
  • Doesn’t waste time explaining Python and HTML (nor does it dive deep explaining the how/why of what you’re doing in the examples)
  • Introduces the reader to powerful Django features — covering a wide range of capability
  • Examples focus on designing for code reuse (and leading by example, by integrating with existing reusable apps and Python libraries)
  • Offers an alternative approach to learning, focused on relevant, practical examples


Practical Django Projects (Apress book description) was written by James Bennett, release manager and contributor to the Django Web Framework. It was published by Apress in 2008. This was Bennett’s first book.

Full disclosure: I was provided with a free, review-copy of the book by Apress.

The Book:

Practical Django Projects introduces the reader to the Django Web Framework by example. It takes the reader step-by-step through three example projects: a basic CMS, a blog application (called Coltrane, which powers the author’s personal blog), and a code-sharing/snippets site (called Cab, which powers http://www.djangosnippets.org/.) The examples cover real-world problems (and integration tasks) that developers are likely to be interested in, and leaves the reader with three working Django applications.

The lessons are spread across eleven chapters:

  1. Welcome to Django — a wonderfully short introduction that wastes no space explaining prerequisites (it assumes the reader knows Python)
  2. Your First Django Site: A Simple CMS — an introduction to the Django Admin and Flatpages
  3. Customizing the Simple CMS — customizing the Admin interface (adding TinyMCE) and developing a simple, reusable search feature
  4. A Django-Powered Weblog — defining the basic models, and using django-tagging and Generic Views
  5. Expanding the Weblog — adding del.icio.us-synced links, and custom categories
  6. Templates for the Weblog — more extensive use of Generic Views, template inheritance, and custom template tags
  7. Finishing the Weblog — using django.contrib.comments and model signals to develop a moderation system with email notification and Akismet integration; Using django.contrib.syndication to add RSS/Atom feeds
  8. A Social Code-Sharing Site — building the initial models, integrating with the pygments syntax highlighter, and writing custom model managers
  9. Form Processing in the Code-Sharing Application — great examples of using newforms (much better then the The Definitive Guide to Django‘s chapter on form processing)
  10. Finishing the Code-Sharing Application — more custom template tags, this time used with bookmarking and rating features
  11. Writing Reusable Django Applications — a summary of Bennett’s philosophy on decoupling application features into reusable components (with references to the UNIX saying, “do one thing, and do it well”)

The examples focus on building applications the “Django way” — meaning that they heavily leverage Django features such as Generic Views, custom template tags, and the django.contrib package. Each section starts by outlining the features to be developed, then walking the reader through model definitions, URLs, template design, and the request-handler (view) code.

While working through the three example applications, Bennett teaches the reader how to decouple applications from projects, how to think about (and look for) opportunities for code reuse, and how to integrate with other reusable Django applications. The lessons aren’t so much “how does Django work”, but rather “how do you, as a developer, structure your projects to get the most out of the framework.” Depending on your level of comfort using Django and Python, the lessons will either be a breeze, or ridiculously confusing. (ie., there’s a lot of magic going on in the examples, and the book assumes that either you get it, you’re comfortable not knowing, or that you’ll figure out the finer bits when you need them.)

The Core Message

Ultimately, the book isn’t so much about learning Django, as it is about learning how to use Django properly (where properly is defined as the way in which the Django developers use Django.) From this perspective, it’s quite successful. The reader is shown a number of patterns and concepts that can be applied to any Django project.

Bennett wraps up the book with a chapter on design philosophy, but I think the overall lesson of the book is best summarized on page 124, with the following quote:

…this is the hallmark of a well-built Django application. Installing it shouldn’t involve any more work than the following:

  1. Add it to INSTALLED_APPS and run syncdb.
  2. Add a new URL pattern to route to its default URLConf.
  3. Set up any needed templates.

This is the zen of pluggable Django applications. It’s the path Bennett wants to help you start down. The value of going down this path will depend on how often you’ll use Django in the future.


Overall, I think the book will be more valuable for someone just getting started with Django, then someone who’s been hacking lower-level with the framework for awhile. It’s a developer-focused, quick-start, “get you on the right foot” kind of book that I certainly would have appreciated more a few years ago. The big question then, is whether this book is for you. The answer depends on a couple things, with the most important being how you like to learn. Do you prefer learning by example, or learning by reading the docs and building things on your own? If you prefer to have an expert guide you step-by-step, then this book is for you. You’ll still need to poke around in the Django documentation to really grok how it all works, but this book will get you up to speed quickly.

If you’ve read the docs, done the online tutorials, and are still interested in picking up some best-practices on decoupling your code from your specific application (ie., learning how Django supports code reuse), then this may still be a book for you. If you know you’ll be building a large application, the lessons in the book might help prevent you from writing a single, monolithic application, or at least give you some insight into how to organize and package your code. Down the road you’ll thank yourself.

For me personally, I was actually looking forward to this book before it came out. I think the Django docs online (as great as they are) can sometimes lack in providing best practices. However, I’ve also been using the framework professionally for a number of years (to deploy personal, start-up, and enterprise-class web applications), and I’ve previously built and deployed a pluggable, multi-site, Django-based blog engine (with del.icio.us and Akismet integration, flexible moderation rules, etc.), so the idea of using a blog engine as the core example in the book was a bit disappointing. That said, I did enjoy seeing another developer’s approach on solving the same problem, and I picked up a few nice tips around some of the more recent Django features.

If you’re looking to build a reusable code library (and you should be, if you’re going to build more then one Django project) and ensure that you’re using Django efficiently, this book will help point you down the right path and have you thinking about decoupling your architecture from the start.