Facebook-bot behaving badly

Quick Summary: Facebookbot can be sporadically greedy, but will back-off temporarily when sent an HTTP 429 (Too Many Requests) response.

You may know the scenario all too well: Your website is running great; Your real-time analytics show a nice (but not unusual) flow of visitors; and then someone calls out, “Hey, is anyone else seeing timeout errors on the site?” Crap. You flip over to your infrastructure monitoring tools. CPU utilization is fine; the cache servers are up; the load balancers are… dropping servers. At work, we started experiencing this pattern from time-to-time, and the post-mortem, log-analysis always showed the same thing: An unusual spike in requests from bots (on the order of 10-20x our normal traffic loads.)

Traffic from bots is a mixed blessing — they’re “God’s little spiders”, without which your site will be unfindable. But entertaining these visitors comes with an infrastructure cost. At a minimum, you’re paying for the extra bandwidth they eat; and worst-case, extra hardware (or a more complex architecture) to keep-up with the traffic. For a small business, it’s hard to justify buying extra hardware just so the bots can crawl your archives faster; but you can’t afford site outages either. What do you do? This is traffic you’d like to keep — there’s just too much of it.

Our first step was to move from post-mortem to pre-cog. I wanted to see these scenarios playing out in real-time so that we could experiment with different solutions. To do this, we wrote a bot-detecting middleware layer for our application servers (something easily accomplished with Django’s middleware hooks.) Once we could identify the traffic, we used statsd and graphite to collect data and visualize bot activity. We now had a way of observing human-traffic patterns in comparison to the bots — and the resulting information was eye-opening.

Let’s start with a view of normal, site-wide traffic:

normal site traffic in graphite

In the graph above, the top, purple line plots “total, successful server requests” (e.g, 200’s, for the web folks.) Below that we see Googlebot in blue, Facebookbot in green, and all other known-bots in red. This isn’t what I’d call a high-traffic site, so you’ll notice that bots make up almost half of the server requests. [This is, by the way, one of the dangers of only using tools like Chartbeat to gauge traffic — you can gauge content impressions, but you’re not seeing the full server load.]

Now let’s look at some interesting behavior:

bad bot traffic in graphite

In this graph, we have the same color-coding: Purple plots valid, HTTP 200 responses; Blue plots Googlebot; Green plots Facebookbot; and red is all other bots. During the few minutes represented in the far, right-hand side of the graph, you might have called the website “sluggish”. The bot-behavior during this time is rather interesting: Even though the site is struggling to keep up with the increased requests from Facebookbot, the bot continues hammering the site. It’s like a kid repeatedly hitting “reload” in their web browser when they see a HTTP 500 error message response. On the other hand, Googlebot notices the site problems and backs-down. Here’s a wider view of the same data that shows how Googlebot slowly ramps back up after the incident:

bad bot traffic in graphite

Very well done Google engineers! Thank you for that considerate bot behavior.

With our suspicions confirmed, it was time to act. We could identify the traffic at the application layer, so we knew that we could respond to bots differently if needed. We added a throttling mechanism using memcache to count requests per minute, per bot, per server. [By counting requests/minute at the server-level instead of site-wide, we didn’t have to worry about clock-sync; and with round-robin load balancing, we get a “good enough” estimate of traffic. By including the bot-name in the cache-key, we can count each bot separately.]

On each bot-request, the middleware checks the counter. If it has exceeded its threshold, the process is killed, and an HTTP 429 (Too Many Requests) response is returned instead. Let’s see how they respond:

Bot traffic response to HTTP 429

Here we see the total count of HTTP 200 responses in green; Googlebot in red; and Facebookbot in purple. At annotation ‘A’, we see Googlebot making a high number of requests. Annotation ‘B’ shows our server responding with an HTTP 429. After the HTTP 429 response, Googlebot backs down, waits, and the resumes at it’s normal rate. Very nice!

In the same chart (above), we also see Facebookbot making a high number of requests at annotation ‘C’. Annotation ‘D’ shows our HTTP 429 response. Following the response, Facebookbot switches to a spike-pause-repeat pattern. It’s odd, but at least it’s an acknowledgement, and the pauses are long enough for the web-servers to breathe and handle the load.

While each bot may react differently, we’ve learned that the big offenders do pay attention to HTTP 429 responses, and will back down. With a little threshold tuning, this simple rate-limiting solution allows the bots to keep crawling content (as long as they’re not too greedy) without impacting site responsiveness for paying customers (and without spending more money on servers.) That’s a win-win in my book.

Book Review: “Practical Django Projects”

Summary:

  • Targeted at developers wanting to learn Django by building example applications rather then (or in addition to) reading the docs and man pages
  • The reader builds three working applications by following along
  • The examples are based on up-to-date Django features (ie., a 2008 build)
  • Lesson’s focused on using Django (not on Django’s inner workings)
  • Doesn’t waste time explaining Python and HTML (nor does it dive deep explaining the how/why of what you’re doing in the examples)
  • Introduces the reader to powerful Django features — covering a wide range of capability
  • Examples focus on designing for code reuse (and leading by example, by integrating with existing reusable apps and Python libraries)
  • Offers an alternative approach to learning, focused on relevant, practical examples

Background:

Practical Django Projects (Apress book description) was written by James Bennett, release manager and contributor to the Django Web Framework. It was published by Apress in 2008. This was Bennett’s first book.

Full disclosure: I was provided with a free, review-copy of the book by Apress.

The Book:

Practical Django Projects introduces the reader to the Django Web Framework by example. It takes the reader step-by-step through three example projects: a basic CMS, a blog application (called Coltrane, which powers the author’s personal blog), and a code-sharing/snippets site (called Cab, which powers http://www.djangosnippets.org/.) The examples cover real-world problems (and integration tasks) that developers are likely to be interested in, and leaves the reader with three working Django applications.

The lessons are spread across eleven chapters:

  1. Welcome to Django — a wonderfully short introduction that wastes no space explaining prerequisites (it assumes the reader knows Python)
  2. Your First Django Site: A Simple CMS — an introduction to the Django Admin and Flatpages
  3. Customizing the Simple CMS — customizing the Admin interface (adding TinyMCE) and developing a simple, reusable search feature
  4. A Django-Powered Weblog — defining the basic models, and using django-tagging and Generic Views
  5. Expanding the Weblog — adding del.icio.us-synced links, and custom categories
  6. Templates for the Weblog — more extensive use of Generic Views, template inheritance, and custom template tags
  7. Finishing the Weblog — using django.contrib.comments and model signals to develop a moderation system with email notification and Akismet integration; Using django.contrib.syndication to add RSS/Atom feeds
  8. A Social Code-Sharing Site — building the initial models, integrating with the pygments syntax highlighter, and writing custom model managers
  9. Form Processing in the Code-Sharing Application — great examples of using newforms (much better then the The Definitive Guide to Django‘s chapter on form processing)
  10. Finishing the Code-Sharing Application — more custom template tags, this time used with bookmarking and rating features
  11. Writing Reusable Django Applications — a summary of Bennett’s philosophy on decoupling application features into reusable components (with references to the UNIX saying, “do one thing, and do it well”)

The examples focus on building applications the “Django way” — meaning that they heavily leverage Django features such as Generic Views, custom template tags, and the django.contrib package. Each section starts by outlining the features to be developed, then walking the reader through model definitions, URLs, template design, and the request-handler (view) code.

While working through the three example applications, Bennett teaches the reader how to decouple applications from projects, how to think about (and look for) opportunities for code reuse, and how to integrate with other reusable Django applications. The lessons aren’t so much “how does Django work”, but rather “how do you, as a developer, structure your projects to get the most out of the framework.” Depending on your level of comfort using Django and Python, the lessons will either be a breeze, or ridiculously confusing. (ie., there’s a lot of magic going on in the examples, and the book assumes that either you get it, you’re comfortable not knowing, or that you’ll figure out the finer bits when you need them.)

The Core Message

Ultimately, the book isn’t so much about learning Django, as it is about learning how to use Django properly (where properly is defined as the way in which the Django developers use Django.) From this perspective, it’s quite successful. The reader is shown a number of patterns and concepts that can be applied to any Django project.

Bennett wraps up the book with a chapter on design philosophy, but I think the overall lesson of the book is best summarized on page 124, with the following quote:

…this is the hallmark of a well-built Django application. Installing it shouldn’t involve any more work than the following:

  1. Add it to INSTALLED_APPS and run syncdb.
  2. Add a new URL pattern to route to its default URLConf.
  3. Set up any needed templates.

This is the zen of pluggable Django applications. It’s the path Bennett wants to help you start down. The value of going down this path will depend on how often you’ll use Django in the future.

Conclusion:

Overall, I think the book will be more valuable for someone just getting started with Django, then someone who’s been hacking lower-level with the framework for awhile. It’s a developer-focused, quick-start, “get you on the right foot” kind of book that I certainly would have appreciated more a few years ago. The big question then, is whether this book is for you. The answer depends on a couple things, with the most important being how you like to learn. Do you prefer learning by example, or learning by reading the docs and building things on your own? If you prefer to have an expert guide you step-by-step, then this book is for you. You’ll still need to poke around in the Django documentation to really grok how it all works, but this book will get you up to speed quickly.

If you’ve read the docs, done the online tutorials, and are still interested in picking up some best-practices on decoupling your code from your specific application (ie., learning how Django supports code reuse), then this may still be a book for you. If you know you’ll be building a large application, the lessons in the book might help prevent you from writing a single, monolithic application, or at least give you some insight into how to organize and package your code. Down the road you’ll thank yourself.

For me personally, I was actually looking forward to this book before it came out. I think the Django docs online (as great as they are) can sometimes lack in providing best practices. However, I’ve also been using the framework professionally for a number of years (to deploy personal, start-up, and enterprise-class web applications), and I’ve previously built and deployed a pluggable, multi-site, Django-based blog engine (with del.icio.us and Akismet integration, flexible moderation rules, etc.), so the idea of using a blog engine as the core example in the book was a bit disappointing. That said, I did enjoy seeing another developer’s approach on solving the same problem, and I picked up a few nice tips around some of the more recent Django features.

If you’re looking to build a reusable code library (and you should be, if you’re going to build more then one Django project) and ensure that you’re using Django efficiently, this book will help point you down the right path and have you thinking about decoupling your architecture from the start.

Reading “The Definitive Guide to Django”; Verdict: A solid learning reference for a beginning/intermediate Django user

Last week I received a review-copy of the new “The Definitive Guide to Django” book from Apress. I hadn’t planned on buying the book since it seemed a little too beginner-focused; but I agreed to give it an honest reading, so I happily dove in with an “it’s Python, of course I’m going to like it” attitude.

Background

The book was written by Adrian Holovaty and Jacob Kaplan-Moss, the creators and “Benevolent Dictators” of the Django Web Framework. It was Holovaty and Kaplan-Moss’ first book, and, I believe, meant to be the first Django book to market. The book was drafted online; open to peer-review and community feedback; and ultimately published under the GNU Free Documentation License.

From the get-go, the print edition had a few inherent market challenges to face: First, the entire book is available online, for free, at: <http://www.djangobook.com/>. Second, in many ways the book is a re-hash of the docs available at <http://www.djangoproject.com/documentation/>, which are also free. Third, the book covers Django 0.96, not SVN. (0.96 is technically the latest-snapshot release, but a lot has changed since 0.96.) And finally, the $45 MSRP could be seen as a little steep for what is effectively a printed copy of a free, online book.

The print experience

Diving in, the book takes the reader through the basic installation process, provides a brief background on how the framework came to be (and why you want one), then steps through the major features (ie., the template system, ORM, URLconfs, generic views, etc.) It’s what you’d expect from a technical reference — no fluff, and straight to the details. There are plenty of code snippets to learn from, and the sidebar notes tend to be insightful.

Since it wasn’t new material for me, the book was a fairly quick read; but the experience of reading Django documentation in book-form was actually quite fascinating. There’s something about settling into a comfortable chair with a book, pen, and highlighter that you just can’t get with online documentation. Perhaps it was just a little more noticeable given the material. When I read the Django docs online, I tend to skim over them while trying to solve a problem. I use them as a reference more then a learning tool, and it’s usually while actively coding, thus my brain is partially distracted with whatever it is I’m building.

With a physical book, you can unplug, step away from the computer, and give the material your undivided attention. This isolation from distraction results in a much deeper understanding of the text. This is the real the value of the printed book — it’s an opportunity to digest online documentation in an environment more conducive to learning and retention.

My general take-aways and observations

  • The book definitely has a beginner/intermediate feel to it, but only in the sense of a beginner Django user — not a beginner Web developer or Python programmer. I’m curious how well the book is received by folks who are beginners at Django and dynamic Web development since the text brings up a lot of complex topics in Web development that aren’t really explained. (Ex., database administration, server clustering, manipulating HTTP headers, etc.)
  • The breadth of the book is impressive, but in some ways, the book really feeds you through a firehose, so to speak. It throws a lot of new concepts at the reader and doesn’t always explain why you’d need to know them, or how you might use them in the real world. For someone deploying a site with Django, it will be good to know that all these features are available, but it might take awhile before they need to use them (if ever.)
  • The book does touch on some of the more advanced Django features (like extending the template system and writing custom middleware), which was nice, but some topics are reserved for the appendix and get limited coverage (ex., model managers and ‘Q’ queries.) Others, like the Sites Framework, are given good exposure, but not so much that the reader is left with a clear picture on when to use them and what their limitations are.
  • The forms processing chapter was a bit lighter then what I was hoping for — especially given that the current newforms documentation still trends toward “read the source code.” It provides enough to start using newforms if your form needs are pretty basic, but doesn’t address creating your own widgets, or any of the fun stuff you can do once you start dynamically generating and manipulating newforms objects.
  • It might have been nicer if the examples in the book were a little more tied together, perhaps all focused on building a single example project and showing how the various features are used in real-world applications. (The example of the book-publisher’s app was a reoccurring theme, but not so strongly that each chapter applied it’s new learnings to it.)
  • The Deploying Django: “Going Big” sub-section provides a nice infrastructure graphics for how high-traffic systems might be setup, but once you get to the point of being “big”, you need to architect for it, and that’s really outside of the scope of this book. For this section, it might have been nice to reference other resources on scaling infrastructure, and perhaps pointing out some of the ways that Django can be optimized for performance and horizontal scaling. (For example, one of the Django projects we put into production at work will happily support 1,200 requests/second, but the database layer and session middleware have been reworked a bit, and the content caching approach is a little different then the standard Django offering.)
  • On the more positive side, even as someone who’s been using Django for some time, I still learned a few new tricks, and I was reminded of a few features that I could be taking better advantage of. (And when you do this stuff professionally, every shortcut and productivity gain has monetary value — avoiding even a half-hour of debugging pays for the cost of this book.)
  • This book would make a fantastic read for a back-end developer joining a project that is already using Django. I normally tell new developers to go through the Python Tutorial at <http://python.org/doc/tut/> if they’re new to Python, then to complete the Django Tutorials at <http://www.djangoproject.com/documentation/> before trying to grok any in-progress Django project. Now I have a third reference (though I might still suggest that they walk through the tutorials first, so that they have some context when reading the book. Otherwise, there are just too many new concepts to do a straight read-through and still grasp it all.)

Summary

The market needed a good Django book, and this one delivered a solid reference for the framework. Arguably, it’s not really a “Beginner’s Guide to Django”, but hopefully it covers enough of the basics that future books can focus on best practices and more advanced techniques. (On a related note, there’s apparently an upcoming “Practical Django Projects” book, also from Apress, that will focus more on building “reusable Django applications from start to finish”. This might actually make for a better beginner’s book, depending on how it turns out. [Via The B-List: Speaking and writing].)

The million-dollar question then, is “Should you buy this book?” My answer ended up being a bit more positive then I expected, but there are two parts: First, if you’re a front-end developer only, you don’t need this book. You can just read Chapter 4: The Django Template System online, and then use the “Django Templates: Guide for HTML authors” section of the online docs as a reference. For back-end developers, the story is different. If you’re going to just “read it while you hack”, then you might as well just read it online; but if you’re serious about building applications with Django (especially if you’re new to it), then you should consider the book and investing the time to step away from the computer and really let yourself get into it. Unless you are an active contributor to Django (which I’m not, just to be clear), the odds are pretty good that you’ll learn something new, even if you’re already using Django today.

Django “lorem ipsum” generator (and a new contrib.webdesign module)

Django “lorem ipsum” generator (and a new contrib.webdesign module)

The Django Web Framework project just added a new contrib.webdesign module with an amazingly simple, but incredibly handy first feature: a lorem ipsum generator. The idea is that a project’s base templates can include generated lorem ipsum for testing layout and page flow, but inheriting templates can override the generated text once real content is available.

The lorem tag is used like this (via the contrib.webdesign docs):

  • {% lorem %} will output the common “lorem ipsum” paragraph.
  • {% lorem 3 p %} will output the common “lorem ipsum” paragraph and two random paragraphs each wrapped in HTML <p> tags.
  • {% lorem 2 w random %} will output two random Latin words.

In practice, you might do this:

templates/template.html:


<html>
  <head>
    <title>{% block article_title %}{% lorem 5 w %}{% endblock %}</title>
  </head>
  <body>
    <div class="article">
      <div class="article_title">{% block article_title %}{% lorem 5 w %}{% endblock %}</div>
      <div class="article_body">{% block article_body %}{% lorem 4 p %}{% endblock %}</div>
    </div>
  </body>
</html>

And then inherit when you’re ready:

templates/article.html:


{% extends "template.html" %}

{% if article %}
  {% block article_title %}{{ article.title }}{% endblock %}
  {% block article_body %}{{ article.body }}{% endblock %}
{% endif %}

Previously, I used to just paste lorem ipsum text directly into the main template (wrapped in block tags for overridding), but this new tag will let you skip the copy/paste routine. Very nice!

PyCon 2007 wrap-up

I’m back from PyCon 2007. It was a busy weekend, with 593 Pythonistas attending the conference. I took a fair amount of notes, but I’ve pulled out some highlights below:

From Ivan Krstic’s keynote on the One Laptop Per Child project:

  • Python is the language of the One Laptop Per Child (OLPC). Everything that can, will be done in Python… and there’s a “view source” button on the keyboard (view layout) so you can view (and edit) the source of your current running application.
  • The filesystem (which supports versioning) is called Yellow, and will be released withing a week or so. The GUI is called Sugar, and is available on http://dev.laptop.org/ to play with. You can download the full image (or build the environment on Linux.)
  • The OLPC supports 802.11s mesh networking.
  • The hand crank was removed for case durability. The OLPC’s are designed to last five years, but the torque from the hand-crank would have stressed the plastic case too much for it to last that long.
  • The first OLPC’s will start shipping in August of this year!
  • The OLPC hardware was getting ~1100 pystones before optimization. They are now up to ~2300 pystones (on a 366 Mhz AMD Geode processor.) (Note: This means they have better Python performance then Python for S60 is seeing on current S60 phones.)

From the Web Frameworks panel:

  • James Tauber, “Reinventing the wheel is great if your goal is to learn more about the wheel.”
  • Jonathan Ellis, “When you control the whole stack you can innovate faster.”

From Adele Goldberg’s keynote:

Public school education is so bad that real eLearning solutions can’t go to the schools — they need to be outside of schools so that you don’t have the traditional censorship that comes with public schools — and you don’t have the associates with the bad experiences kids have while at “school”.

From Jacob Kaplan-Moss’ talk, “Becoming an open-source developer: Lessons from the Django project”:

  1. Use good tools. “Open source is better because it’s better.”
  2. Avoid dogma. Don’t get stuck on what language something is implemented in.
  3. Work with (and hire) smart people. The model in open source is that if you’re smart, people listen to you. That’s rough if you’re not smart… But also means that it’s worthwhile to mention when you’re an expert on a topic.
  4. “Methodologies” suck. Ex., MVC is cool, but Django abuses it because it doesn’t fit so well with the web.
  5. DRY — Don’t Repeat Yourself. The one methodology to use.
  6. The business case for open source. You have to make one (to your company.)
    • Money. You’ll get recognized and sell services because of it. (Ex., Ellignton wouldn’t be as successful without Django.)
    • Free labor. (Sad to think of this way, but true when you have an interesting project.)
    • Self-improvement. Knowing that peers will review your code makes you much more careful about the code you submit. This makes the code a lot better.
    • Geek cred — gaining credibility within the geek community makes it easier to hire great people.
    • Moral Argument — If you built a business on open source — it’s time to give something back.
    • Figure out where to draw the line — Django gave away the tools, but not all the apps.
  7. Selling open source to other companies. Microsoft’s FUD had been quite successful in some areas. Counter the “communist” argument with a “freedom” argument. Focus on the freedom of data — your data belongs to you; there is no vendor lock-in. Open vs. Lock-in is a better argument then Open vs. Closed.
  8. Create a community. This doesn’t just happen because you setup a mailing list. (Gave example of thanking people who post anti-Django blog posts and asking what they didn’t like.) Don’t say anything that would get you kicked out of a bar.
    • Avoid monsters (trolls, vampires, etc.) Detect them early, and ignore them.
    • Spam can’t be an afterthought. Collaborative tools require spam filtering from Day 1. You’ll get spam. Lots of it. Google Groups is pretty good about cutting out spam.
  9. listen to the community. But smartly. Sometimes the vocal majority doesn’t represent the wishes of the whole community. Django’s magic-removal was a big risk, driven by the community. You also have to be willing to ask for help. Sometimes you don’t feel comfortable delegating tasks that you think suck — but not everyone has the same definition of “what sucks” — sometimes there’s someone who actually WANTS to do this task!
  10. Handling community contributions. You need a defined method for how you take contributions. It helped the Django project when they adopted a system for differentiating between patches that are controversial, and those that aren’t. (ie., simple bug fixes vs. design decisions.) A ticket reviewer makes this decision.
  11. Learn to be comfortable saying ‘no’ — there are plenty of Python web frameworks, and maybe someone’s needs are better handled by another framework. “If everyone can check in features, you have PHP.”

From “The absolute minimum an open source developer must know about intellectual property”:

  • It’s a lawyer’s job to figure out what will go wrong with your plan. They are professional pessimists.
  • Only the “claims” in a patent are covered, not the stuff in the “specification.”
  • A header file is a “purely functional” expression, thus NOT-copyrightable.
  • If you don’t protect your Trademark, you lose it. This is why companies have to send cease and desist. The “get a first life” situation was important because Lindon explicitly granted them a license to use the Second Life trademark in the parody, thus they were able to demonstrate that they were protecting their mark.
  • If you tell someone how to do the work (ie., “work for hire”), then you own it.
  • An independent contractor owns their work unless the contract specifically assigns the rights to the company.
  • The person who made a patch owns the patch. By giving it to you, you get an applied license to use it, but because it’s implied, it’s fuzzy as to what you can do with it.

From Robert M. Lefkowitz’s keynote:

  • Only 2% of the population can read source code. (And free software doesn’t matter if no one can read it!)
  • Proprietary software values function. Free software people value the building of the “community of learning” around the software, even if it has fewer features.
  • The traditional view is that computer literacy is about one’s ability to use applications, rather then to program. If this is right, then what’s the point? Computers might as well be printing presses.
  • In literature, you read the greats (ex., Shakespeare), then try to write like them. So in computer literacy, who are the greats? If we were going to make every high school students memorize a program, what would it be?
  • Great programmers break the rules elegantly. Bad programmers break the rules without realizing it.