2010-08-16
Spam sucks. Some jerk comes to my website and wastes my bandwidth and storage with his ad for some “male enhancement” drug, which is actually mostly a sugar pill that some fat guy in Canada made in his basement. The least they could do was sell quality on my site since they aren’t sharing the profits.
Luckily, there are a variety of solutions to minimize this nuisance through a slew of neat community driven technologies and awesome WordPress plugins. Continue reading to check out the magic combination I find works best across the websites I run.
reCAPTCHA
CAPTCHAs, the distorted words hidden inside images that you often have to decipher on the web, can be horribly annoying at times, period. However, CAPTCHAs are a fantastic first line of defense against spam on your website. They permit relative ease in allowing anonymous commenting while preventing bots, automated spamming software, from commenting or registering accounts on your site. Now, why reCAPTCHA specifically?

An example reCAPTCHA challenge, displaying "perceive ilrovie".
reCAPTCHA is a mature project that seeks to utilize CAPTCHAs to digitize print material. So, for example, the New York Times gives Google (the owner of reCAPTCHA) their back catalog which is scanned into computers and analyzed by Optical Character Recognition (OCR) software. Most of the text is recognized with a high degree of certainty, but some isn’t. That percent the machines have failed to recognize is then relayed through reCAPTCHA to humans to solve. After a number of people provide the identical result, Google can trust that word to be “solved” and gives it back to the New York Times. So, two birds with one stone sort of thing going on here; we’re provided with hard to solve CAPTCHAs for our sites and someone gets their content digitized.
Now, that’s all well and good, but what if you don’t care about any of that? Beyond the purpose, as I mentioned Google now owns the project, which they acquired last fall. This means reCAPTCHA scales well. It’s reliable and doesn’t suffer from outages or performance issues that could drag your site down. This is proven every day as the likes of Facebook, Twitter, and Craigslist rely on it.
It’s also well maintained, including a WordPress specific plugin with multiple graphical styles, and is constantly evolving to better combat spam. On a bad day reCAPTCHA is 70% effective, but generally it can be anywhere from 90-95% effective. Yes; some bots will be able to get through it, which is why you don’t rely on it solely, as I go into later.
One final note on reCAPTCHA is it includes “Mailhide,” which is a service which will automatically identify email addresses on your site, like ones users were dumb enough to leave in comments or on forum posts, and replace them with a link. Upon clicking the link users are presented with a reCAPTCHA challenges; solve it and you get the address, fail and you don’t. The WordPress plugin even has granular control so you can automatically display email addresses to registered users but utilize Mailhide for anonymous visitors. It’s neat stuff, and goes a long way to protecting your users from having their email addresses harvested by spammers.
Mollom

A Mollom CAPTCHA challenge displaying "KBWIK," only presented when Mollom isn't sure if content is spam or not.
Mollom is a highly scalable Real-time Black List (RBL), meaning as users submit comments and content to your site it’s compared on the spot to a continuously updated list of know spam. Keywords, links, usernames, email addresses, IP addresses; everything is analyzed and assigned a score. When a certain score is reached the content is marked as spam and blocked from your site. It’s quick, easy, and effective, so what’s the downside?
Well, Mollom is not free if you want to use it on a high-traffic site. You’ll also have to pay if you want advanced functionality such as removing their branding or checking content on SSL encrypted pages. It’s great for small and medium sites, but expect to pay for everything else.
Unfortunately the WordPress plugin specifically lacks much beyond basic configuration. You can exempt roles from analysis such as Admins and Editors, check the score it has assigned your content, and some basic logging and statistics. That’s about it, though Mollom has a secondary check which proves very effective.
Normally Mollom is very unintrusive, it either permits content or rejects it as spam; normal users will almost never know it’s even running on your site. However, when Mollom isn’t sure, that is something has a score near the point of being labeled spam but not quite, Mollom will run a secondary check by presenting its own CAPTCHA. If the response is correct the content will post, if not it’ll be rejected. This is an elegant setup since it works for even registered users.
Unfortunately, the granular controls found in other platforms plugins, namely Drupal, simply aren’t in the WordPress plugin yet. There is no way to control what parts of your site Mollom runs on or how it runs on each part, for example. That said, Mollom is good at what it does. It’s both mature and scalable, used on sites like Adobe, Fox News, and Sony BMG. Neither reliability nor performance is an issue. Mollom itself claims 99.93% accuracy, which is only a tad higher than I find on my own sites, currently sitting at 96.9%. Still, reducing my workload by nearly 97%, especially since Mollom only kicks into effect after reCAPTCHA has failed, is pretty good in my books.
Akismet
Akismet is very similar in both form and function to Mollom. Normally I don’t like two competing technologies fighting over each other to accomplish the same job. I would never recommend you run two anti-virus products on your computer or two firewalls for example. However, Mollom and Akismet run sequentially, with Akismet only kicking in after reCAPTCHA and Mollom have failed, so I find this acceptable given the additional performance overhead is minimal.

Akismet offers detailed and well rendered statistics about several facets of its operation.
Like Mollom, Akismet is a mature platform and free for personal use. Again like Mollom, you’ll have to pay in some circumstances, such as commercial use or if you pull in a large volume of traffic to your site. Akismet is also a bit more accurate than Mollom, currently holding 100% accuracy across my sites. In all fairness though, this is only after both reCAPTCHA and Mollom have filtered out the large majority of spam, so take this number with a grain of salt.
Akismet is also a very well-trusted platform, included by default with WordPress for a while now. It also integrates smoothly into WordPress, offering arguably better moderation tools than Mollom. The fact is the only reason I don’t use Akismet as my primary solution is because when both it and Mollom are operational WordPress delegates to Mollom first.
Bad Behavior
Bad Behavior is a plugin I only recently implemented across my sites. The combination of reCAPTCHA, Mollom, and Akismet has brought my spam comments and content to nearly zero, but I found myself flooded with spam accounts being generated. The spam they posted would be caught, but I don’t like having a bunch of spammers sitting with access to my site, so I turned to Bad Behavior.

An example of Bad Behavior blocking access based off HTTP fingerprinting.
Bad Behavior relies upon two major checks. The first is an analysis of the incoming header, IP, and other metadata when someone connects to your site. It can detect with some degree of accuracy if you are a human, known bot (like one working for Google, Yahoo, or Microsoft), or a spammer, then permit or deny access. On top of this header and other analysis of “bot-like activities,” Bad Behavior ties into the http:BL RBL. I know, a third RBL? What am I thinking!?
Well, the key point here is that Bad Behavior doesn’t analyze traffic at the point of content submission, but rather when a user first connects to your website. This means instead of rejecting a spam comment for example, it just denies the bot access to your site at all. This reduces your traffic load some, but more importantly prevents bots from opening accounts because they can’t reach your registration page.
Moderation
Well, despite all the neat and helpful technology available, in the end you still need a human presence on your site to keep things in order. Banning troublemakers, deleting spam that makes it through the gauntlet above, and publishing false positives (content caught as spam which isn’t) are things only you can do. You can also predict trouble, such as viewing your logs for oddities or catching suspicious accounts with random usernames, throw-away email addresses (like Mailinator), and origins in China or Russia. The longer you run your site the more you’ll learn its patterns and notice what stands out as “just not right.” These are traits no technology can ever replace fully, so use everything above to lighten your workload, but always keep an eye out and your banhammer ready.
This entry was initially on my former blog, LearnToHack.org, which I no longer maintain.
Recent Comments