Sunday, May 30, 2004

Comment Spam

I may add myself to the list of those comment-spammed. It seems blog.com, besides grabbing the eye of news media, called in spam forces. I'm counting a couple hundred spam comments on my blog already...

Adding my 0.02€ to the discussion by André: Blog.com requires two kinds of tools for dealing with comment spam. Automated tools and manual ones. Manual tools are pretty straightforward: Comments, wich as of now are seen as a small total count entity, must be seen as a large volume entity. That means comments must be searchable, and manageable in sets (namely set deletion is essential). Automated tools are a whole different beast...
The objective, for automated comment spam detection, is one that is known to be a lost cause in the long run (or so we hope): Differentiating between man and machine. We're trying to find out wether whoever/whatever posted a comment is a human or a spambot.

As of today, I guess distinction can be based on two factors:
  1. Insert, into the comment post process, a task that is impossible or hard to do by a computer.
  2. Detect posting patterns that are notoriously different between humans and computers.
Strategies based on option one have two large flaws:
  1. They create another hoop people need to go through. When looking at large numbers, this always has a cost, be it 0.5% of your comments or 50%. The cost may be negligible, but it's not easy to measure. We just know it's never zero.
  2. As years go by, and computers evolve, there will always be fewer and fewer tasks that are difficult for computers. The remaining difficult tasks may prove to be difficult for humans too, aggravating the previous factor.
Strategies based on factor two are mostly circumventable. I'd even wager they are all circumventable, if automatic analysis requires no false positives. If they can be worked around, then they're always a Club Solution. They present diminishing returns as the blog hosting service gets big and spammers get more responsive to new pattern checking.

Now what? I don't have a solid solution yet. My current bet is in a difficult-task strategy, accepting the losses as being introduced by spammers, not by the anti-spam system. This should hold until HAL becomes sentient.

For difficult-task strategies, the most widespread, and thus most user-friendly method is using a code image: an image with a word or a number, which must be manually copied to a field. Human OCR is much better than automated OCR, so the image may have lots of noise to raise the bar on the automated OCR difficuly.

There are accessibility issues: text-based browsers are unable to show the image, making it impossible for the vision-impaired to read the code in the image. The obvious workaround would be to add an equivalent sound code, where the code is voice-synthesized over a background of noise -- the exact same concept, applied to sound.
Posted by K at 19:21:23 | Permanent Link | Comments (0) |
Comments
Write a comment