Comment Spam
I may add myself to the list of those comment-spammed. It seems blog.com, besides grabbing the eye of news media, called in spam forces. I'm counting a couple hundred spam comments on my blog already...
Adding my 0.02 to the discussion by André: Blog.com requires two kinds of tools for dealing with comment spam. Automated tools and manual ones. Manual tools are pretty straightforward: Comments, wich as of now are seen as a small total count entity, must be seen as a large volume entity. That means comments must be searchable, and manageable in sets (namely set deletion is essential). Automated tools are a whole different beast...
Adding my 0.02 to the discussion by André: Blog.com requires two kinds of tools for dealing with comment spam. Automated tools and manual ones. Manual tools are pretty straightforward: Comments, wich as of now are seen as a small total count entity, must be seen as a large volume entity. That means comments must be searchable, and manageable in sets (namely set deletion is essential). Automated tools are a whole different beast...
The objective, for automated comment spam detection, is one that is known to be a lost cause in the long run (or so we hope): Differentiating between man and machine. We're trying to find out wether whoever/whatever posted a comment is a human or a spambot.
As of today, I guess distinction can be based on two factors:
Now what? I don't have a solid solution yet. My current bet is in a difficult-task strategy, accepting the losses as being introduced by spammers, not by the anti-spam system. This should hold until HAL becomes sentient.
For difficult-task strategies, the most widespread, and thus most user-friendly method is using a code image: an image with a word or a number, which must be manually copied to a field. Human OCR is much better than automated OCR, so the image may have lots of noise to raise the bar on the automated OCR difficuly.
There are accessibility issues: text-based browsers are unable to show the image, making it impossible for the vision-impaired to read the code in the image. The obvious workaround would be to add an equivalent sound code, where the code is voice-synthesized over a background of noise -- the exact same concept, applied to sound.
As of today, I guess distinction can be based on two factors:
- Insert, into the comment post process, a task that is impossible or hard to do by a computer.
- Detect posting patterns that are notoriously different between humans and computers.
- They create another hoop people need to go through. When looking at large numbers, this always has a cost, be it 0.5% of your comments or 50%. The cost may be negligible, but it's not easy to measure. We just know it's never zero.
- As years go by, and computers evolve, there will always be fewer and fewer tasks that are difficult for computers. The remaining difficult tasks may prove to be difficult for humans too, aggravating the previous factor.
Now what? I don't have a solid solution yet. My current bet is in a difficult-task strategy, accepting the losses as being introduced by spammers, not by the anti-spam system. This should hold until HAL becomes sentient.
For difficult-task strategies, the most widespread, and thus most user-friendly method is using a code image: an image with a word or a number, which must be manually copied to a field. Human OCR is much better than automated OCR, so the image may have lots of noise to raise the bar on the automated OCR difficuly.
There are accessibility issues: text-based browsers are unable to show the image, making it impossible for the vision-impaired to read the code in the image. The obvious workaround would be to add an equivalent sound code, where the code is voice-synthesized over a background of noise -- the exact same concept, applied to sound.

