We have a proposal on the table to apply Akismet to some/all content submissions on the MDN wiki. This will be an expensive project to undertake. We should run an experiment to learn if Akismet can help identify good and bad content on MDN.
Build a manual or persistent application that will...
* Look for new revisions (perhaps using the revisions feed https://developer.mozilla.org/en-US/docs/feeds/atom/revisions)
* Post the body, title, and email address associated with new revisions to the Akismet API (http://akismet.com/development/api/#comment-check)
* Capture the response and store it along with the revision's URL
* Visually compare the outcome of Akismet scans to the manual triage underway (over the course of a week) to learn if Akismet can viably replace human triage
The tool coming out of this experiment might itself be useful to human triagers.
I wrote a little application using some node libraries that posts MDN revisions to the Akismet API. It is a hack; please ignore the quality of the code and focus on the results.
I used this tool to post the contents and IP addresses of 100 random unbanned revisions (known ham) and 100 random banned revisions (known spam) from a recent MDN database export. I captured the results in a spreadsheet. I did this three different times.
Across all 3 tests:
* 98%-99% of ham was correctly identified as ham
* 70%-83% of spam was correctly identified as spam
* The Akismet API asks for many fields that we can capture, but don't currently store -- for example, user agent, time of day, language, etc. If we included those fields the accuracy would probably go up, since Akismet depends on a host of criteria to identify spam.
* The "known ham" and "known spam" I used are not perfect. Some of the ham I sent might actually have been spam. Some of the spam I sent might not have been true spam, but instead some other kind of objectionable content. Again in this instance, the test is probably less accurate than MDN's real results would be.
I believe this experiment demonstrates the power of Akismet to dramatically improve our spam triage.
Leaving this bug open a little while for discussion.
Whoa, thanks :hoosteeno! This is great to know.
I opened a meta bug for spam heuristics where further discussion can occur: bug 1168472.
I had no idea this experiment had already been run; I thought you were just getting ready to do it. Well done!
This does sound promising; it certainly reaffirms my feeling that we should go for it as soon as we can reasonably do so.