Closed Bug 1124358 Opened 9 years ago Closed 9 years ago

[spam] Experiment with Akismet

Categories

(developer.mozilla.org Graveyard :: General, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: hoosteeno, Assigned: hoosteeno)

References

Details

(Whiteboard: [patchwelcome][difficulty=expert])

We have a proposal on the table to apply Akismet to some/all content submissions on the MDN wiki. This will be an expensive project to undertake. We should run an experiment to learn if Akismet can help identify good and bad content on MDN. 

Suggested experiment:

Build a manual or persistent application that will...
* Look for new revisions (perhaps using the revisions feed https://developer.mozilla.org/en-US/docs/feeds/atom/revisions) 
* Post the body, title, and email address associated with new revisions to the Akismet API (http://akismet.com/development/api/#comment-check)
* Capture the response and store it along with the revision's URL
* Visually compare the outcome of Akismet scans to the manual triage underway (over the course of a week) to learn if Akismet can viably replace human triage

The tool coming out of this experiment might itself be useful to human triagers.
Severity: normal → enhancement
I wrote a little application using some node libraries that posts MDN revisions to the Akismet API[0]. It is a hack; please ignore the quality of the code and focus on the results.

I used this tool to post the contents and IP addresses of 100 random unbanned revisions (known ham) and 100 random banned revisions (known spam) from a recent MDN database export. I captured the results in a spreadsheet[1]. I did this three different times.

Across all 3 tests: 
* 98%-99% of ham was correctly identified as ham
* 70%-83% of spam was correctly identified as spam

Caveats:
* The Akismet API asks for many fields that we can capture, but don't currently store -- for example, user agent, time of day, language, etc. If we included those fields the accuracy would probably go up, since Akismet depends on a host of criteria to identify spam.
* The "known ham" and "known spam" I used are not perfect. Some of the ham I sent might actually have been spam. Some of the spam I sent might not have been true spam, but instead some other kind of objectionable content. Again in this instance, the test is probably less accurate than MDN's real results would be.

I believe this experiment demonstrates the power of Akismet to dramatically improve our spam triage.

Leaving this bug open a little while for discussion.

[0] https://github.com/hoosteeno/test_akismet/blob/master/test_akismet.js
[1] https://docs.google.com/spreadsheets/d/1PmvIp9nehcAREsQLzaAYqtQeNONQJ9taRqgeFcSVPjE/edit#gid=713193332
Assignee: nobody → hoosteeno
WANT.
Whoa, thanks :hoosteeno! This is great to know.
I opened a meta bug for spam heuristics where further discussion can occur: bug 1168472.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
See Also: → 1188029
I had no idea this experiment had already been run; I thought you were just getting ready to do it. Well done!

This does sound promising; it certainly reaffirms my feeling that we should go for it as soon as we can reasonably do so.
Product: developer.mozilla.org → developer.mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.