Last Comment Bug 1124358 - [spam] Experiment with Akismet
: [spam] Experiment with Akismet
Status: RESOLVED FIXED
[patchwelcome][difficulty=expert]
:
Product: Mozilla Developer Network
Classification: Other
Component: General (show other bugs)
: unspecified
: All All
-- enhancement (vote)
: ---
Assigned To: Justin Crawford [:hoosteeno] [:jcrawford]
:
:
Mentors:
Depends on:
Blocks: 1168472
  Show dependency treegraph
 
Reported: 2015-01-21 12:29 PST by Justin Crawford [:hoosteeno] [:jcrawford]
Modified: 2015-07-27 12:38 PDT (History)
1 user (show)
See Also:
QA Whiteboard:
Iteration: ---
Points: ---


Attachments

Description User image Justin Crawford [:hoosteeno] [:jcrawford] 2015-01-21 12:29:56 PST
We have a proposal on the table to apply Akismet to some/all content submissions on the MDN wiki. This will be an expensive project to undertake. We should run an experiment to learn if Akismet can help identify good and bad content on MDN. 

Suggested experiment:

Build a manual or persistent application that will...
* Look for new revisions (perhaps using the revisions feed https://developer.mozilla.org/en-US/docs/feeds/atom/revisions) 
* Post the body, title, and email address associated with new revisions to the Akismet API (http://akismet.com/development/api/#comment-check)
* Capture the response and store it along with the revision's URL
* Visually compare the outcome of Akismet scans to the manual triage underway (over the course of a week) to learn if Akismet can viably replace human triage

The tool coming out of this experiment might itself be useful to human triagers.
Comment 1 User image Justin Crawford [:hoosteeno] [:jcrawford] 2015-05-18 14:53:49 PDT
I wrote a little application using some node libraries that posts MDN revisions to the Akismet API[0]. It is a hack; please ignore the quality of the code and focus on the results.

I used this tool to post the contents and IP addresses of 100 random unbanned revisions (known ham) and 100 random banned revisions (known spam) from a recent MDN database export. I captured the results in a spreadsheet[1]. I did this three different times.

Across all 3 tests: 
* 98%-99% of ham was correctly identified as ham
* 70%-83% of spam was correctly identified as spam

Caveats:
* The Akismet API asks for many fields that we can capture, but don't currently store -- for example, user agent, time of day, language, etc. If we included those fields the accuracy would probably go up, since Akismet depends on a host of criteria to identify spam.
* The "known ham" and "known spam" I used are not perfect. Some of the ham I sent might actually have been spam. Some of the spam I sent might not have been true spam, but instead some other kind of objectionable content. Again in this instance, the test is probably less accurate than MDN's real results would be.

I believe this experiment demonstrates the power of Akismet to dramatically improve our spam triage.

Leaving this bug open a little while for discussion.

[0] https://github.com/hoosteeno/test_akismet/blob/master/test_akismet.js
[1] https://docs.google.com/spreadsheets/d/1PmvIp9nehcAREsQLzaAYqtQeNONQJ9taRqgeFcSVPjE/edit#gid=713193332
Comment 2 User image Eric Shepherd [:sheppy] 2015-05-18 15:08:43 PDT
WANT.
Comment 3 User image Luke Crouch [:groovecoder] 2015-05-22 07:54:26 PDT
Whoa, thanks :hoosteeno! This is great to know.
Comment 4 User image Justin Crawford [:hoosteeno] [:jcrawford] 2015-05-26 10:18:51 PDT
I opened a meta bug for spam heuristics where further discussion can occur: bug 1168472.
Comment 5 User image Eric Shepherd [:sheppy] 2015-07-27 12:38:11 PDT
I had no idea this experiment had already been run; I thought you were just getting ready to do it. Well done!

This does sound promising; it certainly reaffirms my feeling that we should go for it as soon as we can reasonably do so.

Note You need to log in before you can comment on or make changes to this bug.