[meta] MDN Spam Heuristics

RESOLVED DUPLICATE of bug 1188029

Status

Mozilla Developer Network
General
--
enhancement
RESOLVED DUPLICATE of bug 1188029
3 years ago
2 years ago

People

(Reporter: hoosteeno, Unassigned)

Tracking

(Blocks: 1 bug, {in-triage, productwanted})

Details

(Whiteboard: [specification][type:feature])

What problem would this feature solve?
======================================
1) It would limit the damage to content quality caused by spam articles
2) It would help protect MDN's search ranking
3) It would limit the amount of manual effort spent on spam triage 


Who has this problem?
=====================
All visitors to MDN

How do you know that the users identified above have this problem?
==================================================================
For the problems listed above:
1) No quality metrics currently exist.
2) In December 2014, the #1 most visited page on MDN was a spam page linking to a movie torrent site. This "spamdexing"[0] can cause major search engines to devalue the content of the site[1].
3) The spam triage team put together an analysis of spam triage impact[2]

[0] https://en.wikipedia.org/wiki/Spamdexing
[1] https://www.seroundtable.com/archives/021236.html
[2] https://docs.google.com/spreadsheets/d/1YijcYXfJTjqouOnj8zQULtPtxGvH-Lp9r6Tf2Ifqbwc/edit#gid=0

How are the users identified above solving this problem now?
============================================================
All users: disregarding spam pages
Triage team: manual discovery and handling

Do you have any suggestions for solving the problem? Please explain in detail.
==============================================================================
Yes! A spec, here:

https://docs.google.com/document/d/1ZgX6cGnrD2xiuiRjCNDS_LgRUFwTFNaTcqQrf-LOBeU/edit#

Is there anything else we should know?
======================================
Luke, can you review the spec, particularly the deliverables section? It needs technical review and LOE?
Flags: needinfo?(lcrouch)
Keywords: productwanted
Depends on: 1124358
Blocks: 1109994
Reviewed and provided the first estimates.
Flags: needinfo?(lcrouch)
Keywords: in-triage
A competing proposal for spam heuristics was presented at an earlier product council meeting[0]. It proposes to...
A) use a configurable regular expression to scan titles when they are changed or when pages are created
B) if the new title fails the regex test, the change fails with a form validation error

Since that proposal, the variety of titles has increased. Spam has appeared under numerous different kinds of title that would defy a good regex. For example:

first Barcelona
open all windows
how they tell
battleground 2015
ask usa cuba
open high way
usa live
change effect
double usa
find search
ufc down
ufc 189
change Costa Rica
usa boston last
above share
Cuba en directo
Wayward Pines
need you again
want this found

These highlight one fact: Any heuristic approach will always be incomplete. And I firmly believe that our own heuristic approaches will always be more incomplete than the heuristics developed by a 3rd party business with thousands of customers that is entirely focused on this problem. 

Considering this, I suggest we do not build our own heuristics. Instead, I suggest we combine the approach outlined in the regex pitch with the approach outlined in the Akismet pitch:

A) Run page edits through Akismet. Optionally, do this only for page edits from a particular cohort ("users with fewer than X edits", for example). 
B) If the edit fails the Akismet test, the submission fails form validation.

I will follow this message up with some additional Akismet tests. After that we may wish to close this bug and file a new bug, "implement Akismet edit blocking" or similar.

[0] https://docs.google.com/document/d/1VPvv2qrCc1QSTgpF-pigbDh0syJH5BHShvQhLHwPlqw/edit
I just ran 8 more Akismet experiments (in addition to the four I ran earlier this year).

Here's where we come out: 

A) Across all 12 experiments[3], Akismet found 79.59% of known spam, where "known" means the revision had one of these factors:
* The author was banned, or
* The document was deleted, or
* The title matched a regex

B) 77.82% of (a sample of) our deleted articles were identified by Akismet as spam.

C) 78.28% of our deleted articles were caught by our spam-filtering regex.

In other words, Akismet is as accurate now as regex title matching in finding spam among known spammy content. But the number of articles we delete that cannot be caught by a regex is increasing rapidly (see comment 3).

Considering that Akismet is as successful in finding spam in our deleted articles as the regex is, and that Akismet will also find spam that defies our best regexes, I think Akismet is the better way forward.

Closing this as a duplicate of bug 1188029.
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 1188029
Cool. Can you file the Akismet sec/privacy review bug blocking bug 1188033?
Flags: needinfo?(hoosteeno)
You need to log in before you can comment on or make changes to this bug.