What problem would this feature solve? ====================================== 1) It would limit the damage to content quality caused by spam articles 2) It would help protect MDN's search ranking 3) It would limit the amount of manual effort spent on spam triage Who has this problem? ===================== All visitors to MDN How do you know that the users identified above have this problem? ================================================================== For the problems listed above: 1) No quality metrics currently exist. 2) In December 2014, the #1 most visited page on MDN was a spam page linking to a movie torrent site. This "spamdexing" can cause major search engines to devalue the content of the site. 3) The spam triage team put together an analysis of spam triage impact  https://en.wikipedia.org/wiki/Spamdexing  https://www.seroundtable.com/archives/021236.html  https://docs.google.com/spreadsheets/d/1YijcYXfJTjqouOnj8zQULtPtxGvH-Lp9r6Tf2Ifqbwc/edit#gid=0 How are the users identified above solving this problem now? ============================================================ All users: disregarding spam pages Triage team: manual discovery and handling Do you have any suggestions for solving the problem? Please explain in detail. ============================================================================== Yes! A spec, here: https://docs.google.com/document/d/1ZgX6cGnrD2xiuiRjCNDS_LgRUFwTFNaTcqQrf-LOBeU/edit# Is there anything else we should know? ======================================
Luke, can you review the spec, particularly the deliverables section? It needs technical review and LOE?
Reviewed and provided the first estimates.
A competing proposal for spam heuristics was presented at an earlier product council meeting. It proposes to... A) use a configurable regular expression to scan titles when they are changed or when pages are created B) if the new title fails the regex test, the change fails with a form validation error Since that proposal, the variety of titles has increased. Spam has appeared under numerous different kinds of title that would defy a good regex. For example: first Barcelona open all windows how they tell battleground 2015 ask usa cuba open high way usa live change effect double usa find search ufc down ufc 189 change Costa Rica usa boston last above share Cuba en directo Wayward Pines need you again want this found These highlight one fact: Any heuristic approach will always be incomplete. And I firmly believe that our own heuristic approaches will always be more incomplete than the heuristics developed by a 3rd party business with thousands of customers that is entirely focused on this problem. Considering this, I suggest we do not build our own heuristics. Instead, I suggest we combine the approach outlined in the regex pitch with the approach outlined in the Akismet pitch: A) Run page edits through Akismet. Optionally, do this only for page edits from a particular cohort ("users with fewer than X edits", for example). B) If the edit fails the Akismet test, the submission fails form validation. I will follow this message up with some additional Akismet tests. After that we may wish to close this bug and file a new bug, "implement Akismet edit blocking" or similar.  https://docs.google.com/document/d/1VPvv2qrCc1QSTgpF-pigbDh0syJH5BHShvQhLHwPlqw/edit
I just ran 8 more Akismet experiments (in addition to the four I ran earlier this year). Here's where we come out: A) Across all 12 experiments, Akismet found 79.59% of known spam, where "known" means the revision had one of these factors: * The author was banned, or * The document was deleted, or * The title matched a regex B) 77.82% of (a sample of) our deleted articles were identified by Akismet as spam. C) 78.28% of our deleted articles were caught by our spam-filtering regex. In other words, Akismet is as accurate now as regex title matching in finding spam among known spammy content. But the number of articles we delete that cannot be caught by a regex is increasing rapidly (see comment 3). Considering that Akismet is as successful in finding spam in our deleted articles as the regex is, and that Akismet will also find spam that defies our best regexes, I think Akismet is the better way forward. Closing this as a duplicate of bug 1188029.
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 1188029
Cool. Can you file the Akismet sec/privacy review bug blocking bug 1188033?
You need to log in before you can comment on or make changes to this bug.