Closed Bug 1281270 Opened 9 years ago Closed 6 years ago

Prevent scrapers from accessing non-content URLs

Categories

(developer.mozilla.org Graveyard :: General, defect)

All
Other
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: jwhitlock, Unassigned)

References

Details

(Keywords: in-triage, Whiteboard: [specification][type:change])

What feature should be changed? Please provide the URL of the feature if possible. ================================================================================== Add additional MDN URLs that scrapers and web crawlers should not access to robots.txt, such as the Django admin. What problems would this solve? =============================== This further reduces the URLs that we have to consider for spam Who would use this? =================== Scrapers and crawlers who respect robots.txt and support the optional wildcard pattern matching. What would users see? ===================== Non-content MDN URLs would not appear in search results What would users do? What would happen as a result? =================================================== They would not click non-content URLs Is there anything else we should know? ====================================== This task is already in progress: * Endpoints were gathered in a (Mozilla-only) Google Doc [1]. * PR 3884 [2] contains the first round of changes * PR 3891 [3] contains the second round of changes [1] https://docs.google.com/spreadsheets/d/1X-YLmIg8vVvDWlShLF-361Cz2KcUBnlaPXBHLMaYEaQ [2] https://github.com/mozilla/kuma/pull/3884 [3] https://github.com/mozilla/kuma/pull/3891
Commit pushed to master at https://github.com/mozilla/kuma https://github.com/mozilla/kuma/commit/0f50d85ecd940c0c528b2ae01c6a1e771438b689 bug 1281270 - robots.txt edits: Part II (#3891) * remove docs directory from robots rules with $endings * add slashes to robots.txt entries to block more specific pages * add slashes around admin robots.txt entry * fix admin pattern for robots.txt

The scrapers that do not respect robots.txt are more of a problem. The work in bug 1525719 is a better long term solutions, and should result in one domain which allows scraping of all URLs, and one domain that requires login and forbids all scraping.

Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INCOMPLETE
See Also: → 1525719
Product: developer.mozilla.org → developer.mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.