Closed
Bug 1281270
Opened 9 years ago
Closed 6 years ago
Prevent scrapers from accessing non-content URLs
Categories
(developer.mozilla.org Graveyard :: General, defect)
Tracking
(Not tracked)
RESOLVED
INCOMPLETE
People
(Reporter: jwhitlock, Unassigned)
References
Details
(Keywords: in-triage, Whiteboard: [specification][type:change])
What feature should be changed? Please provide the URL of the feature if possible.
==================================================================================
Add additional MDN URLs that scrapers and web crawlers should not access to robots.txt, such as the Django admin.
What problems would this solve?
===============================
This further reduces the URLs that we have to consider for spam
Who would use this?
===================
Scrapers and crawlers who respect robots.txt and support the optional wildcard pattern matching.
What would users see?
=====================
Non-content MDN URLs would not appear in search results
What would users do? What would happen as a result?
===================================================
They would not click non-content URLs
Is there anything else we should know?
======================================
This task is already in progress:
* Endpoints were gathered in a (Mozilla-only) Google Doc [1].
* PR 3884 [2] contains the first round of changes
* PR 3891 [3] contains the second round of changes
[1] https://docs.google.com/spreadsheets/d/1X-YLmIg8vVvDWlShLF-361Cz2KcUBnlaPXBHLMaYEaQ
[2] https://github.com/mozilla/kuma/pull/3884
[3] https://github.com/mozilla/kuma/pull/3891
Comment 1•9 years ago
|
||
Commit pushed to master at https://github.com/mozilla/kuma
https://github.com/mozilla/kuma/commit/0f50d85ecd940c0c528b2ae01c6a1e771438b689
bug 1281270 - robots.txt edits: Part II (#3891)
* remove docs directory from robots rules with $endings
* add slashes to robots.txt entries to block more specific pages
* add slashes around admin robots.txt entry
* fix admin pattern for robots.txt
| Reporter | ||
Comment 2•6 years ago
|
||
The scrapers that do not respect robots.txt are more of a problem. The work in bug 1525719 is a better long term solutions, and should result in one domain which allows scraping of all URLs, and one domain that requires login and forbids all scraping.
Updated•5 years ago
|
Product: developer.mozilla.org → developer.mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•