hg.mozilla.org's robots.txt file is too permissive
Categories
(Developer Services :: Mercurial: hg.mozilla.org, enhancement)
Tracking
(Not tracked)
People
(Reporter: glob, Assigned: sheehan)
References
Details
Attachments
(1 file)
Bug 1234971 allowed bots to crawl hgweb to make the list of repositories visible.
Sadly this has resulted in multiple occurrences of bots crawling every single link under a repository, essentially downloading a repository in the most inefficient way possible and generating levels of sever load that has impacted service availability.
I propose changing robots.txt to an allow list of pages that we know are safe to index.
eg.
/
/automation
/build
/ci
...
ie. just the endpoints that list repositories, nothing deeper.
| Assignee | ||
Updated•2 years ago
|
| Assignee | ||
Comment 1•2 years ago
|
||
Crawling on hg.mozilla.org has become a more and more frequent
problem in recent memory. This commit changes the robots.txt
file to be fully restrictive, except for the main landing
page and other pages which only list the existing repositories
on hg.mozilla.org.
Pushed by cosheehan@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/92956b1d1b4c
hgwsgi: disallow crawling hg.mozilla.org r=glob
Description
•