Closed Bug 1830926 Opened 2 years ago Closed 2 years ago

hg.mozilla.org's robots.txt file is too permissive

Categories

(Developer Services :: Mercurial: hg.mozilla.org, enhancement)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: glob, Assigned: sheehan)

References

Details

Attachments

(1 file)

Bug 1234971 allowed bots to crawl hgweb to make the list of repositories visible.

Sadly this has resulted in multiple occurrences of bots crawling every single link under a repository, essentially downloading a repository in the most inefficient way possible and generating levels of sever load that has impacted service availability.

I propose changing robots.txt to an allow list of pages that we know are safe to index.
eg.

/
/automation
/build
/ci
...

ie. just the endpoints that list repositories, nothing deeper.

Assignee: nobody → sheehan

Crawling on hg.mozilla.org has become a more and more frequent
problem in recent memory. This commit changes the robots.txt
file to be fully restrictive, except for the main landing
page and other pages which only list the existing repositories
on hg.mozilla.org.

Pushed by cosheehan@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/92956b1d1b4c
hgwsgi: disallow crawling hg.mozilla.org r=glob

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: