Treeherder needs a robots.txt to prevent load from crawlers

RESOLVED FIXED

Status

P3
normal
RESOLVED FIXED
4 years ago
4 years ago

People

(Reporter: emorley, Assigned: fubar)

Tracking

Details

We should probably add a robots.txt to stop additional load from search engines crawling treeherder URLs, since they'll be referenced all over the place.

We could add one to the UI repo root, but I guess that will only cover treeherder.m.o/ui/* unless we fiddle with the apache config and add a redirect from root?
I meant to add: search engines support JS, so hitting the UI means they do cause API load, plus they'll also likely stumble upon API URLs in bug comments, or worse things like the dynamically generated Swagger docs (treeherder-dev.a.org/docs/ ; is disabled on prod).
Blocks: 1080757
Summary: Treeherder needs a robots.txt → Treeherder needs a robots.txt to prevent load from crawlers
Component: Treeherder → Treeherder: Infrastructure
QA Contact: laura
No longer blocks: 1080757
(Assignee)

Comment 2

4 years ago
Going with the default no-bots-here robots.txt unless there's something you'd like excepted:

User-agent: *
Disallow: /

deployed on stage and prod webheads
Assignee: nobody → klibby
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
sgtm, thank you :-)

Is this currently managed via puppet? Is this something we could get checked into the repo instead? I'#m a fan of having as few hidden/magical things as possible, and for most people if it isn't in the repo, its invisible to them :-)
(Assignee)

Comment 4

4 years ago
it is in puppet because of the apache config, which is managed by the webapp module. with the proxy happening, you'd have to have gunicorn handle it if you wanted it in the repo, I think.
Depends on: 1118387
You need to log in before you can comment on or make changes to this bug.