Closed Bug 690447 Opened 13 years ago Closed 5 years ago

[SEO] college.stage.mozilla.com is being indexed, is second result on google

Categories

(Websites :: other.mozilla.org, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: cbeasley, Unassigned)

Details

Since we don't want to loose google ranking, please make a rel canonical to the prod server until google reindexes. That'll take ~1 week. Then, add the proper robots.txt to disable indexing as it should have been originally.

screenshot of the google ranking
http://cl.ly/0h2a2z2q2n0R3F2d3y3t
Anthony - Is this a dup? I know you were doing work in this area.
Assignee: nobody → anthony
Summary: stage.college.mozilla.com is being indexed, is second result on google → [SEO] stage.college.mozilla.com is being indexed, is second result on google
I don't know where the code for this is located. This is not in the mozilla.org/.com codebase. We need to find who works on this.
When we find out this info, please add to 

https://wiki.mozilla.org/Webdev:WhoWorksOnWhat
Hey Julie or Morgamic - Who is the webdev on this microsite? Anthony is looking for the codebase, and this site does not appear to have a technical owner.
Assignee: anthony → nobody
Component: www.mozilla.org/firefox → other.mozilla.org
QA Contact: www-mozilla-com → other-mozilla-org
Matt (mbasta) has been working on this. The code is hosted on his Github account, but we're not quite ready to make it public... 
https://github.com/mattbasta/moz-college-recruiting
CCing Jason and Jake from Webops: Is there a way to hook up /robots.txt on all (new, or ideally even existing) staging sites, by hooking up something inside the apache config? I know we can hook up an alias, but unsure if that just works for directories or also individual files.

If it works for files, we should drop a default, deny-all robots.txt file somewhere and for all staging sites just hook up /robots.txt to that file. I can't think of a single instance of a staging site where we'd want it to be indexed by robots.
Summary: [SEO] stage.college.mozilla.com is being indexed, is second result on google → [SEO] college.stage.mozilla.com is being indexed, is second result on google
Especially in this specific case, when we decided to not use the design, nor the content. Honestly, I had no idea is was actually indexed until Crystal pointed it out (thanks, Crystal!).
Nothing convenient or automatic comes to mind. Dev and stage sites are scattered across many different servers/clusters, and the Apache configs aren't especially unified in Puppet. A solution in one place doesn't necessarily apply elsewhere.

I can think of a few ways to do this, but they all ultimately come down to "it's policy to put this robots.txt file in place"... either by manually copying it into place, or with an Apache Alias in every VirtualHost, or some other mechanism. Another possibility would be to try and have Zeus (load balancers) handle robots.txt directly... possible, but likely still suffers from being a manual, per-site process.

One concern I have is that I'm a little leery of IT taking on a task like this. It feels rather close to IT managing the actual content of a site, and that makes me nervous. I suppose in the grand scheme it's not much different from managing a settings_local.py file, though.

I'm also a bit worried about how consistently we could apply it... it seems likely that if it's something that has to be done individually for each dev/stage site, it's bound to get forgotten sometimes.

There's also a small concern that during some types of dev->stage or stage->prod pushes, in some configurations, this might accidentally get pushed to prod. That would be bad, and whatever we come up with should be sure to account for this.

I don't have a solution for this, either on our side or webdev's. I'll bring in some more folks, and see if we can come up with something.
(In reply to Jake Maul [:jakem] from comment #8)
> One concern I have is that I'm a little leery of IT taking on a task like
> this. It feels rather close to IT managing the actual content of a site, and
> that makes me nervous. I suppose in the grand scheme it's not much different
> from managing a settings_local.py file, though.

Given that we pull from a single repository regardless of dev, stage, prod we would have the be the point of implementation here.

Can we hack this up within the dev and stage apache confs themselves - redirect robots.txt for only those 2 sites to a generic robots.txt somewhere?  This would avoid needing to dump the file into the tree itself.
Wil Clouser informs me that AMO has a setting called ENGAGE_ROBOTS, and it appears that there are efforts to make this a standard part of either playdoh or funfactory, which would make it somewhat easier to enable this at the app level anywhere it's needed.

https://github.com/mozilla/playdoh/issues/53


Of course, this doesn't help for non-django sites... but since most new sites are coming up that way (or are pre-packaged 3rd-party apps), that might not be an issue. Anything else could potentially be handled as a one-off.

It would also be worthwhile to ask the SUMO and Input teams how they handle this. I'd much rather have a common solution than a bunch of different ones.
SUMO does the exact same thing as AMO/playdoh, and uses an ENGAGE_ROBOTS setting. +1 to getting that into playdoh.
(In reply to Corey Shields [:cshields] from comment #9)
> Can we hack this up within the dev and stage apache confs themselves -
> redirect robots.txt for only those 2 sites to a generic robots.txt
> somewhere?  This would avoid needing to dump the file into the tree itself.

If that's easily doable, I suggest we start doing it for these sites and keep doing it when setting up new sites. I am not as concerned about old sites, those we can catch one by one.

(In reply to James Socol [:jsocol, :james] from comment #11)
> SUMO does the exact same thing as AMO/playdoh, and uses an ENGAGE_ROBOTS
> setting. +1 to getting that into playdoh.

Seems to me like it already was, but I guess not. For non-django sites, that doesn't work though, as Jake says. Is college.stage a Django site?
(In reply to Fred Wenzel [:wenzel] from comment #12)
> Is college.stage a Django site?

It is not. It's PHP. There's no robots.txt in the repo but someone added one on the server already. I think there's a format issue, though. AFAIK, the minimal robots.txt is:

User-agent: *
Disallow: /

Right now the / is missing. I might be wrong though.
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.