690447 - [SEO] college.stage.mozilla.com is being indexed, is second result on google

Reporter

Description

•

13 years ago

Since we don't want to loose google ranking, please make a rel canonical to the prod server until google reindexes. That'll take ~1 week. Then, add the proper robots.txt to disable indexing as it should have been originally.

screenshot of the google ranking
http://cl.ly/0h2a2z2q2n0R3F2d3y3t

Laura Forrest

Comment 1

•

13 years ago

Anthony - Is this a dup? I know you were doing work in this area.

Assignee: nobody → anthony

Summary: stage.college.mozilla.com is being indexed, is second result on google → [SEO] stage.college.mozilla.com is being indexed, is second result on google

Anthony Ricaud (:rik)

Comment 2

•

13 years ago

I don't know where the code for this is located. This is not in the mozilla.org/.com codebase. We need to find who works on this.

Crystal Beasley [:skinny, :crystal]

Reporter

Comment 3

•

13 years ago

When we find out this info, please add to 

https://wiki.mozilla.org/Webdev:WhoWorksOnWhat

Laura Forrest

Comment 4

•

13 years ago

Hey Julie or Morgamic - Who is the webdev on this microsite? Anthony is looking for the codebase, and this site does not appear to have a technical owner.

Assignee: anthony → nobody

Component: www.mozilla.org/firefox → other.mozilla.org

QA Contact: www-mozilla-com → other-mozilla-org

Julie Deroche

Comment 5

•

13 years ago

Matt (mbasta) has been working on this. The code is hosted on his Github account, but we're not quite ready to make it public... 
https://github.com/mattbasta/moz-college-recruiting

Fred Wenzel [:wenzel]

Comment 6

•

13 years ago

CCing Jason and Jake from Webops: Is there a way to hook up /robots.txt on all (new, or ideally even existing) staging sites, by hooking up something inside the apache config? I know we can hook up an alias, but unsure if that just works for directories or also individual files.

If it works for files, we should drop a default, deny-all robots.txt file somewhere and for all staging sites just hook up /robots.txt to that file. I can't think of a single instance of a staging site where we'd want it to be indexed by robots.

Summary: [SEO] stage.college.mozilla.com is being indexed, is second result on google → [SEO] college.stage.mozilla.com is being indexed, is second result on google

Julie Deroche

Comment 7

•

13 years ago

Especially in this specific case, when we decided to not use the design, nor the content. Honestly, I had no idea is was actually indexed until Crystal pointed it out (thanks, Crystal!).

Jake Maul [:jakem]

Comment 8

•

13 years ago

Nothing convenient or automatic comes to mind. Dev and stage sites are scattered across many different servers/clusters, and the Apache configs aren't especially unified in Puppet. A solution in one place doesn't necessarily apply elsewhere.

I can think of a few ways to do this, but they all ultimately come down to "it's policy to put this robots.txt file in place"... either by manually copying it into place, or with an Apache Alias in every VirtualHost, or some other mechanism. Another possibility would be to try and have Zeus (load balancers) handle robots.txt directly... possible, but likely still suffers from being a manual, per-site process.

One concern I have is that I'm a little leery of IT taking on a task like this. It feels rather close to IT managing the actual content of a site, and that makes me nervous. I suppose in the grand scheme it's not much different from managing a settings_local.py file, though.

I'm also a bit worried about how consistently we could apply it... it seems likely that if it's something that has to be done individually for each dev/stage site, it's bound to get forgotten sometimes.

There's also a small concern that during some types of dev->stage or stage->prod pushes, in some configurations, this might accidentally get pushed to prod. That would be bad, and whatever we come up with should be sure to account for this.

I don't have a solution for this, either on our side or webdev's. I'll bring in some more folks, and see if we can come up with something.

Corey Shields [:cshields]

Comment 9

•

13 years ago

(In reply to Jake Maul [:jakem] from comment #8)
> One concern I have is that I'm a little leery of IT taking on a task like
> this. It feels rather close to IT managing the actual content of a site, and
> that makes me nervous. I suppose in the grand scheme it's not much different
> from managing a settings_local.py file, though.

Given that we pull from a single repository regardless of dev, stage, prod we would have the be the point of implementation here.

Can we hack this up within the dev and stage apache confs themselves - redirect robots.txt for only those 2 sites to a generic robots.txt somewhere?  This would avoid needing to dump the file into the tree itself.

Jake Maul [:jakem]

Comment 10

•

13 years ago

Wil Clouser informs me that AMO has a setting called ENGAGE_ROBOTS, and it appears that there are efforts to make this a standard part of either playdoh or funfactory, which would make it somewhat easier to enable this at the app level anywhere it's needed.

https://github.com/mozilla/playdoh/issues/53


Of course, this doesn't help for non-django sites... but since most new sites are coming up that way (or are pre-packaged 3rd-party apps), that might not be an issue. Anything else could potentially be handled as a one-off.

It would also be worthwhile to ask the SUMO and Input teams how they handle this. I'd much rather have a common solution than a bunch of different ones.

James Socol [:jsocol, :james]

Comment 11

•

13 years ago

SUMO does the exact same thing as AMO/playdoh, and uses an ENGAGE_ROBOTS setting. +1 to getting that into playdoh.

Fred Wenzel [:wenzel]

Comment 12

•

13 years ago

(In reply to Corey Shields [:cshields] from comment #9)
> Can we hack this up within the dev and stage apache confs themselves -
> redirect robots.txt for only those 2 sites to a generic robots.txt
> somewhere?  This would avoid needing to dump the file into the tree itself.

If that's easily doable, I suggest we start doing it for these sites and keep doing it when setting up new sites. I am not as concerned about old sites, those we can catch one by one.

(In reply to James Socol [:jsocol, :james] from comment #11)
> SUMO does the exact same thing as AMO/playdoh, and uses an ENGAGE_ROBOTS
> setting. +1 to getting that into playdoh.

Seems to me like it already was, but I guess not. For non-django sites, that doesn't work though, as Jake says. Is college.stage a Django site?

James Socol [:jsocol, :james]

Comment 13

•

13 years ago

(In reply to Fred Wenzel [:wenzel] from comment #12)
> Is college.stage a Django site?

It is not. It's PHP. There's no robots.txt in the repo but someone added one on the server already. I think there's a format issue, though. AFAIK, the minimal robots.txt is:

User-agent: *
Disallow: /

Right now the / is missing. I might be wrong though.

Josh Howard [:jrh]

Updated

•

5 years ago

Status: NEW → RESOLVED

Closed: 5 years ago

Resolution: --- → INVALID

Bugzilla

Quick Search

[SEO] college.stage.mozilla.com is being indexed, is second result on google

Categories

(Websites :: other.mozilla.org, defect)

Tracking

(Not tracked)

People

(Reporter: cbeasley, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Updated