Closed Bug 383816 Opened 17 years ago Closed 17 years ago

Doctor's robots.txt isn't giving the intended result

Categories

(Webtools Graveyard :: Doctor, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nelson, Assigned: reed)

References

()

Details

Attachments

(1 file)

Google searches for mozilla documenation should NEVER return pages on 
doctor.mozilla.org.  But they do, all too often.  

The search feature on www.mozilla.org does a google search on mozilla.org 
servers.  When searching for mozilla documentation using that search, 
very often I get a page back where many of the google links point to 
doctor.mozilla.org.  

For example, go to http://www.mozilla.org and in the search box enter

The output SHOULD include this page:
www.mozilla.org/projects/security/pki/nss/ref/ssl/sslerr.html

But often it does not.  Instead, it contains 
https://doctor.mozilla.org/doctor.cgi?file=mozilla-org/html/projects/security/pki/nss/ref/ssl/sslerr.html&action=display

Because google tries to suppress duplicate pages, it suppressed the page
on www.mozilla.org (which is the page users should see), and instead it 
gives the url for doctor.mozilla.org.  This is really frustrating.

Please, let's keep doctor.mozilla.org out of google.
I suspect that a simple robots.txt file would solve the problem
erm, something's missing there.  It was supposed to say
"in the search box, enter SEC_ERROR_BAD_KEY and click Go."
It has one, added by bug 359129, it's just almost completely broken. Since /doctor.cgi is the root of *every* doctor URL, 

User-agent: *
Allow: /doctor.cgi
Disallow: /

works out to allowing everything, as far as Google is concerned - if you allow them a file, you allow them any query string for that file.

It's actually possible to block them from query strings, *if* we're only interested in Google: they treat

Disallow: /*?

as blocking any URL with a query string. However, they can still choose to include it, with no summary, if their algo decides for a particular query that the doctor URL is a better result (based on factors from pages linking to the doctor URL).

If the goal is to not have doctor URLs appear in Google results, rather than to not have Google crawl doctor, then robots.txt is the wrong tool: rather than keeping it out, we need to let it in, and include a noindex meta tag in every result that isn't the front page.
Assignee: server-ops → myk
Blocks: 359129
Component: Server Operations → Doctor
Product: mozilla.org → Webtools
QA Contact: justin → doctor
Summary: put robots.txt on https://doctor.mozilla.org → Doctor's robots.txt isn't giving the intended result
Why not having:

Disallow: /doctor.cgi
Since doctor.cgi is the index page, amusingly enough that works - Disallow: /doctor.cgi will actually allow doctor.cgi, as a request for /. However, it's very fragile - one link to https://doctor.mozilla.org/?file=foo and you're sunk, and it's crawled.

It's also not guaranteed to do what you want: if enough pages it trusts say that /doctor.cgi?file=foo/bar is the best page evar about "foo bar" then even though it hasn't crawled it, Google will still include it in results for that search. If doctor's robots.txt was there to avoid the database hit of crawling, I'd say the thing to do is add an index.cgi that only has the content you want indexed and a form that submits to doctor.cgi, and only Allow: /index.cgi, but as the tail end of http://www.google.com/support/webmasters/bin/answer.py?answer=35303 says, if your goal is to not have the URLs appear in results, you instead have to let Google crawl them, so that it can see that it's not supposed to index them.
I signed up doctor.mozilla.org for the Google Webmaster Tools, and used their site preferences to request removal of the entire site.  I also changed the robots.txt file to Disallow: /
But that's only for our local copy, that does nothing for the one in CVS (in case of other people using it)
Attached patch patch - v1Splinter Review
Remove doctor.cgi as an allowed file (already in production).
Assignee: myk → reed
Status: NEW → ASSIGNED
Attachment #276312 - Flags: review?(timeless)
Attachment #276312 - Flags: review?(timeless) → review+
Checking in robots.txt;
/cvsroot/mozilla/webtools/doctor/robots.txt,v  <--  robots.txt
new revision: 1.2; previous revision: 1.1
done
Status: ASSIGNED → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
Product: Webtools → Webtools Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: