Google searches for mozilla documenation should NEVER return pages on doctor.mozilla.org. But they do, all too often. The search feature on www.mozilla.org does a google search on mozilla.org servers. When searching for mozilla documentation using that search, very often I get a page back where many of the google links point to doctor.mozilla.org. For example, go to http://www.mozilla.org and in the search box enter The output SHOULD include this page: www.mozilla.org/projects/security/pki/nss/ref/ssl/sslerr.html But often it does not. Instead, it contains https://doctor.mozilla.org/doctor.cgi?file=mozilla-org/html/projects/security/pki/nss/ref/ssl/sslerr.html&action=display Because google tries to suppress duplicate pages, it suppressed the page on www.mozilla.org (which is the page users should see), and instead it gives the url for doctor.mozilla.org. This is really frustrating. Please, let's keep doctor.mozilla.org out of google. I suspect that a simple robots.txt file would solve the problem
erm, something's missing there. It was supposed to say "in the search box, enter SEC_ERROR_BAD_KEY and click Go."
It has one, added by bug 359129, it's just almost completely broken. Since /doctor.cgi is the root of *every* doctor URL, User-agent: * Allow: /doctor.cgi Disallow: / works out to allowing everything, as far as Google is concerned - if you allow them a file, you allow them any query string for that file. It's actually possible to block them from query strings, *if* we're only interested in Google: they treat Disallow: /*? as blocking any URL with a query string. However, they can still choose to include it, with no summary, if their algo decides for a particular query that the doctor URL is a better result (based on factors from pages linking to the doctor URL). If the goal is to not have doctor URLs appear in Google results, rather than to not have Google crawl doctor, then robots.txt is the wrong tool: rather than keeping it out, we need to let it in, and include a noindex meta tag in every result that isn't the front page.
Why not having: Disallow: /doctor.cgi
Since doctor.cgi is the index page, amusingly enough that works - Disallow: /doctor.cgi will actually allow doctor.cgi, as a request for /. However, it's very fragile - one link to https://doctor.mozilla.org/?file=foo and you're sunk, and it's crawled. It's also not guaranteed to do what you want: if enough pages it trusts say that /doctor.cgi?file=foo/bar is the best page evar about "foo bar" then even though it hasn't crawled it, Google will still include it in results for that search. If doctor's robots.txt was there to avoid the database hit of crawling, I'd say the thing to do is add an index.cgi that only has the content you want indexed and a form that submits to doctor.cgi, and only Allow: /index.cgi, but as the tail end of http://www.google.com/support/webmasters/bin/answer.py?answer=35303 says, if your goal is to not have the URLs appear in results, you instead have to let Google crawl them, so that it can see that it's not supposed to index them.
I signed up doctor.mozilla.org for the Google Webmaster Tools, and used their site preferences to request removal of the entire site. I also changed the robots.txt file to Disallow: /
But that's only for our local copy, that does nothing for the one in CVS (in case of other people using it)
Created attachment 276312 [details] [diff] [review] patch - v1 Remove doctor.cgi as an allowed file (already in production).
Assignee: myk → reed
Status: NEW → ASSIGNED
Attachment #276312 - Flags: review?(timeless)
Checking in robots.txt; /cvsroot/mozilla/webtools/doctor/robots.txt,v <-- robots.txt new revision: 1.2; previous revision: 1.1 done
Status: ASSIGNED → RESOLVED
Last Resolved: 11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.