Closed
Bug 383816
Opened 18 years ago
Closed 18 years ago
Doctor's robots.txt isn't giving the intended result
Categories
(Webtools Graveyard :: Doctor, defect)
Webtools Graveyard
Doctor
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: nelson, Assigned: reed)
References
()
Details
Attachments
(1 file)
345 bytes,
patch
|
timeless
:
review+
|
Details | Diff | Splinter Review |
Google searches for mozilla documenation should NEVER return pages on
doctor.mozilla.org. But they do, all too often.
The search feature on www.mozilla.org does a google search on mozilla.org
servers. When searching for mozilla documentation using that search,
very often I get a page back where many of the google links point to
doctor.mozilla.org.
For example, go to http://www.mozilla.org and in the search box enter
The output SHOULD include this page:
www.mozilla.org/projects/security/pki/nss/ref/ssl/sslerr.html
But often it does not. Instead, it contains
https://doctor.mozilla.org/doctor.cgi?file=mozilla-org/html/projects/security/pki/nss/ref/ssl/sslerr.html&action=display
Because google tries to suppress duplicate pages, it suppressed the page
on www.mozilla.org (which is the page users should see), and instead it
gives the url for doctor.mozilla.org. This is really frustrating.
Please, let's keep doctor.mozilla.org out of google.
I suspect that a simple robots.txt file would solve the problem
Reporter | ||
Comment 1•18 years ago
|
||
erm, something's missing there. It was supposed to say
"in the search box, enter SEC_ERROR_BAD_KEY and click Go."
Comment 2•18 years ago
|
||
It has one, added by bug 359129, it's just almost completely broken. Since /doctor.cgi is the root of *every* doctor URL,
User-agent: *
Allow: /doctor.cgi
Disallow: /
works out to allowing everything, as far as Google is concerned - if you allow them a file, you allow them any query string for that file.
It's actually possible to block them from query strings, *if* we're only interested in Google: they treat
Disallow: /*?
as blocking any URL with a query string. However, they can still choose to include it, with no summary, if their algo decides for a particular query that the doctor URL is a better result (based on factors from pages linking to the doctor URL).
If the goal is to not have doctor URLs appear in Google results, rather than to not have Google crawl doctor, then robots.txt is the wrong tool: rather than keeping it out, we need to let it in, and include a noindex meta tag in every result that isn't the front page.
Assignee: server-ops → myk
Blocks: 359129
Component: Server Operations → Doctor
Product: mozilla.org → Webtools
QA Contact: justin → doctor
Summary: put robots.txt on https://doctor.mozilla.org → Doctor's robots.txt isn't giving the intended result
![]() |
||
Comment 3•18 years ago
|
||
Why not having:
Disallow: /doctor.cgi
Comment 4•18 years ago
|
||
Since doctor.cgi is the index page, amusingly enough that works - Disallow: /doctor.cgi will actually allow doctor.cgi, as a request for /. However, it's very fragile - one link to https://doctor.mozilla.org/?file=foo and you're sunk, and it's crawled.
It's also not guaranteed to do what you want: if enough pages it trusts say that /doctor.cgi?file=foo/bar is the best page evar about "foo bar" then even though it hasn't crawled it, Google will still include it in results for that search. If doctor's robots.txt was there to avoid the database hit of crawling, I'd say the thing to do is add an index.cgi that only has the content you want indexed and a form that submits to doctor.cgi, and only Allow: /index.cgi, but as the tail end of http://www.google.com/support/webmasters/bin/answer.py?answer=35303 says, if your goal is to not have the URLs appear in results, you instead have to let Google crawl them, so that it can see that it's not supposed to index them.
Comment 5•18 years ago
|
||
I signed up doctor.mozilla.org for the Google Webmaster Tools, and used their site preferences to request removal of the entire site. I also changed the robots.txt file to Disallow: /
Comment 6•18 years ago
|
||
But that's only for our local copy, that does nothing for the one in CVS (in case of other people using it)
Assignee | ||
Comment 7•18 years ago
|
||
Remove doctor.cgi as an allowed file (already in production).
Attachment #276312 -
Flags: review?(timeless) → review+
Assignee | ||
Comment 8•18 years ago
|
||
Checking in robots.txt;
/cvsroot/mozilla/webtools/doctor/robots.txt,v <-- robots.txt
new revision: 1.2; previous revision: 1.1
done
Status: ASSIGNED → RESOLVED
Closed: 18 years ago
Resolution: --- → FIXED
Updated•9 years ago
|
Product: Webtools → Webtools Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•