Closed Bug 383816 Opened 17 years ago Closed 17 years ago

Doctor's robots.txt isn't giving the intended result

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: nelson, Assigned: reed)

References

(
URL
)

Details

Attachments

(1 file)

patch - v1 17 years ago Reed Loden [:reed] 345 bytes, patch	timeless : review+	Details \| Diff \| Splinter Review

Nelson Bolyard (seldom reads bugmail)

Reporter

Description

•

17 years ago

Google searches for mozilla documenation should NEVER return pages on 
doctor.mozilla.org.  But they do, all too often.  

The search feature on www.mozilla.org does a google search on mozilla.org 
servers.  When searching for mozilla documentation using that search, 
very often I get a page back where many of the google links point to 
doctor.mozilla.org.  

For example, go to http://www.mozilla.org and in the search box enter

The output SHOULD include this page:
www.mozilla.org/projects/security/pki/nss/ref/ssl/sslerr.html

But often it does not.  Instead, it contains 
https://doctor.mozilla.org/doctor.cgi?file=mozilla-org/html/projects/security/pki/nss/ref/ssl/sslerr.html&action=display

Because google tries to suppress duplicate pages, it suppressed the page
on www.mozilla.org (which is the page users should see), and instead it 
gives the url for doctor.mozilla.org.  This is really frustrating.

Please, let's keep doctor.mozilla.org out of google.
I suspect that a simple robots.txt file would solve the problem

Nelson Bolyard (seldom reads bugmail)

Reporter

Comment 1

•

17 years ago

erm, something's missing there.  It was supposed to say
"in the search box, enter SEC_ERROR_BAD_KEY and click Go."

Phil Ringnalda (:philor)

Comment 2

•

17 years ago

It has one, added by bug 359129, it's just almost completely broken. Since /doctor.cgi is the root of *every* doctor URL, 

User-agent: *
Allow: /doctor.cgi
Disallow: /

works out to allowing everything, as far as Google is concerned - if you allow them a file, you allow them any query string for that file.

It's actually possible to block them from query strings, *if* we're only interested in Google: they treat

Disallow: /*?

as blocking any URL with a query string. However, they can still choose to include it, with no summary, if their algo decides for a particular query that the doctor URL is a better result (based on factors from pages linking to the doctor URL).

If the goal is to not have doctor URLs appear in Google results, rather than to not have Google crawl doctor, then robots.txt is the wrong tool: rather than keeping it out, we need to let it in, and include a noindex meta tag in every result that isn't the front page.

Assignee: server-ops → myk

Blocks: 359129

URL: https://doctor.mozilla.org/robots.txt

Component: Server Operations → Doctor

Product: mozilla.org → Webtools

QA Contact: justin → doctor

Summary: put robots.txt on https://doctor.mozilla.org → Doctor's robots.txt isn't giving the intended result

Frédéric Buclin

Comment 3

•

17 years ago

Why not having:

Disallow: /doctor.cgi

Phil Ringnalda (:philor)

Comment 4

•

17 years ago

Since doctor.cgi is the index page, amusingly enough that works - Disallow: /doctor.cgi will actually allow doctor.cgi, as a request for /. However, it's very fragile - one link to https://doctor.mozilla.org/?file=foo and you're sunk, and it's crawled.

It's also not guaranteed to do what you want: if enough pages it trusts say that /doctor.cgi?file=foo/bar is the best page evar about "foo bar" then even though it hasn't crawled it, Google will still include it in results for that search. If doctor's robots.txt was there to avoid the database hit of crawling, I'd say the thing to do is add an index.cgi that only has the content you want indexed and a form that submits to doctor.cgi, and only Allow: /index.cgi, but as the tail end of http://www.google.com/support/webmasters/bin/answer.py?answer=35303 says, if your goal is to not have the URLs appear in results, you instead have to let Google crawl them, so that it can see that it's not supposed to index them.

Dave Miller [:justdave]

Comment 5

•

17 years ago

I signed up doctor.mozilla.org for the Google Webmaster Tools, and used their site preferences to request removal of the entire site.  I also changed the robots.txt file to Disallow: /

Dave Miller [:justdave]

Comment 6

•

17 years ago

But that's only for our local copy, that does nothing for the one in CVS (in case of other people using it)

Reed Loden [:reed]

Assignee

Comment 7

•

17 years ago

Attached patch patch - v1 — Details — Splinter Review

Remove doctor.cgi as an allowed file (already in production).

Assignee: myk → reed

Status: NEW → ASSIGNED

Attachment #276312 - Flags: review?(timeless)

timeless

Updated

•

17 years ago

Attachment #276312 - Flags: review?(timeless) → review+

Reed Loden [:reed]

Assignee

Comment 8

•

17 years ago

Checking in robots.txt;
/cvsroot/mozilla/webtools/doctor/robots.txt,v  <--  robots.txt
new revision: 1.2; previous revision: 1.1
done

Status: ASSIGNED → RESOLVED

Closed: 17 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

8 years ago

Product: Webtools → Webtools Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Doctor's robots.txt isn't giving the intended result

Categories

(Webtools Graveyard :: Doctor, defect)

Tracking

(Not tracked)

People

(Reporter: nelson, Assigned: reed)

References

(
URL
)

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Updated

Comment 8

Updated

Attachment

General

Description

File Name

Content Type