Closed Bug 127452 Opened 23 years ago Closed 20 years ago

Change robots.txt to allow search engines index bugzilla.mozilla.org

Categories

(bugzilla.mozilla.org :: General, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED DUPLICATE of bug 187870

People

(Reporter: andrea.monni, Assigned: endico)

Details

I file this bug from a post by Nicolás Lichtmaier on n.p.m.general, the original
one can be found here:

http://groups.google.com/groups?hl=en&selm=3C76B4AA.7000600%40technisys.com.ar

And here follows the test:

<Nicolás Post>
From: Nicolás Lichtmaier (nick@technisys.com.ar)
Subject: Bugzilla and search engines
Newsgroups: netscape.public.mozilla.general
Date: 2002-02-22 13:15:17 PST

Currently bugzilla.mozilla.org disallow indexing by search engines. The 
robots.txt file has:

User-agent: *
Allow: /index.html
Disallow: /

Having the bug pages indexed by Google is of course a good thing:

* Google could be used to search if a bug has already been reported, and 
it will probably make a better job than the current search system.
* If someone searchs about a company Google will probably shown an 
evangelism bug reported to his website.
* If someone is searching about a web feature (e.g. border-collapese) 
Google will be able to display Mozilla's bug about that.

So I propose to change robots.txt to only disallow entries which causes 
searchs in bugzilla, and allow http://bugzilla.mozilla.org/show_bug.cgi at least.

Thoughts?
</Nicolás Post>

I can only add that, obviously, I second this proposal and that Nicolás is the
only to blame for this proposal! ;)

Andrea
Woa..! I was about to file this bug! =)
I read that discussion, and I also agree Google should be allowed in. The
arguments I remember being mentioned against were:

(a) You can do all kinds of search in the Bugzilla query page.

Google will answer in half a second, whereas with Bugzilla I always try to limit
the search to one or two components and only search the summary, in order to
keep the search time reasonable. I never search in the description/comments.

So, no, in reality you can't search the description/comments with Bugzilla.

(b) Google may return too many results.

I think we all know that Google normally returns thousands of results for our
queries, but the order in which it displays them is amazing. If you search the
Bugzilla database with it, in most cases you will get what you're looking for on
the first result page.

(c) The index will be out of date.

No more than three months, I think. Most of my searches are for bugs older than
this. I understand that many Bugzilla users won't benefit from Google because of
its index being old, but a large class of users, like me, will benefit.

(d) The bots will be a large load for the server.

I am the administrator of a dynamic web site which totals around 1700 pages.
When Google reindexes the entire site, it makes one request every 3 to 5
minutes, and it takes a few days to finish reindexing.

My server _has_ gone up in flames a couple of times, due to bad robots, but
never because of Google. In order to avoid something like this happening, we
could specify, in robots.txt, that only Google is allowed.

Since bmo has 200 thousand bugs, it would take around a year to reindex it at
3-5 minutes per request. I don't know what Google would do for that. It would
need one request every 5 seconds to finish in one week. But at first we should
only allow it in, and it would find only those bugs for which there are
references on the web and groups, and these are probably much less. We can then
observe its behavior and decide whether we will create pages with links that
will lead it to all bugs.
See bug 145588 for what we're planning to do to implement Google-like search in
Bugzilla.
While fixing bug 145588 will make searching easier, I can see advantages to
letting Google index show_bug.cgi (and only show_bug.cgi). I think the biggest
advantage is that our bugs will be intwined in the results of other queries. I
can't tell you how many times I've had a problem w/some Microsoft software, went
to Google to find the answer, and found an article in the MSKB. Even though I
could have just gone to the MSKB first, I didn't think about it; Google rescued
me. A couple other advantages that are similar in nature were listed in comment 0.

As a side note, we don't have an index.html to be allowed anymore.
Well, we could allow index.cgi in the robots.txt file. The issue is that the
googlebot grabbing all 200000 bugs would be a really heavy load on bmo. Maybe
post mod-perl, we can look at it,but...

Also, w/o last-modified dates there would be bandwidth issues too (I assume that
the googlebot does conditional fetches)

Plus you'd have to get the listof bugs to google somehow, and I don't think
google indexes cgi requests - you'd need a static url
(http://bmo/show_bug/12345) via mod_rewrite or similar.
there are also security concerns. i've discussed them somewhere.
This looks like a dup of/related to bug 187870

*** This bug has been marked as a duplicate of 187870 ***
Status: NEW → RESOLVED
Closed: 20 years ago
Resolution: --- → DUPLICATE
Component: Bugzilla: Other b.m.o Issues → General
Product: mozilla.org → bugzilla.mozilla.org
You need to log in before you can comment on or make changes to this bug.