Closed Bug 305744 Opened 19 years ago Closed 9 years ago

Bugzilla should by default let Google and other major search engines index bug pages

Categories

(Bugzilla :: Bugzilla-General, enhancement)

2.21
enhancement
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: usenet, Unassigned)

Details

User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050716 Firefox/1.0.6
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050716 Firefox/1.0.6

Bugzilla should be able to be optinally installed in such as way as to let
Google and other major search engines to spider bug pages. 

This can be done by

(a) allowing bugs to be accessed with simple static-page-like URLs such as
http://bugzilla.mozilla.org/bug/bug_151410. This can be as simple as using
mod-rewrite. 

(b) making all internal references to bugs within bugzilla pages use that syntax 

(b) allowing the major search engines to spider _that particular virtual
directory only _ within bugzilla in the robots.txt file

Presumably the rationale for preventing all robots from spidering the site is to
prevent undue server load: however, doing this, possibly with caching of
accessed pages to cut down on server load, would greatly enhance the utility of
bugzilla when doing general-purpose searches.

It's possible that spidering load could be cut down considerable by the
generation of a Google sitemap feed, or other alternative site feeds, which can
be as simple as an RSS or Atom feed: see
https://www.google.com/webmasters/sitemaps/docs/en/other.html

(An RSS feed might well be a useful resource in its own right.)

If necessary, spidering could be restricted to search engines which are site-map
friendly. 

Has anyone checked whether the cost of allowing spidering under these
circumstances would exceed the savings on searching given the use of external
search engines instead?




Reproducible: Always

Steps to Reproduce:
1. Search for a bug on Bugzilla in Google, or any other search engine
2. It isn't there
3.

Actual Results:  
No bug reporta are found by Google.

Expected Results:  
Allowed search engines to spider the site.
Current robots.txt:

User-agent: *
Allow: /index.cgi
Disallow: /

Suggested robots.txt:

User-agent: *
Allow: /index.cgi
Allow: /bug
Disallow: /

This is a dup, I'm fairly sure, but I can't find what it's a dup of.
(In reply to comment #2)
> This is a dup, I'm fairly sure, but I can't find what it's a dup of.

bug 294295 is the bug you are looking for.
OK, yeah, these are different bugs. One of them is just a discussion about
things we could do on bmo, this one is about what we could do by default in
Bugzilla in general.
Severity: normal → enhancement
Status: UNCONFIRMED → NEW
Ever confirmed: true
OS: Linux → All
Hardware: PC → All
Summary: Bugzilla should by default let Google and other major search engines to spider bug pages → Bugzilla should by default let Google and other major search engines index bug pages
Version: unspecified → 2.21
AFAIK, search robots are causing a lot of traffic on Bugzilla sites. It's a general problem in web application. 
IMHO this is not a Bugzilla bug, but rather a default setting that any SysAdmin is free to change. They had to pick one, and they picked "no indexing". Sounds reasonable.

At Eclipse we allow spiders to crawl our Bugzilla site, but as per comment 5, I had to contact Google and have them relax their spider... They'd hammer away at the bugs at a fast rate, causing the DB to slow down considerably.

D.
If this is a bugzilla.mozilla.org bug, I think it will be a WONTFIX.
Pre-Firefox (may be FireBird/Phoenix) time we were able to search bmo using google. But later it was stopped due to the load it caused at bmo.

What can we do make bugzilla search able by google?
1. Write boot to retrieve following link everyday
   https://bugzilla.mozilla.org/buglist.cgi?chfieldfrom=-2d&chfieldto=-1d
2. Cache each bug on the above link on a server 
   which dont restrict search engine using robots.txt: 
I hear bugzilla 3.0 is faster so this performance is no longer such an issue.

justdave, would you be so kind as to do an experiment?

1. reassign this bug to yourself
2. use the Google Webmaster Tools at https://www.google.com/webmasters/tools to reduce Google's crawl speed on b.m.o to "Slow"
3. add the following to the top of http://bugzilla.mozilla.org/robots.txt:
     User-agent: Googlebot
     Allow: /
4. see if:
     a) the mozilla community is overall happier now that b.m.o is indexed 
        by Google (darn useful), or
     b) unhappier because b.m.o is just a tiny bit more loaded.
5. if the truth is b), then undo step 4.
By the way, Google crawls https:// URLs too, so it is no problem that b.m.o is behind HTTPS.[1][2]

^[1]. http://www.google.com/search?q=site:addons.mozilla.org
^[2]. http://www.markcarey.com/googleguy-says/archives/discuss-google-crawls-https.html
Also, justdave, if you want to prevent Google and Wayback Machine caching, you can.  Just add this META tag to the top of pages generated by show_bug.cgi:

     <meta name="robots" content="noarchive">

http://webhostinggeeks.com/articles/web-development/6760.php describes exactly what the tag means.
If anyone is out there, I'll reiterate that this is not a Bugzilla bug, as Google will index any new Bugzilla install. I recommend RESOLVED->WORKSFORME
BMO has let Google index for quite some time and other installs can update their configurations to allow the same.

https://bugzilla.mozilla.org/robots.txt

dkl
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
After implmenting anything in robots, you should check in google webmaster robots test because that will help you to check what you blocked for google. I have blocked https://www.miditech.co.in for few days by mistake and lost ranking on several keywords so be careful before making any changes in robots.txt.
You need to log in before you can comment on or make changes to this bug.