User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050716 Firefox/1.0.6 Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050716 Firefox/1.0.6 Bugzilla should be able to be optinally installed in such as way as to let Google and other major search engines to spider bug pages. This can be done by (a) allowing bugs to be accessed with simple static-page-like URLs such as http://bugzilla.mozilla.org/bug/bug_151410. This can be as simple as using mod-rewrite. (b) making all internal references to bugs within bugzilla pages use that syntax (b) allowing the major search engines to spider _that particular virtual directory only _ within bugzilla in the robots.txt file Presumably the rationale for preventing all robots from spidering the site is to prevent undue server load: however, doing this, possibly with caching of accessed pages to cut down on server load, would greatly enhance the utility of bugzilla when doing general-purpose searches. It's possible that spidering load could be cut down considerable by the generation of a Google sitemap feed, or other alternative site feeds, which can be as simple as an RSS or Atom feed: see https://www.google.com/webmasters/sitemaps/docs/en/other.html (An RSS feed might well be a useful resource in its own right.) If necessary, spidering could be restricted to search engines which are site-map friendly. Has anyone checked whether the cost of allowing spidering under these circumstances would exceed the savings on searching given the use of external search engines instead? Reproducible: Always Steps to Reproduce: 1. Search for a bug on Bugzilla in Google, or any other search engine 2. It isn't there 3. Actual Results: No bug reporta are found by Google. Expected Results: Allowed search engines to spider the site.
Current robots.txt: User-agent: * Allow: /index.cgi Disallow: / Suggested robots.txt: User-agent: * Allow: /index.cgi Allow: /bug Disallow: /
This is a dup, I'm fairly sure, but I can't find what it's a dup of.
(In reply to comment #2) > This is a dup, I'm fairly sure, but I can't find what it's a dup of. bug 294295 is the bug you are looking for.
OK, yeah, these are different bugs. One of them is just a discussion about things we could do on bmo, this one is about what we could do by default in Bugzilla in general.
Severity: normal → enhancement
Status: UNCONFIRMED → NEW
Ever confirmed: true
OS: Linux → All
Hardware: PC → All
Summary: Bugzilla should by default let Google and other major search engines to spider bug pages → Bugzilla should by default let Google and other major search engines index bug pages
Version: unspecified → 2.21
AFAIK, search robots are causing a lot of traffic on Bugzilla sites. It's a general problem in web application.
IMHO this is not a Bugzilla bug, but rather a default setting that any SysAdmin is free to change. They had to pick one, and they picked "no indexing". Sounds reasonable. At Eclipse we allow spiders to crawl our Bugzilla site, but as per comment 5, I had to contact Google and have them relax their spider... They'd hammer away at the bugs at a fast rate, causing the DB to slow down considerably. D.
If this is a bugzilla.mozilla.org bug, I think it will be a WONTFIX. Pre-Firefox (may be FireBird/Phoenix) time we were able to search bmo using google. But later it was stopped due to the load it caused at bmo. What can we do make bugzilla search able by google? 1. Write boot to retrieve following link everyday https://bugzilla.mozilla.org/buglist.cgi?chfieldfrom=-2d&chfieldto=-1d 2. Cache each bug on the above link on a server which dont restrict search engine using robots.txt:
I hear bugzilla 3.0 is faster so this performance is no longer such an issue. justdave, would you be so kind as to do an experiment? 1. reassign this bug to yourself 2. use the Google Webmaster Tools at https://www.google.com/webmasters/tools to reduce Google's crawl speed on b.m.o to "Slow" 3. add the following to the top of http://bugzilla.mozilla.org/robots.txt: User-agent: Googlebot Allow: / 4. see if: a) the mozilla community is overall happier now that b.m.o is indexed by Google (darn useful), or b) unhappier because b.m.o is just a tiny bit more loaded. 5. if the truth is b), then undo step 4.
By the way, Google crawls https:// URLs too, so it is no problem that b.m.o is behind HTTPS. ^. http://www.google.com/search?q=site:addons.mozilla.org ^. http://www.markcarey.com/googleguy-says/archives/discuss-google-crawls-https.html
Also, justdave, if you want to prevent Google and Wayback Machine caching, you can. Just add this META tag to the top of pages generated by show_bug.cgi: <meta name="robots" content="noarchive"> http://webhostinggeeks.com/articles/web-development/6760.php describes exactly what the tag means.
If anyone is out there, I'll reiterate that this is not a Bugzilla bug, as Google will index any new Bugzilla install. I recommend RESOLVED->WORKSFORME
BMO has let Google index for quite some time and other installs can update their configurations to allow the same. https://bugzilla.mozilla.org/robots.txt dkl
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → WONTFIX
After implmenting anything in robots, you should check in google webmaster robots test because that will help you to check what you blocked for google. I have blocked https://www.miditech.co.in for few days by mistake and lost ranking on several keywords so be careful before making any changes in robots.txt.
You need to log in before you can comment on or make changes to this bug.