Closed Bug 472243 Opened 17 years ago Closed 15 years ago

Create a sitemap for web crawlers

Categories

(Bugzilla :: Bugzilla-General, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: mkanat, Unassigned)

References

Details

Attachments

(1 file)

It would be nice to allow crawlers to access Bugzilla and to have a sitemap to do it. Of course, this first requires that we make show_bug.cgi use the shadow DB for logged-out users--otherwise large installations would be overcome by crawler traffic. There's a decent-looking CPAN module to create and manage sitemaps, here: http://search.cpan.org/dist/WWW-Google-SiteMap/ I figure we could update it whenever a bug is filed, and only store X of the most-recently-updated bugs in the sitemap, so that it doesn't get too large.
Has somebody experience with how often crawlers request the sitemap.xml? I'm trying to figure out if keeping and updating a static sitemap.xml is worth the effort or if it could be created on the fly. The sitemap could probably be limited to open bugs + bugs that were updated in the last N days (or the last N updated bug reports as suggested) so it stays reasonable in size even for large databases. Most important seems to keep the lastmod attribute updated, maybe a priority could be assigned based on the status flag ( http://www.sitemaps.org/protocol.php ).
I think that given the potential size of the file, probably the best thing to do would be to have collectstats.pl update it nightly. Also, there should probably be a slightly delay (perhaps 6 hours) for any bug before it ever becomes part of the sitemap, in the event that it was a security bug that was accidentally made public. (That should usually get fixed within six hours, that seems pretty reasonable, and 6 + 24 hours [the max it could be] of delay for a bug to appear in Google's search isn't that bad.)
Also, the last N updated bug reports is better than the last N days, since the "N days" number could fluctuate and the "last N updated", in a large Bugzilla, will not fluctuate (thus giving search engines the most reasonable and up-to-date list always). It's also possible to use a sitemap index and index every single bug.
Here is the script we current use here at Red Hat. It only indexes public bugs and regenerates once a day. We use delta_ts as the lastmod value. Has been working well for us for a while now. Also here is our current robots.txt config. User-agent: * Crawl-delay: 5 Disallow: / Disallow: /show_bug.cgi*ctype=xml* Allow: /index.cgi Allow: /show_bug.cgi Allow: /sitemap_index.xml Allow: /sitemap*xml.gz Sitemap: http://bugzilla.redhat.com/sitemap_index.xml
Okay, instead of implementing this in Bugzilla, I've made an extension: http://code.google.com/p/bugzilla-sitemap/ There aren't any downloads there yet, but there will be within the next few days. Until that point, there are instructions on that page for where you can get it using bzr. We can reopen this if at some point we decide that it's something we want in core, but I think it makes the most sense as an extension for now, because (a) that way older installations can get it (b) it's only applicable to a certain subset of Bugzillas--namely, publicly-accessible ones with requirelogin turned off.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: