Closed
Bug 223988
Opened 21 years ago
Closed 13 years ago
Let search engines index bugs (perhaps by installing the Sitemap extension)
Categories
(bugzilla.mozilla.org :: Extensions, enhancement, P5)
Tracking
()
RESOLVED
FIXED
People
(Reporter: myk, Unassigned)
References
()
Details
(Whiteboard: [bmo4.0-resolved])
Attachments
(3 files, 6 obsolete files)
983 bytes,
patch
|
timeless
:
feedback-
|
Details | Diff | Splinter Review |
17.60 KB,
patch
|
glob
:
review+
|
Details | Diff | Splinter Review |
9.23 KB,
patch
|
glob
:
review+
|
Details | Diff | Splinter Review |
We should create static versions of each non-secure bug's "show bug" page daily and let Google index them so people can take advantage of Google's search capabilities to search for bugs. Perhaps we can do funny tricks with redirects so people going to the static page get pointed at the dynamic one easily. timeless points out that we don't want Google to index bugs that subsequently become secure and offers some suggestions for mitigating this problem: <timely> i'd suggest publishing all resolved bugs <timely> the odds of them transitioning to security are very slim <timely> it's not as big of a win, but it's sort of useful <timely> we could also consider publishing all stale bugs (say >7days stale) <timely> the odds of those transitioning to security w/o some thought aren't very high <timely> as soon as the bug loses its stale state you kill the google cachable doc and possibly replace it w/ something which makes google more likely to replace its cached doc w/ something else
Comment 1•21 years ago
|
||
I'm a bit worried by this. Please make sure you remove mail addresses in the published version and that all links back to bugzilla/bonsai/lxr in comments are removed.
the links comment is reasonable, but how do you handle email addresses? consider that this comment is posted by timeless @ bemail . org which doesn't deliver mail. But, there's a real name field which contains my real mail box. (in case it isn't obvious, i don't want to get spam to my real box just because we're protecting someone else's bugmail address.) not being able to search for people's comments is problematic. Searching for bz and timeless can be helpful (if you ignore the fact that we comment in a large number of bugs). One approach would be to expose a user id instead. This of course only works if there's a relatively easy way for people to find those user ids. And what do you do about things like: "well nobody@mozilla.org wrote: ..." or no-such-user@example.com (hopefully never an account in bmo)? We can convert the former into user-id. I suppose the latter should just get ('d or whatever we currently use. Myk: do we want to googlebait? something like making the front page have a link to a page which lists all cached bugs.
Comment 3•20 years ago
|
||
> so people can take advantage of Google's search
> capabilities to search for bugs.
Bugzilla now has full-text indexing so you can do a Google-like search anyway.
Now that we have that, there seems to be a lot of hassle and not much gain in
attempting this. I'd call it a WONTFIX.
Gerv
Comment 4•19 years ago
|
||
Seems like (In reply to comment #3) > > so people can take advantage of Google's search > > capabilities to search for bugs. > > Bugzilla now has full-text indexing so you can do a Google-like search anyway. > Now that we have that, there seems to be a lot of hassle and not much gain in > attempting this. I'd call it a WONTFIX. > > Gerv Does indeed seem like an idea that has passed it's prime and doesn't appear to have interest from yall at bmo. I'm surprised no one has maked it wontfix!
Reporter | ||
Comment 5•19 years ago
|
||
Yeah, the effort is probably not worth it. But it *is* worth letting Google index our pages, even though we now have fulltext search, since they'll aggregate our pages with other results in useful ways, and they may work better for certain searches. But instead of creating static versions of the pages, we should just make Bugzilla run under mod_cgi/SpeedyCGI and then let Google index the live versions. Updating the summary accordingly.
Summary: create static "show bug" pages daily and let google index them → let Google index bugs once Bugzilla runs under mod_cgi/SpeedyCGI
Comment 6•19 years ago
|
||
You mean mod_perl I assume. We already run under mod_cgi ;)
Summary: let Google index bugs once Bugzilla runs under mod_cgi/SpeedyCGI → let Google index bugs once Bugzilla runs under mod_perl/SpeedyCGI
Reporter | ||
Comment 7•19 years ago
|
||
(In reply to comment #6) > You mean mod_perl I assume. We already run under mod_cgi ;) Yah. :-)
Comment 8•18 years ago
|
||
*** Bug 330493 has been marked as a duplicate of this bug. ***
Updated•18 years ago
|
Assignee: myk → justdave
Priority: -- → P5
Updated•18 years ago
|
QA Contact: myk → reed
Comment 9•17 years ago
|
||
Doesn't Bugzilla run on mod_perl since the last upgrade to 3.0?
Comment 10•16 years ago
|
||
Rearranging this component to not have me as the default assignee, so that it doesn't appear like I'm intending to work on bugs that other people could be taking care of if they didn't think I was already doing them. If this bug is a software issue on b.m.o and you'd like to fix it, the modified source is now available for the patching (see bug 373688). Reassigning to the new default owner.
Assignee: justdave → nobody
QA Contact: reed → other-bmo-issues
Comment 11•15 years ago
|
||
There's been several discussions on this outside of this bug, and timeless keeps saying we can't let it be cacheable for security reasons. I guess the question is no longer whether Bugzilla can handle being scraped (because it can now), but whether we want to allow it.
Updated•15 years ago
|
Summary: let Google index bugs once Bugzilla runs under mod_perl/SpeedyCGI → let Google index bugs
Comment 12•15 years ago
|
||
Here you go, here's a patch that solves that problem.
Assignee: nobody → mkanat
Status: NEW → ASSIGNED
Updated•14 years ago
|
Attachment #414540 -
Flags: review?(justdave)
Updated•14 years ago
|
Attachment #414540 -
Flags: feedback?(timeless)
Comment 13•14 years ago
|
||
http://www.google.com/intl/ru/remove.html * To remove cached versions of a page from Google's index and prevent Google from caching the page in the future, you must add a noarchive meta tag to that page. The next time we crawl that site, we'll see the tag and remove the page. To prevent all search engines from showing a "Cached" link for your site, place this tag in the <HEAD> section of your page: <meta name="robots" content="noarchive"> Note: Using a noarchive metatag removes only the "Cached" link for the page. Google will continue to index the page and display a snippet. the snippet bit worries me... is there a way we can suppress this too?
Comment 15•14 years ago
|
||
Comment on attachment 414540 [details] [diff] [review] Allow robots to access show_bug and don't let them cache it please incorporate the bit from biesi
Attachment #414540 -
Flags: feedback?(timeless) → feedback-
Comment 16•14 years ago
|
||
note that in the past i've specifically asked for us to limit things to bugs which are old. i'd still like to see that. a bug which is >!month old is less concerning to me than a bug which our triagers haven't had a chance to see.
Comment 18•14 years ago
|
||
Okay, instead of doing any of that, I made a Sitemap extension for Bugzilla. It's here: http://bzr.mozilla.org/bugzilla/extensions/sitemap/trunk Its home page is here: http://code.google.com/p/bugzilla-sitemap It has the capability to have the security protections that you need. It requires Bugzilla 3.6, though, of course, since it's an extension.
Updated•14 years ago
|
Attachment #414540 -
Flags: review?(justdave)
Updated•14 years ago
|
Depends on: bmo-upgrade-3.6
Summary: let Google index bugs → let Google index bugs, using the Sitemap extension
Updated•14 years ago
|
Assignee: mkanat → justdave
Comment 19•14 years ago
|
||
There is a security question here, which is discussed over in bug 577732. Is there a risk that Google will index security bugs before they have been made secure, if they are not made that way from the start? If I were an attacker with any vague interest in Firefox security bugs, I would set up a job to poll Bugzilla regularly and download a copy of each new bug and its attachments, and then check back a bit later to see if that bug had been made secure. This is not complicated to do, and would not, I suspect, be sufficient activity to raise alarm bells in IT. Given that, I don't think we gain much from setting the delay before Google indexes a bug to anything greater than 24 hours. And we gain a lot from keeping it small; the search is more effective. So I'd argue for a 24h delay maximum. However, we should also add the noarchive meta tag explained in comment #13 to the "you are not permitted to see this bug" error page which results when a secure bug is accessed by someone with insufficient permissions. It can't hurt. Gerv
Comment 20•14 years ago
|
||
(In reply to comment #19) > There is a security question here, which is discussed over in bug 577732. Is > there a risk that Google will index security bugs before they have been made > secure, if they are not made that way from the start? I assume you mean bug 577731. As for the security question, see http://code.google.com/p/bugzilla-sitemap/#Security.
Comment 21•14 years ago
|
||
how to chang jit
Updated•14 years ago
|
Attachment #457513 -
Attachment is obsolete: true
Updated•13 years ago
|
Summary: let Google index bugs, using the Sitemap extension → Let search engines index bugs (perhaps by installing the Sitemap extension)
Updated•13 years ago
|
Assignee: justdave → nobody
Comment 22•13 years ago
|
||
It would be nice to get this in for bmo/4.0 upgrade but we may need to do some dicussion about it first. Search::Sitemap requires Moose and that will bring in quite a few dependencies and we need to gauge the impact that will cause. Dave
Comment 23•13 years ago
|
||
(In reply to comment #22) > It would be nice to get this in for bmo/4.0 upgrade but we may need to do some > dicussion about it first. Search::Sitemap requires Moose and that will bring in > quite a few dependencies and we need to gauge the impact that will cause. For what it's worth, it should work with the perl-Moose package that's in rpmforge.
Comment 24•13 years ago
|
||
Here is a revised Sitemap extension that removes the need for Search::Sitemap and also handles all bugs instead of the last 50k changed. It divides the bugs into separate gzipped sitemap files and creates a sitemap index that the crawler can use. It has a SITEMAP_AGE of 12 hours by default so that the full sitemap files are not generated each time. Also according to the sitemaps.org specification you do not necessarily need to ping the search engines when a new version is available. Having the link to the sitemap index in the robots.txt file should be sufficient. In the past I have done this in a cron job but this will get triggered automatically when an engine accesses page.cgi?id=sitemap/sitemap.xml using extension hooks. Dave
Attachment #519796 -
Flags: review?(glob)
Comment 25•13 years ago
|
||
Why remove Search::Sitemap?
Comment 26•13 years ago
|
||
(In reply to comment #25) > Why remove Search::Sitemap? heavy dependencies for simple requirements.
Comment 27•13 years ago
|
||
(In reply to comment #26) > heavy dependencies for simple requirements. That's not a really good design argument. If you have rpmforge installed, the requirements are accomplished by "yum install perl-Search-Sitemap".
Comment 28•13 years ago
|
||
Comment on attachment 519796 [details] [diff] [review] Revised Sitemap extension without need for Search::Sitemap (v1) excellent :) as it's a fork of the sitemap extension, it should have a different name. ::: extensions/Sitemap/lib/Constants.pm @@ +30,5 @@ +); + +# This is the amount of time a sitemap index and it's files are considered +# valid before needing to be regenerated +use constant SITEMAP_AGE => 12; you should note that the age is in hours. nit: s/it's/its/ @@ +34,5 @@ +use constant SITEMAP_AGE => 12; + +# This is the largest number of entries that can be in a sitemap, +# per the sitemaps.org standard. +use constant SITEMAP_MAX => 50_000; this comment isn't accurate anymore :) @@ +39,5 @@ + +# We only show bugs that are 12 hours old, because if somebody +# files a bug that's a security bug but doesn't protect it, we +# want to give them time to fix that. +use constant SITEMAP_DELAY => 12; again clarify the units used. @@ +41,5 @@ +# files a bug that's a security bug but doesn't protect it, we +# want to give them time to fix that. +use constant SITEMAP_DELAY => 12; + +use constant SITEMAP_URL => 'page.cgi?id=sitemap/sitemap.xml'; this const is never used. perhaps it should be used by _fix_robots_txt(), and the entry removed from robots.txt ::: extensions/Sitemap/lib/Util.pm @@ +52,5 @@ +# and links to bugs. +sub generate_sitemap { + # If file is less than SITEMAP_AGE hours old, then read in and send to caller. + # If greater, then regenerate and send the new version. + my $index_path = bz_locations()->{'cgi_path'} . "/sitemap_index.xml"; i'd prefer for the files to live in a sitemap subdir.. "/sitemap/index.xml", etc. although if the regeneration isn't triggered by cron, then it's probably a good idea to put the main index.xml into data/, to ensure robots can't request it without triggering a re-gen (with the files the index references in sitemap/). @@ +101,5 @@ + $bug_sth->execute($hours_ago, $last_id); + + while (my ($bug_id, $delta_ts) = $bug_sth->fetchrow_array()) { + push(@$bugs, { bug_id => $bug_id, delta_ts => $delta_ts }); + $last_id = $bug_id; sql_limit expects the number of rows to skip as the second parameter, so passing in the bug_id won't work. @@ +107,5 @@ + + last if !@$bugs; + + # We only need the product links in the first sitemap file + $products = [] if $filecount > 1; nit: tabs @@ +114,5 @@ + $filecount++; + } + + # Generate index file + return _generate_sitemap_index($filelist); nit: files no longer referenced by the index.xml should be deleted. @@ +165,5 @@ + + foreach my $product (@$products) { + $sitemap .= " + <url> + <loc>" . $product_url . $product->name . "</loc> you need to xml escape the product name. @@ +193,5 @@ + $sitemap_fh->close() || die "Could not close new sitemap file: $!"; + + # Compress the file + system("/bin/gzip", "-f", $sitemap_path) == 0 + || die "Error executing gzip -f $sitemap_path"; please use IO::Compress::Gzip instead.
Attachment #519796 -
Flags: review?(glob) → review-
Comment 29•13 years ago
|
||
Comment on attachment 519796 [details] [diff] [review] Revised Sitemap extension without need for Search::Sitemap (v1) >+ <url> >+ <loc>" . $product_url . $product->name . "</loc> actually have you to URL encode the name.
Comment 30•13 years ago
|
||
Thanks glob. Here is a new patch to address your suggested changes. Dave
Assignee: nobody → dkl
Attachment #519796 -
Attachment is obsolete: true
Attachment #519983 -
Flags: review?(glob)
Comment 32•13 years ago
|
||
Comment on attachment 519983 [details] [diff] [review] Revised Sitemap extension without need for Search::Sitemap (v2) ::: extensions/SiteIndex/Extension.pm @@ +116,5 @@ + my $sitemap_url = correct_urlbase() . SITEMAP_URL; + $new_contents =~ s/SITEMAP_URL/$sitemap_url/; + $new_file = new IO::File("$cgi_path/robots.txt", 'w'); + print $new_file $new_contents; + $new_file->close(); you need to catch errors here. ::: extensions/SiteIndex/lib/Util.pm @@ +103,5 @@ + + while (my ($bug_id, $delta_ts) = $bug_sth->fetchrow_array()) { + push(@$bugs, { bug_id => $bug_id, delta_ts => $delta_ts }); + $last_id = $bug_id; + } you still need to change $last_id. it should be a count of the rows to skip, not the last bug id. @@ +166,5 @@ + + foreach my $product (@$products) { + $sitemap .= " + <url> + <loc>" . $product_url . xml_quote($product->name) . "</loc> this needs to be a url quote, not xml quote. if the product name contains spaces, the url will be wrong. @@ +188,5 @@ +END + + # Write the compressed sitemap data to a file in the cgi root so that they can + # be accessed by the search engines. + gzip \$sitemap => bz_locations()->{'cgi_path'} . "/sitemap$filecount.xml.gz"; errors need to be caught and managed.
Attachment #519983 -
Flags: review?(glob) → review-
Comment 33•13 years ago
|
||
New patch addressing glob's suggested changes. Max, as soon as this is considered ready, I will make a patch for your current sitemap extension in bzr.mozilla.org and send it your way. Thanks Dave
Attachment #519983 -
Attachment is obsolete: true
Attachment #520296 -
Flags: review?(glob)
Comment 34•13 years ago
|
||
Comment on attachment 520296 [details] [diff] [review] Revised Sitemap extension without need for Search::Sitemap (v3) r=glob
Attachment #520296 -
Flags: review?(glob) → review+
Updated•13 years ago
|
Whiteboard: [bmo-post-4.0] → [bmo4.0-resolved]
Comment 35•13 years ago
|
||
Patch to the committed SiteMapIndex extension that writes the index and sitemap files instead to extensions/SiteMapIndex/web since the Bugzilla root is not set to writable by the web server user. Dave
Attachment #520734 -
Flags: review?(glob)
Comment 36•13 years ago
|
||
Comment on attachment 520734 [details] [diff] [review] Patch to SiteMapIndex extension to write sitemap files to ext's webdir (v1) >+ # Allow webserver to write to the web dir to generate sitemap files >+ if (!ON_WINDOWS) { >+ my $mode = 0770; >+ chmod $mode, $self->package_dir . "/web"; >+ } it'll be better to use Bugzilla::Install::Filesystem::DIR_CGI_WRITE instead of 0700, for different server configurations. you also need to catch chmod errors. >- <loc>" . correct_urlbase() . "$filename</loc> >+ <loc>" . correct_urlbase() . "$package_dir/web/$filename</loc> hrm, i'm not sure about this, because $package_dir is the filesystem path built from %INC. i'm concerned it may result in an absolute instead of relative path. normally you'd just do correct_urlbase() . "extensions/SiteMapIndex/web/$filename" it would be nice if you could do $self->web_path from an extension :) neither of these are show stoppers for bmo, so r=glob they will need to be addressed before upstreaming.
Attachment #520734 -
Flags: review?(glob) → review+
Comment 37•13 years ago
|
||
Comment on attachment 520734 [details] [diff] [review] Patch to SiteMapIndex extension to write sitemap files to ext's webdir (v1) actually, the patch doesn't work because it doesn't create the web/ directory.
Attachment #520734 -
Flags: review+ → review-
Comment 38•13 years ago
|
||
Thanks glob. Hopefully this is getting close now. I found a few other issues and also create/chown/chmod the web directory properly now. Dave
Attachment #520734 -
Attachment is obsolete: true
Attachment #520990 -
Flags: review?(glob)
Comment 39•13 years ago
|
||
(In reply to comment #36) > it'll be better to use Bugzilla::Install::Filesystem::DIR_CGI_WRITE instead of > 0700, for different server configurations. > you also need to catch chmod errors. Also, there should be a hook to extend the Filesystem definitions. Is there not one already? > it would be nice if you could do $self->web_path from an extension :) Agreed, could you file a bug for that?
Comment 40•13 years ago
|
||
By the way, I think I should warn you in advance, I won't accept a writable web_path upstream--instead I would suggest having a script that delivers content from your own data/ subdirectory.
Comment 41•13 years ago
|
||
(In reply to comment #39) > Also, there should be a hook to extend the Filesystem definitions. Is there > not one already? Not that I could see any, no. > Agreed, could you file a bug for that? Got a patch and submitting bug now.
Comment 42•13 years ago
|
||
(In reply to comment #41) > (In reply to comment #39) > > Also, there should be a hook to extend the Filesystem definitions. Is there > > not one already? > > Not that I could see any, no. Okay, I'd love a patch for that, then, too.
Comment 43•13 years ago
|
||
(In reply to comment #40) > By the way, I think I should warn you in advance, I won't accept a writable > web_path upstream--instead I would suggest having a script that delivers > content from your own data/ subdirectory. I had a hunch that this would be considered uncool in some situations but seemed to work the best for the needs of this extension. The data dir is normally locked down pretty tight except for data/attachments and data/webdot. I wasn't sure if having extensions starting to create their own directories there could cause other problems such as conflicting with other extensions. So at the time I felt it best to keep it all contained in it's own extension directory somehow to make sure it never conflicted. But aside from that, I could see creating a data/sitemapindex directory with a custom htaccess file allowing access if you think that would be a better choice. Dave
Comment 44•13 years ago
|
||
Yeah, it's kosher for extensions to create a data subdirectory with their own name. In fact, I vaguely recall even adding a mechanism for that, a data_dir or something, but maybe not. You could add a .htaccess in that directory, or you could just have a script deliver the contents of the files when they are asked for (which would be more broadly compatible with anything we might do with data/ in the future).
Comment 45•13 years ago
|
||
Revised patch using the new $self->web_dir Bugzilla::Extension attribute plus other fixes. May work on a new version that uses data/sitemapindex/* for sitemap files per discussion w/ Max.
Attachment #520990 -
Attachment is obsolete: true
Attachment #521044 -
Flags: review?(glob)
Attachment #520990 -
Flags: review?(glob)
Comment 46•13 years ago
|
||
Comment on attachment 521044 [details] [diff] [review] Patch to SiteMapIndex extension to write sitemap files to ext's webdir (v3) >+ # Create the sitemap directory if does not exist >+ my $sitemap_path = $self->web_dir; >+ unless (-d $sitemap_path) { >+ print "Creating $sitemap_path directory...\n"; >+ mkdir $sitemap_path or die "mkdir $sitemap_path failed: $!"; >+ } nit: might be easier to just add the directory with bzr :) >- <loc>" . correct_urlbase() . "$filename</loc> >+ <loc>" . correct_urlbase() . "$web_dir/$filename</loc> sorry, but $web_dir is still using the filesystem path. we need the extension's web_path, which would be built from the urlbase, and the virtual path to the extension's web directory. i'm expecting something like the following in Bugzilla::Extension: sub web_path { my $self = shift; my ($class) = ref($self) =~ /^.+::(.+)/; return correct_urlbase() . "extensions/$class/web"; } although this may have issues with cpan-installed extensions. for this bug, i'd be happy with: >+ <loc>" . correct_urlbase() . "extensions/SiteMapIndex/$filename</loc> everything else looks excellent :)
Attachment #521044 -
Flags: review?(glob) → review-
Comment 47•13 years ago
|
||
Here is a new patch that instead places the sitemap files in the data/SiteMapIndex directory and not extensions/SiteMapIndex/web. Max suggested data would be a better choice for having a writable directory for this. This patch requires the new install_filesystem hook patch to be applied first which is under currently under review. https://bugzilla.mozilla.org/show_bug.cgi?id=644334 Dave
Attachment #521044 -
Attachment is obsolete: true
Attachment #521310 -
Flags: review?(glob)
Comment 48•13 years ago
|
||
Comment on attachment 521310 [details] [diff] [review] Patch to SiteMapIndex extension to write sitemap files data dir (v1) nice :) r=glob
Attachment #521310 -
Flags: review?(glob) → review+
Comment 49•13 years ago
|
||
bmo/4.0: Committing to: bzr+ssh://dlawrence%40mozilla.com@bzr.mozilla.org/bmo/4.0 modified extensions/SiteMapIndex/Config.pm modified extensions/SiteMapIndex/Extension.pm modified extensions/SiteMapIndex/lib/Constants.pm modified extensions/SiteMapIndex/lib/Util.pm Committed revision 7587.
Comment 50•13 years ago
|
||
Woo! Could somebody submit a patch upstream and work with me on getting this into the main Sitemap extension?
Comment 51•13 years ago
|
||
(In reply to comment #50) > Woo! Could somebody submit a patch upstream and work with me on getting this > into the main Sitemap extension? Will do. I went ahead and got it in there to make our April 16th cut over date. It is undergoing a security review as well. I will work on a patch hopefully tomorrow as next week is Mozilla All-Hands. Dave
Comment 52•13 years ago
|
||
Cool, sounds great. :-) Yeah, I have absolutely no qualms with you checking it into bmo as it is, I totally understand.
Comment 53•13 years ago
|
||
This is now in place and you can test it out on google using "site:bugzilla.mozilla.org".
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Updated•13 years ago
|
Assignee: dkl → nobody
Component: Bugzilla: Other b.m.o Issues → Extensions
Product: mozilla.org → bugzilla.mozilla.org
QA Contact: general → extensions
Version: other → Current
Updated•5 years ago
|
Component: Extensions: Other → Extensions
You need to log in
before you can comment on or make changes to this bug.
Description
•