Closed Bug 223988 Opened 21 years ago Closed 13 years ago

Let search engines index bugs (perhaps by installing the Sitemap extension)

Categories

(bugzilla.mozilla.org :: Extensions, enhancement, P5)

Production
x86
Linux
enhancement

Tracking

()

RESOLVED FIXED

People

(Reporter: myk, Unassigned)

References

()

Details

(Whiteboard: [bmo4.0-resolved])

Attachments

(3 files, 6 obsolete files)

We should create static versions of each non-secure bug's "show bug" page daily
and let Google index them so people can take advantage of Google's search
capabilities to search for bugs.  Perhaps we can do funny tricks with redirects
so people going to the static page get pointed at the dynamic one easily.

timeless points out that we don't want Google to index bugs that subsequently
become secure and offers some suggestions for mitigating this problem:

<timely> i'd suggest publishing all resolved bugs
<timely> the odds of them transitioning to security are very slim
<timely> it's not as big of a win, but it's sort of useful
<timely> we could also consider publishing all stale bugs (say >7days stale)
<timely> the odds of those transitioning to security w/o some thought aren't
very high
<timely> as soon as the bug loses its stale state you kill the google cachable
doc and possibly replace it w/ something which makes google more likely to
replace its cached doc w/ something else
I'm a bit worried by this.  Please make sure you remove mail addresses in the
published version and that all links back to bugzilla/bonsai/lxr in comments
are removed.
the links comment is reasonable, but how do you handle email addresses?

consider that this comment is posted by timeless @ bemail . org which doesn't
deliver mail. But, there's a real name field which contains my real mail box.
(in case it isn't obvious, i don't want to get spam to my real box just
because we're protecting someone else's bugmail address.)

not being able to search for people's comments is problematic. Searching for
bz and timeless can be helpful (if you ignore the fact that we comment in a
large number of bugs).

One approach would be to expose a user id instead. This of course only works if
there's a relatively easy way for people to find those user ids.

And what do you do about things like: "well nobody@mozilla.org wrote: ..."
or no-such-user@example.com (hopefully never an account in bmo)? We can convert
the former into user-id. I suppose the latter should just get &#40;'d or
whatever we currently use.

Myk: do we want to googlebait?
something like making the front page have a link to a page which lists all
cached bugs.
> so people can take advantage of Google's search
> capabilities to search for bugs.

Bugzilla now has full-text indexing so you can do a Google-like search anyway.
Now that we have that, there seems to be a lot of hassle and not much gain in
attempting this. I'd call it a WONTFIX.

Gerv
Seems like (In reply to comment #3)
> > so people can take advantage of Google's search
> > capabilities to search for bugs.
> 
> Bugzilla now has full-text indexing so you can do a Google-like search anyway.
> Now that we have that, there seems to be a lot of hassle and not much gain in
> attempting this. I'd call it a WONTFIX.
> 
> Gerv

Does indeed seem like an idea that has passed it's prime and doesn't appear to
have interest from yall at bmo.  I'm surprised no one has maked it wontfix!
Yeah, the effort is probably not worth it.  But it *is* worth letting Google
index our pages, even though we now have fulltext search, since they'll
aggregate our pages with other results in useful ways, and they may work better
for certain searches.  But instead of creating static versions of the pages, we
should just make Bugzilla run under mod_cgi/SpeedyCGI and then let Google index
the live versions.

Updating the summary accordingly.
Summary: create static "show bug" pages daily and let google index them → let Google index bugs once Bugzilla runs under mod_cgi/SpeedyCGI
You mean mod_perl I assume.  We already run under mod_cgi ;)
Summary: let Google index bugs once Bugzilla runs under mod_cgi/SpeedyCGI → let Google index bugs once Bugzilla runs under mod_perl/SpeedyCGI
(In reply to comment #6)
> You mean mod_perl I assume.  We already run under mod_cgi ;)

Yah. :-)
*** Bug 330493 has been marked as a duplicate of this bug. ***
Assignee: myk → justdave
Priority: -- → P5
QA Contact: myk → reed
Doesn't Bugzilla run on mod_perl since the last upgrade to 3.0?
Rearranging this component to not have me as the default assignee, so that it doesn't appear like I'm intending to work on bugs that other people could be taking care of if they didn't think I was already doing them.  If this bug is a software issue on b.m.o and you'd like to fix it, the modified source is now available for the patching (see bug 373688).  Reassigning to the new default owner.
Assignee: justdave → nobody
QA Contact: reed → other-bmo-issues
There's been several discussions on this outside of this bug, and timeless keeps saying we can't let it be cacheable for security reasons.  I guess the question is no longer whether Bugzilla can handle being scraped (because it can now), but whether we want to allow it.
Summary: let Google index bugs once Bugzilla runs under mod_perl/SpeedyCGI → let Google index bugs
Here you go, here's a patch that solves that problem.
Assignee: nobody → mkanat
Status: NEW → ASSIGNED
Attachment #414540 - Flags: review?(justdave)
Attachment #414540 - Flags: feedback?(timeless)
http://www.google.com/intl/ru/remove.html

    * To remove cached versions of a page from Google's index and prevent Google from caching the page in the future, you must add a noarchive meta tag to that page. The next time we crawl that site, we'll see the tag and remove the page. 

To prevent all search engines from showing a "Cached" link for your site, place this tag in the <HEAD> section of your page:

<meta name="robots" content="noarchive">

Note: Using a noarchive metatag removes only the "Cached" link for the page. Google will continue to index the page and display a snippet.

the snippet bit worries me... is there a way we can suppress this too?
Comment on attachment 414540 [details] [diff] [review]
Allow robots to access show_bug and don't let them cache it

please incorporate the bit from biesi
Attachment #414540 - Flags: feedback?(timeless) → feedback-
note that in the past i've specifically asked for us to limit things to bugs which are old. i'd still like to see that. a bug which is >!month old is less concerning to me than a bug which our triagers haven't had a chance to see.
  Okay, instead of doing any of that, I made a Sitemap extension for Bugzilla. It's here:

  http://bzr.mozilla.org/bugzilla/extensions/sitemap/trunk

  Its home page is here:

  http://code.google.com/p/bugzilla-sitemap

  It has the capability to have the security protections that you need. It requires Bugzilla 3.6, though, of course, since it's an extension.
Attachment #414540 - Flags: review?(justdave)
Depends on: bmo-upgrade-3.6
Summary: let Google index bugs → let Google index bugs, using the Sitemap extension
Assignee: mkanat → justdave
Blocks: 577731
There is a security question here, which is discussed over in bug 577732. Is there a risk that Google will index security bugs before they have been made secure, if they are not made that way from the start?

If I were an attacker with any vague interest in Firefox security bugs, I would set up a job to poll Bugzilla regularly and download a copy of each new bug and its attachments, and then check back a bit later to see if that bug had been made secure. This is not complicated to do, and would not, I suspect, be sufficient activity to raise alarm bells in IT.

Given that, I don't think we gain much from setting the delay before Google indexes a bug to anything greater than 24 hours. And we gain a lot from keeping it small; the search is more effective. So I'd argue for a 24h delay maximum.

However, we should also add the noarchive meta tag explained in comment #13 to the "you are not permitted to see this bug" error page which results when a secure bug is accessed by someone with insufficient permissions. It can't hurt.

Gerv
(In reply to comment #19)
> There is a security question here, which is discussed over in bug 577732. Is
> there a risk that Google will index security bugs before they have been made
> secure, if they are not made that way from the start?

I assume you mean bug 577731.

As for the security question, see http://code.google.com/p/bugzilla-sitemap/#Security.
Attached patch chang (obsolete) — Splinter Review
how to chang jit
Attachment #457513 - Attachment is obsolete: true
Summary: let Google index bugs, using the Sitemap extension → Let search engines index bugs (perhaps by installing the Sitemap extension)
Assignee: justdave → nobody
It would be nice to get this in for bmo/4.0 upgrade but we may need to do some dicussion about it first. Search::Sitemap requires Moose and that will bring in quite a few dependencies and we need to gauge the impact that will cause.

Dave
(In reply to comment #22)
> It would be nice to get this in for bmo/4.0 upgrade but we may need to do some
> dicussion about it first. Search::Sitemap requires Moose and that will bring in
> quite a few dependencies and we need to gauge the impact that will cause.

  For what it's worth, it should work with the perl-Moose package that's in rpmforge.
Whiteboard: [bmo-post-4.0]
Here is a revised Sitemap extension that removes the need for Search::Sitemap and also handles all bugs instead of the last 50k changed. It divides the bugs into separate gzipped sitemap files and creates a sitemap index that the crawler can use. It has a SITEMAP_AGE of 12 hours by default so that the full sitemap files are not generated each time. Also according to the sitemaps.org specification you do not necessarily need to ping the search engines when a new version is available. Having the link to the sitemap index in the robots.txt file should be sufficient. In the past I have done this in a cron job but this will get triggered automatically when an engine accesses page.cgi?id=sitemap/sitemap.xml using extension hooks.

Dave
Attachment #519796 - Flags: review?(glob)
  Why remove Search::Sitemap?
(In reply to comment #25)
>   Why remove Search::Sitemap?

heavy dependencies for simple requirements.
(In reply to comment #26)
> heavy dependencies for simple requirements.

  That's not a really good design argument. If you have rpmforge installed, the requirements are accomplished by "yum install perl-Search-Sitemap".
Comment on attachment 519796 [details] [diff] [review]
Revised Sitemap extension without need for Search::Sitemap (v1)

excellent :)

as it's a fork of the sitemap extension, it should have a different name.

::: extensions/Sitemap/lib/Constants.pm
@@ +30,5 @@
+);
+
+# This is the amount of time a sitemap index and it's files are considered
+# valid before needing to be regenerated
+use constant SITEMAP_AGE => 12;

you should note that the age is in hours.
nit: s/it's/its/

@@ +34,5 @@
+use constant SITEMAP_AGE => 12;
+
+# This is the largest number of entries that can be in a sitemap,
+# per the sitemaps.org standard.
+use constant SITEMAP_MAX => 50_000;

this comment isn't accurate anymore :)

@@ +39,5 @@
+
+# We only show bugs that are 12 hours old, because if somebody
+# files a bug that's a security bug but doesn't protect it, we
+# want to give them time to fix that.
+use constant SITEMAP_DELAY => 12;

again clarify the units used.

@@ +41,5 @@
+# files a bug that's a security bug but doesn't protect it, we
+# want to give them time to fix that.
+use constant SITEMAP_DELAY => 12;
+
+use constant SITEMAP_URL => 'page.cgi?id=sitemap/sitemap.xml';

this const is never used.  perhaps it should be used by _fix_robots_txt(), and
the entry removed from robots.txt

::: extensions/Sitemap/lib/Util.pm
@@ +52,5 @@
+# and links to bugs.
+sub generate_sitemap {
+    # If file is less than SITEMAP_AGE hours old, then read in and send to
caller.
+    # If greater, then regenerate and send the new version.
+    my  $index_path = bz_locations()->{'cgi_path'} . "/sitemap_index.xml";

i'd prefer for the files to live in a sitemap subdir.. "/sitemap/index.xml",
etc.

although if the regeneration isn't triggered by cron, then it's probably a good
idea to put the main index.xml into data/, to ensure robots can't request it
without triggering a re-gen (with the files the index references in sitemap/).

@@ +101,5 @@
+        $bug_sth->execute($hours_ago, $last_id);
+
+        while (my ($bug_id, $delta_ts) = $bug_sth->fetchrow_array()) {
+            push(@$bugs, { bug_id => $bug_id, delta_ts => $delta_ts });
+            $last_id = $bug_id;

sql_limit expects the number of rows to skip as the second parameter, so
passing in the bug_id won't work.

@@ +107,5 @@
+
+        last if !@$bugs;
+
+    # We only need the product links in the first sitemap file
+    $products = [] if $filecount > 1;

nit: tabs

@@ +114,5 @@
+        $filecount++;
+    }
+
+    # Generate index file
+    return _generate_sitemap_index($filelist);

nit: files no longer referenced by the index.xml should be deleted.

@@ +165,5 @@
+
+    foreach my $product (@$products) {
+        $sitemap .= "
+  <url>
+    <loc>" . $product_url . $product->name . "</loc>

you need to xml escape the product name.

@@ +193,5 @@
+    $sitemap_fh->close() || die "Could not close new sitemap file: $!";
+
+    # Compress the file
+    system("/bin/gzip", "-f", $sitemap_path) == 0
+        || die "Error executing gzip -f $sitemap_path";

please use IO::Compress::Gzip instead.
Attachment #519796 - Flags: review?(glob) → review-
Comment on attachment 519796 [details] [diff] [review]
Revised Sitemap extension without need for Search::Sitemap (v1)

>+  <url>
>+    <loc>" . $product_url . $product->name . "</loc>

actually have you to URL encode the name.
Thanks glob. Here is a new patch to address your suggested changes.

Dave
Assignee: nobody → dkl
Attachment #519796 - Attachment is obsolete: true
Attachment #519983 - Flags: review?(glob)
  You don't have to fork this. Send me a diff.
OS: Linux → Windows CE
OS: Windows CE → Linux
Comment on attachment 519983 [details] [diff] [review]
Revised Sitemap extension without need for Search::Sitemap (v2)

::: extensions/SiteIndex/Extension.pm
@@ +116,5 @@
+    my $sitemap_url = correct_urlbase() . SITEMAP_URL;
+    $new_contents =~ s/SITEMAP_URL/$sitemap_url/;
+    $new_file = new IO::File("$cgi_path/robots.txt", 'w');
+    print $new_file $new_contents;
+    $new_file->close();

you need to catch errors here.

::: extensions/SiteIndex/lib/Util.pm
@@ +103,5 @@
+
+        while (my ($bug_id, $delta_ts) = $bug_sth->fetchrow_array()) {
+            push(@$bugs, { bug_id => $bug_id, delta_ts => $delta_ts });
+            $last_id = $bug_id;
+        }

you still need to change $last_id.
it should be a count of the rows to skip, not the last bug id.

@@ +166,5 @@
+
+    foreach my $product (@$products) {
+        $sitemap .= "
+  <url>
+    <loc>" . $product_url . xml_quote($product->name) . "</loc>

this needs to be a url quote, not xml quote.  if the product name contains
spaces, the url will be wrong.

@@ +188,5 @@
+END
+
+    # Write the compressed sitemap data to a file in the cgi root so that they
can
+    # be accessed by the search engines.
+    gzip \$sitemap => bz_locations()->{'cgi_path'} .
"/sitemap$filecount.xml.gz"; 

errors need to be caught and managed.
Attachment #519983 - Flags: review?(glob) → review-
New patch addressing glob's suggested changes. 

Max, as soon as this is considered ready, I will make a patch for your current sitemap extension in bzr.mozilla.org and send it your way.

Thanks
Dave
Attachment #519983 - Attachment is obsolete: true
Attachment #520296 - Flags: review?(glob)
Comment on attachment 520296 [details] [diff] [review]
Revised Sitemap extension without need for Search::Sitemap (v3)

r=glob
Attachment #520296 - Flags: review?(glob) → review+
Whiteboard: [bmo-post-4.0] → [bmo4.0-resolved]
Patch to the committed SiteMapIndex extension that writes the index and sitemap files instead to extensions/SiteMapIndex/web since the Bugzilla root is not set to writable by the web server user.

Dave
Attachment #520734 - Flags: review?(glob)
Comment on attachment 520734 [details] [diff] [review]
Patch to SiteMapIndex extension to write sitemap files to ext's webdir (v1)

>+    # Allow webserver to write to the web dir to generate sitemap files
>+    if (!ON_WINDOWS) {
>+        my $mode = 0770;
>+        chmod $mode, $self->package_dir . "/web";
>+    }

it'll be better to use Bugzilla::Install::Filesystem::DIR_CGI_WRITE instead of 0700, for different server configurations.
you also need to catch chmod errors.


>-    <loc>" . correct_urlbase() . "$filename</loc>
>+    <loc>" . correct_urlbase() . "$package_dir/web/$filename</loc>

hrm, i'm not sure about this, because $package_dir is the filesystem path built from %INC.  i'm concerned it may result in an absolute instead of relative path.

normally you'd just do correct_urlbase() . "extensions/SiteMapIndex/web/$filename"
it would be nice if you could do $self->web_path from an extension :)



neither of these are show stoppers for bmo, so r=glob
they will need to be addressed before upstreaming.
Attachment #520734 - Flags: review?(glob) → review+
Comment on attachment 520734 [details] [diff] [review]
Patch to SiteMapIndex extension to write sitemap files to ext's webdir (v1)

actually, the patch doesn't work because it doesn't create the web/ directory.
Attachment #520734 - Flags: review+ → review-
Thanks glob. Hopefully this is getting close now. I found a few other issues and also create/chown/chmod the web directory properly now. 

Dave
Attachment #520734 - Attachment is obsolete: true
Attachment #520990 - Flags: review?(glob)
(In reply to comment #36)
> it'll be better to use Bugzilla::Install::Filesystem::DIR_CGI_WRITE instead of
> 0700, for different server configurations.
> you also need to catch chmod errors.

  Also, there should be a hook to extend the Filesystem definitions. Is there not one already?

> it would be nice if you could do $self->web_path from an extension :)

  Agreed, could you file a bug for that?
By the way, I think I should warn you in advance, I won't accept a writable web_path upstream--instead I would suggest having a script that delivers content from your own data/ subdirectory.
(In reply to comment #39)
>   Also, there should be a hook to extend the Filesystem definitions. Is there
> not one already?

Not that I could see any, no.
 
>   Agreed, could you file a bug for that?

Got a patch and submitting bug now.
(In reply to comment #41)
> (In reply to comment #39)
> >   Also, there should be a hook to extend the Filesystem definitions. Is there
> > not one already?
> 
> Not that I could see any, no.

  Okay, I'd love a patch for that, then, too.
(In reply to comment #40)
> By the way, I think I should warn you in advance, I won't accept a writable
> web_path upstream--instead I would suggest having a script that delivers
> content from your own data/ subdirectory.

I had a hunch that this would be considered uncool in some situations but seemed to work the best for the needs of this extension. The data dir is normally locked down pretty tight except for data/attachments and data/webdot. I wasn't sure if having extensions starting to create their own directories there could cause other problems such as conflicting with other extensions. So at the time I felt it best to keep it all contained in it's own extension directory somehow to make sure it never conflicted.

But aside from that, I could see creating a data/sitemapindex directory with a custom htaccess file allowing access if you think that would be a better choice.

Dave
Yeah, it's kosher for extensions to create a data subdirectory with their own name. In fact, I vaguely recall even adding a mechanism for that, a data_dir or something, but maybe not. You could add a .htaccess in that directory, or you could just have a script deliver the contents of the files when they are asked for (which would be more broadly compatible with anything we might do with data/ in the future).
Revised patch using the new $self->web_dir Bugzilla::Extension attribute plus other fixes. May work on a new version that uses data/sitemapindex/* for sitemap files per discussion w/ Max.
Attachment #520990 - Attachment is obsolete: true
Attachment #521044 - Flags: review?(glob)
Attachment #520990 - Flags: review?(glob)
Depends on: 643904
Comment on attachment 521044 [details] [diff] [review]
Patch to SiteMapIndex extension to write sitemap files to ext's webdir (v3)

>+    # Create the sitemap directory if does not exist
>+    my $sitemap_path = $self->web_dir;
>+    unless (-d $sitemap_path) {
>+        print "Creating $sitemap_path directory...\n";
>+        mkdir $sitemap_path or die "mkdir $sitemap_path failed: $!";
>+    }

nit: might be easier to just add the directory with bzr :)

>-    <loc>" . correct_urlbase() . "$filename</loc>
>+    <loc>" . correct_urlbase() . "$web_dir/$filename</loc>

sorry, but $web_dir is still using the filesystem path.
we need the extension's web_path, which would be built from the urlbase, and the virtual path to the extension's web directory.

i'm expecting something like the following in Bugzilla::Extension:

sub web_path {
    my $self = shift;
    my ($class) = ref($self) =~ /^.+::(.+)/;
    return correct_urlbase() . "extensions/$class/web";
}

although this may have issues with cpan-installed extensions.

for this bug, i'd be happy with:

>+    <loc>" . correct_urlbase() . "extensions/SiteMapIndex/$filename</loc>


everything else looks excellent :)
Attachment #521044 - Flags: review?(glob) → review-
Here is a new patch that instead places the sitemap files in the data/SiteMapIndex directory and not extensions/SiteMapIndex/web. Max suggested data would be a better choice for having a writable directory for this.

This patch requires the new install_filesystem hook patch to be applied first which is under currently under review.

https://bugzilla.mozilla.org/show_bug.cgi?id=644334

Dave
Attachment #521044 - Attachment is obsolete: true
Attachment #521310 - Flags: review?(glob)
Depends on: 644334
Comment on attachment 521310 [details] [diff] [review]
Patch to SiteMapIndex extension to write sitemap files data dir (v1)

nice :)
r=glob
Attachment #521310 - Flags: review?(glob) → review+
bmo/4.0:
Committing to: bzr+ssh://dlawrence%40mozilla.com@bzr.mozilla.org/bmo/4.0
modified extensions/SiteMapIndex/Config.pm
modified extensions/SiteMapIndex/Extension.pm
modified extensions/SiteMapIndex/lib/Constants.pm
modified extensions/SiteMapIndex/lib/Util.pm
Committed revision 7587.
Woo! Could somebody submit a patch upstream and work with me on getting this into the main Sitemap extension?
(In reply to comment #50)
> Woo! Could somebody submit a patch upstream and work with me on getting this
> into the main Sitemap extension?

Will do. I went ahead and got it in there to make our April 16th cut over date. It is undergoing a security review as well. I will work on a patch hopefully tomorrow as next week is Mozilla All-Hands. 

Dave
Cool, sounds great. :-) Yeah, I have absolutely no qualms with you checking it into bmo as it is, I totally understand.
This is now in place and you can test it out on google using "site:bugzilla.mozilla.org".
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Assignee: dkl → nobody
Component: Bugzilla: Other b.m.o Issues → Extensions
Product: mozilla.org → bugzilla.mozilla.org
QA Contact: general → extensions
Version: other → Current
Component: Extensions: Other → Extensions
You need to log in before you can comment on or make changes to this bug.