Closed Bug 515644 Opened 15 years ago Closed 15 years ago

Versioning of the configuration data served by config.cgi (by ETag?)

Categories

(Bugzilla :: WebService, enhancement, P3)

enhancement

Tracking

()

RESOLVED FIXED
Bugzilla 3.6

People

(Reporter: thomas.ehrnhoefer, Assigned: Frank)

References

(Blocks 1 open bug)

Details

Attachments

(1 file, 5 obsolete files)

User-Agent:       Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)
Build Identifier: 3.4.1

Computing the repository configuration can, especially for large Bugzilla instances, be very expensive.
It would siginificantly lower the CPU usage on the server if the client requesting the configuraiton can pass a version identifier from the previous configuration he received, and the server checks that value before even trying to compute a new one.

As a first step, this could be done as a global value for the whole Bugzilla instance - whenever the configuration changes, the version is incremented.
For large instances there is a possible further enhancement, to version on a per product basis. But I think the global value would significantly reduce the load on the server by itself.

Reproducible: Always
Not sure there is a real benefit. Suggesting WONTFIX.
+1 for considering implementing this feature, initially with a simple global timestamp/hash.  Even a global indicator of change would have a significant impact on server load for large repositories (i.e. redhat's bugzilla). I recommend use of the existing HTTP ETag standard to implement this:

   http://en.wikipedia.org/wiki/HTTP_ETag
Blocks: bz-clients
  Yeah, an ETag sounds like a good idea, because that way we don't have to do a really complex calculation to determine the last-modified date of the info.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Priority: -- → P3
Summary: versioning of the repository configuration file (served by config.cgi) → Versioning of the configuration data served by config.cgi (by ETag?)
This is problematic. First, how do you check that the configuration changed? Unless each admin page increments some given value (where would it be stored?), I don't see how you can achieve that without doing all the work config.cgi already does. Second, data returned by config.cgi depends on your credentials, and so even if the configuration didn't change, the returned data may be different if you access config.cgi with different credentials and/or as another user. In that case, you have to redo all the work again.
Frederic, as you said, there are two things determining if the configuration has changed: the configuration itself and the permission scheme.
A rather simple first approach could just store a overall version number of the configuration not regarding permissions or users at all. This would not be very precise, but it will nevertheless eliminate a big bunch of unnecessary calls and computations. Additionally, a second versioning could be done for the permission scheme (rather on a per user base). If either one of those version numbers (probably stored in the DB) changed, then and only then the configuration will be downloaded.

A different approach (although I am not sure if this is in any way feasible) would be to store a per user version number that gets changed whenever a configuration part the user can see changed. This could be a very expensive operation though. I feel that the first approach is probably sufficient since even in a large instance the configuration probably wont change too often.

And on a side note but very much related:
Some smart people at eclipse.org also improved repository download time quite a lot by removing excessive white spaces as well as using compression. See 
398546: /config.cgi?ctype=rdf produces XML with about 30% whitespace and wastes bandwidth
https://bugzilla.mozilla.org/show_bug.cgi?id=398546
for details.
It's been in use now for 2 years, and it is said it's a 90% reduction in bandwith usage.
Here's how you do it:

  Every time you generate the config.rdf, you cache a copy in data/, and take an MD5 hash of it. You can send the MD5 hash as the ETag, and the creation date of the file as the Last-Modified. Then, next time you load config.rdf, you compare it to the file on disk, and if it has an identical MD5 hash, then you send a 304 if the client sent an If-Modified-Since header or whatever other headers are required for ETag.
This sounds right. My only though is that configurations are scoped per.  If the hash was only taken for the complete configuration this would work.  Alternatively, maybe if we use a timestamp for the etag that is stored when any aspect of the configuration changes, this could be sent in both the etag and last-modified header?
What do you mean by "the etag that is stored when any aspect of the configuration changes"?
I think Rob meant "use a timestamp (that is stored when any
aspect of the configuration changes) for the etag"
(In reply to comment #9)
> I think Rob meant "use a timestamp (that is stored when any
> aspect of the configuration changes) for the etag"

  Ah yeah, I think we're not going to do that.

  What's the primary concern here--that generating config.rdf is too hard on the Bugzilla server, or that it's too much data to download to the client?
(In reply to comment #10)
>   What's the primary concern here--that generating config.rdf is too ****
> the Bugzilla server, or that it's too much data to download to the client?

I think that depends on the server Bugzilla is running on and on it's bandwith. Eclipse.org mostly was concerned about the bandwith usage afaik, others (including our own instances) are more concerned about the CPU load on the server.

Why is the timestamp a no-go for you? Seems like that would be an easy way to determine if the configuration has changed and thus if computing and sending the new RDF is necessary or not. There is the possibility that even though the configuration changed a user's configuration RDF might be unaffected (due to scoping), but that's not a big deal I think.
The issue with the timestamp is that Bugzilla would need to be modified to store a "last time the configuration was changed" timestamp somewhere. This modification, while not complex, would be extensive. And we'd need to make sure that any future configuration changes set the flag.

Max's scheme, on the other hand, does not require changes anywhere other than config.cgi.

Max: how many cached copies of config.rdf (or config.json :-) would you store? Perhaps clean the directory out every 24 hours? Or every week?

One option would be to store them as
/data/config/<md5sum>.rdf
and then seeing if you send back a 304 means simply doing an open() call on "data/config/<sanitized-etag>.rdf" and seeing if you got data back.

Gerv
(In reply to comment #12)
> Max: how many cached copies of config.rdf (or config.json :-) would you store?
> Perhaps clean the directory out every 24 hours? Or every week?

  Well, it's honestly not that big in terms of disk space, so I think that we could store a copy for every user.

> One option would be to store them as
> /data/config/<md5sum>.rdf
> and then seeing if you send back a 304 means simply doing an open() call on
> "data/config/<sanitized-etag>.rdf" and seeing if you got data back.

  I think that's a great idea. Actually, we'd just have to do a "-f" check.
(In reply to comment #13)
> (In reply to comment #12)
> > Max: how many cached copies of config.rdf (or config.json :-) would you store?
> > Perhaps clean the directory out every 24 hours? Or every week?
> 
>   Well, it's honestly not that big in terms of disk space, so I think that we
> could store a copy for every user.

  Oh, and when doing the check, we could clean out any files that are older than an hour, I think.
(In reply to comment #6)
> Here's how you do it:
> 
>   Every time you generate the config.rdf, you cache a copy in data/, and take
> an MD5 hash of it. You can send the MD5 hash as the ETag, and the creation date
> of the file as the Last-Modified. Then, next time you load config.rdf, you
> compare it to the file on disk, ....

If I want to compare this I need the MD5 hash for the new version. The only way to get this is to produce the configuration only for calculate the value.

What do you think about the followings ways:

1) use an config param for the ETag and use the patch from https://bugzilla.mozilla.org/show_bug.cgi?id=504612 as an base of how this param get changed when an config element get changed.

2) do a delete of all files in the data/config dir. Assuming that the changed date is included in the stored configuration.

Thoughts?
(In reply to comment #15)
> If I want to compare this I need the MD5 hash for the new version.

  *You* don't have to compare it at all. We're going to be using standard HTTP functionality to return you a 304 if it hasn't changed.
Attached patch Patch, V1 (obsolete) — Splinter Review
This patch did not use a cache but has support for ETag and If-None-Match Header.

What do you think about this implementation.
Attachment #408288 - Flags: review?(mkanat)
Comment on attachment 408288 [details] [diff] [review]
Patch, V1

No, as I've said before, I don't want to keep track of every single thing in Bugzilla that could possibly modify the output of config.cgi.
Attachment #408288 - Flags: review?(mkanat) → review-
(In reply to comment #18)
> (From update of attachment 408288 [details] [diff] [review])
> No, as I've said before, I don't want to keep track of every single thing in
> Bugzilla that could possibly modify the output of config.cgi.

OK but then you must first create the whole output for config.cgi and have the whole CPU load for this.

Should I create a patch for this?
Attached patch Patch, V2 (obsolete) — Splinter Review
Now I only change the config.cgi!

This patch use MD5 as the ETag and needs no caching.
Attachment #408313 - Flags: review?(mkanat)
Comment on attachment 408313 [details] [diff] [review]
Patch, V2

Don't print headers. Use $cgi.

You need to re-work how config.cgi works so that items are always returned in a consistent order, otherwise the MD5 is meaningless.
Attachment #408313 - Flags: review?(mkanat) → review-
(In reply to comment #21)
> (From update of attachment 408313 [details] [diff] [review])
> Don't print headers. Use $cgi.
> 
> You need to re-work how config.cgi works so that items are always returned in a
> consistent order, otherwise the MD5 is meaningless.

I did not see what  items you mean.
get_legal_field_values, Bugzilla::Keyword->get_all, Bugzilla->active_custom_fields, $user->get_selectable_products, Bugzilla::Field->match({obsolete => 0}) all use "ORDER BY" and so the items are returned in an  consistent order.

Sorry I am not an Perl Expert so please tell me what I miss.
(In reply to comment #21)
> (From update of attachment 408313 [details] [diff] [review])
> Don't print headers. Use $cgi.

Sorry I did not use $cgi for print headers. 
What did you mean? 
Please tell me the line in my patch.
Anything that does keys %{ $hashref } will randomize the order. I seem to recall there's something in there that does that, or perhaps something that just calls ".keys" in the template.
That problem can be fixed by sorting the keys; in the template, that would mean replacing foo.keys with foo.keys.sort.

Frank: are you able to check both the template and the CGI for non-determinism and report back?

Thanks,

Gerv
(In reply to comment #25)
> That problem can be fixed by sorting the keys; in the template, that would mean
> replacing foo.keys with foo.keys.sort.
> 
> Frank: are you able to check both the template and the CGI for non-determinism
> and report back?

I did a check and did not found a problem in the files config.rdf.tmpl and config.js.tmpl.
There I only find things like
      [% FOREACH item = items %]
and this is save.

In the file config.cgi I  found:
1) Object->get_legal_field_values and this returns an a reference to a list of valid values
2) Bugzilla::Keyword->get_all and this use Object->_do_list_select which returns an arrayref.
3) $user->get_selectable_products returns an arrayref.
4) Bugzilla->active_custom_fields internal use  Bugzilla::Field->match which returns an arrayref.

So I can not find any hashref in config.cgi


Can someone verify this?
Sounds fine to me.

Regarding the "don't use print" thing, I think what Max meant was that we have a CGI object, $cgi, which has methods for printing headers and so on, and he would like you to use those rather than printing the headers directly. 

Something like this?
http://perldoc.perl.org/CGI.html#CREATING-A-STANDARD-HTTP-HEADER:

Gerv
Yeah, that's what I meant, thank you Gerv. :-) I've been so busy lately that I hardly have time to do reviews, but I really did want to respond to Frank as quickly as possible.
Attached patch patch, V3 (obsolete) — Splinter Review
Patch V2 was change in the way requested in comment#23
Attachment #408313 - Attachment is obsolete: true
Attachment #408706 - Flags: review?(mkanat)
Comment on attachment 408706 [details] [diff] [review]
patch, V3

>Index: config.cgi
>+    my $ctx = Digest::MD5->new;
>+    $ctx->add($output);
>+    my $digest = $ctx->b64digest;

  It's simpler to just use the function interface.

  Also, if the string utf8::is_utf8(), then you'll need to decode it--look at other places that md5_base64() is used (like contrib/recode.pl).

>+    my $if_none = $ENV{'HTTP_IF_NONE_MATCH'};

  I'd rather use $cgi->http() for that header. (Using $cgi->http() gives us the opportunity to override things if we ever have to, and centralizes calls to data.)

>+    if ($if_none eq $digest) {

  You have to check for "*" in If-None-Match:

  " As a special case, the value "*" matches any current entity of the resource. "

  Also, If-None-Match is a comma-separated and quoted list, so we need to check each item in the list, and remove quotes.

  (It's 14.26 here: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html )

  Oh, also, I didn't realize that the original script was using "print" to print headers! Thanks for fixing it up. :-)

>+       print $cgi->header('text/html','304 Not Modified'); 

  I think you can do $cgi->header(-status => 304) (which would be simpler) but I'm not certain.

>+         print "$output";

  Nit: No need to put it in quotes.


  Otherwise, looks generally pretty good! :-) Excellent simple solution! :-)
Attachment #408706 - Flags: review?(mkanat) → review-
Attached patch patch, V4 (obsolete) — Splinter Review
points from review of patch V3 are included.
Attachment #408288 - Attachment is obsolete: true
Attachment #408706 - Attachment is obsolete: true
Attachment #409200 - Flags: review?(mkanat)
Comment on attachment 409200 [details] [diff] [review]
patch, V4

>+    my $if_none_string = $cgi->http('HTTP_IF_NONE_MATCH') || "";

  Nit: I think 'If-None-Match' would be more consistent with how we use this (or are going to use this) elsewhere.

>+    my $found304= '';

  Nit: Space after =. Also, it's actually not necessary to initialize it.

>+    my @if_none = split(/,/,$if_none_string);

  This should be split(/[\s,]+/, $if_none_string);

>+        $if_none=~ s/^\s*\"?//g;
>+        $if_none=~ s/\"?\s*$//g;

  Which makes the \s in these unnecessary. Also, space before =~.

>+        $found304= $digest if ($if_none eq $digest || $if_none eq '*');

  Nit: You could do "last" here, too, if you found it. (Also: Space before =)

>+   if ($found304 ne '') {

  We can just do "if ($found304)" here.

>+        print $cgi->header(-type=>'text/html',
>+                           -ETag=>$found304,

  I'm not sure "*" is a valid ETag value, though, so should we be including it if it's "*"?

>+                           -status=>'304 Not Modified');

  Nit: Spaces before and after =>

>+        print $cgi->header (-ETag=>$digest, -type=> $format->{'ctype'});

  Nit: Spaces before and after =>.
Attachment #409200 - Flags: review?(mkanat) → review-
...but thank you for your patch, and I think it's pretty close to being ready :-)

Gerv
Attached patch patch, V5 (obsolete) — Splinter Review
next try
Attachment #409200 - Attachment is obsolete: true
Attachment #409541 - Flags: review?(mkanat)
Attachment #409541 - Flags: review?(mkanat) → review+
Comment on attachment 409541 [details] [diff] [review]
patch, V5

>Index: config.cgi
>+use Encode qw(encode decode is_utf8);

  You don't need to use that. You can use utf8::encode.

>@@ -109,12 +112,41 @@
>+    my $output = '';

  Nit: Technically, you don't need to initialize that variable, I think.

>+    my $if_none_match = $cgi->http('HTTP_IF_NONE_MATCH') || "";

  Nit: Let's use 'If-None-Match'.

>+        $if_none =~ s/^\"?//g;
>+        $if_none =~ s/\"?$//g;

  The question marks aren't necessary.

>+        if ($if_none eq $digest || $if_none eq '*') {
>+            # leave the loop after the first match
>+            $found304= $if_none;

  Nit: Space before the equals sign.


  Otherwise, this looks fine. The above can all be fixed on checkin.
Flags: approval+
Target Milestone: --- → Bugzilla 3.6
Assignee: webservice → Frank
Status: NEW → ASSIGNED
Checking in config.cgi;
/cvsroot/mozilla/webtools/bugzilla/config.cgi,v  <--  config.cgi
new revision: 1.32; previous revision: 1.31
done
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Keywords: relnote
Resolution: --- → FIXED
Attachment #409541 - Attachment is obsolete: true
Attachment #411228 - Flags: review+
Added to the release notes in bug 547466.
Keywords: relnote
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: