Closed Bug 80157 Opened 23 years ago Closed 21 years ago

patch that adds "regenerate" option to collectstats.pl

Categories

(Bugzilla :: Reporting/Charting, enhancement, P3)

2.11
enhancement

Tracking

()

RESOLVED FIXED
Bugzilla 2.18

People

(Reporter: rwalters, Assigned: gerv)

References

Details

Attachments

(1 file, 9 obsolete files)

I've added a "regenerate" command line option to collectstats.pl, similar to the 
option found in the "processmail" script.  It will regenerate all the 
data/mining files from scratch, based on the current bug_status of each bug and 
tracing back the changes to the bug_status field as recorded in the 
bugs_activity table.

I'm going to attach a patch that can be applied to rev 1.7 of collectstats.pl to 
add the enhancement.  Sorry I'm not operating off the tip but it's been a while 
since I did a cvs update.  Now it looks like for this to work in the latest 
Bugzilla some merging will have to be done.
Target Milestone: --- → Bugzilla 2.16
This is cool :-)

Based on visual inspection, I'll confirm for those who haven't looked at it that 
this snarfs the snapshot of what the bugs looked like on a given day based on the 
bugs starting status and applying changes listed in bugs_activity prior to that 
day.

I think it may be missing some fields in the generated files however, and/or may 
not be creating the files in the proper format.  CCing Gerv because he's messed 
with that code most recently.
Yep, this patch is from before the big changes which added a large number of 
other states to chart. From superficial inspection, the idea seems sound, 
though. 

rwalters@qualcomm.com - would you be up for fixing this patch up for the new, 
shiny, chart-lots-of-things world?

If not, I could give it a go, but can't promise to get to it for three weeks or 
so.

Gerv
Sure, I'll see what I can do.  Maybe I can make it run a little faster too.
I tried this patch on our local install, and running "./collectstats.pl 
regenerate" gives the error:

Can't locate object method "autoflush" via package "IO::Handle" 
at ./collectstats.pl line 233.
Stephen: look in your checksetup.pl file for the version checks, and add this 
line:

unless (have_vers("IO::Handle",0))        { push @missing,"IO::Handle" }

And report what output you get from it.

My system has autoflush, it reports 1.21.  I found reference to autoflush in the 
ChangeLog for version 1.16, however, I don't see anything in the ChangeLog 
describing when it was added.  They only kept track back to 1.10 though.
Initially the { push @missing,"IO::Handle" } gave a compilation error, 
as "@missing" didn't exist. When I removed that clause, I got the required 
output, including:

Checking for      IO::Handle (any)     ok: found v1.21

After looking at a couple of man pages, I found that adding the following line 
to the top of collectstats.pl made the error go away and made it do something 
useful:

use IO::Handle;

I notice that the regenerate option gives no output. Perhaps it could output 
something like:
Regenerating -All- ..................................
Regenerating TestProduct ..................................
etc.

with a . printed for each date, so that you know it is doing something?
I can confirm that when modified as stated above, the new regenerate option on 
our local bugzilla installation generates identical output for the date range 
over which we had data previously. It also generates sensible-looking data for 
the earlier dates.
Good idea, Stephan - take a look at the new patch I put in and see if you like.
Looks good to me...
This looks good at first glance; very good indeed. It's excellent also for those
of us who boot into Linux part time and so don't have a collectstats.pl cron
job; it generates all the missing data values. It works for me too, albeit on a
tiny DB. I hope to see many more patches of this quality from this man :-)

r=gerv if you fix the mismatched bracket in the comment around line 35. ;-)

Gerv
Keywords: patch, review
Depends on: 55161
See bug 55161, the column names for the stuff in the bugs_activity table has 
changed, this will need to take that into account.
Priority: -- → P3
-> Bugzilla product, Charting component, reassigning.

This sounds really cool. Is this somehow related to my 2001-02-14 00:18 comments
in bug 16009 "General summary charts" ?
Assignee: tara → gerv
Component: Bugzilla → Reporting/Charting
Product: Webtools → Bugzilla
Version: Bugzilla 2.11 → 2.11
Comment on attachment 39510 [details] [diff] [review]
another patch to -r1.20, this outputs progress indicators

r=gerv. Looking for second review from some other Bugzilla hacker.

Gerv
Attachment #39510 - Flags: review+
CCing Dave to find a reviewer. ;-)

Gerv
Comment on attachment 39510 [details] [diff] [review]
another patch to -r1.20, this outputs progress indicators

as noted in other comments, this doesn't work with 2.14+ because of the
activity table having some columns renamed.
Attachment #39510 - Attachment is obsolete: true
Attachment #39510 - Flags: review-
Attachment #34000 - Attachment is obsolete: true
Attachment #39369 - Attachment is obsolete: true
Comment on attachment 52690 [details] [diff] [review]
new version changes  bugs_activity.oldvalue to bugs_activity.removed to make it work with 2.14

it works great except for one small little problem...

the generated data files have the permissions set wrong, and the webserver
can't read them.  In fact, it completely disables charting because reports.cgi
can no longer read the -All- file.  running checksetup.pl does not fix it.
Attachment #52690 - Flags: review-
Oh, I did want to note that the regenerated data appears to be pretty darn close
to accurate.  I compared it with the "before" data, and with the exception of
the period of time when charting was broke for a month or two shortly after 2.12
was out, the data was all within one or two points of the old data.  I blame the
discrepancies on the fact that my cron job runs at 6am instead of midnight, and
we tend to do a lot of work between midnight and 6, so it all credits to the
previous day when the cron job generates it.
We are currently trying to wrap up Bugzilla 2.16.  We are now close enough to
release time that anything that wasn't already ranked at P1 isn't going to make
the cut.  Thus this is being retargetted at 2.18.  If you strongly disagree with
this retargetting, please comment, however, be aware that we only have about 2
weeks left to review and test anything at this point, and we intend to devote
this time to the remaining bugs that were designated as release blockers.
Target Milestone: Bugzilla 2.16 → Bugzilla 2.18
I tried the patch on my company's bugzilla installation and it worked
beautifully!! I ran it as the webserver user using sudo so I didn't have any
permission problems. Although it did create the mining directory as 777 which I
thought would kinda dangerous in that someone could delete all your mining
files, but now that you can regenerate them, it's not much of a lose. 
rwalters: are you able to address the couple of issues that have been raised here?

I apologise that this didn't make it in sooner.

Gerv
Gerv:

As far as the permissions issues reported, I thought I changed the code that 
creates the mining directory so that it uses mode 0770 instead of mode 0777 - I 
just checked and apparently it did not get in the patch (hmm, I wonder how that 
happened?)  Anyway, it is a simple fix - change "777" to "770" in two places in 
the "check_data_dir" subroutine.

In my installation I set the group sticky bit on the whole Bugzilla directory, 
and set the owning group to "webadmin" which consists of myself, httpd, and 
mail.  The group sticky bit should cause the mining directory to inherit that 
group ownership from the parent, and mode 0770 should allow only user httpd and 
group webadmin to read/write/modify/delete it.

As for any issues relating to Bugzilla versions newer than 2.13, I haven't 
updated my Bugzilla installation since then so I don't know of any problems.  
It looks like someone put in a patch to make this work with 2.14, and I scanned 
the comments, which seems to indicate any issues have been resolved.
Okay, I may have a shot at this if you want.

What permissions should the files have, and which owner/group shoudl they belong to?
Dave?

Gerv
since collectstats.pl is run either by the maintainer or a cron job, the files
do not need to be writable by the webserver, only readable.  This mean 750 for
the directory containing them if you have $webservergroup and 755 if you don't,
and 640 for the files with, and 644 without.
This code does the same as patch 52690, only *much* faster (4 min compared to
140 min with patch 52690 on a Bugzilla installation with about 30 projects and
1000 bugs.
The statistics have an additional column "total", which is the total bug count.
I think this is useful. 

I tested this against patch 52690 on Bugzilla 2.14 with our db mentioned above;
the only differences were the extra "totals" column and all my dates are off by
one day (I'm not sure whether it is my code which is incorrect or whether the
patch is incorrect).

One can probably rewrite the awk script in perl but I was too lazy to do it. 

The code keeps intermediate results in RAM. This might be a problem with very
large installations, but it is very fast.
Is that an application/zip, an application/x-gzip or what? :-)

Gerv
It is two scripts, stored as a .tar.gz file.
(Sorry, but I thought that the file name is stored 
too)
We should rewrite possible awk scripts into Perl. Not all Bugzilla installations
run on Unix, and you don't usually have awk installed on Windows.
I am assigning all the bugs I am not working on in the immediate future to
nobody@bugzilla.org. This means:

- I will be able to search for bugs assigned to me as a list of bugs I'm going
to fix (which is as it should be), and
- people won't falsely assume I might be about to fix a bug when I'm not.

Gerv
Assignee: gerv → nobody
Very nice patch (I just tried the Perl version, since I'm on Windows).

I updated the regenerate_stats sub's SQL to work on 2.17.x, and also added some 
(simple) code to benchmark the time it takes for each product. The output is 
now :

Regenerating -All- [100.0%] - 00:01:03
... (for each product)
Total time taken 00:02:09

But I don't think the code is clean enough to be posted here. Should I attach 
it here anyways, and let everyone ridicule my Perl (one of the best ways to 
learn, I know, but still painful :) or should I post it on the webtools mailing 
list and ask for comments there before attaching it here?

Or then again, should I just post the SQL changes here and let someone else 
change the actual file?
Sure, go ahead and post it here. (add it as an attachment).

A diff -u against cvs or against the original would work best, but if you don't
have diff, you can just upload the file and we'll figure it out. :)
Patch against the current collectstats.pl from CVS (as of Dec. 06 2002). The
SQL to regenerate the stats did not work with 2.17, since the product name has
been replaced by a foreign key in the bugs table. This complicates the SQL a
bit.

Also, when 'collectstats.pl regenerate' is run, the time taken for each product
and the total time will be printed.

Should make the previous patch (patch 52690) obsolete, but I don't have the
option ([no attachments can be made obsolete] above).

(justdave : diff is no problem, I work with CygWin because Win32 is strangely
lacking for any real development... :-)
Attachment #52690 - Attachment is obsolete: true
Attachment #108486 - Flags: review?(gerv)
Notice that my patch does not fix the primary reason why the other patch got a 
negative review, which was a permissions problem... I don't know what the final 
concensus on that was. My patch only fixes the SQL...
Comment on attachment 108486 [details] [diff] [review]
Patch based on the one in attachment 52690 [details] [diff] [review], with SQL updated to work with 2.17.x

This patch still looks OK; but I'd much rather review the final version, with
the permissions problem solved. 

I assume it works?

Gerv
Yep, it works for me anyways. And I have no idea what the permission problem 
was, since I'm on Windows. If I had a Linux machine to check, or if someone can 
tell me what the changes are, I can make them. But other than that, I think the 
original patch author might have to fix that part.
Shouldn't someone try to get this patch completed? It seems like this bug has 
been at a status quo (except for my SQL update) for over a year, for what 
basically is a permissions problem that shouldn't take too long to fix by 
someone who knows how...

If no one has the time, just tell me what has to be done, I'll do it and submit 
the patch, so that this bug can get reviewed and resolved once and for all! 
This is useful enough to be checked in, IMHO, so we should complete it before 
it gets to the point of needing another SQL code update...
it's in Gerv's review queue and he hasn't denied it yet...  hopefully he'll
catch up soon :)
jean_seb@hybride.com: I've been avoiding this review for ages, because this code
scares me ;-) However, that's rather unreasonable and, as it's an _option_,
no-one has to run it. So if you check the patch and make sure it's fine for the
current tip, I'll look at it and check it in.

Gerv
grev : I've checked against collectstats.pl taken from CVS this morning
(25/02/2003 ~10:50 Eastern) and the last patch applies with the first hunk
rejected, but it's very easy to merge it manually. Here is another version of
the patch that applies cleanly.

I also tried to fix the permissions problem mentioned in comment #20, with the
information in comment #28. I set the permissions on creation of the directory,
and each time data is written to a file, in case the file had to be created
before writing to it. I don't know how to check "if you have $webservergroup",
so I set the permissions to what Dave said they should be WITH $webservergroup.
Before reviewing, if someone can tell me how to check it, I can add it to the
patch and it would be pretty final.

Note that if there's anything wrong with the patch other than the SQL code and
the permissions fix, the original author of the patch (rwalters@qualcomm.com)
should probably fix it. The code scares me too :-)
Attachment #108486 - Attachment is obsolete: true
Comment on attachment 115509 [details] [diff] [review]
Patch that applies cleanly as of 2003/02/25 11:10 AM Eastern

Oops, sorry about the nasty tabs in the patch. If there are any changes to be
made, I'll remove them at the same time.
Attachment #115509 - Flags: review?(gerv)
The SQL queries this uses are incredibly slow, because they perform the function on the 
column, rather than the constant,  and thus, can't use any indexes.

For example, it does: 
   SendSQL ("select bug_id from bugs" .
               (($product ne '-All-') ? ", products" : "") .
                " where to_days(bugs.creation_ts)=" . ($day - 1) .
                (($product ne '-All-') ? " and bugs.product_id=products.id " .
                    " and products.name=" . SqlQuote ($product) : "") .
                " order by bug_id");

Notice it does "to_days(bugs.creation_ts) = ($day - 1)", which can't use any indexes 
because it has to apply to_days to *every* bugs.creation_ts value.
If you write this as "bugs.creation_ts < from_days($day - 1) and bugs.creation_ts >= 
from_days ($day - 2)", it will be able to use the bugs.creation_ts index that exists (since 
it'll turn the comparisons into comparisons against constants, and thus, just do a range 
select).
This makes the first 16% (which are what use this query) *much* faster, because it will 
turn from_days 

A similar problem exists in the other later queries like "... to_days 
(bugs_activity.bug_when) >= $day " instead of "bugs_activity.bug_when >= from_days 
($day - 1)", but modifying them doesn't speed up renegerate measurably.


Just thought i'd point this out
dberlin : (this may sound like an excuse, but it's actually an explanation) I 
didn't write the original SQL, I just changed it to work on Bugzilla 2.17.x, 
but I agree that what you mention represents a good speedup. Of course, since 
regenerate will theoretically only be run once (for example, as in my case, 
when an installation moves from not collecting stats to collecting them), it's 
not that important. Nevertheless, if you want to fix the code and submit 
another patch, that would be great.
Yes, dberlin: are you able to submit an improved patch with better queries?

Gerv
Sure, i can submit a patch against the patch (since it's only changing a few lines).

While regenerate is only run once, on my bugzilla installation (which is for the GCC project), regenerate takes >24 hours with the current queries.  Since this is a gnats->bugzilla conversion (using a heavily modified gnats2bz.pl script), and we haven't done the final cutover yet, whenever i reimport the gnats data (once or twice a week right now), i need to run regenerate.  

There is also some perl code that is doing unnecessary work somewhere (it really should be query bound, but it's not right now, ) that i'm working on tracking down.  It gets exponentially slower as it approaches 100%.

Once i fix that, i'll submit an updated patch. 
dberlin : Ouch, 24 hours! Thanks for your cooperation. I look forward to your 
update. 
dberlin: ping? :-)

Gerv
Attached patch Patch with redone queries (obsolete) — Splinter Review
Just to be clear, the reason we have to change to_days (creation_ts) = $day - 1
to creation_ts >= from_days($day-1) and creation_ts < from_days ($day-2) is
because there is no easy *AND* fast way to remove the time from the creation_ts
, and with the time part there, "03-11-03 12:54:11" != "03-11-03".
Thus, we check if it's greater than the beginning of the day and less than the
beginning of the next day.

I've excluded other speedups than the query changes, i'll submit them
seperately when they are completely finished.
*** Bug 197064 has been marked as a duplicate of this bug. ***
Attached patch Patch v.1 (obsolete) — Splinter Review
OK. Here's a patch, cleaned up to match our coding standards with a small
dabble of rearrangement. The command-line parameter is now the more-standard
"--regenerate". I've tested it locally, and it works. 

I'm attaching this patch so I can request a=.

Gerv
Attachment #76414 - Attachment is obsolete: true
Attachment #115509 - Attachment is obsolete: true
Attachment #116907 - Attachment is obsolete: true
Attached patch Patch v.2Splinter Review
And now the version without the debugging code.

Regarding permissions: I've used the looser of the two given by justdave in
comment 28, because I don't think we can easily check if webservergroup is set.


Gerv
Attachment #117339 - Attachment is obsolete: true
Requesting approval.

Gerv
Flags: approval?
Taking.

Gerv
Assignee: nobody → gerv
Attachment #108486 - Flags: review?(gerv) → review-
Attachment #115509 - Flags: review?(gerv) → review-
Regarding comment 54: Correct me if I'm wrong as I haven't actually looked at
the code/patch, but doesn't collectstats.pl require globals.pl which in turn
imports everything from localconfig into the global namespace?  If so, it should
be trivial to tell whether or not webservergroup is set.
Just so it doesn't get lost, i'll point out the pathological behavior in the
regenerate option that causes it to be still incredibly slow.

It currently works like so:

for all days:
  Initialize bug counts for each status to 0
  add all bugs from previous day to list of bugs to process (processlist).
  for all bugs in processlist:
     determine status of bug and update bugcount.

Obviously, because we never take bugs off the processlist (necessary because we
reset the bugcounts every day), this gets slower and slower as we move through
the days.

I've completely reworked it so that we take advantage of the fact that the only
time the counts change is when a bug is added, or it is resolved. Thus, we only
need to keep looking at the *unresolved* bugs, not all of them, and we don't
need to update the count on the unresolved bugs until they get resolved. (IE
once we've seen it once, until it's resolved, we don't care about it).
Unfortunately, my current version assumes that it's only resolved once.
Once it is resolved, we don't look at it again. 
When i get back to it, i'll make it do a quick query once a resolution occurs to
see if it ever gets reopened later on, and put it back on the processlist if
this is the case.

With changes to do this, it takes roughly 20-30 seconds to regenerate stats,
rather than ungodly large amounts of time.

I should also note that AFAICT, it's the perl code that's slow in all of this
(both before and after),  not the large amount of queries.  Mysql was waiting
most of the time, and wasn't eating much CPU at all.

dberlin: I noticed it got slower as you went, and better queries are always good
:-) But I'd much rather not do the cleanups I did again, so please base any
future patch on my version 2, above. 

When do you anticipate being able to make these changes?

Gerv
I don't see any reviewed patches ready to check in here, so nothing to approve...
Flags: approval?
I'd hate to see this patch get lost _again_. 

dberlin: will you be able to fix the dodgy cases in the next week, or should I
get it checked in and maybe you can fix it in a later patch?

Gerv
I doubt it, i've got finals coming up, and the big final gcc switchover to
attend to.
I'll fix the slowness later, so just go with what we have now
Requesting approval. This patch was basically written by someone else and tidied
up by me, so I've reviewed it as well as uploaded it :-) Also, it's been tested
by several people and is very self-contained (you have to pass an option to
collectstats.pl to enable the function.)

Gerv
Flags: approval?
Flags: approval? → approval+
Fixed. Apologies for the delay.

Checking in collectstats.pl;
/cvsroot/mozilla/webtools/bugzilla/collectstats.pl,v  <--  collectstats.pl
new revision: 1.29; previous revision: 1.28
done

Gerv
Status: NEW → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
Do we have a bug number for fixing the speed issues mentioned in comment #62?
No, we don't have a bug number for the speed issue: I assume dberlin will file
one when and if he gets around to making a patch for it.

Gerv
Flags: documentation2.18?
Flags: documentation?
OS: Linux → All
Hardware: PC → All
Blocks: 285466
Flags: documentation?
Flags: documentation2.18?
QA Contact: matty_is_a_geek → default-qa
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: