Closed Bug 72721 Opened 24 years ago Closed 24 years ago

duplicates.cgi performs poorly with lots of bugs

Categories

(Bugzilla :: Bugzilla-General, defect)

x86
All
defect
Not set
major

Tracking

()

RESOLVED FIXED
Bugzilla 2.12

People

(Reporter: gerv, Assigned: gerv)

References

()

Details

Attachments

(3 files)

On bugzilla.mozilla.org, duplicates.cgi sorts all 7000 dupes in a hash prior to display, which takes an unacceptably long time. I've made it so that it first weeds out all bugs with a dup_count below a certain value, which is a param. I've also added "Click to sort" column headings, refactored the code a bit and generally tidied it up. This should hopefully make duplicates.cgi "2.12-ready", but it'll need testing on there, as I can't test it with a DB that big. There's still no easy interface to changing the time difference for the "changed since" column, but that can wait. Suggestions as to how to do this are welcome. Patch to follow. Gerv
Keywords: patch
Target Milestone: --- → Bugzilla 2.12
OK, so I forgot the other necessary fix - to make it display only open bugs :-( I'll sort that out today, and attach another patch. Gerv
do you really want only open bugs? People using slightly moldy builds will want to see recently closed bugs too i belive. Not sure what the best way is to define 'recently closed'.
True. But you are right, there's no easy way to query for "resolution changed from "" to something in the last week", is there? Gerv
how about some variation of this select bug_id, oldvalue, newvalue from bugs_activity where fieldid=29 and (newvalue='RESOLVED') AND bug_when > "2001-03-19 00:00:00";
If you want to trawl the bug activity you might be able to do this but that might be slow. But I do think we should have at least ResolvedDaysAgo, VerifiedDaysAgo & ClosedDaysAgo available for the user to query on as well.
Blocks: 69054
Dawn: I have a fix for the last part now, but in my DB, the fieldid in bugs_activity of a change from nothing to RESOLVED is 8. Did you pull "29" out of your head, or is there something here I'm not understanding? Gerv
RESOLVED is a type of bug_status. in the fielddefs table, bug_status has fieldid 29 (on bmo at least) mysql> select fieldid name from fielddefs where name="bug_status"; +------+ | name | +------+ | 29 | +------+ 1 row in set (0.00 sec)
mysql> select fieldid name from fielddefs where name="bug_status"; +------+ | name | +------+ | 8 | +------+ 1 row in set (0.15 sec)
Just looked and you have to do something like: my $statusid = GetFieldID("bug_status");
OK, new patch coming. It does the following: - Changed DB_File to AnyDBM_File, for cross-platform portability reasons. This will work on vanilla Perl on any platform by using whatever DBM implementation is installed. This is bug 69054. This also requires no longer manually adding the ".db" extension to the files. - Changed the date generation routines to generate a date like 2001-01-01 rather than 20010101 because this matches the internal form in Bugzilla. This is for code-reuse reasons. (This means that old db files will not be recognised. This is good, because they may in the wrong format anyway.) - Made duplicates.cgi not show closed bugs, unless they were closed less than a param-to-the-form (show_res_days) number of days ago (default 7.) - Added a param "mostfreqthreshold", for performance reasons - any bugs with less than that number of dupes will never show up. Default is 2 - I suggest b.m.o set this to 8 or so. It now also uses the shadowdb. - Also added "click to sort" column headings to duplicates.cgi, and other general polish. - Fixed incorrect comment on param types in defparams.pl - Made reports.cgi say what the current milestone is when offering Most Doomed for it. Also formatting changes. Please review :-) Gerv
<sigh> No, no. It's all going pear-shaped. I'm currently using Dawn's SQL to generate a list of bugs to possibly remove from my list of dupes. However, it: a) Gets all bugs which have been resolved more than e.g. a week ago. If they've been reopened since, it still catches them and they are erroneously removed. b) "All events where a bug was resolved more than a week ago" is a very big list on b.m.o. Argh! I have a big list of bugs. I want to reduce it to all those which are currently open, or are currently closed and were closed less than e.g. a week ago. How do I do this in a sane amount of time and resources? I could do an individual SQL query to check the status of each bug, but this would mean e.g. 100 queries all hitting the bugs_activity table, which is very big indeed. Can anyone give me ideas? I want to get this patch done ASAP, because all the neat stuff I mentioned above also needs doing for 2.12, and it's hard to disentangle it all. Gerv
How up to date does the output from duplicates.cgi have to be? If you made collectstats.pl do the bulk of the processing and basically just output a list of bug numbers that should be displayed, then duplicates.cgi could simply take that list, do a few formating tricks, pull the current summaries and spit out the HTML. This would allow you to do a bunch of smaller queries in collectstats that (for resolution, etc.) and even the larger bulk sorts and stuff w/out worrying about how long it takes (within reason, of course).
collectstats.pl can't just output a list of bug numbers, unless we want to lose the "changed since" functionality (which says how many dupes a bug has acquired in e.g. the last week) which requires the databases the way they are. Because of this, we need to a) hold all the dupes in the on-disk file, and b)only calculate which ones to display after we know what the value is. I think the 2.12 solution is probably, sadly, just to display all the bugs with lots of dupes, resolved or not. Gerv
Haven't looked closely enough to look at duplicates.cgi to know it did all that. Basically my suggestion is to put as much processing as possible into collectstats.pl, but I guess you already did that. It seems like there's some SQL magic that can be done to make the query return only open bugs, but I'm not an SQL expert by any means.
I think there needs to be a fast way to get the "last-resolved", "last-verified", and maybe other dates for each bug. This will help here, and it will also help the QuickSearch facility which has the same problem: to show open and "recently fixed" bugs. What about the following solution: Add a "last_activity" table to the database. The columns would be: - bug id (with an index on this column) - activity_id (with an index on this column) - timestamp (with an index on this column) and optionally - who The activities would be hardcoded and described in a "interesting_activity" table: - id (with an index on this column) - desc (text, e.g. "status changes to RESOLVED") The interesting_activity table would be hardcoded, and for use by bugzilla developers only. (We could also do without it.) The code for updating the last_activity table would have to be placed appropriately at locations where the bug_activity table is currently updated. Whenever an interesting activity takes place, the last_activity table would be queried for an entry for this bug_id/activity_id combination, and if one already exists, the timestamp (and maybe "who") data is updated, otherwise a new tuple is inserted. This would be a flexible, extensible mechanism. If we wanted to restrict ourselved to interesting activities where a certain field changes, or a certain field changes to a certain new value, then the "interesting_activity" tables could be defined as - id - field_id - new_value (where NULL means "does not matter") and optionally - old_value (where NULL means "does not matter") - who (where NULL means "does not matter") This would allow for a generic implementation of the update mechanism for the "interesting_activity" table (assuming that the code for logging the bug_activity is nicely centralized in a single location). It would make it impossible to track compound activities that affect several fields, but I'm not sure if anyone needs them at all. Optionally, for each "value" column in "interesting_activity" there could be an additional column for the comparison type: exact (default), substr, regexp, ... But I'm not sure if this is 2.12 stuff... ;)
I am pushing this off to 2.16. The feature works, it's just slow.
Target Milestone: Bugzilla 2.12 → Bugzilla 2.16
Chris, It doesn't really work satisfactorily. Given that the patch only really affects duplicates.cgi (and so there's no danger of it breaking other Bugzilla function) why can't we just check it in? I don't want my work being displayed to the world in a half-finished state :-) Gerv
OK, new patch attached. This patch: 1) Fixes the AnyDBM_File problem (bug 69054) in collectstats.pl and duplicates.cgi 2) Adds one new param 3) Makes a load of other changes to duplicates.cgi 3) is irrelevant - the feature didn't really work before (no, it really didn't, I promise :-) so if it doesn't work now, what's the problem? 1), however, is really important for upgradeability reasons. People are _not_ going to be happy if we break all their stats in a few months time. I've changed the stats filenames so that this code won't try and read the older ones (though I doubt anyone other than b.m.o. is using them.) This change is good for other reasons, too (too complex to explain here.) So, please review and check in the small changes to collectstats.pl and defparams.pl. You can take my word for it that duplicates.cgi works, if it seems like too much effort to review. If you are going to review it, probably best to review the complete file, as the patch moves stuff around. Gerv
Target Milestone: Bugzilla 2.16 → Bugzilla 2.12
r= justdave Ooh, duplicates.cgi queries the shadow database now. Nice touch. :-) CPAN says AnyDBM_File is included with Perl 5.6. Can anyone verify if it's also included with Perl 5.004 (Which is currently Bugzilla's minimum required Perl version)? If so, this won't need to be release-noted.
I take that back, this bug does need a release-note :-) It adds a param for the number of dupes a bug needs to be included in the "most frequent" list. Anyway, it's checked in.
Status: NEW → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
my 5.004_04 installation has it but AnyDBM_File may have been installed later.
collectstats is creating files named dupes2001-04-06.dir dupes2001-04-06.pag but duplicates.cgi tries to open files ending in ".db" also, this is pretty suboptimal because as time passes its adding more and more files which we need to keep forever. In no time there will be hudnreds of files in the data directory. At the very least, get these files out of bugzilla/data and into their own directory (bugzilla/data/duplicates)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Why is nothing in life ever as simple as it appears? I'll look at this again :-) <looks> Actually, I think my "file exists" tests are broken. the AnyDBM_File calls are extension-agnostic (presumably because this varies across platforms), but I'm testing for them with the .db extension. Doh. I will fix this, and also make a new directory for them (data/duplicates ?). Gerv
OK. The new (simple, 10-line) patch against current CVS moves all duplicates files to data/duplicates (and checksetup.pl creates it if it doesn't exist.) It also changes the file tests to not rely on the files ending in .db and so _should_ work when they don't. However, I don't have a suitable system to test this on. I've resisted the temptation to make more improvements to duplicates.cgi :-) Gerv
Blocks: 76159
Final patch checked. Thanks Gerv.
Status: REOPENED → RESOLVED
Closed: 24 years ago24 years ago
Resolution: --- → FIXED
Moving closed bugs to Bugzilla product
Component: Bugzilla → Bugzilla-General
Product: Webtools → Bugzilla
Version: Bugzilla 2.11 → unspecified
QA Contact: matty_is_a_geek → default-qa
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: