Closed Bug 72721 Opened 23 years ago Closed 23 years ago

duplicates.cgi performs poorly with lots of bugs

Categories

(Bugzilla :: Bugzilla-General, defect)

x86
All
defect
Not set
major

Tracking

()

RESOLVED FIXED
Bugzilla 2.12

People

(Reporter: gerv, Assigned: gerv)

References

()

Details

Attachments

(3 files)

On bugzilla.mozilla.org, duplicates.cgi sorts all 7000 dupes in a hash prior to
display, which takes an unacceptably long time. I've made it so that it first
weeds out all bugs with a dup_count below a certain value, which is a param. 

I've also added "Click to sort" column headings, refactored the code a bit and
generally tidied it up. This should hopefully make duplicates.cgi "2.12-ready",
but it'll need testing on there, as I can't test it with a DB that big. 

There's still no easy interface to changing the time difference for the "changed
since" column, but that can wait. Suggestions as to how to do this are welcome.

Patch to follow.

Gerv
Keywords: patch
Target Milestone: --- → Bugzilla 2.12
OK, so I forgot the other necessary fix - to make it display only open bugs :-( 
I'll sort that out today, and attach another patch.

Gerv
do you really want only open bugs? People using slightly moldy builds will
want to see recently closed bugs too i belive. Not sure what the best way is
to define 'recently closed'.
True. But you are right, there's no easy way to query for "resolution changed
from "" to something in the last week", is there?

Gerv
how about some variation of this

select bug_id, oldvalue, newvalue from bugs_activity where fieldid=29 and
(newvalue='RESOLVED') AND bug_when > "2001-03-19 00:00:00";
If you want to trawl the bug activity you might be able to do this but that
might be slow.

But I do think we should have at least ResolvedDaysAgo, VerifiedDaysAgo &
ClosedDaysAgo available for the user to query on as well.
Blocks: 69054
Dawn: I have a fix for the last part now, but in my DB, the fieldid in
bugs_activity of a change from nothing to RESOLVED is 8. Did you pull "29" out
of your head, or is there something here I'm not understanding? 

Gerv
RESOLVED is a type of bug_status.

in the fielddefs table, bug_status has fieldid 29 (on bmo at least)

mysql> select fieldid name from fielddefs where name="bug_status";
+------+
| name |
+------+
|   29 |
+------+
1 row in set (0.00 sec)
mysql> select fieldid name from fielddefs where name="bug_status";
+------+
| name |
+------+
|    8 |
+------+
1 row in set (0.15 sec)
Just looked and you have to do something like:
  my $statusid = GetFieldID("bug_status");
OK, new patch coming. It does the following:

- Changed DB_File to AnyDBM_File, for cross-platform portability reasons. This
will work on vanilla Perl on any platform by using whatever DBM implementation
is installed. This is bug 69054. This also requires no longer manually adding
the ".db" extension to the files.

- Changed the date generation routines to generate a date like 2001-01-01
rather than 20010101 because this matches the internal form in Bugzilla. This is
for code-reuse reasons.

(This means that old db files will not be recognised. This is good, because
they may in the wrong format anyway.)

- Made duplicates.cgi not show closed bugs, unless they were closed less than
a param-to-the-form (show_res_days) number of days ago (default 7.) 

- Added a param "mostfreqthreshold", for performance reasons - any bugs with
less than that number of dupes will never show up. Default is 2 - I suggest
b.m.o set this to 8 or so. It now also uses the shadowdb.

- Also added "click to sort" column headings to duplicates.cgi, and other
general polish.

- Fixed incorrect comment on param types in defparams.pl

- Made reports.cgi say what the current milestone is when offering Most Doomed
for it. Also formatting changes.

Please review :-) 

Gerv
<sigh> No, no. It's all going pear-shaped.

I'm currently using Dawn's SQL to generate a list of bugs to possibly remove
from my list of dupes. However, it:
a) Gets all bugs which have been resolved more than e.g. a week ago. If they've
been reopened since, it still catches them and they are erroneously removed.
b) "All events where a bug was resolved more than a week ago" is a very big list
on b.m.o.

Argh! I have a big list of bugs. I want to reduce it to all those which are
currently open, or are currently closed and were closed less than e.g. a week
ago. How do I do this in a sane amount of time and resources? I could do an
individual SQL query to check the status of each bug, but this would mean e.g.
100 queries all hitting the bugs_activity table, which is very big indeed.

Can anyone give me ideas? I want to get this patch done ASAP, because all the
neat stuff I mentioned above also needs doing for 2.12, and it's hard to
disentangle it all.

Gerv
How up to date does the output from duplicates.cgi have to be?  If you made
collectstats.pl do the bulk of the processing and basically just output a list
of bug numbers that should be displayed, then duplicates.cgi could simply take
that list, do a few formating tricks, pull the current summaries and spit out
the HTML.

This would allow you to do a bunch of smaller queries in collectstats that (for
resolution, etc.) and even the larger bulk sorts and stuff w/out worrying about
how long it takes (within reason, of course).
collectstats.pl can't just output a list of bug numbers, unless we want to lose
the "changed since" functionality (which says how many dupes a bug has acquired
in e.g. the last week) which requires the databases the way they are. Because of
this, we need to a) hold all the dupes in the on-disk file, and b)only calculate
which ones to display after we know what the value is.

I think the 2.12 solution is probably, sadly, just to display all the bugs with
lots of dupes, resolved or not.

Gerv
Haven't looked closely enough to look at duplicates.cgi to know it did all that.
 Basically my suggestion is to put as much processing as possible into
collectstats.pl, but I guess you already did that.

It seems like there's some SQL magic that can be done to make the query return
only open bugs, but I'm not an SQL expert by any means.
I think there needs to be a fast way to get the "last-resolved",
"last-verified", and maybe other dates for each bug. This will help here, and it
will also help the QuickSearch facility which has the same problem: to show open
and "recently fixed" bugs. 
What about the following solution: Add a "last_activity" table to the database.
The columns would be:
- bug id      (with an index on this column)
- activity_id (with an index on this column)
- timestamp   (with an index on this column)
and optionally
- who 

The activities would be hardcoded and described in a "interesting_activity" table:
- id   (with an index on this column)
- desc (text, e.g. "status changes to RESOLVED")

The interesting_activity table would be hardcoded, and for use by bugzilla
developers only. (We could also do without it.)

The code for updating the last_activity table would have to be placed
appropriately at locations where the bug_activity table is currently updated.
Whenever an interesting activity takes place, the last_activity table would be
queried for an entry for this bug_id/activity_id combination, and if one already
exists, the timestamp (and maybe "who") data is updated, otherwise a new tuple
is inserted. This would be a flexible, extensible mechanism.

If we wanted to restrict ourselved to interesting activities where a certain
field changes, or a certain field changes to a certain new value, then the
"interesting_activity" tables could be defined as
- id
- field_id
- new_value (where NULL means "does not matter")
and optionally
- old_value (where NULL means "does not matter")
- who       (where NULL means "does not matter")

This would allow for a generic implementation of the update mechanism for the
"interesting_activity" table (assuming that the code for logging the
bug_activity is nicely centralized in a single location). It would make it
impossible to track compound activities that affect several fields, but I'm not
sure if anyone needs them at all.

Optionally, for each "value" column in "interesting_activity" there could be an
additional column for the comparison type: exact (default), substr, regexp, ...

But I'm not sure if this is 2.12 stuff... ;)
I am pushing this off to 2.16. The feature works, it's just slow. 
Target Milestone: Bugzilla 2.12 → Bugzilla 2.16
Chris,

It doesn't really work satisfactorily. Given that the patch only really affects 
duplicates.cgi (and so there's no danger of it breaking other Bugzilla function) 
why can't we just check it in? I don't want my work being displayed to the world 
in a half-finished state :-)

Gerv
OK, new patch attached. This patch:

1) Fixes the AnyDBM_File problem  (bug 69054) in collectstats.pl and
duplicates.cgi
2) Adds one new param
3) Makes a load of other changes to duplicates.cgi

3) is irrelevant - the feature didn't really work before (no, it really didn't,
I promise :-) so if it doesn't work now, what's the problem? 

1), however, is really important for upgradeability reasons. People are _not_
going to be happy if we break all their stats in a few months time. 

I've changed the stats filenames so that this code won't try and read the older
ones (though I doubt anyone other than b.m.o. is using them.) This change is
good for other reasons, too (too complex to explain here.)

So, please review and check in the small changes to collectstats.pl and
defparams.pl. You can take my word for it that duplicates.cgi works, if it seems
like too much effort to review. If you are going to review it, probably best to
review the complete file, as the patch moves stuff around.

Gerv
Target Milestone: Bugzilla 2.16 → Bugzilla 2.12
r= justdave

Ooh, duplicates.cgi queries the shadow database now.  Nice touch. :-)

CPAN says AnyDBM_File is included with Perl 5.6.  Can anyone verify if it's also 
included with Perl 5.004 (Which is currently Bugzilla's minimum required Perl 
version)?  If so, this won't need to be release-noted.
I take that back, this bug does need a release-note :-)

It adds a param for the number of dupes a bug needs to be included in the "most 
frequent" list.

Anyway, it's checked in.
Status: NEW → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
my 5.004_04 installation has it but AnyDBM_File may have been
installed later.

collectstats is creating files named
dupes2001-04-06.dir
dupes2001-04-06.pag

but duplicates.cgi tries to open files ending in ".db"

also, this is pretty suboptimal because as time passes its adding more and 
more files which we need to keep forever. In no time there will be hudnreds 
of files in the data directory. At the very least, get these files out of
bugzilla/data and into their own directory (bugzilla/data/duplicates)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Why is nothing in life ever as simple as it appears?

I'll look at this again :-)

<looks> Actually, I think my "file exists" tests are broken. the AnyDBM_File
calls are extension-agnostic (presumably because this varies across platforms),
but I'm testing for them with the .db extension. Doh. I will fix this, and also
make a new directory for them (data/duplicates ?).          

Gerv
OK. The new (simple, 10-line) patch against current CVS moves all duplicates
files to data/duplicates (and checksetup.pl creates it if it doesn't exist.) It
also changes the file tests to not rely on the files ending in .db and so
_should_ work when they don't. However, I don't have a suitable system to test
this on.

I've resisted the temptation to make more improvements to duplicates.cgi :-)

Gerv
Blocks: 76159
Final patch checked.  Thanks Gerv.
Status: REOPENED → RESOLVED
Closed: 23 years ago23 years ago
Resolution: --- → FIXED
Moving closed bugs to Bugzilla product
Component: Bugzilla → Bugzilla-General
Product: Webtools → Bugzilla
Version: Bugzilla 2.11 → unspecified
QA Contact: matty_is_a_geek → default-qa
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: