Closed Bug 176599 Opened 22 years ago Closed 22 years ago

Improve performance of duplicates.cgi

Categories

(Bugzilla :: Reporting/Charting, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
Bugzilla 2.18

People

(Reporter: gerv, Assigned: gerv)

References

Details

(Keywords: perf)

Attachments

(3 files, 4 obsolete files)

Split off from bug 176002.

Moving the stats into the DB is desireable, but an effort. I think we should
immediately make the following changes:
- Incorporate the "openonly" functionality into the query.
- Incorporate the access check into the query
- Incorporate the "numwanted" functionality into the query. Because the access
check is also incorporated, we won't need to iterate.

This should make a significant difference. Then, we can work out what schema we
need to move the stats into the DB.

Gerv
Blocks: 176002
Blocks: 176570
Keywords: perf
Attached patch Patch v.1 (obsolete) — Splinter Review
Let MySQL do all the work.

bbaetz: you've been pushing for this; any time to review it? ;-) Seriously,
it's not that bug - some stuff copied from CanSeeBug, and a template
optimisation now that we pass in the same number of bugs that we display.

Gerv
Comment on attachment 104272 [details] [diff] [review]
Patch v.1

You should:

+    detaint_natural($maxrows);

|| die (...)

Also, dkl is rewriting the stuff in Search.pm to be ~ 10 times faster; you want
to use the new method instead. See bug 114696.
Attachment #104272 - Flags: review-
Attached patch Patch v.2 (obsolete) — Splinter Review
It's not clear that that bug will make it in before b.m.o upgrades - let's take
this now, and if we can improve it later, great.

Nits fixed.

Gerv
Attachment #104272 - Attachment is obsolete: true

Gerv: 

I'd suggest either making this depend on bug 114696 or use the same appproach. 
It is faster, works better for differrent DBs, and does not have to be heavuilty
rewritten to switch between AND and OR groups.

Joel: as I said in comment 3:

"It's not clear that that bug will make it in before b.m.o upgrades - let's take
this now, and if we can improve it later, great."

If it requires rewriting to fit in with bug 114696, then we can do that in
another round of patching. We do not need multi-DB support in the next week :-)
I do not want to miss the boat with this - now that the Bugzilla Helper's being
used on b.m.o., that page gets a lot of hits.

Gerv

The patch looks good (a few indentation nits, but nothing review-breaking),
seems to correctly hide bugs the user is unauthorized to see, and is much
faster, but it doesn't show the same bug set.  Here's an example of what I get
on a several-days old copy of b.m.o data.
Here's what duplicates.cgi produces on the same installation if I reverse the
patch.
Comment on attachment 104691 [details] [diff] [review]
Patch v.2

OK, got it.

You've got a LIMIT, but no ORDER BY, so you're going to be picking $maxrows
_random_ rows, then sorting, rather than sorting then limiting.

You can move the foundgroups stuff into the query, I guess, and then ORDER BY
if you're not sorting on dupes. See my discussion in bug 176002.
Attachment #104691 - Flags: review-
OK, my idea here is to put the limiting back in the template; duplicates.cgi's
cool bi-directional column sorting mechanism is based on having all the data
available in the template, and changing that requires quite significant surgery. 

We can still provide a performance boost by moving the access check and open
check into the query. Although it's not as much as we'd like ideally, we have
deadlines...

Anyone think that's a bad plan?

Gerv
Attached patch Patch v.3 (obsolete) — Splinter Review
Here's v.3, which moves the row limiting back to the template. Still a perf
improvement, though, surely.

Gerv
Attachment #104691 - Attachment is obsolete: true
Comment on attachment 105506 [details] [diff] [review]
Patch v.3

>Index: duplicates.cgi

>+    my $query = 
>+      "SELECT bugs.bug_id, components.name, bug_severity, op_sys, " .
>+        "target_milestone, short_desc, bug_status, resolution, " .
>+        "reporter, assigned_to, qa_contact, reporter_accessible, " .
>+        "cclist_accessible, cc.who IS NOT NULL, " .
>+        "COUNT(DISTINCT(bug_group_map.group_id)) as cntbugingroups, " .
>+        "COUNT(DISTINCT(user_group_map.group_id)) as cntuseringroups " .
>+      "FROM bugs, components " .
>+        "LEFT JOIN cc ON bugs.bug_id = cc.bug_id " .
>+        "             AND cc.who = $userid " .
>+        "LEFT JOIN bug_group_map ON bugs.bug_id = bug_group_map.bug_id " .
>+        "LEFT JOIN user_group_map ON " .
>+        "                   user_group_map.group_id = bug_group_map.group_id " .
>+        "                         AND user_group_map.isbless = 0 " .
>+        "                         AND user_group_map.user_id = $userid " .

Nit: indentation here is a bit ambiguous.


>+        # Don't add CLOSED, and don't add VERIFIED unless they are INVALID or 
>+        # WONTFIX. We want to see VERIFIED INVALID and WONTFIX because common 
>+        # "bugs" which aren't bugs end up in this state.
>+        $query .= "AND (bug_status != 'CLOSED') " .
>+                  "AND ((bug_status = 'VERIFIED' " .
>+                  "      AND resolution IN ('INVALID', 'WONTFIX')) " .
>+                  "     OR (bug_status != 'VERIFIED')) ";

Nit: indentation here is inconsistent.

With this patch applied the "changed since" column doesn't show up.

Also, I'm getting mixed results with performance tests on the code.  F.e.,
hitting the CGI twenty five times takes 2 minutes 27 seconds for the version
currently on b.m.o, 3 minutes 2 seconds for the trunk without the patch
applied, and 2 minutes 13 seconds for the trunk with the patch applied on a
test installation.  I'm not 100% confident in these results, but did
performance regress on the trunk recently?
Attachment #105506 - Flags: review-
myk - that slowdown is probably due to the groups stuff again. bitwise & is
always better than an sql join...
>+    my $query = 
>+      "SELECT bugs.bug_id, components.name, bug_severity, op_sys, " .
>+        "target_milestone, short_desc, bug_status, resolution, " .
>+        "reporter, assigned_to, qa_contact, reporter_accessible, " .
>+        "cclist_accessible, cc.who IS NOT NULL, " .
>+        "COUNT(DISTINCT(bug_group_map.group_id)) as cntbugingroups, " .
>+        "COUNT(DISTINCT(user_group_map.group_id)) as cntuseringroups " .
>+      "FROM bugs, components " .
>+        "LEFT JOIN cc ON bugs.bug_id = cc.bug_id " .
>+        "             AND cc.who = $userid " .
>+        "LEFT JOIN bug_group_map ON bugs.bug_id = bug_group_map.bug_id " .
>+        "LEFT JOIN user_group_map ON " .
>+        "                   user_group_map.group_id = bug_group_map.group_id " .
>+        "                         AND user_group_map.isbless = 0 " .
>+        "                         AND user_group_map.user_id = $userid " .

> Nit: indentation here is a bit ambiguous.

Well, all the ANDs are lined up with their ONs, and the only anomaly is that
user_group_map.group_id = bug_group_map.group_id is down one line because it
won't fit.

>+        # Don't add CLOSED, and don't add VERIFIED unless they are INVALID or 
>+        # WONTFIX. We want to see VERIFIED INVALID and WONTFIX because common 
>+        # "bugs" which aren't bugs end up in this state.
>+        $query .= "AND (bug_status != 'CLOSED') " .
>+                  "AND ((bug_status = 'VERIFIED' " .
>+                  "      AND resolution IN ('INVALID', 'WONTFIX')) " .
>+                  "     OR (bug_status != 'VERIFIED')) ";

> Nit: indentation here is inconsistent.

Things are lined up with their brackets, to show the structure - the OR is an OR
for the entire clause above, etc. etc. 

I'm happy to accept suggestions for better indentation, but that's the best I
could come up with.

> With this patch applied the "changed since" column doesn't show up.

Are you sure your test installation has the historical data? This works for me.

Gerv
extra thing - I'm in the process of redoing bug 114696 for a 10-20% speedup on
the group checking stuff. (Summary: groupby sucks). I don't know if we want to
wait for that before this, or to do this in towo stages. I guess it depends on
timing, at this stage.
That's up to myk. This patch is ready to go (perhaps with indentation tweaks)
and gives us a win. Or we can wait for bug 114696 - but we are running out of
time fast.

Gerv
Depends on: 179193
Attached patch v4 (obsolete) — Splinter Review
Bug 114696 has a perf improvement for the security stuff. I was going to merge
it into duplicates, but then I decided that it would be better to use
Bugzilla::Search instead.

This isn't finished - as was pointed out on IRC, I don't need teh extra stuff,
because I can represent the extra bit with boolean charts.
Attachment #105506 - Attachment is obsolete: true
-> 2.18; I'll play with this tomorrow
Target Milestone: --- → Bugzilla 2.18
Attached patch v5Splinter Review
Thanks to some help on #mozwebtools, I don't need the extracond stuff.

Note that I use anyexact, even though this is not directly available from the
query.cgi UI - see bug 179302

myk - can I get you to run some speed/correctness tests, please?
Attachment #105673 - Attachment is obsolete: true
Attachment #105728 - Flags: review?
Comment on attachment 105728 [details] [diff] [review]
v5

>Index: duplicates.cgi

>+    # Restrict to product if requested
>+    if ($::FORM{'product'}) {
>+        $params->param('product', $::FORM{'product'});
>+    }

Nit: simple conditionals like this are more readable if postfix:

$params->param('product', $::FORM{'product'}) if ($::FORM{'product'});


>Index: template/en/default/global/user-error.html.tmpl

>+  [% ELSIF error == "invalid_maxrow" %]
>+    [% title = "Invalid Max Rows" %]
>+    The maximum number of rows, '[% maxrows FILTER html %]', must be a positive
>+    integer.

Nit: quoting values is better accomplished with style (bold, italic,
strikethrough, monospaced font, different font color or background color,
etc.).

This patch produces identical output and provides a significant speed boost. 
In my tests on Bugzilla data, five runs of duplicates.cgi took ~27 seconds
without the patch and ~9.5 seconds with it.  Impressive.

r=myk
Attachment #105728 - Flags: review? → review+
a=justdave
Fixed
Status: NEW → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
I should say that I find this syntax poor:

SELECT ... FROM bugs... LEFT JOIN longdescs longdescs_email_1 ON bugs.bug_id =
longdescs_email_1.bug_id AND longdescs_email_1.who IN(###)... WHERE ... AND
longdescs_email_1.who IS NOT NULL...

I think it would make more sense (cleaner, easier to read, more correct) as:

SELECT ... FROM bugs... LEFT JOIN longdescs longdescs_email_1 ON bugs.bug_id =
longdescs_email_1.bug_id... WHERE ... AND longdescs_email_1.who = ###...

The MySQL manual also discourages row selection criteria in JOIN statements in
the FROM clause.  Preliminary testing says the queries are equally performant. 
This is perhaps another bug, though.
The mysql manual discourages it because people assume that the two things are
equivalent.

I'm not particularly attached to either form, mind you, although I do think that
the current one is clearer. mysql optimises tehm to be the same, so tahts not an
issue.
QA Contact: matty_is_a_geek → default-qa
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: