Closed Bug 176002 Opened 22 years ago Closed 15 years ago

Move duplicate statistics into the db

Categories

(Bugzilla :: Reporting/Charting, enhancement, P2)

enhancement

Tracking

()

RESOLVED FIXED
Bugzilla 3.6

People

(Reporter: bbaetz, Assigned: mkanat)

References

Details

(Keywords: perf, topperf, Whiteboard: [3.6 Focus])

Attachments

(1 file, 7 obsolete files)

I'm sure I filed a bug on this earlier, but I can't find it now, either in bz,
or in my mail archives. Oh well. If someone does find it, its probbaly better to
dupe that against this one, which actually has some thoughts

We need to move the duplicate code into the db.  bmo currently has about 52,000
duplicate bugs (based on the reports.cgi graph). I don't know how big that makes
bmo's dbm file (since it only has an entry for every unique bug), but I'd guess
that its at least 40,000 and probably closer to 45,000.

When loading duplicates.cgi, this means that we read in these 45000 entries from
disk, copy them into memory to avoid modifying the tied version, then SELECT
these 45000-odd entries from the database based on the resolution/bug_status of
the target bug, and optionally the product.

For each of these, we run CanSeeBug on each of them. Individually. It that
passes, we then  discard the closed ones (if requested) [why can't we do that in
the query, at least, or before CanSeeBug?], and then chuck the entire list to
the template, which then sorts these 45000 elements.

This takes 10-20 seconds on bmo for each page load, which really sucks, although
I must say that I'm impressed that it only takes 10 seconds to run CanSeeBug
45000 times.

The fix for this is:

a) move duplicate stats from the dbm files into the db (This also fixes the fact
that my perl can no longer read the old dupe files from rh7.3 w/o running
db_upgrade, because the formats are not compatable)
b) Change the query to do a SELECT with the approriate WHERE clause, LIMIT
$numwanted. Run CanSeeBug on each of those, and if it fails, run with LIMIT
$numfailed OFFSET $oldnumwanted, and repeat until none fail, or we run out of
entries.
c) push that data to the tempate

(b) will be slower in the case where the user can't see most of the bugs, since
we'd run the query lots of times. We don't have to - on a test db of 100,000
bugs + ORDER BY, LIMIT 1000 takes 0.05 seconds, as opposed to 1.34s without the
limit, so we could just do the one query, and then stop caring after we have
enough passes from CanSeeBug.. I don't know if we care about that, though - its
unlikly to be an issue on a real installation.

I know that part of the reason we do some of this stuff this way is that in the
future gerv wanted us to sort/filter in the page dynamically, w/o requiring us
to reload the data. We can deal with this when we add that support, I think,
given this _massive_ perf issue.

The hard part is coming up with a db model which doesn't have 45,000 rows for
each day - if we do that, then we'll probably lose most of the perf advantage.

The simple way is to only keep the last week's results, or something.

The better idea is to only store a new row when something changes. To find the
top 100 bugs, we then do:

SELECT bug_id, dupe_count FROM dupe_stats a WHERE changed_date=(SELECT
MAX(b.changed_date) FROM dupe_stats b WHERE a.bug_id=b.bug_id) ORDER BY
dupe_count LIMIT 100

except we can't, because thats a subselect :) The alternate is:

CREATE TEMPORARY TABLE tmp (
 ...
);
INSERT INTO tmp SELECT bug_id, MAX(dupe_count) AS cnt FROM bugs GROUP BY bug_id
SELECT bugs.bug_id, tmp.cnt FROM bugs, tmp WHERE bugs.bug_id = tmp.bug_id AND
<other conds> ORDER BY tmp.cnt LIMIT 100
DROP TABLE tmp;

We can avoid the temp table only if we pull everything out of the db, and sort
in perl for the order-by-dupeCnt case. In that case, its still more efficient
than the current scheme, but a temp table is probably faster anyway.

For order-by-changed-in-last-n-days, we just ahve two temp tables, the second
with an additional |date <= (now - 7 days)| added, and ORDER BY
(foo.dupe_count-bar.dupe_count). The alternative, for when we're not sorting
based on that, is to select the vals in perl WHERE bug_id IN (whatever we want
to display), and then do the merge in perl. No idea which way is better/faster/etc.

Anyway, that makes the db schema something like:

DATE last_changed,
INT bug_id,
INT dupe_count

with indexes on last_changed and bug_id.

collectstats then grabs the current vals via the above query, grabs todays vals
by scanning the dupes table + consolidating, and for any non equal ones, it
INSERTs into the table. Theres a race condition with half updated vals, but we
have that currently with the dbm stuff (although the window is smaller). That
can't be fixed until we have transactions, though.

Thoughts?
Blocks: bz-perf
Keywords: perf, topperf
Summary: Move duplicates into the db → Move duplicate statistics into the db
Target Milestone: --- → Bugzilla 2.18
Rather than fetching a large number of items from the DB and then running
CanSeeBug() on each of them, I'd suggest integrating the access check directly
into the original query the same way as is (being) done in Search.pm.  

bbaetz's original analysis is slightly dodgy. After reading the file into memory
(although today's file is almost certainly in cache anyway), we throw away all
dupes with a count < the threshold, which is 5 on b.m.o. This will reduce the
number of bugs considered from 50,000 to far fewer, probably about 5000. 

Moving the stats into the DB is desireable, but an effort. I think we should
immediately make the following changes:
- Incorporate the "openonly" functionality into the query.
- Incorporate the access check into the query
- Incorporate the "numwanted" functionality into the query. Because the access
check is also incorporated, we won't need to iterate.

This should make a significant difference. Then, we can work out what schema we
need to move the stats into the DB.

Gerv
Ah, I missed the prelimiting, and that does explain why its so fast, relativly
speaking. Whilst teh file may be in cache, we copy it,and I'm not sure whether
solaris' Copy-On-Write code mixes with the internal refcnting perl must end up
doing.

In any event, 5000 is still way too many queries to do. Moving the open part
into there will help, though. Maybe w eshould split off another bug for that,
and keep this one open? I can get pg doing the subselect I mentioned earlier on
100000 duplicates in 0.8 seconds, and the second form (which we'd need for
mysql) in 1.1 seconds on my PIII-500. Can you beat that :)

Note that gerv's plan won't work for the case where we're sorting by '# changed
in last n days'. Thats not the default, though, so it still should be a perf
increase in teh general case.
Depends on: 176599
Priority: -- → P2
OK, so I'm having a look at this. Given the trouble we had migrating b.m.o., I'd
like to eliminate the pesky dbm files if at all possible.

However, bbaetz' suggested DB implementation method scares me (as someone not
incredibly well versed in SQL-fu.) 

bbaetz: just to check, is that still your best idea?

Gerv
Umm. Its been almost a year; let me think about this a bit more.
Note that the data/duplicates directory on b.m.o is currently 3.0GB and growing
by 4.5MB a day, so this would be useful for space reasons alone (if MySQL is
more space efficient than DB, or if it can be made so).
If we add indexes, it would take a fair ammount of space.

I also want subselects to try to do this using my schema. If we just import
stuff in as-is, I don't know if we get much benefit. Someone should try it and
see....
Pushing out bugs that aren't blockers.  If someone's working on one of these, we
can move it back.
Target Milestone: Bugzilla 2.18 → Bugzilla 2.20
Severity: critical → enhancement
Bugzilla 2.20 feature set is now frozen as of 15 Sept 2004.  Anything flagged
enhancement that hasn't already landed is being pushed out.  If this bug is
otherwise ready to land, we'll handle it on a case-by-case basis, please set the
blocking2.20 flag to '?' if you think it qualifies.
Target Milestone: Bugzilla 2.20 → Bugzilla 2.22
QA Contact: mattyt-bugzilla → default-qa
Target Milestone: Bugzilla 2.22 → ---
OS: Linux → All
Hardware: PC → All
(In reply to comment #6)
> Note that the data/duplicates directory on b.m.o is currently 3.0GB and growing
> by 4.5MB a day, so this would be useful for space reasons alone (if MySQL is
> more space efficient than DB, or if it can be made so).

I'm facing the space problem now, too. I'm backing up our Bugzilla instance and I'm not sure how those data/duplicates/dupes* files were created, or if they are even necessary. The manual does not talk about them.

In the meantime, is it safe to delete them ? Delete the older ones at least ?
You can delete them if you don't want historical duplicates statistics in duplicates.cgi. If you delete say half, you don't get comparative stats from those dates.

Gerv
(In reply to comment #11)
> You can delete them if you don't want historical duplicates statistics in
> duplicates.cgi. If you delete say half, you don't get comparative stats from
> those dates.

Could all of them be regenerated if needed using "collectstats.pl --regenerate"? Or would I be permanently losing valuable information ?
No, collectstats.pl --regenerate does not regenerate this information.

Gerv
For recently added dups:

SELECT dupe_of, COUNT(dupe) AS last_added FROM duplicates INNER JOIN bugs_activity ON bugs_activity.bug_id = duplicates.dupe WHERE added = 'DUPLICATE' AND bug_when > '2007-01-01' GROUP BY dupe_of;

For total dupe count:

SELECT dupe_of, COUNT(dupe) FROM duplicates INNER JOIN bugs_activity ON bugs_activity.bug_id = duplicates.dupe WHERE added = 'DUPLICATE' GROUP BY dupe_of;

I don't see anything wrong with that. I also don't see any performance issue on landfill, although I'd be interested how long those queries take on bmo.
Assignee: gerv → mkanat
Okay, both of those queries seem to be extremely fast. And if you want them even faster:

CREATE INDEX bugs_activity_added_idx ON bugs_activity (added(64));
Attached patch v1 (obsolete) — Splinter Review
I really just couldn't believe that this bug was still open, so I went ahead and fixed it.

This won't show negative changes anymore--that is, if a bug loses dups, it won't show that it lost dups. It will only show that dups were added. But I really don't think that's important data.
Attachment #257668 - Flags: review?(bugzilla-mozilla)
Attached patch v2 (obsolete) — Splinter Review
I also removed the data/duplicates directory, in this version of the patch.
Attachment #257668 - Attachment is obsolete: true
Attachment #257669 - Flags: review?(bugzilla-mozilla)
Attachment #257668 - Flags: review?(bugzilla-mozilla)
Oh, also note that I verified performance on an installation with 70,000 dupes and 1.2 million bugs_activity rows, and performance is very good.
Status: NEW → ASSIGNED
Attachment #257669 - Flags: review?(bugzilla-mozilla) → review?(LpSolit)
Comment on attachment 257669 [details] [diff] [review]
v2

>Index: duplicates.cgi

>+if ($sortvisible && scalar @buglist) {
>+    $buglist_and = "AND dupe_of IN (" . join(',', @buglist) . ")";
>+}

This doesn't work. If bug 1 is a dupe of bug 2 which itself is a dupe of bug 3, and your buglist contains bug 3 only, you want both 1 -> 2 and 2 -> 3 which will end as 2 dupes for bug 3 (instead of 1). This AND part will prevent indirect dupes.


>+    "SELECT dupe_of, COUNT(dupe)
>+       FROM duplicates INNER JOIN bugs_activity 
>+                       ON bugs_activity.bug_id = duplicates.dupe
>+      WHERE added = 'DUPLICATE' AND fieldid = $reso_field_id

I don't see why you join the bugs_activity table in this query. If there is an entry in the duplicates tables, then you already know which bugs are dupes. This is only useful if you want to limit the result based on time.


>+            $buglist_and
>+   GROUP BY dupe_of " . $dbh->sql_limit($maxrows), {Columns => [1,2]})};

First, you should use bz_group_by(), then $maxrows is the number of bugs to display, not the number of rows to extract from the DB. With my previous example above, you extract 2 rows from the DB, but only one bug will be displayed in the UI (bug 3). So you cannot use $maxrows here.


>+    "SELECT dupe, dupe_of FROM duplicates
>+      WHERE dupe IN (" . join(',', keys %total_dups) . ")", 
>+    {Columns => [1,2]})};
>+add_indirect_dups(\%total_dups, \%dupe_relation);

This looks wrong to me. You never loop over consecutive dupes as in my example above.


>+   GROUP BY dupe_of " . $dbh->sql_limit($maxrows), {Columns=>[1,2]},

Same comments as above about GROUP BY and $maxrows.


>-if (scalar(%count)) {
>+if (scalar %total_dups) {

You forgot to first restrict the hash to keys having values larger than the 'mostfreqthreshold' parameter.


>-$vars->{'dobefore'} = $dobefore;
>+$vars->{'dobefore'} = 1;

dobefore = 1 only if the data is limited in time (in opposition to displaying all the data).


Also, the days_ago() function is no longer used; you can remove it.
Attachment #257669 - Flags: review?(LpSolit) → review-
when a bug is marked as a dup, the cc: list for the dup should be consolidated with that for the surviving bug, so that those who reported/commented on it can track the bug fix. 
Case in point: Several bugs were recently consolidated into 441175, but many of those who reported/confirmed various symptoms of the underlying problem (mishandling slow response times) are not on its cc: list. If that's too complicated, maybe an email to the cc: list announcing the new bug number?
FWIW, I believe I tested the code and consecutive dupes worked fine, but I will test again. That's the whole purpose of add_indirect_dupes.
Target Milestone: --- → Bugzilla 4.0
Whiteboard: [3.6 Focus]
  Okay, I'm testing now and I have some responses! :-)

(In reply to comment #19)
> This doesn't work. If bug 1 is a dupe of bug 2 which itself is a dupe of bug 3,
> and your buglist contains bug 3 only, you want both 1 -> 2 and 2 -> 3 which
> will end as 2 dupes for bug 3 (instead of 1). This AND part will prevent
> indirect dupes.

  Actually, it does work. I just tested it. There is a problem when doing LIMIT on the list, though, so that I will have to fix.

> This is only useful if you want to limit the result based on time.

  Yeah, you're right. I know I did this for some reason, but I can't remember why, right now. I'll investigate.

> First, you should use bz_group_by(), 

  No, don't need to, because we're actually already grouping by every non-aggregate column in the SELECT.

> then $maxrows is the number of bugs to
> display, not the number of rows to extract from the DB. With my previous
> example above, you extract 2 rows from the DB, but only one bug will be
> displayed in the UI (bug 3). So you cannot use $maxrows here.

  Yeah, that's right, I'll have to fix that.

> This looks wrong to me. You never loop over consecutive dupes as in my example
> above.

  It works. :-)

> >-if (scalar(%count)) {
> >+if (scalar %total_dups) {
> 
> You forgot to first restrict the hash to keys having values larger than the
> 'mostfreqthreshold' parameter.

  Indeed. I will have to fix that.

> >-$vars->{'dobefore'} = $dobefore;
> >+$vars->{'dobefore'} = 1;
> 
> dobefore = 1 only if the data is limited in time (in opposition to displaying
> all the data).

  The data is always limited in time.
Target Milestone: Bugzilla 4.0 → Bugzilla 3.6
Attached patch v3 (obsolete) — Splinter Review
Okay, here we go. This should fix all the actual bugs in the script. Let me know if you discover any more from testing.
Attachment #257669 - Attachment is obsolete: true
Attachment #395532 - Flags: review?(LpSolit)
Attachment #395532 - Flags: review?(LpSolit) → review-
Comment on attachment 395532 [details] [diff] [review]
v3

>Index: collectstats.pl

>-sub calculate_dupes {

>-    # Now we collapse the dupe tree by iterating over %count until
>-    # there is no further change.
>-    while ($changed == 1)
>-    {
>-        $changed = 0;
>-        foreach $key (keys(%count)) {
>-            # if this bug is actually itself a dupe, and has a count...
>-            if (defined($dupes{$key}) && $count{$key} > 0) {
>-                # add that count onto the bug it is a dupe of,
>-                # and zero the count; the check is to avoid
>-                # loops
>-                if ($count{$dupes{$key}} != 0) {
>-                    $count{$dupes{$key}} += $count{$key};
>-                    $count{$key} = 0;
>-                    $changed = 1;
>-                }
>-            }
>-        }
>-    }

See how the original code iterates, using while (...).



>Index: duplicates.cgi

>+sub add_indirect_dups {

>+    foreach my $add_from (keys %$dupes) {
>+        my $add_to     = $dupes->{$add_from};
>+        my $add_amount = delete $counts->{$add_from} || 0;
>+        $counts->{$add_to} += $add_amount;
>+    }

Now compare the original code with this subroutine. Here you loop only once and stop. Imagine the following situation:

$dupes = { 2 => 1, 3 => 2 } and keys %$dupes returns qw(2 3) (that's possible as keys() returns the list in no specific order). You will first put the number of dupes of bug 2 in bug 1, *then* the number of dupes of bug 3 in bug 2, meaning that you end with bug 2 and bug 1 having dupes, instead of bug 1 having all dupes on it. To avoid this problem, you need a while() loop and exit only when no changes occur anymore.


> my $cgi = Bugzilla->cgi;
> my $template = Bugzilla->template;
> my $vars = {};

While you are on it, please remove the |if ($::ENV{'GATEWAY_INTERFACE'} eq "cmdline")| check as collectstats.pl no longer calls duplicates.cgi. Also, remove all mentions of collectstats.pl at lines 23 and 55.


>+detaint_natural($_) foreach @buglist;

If I pass non-integers, you will pass undefined values to the DB. You have to filter these values.


>+my %dupe_relation = @{$dbh->selectcol_arrayref(
>+    "SELECT dupe, dupe_of FROM duplicates
>+      WHERE dupe IN (" . join(',', keys %total_dups) . ")", 
>+    {Columns => [1,2]})};

dupe IN (...) looks wrong to me. You are looking for bugs from the previous list which are dupes. You miss indirect dupes.


>+my %since_dups = @{$dbh->selectcol_arrayref(
>+    "SELECT dupe_of, COUNT(dupe)
>+       FROM duplicates INNER JOIN bugs_activity 
>+                       ON bugs_activity.bug_id = duplicates.dupe 
>+      WHERE added = 'DUPLICATE' AND fieldid = ? AND " 
>+            . $dbh->sql_to_days('bug_when') . " >= (" 
>+            . $dbh->sql_to_days('NOW()') . " - ?)
>+   GROUP BY dupe_of", {Columns=>[1,2]},
>+    $reso_field_id, $changedsince)};

Here, you don't restrict the list to some specific list, which looks right to me (compared to the previous query).


>-$vars->{'dobefore'} = $dobefore;
>+$vars->{'dobefore'} = 1;

If dobefore is always 1, it's no longer a parameter. Drop it from duplicates.cgi and fix duplicates-table.html.tmpl accordingly.



>Index: Bugzilla/DB/Schema.pm

>-            added     => {TYPE => 'TINYTEXT'},
>+            added     => {TYPE => 'varchar(255)'},
>             removed   => {TYPE => 'TINYTEXT'},

Wouldn't it make sense to also update the "removed" column, for consistency?
(In reply to comment #24)
> Now compare the original code with this subroutine. Here you loop only once and
> stop. Imagine the following situation:
> 
> $dupes = { 2 => 1, 3 => 2 } and keys %$dupes returns qw(2 3) (that's possible
> as keys() returns the list in no specific order). You will first put the number
> of dupes of bug 2 in bug 1, *then* the number of dupes of bug 3 in bug 2,
> meaning that you end with bug 2 and bug 1 having dupes, instead of bug 1 having
> all dupes on it. To avoid this problem, you need a while() loop and exit only
> when no changes occur anymore.

  Mmm, I suspect you are incorrect. I am unable to produce a situation in which the code counts dupes incorrectly in actual practice. I think that my code works because of the particular SQL I used.

> While you are on it, please remove the |if ($::ENV{'GATEWAY_INTERFACE'} eq
> "cmdline")| check as collectstats.pl no longer calls duplicates.cgi. Also,
> remove all mentions of collectstats.pl at lines 23 and 55.

  We could do that pretty simply in another bug.

> >+detaint_natural($_) foreach @buglist;
> 
> If I pass non-integers, you will pass undefined values to the DB. You have to
> filter these values.

  The script doesn't accept non-integers, and NULL values don't match anything, so there's not a problem there.

> dupe IN (...) looks wrong to me. You are looking for bugs from the previous
> list which are dupes. You miss indirect dupes.

  I'm pretty sure it's not wrong. Try to produce a situation in which it counts dupes incorrectly.

> If dobefore is always 1, it's no longer a parameter. Drop it from
> duplicates.cgi and fix duplicates-table.html.tmpl accordingly.

  That could be done.

> >-            added     => {TYPE => 'TINYTEXT'},
> >+            added     => {TYPE => 'varchar(255)'},
> >             removed   => {TYPE => 'TINYTEXT'},
> 
> Wouldn't it make sense to also update the "removed" column, for consistency?

  Not necessary at this time for this bug.
(In reply to comment #25)
>   Mmm, I suspect you are incorrect. I am unable to produce a situation in which
> the code counts dupes incorrectly in actual practice. I think that my code
> works because of the particular SQL I used.

  Hmm, actually, I think the reason that you are incorrect is that $count already is populated by the direct dupes, and the other hash is just the indirect dupes, which always have a count of 1. In other words, the code doesn't work like the situation you described. I probably should add a comment as to exactly how and why it works.
Attached patch v4 (obsolete) — Splinter Review
Okay, you were right about the GATEWAY_INTERFACE stuff. That all has now gone away. And I removed dobefore and edited the template to work properly now.

As far as the add_indirect_dups stuff, I added a comment to it that hopefully will explain how it works and explain to you why all the other issues you pointed out aren't actually problems. (Unless you can find some actual situation in which the code doesn't work when tested.)
Attachment #395532 - Attachment is obsolete: true
Attachment #397817 - Flags: review?(LpSolit)
Comment on attachment 397817 [details] [diff] [review]
v4

>Index: duplicates.cgi

>+sub add_indirect_dups {
>+    my ($counts, $dupes) = @_;
>+
>+    foreach my $add_from (keys %$dupes) {
>+        my $add_to     = $dupes->{$add_from};
>+        my $add_amount = delete $counts->{$add_from} || 0;
>         print "Transferring $add_amount dupes from bug $add_from to bug $add_to ..."; # debug
>+        $counts->{$add_to} += $add_amount;
>         print " bug $add_to now has " . $counts->{$add_to} . " dupes<p>\n"; # debug
>+    }
>+}

You wanted real testing, so here you have. What I said in my review is correct, see the coming screenshot. I added above the two debug lines I used to get messages appearing in the screenshot, so you know where these messages come from.
Attachment #397817 - Flags: review?(LpSolit) → review-
As you can see, dupes on bug 1336 are transferred to bug 1337 *before* dupes of bug 1328 are transferred to bug 1336, making bug 1336 to not pass all its indirect dupes correctly. On the right can you see the content of the duplicates table.
(In reply to comment #25)
>   The script doesn't accept non-integers, and NULL values don't match anything,
> so there's not a problem there.

It's a problem:

DBD::mysql::db selectcol_arrayref failed: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '1193,1113,1101,1001,1)
   GROUP BY dupe_of' at line 3 [for Statement "SELECT dupe_of, COUNT(dupe)
       FROM duplicates
      WHERE 1 = 1 AND dupe_of IN (1337,1033,,1193,1113,1101,1001,1)
   GROUP BY dupe_of"] at /var/www/html/bugzilla/duplicates.cgi line 107
 at /var/www/html/bugzilla/duplicates.cgi line 107

Note the ,, in the IN (...) part.
Attached patch v5 (obsolete) — Splinter Review
Ah, thank you for that. :-) I've fixed it by always adding the duplicates to the ultimate target bug, which I think should be faster than looping through all the keys over and over. (My concern was Bugzilla DBs with tens of thousands or hundreds of thousands of duplicates.)

Also, I realized that the no_dupe_stats errors are all obsolete now, and I removed them from user-error.html.tmpl. But it seems like the 012throwables.t test didn't catch that those errors were obsolete. Is that a bug in 012throwables?
Attachment #397817 - Attachment is obsolete: true
Attachment #397924 - Attachment is obsolete: true
Attachment #397966 - Flags: review?(LpSolit)
(In reply to comment #31)
> removed them from user-error.html.tmpl. But it seems like the 012throwables.t
> test didn't catch that those errors were obsolete. Is that a bug in
> 012throwables?

It catches them, but only as WARNINGs instead of ERRORs as it's less critical to have unused messages than missing ones. The -v flag shows them.
Comment on attachment 397966 [details] [diff] [review]
v5

>Index: collectstats.pl

>-use AnyDBM_File;
> use IO::Handle;

IO::Handle can go away too.



>Index: duplicates.cgi

SQL queries do not work as expected when $sortvisible = 1, because you limit searches for dupes to bugs visible in the list and so you are basically only looking at direct dupes. I think that's the very last point which still fails.
Attachment #397966 - Flags: review?(LpSolit) → review-
Attached patch v6 (obsolete) — Splinter Review
Okay, I fixed that problem. Thanks for the note about IO::Handle, too. Removed that.

The tests actually didn't show anything at all, even a WARNING.
Attachment #397966 - Attachment is obsolete: true
Attachment #398319 - Flags: review?(LpSolit)
Comment on attachment 398319 [details] [diff] [review]
v6

>Index: duplicates.cgi

>+sub walk_dup_chain {
>+    my ($dups, $from_id) = @_;
>+    my $to_id = $dups->{$from_id};
>+    while (my $bug_id = $dups->{$to_id}) {
>+        last if $bug_id == $from_id; # avoid duplicate loops
>+        $to_id = $bug_id;
>+    }
>+    return $to_id;
>+}

You should write: $dups->{$from_id} = $to_id; right before "return $to_id" so that you don't need to go through the tree again the next time you get the same $from_id.


>+detaint_natural($_) foreach @buglist;
>+# If we got any non-numeric items, they will now be undef. Remove them from
>+# the list.
>+@buglist = grep($_, @buglist);

Nit: this could simply be written as:
  @buglist = grep(detaint_natural($_), @buglist);


>+my %total_dups = @{$dbh->selectcol_arrayref(
>+    "SELECT dupe_of, COUNT(dupe)
>+       FROM duplicates
>+      WHERE 1 = 1
>+   GROUP BY dupe_of", {Columns => [1,2]})};

WHERE 1 = 1 should go away.


>+my %dupe_relation = @{$dbh->selectcol_arrayref(
>+    "SELECT dupe, dupe_of FROM duplicates
>+      WHERE dupe IN (" . join(',', keys %total_dups) . ")", 

To let the DB server optimize the query, and because %total_dups now contains all bugs in the dupe_of column in all cases, you should write:

"SELECT dupe, dupe_of FROM duplicates WHERE dupe IN (SELECT dupe_of FROM duplicates)". I tested this SQL query on both MySQL and PostgreSQL successfully. Also, hardcoded "IN ()" should be replaced by sql_in() as the list may be quite large and would crash Oracle.


>+add_indirect_dups(\%since_dups, \%dupe_relation);

With my suggestion above about walk_dup_chain(), we would loop over %dupe_relation very quickly.


>+    if ($sortvisible and @buglist and !grep($_ eq $id, @buglist)) {

@buglist only contains integers. 'eq' should be '=='.


Your patch seems to work fine now. r=LpSolit with my comments above addressed. Note that I didn't do any perf testing, so I assume you did it and that there is no perf regression at all.
Attachment #398319 - Flags: review?(LpSolit) → review+
Holding approval till the patch is updated.
Flags: approval?
Attached patch v7Splinter Review
I did some experiments with the query planner, and you were indeed right about the SQL.

I did everything else you suggested, too, except combining the grep and the map--I think it's nicer to have the comment on only exactly the part where it belongs.

As far as performance, I did test and I don't see any significant regressions at all on my test data set (including about 20,000 duplicate bugs). If there are any significant issues once we see this in production, since we're now in SQL, we can definitely profile the code and handle it.
Attachment #398319 - Attachment is obsolete: true
Attachment #398922 - Flags: review+
Comment on attachment 398922 [details] [diff] [review]
v7

>Index: duplicates.cgi

>+    if ($sortvisible and @buglist and !grep($_ eq $id, @buglist)) {

You forgot to s/eq/==/.


When you say "I don't see any significant regressions at all", do you mean there is some minor regression anyway, or that there is no regression at all but also no improvement? In other words, is the global effect neutral or is there some improvement/regression anyway?
(In reply to comment #38)
> You forgot to s/eq/==/.

  Oh. I made that change locally. I must have forgotten to do the diff afterward.

> When you say "I don't see any significant regressions at all", do you mean
> there is some minor regression anyway, or that there is no regression at all
> but also no improvement? In other words, is the global effect neutral or is
> there some improvement/regression anyway?

  I mean that I don't perceive any significant difference.
Flags: approval? → approval+
I did the == fix on checkin.

Checking in collectstats.pl;
/cvsroot/mozilla/webtools/bugzilla/collectstats.pl,v  <--  collectstats.pl
new revision: 1.71; previous revision: 1.70
done
Checking in duplicates.cgi;
/cvsroot/mozilla/webtools/bugzilla/duplicates.cgi,v  <--  duplicates.cgi
new revision: 1.64; previous revision: 1.63
done
Checking in Bugzilla/DB/Schema.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/DB/Schema.pm,v  <--  Schema.pm
new revision: 1.122; previous revision: 1.121
done
Checking in Bugzilla/Install/DB.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/Install/DB.pm,v  <--  DB.pm
new revision: 1.70; previous revision: 1.69
done
Checking in Bugzilla/Install/Filesystem.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/Install/Filesystem.pm,v  <--  Filesystem.pm
new revision: 1.37; previous revision: 1.36
done
Checking in template/en/default/global/user-error.html.tmpl;
/cvsroot/mozilla/webtools/bugzilla/template/en/default/global/user-error.html.tmpl,v  <--  user-error.html.tmpl
new revision: 1.286; previous revision: 1.285
done
Checking in template/en/default/reports/duplicates-table.html.tmpl;
/cvsroot/mozilla/webtools/bugzilla/template/en/default/reports/duplicates-table.html.tmpl,v  <--  duplicates-table.html.tmpl
new revision: 1.15; previous revision: 1.14
done
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Keywords: relnote
Resolution: --- → FIXED
Blocks: 517761
Blocks: 543459
Added to the release notes in bug 547466.
Keywords: relnote
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: