Closed Bug 127200 Opened 23 years ago Closed 22 years ago

Query for CC/longdesc/OR takes long time

Categories

(Bugzilla :: Query/Bug List, defect, P2)

2.15
Other
Other

Tracking

()

RESOLVED FIXED
Bugzilla 2.18

People

(Reporter: tessarakt, Assigned: bugreport)

References

()

Details

(Keywords: perf)

Attachments

(1 file, 4 obsolete files)

Reproduction:

1. Go to Bugzilla page
2. Go to query page
3. Uncheck entries under "Status"
4. Enter "jens@unfaehig.de" under e-mail adress, exact match, matching field CC
5. Run query

The query takes a long time (such a long time that I haven't waited for it to
finish, aborted it after some minutes).

Maybe no index in the database for that field?
SELECT bugs.bug_id, bugs.groupset, substring(bugs.bug_severity, 1, 3),
  substring(bugs.priority, 1, 3), substring(bugs.rep_platform, 1, 3),
  map_assigned_to.login_name, substring(bugs.bug_status,1,4),
  substring(bugs.resolution,1,4), substring(bugs.short_desc, 1, 60)
FROM bugs, profiles map_assigned_to, profiles map_reporter
  LEFT JOIN profiles map_qa_contact ON bugs.qa_contact = map_qa_contact.userid
  LEFT JOIN cc cc_ ON bugs.bug_id = cc_.bug_id
  LEFT JOIN profiles map_cc_ ON cc_.who = map_cc_.userid
WHERE ((bugs.groupset & 0) = bugs.groupset )
  AND bugs.assigned_to = map_assigned_to.userid
  AND bugs.reporter = map_reporter.userid
  AND (map_cc_.login_name = 'jens@unfaehig.de')
...

+-----------------+--------+----------------------+---------+---------+------------------+------+---------------------------------------------+
| table           | type   | possible_keys        | key     | key_len |
ref              | rows | Extra                                       |
+-----------------+--------+----------------------+---------+---------+------------------+------+---------------------------------------------+
| bugs            | ALL    | assigned_to,reporter | NULL    |    NULL |
NULL             |    5 | where used; Using temporary; Using filesort |
| map_assigned_to | eq_ref | PRIMARY              | PRIMARY |       3 |
bugs.assigned_to |    1 |                                             |
| map_reporter    | index  | PRIMARY              | PRIMARY |       3 |
NULL             |    4 | where used; Using index                     |
| map_qa_contact  | eq_ref | PRIMARY              | PRIMARY |       3 |
bugs.qa_contact  |    1 | Using index                                 |
| cc_             | index  | bug_id               | bug_id  |       6 |
NULL             |    4 | Using index                                 |
| map_cc_         | eq_ref | PRIMARY              | PRIMARY |       3 |
cc_.who          |    1 | where used                                  |
+-----------------+--------+----------------------+---------+---------+------------------+------+---------------------------------------------+
Thats because mysql is examining every table in the bugs row. On a db with 442
bugs, and a simplified query

mysql> EXPLAIN SELECT bugs.bug_id FROM profiles map_assigned_to INNER JOIN bugs
ON bugs.assigned_to=map_assigned_to.userid LEFT JOIN cc ON bugs.bug_id =
cc.bug_id WHERE cc.who='1'\G
*************************** 1. row ***************************
        table: bugs
         type: ALL
possible_keys: assigned_to
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 442
        Extra: 
*************************** 2. row ***************************
        table: map_assigned_to
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 3
          ref: bugs.assigned_to
         rows: 1
        Extra: Using index
*************************** 3. row ***************************
        table: cc
         type: ref
possible_keys: bug_id
          key: bug_id
      key_len: 3
          ref: bugs.bug_id
         rows: 1
        Extra: where used; Using index
3 rows in set (0.00 sec)

If I remove the join to the assignee table (which isn't being used):

mysql> EXPLAIN SELECT bugs.bug_id FROM bugs LEFT JOIN cc ON bugs.bug_id =
cc.bug_id WHERE cc.who='1'\G
*************************** 1. row ***************************
        table: bugs
         type: index
possible_keys: NULL
          key: PRIMARY
      key_len: 3
          ref: NULL
         rows: 442
        Extra: Using index
*************************** 2. row ***************************
        table: cc
         type: ref
possible_keys: bug_id
          key: bug_id
      key_len: 3
          ref: bugs.bug_id
         rows: 1
        Extra: where used; Using index
2 rows in set (0.00 sec)

Also see bug 96101.
Wow, shouldn't it look up the CCs table first since it actually has only one key
to look up for that?
No, it can't, because its a left join, so it has to find the non-matching bits
too. This means that it has to sequencially scan the table

On a test db with 10000 bugs, 30000 cc, and 25000 profiles:

SELECT bugs.bug_id,map_assigned_to.login_name FROM profiles map_assigned_to,
bugs LEFT JOIN cc cc_ ON bugs.bug_id = cc_.bug_id LEFT JOIN profiles map_cc_ ON
cc_.who = map_cc_.userid WHERE bugs.assigned_to=map_assigned_to.userid AND
map_cc_.login_name='QI';

takes 3.62 seconds.

Changing the second LEFT JOIN to an INNER JOIN takes this to 1.23 seconds. Why
isn't this done? The only difference will be if there is someone on the cc list
who is not in the profiles table (actually, they'd have to be the only person in
this case). Thats invalid, and I don't think we should slow down queries because
of people who may have data corruption, so that should be a plain JOIN. (I note
that we do the same thing with the qa contact, but thats a trickier issue
because there it is optional. We probably need to rethink that)

Any objections to making that change for 2.16?

Changing the first LEFT JOIN to an INNER JOIN as well makes this take 0.00
seconds. For cc equals, then we could use an inner join. For cc not equals, do
you want to match bugs with no ccs? Tihs gets us back into the problems of bug
109528 of contains vs equals. We'll probably revisit that for 2.18, I guess.
select count(cc.bug_id) from cc left join profiles ON cc.who=profiles.userid
where profiles.userid=NULL;

returns 0 on bmo, and if bmo doesn't have corruption, then noone does, right?.
This is detected by sanity check anyway.

Taking for 2.16 to test with. I'm reasonably certain that the first left join
means that the bug row will still get through anyway, but I have to check,
hopefully later this week.

The line in question in the join at about line 368 of buglist.cgi
Assignee: endico → bbaetz
Severity: normal → major
Keywords: perf
Priority: -- → P2
Target Milestone: --- → Bugzilla 2.16
Attached patch patch (obsolete) — Splinter Review
OK, try this. This removes both inner joins, on the basis that "not contains"
doesn't pick up bugs with no ccs at the moment anyway.

This is, I think, provably identical - the condition in $f restricted the
column in the last table to be non null, so the left join is identical to the
inner join, since the only difference between a left join and an inner join is
that non matching rows in the join get NULL as entries.

The reason that "does not contain" fails is because INSTR(NULL) == NULL != 0,
and NULL fails all the = and != conditionals anyway.

We so need to clean this up, but not for 2.16
Keywords: patch, review
Comment on attachment 72194 [details] [diff] [review]
patch

This looks good.  Works and passes tests.  I haven't measured performance but
that it works should be a no-brainer.

I was a little suspicious, so I looked back at the history of this code.  This
appears to have been inserted as is (rather than changed from inner to left
outer), and there was a change to this code that was backed out, but this
doesn't seem to have the same problem.
Attachment #72194 - Flags: review+
I have a saved query called "Open bugs I care about", which searches for (among
other things) my email address in the CC: field.  The query doesn't work.  My
browser (Mozilla 0.99) waits for Bugzilla to return some result, but it never
gets it!  After about a minute of waiting, it appears that Mozilla just gives
up, and I'm left with a web page that just says, "Please stand by ...".

So the problem is not just that it takes a long time.  The problem is that it
takes too long.
Comment on attachment 72194 [details] [diff] [review]
patch

r= justdave
Attachment #72194 - Flags: review+
Checking in buglist.cgi;
/cvsroot/mozilla/webtools/bugzilla/buglist.cgi,v  <--  buglist.cgi
new revision: 1.162; previous revision: 1.161
done
Status: NEW → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
This patch is invalid, as has been backed out. See bug 139759.

In particular see bug 139759 comment #6, which gives some possible solutions to
this problem.

reopening, and -> 2.18
Status: RESOLVED → REOPENED
Keywords: patch, review
Resolution: FIXED → ---
Target Milestone: Bugzilla 2.16 → Bugzilla 2.18
Comment on attachment 72194 [details] [diff] [review]
patch

patch was vetoed by bbeatz and backed out.  Marking needs-work on his behalf
Attachment #72194 - Flags: review-
*** Bug 144966 has been marked as a duplicate of this bug. ***
Bug 144966 is not a duplicate of this one, rather this is a duplicate/subset of
that one. (the problem is not just with CC but with most of the email search fields)

What about adding a box "Fast search (may fail on corrupt profiles)"?

Blocks 133173 ("Change "My Bugs" into a preset query") - the user can't query
his own bugs if it takes so long. (personally, with all the boxes by email
matched, I waited over 30 mins for results before giving up.)
Blocks: 133173
The actual problem is 'queries with OR in them are slow'. queries involving CC
tend to be the easiest way to run into this, but the comments (longdescs in the
schema) have the same problem.

The actual problem is a hard problem to solve, in general. Since bugzilla only
works on mysql, we don't know if changing databases would speed this up.

My fix for this, which possibly could have been extended to longdecs didn't
work. Its possible to fix this with subselects, except mysql doesn't support
those. We could simulate subselects with a separate query, and thats somethign
to look into.
Summary: Query for CC takes long time → Query for CC/longdesc/OR takes long time
I don't think this is related to OR, I don't see any ORs or sets in the
description of this problem.
Blocks: bz-perf
if Bug 144966 is not a dup of this one
shouldn't it be reopened ?

The actual problem is more likely this.....


When a user chooses to search for an email address in CC list or comments,
Search.pm takes the single matching email and turns it into (paraphrased) ....

SELECT ...otherstuff..... FROM bugs LEFT JOIN cc ON bugs.id = cc.bug_id LEFT
JOIN profiles AS map_cc ON cc.who = map_cc.id LEFT JOIN longdescs ON bugs.id =
longdescs.bug_id LEFT JOIN profiles AS map_comment ON longdescs.who =
map_comment.id LEFT_JOIN profiles AS map_reporter ON bugs.reporter =
map_reporter.id ....othestuff...  WHERE (map_comment.login_name =
'sameuser@mozilla.org' OR map_cc.login_name = 'sameuser@mozilla.org' OR
map_reporter - 'sameuser@mozilla.org') ..otherstuff..

This means that, just for this function, every bug expands to count(cc) *
count(comments) rows each of which is joined to a profile 3 times and then
matched to the form field 3 times.
 
In the usual case where a moderate list of users match, this can really be done
in a manner requiring only 1 row per bug for this function...

first look up userids of matched user..

SELECT ...otherstuff..... FROM bugs LEFT JOIN cc ON bugs.id = cc.bug_id AND
cc.who IN(useridlist) LEFT JOIN longdescs ON bugs.id = longdescs.bug_id AND
logdescs.who IN(useridlist) ...otherstuff... WHERE ((cc.who IS NOT NULL) OR
(longdescs.who IS NOT NULL) OR (bugs.reporter IN (useridlist))) .. otherstuff..


Implementation note...

In Search.pm.

Check the $email1 and $email2 fields for matching email addresses directly from
profiles.  During the initial processing of hard-coded stuff, load an array with
a list of matching userids ONLY if there are a reasonable number of matches (< 50?).

In the funcdefs, prior to  any "LEFT JOIN profiles" operation, check what is
about to be included in the query to see if it can be done with the list already
stored.


*** Bug 175425 has been marked as a duplicate of this bug. ***
Attached patch Rough patch (obsolete) — Splinter Review
This patch causes exact matches to be handled with far fewer joins.  I believe
that this is really the crux of the matter.  If this causes the email "is"
queries to become much faster than they used to be, then the patch should be
updated to handle email "contains" queries where the matching email returns a
moderate number of matches (like fewer than 50).
Attachment #72194 - Attachment is obsolete: true
Attached patch The patch (obsolete) — Splinter Review
This will speed up queries for users making comments or on CC lists
dramatically.
Attachment #103452 - Attachment is obsolete: true
taking on
Assignee: bbaetz → bugreport
Status: REOPENED → NEW
This seems OK to me (although you probably mean either (@list < 51) or LIMIT
50), although I'd prefer not to review it as I'm not familiar with this code.

What's the relationship between this bug and 133173? That seems wrong to me...

Gerv
If we hit the LIMIT, then we don't know if there were really more than we were
willing to check for this way, so we only use the mechanism if the number
returned is fewer than the LIMIT.

I'll ping around for a reviewer.  I think that everyone is unhappy with reviews
in this part of the code.

Blocks: 176570
There is no reason for bug 133173 to depend on this.  

This still does need a review, though.
No longer blocks: 133173
Status: NEW → ASSIGNED
Prior to settling on the patch in attachment 103459 [details] [diff] [review], MattyT suggested trying the
approach of creating a temporary table with the matching userids and joining
with that.  I will try that in the next few days, but it will need a performance
test from myk.

myk - If you are up for trying out a few key queries, I'll do this.  If you
would rather stick with the current patch, It's your call.  Please comment here.
OK... here's the 2 SQL statements that need to b timed on BMO.  Please post
result here.

create temporary table garbage  select userid from profiles where login_name 
like '%@%' ;
alter table garbage add index userid;

It looks like there is no common syntax for temporary tables populated from a
select.  mysql cannot handle pg's way and vice-versa.  

What say we go with the patch we have then?

Opinions from myk, justdave, & dkl especially requested.  (myk for BMO impact,
justdave for sybase impact, dkl for pg impact)
Comment on attachment 103459 [details] [diff] [review]
The patch

>+        &::SendSQL("SELECT userid FROM profiles WHERE INSTR(login_name, " .
>+            &::SqlQuote($email) . ") LIMIT 51");

Changing the 

INSTR(login_name, " . &::SqlQuote($email) . ") 

to

POSITION(" . &::SqlQuote($email) . " IN login_name) != 1

is better for cross db compatibility since this works in both MySQL and
PostgreSQL for me.

I have applied this patch to the PostgreSQL version with the above change and
was able to notice a 
significant increase in lookup when searching with cclist and/or qa_contact.
Sybase doesn't support either of those (not that it matters for this bug, since
the Sybase stuff is still off on the horizon).  Sybase's equivalent is:

CHARINDEX(" . &::SqlQuote($email) . ",login_name)
Attachment #103459 - Flags: review-
Comment on attachment 103459 [details] [diff] [review]
The patch

This patch is now obsolete in that it is broken with the new CGI pm checkin of
recent. Things like $M{$field} has becoming $params->param($field) etc. 

Per Dave's comment about Sybase, I think separating out some of the string
search stuff from the SQL into smaller utility functions would be better for
this. Whereas POSITION() works for Pg and MySQL it does not work for Sybase. We
could have a SqlString() function in the SQL that will place the proper syntax
string in it's place. This way DB specific conditionals could be isolated to
one function.
For 'starts with', LIKE 'Foo%' is better, because then the db can use a index.
You may need to escape the input string, somehoe. I presume theres a way to do
that, hopefully even a portable one.
Per comment 32:

how much of a performance benefit would it be then to just replace all 
email specific searching with the proper LIKE '%foo%' and for the case
with multiple words just split on the spaces and string along multple LIKE's
in an (field1 LIKE '%foo%' OR field2 LIKE '%foo%')? I think MySQL supports RLIKE
for case insensitive search but the others do not. For those cases you can do
the usual (LOWER(field1) LIKE '%" . lc($foo) . "%' OR ...). I have read/heard
though when LOWERing the field before comparison then you basically throwing the
index out the window so it should be discouraged.
Actually, we're getting way off the original point of this patch.
Even the most horrible way of looking up the email addresses ONCE and then
qualifying the CC and londescs with it is orders of magnitude faster than the
status quo.

Let's just keep this simple and land it.  If the pattern match is too
contoversial, we could narrow its effect to handle the exact match only and get
it in.
Attached patch Patch updated (obsolete) — Splinter Review
OK... Here it is back to INSTR and up-to-date
Attachment #103459 - Attachment is obsolete: true
Comment on attachment 104448 [details] [diff] [review]
Patch updated

Decided per earlier discussion to leave the INSTR() change for another later
bug. This patch has passed my functionality testing on both Mysql and
PostgreSQL and has shown to speed up specific searches on cc/commenter
especially in the area of PostgreSQL.

r=dkl
Attachment #104448 - Flags: review+
OK, I had a bit of difficulty understanding this patch, and so these comments
reflect changes I think should be made to make the code more understandable.

- ListEmails should do just that; the "IN (" and ")" should be added by the caller.
+ push(@supptables, "LEFT JOIN longdescs $table ON bugs.bug_id = $table.bug_id
AND $table.who IN($list)");
is more understandable than the current
+ push(@supptables, "LEFT JOIN longdescs $table ON bugs.bug_id = $table.bug_id
AND $table.who $list");

- ListEmails should be called ListIDsForEmail or similar - it doesn't list email
addresses, it lists IDs. The comment above it could also be updated to make this
more clear.

- Why do we do:
+                push(@clist, "$table.who",'greaterthan','0');
? longdescs.who should never be 0, because we don't allow anonymous comment
submission.

- What is $ok flagging as OK? I can't work out what it does. Please give it a
more descriptive name like $foocheckok, and test for not ok with if (!$foocheckok).

Thanks :-)

Gerv
OK, clarified, fixed naming, and added isnotnull for when I mean isnotnull.
Attachment #104448 - Attachment is obsolete: true
Comment on attachment 104825 [details] [diff] [review]
Revised with readability improvements

r=gerv.

Gerv
Attachment #104825 - Flags: review+
Checking in Bugzilla/Search.pm;                               
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/Search.pm,v  <--  Search.pm
new revision: 1.23; previous revision: 1.22
done              
Status: ASSIGNED → RESOLVED
Closed: 22 years ago22 years ago
Resolution: --- → FIXED
QA Contact: matty_is_a_geek → default-qa
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: