Open Bug 829413 Opened 11 years ago Updated 2 months ago

Bugzilla::Bug->comments() shouldn't load all comments if 'after' or 'to' args are provided

Categories

(Bugzilla :: Bugzilla-General, enhancement)

enhancement
Not set
normal

Tracking

()

People

(Reporter: glob, Unassigned)

Details

(Keywords: perf)

Attachments

(1 obsolete file)

currently Bugzilla::Bug->comments() acts as follows:

  if (!self->{comments}) {
    load all comments, save in self->{comments}
  }
  @comments = self->{comments}
  if (params->{after}) {
    @comments = grep { comment->date > params->{after} } @comments
  }
  if (params->{to}) {
    @comments = grep { comment->date <= params->{to} } @comments
  }
  return @comments

the problem with this approach is even if the caller only wants a subset of comments, we always hit the database for all comments.  this could be quite expensive on bugs with a lot of comments.

as $self->{comments} lives for the duration of the current request only, it doesn't provide any value for most calls.  i don't see anywhere in our code where we use 'after' or 'to', outside the webservice modules.

i think if 'after' or 'to' parameters are provided, instead of loading all comments from the database, we should pass the date filters to the database and let it perform filtering.
With the current DB schema, there is no perf penalty with the current code. Here is the SQL query executed by Bugzilla::Bug->comments:

EXPLAIN SELECT comment_id, bug_id, who, bug_when, work_time, thetext,
               isprivate, already_wrapped, type, extra_data
          FROM longdescs
         WHERE bug_id = 29286
      ORDER BY bug_when, comment_id;

+------+-------------+-----------+------+----------------------+----------------------+---------+-------+------+-----------------------------+
| id   | select_type | table     | type | possible_keys        | key                  | key_len | ref   | rows | Extra                       |
+------+-------------+-----------+------+----------------------+----------------------+---------+-------+------+-----------------------------+
|    1 | SIMPLE      | longdescs | ref  | longdescs_bug_id_idx | longdescs_bug_id_idx | 3       | const |  180 | Using where; Using filesort |
+------+-------------+-----------+------+----------------------+----------------------+---------+-------+------+-----------------------------+

Adding |AND bug_when > 'YYYY-MM-DD'| to the WHERE part doesn't help. As you can see below, despite the longdescs_bug_when_idx index is a possible candidate to retrieve data, the DB server is still using the longdescs_bug_id_idx index and the 180 rows/comments belonging to this bug are still scanned to compare the date. So the DB server still has to retrieve all comments from this bug:

EXPLAIN SELECT comment_id, bug_id, who, bug_when, work_time, thetext,
               isprivate, already_wrapped, type, extra_data
          FROM longdescs
         WHERE bug_id = 29286
           AND bug_when > '2009-02-21'
      ORDER BY bug_when, comment_id;

+------+-------------+-----------+------+---------------------------------------------+----------------------+---------+-------+------+-----------------------------+
| id   | select_type | table     | type | possible_keys                               | key                  | key_len | ref   | rows | Extra                       |
+------+-------------+-----------+------+---------------------------------------------+----------------------+---------+-------+------+-----------------------------+
|    1 | SIMPLE      | longdescs | ref  | longdescs_bug_when_idx,longdescs_bug_id_idx | longdescs_bug_id_idx | 3       | const |  180 | Using where; Using filesort |
+------+-------------+-----------+------+---------------------------------------------+----------------------+---------+-------+------+-----------------------------+


Things become different if we add a new index foo(bug_id bug_when). The current longdescs_bug_when_idx(bug_when) index is of no use to get comments as seen above, but the new one gives interesting results. The first query, without the time range, gives:

+------+-------------+-----------+------+----------------------------------------+-------------------+---------+-------+------+-------------+
| id   | select_type | table     | type | possible_keys                          | key               | key_len | ref   | rows | Extra       |
+------+-------------+-----------+------+----------------------------------------+-------------------+---------+-------+------+-------------+
|    1 | SIMPLE      | longdescs | ref  | longdescs_bug_id_idx,longdescs_foo_idx | longdescs_foo_idx | 3       | const |  179 | Using where |
+------+-------------+-----------+------+----------------------------------------+-------------------+---------+-------+------+-------------+

As you can see, the DB server now chooses the new longdescs_foo_idx index and the Extra column shows that "Using filesort" is gone.

With the time range, we get:

+------+-------------+-----------+-------+---------------------------------------------------------------+-------------------+---------+------+------+-----------------------+
| id   | select_type | table     | type  | possible_keys                                                 | key               | key_len | ref  | rows | Extra                 |
+------+-------------+-----------+-------+---------------------------------------------------------------+-------------------+---------+------+------+-----------------------+
|    1 | SIMPLE      | longdescs | range | longdescs_bug_when_idx,longdescs_bug_id_idx,longdescs_foo_idx | longdescs_foo_idx | 11      | NULL |    1 | Using index condition |
+------+-------------+-----------+-------+---------------------------------------------------------------+-------------------+---------+------+------+-----------------------+

Now type = range instead of type = ref, but more interesting is that there is now only one row/comment retrieved instead of the 180 comments found before.

Now, I don't know if this translates into a real perf win. In my testing, the current code is already very fast: 0.01 second to collect 180 comments. Things can hardly be faster.
Severity: normal → enhancement
Keywords: perf
OS: Mac OS X → All
Hardware: x86 → All
thanks for looking at this so quickly :)

be aware you're only measuring the database query -- it takes time and memory to marshal the results to perl, to store them, and then to filter the array.
(In reply to Byron Jones ‹:glob› from comment #2)
> be aware you're only measuring the database query -- it takes time and
> memory to marshal the results to perl, to store them, and then to filter the
> array.

Sure, but to be really efficient, I showed that adding an index on (bug_id, bug_when) would let the DB server avoid an index scan and only return relevant rows. So not only would Perl have less rows to compute, but also the DB server would have less rows to analyze. So what I'm saying is that if we want to fix this bug in a very efficient way, we should both add an index and restrict the time range.

But we would have to be careful and check if not using the cache would be a perf penalty or not.
(In reply to Byron Jones ‹:glob› from comment #0)
> doesn't provide any value for most calls.  i don't see anywhere in our code
> where we use 'after' or 'to', outside the webservice modules.

We use 'after' and 'to' to generate bugmails, to get the new comment(s), see Bugzilla::BugMail::Send():

    my $comments = $bug->comments({ after => $start, to => $end });
Assignee: general → glob
Assignee: glob → general
Attachment #9383262 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: