Closed Bug 733556 Opened 12 years ago Closed 12 years ago

tbpl query taking a long time and using up lots of connections

Categories

(Tree Management Graveyard :: TBPL, defect, P2)

x86
macOS
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: scabral, Assigned: rhelmer)

References

Details

Attachments

(2 files)

We had some problems today on the generic db, due to tons of these queries running (with different paramters):

SELECT id, buildbot_id AS _id, b
uildername, slave, result,
    unix_timestamp(starttime) AS starttime,
    unix_timestamp(endtime) AS endtime
  FROM runs
  WHERE branch = 'fx-team' AND revision = '5724c5ba0f84'
  AND buildername NOT IN (
    SELECT buildername
    FROM builders
    WHERE branch = 'fx-team' AND hidden = TRUE)

MySQL doesn't do subqueries well. This can be rewritten as a LEFT JOIN...off the top of my head, I *think* this is correct, but please double-check with sample data.

SELECT id, buildbot_id AS _id, buildername, slave, result,
    unix_timestamp(starttime) AS starttime,
    unix_timestamp(endtime) AS endtime
  FROM runs LEFT JOIN builders USING (buildername)
  WHERE runs.branch = 'fx-team' AND builders.revision = '5724c5ba0f84'
  AND builders.branch = 'fx-team' and builders.hidden=TRUE
  AND runs.buildername IS NULL;
Also, why is this running on the master, when it's a SELECT query? It should be runnning on the slave.
Severity: normal → critical
Priority: -- → P1
Production tbpl hasn't changed since February 8th, did something change under us, or is someone doing some ugly spidering, to make us cause noticeable load? Or were we causing the same sort of load all along, and it just got noticed?
And that query looks like it would be coming from http://hg.mozilla.org/users/mstange_themasta.com/tinderboxpushlog/file/86204a11d671/php/getRevisionBuilds.php where it's been (for good or ill) since the dawn of tbpl's MySQL time, so if we're suddenly through the roof, I'd really expect it to be from an aggressive spidering.

(And yeah, I'm right there with you wondering why there's no https://tbpl.mozilla.org/robots.txt)
Killing queries longer than 5 mins:

 mysql -N tbpl_mozilla_org -e "select concat('kill ',ID,';') from information_schema.processlist where TIME>300 and DB='tbpl_mozilla_org' and INFO LIKE 'SELECT id, buildbot_id AS _id, buildername, slave, result%'" > killme.sql

mysql -f < killme.sql
I can't tell you where they're coming from, someone should check apache/app logs. According to mysql they're all coming from 10.8.70.207 = pp-zlb08.vlan70.phx.mozilla.com, which is the load balancer (and this is appropriate).
Severity: critical → blocker
Assignee: nobody → rhelmer
(In reply to Sheeri Cabral [:sheeri] from comment #0)
> We had some problems today on the generic db, due to tons of these queries
> running (with different paramters):
> 
> SELECT id, buildbot_id AS _id, b
> uildername, slave, result,
>     unix_timestamp(starttime) AS starttime,
>     unix_timestamp(endtime) AS endtime
>   FROM runs
>   WHERE branch = 'fx-team' AND revision = '5724c5ba0f84'
>   AND buildername NOT IN (
>     SELECT buildername
>     FROM builders
>     WHERE branch = 'fx-team' AND hidden = TRUE)
> 
> MySQL doesn't do subqueries well. This can be rewritten as a LEFT JOIN...off
> the top of my head, I *think* this is correct, but please double-check with
> sample data.
> 
> SELECT id, buildbot_id AS _id, buildername, slave, result,
>     unix_timestamp(starttime) AS starttime,
>     unix_timestamp(endtime) AS endtime
>   FROM runs LEFT JOIN builders USING (buildername)
>   WHERE runs.branch = 'fx-team' AND builders.revision = '5724c5ba0f84'
>   AND builders.branch = 'fx-team' and builders.hidden=TRUE
>   AND runs.buildername IS NULL;

This is really close, but I think what the first query is doing is looking for buildernames NOT IN the subselect, so something more like this (changed the branch and rev to be newer data so I could test locally):

SELECT runs.id, buildbot_id AS _id, buildername, slave, result,    unix_timestamp(starttime) AS starttime,
    unix_timestamp(endtime) AS endtime  FROM runs JOIN builders USING (buildername)
  WHERE runs.branch = 'mozilla-central' AND runs.revision = '78e56fd22f2a'
  AND builders.branch = 'mozilla-central' and builders.hidden != TRUE
  AND runs.buildername IS NOT NULL;
JOIN does an INNER JOIN and so it won't actually get stuff in runs that's not in builders.

But add in the word "LEFT" and I think you're golden:

SELECT runs.id, buildbot_id AS _id, buildername, slave, result,    unix_timestamp(starttime) AS starttime,
    unix_timestamp(endtime) AS endtime  FROM runs LEFT JOIN builders USING (buildername)
  WHERE runs.branch = 'mozilla-central' AND runs.revision = '78e56fd22f2a'
  AND builders.branch = 'mozilla-central' and builders.hidden != TRUE
  AND runs.buildername IS NOT NULL;
Seems to work locally, not sure who is around to review right now so casting a wide net :)
Attachment #603531 - Flags: review?(philringnalda)
Attachment #603531 - Flags: review?(mstange)
Attachment #603531 - Flags: review?(arpad.borsos)
Going to have to back out bug 718632 so we can get this tested on tbpl-dev; I will do that over in that bug.
Status: NEW → ASSIGNED
We closed the trees earlier today (bug 733592) because TBPL was unreachable, but it's working well enough at the moment that we have reopened them.
(In reply to Matt Brubeck (:mbrubeck) from comment #12)
> We closed the trees earlier today (bug 733592) because TBPL was unreachable,
> but it's working well enough at the moment that we have reopened them.

Thanks for the update; I would like to get this landed and on tbpl-dev (at least) so we have it in our back pocket if we hit problems again.
bug 718632 is backed out, this is now clear to land.
Went ahead and pushed this (based on sheeri's review), so we can get this on tbpl-dev and test behavior:

http://hg.mozilla.org/users/mstange_themasta.com/tinderboxpushlog/rev/5e65c2529f94

Leaving review requests open, just in case I missed something.
Blocks: 733643
Reducing severity per comment 12, but went ahead and filed a bug to push this to prod (bug 733643).
Severity: blocker → normal
Please reopen if you see any issues, I'd like to avoid any further problems in prod so calling this fixed based on review and testing on tbpl-dev.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
We're back to getting 500 errors, with the query change deployed.
(In reply to Robert Helmer [:rhelmer] from comment #17)
> Please reopen if you see any issues, I'd like to avoid any further problems
> in prod so calling this fixed based on review and testing on tbpl-dev.

reopening because we're hitting issues again. Sheeri just restarted the db cluster.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to Sheeri Cabral [:sheeri] from comment #6)
> I can't tell you where they're coming from, someone should check apache/app
> logs. According to mysql they're all coming from 10.8.70.207 =
> pp-zlb08.vlan70.phx.mozilla.com, which is the load balancer (and this is
> appropriate).

per irc at ~17:00 PST: cshields and fox2mike scanned the apache logs, and found nothing unusual.
And none of the queries from other users of the shared db, even though they are fewer in number and thus less obvious, could possibly be the result of the slowdown?

We haven't changed in a month, can everybody else using the db say that? The query that got the blame here hasn't changed in more like 5 months (until this bug changed it), can all the other queries say that?

I know it sounds like I'm trying to blame-shift, but step one for diagnosing "it was working fine yesterday, and now it's broken!" is always to ask "what changed between then and now?" For tbpl, what changed was adding another day worth of rows, which I'm willing to believe pushed it over some magic failure threshold, but, was that the only thing that changed having to do with the db server?
Depends on: 733667
From irc, sheeri added an index to the left join of tbpl. This made things much more responsive, and with tbpl working, we were able to reopen the trees at ~21:10PST.

From what I can piece together after watching this cause two tree closures today:

1) The tbpl code hasnt changed in months, so the sql queries being run by tbpl were working fine 
2) Dev checkin load seems about usual earlier today, and seems lighter now (typical for this time of evening).
3) One theory is that there's load coming from a non-tbpl application onto db server. Another theory is that the tbpl sql crossed some unknown threshold and now queries that used to be quick are now taking significantly more time. Either way, something is causing queries to run slower, which causes later queries to get backlogged, which eventually becomes visible to tbpl users as "Loading failed: error".

Leaving open for now so rhelmer and sheeri can post a RFO.
No longer depends on: 733667
philor: sorry, I mid-aired. Fixing dependency.
Depends on: 733667
Comment on attachment 603531 [details] [diff] [review]
better query for getRevisionBuilds

Review of attachment 603531 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good apart from that.
builders.name is unique, whereas buildername is not. I need to look into that again.

::: php/getRevisionBuilds.php
@@ +18,5 @@
>      unix_timestamp(starttime) AS starttime,
>      unix_timestamp(endtime) AS endtime
> +  FROM runs LEFT JOIN builders USING (buildername)
> +  WHERE runs.branch = :branch AND runs.revision = :revision
> +  AND runs.buildername IS NOT NULL

runs.buildername is defined as NOT NULL
Attachment #603531 - Flags: review?(arpad.borsos) → review+
(In reply to John O'Duinn [:joduinn] from comment #22)
> From irc, sheeri added an index to the left join of tbpl. This made things
> much more responsive, and with tbpl working, we were able to reopen the
> trees at ~21:10PST.

Which one was that? An EXPLAIN of the query would be helpful here. I was under the impression runs.(branch+revision) and builders.hidden was enough as index here.

We might as well need to focus on bug 703967. Less data means less work for the server :-)
Depends on: 703967
<sheeri>: the left join was better but w/out an index it was trying to do 300*8000 rows
<sheeri>: which is way better than a dependent subquery
<sheeri>: but with the index it's 300*1 rows 

Which was where I started wondering if maybe what changed between day before yesterday and yesterday to make us suddenly broken might really be just "we added another day worth of rows, and went over a magic slowness threshold."
philor - I agree that's probably it, but I think the problem is less the amount of rows, and more the fact that all those rows are being processed. Is all that necessary? (It's totally OK if it is, but I got the sense that maybe it wasn't necessary?)
Severity: normal → blocker
Here's an EXPLAIN of the original subquery:


mysql> explain SELECT id, buildbot_id AS _id, buildername, slave, result,     unix_timestamp(starttime) AS starttime,     unix_timestamp(endtime) AS endtime   FROM runs   WHERE branch = 'fx-team' AND revision = '5724c5ba0f84'   AND buildername NOT IN (     SELECT buildername     FROM builders     WHERE branch = 'fx-team' AND hidden = TRUE)\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: runs
         type: ref
possible_keys: runs_revision_branch_idx
          key: runs_revision_branch_idx
      key_len: 889
          ref: const,const
         rows: 1688
        Extra: Using where
*************************** 2. row ***************************
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: builders
         type: index_subquery
possible_keys: builders_hidden_idx,buildername
          key: buildername
      key_len: 768
          ref: func
         rows: 2
        Extra: Using where
2 rows in set (0.01 sec)

Here's an EXPLAIN before the index was in place:
mysql> explain  SELECT runs.id, buildbot_id AS _id, buildername, slave, result,     unix_timestamp(starttime) AS starttime,     unix_timestamp(endtime) AS endtime   FROM runs LEFT JOIN builders USING (buildername)   WHERE runs.branch = 'mozilla-inbound' AND runs.revision = '5bddb0ee5912'   AND runs.buildername IS NOT NULL   AND builders.hidden != TRUE\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: runs
         type: ref
possible_keys: runs_revision_branch_idx
          key: runs_revision_branch_idx
      key_len: 889
          ref: const,const
         rows: 327
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: builders
         type: ALL
possible_keys: builders_hidden_idx
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 8756
        Extra: Using where
2 rows in set (0.00 sec)

The builders_hidden_idx is an index on the field "hidden", but it's useless because 93% of the values of hidden are !=TRUE, so it's better to do a full table scan than to have the overhead of going to the index first, then the data row itself, for 8500 rows.

The table join is on buildername, so it'd make sense to have an index on buildername, which is the index I added, and here's the explain:

mysql> explain  SELECT runs.id, buildbot_id AS _id, buildername, slave, result,     unix_timestamp(starttime) AS starttime,     unix_timestamp(endtime) AS endtime   FROM runs LEFT JOIN builders USING (buildername)   WHERE runs.branch = 'mozilla-inbound' AND runs.revision = '5bddb0ee5912'   AND runs.buildername IS NOT NULL   AND builders.hidden != TRUE\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: runs
         type: ref
possible_keys: runs_revision_branch_idx
          key: runs_revision_branch_idx
      key_len: 889
          ref: const,const
         rows: 327
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: builders
         type: ref
possible_keys: builders_hidden_idx,buildername
          key: buildername
      key_len: 768
          ref: tbpl_mozilla_org.runs.buildername
         rows: 1
        Extra: Using where
2 rows in set (0.00 sec)
Thanks for that info. Can you check if buildername is unique? It should be as far as I know. Maybe we can add a unique index to our schema file.
(In reply to John O'Duinn [:joduinn] from comment #22)
> Leaving open for now so rhelmer and sheeri can post a RFO.

I'm not sure I really have enough access to production to give a good reason for outage (since we can't repro on dev), I would hazard a guess that it was related to overall DB load (coupled with us using a less efficient query+index) based on sheeri's investigation and fix.

On reducing the risk of this kind of outage - if the above is correct then since the DB is a shared resource and a big SPOF for TBPL, there are ways we could mitigate (having a caching layer, using connection pooling to limit possible concurrent connections, etc. - right now I am pretty sure HTTP connections and DB connections are 1:1 via the PHP app, but I could be wrong.)

I was planning to use bug 733667 to follow up on this, possibly morphing the summary.
philor - to address your question of "And none of the queries from other users of the shared db, even though they are fewer in number and thus less obvious, could possibly be the result of the slowdown?" - there were no other queries, other than replication, and that wasn't using undue resources.

Just wanted to make sure I put that information out there, when there were over 7,000 queries last night, they were all that tbpl query.
I bet part of the problem here is that once it starts to slow down, people seeing this think it is hung and not going to answer and reload the page which I bet starts a new query without cancelling the old one so that once it starts slowing down it just goes into a death spiral.
tbpl is auto refreshing anyway, so it's more some slowness compounds as more and more queries run, until we end up unresponsive.
Quoting from before, as it's still relevant, "I agree that's probably it, but I think the problem is less the amount of rows, and more the fact that all those rows are being processed. Is all that necessary? (It's totally OK if it is, but I got the sense that maybe it wasn't necessary?)"
FWIW I've set max_connections back to 1200 (it was up to 10,000 yesterday while we were debugging) and set a limit for the tbpl_user to 1000 connnections:

GRANT USAGE ON *.* TO 'tbpl_user'@'10.8.70.20_' IDENTIFIED BY PASSWORD 'ELIDED' WITH MAX_USER_CONNECTIONS 1000;
Severity: blocker → critical
Are we ready to call this bug closed? I want to make sure everyone on this bug got their questions answered before I do so.
Well I would like to
1) know if a UNIQUE index on builders.buildername is feasible and
2) add that index to our schema.sql
3) get my review comment addressed, as runs.buildername is defined as NOT NULL. Or is the AND IS NOT NULL needed because of the LEFT JOIN? Really, there should be a matching builder for every run.
1) UNIQUE index on builders.buildername is not doable in the moment as there are duplicates:
mysql> SELECT COUNT(buildername),COUNT(distinct buildername),count(id) FROM builders;
+--------------------+-----------------------------+-----------+
| COUNT(buildername) | COUNT(distinct buildername) | count(id) |
+--------------------+-----------------------------+-----------+
|               9054 |                        8535 |      9055 |
+--------------------+-----------------------------+-----------+
1 row in set (0.10 sec)

Here's a sample of a duplicate builder name:

mysql> select * from builders where buildername='Android alder build';
+------+---------------------+--------+---------------------+--------+
| id   | name                | branch | buildername         | hidden |
+------+---------------------+--------+---------------------+--------+
| 4270 | alder-linux-android | alder  | Android alder build |      0 |
| 7127 | alder-android       | alder  | Android alder build |      0 |
+------+---------------------+--------+---------------------+--------+

I'll attach a list of the over 500 duplicate buildernames.

2) See #1
3) The original subquery was saying where runs IS NOT IN the builders query, that's what we're getting here. (where there's not a matching run for a builder) The "IS NOT NULL" is needed because of the LEFT JOIN. Let's say the data is like this:

runs:
id buildername
1  a
2  b
3  c

builders:
hidden buildername
false  b
true   c
false  d

runs LEFT JOIN builders on (buildername) tries to match every row in runs with a row in builders, and if there isn't a match in builders, get a NULL value:

(id buildername hidden buildername)
1 a NULL NULL
2 b false b
3 c true c

(Note: there is no NULL NULL false d because it only tries matching rows that exist in runs).

So you want to add WHERE builds.[something] IS NULL.
(In reply to Sheeri Cabral [:sheeri] from comment #38)
> (Note: there is no NULL NULL false d because it only tries matching rows
> that exist in runs).

And thats exactly why runs.buildername is never NULL.
So is this a typo in the query which should actually say builders.buildername IS NOT NULL?
Ah! Yes, it should say builders.buildername IS NOT NULL. Thank you for catching that!
Attachment #603531 - Flags: review?(philringnalda)
Attachment #603531 - Flags: review?(mstange)
What is this bug waiting on, me to interpret Arpad's review + Sheeri's comment and submit another patch? :)
(In reply to Robert Helmer [:rhelmer] from comment #41)
> What is this bug waiting on, me to interpret Arpad's review + Sheeri's
> comment and submit another patch? :)

I see your issue here the review comments and Sheeri's comment seem to be supporting your change but in a very ambiguous way.
Hopefully this is more clear - Arpad has brought up a correction that I believe should be implemented in a patch.
Whiteboard: waiting on new patch
(In reply to Sheeri Cabral [:sheeri] from comment #43)
> Hopefully this is more clear - Arpad has brought up a correction that I
> believe should be implemented in a patch.

Dropping severity a little bit, will get to this soon.
Severity: critical → normal
Priority: P1 → P2
Robert - any ETA on this?
(In reply to Sheeri Cabral [:sheeri] from comment #45)
> Robert - any ETA on this?

Sorry completely fell off my radar, should be able to finish it up this week.
FWIW the purge has been completed, and indexes added (dependent bug).
Has the patch been committed/implemented yet?

There are some numbers on the tbpl table growth in the past month at https://bugzilla.mozilla.org/show_bug.cgi?id=779290#c2
(In reply to Sheeri Cabral [:sheeri] from comment #48)
> Has the patch been committed/implemented yet?

The attached patch landed in comment 15 & has been in production for 5-6 months.

However this leaves the followup:

(In reply to Arpad Borsos (Swatinem) from comment #39)
> (In reply to Sheeri Cabral [:sheeri] from comment #38)
> > (Note: there is no NULL NULL false d because it only tries matching rows
> > that exist in runs).
> 
> And thats exactly why runs.buildername is never NULL.
> So is this a typo in the query which should actually say
> builders.buildername IS NOT NULL?

(In reply to Sheeri Cabral [:sheeri] from comment #40)
> Ah! Yes, it should say builders.buildername IS NOT NULL. Thank you for
> catching that!

If I understand the context from the comments prior to these correctly, the attach hunk:

+  FROM runs LEFT JOIN builders USING (buildername)
+  WHERE runs.branch = :branch AND runs.revision = :revision
+  AND runs.buildername IS NOT NULL

needed to instead be:

+  AND builders.buildername IS NOT NULL

Right...?
Attachment #658534 - Flags: review?(scabral)
Attachment #658534 - Flags: review?(arpad.borsos)
Right. When you have

runs LEFT JOIN builders

that means that all the values in "runs" will try to be matched up with "builders". So any NULL values would be from "builders" because NULL means "did not match". There's no way to get a NULL value from runs.buildername, because that is defined as NOT NULL in the runs table, and the LEFT JOIN wouldn't produce any:

  `buildername` varchar(256) COLLATE utf8_unicode_ci NOT NULL,
Attachment #658534 - Flags: review?(arpad.borsos) → review+
Comment on attachment 658534 [details] [diff] [review]
builders.buildername followup

Fine when tested locally, so landed:
http://hg.mozilla.org/users/mstange_themasta.com/tinderboxpushlog/rev/b4230ded8e13

On tbpl-dev, will close the bug when on production.
Attachment #658534 - Flags: review?(scabral)
Whiteboard: waiting on new patch
Depends on: 790559
In production.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
No longer blocks: 733643
Depends on: 733643
Product: Webtools → Tree Management
Product: Tree Management → Tree Management Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: