Figure out the cause of the DB load/timeouts in bug 1223174

RESOLVED FIXED

Status

Tree Management
Treeherder: Infrastructure
P1
normal
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: emorley, Assigned: emorley)

Tracking

Details

(Assignee)

Description

2 years ago
Yesterday evening production treeherder was experiencing heavy DB load, causing API requests to time out. The number of these timeouts (or perhaps the time it takes for the workers to recover?) also then caused the load balancer to think the web nodes were bad and so mark them as unavailable.

This bug is to try and figure out what happened and to try and roll out the prod push again, since it was reverted to see if it fixed the issue.
(Assignee)

Comment 1

2 years ago
The rough timeline is as follows (all times to follow in UK time, UTC):

2047: Will did a prod deploy after checking state and getting the ok from the other devs. Log: http://treeherderadm.private.scl3.mozilla.com/chief/treeherder.prod/logs/production.1447102032

2126: Another prod deploy was performed since the IRC bot messages implied the first had failed (it hadn't, see the log above; this is an unfortunate known issue with the IRC bot). Log: http://treeherderadm.private.scl3.mozilla.com/chief/treeherder.prod/logs/production.1447104360

2144: Another prod deploy was initiated - which looking at the logs seems to be because the first few were actually deploying ed84987, which was the old prod revision, so would have been no-ops. This deploy used 8a511b7, which was correct. Log: http://treeherderadm.private.scl3.mozilla.com/chief/treeherder.prod/logs/production.1447105494

2145: The deploy script performed the DB migrate, which ran the perf.0003_performancesignature_lower_is_better migration. The |./manage.py migrate --noinput| command took 3.3 seconds to run.

2154: The deploy script had finished pushing code to the prod nodes/draining/restarting processes/undraining etc.

2210: The New Relic error rate spiked (see links below), the ping monitor reported that the API was not responding & the sheriffs mentioned having issues in #developers

2312: Will started the deploy of the former revision ed84987 to production (what was on prod previously). Log: http://treeherderadm.private.scl3.mozilla.com/chief/treeherder.prod/logs/production.1447110767

2313: The New Relic error rate can be seen to drop, web transactions throughput starts to increase etc.

2322: The deploy completed updating nodes


Will, did you run the management command `remove_single_test_summaries` at any point? If so, what time?
Flags: needinfo?(wlachance)
(Assignee)

Comment 2

2 years ago
New Relic links (filtered to the appropriate time range):
* Error rate: https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors?tw%5Bend%5D=1447112244&tw%5Bstart%5D=1447103703
* Web transactions: https://rpm.newrelic.com/accounts/677903/applications/4180461/transactions?tw%5Bend%5D=1447112244&tw%5Bstart%5D=1447103703
* Celery transactions: https://rpm.newrelic.com/accounts/677903/applications/4180461/transactions?show_browser=false&tw%5Bend%5D=1447112244&tw%5Bstart%5D=1447103703&type=other
* Database stats: https://rpm.newrelic.com/accounts/677903/applications/4180461/datastores?tw%5Bend%5D=1447112244&tw%5Bstart%5D=1447103703#/overview/All?value=total_call_time_per_minute
* Master DB server stats: https://rpm.newrelic.com/accounts/677903/servers/6106885?tw%5Bend%5D=1447112244&tw%5Bstart%5D=1447103703
* MySQL New Relic Plugin (only available on the other New Relic account): 
  - Overview: https://rpm.newrelic.com/accounts/263620/dashboard/3942659?tw%5Bend%5D=1447112244&tw%5Bstart%5D=1447103703
  - Query analysis: https://rpm.newrelic.com/accounts/263620/dashboard/3942659/page/2?tw%5Bend%5D=1447112244&tw%5Bstart%5D=1447103703
  - InnoDB metrics: https://rpm.newrelic.com/accounts/263620/dashboard/3942659/page/3?tw%5Bend%5D=1447112244&tw%5Bstart%5D=1447103703

The changes that were deployed:
https://github.com/mozilla/treeherder/compare/ed84987...8a511b7

Things that stand out:
* The perf migration to add a column to the performancesignature table (https://github.com/mozilla/treeherder/commit/81a7d5148f2766fd5d4af0d6c1156163e29fcc3f#diff-7b6fa7ce1bb6db25d07f8271f2e4e7f9R14)
* The management command manually to remove single test summaries from the performancesignatures table (https://github.com/mozilla/treeherder/commit/06fbcd9b09f06f680af1d9a64190fc1e085c1cb9#diff-073e5306b556663c93b32263233c6bd2)
* Adding an ORDER BY to the jobs list query (https://github.com/mozilla/treeherder/commit/761ae7f163d0550b124045125255cd0c3b7bd415)
* Shortening the gunicorn timeout from 120s to 30s (https://github.com/mozilla/treeherder/commit/50cec5c1fe71a04c0f8da0bef87783bd9ad621c8)

Observations so far:
* There is an error in the deploy script `[localhost] err: /root/bin/th_functions.sh: line 55: [: too many arguments` (likely unrelated, but we should fix).
* The error rate went down *before* the deploy of the revert had fully completed. Looking at the log to see which nodes had been updated will likely give clues (see http://treeherderadm.private.scl3.mozilla.com/chief/treeherder.prod/logs/production.1447110767).
* We really should move more of our own internal tasks to access the DB directly and not hit our public API. It puts additional load on the API, muddies the water when comparing our own requests to those from third parties, and also makes API issues more severe, since it breaks our own ingestion. (Bug 1211414).
* The job ETA table still features highly in DB profiles, we really should fix this soon (bug 1196764).
* Perhaps when we move to Heroku and Amazon RDS, we should make stage have the same data retention as prod, since we clearly have more of a difference between environments than is ideal - since we didn't have issues on stage prior to the deploy. (Though we'd still have the issue of generating API load from users of the UI etc).
(In reply to Ed Morley [:emorley] from comment #1)

> Will, did you run the management command `remove_single_test_summaries` at
> any point? If so, what time?

Pretty much immediately after the deploy.

I strongly suspect the problem is that I ran this command unthrottled, falsely thinking that because there were no problems on stage/heroku that there would be no issues on prod. This was pretty dumb of me, I'm sorry (I even put a rate-limiting argument to the command that I didn't use, argh)

I'd suggest re-deploying prod and just seeing how it does. I'll hold off on re-running that script for a while, as it isn't totally urgent (some perf results will look a bit wonky until it is, but it's not a huge deal).
Flags: needinfo?(wlachance)
(Assignee)

Comment 4

2 years ago
We missed each other on IRC (it said your client got disconnected) - I'm not 100% convinced it was all your script - so I'm going to push to prod, but this time incrementally, so we can be sure what causes problems (if anything) - starting with 06fbcd9.

Log:
http://treeherderadm.private.scl3.mozilla.com/chief/treeherder.prod/logs/production-06fbcd9.1447162366

Incidentally even before this re-deploy, there were failures in the log parser logs - guessing because the schema changes from yesterday are still in place? Should we add a default when adding new columns to avoid errors when rolling back?

[2015-11-10 05:30:15,279: ERROR/Worker-109] Response body: {"detail":"(1364, \"Field 'lower_is_better' doesn't have a default value\")"}
(Full output: https://emorley.pastebin.mozilla.org/8851782)
(Assignee)

Comment 5

2 years ago
Looking back to yesterday evening, prior to the deploy, th-prod-web1's /var/log/gunicorn/treeherder_error.log-20151108 log contained:

[2015-11-03 17:31:54 +0000] [8839] [CRITICAL] WORKER TIMEOUT (pid:28063)
[2015-11-03 17:32:35 +0000] [8839] [CRITICAL] WORKER TIMEOUT (pid:28051)
[2015-11-03 17:33:11 +0000] [8839] [CRITICAL] WORKER TIMEOUT (pid:28106)
[2015-11-04 13:13:36 +0000] [8839] [CRITICAL] WORKER TIMEOUT (pid:7648)
[2015-11-04 13:14:20 +0000] [8839] [CRITICAL] WORKER TIMEOUT (pid:7664)

(That's the entire log file)

After the deploy /var/log/gunicorn/treeherder_error.log contains:

[2015-11-09 21:47:49 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29731)
[2015-11-09 21:47:49 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29727)
[2015-11-09 21:47:57 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29729)
[2015-11-09 21:48:03 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29728)
[2015-11-09 21:48:23 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29804)
[2015-11-09 21:48:26 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29803)
[2015-11-09 21:48:38 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29792)
[2015-11-09 21:48:51 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29830)
[2015-11-09 21:48:59 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29834)
[2015-11-09 21:49:15 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29838)
[2015-11-09 21:49:17 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29842)
[2015-11-09 21:49:22 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29846)
[2015-11-09 21:50:03 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29851)
[2015-11-09 21:50:08 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29903)
[2015-11-09 21:50:15 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29907)
[2015-11-09 21:50:29 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29888)
[2015-11-09 21:50:41 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:30005)
[2015-11-09 21:51:02 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29998)
[2015-11-09 21:51:04 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29917)
[2015-11-09 21:51:16 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:30021)
[2015-11-09 21:51:17 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:30031)
...
(and hundreds more)

That error message is given when we reach gunicorn's timeout value. We had just lowered the timeout, but then requests shouldn't be taking 30s in normal cases anyway, so it may be more effect than cause.
(Assignee)

Comment 6

2 years ago
No issues seen on the first deploy - running next incremental deploy, production-c6e4151:
http://treeherderadm.private.scl3.mozilla.com/chief/treeherder.prod/logs/production-c6e4151.1447166046
(Assignee)

Updated

2 years ago
Depends on: 1223384
(In reply to Ed Morley [:emorley] from comment #2)
>
> * There is an error in the deploy script `[localhost] err:
> /root/bin/th_functions.sh: line 55: [: too many arguments` 
> (likely unrelated, but we should fix).

Weird! And it's only shown up the one time. That bit is just verifying the number of active connections to the web node after draining or undraining, so it's not critical, but I'll look into it.
 
[localhost] out: undraining 10.22.81.36:8080 in treeherder-https.
[localhost] out: Active connections on treeherder3.webapp.scl3.mozilla.com: 33
[localhost] err: /root/bin/th_functions.sh: line 55: [: too many arguments
(Assignee)

Comment 8

2 years ago
Hmm one strange thing I've noticed (unrelated to this, but it still puts a lot of load on the DB) is that every 12 hours the throughput on treeherder.webapp.api.jobs:JobsViewSet.retrieve goes through the roof (as in from the normal 5 increased to 1500 requests per minute):

https://rpm.newrelic.com/accounts/677903/applications/4180461/transactions?tw%5Bend%5D=1447168139&tw%5Bstart%5D=1446908939&type=app#id=5b225765625472616e73616374696f6e2f46756e6374696f6e2f747265656865726465722e7765626170702e6170692e6a6f62733a4a6f6273566965775365742e7265747269657665222c22225d&sort_by=throughput

Clearer view of the latest occurrence:

https://rpm.newrelic.com/accounts/677903/applications/4180461/transactions?tw%5Bend%5D=1447168460&tw%5Bstart%5D=1447160423&type=app#id=5b225765625472616e73616374696f6e2f46756e6374696f6e2f747265656865726465722e7765626170702e6170692e6a6f62733a4a6f6273566965775365742e7265747269657665222c22225d&sort_by=throughput

...and the corresponding spike in overall DB throughput:

https://rpm.newrelic.com/accounts/677903/applications/4180461/datastores?tw%5Bend%5D=1447168460&tw%5Bstart%5D=1447160423#/overview/All?value=total_call_time_per_minute

-> it increased our overall database calls per minute from 20K to 40K !

Is someone scraping jobs one by one, every 12 hours?

Regardless of whether this made yesterday's issue any worse, it seems like we need:
(a) To impose GET rate limits (bug 1203556)
(b) To educate people to use the right endpoints (rather than scraping jobs individually) and/or add endpoints if they need them
(c) To see if there are any perf optimisations we can make for JobsViewSet.retrieve (bearing in mind the UI and most other things use JobsViewSet.list ; so we may not have had cause to optimise it previously)
(Assignee)

Comment 9

2 years ago
And now production-d0ac6d4 (which includes the ORDER BY change):
http://treeherderadm.private.scl3.mozilla.com/chief/treeherder.prod/logs/production-d0ac6d4.1447169935
(Assignee)

Comment 10

2 years ago
And now production-8a511b7 (includes the 0003_performancesignature_lower_is_better migration):
http://treeherderadm.private.scl3.mozilla.com/chief/treeherder.prod/logs/production-8a511b7.1447175273
(Assignee)

Comment 11

2 years ago
(In reply to Ed Morley [:emorley] from comment #4)
> Incidentally even before this re-deploy, there were failures in the log
> parser logs - guessing because the schema changes from yesterday are still
> in place? Should we add a default when adding new columns to avoid errors
> when rolling back?
> 
> [2015-11-10 05:30:15,279: ERROR/Worker-109] Response body: {"detail":"(1364,
> \"Field 'lower_is_better' doesn't have a default value\")"}
> (Full output: https://emorley.pastebin.mozilla.org/8851782)

This would appear to be due to the invalid/wontfixed:
https://code.djangoproject.com/ticket/2231

tl;dr django defaults are only enforced at the model level, strangely.
(Assignee)

Comment 12

2 years ago
(In reply to Ed Morley [:emorley] from comment #11)
> tl;dr django defaults are only enforced at the model level, strangely.

More reasoning in:
https://code.djangoproject.com/ticket/470#comment:2
(Assignee)

Comment 13

2 years ago
The newly filed bug 1223450 would help here - the slow transactions would actually have the original values present in New Relic, and we could run them against prod ourselves to see if they repro etc. We'll have to see what New Relic says in their reply.
Depends on: 1223450
(Assignee)

Updated

2 years ago
Depends on: 1223496
(Assignee)

Comment 14

2 years ago
I ended up reverting the ORDER BY change, since it seems to be a contributing factor to DB load. The first and last deploy markers here are when it landed and was reverted:

https://rpm.newrelic.com/accounts/677903/applications/4180461/datastores?tw%5Bend%5D=1447189515&tw%5Bstart%5D=1447167915#/overview/All/drilldown?metric=Datastore%252Fstatement%252FMySQL%252Fjob%252Fselect&value=total_call_time_per_minute

Prod is now on 8a511b7 (what we tried to deploy last night originally) along with the revert cherry-picked.

There are still other issues we need to address (on it's own the ORDER BY change shouldn't have been enough to impact us), so I'll leave this bug open for now.
(Assignee)

Updated

2 years ago
Status: ASSIGNED → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
(Assignee)

Updated

2 years ago
Depends on: 1226565
You need to log in before you can comment on or make changes to this bug.