Closed
Bug 1223335
Opened 9 years ago
Closed 9 years ago
Figure out the cause of the DB load/timeouts in bug 1223174
Categories
(Tree Management :: Treeherder: Infrastructure, defect, P1)
Tree Management
Treeherder: Infrastructure
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: emorley, Assigned: emorley)
References
Details
Yesterday evening production treeherder was experiencing heavy DB load, causing API requests to time out. The number of these timeouts (or perhaps the time it takes for the workers to recover?) also then caused the load balancer to think the web nodes were bad and so mark them as unavailable.
This bug is to try and figure out what happened and to try and roll out the prod push again, since it was reverted to see if it fixed the issue.
Assignee | ||
Comment 1•9 years ago
|
||
The rough timeline is as follows (all times to follow in UK time, UTC):
2047: Will did a prod deploy after checking state and getting the ok from the other devs. Log: http://treeherderadm.private.scl3.mozilla.com/chief/treeherder.prod/logs/production.1447102032
2126: Another prod deploy was performed since the IRC bot messages implied the first had failed (it hadn't, see the log above; this is an unfortunate known issue with the IRC bot). Log: http://treeherderadm.private.scl3.mozilla.com/chief/treeherder.prod/logs/production.1447104360
2144: Another prod deploy was initiated - which looking at the logs seems to be because the first few were actually deploying ed84987, which was the old prod revision, so would have been no-ops. This deploy used 8a511b7, which was correct. Log: http://treeherderadm.private.scl3.mozilla.com/chief/treeherder.prod/logs/production.1447105494
2145: The deploy script performed the DB migrate, which ran the perf.0003_performancesignature_lower_is_better migration. The |./manage.py migrate --noinput| command took 3.3 seconds to run.
2154: The deploy script had finished pushing code to the prod nodes/draining/restarting processes/undraining etc.
2210: The New Relic error rate spiked (see links below), the ping monitor reported that the API was not responding & the sheriffs mentioned having issues in #developers
2312: Will started the deploy of the former revision ed84987 to production (what was on prod previously). Log: http://treeherderadm.private.scl3.mozilla.com/chief/treeherder.prod/logs/production.1447110767
2313: The New Relic error rate can be seen to drop, web transactions throughput starts to increase etc.
2322: The deploy completed updating nodes
Will, did you run the management command `remove_single_test_summaries` at any point? If so, what time?
Flags: needinfo?(wlachance)
Assignee | ||
Comment 2•9 years ago
|
||
New Relic links (filtered to the appropriate time range):
* Error rate: https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors?tw%5Bend%5D=1447112244&tw%5Bstart%5D=1447103703
* Web transactions: https://rpm.newrelic.com/accounts/677903/applications/4180461/transactions?tw%5Bend%5D=1447112244&tw%5Bstart%5D=1447103703
* Celery transactions: https://rpm.newrelic.com/accounts/677903/applications/4180461/transactions?show_browser=false&tw%5Bend%5D=1447112244&tw%5Bstart%5D=1447103703&type=other
* Database stats: https://rpm.newrelic.com/accounts/677903/applications/4180461/datastores?tw%5Bend%5D=1447112244&tw%5Bstart%5D=1447103703#/overview/All?value=total_call_time_per_minute
* Master DB server stats: https://rpm.newrelic.com/accounts/677903/servers/6106885?tw%5Bend%5D=1447112244&tw%5Bstart%5D=1447103703
* MySQL New Relic Plugin (only available on the other New Relic account):
- Overview: https://rpm.newrelic.com/accounts/263620/dashboard/3942659?tw%5Bend%5D=1447112244&tw%5Bstart%5D=1447103703
- Query analysis: https://rpm.newrelic.com/accounts/263620/dashboard/3942659/page/2?tw%5Bend%5D=1447112244&tw%5Bstart%5D=1447103703
- InnoDB metrics: https://rpm.newrelic.com/accounts/263620/dashboard/3942659/page/3?tw%5Bend%5D=1447112244&tw%5Bstart%5D=1447103703
The changes that were deployed:
https://github.com/mozilla/treeherder/compare/ed84987...8a511b7
Things that stand out:
* The perf migration to add a column to the performancesignature table (https://github.com/mozilla/treeherder/commit/81a7d5148f2766fd5d4af0d6c1156163e29fcc3f#diff-7b6fa7ce1bb6db25d07f8271f2e4e7f9R14)
* The management command manually to remove single test summaries from the performancesignatures table (https://github.com/mozilla/treeherder/commit/06fbcd9b09f06f680af1d9a64190fc1e085c1cb9#diff-073e5306b556663c93b32263233c6bd2)
* Adding an ORDER BY to the jobs list query (https://github.com/mozilla/treeherder/commit/761ae7f163d0550b124045125255cd0c3b7bd415)
* Shortening the gunicorn timeout from 120s to 30s (https://github.com/mozilla/treeherder/commit/50cec5c1fe71a04c0f8da0bef87783bd9ad621c8)
Observations so far:
* There is an error in the deploy script `[localhost] err: /root/bin/th_functions.sh: line 55: [: too many arguments` (likely unrelated, but we should fix).
* The error rate went down *before* the deploy of the revert had fully completed. Looking at the log to see which nodes had been updated will likely give clues (see http://treeherderadm.private.scl3.mozilla.com/chief/treeherder.prod/logs/production.1447110767).
* We really should move more of our own internal tasks to access the DB directly and not hit our public API. It puts additional load on the API, muddies the water when comparing our own requests to those from third parties, and also makes API issues more severe, since it breaks our own ingestion. (Bug 1211414).
* The job ETA table still features highly in DB profiles, we really should fix this soon (bug 1196764).
* Perhaps when we move to Heroku and Amazon RDS, we should make stage have the same data retention as prod, since we clearly have more of a difference between environments than is ideal - since we didn't have issues on stage prior to the deploy. (Though we'd still have the issue of generating API load from users of the UI etc).
Comment 3•9 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #1)
> Will, did you run the management command `remove_single_test_summaries` at
> any point? If so, what time?
Pretty much immediately after the deploy.
I strongly suspect the problem is that I ran this command unthrottled, falsely thinking that because there were no problems on stage/heroku that there would be no issues on prod. This was pretty dumb of me, I'm sorry (I even put a rate-limiting argument to the command that I didn't use, argh)
I'd suggest re-deploying prod and just seeing how it does. I'll hold off on re-running that script for a while, as it isn't totally urgent (some perf results will look a bit wonky until it is, but it's not a huge deal).
Flags: needinfo?(wlachance)
Assignee | ||
Comment 4•9 years ago
|
||
We missed each other on IRC (it said your client got disconnected) - I'm not 100% convinced it was all your script - so I'm going to push to prod, but this time incrementally, so we can be sure what causes problems (if anything) - starting with 06fbcd9.
Log:
http://treeherderadm.private.scl3.mozilla.com/chief/treeherder.prod/logs/production-06fbcd9.1447162366
Incidentally even before this re-deploy, there were failures in the log parser logs - guessing because the schema changes from yesterday are still in place? Should we add a default when adding new columns to avoid errors when rolling back?
[2015-11-10 05:30:15,279: ERROR/Worker-109] Response body: {"detail":"(1364, \"Field 'lower_is_better' doesn't have a default value\")"}
(Full output: https://emorley.pastebin.mozilla.org/8851782)
Assignee | ||
Comment 5•9 years ago
|
||
Looking back to yesterday evening, prior to the deploy, th-prod-web1's /var/log/gunicorn/treeherder_error.log-20151108 log contained:
[2015-11-03 17:31:54 +0000] [8839] [CRITICAL] WORKER TIMEOUT (pid:28063)
[2015-11-03 17:32:35 +0000] [8839] [CRITICAL] WORKER TIMEOUT (pid:28051)
[2015-11-03 17:33:11 +0000] [8839] [CRITICAL] WORKER TIMEOUT (pid:28106)
[2015-11-04 13:13:36 +0000] [8839] [CRITICAL] WORKER TIMEOUT (pid:7648)
[2015-11-04 13:14:20 +0000] [8839] [CRITICAL] WORKER TIMEOUT (pid:7664)
(That's the entire log file)
After the deploy /var/log/gunicorn/treeherder_error.log contains:
[2015-11-09 21:47:49 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29731)
[2015-11-09 21:47:49 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29727)
[2015-11-09 21:47:57 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29729)
[2015-11-09 21:48:03 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29728)
[2015-11-09 21:48:23 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29804)
[2015-11-09 21:48:26 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29803)
[2015-11-09 21:48:38 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29792)
[2015-11-09 21:48:51 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29830)
[2015-11-09 21:48:59 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29834)
[2015-11-09 21:49:15 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29838)
[2015-11-09 21:49:17 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29842)
[2015-11-09 21:49:22 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29846)
[2015-11-09 21:50:03 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29851)
[2015-11-09 21:50:08 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29903)
[2015-11-09 21:50:15 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29907)
[2015-11-09 21:50:29 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29888)
[2015-11-09 21:50:41 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:30005)
[2015-11-09 21:51:02 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29998)
[2015-11-09 21:51:04 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29917)
[2015-11-09 21:51:16 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:30021)
[2015-11-09 21:51:17 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:30031)
...
(and hundreds more)
That error message is given when we reach gunicorn's timeout value. We had just lowered the timeout, but then requests shouldn't be taking 30s in normal cases anyway, so it may be more effect than cause.
Assignee | ||
Comment 6•9 years ago
|
||
No issues seen on the first deploy - running next incremental deploy, production-c6e4151:
http://treeherderadm.private.scl3.mozilla.com/chief/treeherder.prod/logs/production-c6e4151.1447166046
Comment 7•9 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #2)
>
> * There is an error in the deploy script `[localhost] err:
> /root/bin/th_functions.sh: line 55: [: too many arguments`
> (likely unrelated, but we should fix).
Weird! And it's only shown up the one time. That bit is just verifying the number of active connections to the web node after draining or undraining, so it's not critical, but I'll look into it.
[localhost] out: undraining 10.22.81.36:8080 in treeherder-https.
[localhost] out: Active connections on treeherder3.webapp.scl3.mozilla.com: 33
[localhost] err: /root/bin/th_functions.sh: line 55: [: too many arguments
Assignee | ||
Comment 8•9 years ago
|
||
Hmm one strange thing I've noticed (unrelated to this, but it still puts a lot of load on the DB) is that every 12 hours the throughput on treeherder.webapp.api.jobs:JobsViewSet.retrieve goes through the roof (as in from the normal 5 increased to 1500 requests per minute):
https://rpm.newrelic.com/accounts/677903/applications/4180461/transactions?tw%5Bend%5D=1447168139&tw%5Bstart%5D=1446908939&type=app#id=5b225765625472616e73616374696f6e2f46756e6374696f6e2f747265656865726465722e7765626170702e6170692e6a6f62733a4a6f6273566965775365742e7265747269657665222c22225d&sort_by=throughput
Clearer view of the latest occurrence:
https://rpm.newrelic.com/accounts/677903/applications/4180461/transactions?tw%5Bend%5D=1447168460&tw%5Bstart%5D=1447160423&type=app#id=5b225765625472616e73616374696f6e2f46756e6374696f6e2f747265656865726465722e7765626170702e6170692e6a6f62733a4a6f6273566965775365742e7265747269657665222c22225d&sort_by=throughput
...and the corresponding spike in overall DB throughput:
https://rpm.newrelic.com/accounts/677903/applications/4180461/datastores?tw%5Bend%5D=1447168460&tw%5Bstart%5D=1447160423#/overview/All?value=total_call_time_per_minute
-> it increased our overall database calls per minute from 20K to 40K !
Is someone scraping jobs one by one, every 12 hours?
Regardless of whether this made yesterday's issue any worse, it seems like we need:
(a) To impose GET rate limits (bug 1203556)
(b) To educate people to use the right endpoints (rather than scraping jobs individually) and/or add endpoints if they need them
(c) To see if there are any perf optimisations we can make for JobsViewSet.retrieve (bearing in mind the UI and most other things use JobsViewSet.list ; so we may not have had cause to optimise it previously)
Assignee | ||
Comment 9•9 years ago
|
||
And now production-d0ac6d4 (which includes the ORDER BY change):
http://treeherderadm.private.scl3.mozilla.com/chief/treeherder.prod/logs/production-d0ac6d4.1447169935
Assignee | ||
Comment 10•9 years ago
|
||
And now production-8a511b7 (includes the 0003_performancesignature_lower_is_better migration):
http://treeherderadm.private.scl3.mozilla.com/chief/treeherder.prod/logs/production-8a511b7.1447175273
Assignee | ||
Comment 11•9 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #4)
> Incidentally even before this re-deploy, there were failures in the log
> parser logs - guessing because the schema changes from yesterday are still
> in place? Should we add a default when adding new columns to avoid errors
> when rolling back?
>
> [2015-11-10 05:30:15,279: ERROR/Worker-109] Response body: {"detail":"(1364,
> \"Field 'lower_is_better' doesn't have a default value\")"}
> (Full output: https://emorley.pastebin.mozilla.org/8851782)
This would appear to be due to the invalid/wontfixed:
https://code.djangoproject.com/ticket/2231
tl;dr django defaults are only enforced at the model level, strangely.
Assignee | ||
Comment 12•9 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #11)
> tl;dr django defaults are only enforced at the model level, strangely.
More reasoning in:
https://code.djangoproject.com/ticket/470#comment:2
Assignee | ||
Comment 13•9 years ago
|
||
The newly filed bug 1223450 would help here - the slow transactions would actually have the original values present in New Relic, and we could run them against prod ourselves to see if they repro etc. We'll have to see what New Relic says in their reply.
Depends on: 1223450
Assignee | ||
Comment 14•9 years ago
|
||
I ended up reverting the ORDER BY change, since it seems to be a contributing factor to DB load. The first and last deploy markers here are when it landed and was reverted:
https://rpm.newrelic.com/accounts/677903/applications/4180461/datastores?tw%5Bend%5D=1447189515&tw%5Bstart%5D=1447167915#/overview/All/drilldown?metric=Datastore%252Fstatement%252FMySQL%252Fjob%252Fselect&value=total_call_time_per_minute
Prod is now on 8a511b7 (what we tried to deploy last night originally) along with the revert cherry-picked.
There are still other issues we need to address (on it's own the ORDER BY change shouldn't have been enough to impact us), so I'll leave this bug open for now.
Assignee | ||
Updated•9 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•