Closed Bug 1223335 Opened 9 years ago Closed 9 years ago

Figure out the cause of the DB load/timeouts in bug 1223174

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: emorley)

References

Details

Yesterday evening production treeherder was experiencing heavy DB load, causing API requests to time out. The number of these timeouts (or perhaps the time it takes for the workers to recover?) also then caused the load balancer to think the web nodes were bad and so mark them as unavailable. This bug is to try and figure out what happened and to try and roll out the prod push again, since it was reverted to see if it fixed the issue.
The rough timeline is as follows (all times to follow in UK time, UTC): 2047: Will did a prod deploy after checking state and getting the ok from the other devs. Log: http://treeherderadm.private.scl3.mozilla.com/chief/treeherder.prod/logs/production.1447102032 2126: Another prod deploy was performed since the IRC bot messages implied the first had failed (it hadn't, see the log above; this is an unfortunate known issue with the IRC bot). Log: http://treeherderadm.private.scl3.mozilla.com/chief/treeherder.prod/logs/production.1447104360 2144: Another prod deploy was initiated - which looking at the logs seems to be because the first few were actually deploying ed84987, which was the old prod revision, so would have been no-ops. This deploy used 8a511b7, which was correct. Log: http://treeherderadm.private.scl3.mozilla.com/chief/treeherder.prod/logs/production.1447105494 2145: The deploy script performed the DB migrate, which ran the perf.0003_performancesignature_lower_is_better migration. The |./manage.py migrate --noinput| command took 3.3 seconds to run. 2154: The deploy script had finished pushing code to the prod nodes/draining/restarting processes/undraining etc. 2210: The New Relic error rate spiked (see links below), the ping monitor reported that the API was not responding & the sheriffs mentioned having issues in #developers 2312: Will started the deploy of the former revision ed84987 to production (what was on prod previously). Log: http://treeherderadm.private.scl3.mozilla.com/chief/treeherder.prod/logs/production.1447110767 2313: The New Relic error rate can be seen to drop, web transactions throughput starts to increase etc. 2322: The deploy completed updating nodes Will, did you run the management command `remove_single_test_summaries` at any point? If so, what time?
Flags: needinfo?(wlachance)
New Relic links (filtered to the appropriate time range): * Error rate: https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors?tw%5Bend%5D=1447112244&tw%5Bstart%5D=1447103703 * Web transactions: https://rpm.newrelic.com/accounts/677903/applications/4180461/transactions?tw%5Bend%5D=1447112244&tw%5Bstart%5D=1447103703 * Celery transactions: https://rpm.newrelic.com/accounts/677903/applications/4180461/transactions?show_browser=false&tw%5Bend%5D=1447112244&tw%5Bstart%5D=1447103703&type=other * Database stats: https://rpm.newrelic.com/accounts/677903/applications/4180461/datastores?tw%5Bend%5D=1447112244&tw%5Bstart%5D=1447103703#/overview/All?value=total_call_time_per_minute * Master DB server stats: https://rpm.newrelic.com/accounts/677903/servers/6106885?tw%5Bend%5D=1447112244&tw%5Bstart%5D=1447103703 * MySQL New Relic Plugin (only available on the other New Relic account): - Overview: https://rpm.newrelic.com/accounts/263620/dashboard/3942659?tw%5Bend%5D=1447112244&tw%5Bstart%5D=1447103703 - Query analysis: https://rpm.newrelic.com/accounts/263620/dashboard/3942659/page/2?tw%5Bend%5D=1447112244&tw%5Bstart%5D=1447103703 - InnoDB metrics: https://rpm.newrelic.com/accounts/263620/dashboard/3942659/page/3?tw%5Bend%5D=1447112244&tw%5Bstart%5D=1447103703 The changes that were deployed: https://github.com/mozilla/treeherder/compare/ed84987...8a511b7 Things that stand out: * The perf migration to add a column to the performancesignature table (https://github.com/mozilla/treeherder/commit/81a7d5148f2766fd5d4af0d6c1156163e29fcc3f#diff-7b6fa7ce1bb6db25d07f8271f2e4e7f9R14) * The management command manually to remove single test summaries from the performancesignatures table (https://github.com/mozilla/treeherder/commit/06fbcd9b09f06f680af1d9a64190fc1e085c1cb9#diff-073e5306b556663c93b32263233c6bd2) * Adding an ORDER BY to the jobs list query (https://github.com/mozilla/treeherder/commit/761ae7f163d0550b124045125255cd0c3b7bd415) * Shortening the gunicorn timeout from 120s to 30s (https://github.com/mozilla/treeherder/commit/50cec5c1fe71a04c0f8da0bef87783bd9ad621c8) Observations so far: * There is an error in the deploy script `[localhost] err: /root/bin/th_functions.sh: line 55: [: too many arguments` (likely unrelated, but we should fix). * The error rate went down *before* the deploy of the revert had fully completed. Looking at the log to see which nodes had been updated will likely give clues (see http://treeherderadm.private.scl3.mozilla.com/chief/treeherder.prod/logs/production.1447110767). * We really should move more of our own internal tasks to access the DB directly and not hit our public API. It puts additional load on the API, muddies the water when comparing our own requests to those from third parties, and also makes API issues more severe, since it breaks our own ingestion. (Bug 1211414). * The job ETA table still features highly in DB profiles, we really should fix this soon (bug 1196764). * Perhaps when we move to Heroku and Amazon RDS, we should make stage have the same data retention as prod, since we clearly have more of a difference between environments than is ideal - since we didn't have issues on stage prior to the deploy. (Though we'd still have the issue of generating API load from users of the UI etc).
(In reply to Ed Morley [:emorley] from comment #1) > Will, did you run the management command `remove_single_test_summaries` at > any point? If so, what time? Pretty much immediately after the deploy. I strongly suspect the problem is that I ran this command unthrottled, falsely thinking that because there were no problems on stage/heroku that there would be no issues on prod. This was pretty dumb of me, I'm sorry (I even put a rate-limiting argument to the command that I didn't use, argh) I'd suggest re-deploying prod and just seeing how it does. I'll hold off on re-running that script for a while, as it isn't totally urgent (some perf results will look a bit wonky until it is, but it's not a huge deal).
Flags: needinfo?(wlachance)
We missed each other on IRC (it said your client got disconnected) - I'm not 100% convinced it was all your script - so I'm going to push to prod, but this time incrementally, so we can be sure what causes problems (if anything) - starting with 06fbcd9. Log: http://treeherderadm.private.scl3.mozilla.com/chief/treeherder.prod/logs/production-06fbcd9.1447162366 Incidentally even before this re-deploy, there were failures in the log parser logs - guessing because the schema changes from yesterday are still in place? Should we add a default when adding new columns to avoid errors when rolling back? [2015-11-10 05:30:15,279: ERROR/Worker-109] Response body: {"detail":"(1364, \"Field 'lower_is_better' doesn't have a default value\")"} (Full output: https://emorley.pastebin.mozilla.org/8851782)
Looking back to yesterday evening, prior to the deploy, th-prod-web1's /var/log/gunicorn/treeherder_error.log-20151108 log contained: [2015-11-03 17:31:54 +0000] [8839] [CRITICAL] WORKER TIMEOUT (pid:28063) [2015-11-03 17:32:35 +0000] [8839] [CRITICAL] WORKER TIMEOUT (pid:28051) [2015-11-03 17:33:11 +0000] [8839] [CRITICAL] WORKER TIMEOUT (pid:28106) [2015-11-04 13:13:36 +0000] [8839] [CRITICAL] WORKER TIMEOUT (pid:7648) [2015-11-04 13:14:20 +0000] [8839] [CRITICAL] WORKER TIMEOUT (pid:7664) (That's the entire log file) After the deploy /var/log/gunicorn/treeherder_error.log contains: [2015-11-09 21:47:49 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29731) [2015-11-09 21:47:49 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29727) [2015-11-09 21:47:57 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29729) [2015-11-09 21:48:03 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29728) [2015-11-09 21:48:23 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29804) [2015-11-09 21:48:26 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29803) [2015-11-09 21:48:38 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29792) [2015-11-09 21:48:51 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29830) [2015-11-09 21:48:59 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29834) [2015-11-09 21:49:15 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29838) [2015-11-09 21:49:17 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29842) [2015-11-09 21:49:22 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29846) [2015-11-09 21:50:03 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29851) [2015-11-09 21:50:08 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29903) [2015-11-09 21:50:15 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29907) [2015-11-09 21:50:29 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29888) [2015-11-09 21:50:41 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:30005) [2015-11-09 21:51:02 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29998) [2015-11-09 21:51:04 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:29917) [2015-11-09 21:51:16 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:30021) [2015-11-09 21:51:17 +0000] [29714] [CRITICAL] WORKER TIMEOUT (pid:30031) ... (and hundreds more) That error message is given when we reach gunicorn's timeout value. We had just lowered the timeout, but then requests shouldn't be taking 30s in normal cases anyway, so it may be more effect than cause.
No issues seen on the first deploy - running next incremental deploy, production-c6e4151: http://treeherderadm.private.scl3.mozilla.com/chief/treeherder.prod/logs/production-c6e4151.1447166046
Depends on: 1223384
(In reply to Ed Morley [:emorley] from comment #2) > > * There is an error in the deploy script `[localhost] err: > /root/bin/th_functions.sh: line 55: [: too many arguments` > (likely unrelated, but we should fix). Weird! And it's only shown up the one time. That bit is just verifying the number of active connections to the web node after draining or undraining, so it's not critical, but I'll look into it. [localhost] out: undraining 10.22.81.36:8080 in treeherder-https. [localhost] out: Active connections on treeherder3.webapp.scl3.mozilla.com: 33 [localhost] err: /root/bin/th_functions.sh: line 55: [: too many arguments
Hmm one strange thing I've noticed (unrelated to this, but it still puts a lot of load on the DB) is that every 12 hours the throughput on treeherder.webapp.api.jobs:JobsViewSet.retrieve goes through the roof (as in from the normal 5 increased to 1500 requests per minute): https://rpm.newrelic.com/accounts/677903/applications/4180461/transactions?tw%5Bend%5D=1447168139&tw%5Bstart%5D=1446908939&type=app#id=5b225765625472616e73616374696f6e2f46756e6374696f6e2f747265656865726465722e7765626170702e6170692e6a6f62733a4a6f6273566965775365742e7265747269657665222c22225d&sort_by=throughput Clearer view of the latest occurrence: https://rpm.newrelic.com/accounts/677903/applications/4180461/transactions?tw%5Bend%5D=1447168460&tw%5Bstart%5D=1447160423&type=app#id=5b225765625472616e73616374696f6e2f46756e6374696f6e2f747265656865726465722e7765626170702e6170692e6a6f62733a4a6f6273566965775365742e7265747269657665222c22225d&sort_by=throughput ...and the corresponding spike in overall DB throughput: https://rpm.newrelic.com/accounts/677903/applications/4180461/datastores?tw%5Bend%5D=1447168460&tw%5Bstart%5D=1447160423#/overview/All?value=total_call_time_per_minute -> it increased our overall database calls per minute from 20K to 40K ! Is someone scraping jobs one by one, every 12 hours? Regardless of whether this made yesterday's issue any worse, it seems like we need: (a) To impose GET rate limits (bug 1203556) (b) To educate people to use the right endpoints (rather than scraping jobs individually) and/or add endpoints if they need them (c) To see if there are any perf optimisations we can make for JobsViewSet.retrieve (bearing in mind the UI and most other things use JobsViewSet.list ; so we may not have had cause to optimise it previously)
And now production-8a511b7 (includes the 0003_performancesignature_lower_is_better migration): http://treeherderadm.private.scl3.mozilla.com/chief/treeherder.prod/logs/production-8a511b7.1447175273
(In reply to Ed Morley [:emorley] from comment #4) > Incidentally even before this re-deploy, there were failures in the log > parser logs - guessing because the schema changes from yesterday are still > in place? Should we add a default when adding new columns to avoid errors > when rolling back? > > [2015-11-10 05:30:15,279: ERROR/Worker-109] Response body: {"detail":"(1364, > \"Field 'lower_is_better' doesn't have a default value\")"} > (Full output: https://emorley.pastebin.mozilla.org/8851782) This would appear to be due to the invalid/wontfixed: https://code.djangoproject.com/ticket/2231 tl;dr django defaults are only enforced at the model level, strangely.
(In reply to Ed Morley [:emorley] from comment #11) > tl;dr django defaults are only enforced at the model level, strangely. More reasoning in: https://code.djangoproject.com/ticket/470#comment:2
The newly filed bug 1223450 would help here - the slow transactions would actually have the original values present in New Relic, and we could run them against prod ourselves to see if they repro etc. We'll have to see what New Relic says in their reply.
Depends on: 1223450
Depends on: 1223496
I ended up reverting the ORDER BY change, since it seems to be a contributing factor to DB load. The first and last deploy markers here are when it landed and was reverted: https://rpm.newrelic.com/accounts/677903/applications/4180461/datastores?tw%5Bend%5D=1447189515&tw%5Bstart%5D=1447167915#/overview/All/drilldown?metric=Datastore%252Fstatement%252FMySQL%252Fjob%252Fselect&value=total_call_time_per_minute Prod is now on 8a511b7 (what we tried to deploy last night originally) along with the revert cherry-picked. There are still other issues we need to address (on it's own the ORDER BY change shouldn't have been enough to impact us), so I'll leave this bug open for now.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Depends on: 1226565
You need to log in before you can comment on or make changes to this bug.