Closed Bug 1386331 Opened 7 years ago Closed 7 years ago

Webheads returning 500/503s due to long running DB queries from unbounded job_details API calls

Categories

(Tree Management :: Treeherder: Infrastructure, enhancement, P1)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: camd, Unassigned)

References

(Depends on 1 open bug)

Details

Attachments

(2 files)

Looking at RDS, it seems like it could be a DB issue.
Summary: Response times have gone way down. Getting 503s → Response times have gone way down. Getting 500s
This may have been an AWS or Heroku issue: https://status.heroku.com/incidents/1241 (credit to dividehex for noticing this)
Attached image Heroku web metrics.png
5xx errors begin to clear up right around the time heroku reports routing delays
5xx seemed to stop around 17:23-17:24 UTC Around the same time the prod rds goes starts a shutdown/restart treeherder-prod Instances August 1, 2017 at 10:25:22 AM UTC-7 DB instance restarted treeherder-prod Instances August 1, 2017 at 10:25:15 AM UTC-7 DB instance shutdown
The Heroku incident 1241 looks the be the closest thing to a smoking gun we can find so far. Treeherder has now recovered, however. I believe the trees can be reopened.
Update from Heroku 11:26am PT: Update: (US) Routing latency. We have isolated the cause of network degradation in US region. The underlying request issues in EU region seems to have been resolved.
This issue appears to be resolved now.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Attachment #8892544 - Attachment filename: Screenshot 2017-08-01 10.51.38 (2).png → Heroku web metrics
Attachment #8892544 - Attachment description: Screenshot 2017-08-01 10.51.38 (2).png → Heroku web metrics.png
Attachment #8892544 - Attachment filename: Heroku web metrics → Heroku web metrics.png
Cameron and I had a meeting about this earlier. I wasn't sure if we'd be able to track down the cause retrospectively, since with high-load type issues like this, it's often hard to separate cause and effect (eg is a query slow because the DB is being slow, or is it the reason the DB is being slow in the first place). However I managed to find some suspicious looking queries via papertrail - for example: Aug 01 16:33:02 treeherder-prod heroku/router: at=error code=H12 desc="Request timeout" method=GET path="/api/jobdetail/?repository=mozilla-central&job_id__in=" host=treeherder.mozilla.org request_id=e31002ff-cfee-4174-abdc-ca0adc5fd959 fwd="REDACTED" dyno=web.3 connect=1ms service=30001ms status=503 bytes=0 protocol=https Aug 01 17:04:07 treeherder-prod heroku/router: at=error code=H13 desc="Connection closed without response" method=GET path="/api/jobdetail/?repository=mozilla-central&job_id__in=" host=treeherder.mozilla.org request_id=9d50279c-9c97-4bd9-9ce1-0f6125ee1b13 fwd="REDACTED" dyno=web.3 connect=0ms service=29205ms status=503 bytes=0 protocol=https Aug 01 17:08:33 treeherder-prod heroku/router: at=error code=H12 desc="Request timeout" method=GET path="/api/jobdetail/?repository=mozilla-central&job_id__in=" host=treeherder.mozilla.org request_id=bd983d1d-6381-4323-b3f0-83fadb47e36d fwd="REDACTED" dyno=web.3 connect=1ms service=30000ms status=503 bytes=0 protocol=https Note that the value for the `job_id__in` parameter is empty. Make the same request myself results in a timeout and a query that was still running on the DB 10 minutes later when I killed it. The SQL being run is: https://emorley.pastebin.mozilla.org/9028800 The EXPLAIN of that query: https://emorley.pastebin.mozilla.org/9028802 Notably: "USING TEMPORARY; USING filesort" Checking the Cloudwatch graphs on RDS shows that the queries likely took hours to complete, and most critically created temporary tables for the join that consumed up to 70GB of disk, pushing the DB instance close to full disk usage (see attached screenshot). Things to bear in mind when debugging issues like this as they happen in the future: * If the DB is under load, always check the "client connections" page in MySQL Workbench (or `SHOW FULL PROCESSLIST` via the CLI) before doing anything else. * If the DB is believed to be stuck, only restart/failover as a last resort since it results in a cold cache, which makes performance worse. Instead just kill any stuck queries directly. Identified action items (in rough order of relevance): * Fixing this specific extreme load case (bug 1387147), by: - Fixing the UI to not trigger these broken requests (I tracked it down to: https://github.com/mozilla/treeherder/blob/2e341fa06f1b912dbdfb214e5f0ca86e272dcebd/ui/plugins/controller.js#L353-L356) - Making the API reject an empty `job_id__in` parameter. * Making the job details API less expensive in general, by: - removing unnecessary joins (bug 1343239) - reducing the size of the table (bug 1342296) * Avoiding similar issues of this type, by: - now that we're soon switching to MySQL 5.7, using its new "maximum query time" setting (bug 1387258). - devising a more generic validation strategy for the API (bug 1280494). * Tweaks to monitoring to provide more information when debugging: - figure out how to get the RDS instance to report to New Relic (bug 1201063). - see if there are any additional cloudwatch alerts that could catch this case (bug 1306597).
Priority: -- → P1
Summary: Response times have gone way down. Getting 500s → Webheads returning 500/503s due to long running DB queries from unbounded job_details API calls
Just to note: please do not run `job_id__in=` queries to test until bug 1387147 is fixed, or someone will have to manually kill the DB queries.
I just had to kill another three stuck queries, of duration 2610, 2496 and 348 seconds respectively.
Depends on: 1387487
Depends on: 1387508
To update those CCed here - I've tracked down the source of the malformed requests to the API, which is the screenshots compare tool at screenshots.mattn.ca. For more details see bug 1387147 comment 6. (Note that it's not the tool's fault per-se, the API should be [and soon will be] resilient to this kind of thing)
Depends on: 1387642
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: