Closed
Bug 1386331
Opened 7 years ago
Closed 7 years ago
Webheads returning 500/503s due to long running DB queries from unbounded job_details API calls
Categories
(Tree Management :: Treeherder: Infrastructure, enhancement, P1)
Tree Management
Treeherder: Infrastructure
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: camd, Unassigned)
References
(Depends on 1 open bug)
Details
Attachments
(2 files)
Looking at RDS, it seems like it could be a DB issue.
Reporter | ||
Updated•7 years ago
|
Summary: Response times have gone way down. Getting 503s → Response times have gone way down. Getting 500s
Reporter | ||
Comment 1•7 years ago
|
||
This may have been an AWS or Heroku issue:
https://status.heroku.com/incidents/1241
(credit to dividehex for noticing this)
Comment 2•7 years ago
|
||
5xx errors begin to clear up right around the time heroku reports routing delays
Comment 3•7 years ago
|
||
5xx seemed to stop around 17:23-17:24 UTC
Around the same time the prod rds goes starts a shutdown/restart
treeherder-prod
Instances
August 1, 2017 at 10:25:22 AM UTC-7
DB instance restarted
treeherder-prod
Instances
August 1, 2017 at 10:25:15 AM UTC-7
DB instance shutdown
Reporter | ||
Comment 4•7 years ago
|
||
The Heroku incident 1241 looks the be the closest thing to a smoking gun we can find so far. Treeherder has now recovered, however. I believe the trees can be reopened.
Reporter | ||
Comment 5•7 years ago
|
||
Update from Heroku 11:26am PT:
Update: (US) Routing latency.
We have isolated the cause of network degradation in US region. The underlying request issues in EU region seems to have been resolved.
Reporter | ||
Comment 6•7 years ago
|
||
This issue appears to be resolved now.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Attachment #8892544 -
Attachment filename: Screenshot 2017-08-01 10.51.38 (2).png → Heroku web metrics
Updated•7 years ago
|
Attachment #8892544 -
Attachment description: Screenshot 2017-08-01 10.51.38 (2).png → Heroku web metrics.png
Attachment #8892544 -
Attachment filename: Heroku web metrics → Heroku web metrics.png
Comment 7•7 years ago
|
||
Comment 8•7 years ago
|
||
Cameron and I had a meeting about this earlier.
I wasn't sure if we'd be able to track down the cause retrospectively, since with high-load type issues like this, it's often hard to separate cause and effect (eg is a query slow because the DB is being slow, or is it the reason the DB is being slow in the first place).
However I managed to find some suspicious looking queries via papertrail - for example:
Aug 01 16:33:02 treeherder-prod heroku/router:
at=error code=H12 desc="Request timeout"
method=GET path="/api/jobdetail/?repository=mozilla-central&job_id__in="
host=treeherder.mozilla.org request_id=e31002ff-cfee-4174-abdc-ca0adc5fd959 fwd="REDACTED"
dyno=web.3 connect=1ms service=30001ms status=503 bytes=0 protocol=https
Aug 01 17:04:07 treeherder-prod heroku/router:
at=error code=H13 desc="Connection closed without response"
method=GET path="/api/jobdetail/?repository=mozilla-central&job_id__in="
host=treeherder.mozilla.org request_id=9d50279c-9c97-4bd9-9ce1-0f6125ee1b13 fwd="REDACTED"
dyno=web.3 connect=0ms service=29205ms status=503 bytes=0 protocol=https
Aug 01 17:08:33 treeherder-prod heroku/router:
at=error code=H12 desc="Request timeout"
method=GET path="/api/jobdetail/?repository=mozilla-central&job_id__in="
host=treeherder.mozilla.org request_id=bd983d1d-6381-4323-b3f0-83fadb47e36d fwd="REDACTED"
dyno=web.3 connect=1ms service=30000ms status=503 bytes=0 protocol=https
Note that the value for the `job_id__in` parameter is empty.
Make the same request myself results in a timeout and a query that was still running on the DB 10 minutes later when I killed it.
The SQL being run is:
https://emorley.pastebin.mozilla.org/9028800
The EXPLAIN of that query:
https://emorley.pastebin.mozilla.org/9028802
Notably:
"USING TEMPORARY; USING filesort"
Checking the Cloudwatch graphs on RDS shows that the queries likely took hours to complete, and most critically created temporary tables for the join that consumed up to 70GB of disk, pushing the DB instance close to full disk usage (see attached screenshot).
Things to bear in mind when debugging issues like this as they happen in the future:
* If the DB is under load, always check the "client connections" page in MySQL Workbench (or `SHOW FULL PROCESSLIST` via the CLI) before doing anything else.
* If the DB is believed to be stuck, only restart/failover as a last resort since it results in a cold cache, which makes performance worse. Instead just kill any stuck queries directly.
Identified action items (in rough order of relevance):
* Fixing this specific extreme load case (bug 1387147), by:
- Fixing the UI to not trigger these broken requests (I tracked it down to: https://github.com/mozilla/treeherder/blob/2e341fa06f1b912dbdfb214e5f0ca86e272dcebd/ui/plugins/controller.js#L353-L356)
- Making the API reject an empty `job_id__in` parameter.
* Making the job details API less expensive in general, by:
- removing unnecessary joins (bug 1343239)
- reducing the size of the table (bug 1342296)
* Avoiding similar issues of this type, by:
- now that we're soon switching to MySQL 5.7, using its new "maximum query time" setting (bug 1387258).
- devising a more generic validation strategy for the API (bug 1280494).
* Tweaks to monitoring to provide more information when debugging:
- figure out how to get the RDS instance to report to New Relic (bug 1201063).
- see if there are any additional cloudwatch alerts that could catch this case (bug 1306597).
Updated•7 years ago
|
Priority: -- → P1
Summary: Response times have gone way down. Getting 500s → Webheads returning 500/503s due to long running DB queries from unbounded job_details API calls
Comment 9•7 years ago
|
||
Just to note: please do not run `job_id__in=` queries to test until bug 1387147 is fixed, or someone will have to manually kill the DB queries.
Comment 10•7 years ago
|
||
I just had to kill another three stuck queries, of duration 2610, 2496 and 348 seconds respectively.
Comment 11•7 years ago
|
||
To update those CCed here - I've tracked down the source of the malformed requests to the API, which is the screenshots compare tool at screenshots.mattn.ca. For more details see bug 1387147 comment 6. (Note that it's not the tool's fault per-se, the API should be [and soon will be] resilient to this kind of thing)
You need to log in
before you can comment on or make changes to this bug.
Description
•