Closed Bug 1287501 Opened 9 years ago Closed 9 years ago

gunicorn worker exhaustion & 100% DB CPU due to unbounded /jobs/ API queries

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: keeler, Unassigned)

References

Details

Attachments

(4 files)

STR: 1. navigate to treeherder.mozilla.org 2. hum the Jeopardy! theme song a few times 3. treeherder still hasn't completely loaded
18:28 <emorley> camd: the slow transactions have a last_modified__gt of "-infinity" is that expected? 18:28 <•emorley> eg https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors/319e3c62-4d0b-11e6-9e2c-c81f66b8ceca_11418_17782 18:29 <•emorley> and https://rpm.newrelic.com/accounts/677903/applications/4180461/transactions?type=app#id=5b225765625472616e73616374696f6e2f46756e6374696f6e2f747265656865726465722e7765626170702e6170692e6a6f62733a4a6f6273566965775365742e6c697374222c22225d&app_trace_id=26ebba60-4d0b-11e6-84e9-b82a72d22a14_20455_23171 Running this on the prod webheads shows a couple of IPs from which most of the requests originate, though there is a long tail of others, so unclear if just a bug in the UI (eg after bug 1276030) $ grep "Infinity" /var/log/httpd/treeherder.mozilla.org/access_log | cut - d ' ' -f 1 | sort | uniq -c | sort -nr Queries are like: [18/Jul/2016:03:36:50 +0000] "GET /api/project/nss-try/jobs/?count=2000&last_modified__gt=-Infinity&offset=2000&result_set_id__in=277,276,275,274,273,272,271,270,269,268,267,266&return_type=list HTTP/1.1" 200 1008 "https://treeherder.mozilla.org/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:50.0) Gecko/20100101 Firefox/50.0" [18/Jul/2016:05:49:31 +0000] "GET /api/project/try/jobs/?count=2000&last_modified__gt=-Infinity&offset=10000&result_set_id__in=114809,114336,112738,112581,112482,112382,112147,112023,111762,111739&return_type=list HTTP/1.1" 200 202863 "https://treeherder.mozilla.org/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:50.0) Gecko/20100101 Firefox/50.0"
This feels like a tree-closing event to me, but I guess that's up to the people that actually need to use TH for a living :)
Flags: needinfo?(wkocher)
I don't really see how the frontend could ever be passing a value of -Infinity for date, barring a bug in Firefox. The value for setting/updating "lastUpdated" internally is very straightforward. I guess it might not hurt to put an assertion in that validates that we're not sending a crazy query here? Perhaps something like this: https://gist.github.com/wlach/722933270661b4a56024e325e2f06e3a
(In reply to Ryan VanderMeulen [:RyanVM] from comment #3) > This feels like a tree-closing event to me, but I guess that's up to the > people that actually need to use TH for a living :) I've closed trunk trees. Things actually seem to be getting slightly more responsive in the last 20 minutes or so, but I'll leave things closed for now.
Flags: needinfo?(wkocher)
Blocks: 1287455
It also sounds like the trees would be closed for bug 1286942 anyway, if it weren't for this.
Blocks: 1287455
No longer blocks: 1287455
Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/282a66b8f0fbb89658c153122fd3ac76b53dde15 Bug 1287501 - Restrict jobs query last_modified to within last 30 min… (#1702) * Bug 1287501 - Restrict jobs query last_modified to within last 30 minutes
Comment on attachment 8772175 [details] [review] [treeherder] wlach:1287501-ui > mozilla:master This serves two purposes: 1. Make sure we don't ask for stuff that is too old because of lack of job updates (e.g. watching a really old, completed try push). This complements Cam's server side fix for jobs. 2. Fail loudly and noisily if we're not passing a valid date for last_updated to the server -- this should never happen, but apparently it has for some reason I don't know. Yes, ideally we would fix the root cause here but I've been staring at the code in question for more than an hour and nothing is jumping out at me.
Attachment #8772175 - Flags: review?(cdawson)
Attachment #8772214 - Flags: review?(emorley)
Ed found a problem with my first fix attempt. So I made a new PR (r? to emorley now). I pushed that branch to both stage and prod tonight and all seems happier. I have NOT yet merged that last PR to master, however. I'll do that upon approval from ed tomorrow.
I've reopened the integration branches as things seem to be working much better now.
To summarise this issue: * For some reason there were many requests to Treeherder's /api/project/mozilla-inbound/jobs/ API that contained a completely unexpected (and invalid) parameter value of `last_modified__gt=-Infinity`. * This caused expensive (20-40s) DB queries against the jobs table, which maxed out the DB node CPU, plus meant at certain points the gunicorn workers on the webheads were all waiting for the DB to respond, meaning there were none left to serve further requests (which then leads to the Zeus load balancer to return HTTP 5XX). * These queries came from Firefox user agents rather than a script, but it's not clear why the Treeherder UI was suddenly generating these invalid date values, since we hadn't done a Treeherder deploy since last week, and even then it didn't change anything that seems relevant. It almost seemed like there was a Nightly regression that caused weirdness with date handling. Things to fix: 1) Make the API reject requests with invalid values for `last_modified__gt` (deployed already in comment 13) 2) Try to figure out why the UI was generating the invalid dates (Will's PR attached) 3) (Likely after moving to Heroku, so we can optimise for the dyno specs there) Investigate whether using a different gunicorn worker model (eg eventlet) would help avoid exhaustion. We can probably do this as part of bug 1175472. (Though with 100% DB CPU kind academic anyway.) An added complication was that the first PR in this bug (to fix #1) added: ``` datestr = filter.get("last_modified")[1] ``` However `filter` is an instance of our home-grown `UrlQueryFilter` class, whose defined `.get()` method actually unhelpfully also does a `.pop()`: https://github.com/mozilla/treeherder/blob/6a5fbb731090f1723a0625ead7335832e591e9ba/treeherder/webapp/api/utils.py#L70 So checking for the parameter value actually removed it from `filter.conditions` that is then passed to the query. So whilst the 1st PR successfully blocked queries containing an invalid parameter value, it also then stripped the last_modified filter from *all* requests, which meant we replaced the DB load from the rarely chronic cases, with less-severe-per-request but more frequent other cases. So we should also: 4) Make `UrlQueryFilter` less footgun-y. Checking for a parameter shouldn't remove it!
Priority: -- → P1
Summary: treeherder.mozilla.org is responding very very very slowly → gunicorn worker exhaustion & 100% DB CPU due to unbounded /jobs/ API queries
Comment on attachment 8772214 [details] [review] [treeherder] mozilla:fix-to-last-modified-fix > mozilla:master Added some comments to the PR. We really need tests for last_modified - it would have caught the regression in the first PR here much more easily than two of us having to notice the unfortunate (and non-obvious) bad UrlQueryFilter.get() behaviour by code inspection :-)
Attachment #8772214 - Flags: review?(emorley)
Morgan, I know you're working on date stuff, any chance you know of something that could've landed recently to cause this?
Flags: needinfo?(winter2718)
This is happening again now on production but not on staging.
The followup in comment 13 was landed not on the `production` git branch (as the 1st PR had), but on a branch named `prod-hotfix-7-18-2016`. As such, when Will wanted to deploy something on production (and needed to re-cherrypick the fixes here), the followup was accidentally reverted. Will is re-deploying the fix now. I think we need to: 1) Always be consistent in using the branch named `production` to avoid confusion (or delete the branch if using another temporarily) 2) Not deploy an unrelated cherrypick over a prod hotfix unless we absolutely have to (Will asked if it was ok, and I said yes, but I think we should be more wary in the future)
Oh and: 3) If there's any doubt as to the deployed revision (eg if it's known there may have been a hotfix applied), use any of: - whatsdeployed - https://treeherder.mozilla.org/revision.txt - IRC deploy bot message - chief logs at http://treeherderadm.private.scl3.mozilla.com/chief/treeherder.prod/logs/ ...to check what the most recently deployed branch/revision was.
Depends on: 1287935
Comment on attachment 8772214 [details] [review] [treeherder] mozilla:fix-to-last-modified-fix > mozilla:master Hey man, here are the changes we discussed. Thanks. :)
Attachment #8772214 - Flags: review?(emorley)
Attachment #8772214 - Flags: review?(emorley) → review+
Ed: I'm actually going to add a few more tests that test basic less-than / greater-than accuracy. Also: this latest patch will remove the 30 minute limit. But if we hit something like this again, we may need to impose some kind of limiting policy based on params.
Depends on: 1288153
(In reply to Wes Kocher (:KWierso) from comment #19) > Morgan, I know you're working on date stuff, any chance you know of > something that could've landed recently to cause this? Not that I'm aware of. Placing a flag to see if I can get some data about what date strings are returning -INF.
Flags: needinfo?(winter2718)
Say Ed, any chance you could find one of the date strings that returned `-Infinity`? If that's being returned from `new Date(..)` it's a bug we'll want to fix in the JS Engine.
Flags: needinfo?(emorley)
We don't have the original string, only the "-Infinity" value that the Treeherder UI submitted to our API and was seen in the apache logs. The UI calculates the date string using: https://github.com/mozilla/treeherder/blob/6001f3caca1323d4d74c492e54b3f7e8b780ab7e/ui/js/models/resultsets_store.js#L1066-L1073
Flags: needinfo?(emorley)
Attachment #8772175 - Flags: review?(cdawson) → review+
Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/959540913d3ee5ea9c4b1cbb1978a13267984e64 Bug 1287501 - Fix /jobs/ endpoint to preserve `last_modified` param (#1704) A prior commit removed the ability to use "-Infinity" for the last_modified query param. However, the fix accidentally stripped the param entirely. This change ensures that the value is a valid date string. The range is not limited. This also adds some new tests to ensure the param of `last_modified` is working correctly when included.
Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/47b14a0205e5920dcfc729f2e064aa37c1f9eb5b Bug 1287501 - Attempt to make sure polling for new jobs is reasonable (#1703) * Validate that we're passing an actual date to the server (just in case...) * Update the last updated interval even if nothing has been returned
I believe we can mark this fixed now with that two commits above.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Comment on attachment 8774815 [details] [review] [treeherder] wlach:1287501-ui-2 > mozilla:master Yet another edge case for this code.
Attachment #8774815 - Flags: review?(cdawson)
Comment on attachment 8774815 [details] [review] [treeherder] wlach:1287501-ui-2 > mozilla:master Postponing r?, this patch doesn't work for some reason I don't understand yet.
Attachment #8774815 - Flags: review?(cdawson)
Comment on attachment 8774815 [details] [review] [treeherder] wlach:1287501-ui-2 > mozilla:master Ok, finally figured it out. When refetching *all* the jobs (which we do after sleeping for a long time), we would recalculate the lastUpdated time based on an array of arrays, which didn't work -- we would generate a value of -Infinity, because the last_updated property of an "array" is undefined: _.max(_.map([[]], function(job) { return new Date(job.last_modified + 'Z'); }))
Attachment #8774815 - Flags: review?(cdawson)
Good catch!
Comment on attachment 8774815 [details] [review] [treeherder] wlach:1287501-ui-2 > mozilla:master This looks generally good. I had a couple ideas that I think could make it a bit safer, but up to you if you feel like using them.
Attachment #8774815 - Flags: review?(cdawson) → review+
Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/801c33b43e010db756f41398b3154f4e546566af Bug 1287501 - More fixes to job polling (#1729) * Poll for pushes with no jobs associated with them correctly (we should just refetch everything in this case instead of looking for updates) * Handle refetching everything correctly (we were setting the last updated time to -Infinity because we were trying to create a data off of an undefined property of an array)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: