Closed
Bug 1287501
Opened 9 years ago
Closed 9 years ago
gunicorn worker exhaustion & 100% DB CPU due to unbounded /jobs/ API queries
Categories
(Tree Management :: Treeherder: Infrastructure, defect, P1)
Tree Management
Treeherder: Infrastructure
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: keeler, Unassigned)
References
Details
Attachments
(4 files)
STR:
1. navigate to treeherder.mozilla.org
2. hum the Jeopardy! theme song a few times
3. treeherder still hasn't completely loaded
Comment 1•9 years ago
|
||
18:28 <emorley> camd: the slow transactions have a last_modified__gt of "-infinity" is that expected?
18:28 <•emorley> eg https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors/319e3c62-4d0b-11e6-9e2c-c81f66b8ceca_11418_17782
18:29 <•emorley> and https://rpm.newrelic.com/accounts/677903/applications/4180461/transactions?type=app#id=5b225765625472616e73616374696f6e2f46756e6374696f6e2f747265656865726465722e7765626170702e6170692e6a6f62733a4a6f6273566965775365742e6c697374222c22225d&app_trace_id=26ebba60-4d0b-11e6-84e9-b82a72d22a14_20455_23171
Running this on the prod webheads shows a couple of IPs from which most of the requests originate, though there is a long tail of others, so unclear if just a bug in the UI (eg after bug 1276030)
$ grep "Infinity" /var/log/httpd/treeherder.mozilla.org/access_log | cut - d ' ' -f 1 | sort | uniq -c | sort -nr
Queries are like:
[18/Jul/2016:03:36:50 +0000] "GET /api/project/nss-try/jobs/?count=2000&last_modified__gt=-Infinity&offset=2000&result_set_id__in=277,276,275,274,273,272,271,270,269,268,267,266&return_type=list HTTP/1.1" 200 1008 "https://treeherder.mozilla.org/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:50.0) Gecko/20100101 Firefox/50.0"
[18/Jul/2016:05:49:31 +0000] "GET /api/project/try/jobs/?count=2000&last_modified__gt=-Infinity&offset=10000&result_set_id__in=114809,114336,112738,112581,112482,112382,112147,112023,111762,111739&return_type=list HTTP/1.1" 200 202863 "https://treeherder.mozilla.org/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:50.0) Gecko/20100101 Firefox/50.0"
Comment 2•9 years ago
|
||
For example:
$ time curl "https://treeherder.mozilla.org/api/project/mozilla-inbound/jobs/?count=2000&last_modified__gt=-Infinity&offset=2000&result_set_id__in=34522,34521,34520,34519,34518,34517,34516,34515,34514,34513,34512,34511,34510,34509,34508,34507,34506,34505,34504,34503,34502,34501,34500,34499,34498,34497,34496,34495,34485,34484,34483,34482,34481,34480,34479,34478,34477,34476,34475,34474,34473,34469,344&return_type=list"
curl: (52) Empty reply from server
real 0m43.654s
user 0m0.000s
sys 0m0.030s
Comment 3•9 years ago
|
||
This feels like a tree-closing event to me, but I guess that's up to the people that actually need to use TH for a living :)
Flags: needinfo?(wkocher)
Comment 4•9 years ago
|
||
I don't really see how the frontend could ever be passing a value of -Infinity for date, barring a bug in Firefox. The value for setting/updating "lastUpdated" internally is very straightforward.
I guess it might not hurt to put an assertion in that validates that we're not sending a crazy query here? Perhaps something like this:
https://gist.github.com/wlach/722933270661b4a56024e325e2f06e3a
(In reply to Ryan VanderMeulen [:RyanVM] from comment #3)
> This feels like a tree-closing event to me, but I guess that's up to the
> people that actually need to use TH for a living :)
I've closed trunk trees.
Things actually seem to be getting slightly more responsive in the last 20 minutes or so, but I'll leave things closed for now.
Flags: needinfo?(wkocher)
Comment 6•9 years ago
|
||
| Comment hidden (offtopic) |
Comment 8•9 years ago
|
||
It also sounds like the trees would be closed for bug 1286942 anyway, if it weren't for this.
Comment 9•9 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder
https://github.com/mozilla/treeherder/commit/282a66b8f0fbb89658c153122fd3ac76b53dde15
Bug 1287501 - Restrict jobs query last_modified to within last 30 min… (#1702)
* Bug 1287501 - Restrict jobs query last_modified to within last 30 minutes
Comment 10•9 years ago
|
||
Comment 11•9 years ago
|
||
Comment on attachment 8772175 [details] [review]
[treeherder] wlach:1287501-ui > mozilla:master
This serves two purposes:
1. Make sure we don't ask for stuff that is too old because of lack of job updates (e.g. watching a really old, completed try push). This complements Cam's server side fix for jobs.
2. Fail loudly and noisily if we're not passing a valid date for last_updated to the server -- this should never happen, but apparently it has for some reason I don't know. Yes, ideally we would fix the root cause here but I've been staring at the code in question for more than an hour and nothing is jumping out at me.
Attachment #8772175 -
Flags: review?(cdawson)
Comment 12•9 years ago
|
||
Updated•9 years ago
|
Attachment #8772214 -
Flags: review?(emorley)
Comment 13•9 years ago
|
||
Ed found a problem with my first fix attempt. So I made a new PR (r? to emorley now). I pushed that branch to both stage and prod tonight and all seems happier. I have NOT yet merged that last PR to master, however. I'll do that upon approval from ed tomorrow.
I've reopened the integration branches as things seem to be working much better now.
| Comment hidden (offtopic) |
| Comment hidden (offtopic) |
Comment 17•9 years ago
|
||
To summarise this issue:
* For some reason there were many requests to Treeherder's /api/project/mozilla-inbound/jobs/ API that contained a completely unexpected (and invalid) parameter value of `last_modified__gt=-Infinity`.
* This caused expensive (20-40s) DB queries against the jobs table, which maxed out the DB node CPU, plus meant at certain points the gunicorn workers on the webheads were all waiting for the DB to respond, meaning there were none left to serve further requests (which then leads to the Zeus load balancer to return HTTP 5XX).
* These queries came from Firefox user agents rather than a script, but it's not clear why the Treeherder UI was suddenly generating these invalid date values, since we hadn't done a Treeherder deploy since last week, and even then it didn't change anything that seems relevant. It almost seemed like there was a Nightly regression that caused weirdness with date handling.
Things to fix:
1) Make the API reject requests with invalid values for `last_modified__gt` (deployed already in comment 13)
2) Try to figure out why the UI was generating the invalid dates (Will's PR attached)
3) (Likely after moving to Heroku, so we can optimise for the dyno specs there) Investigate whether using a different gunicorn worker model (eg eventlet) would help avoid exhaustion. We can probably do this as part of bug 1175472. (Though with 100% DB CPU kind academic anyway.)
An added complication was that the first PR in this bug (to fix #1) added:
```
datestr = filter.get("last_modified")[1]
```
However `filter` is an instance of our home-grown `UrlQueryFilter` class, whose defined `.get()` method actually unhelpfully also does a `.pop()`:
https://github.com/mozilla/treeherder/blob/6a5fbb731090f1723a0625ead7335832e591e9ba/treeherder/webapp/api/utils.py#L70
So checking for the parameter value actually removed it from `filter.conditions` that is then passed to the query. So whilst the 1st PR successfully blocked queries containing an invalid parameter value, it also then stripped the last_modified filter from *all* requests, which meant we replaced the DB load from the rarely chronic cases, with less-severe-per-request but more frequent other cases.
So we should also:
4) Make `UrlQueryFilter` less footgun-y. Checking for a parameter shouldn't remove it!
Priority: -- → P1
Summary: treeherder.mozilla.org is responding very very very slowly → gunicorn worker exhaustion & 100% DB CPU due to unbounded /jobs/ API queries
Comment 18•9 years ago
|
||
Comment on attachment 8772214 [details] [review]
[treeherder] mozilla:fix-to-last-modified-fix > mozilla:master
Added some comments to the PR.
We really need tests for last_modified - it would have caught the regression in the first PR here much more easily than two of us having to notice the unfortunate (and non-obvious) bad UrlQueryFilter.get() behaviour by code inspection :-)
Attachment #8772214 -
Flags: review?(emorley)
Morgan, I know you're working on date stuff, any chance you know of something that could've landed recently to cause this?
Flags: needinfo?(winter2718)
Comment 20•9 years ago
|
||
This is happening again now on production but not on staging.
Comment 21•9 years ago
|
||
The followup in comment 13 was landed not on the `production` git branch (as the 1st PR had), but on a branch named `prod-hotfix-7-18-2016`.
As such, when Will wanted to deploy something on production (and needed to re-cherrypick the fixes here), the followup was accidentally reverted.
Will is re-deploying the fix now.
I think we need to:
1) Always be consistent in using the branch named `production` to avoid confusion (or delete the branch if using another temporarily)
2) Not deploy an unrelated cherrypick over a prod hotfix unless we absolutely have to (Will asked if it was ok, and I said yes, but I think we should be more wary in the future)
Comment 22•9 years ago
|
||
Oh and:
3) If there's any doubt as to the deployed revision (eg if it's known there may have been a hotfix applied), use any of:
- whatsdeployed
- https://treeherder.mozilla.org/revision.txt
- IRC deploy bot message
- chief logs at http://treeherderadm.private.scl3.mozilla.com/chief/treeherder.prod/logs/
...to check what the most recently deployed branch/revision was.
Comment 23•9 years ago
|
||
Comment on attachment 8772214 [details] [review]
[treeherder] mozilla:fix-to-last-modified-fix > mozilla:master
Hey man, here are the changes we discussed. Thanks. :)
Attachment #8772214 -
Flags: review?(emorley)
Updated•9 years ago
|
Attachment #8772214 -
Flags: review?(emorley) → review+
Comment 24•9 years ago
|
||
Ed: I'm actually going to add a few more tests that test basic less-than / greater-than accuracy.
Also: this latest patch will remove the 30 minute limit. But if we hit something like this again, we may need to impose some kind of limiting policy based on params.
Comment 25•9 years ago
|
||
(In reply to Wes Kocher (:KWierso) from comment #19)
> Morgan, I know you're working on date stuff, any chance you know of
> something that could've landed recently to cause this?
Not that I'm aware of. Placing a flag to see if I can get some data about what date strings are returning -INF.
Flags: needinfo?(winter2718)
Comment 26•9 years ago
|
||
Say Ed, any chance you could find one of the date strings that returned `-Infinity`? If that's being returned from `new Date(..)` it's a bug we'll want to fix in the JS Engine.
Flags: needinfo?(emorley)
Comment 27•9 years ago
|
||
We don't have the original string, only the "-Infinity" value that the Treeherder UI submitted to our API and was seen in the apache logs. The UI calculates the date string using:
https://github.com/mozilla/treeherder/blob/6001f3caca1323d4d74c492e54b3f7e8b780ab7e/ui/js/models/resultsets_store.js#L1066-L1073
Flags: needinfo?(emorley)
Updated•9 years ago
|
Attachment #8772175 -
Flags: review?(cdawson) → review+
Comment 28•9 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder
https://github.com/mozilla/treeherder/commit/959540913d3ee5ea9c4b1cbb1978a13267984e64
Bug 1287501 - Fix /jobs/ endpoint to preserve `last_modified` param (#1704)
A prior commit removed the ability to use "-Infinity" for the last_modified
query param. However, the fix accidentally stripped the param entirely.
This change ensures that the value is a valid date string. The range is
not limited.
This also adds some new tests to ensure the param of `last_modified` is
working correctly when included.
Comment 29•9 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder
https://github.com/mozilla/treeherder/commit/47b14a0205e5920dcfc729f2e064aa37c1f9eb5b
Bug 1287501 - Attempt to make sure polling for new jobs is reasonable (#1703)
* Validate that we're passing an actual date to the server (just in case...)
* Update the last updated interval even if nothing has been returned
Comment 30•9 years ago
|
||
I believe we can mark this fixed now with that two commits above.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Comment 31•9 years ago
|
||
Comment 32•9 years ago
|
||
Comment on attachment 8774815 [details] [review]
[treeherder] wlach:1287501-ui-2 > mozilla:master
Yet another edge case for this code.
Attachment #8774815 -
Flags: review?(cdawson)
Comment 33•9 years ago
|
||
Comment on attachment 8774815 [details] [review]
[treeherder] wlach:1287501-ui-2 > mozilla:master
Postponing r?, this patch doesn't work for some reason I don't understand yet.
Attachment #8774815 -
Flags: review?(cdawson)
Comment 34•9 years ago
|
||
Comment on attachment 8774815 [details] [review]
[treeherder] wlach:1287501-ui-2 > mozilla:master
Ok, finally figured it out. When refetching *all* the jobs (which we do after sleeping for a long time), we would recalculate the lastUpdated time based on an array of arrays, which didn't work -- we would generate a value of -Infinity, because the last_updated property of an "array" is undefined:
_.max(_.map([[]], function(job) { return new Date(job.last_modified + 'Z'); }))
Attachment #8774815 -
Flags: review?(cdawson)
Comment 35•9 years ago
|
||
Good catch!
Comment 37•9 years ago
|
||
Comment on attachment 8774815 [details] [review]
[treeherder] wlach:1287501-ui-2 > mozilla:master
This looks generally good. I had a couple ideas that I think could make it a bit safer, but up to you if you feel like using them.
Attachment #8774815 -
Flags: review?(cdawson) → review+
Comment 38•9 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder
https://github.com/mozilla/treeherder/commit/801c33b43e010db756f41398b3154f4e546566af
Bug 1287501 - More fixes to job polling (#1729)
* Poll for pushes with no jobs associated with them correctly (we should
just refetch everything in this case instead of looking for updates)
* Handle refetching everything correctly (we were setting the last
updated time to -Infinity because we were trying to create a data
off of an undefined property of an array)
You need to log in
before you can comment on or make changes to this bug.
Description
•