Closed
Bug 1280039
Opened 9 years ago
Closed 8 years ago
Missing params from the job_details endpoint causes unbounded DB queries
Categories
(Tree Management :: Treeherder: API, defect, P1)
Tree Management
Treeherder: API
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: team73, Assigned: wlach)
References
(Blocks 1 open bug)
Details
Attachments
(3 files)
Hello,
We have to deal with some count queries against job_details table today which are querying whole table and causing load issues on the database. Please see the attachment for running process list at that moment.
Letting you know about the query which is potentially causing issues.
Regards,
Comment 1•9 years ago
|
||
So I managed to track these down to requests being made to eg:
treeherder.mozilla.org/api/jobdetail/?job_guid=727af17e-f12f-48d4-bdbf-7177e6949af5/0
This was from a local copy of the UI based on master, being run against prod using web-server.js.
Bug 1278597 changed the job guid URL parameter from `job__guid` to `job_guid`, and as such, the prod API doesn't recognise the master UI's request.
However if we're missing parameters, an HTTP 400 should be generated, rather than trying to return all job details records (it's supposed to be 2000 per page, but the whole table is being scanned to return the total records count in the response afaict?).
It's also happening on both stage and prod (ie it's still slow after the removal of ORDER BY in bug 1278597):
time curl https://treeherder.mozilla.org/api/jobdetail/
time curl https://treeherder.allizom.org/api/jobdetail/
time curl https://treeherder.allizom.org/api/jobdetail/?count=1
Component: Treeherder → Treeherder: API
Priority: -- → P1
Summary: Count queries against job_details table → Missing params from the job_details endpoint causes unbounded DB queries
Comment 2•9 years ago
|
||
This caused prod to be unresponsive for 5-10mins just now:
https://rpm.newrelic.com/accounts/677903/applications/4180461?tw%5Bend%5D=1466182530&tw%5Bstart%5D=1466180730
Reporter | ||
Comment 3•9 years ago
|
||
HI,
Currently we have the server OK, and no more sessions running this query.
top - 17:22:04 up 37 days, 3:29, 1 user, load average: 0.58, 1.53, 32.31
Tasks: 126 total, 1 running, 125 sleeping, 0 stopped, 0 zombie
Cpu(s): 17.3%us, 3.2%sy, 0.0%ni, 67.9%id, 10.9%wa, 0.0%hi, 0.7%si, 0.0%st
Mem: 10129816k total, 9998468k used, 131348k free, 134380k buffers
Swap: 2097148k total, 761572k used, 1335576k free, 1580452k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
861 mysql 20 0 9760m 7.6g 5568 S 7.9 78.5 20228:24 mysqld
933 root 20 0 0 0 0 S 2.0 0.0 175:05.60 jbd2/sdb1-8
1 root 20 0 19352 700 440 S 0.0 0.0 0:08.13 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.13 kthreadd
3 root RT 0 0 0 0 S 0.0 0.0 0:22.80 migration/0
goncalves@pythiancom
Comment 4•9 years ago
|
||
Yeah I killed the 30+ pending queries just now to make prod responsive again.
Reporter | ||
Comment 5•9 years ago
|
||
Hi Ed,
We confirmed that only this query was running. I kindly ask you to inform Sheeri Cabral or Matthew Pressman as I do not find they in IRC chat.
At this time server is good and MySQL responding OK.
Please be aware that if this query appears again, most probably a session spike will happens again. As several queries running full table scan in same table will cause a lot of contention.
Thanks in advance
Marcelo
goncalves@pythian.com
Reporter | ||
Comment 6•9 years ago
|
||
Here is the time frame of the spike:
08:50:01 AM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15
09:00:01 AM 7 441 4.86 4.32 4.82
09:10:01 AM 9 435 7.58 6.46 5.59
09:20:01 AM 4 435 4.58 5.14 5.32
09:30:01 AM 8 442 5.60 5.24 5.24
09:40:01 AM 8 439 5.25 5.20 5.24
09:50:01 AM 9 435 5.30 5.31 5.27
10:00:01 AM 7 446 6.13 5.70 5.43
10:10:01 AM 8 438 6.53 6.25 5.81
10:20:01 AM 5 436 5.21 5.63 5.72
10:30:01 AM 2 441 0.64 2.26 4.19
10:40:01 AM 4 435 5.18 4.07 3.99
10:50:01 AM 3 432 5.54 5.44 4.77
11:00:02 AM 7 441 6.09 5.60 5.14
11:10:01 AM 5 435 5.25 5.65 5.42
11:20:01 AM 0 435 0.49 1.94 3.79
11:30:01 AM 1 435 2.07 1.31 2.50
11:40:01 AM 2 435 0.49 0.70 1.60
11:50:01 AM 1 432 0.62 0.80 1.23
12:00:01 PM 3 442 0.95 0.70 0.97
12:10:01 PM 5 436 0.65 0.89 0.94
12:20:01 PM 2 432 0.45 0.54 0.71
12:30:01 PM 4 441 1.01 0.75 0.75
12:40:01 PM 0 435 0.31 0.50 0.63
12:50:01 PM 1 436 0.30 0.59 0.65
01:00:02 PM 5 442 0.17 0.46 0.62
01:10:01 PM 3 432 0.60 0.58 0.59
01:20:01 PM 2 436 0.53 0.45 0.50
01:30:01 PM 2 439 1.50 0.78 0.60
01:40:01 PM 0 435 0.40 0.58 0.58
01:50:01 PM 0 435 0.76 0.99 0.78
02:00:01 PM 2 438 0.64 0.83 0.83
02:10:01 PM 0 432 0.57 1.06 1.02
02:20:01 PM 7 436 5.36 3.83 2.29
02:30:01 PM 8 441 7.72 6.77 4.53
02:40:01 PM 7 435 8.53 8.40 6.44
02:50:01 PM 10 435 9.72 10.19 8.64
03:00:01 PM 23 450 24.18 15.45 11.24
03:10:01 PM 0 435 0.59 6.71 10.17
03:20:01 PM 0 435 0.34 1.37 5.60
03:30:01 PM 0 443 1.63 1.17 3.47
03:40:01 PM 1 435 0.49 0.95 2.32
03:50:01 PM 1 432 0.67 0.90 1.65
04:00:01 PM 0 441 0.70 0.61 1.14
04:10:02 PM 83 439 72.75 44.64 20.84
04:20:02 PM 158 435 160.68 133.81 77.03
04:30:02 PM 142 445 139.72 138.60 107.40
04:40:02 PM 201 456 187.17 156.85 128.39
04:50:02 PM 248 469 252.23 237.67 184.21
05:00:01 PM 3 477 2.04 68.18 130.57
05:10:01 PM 3 472 2.54 10.96 69.35
05:20:01 PM 0 471 0.67 2.01 36.69
05:30:01 PM 5 477 1.30 1.40 19.78
Average: 11 439 9.79 9.53 9.34
[root@treeherder2.db.scl3 ~]#
I'm not sure where in the JobDetailsViewSet[1] you would even reject requests that have the missing params. Would you have to define a `list` view within it to wrap what's there and then get access to the request's query strings or something?
1. https://github.com/mozilla/treeherder/blob/master/treeherder/webapp/api/jobs.py#L311
Flags: needinfo?(wlachance)
Assignee | ||
Comment 8•9 years ago
|
||
Yeah, I think that'd be what you'd do: define a custom list parameter which demands a parameter if one is not set, then just passes through to the default implementation if not:
https://github.com/mozilla/treeherder/blob/7d62a4b38bb68c635c48a06da0cf527d89247ad7/treeherder/webapp/api/refdata.py#L65
I think we should also just remove ordering from the job detail view set:
https://github.com/mozilla/treeherder/blob/master/treeherder/webapp/api/jobs.py#L338
Do you want to take this on?
Flags: needinfo?(wlachance)
Assignee: nobody → wkocher
For posterity because I'm going to be distracted a lot today:
11:10 AM <KWierso> wlach: so for bug 1280039, would I leave the viewset alone and then add the `def list` down below https://github.com/mozilla/treeherder/blob/master/treeherder/webapp/api/jobs.py#L343 ?
11:10 AM <KWierso> or would the contents of the viewset need to go into that `def list`?
11:12 AM <wlach> KWierso: I think the former. You can call the parent with something like: return viewsets.ReadOnlyModelViewSet.list(self, ...) once you've validated that there are parameters
Assignee | ||
Comment 10•9 years ago
|
||
On further consideration, I think it would be better just to remove the OrderingFilter from the endpoint: there's no need to sort these sorts of results. I'll file a patch for this.
Assignee | ||
Comment 11•9 years ago
|
||
(In reply to William Lachance (:wlach) from comment #10)
> On further consideration, I think it would be better just to remove the
> OrderingFilter from the endpoint: there's no need to sort these sorts of
> results. I'll file a patch for this.
Hmm, though it's not OrderingFilter that's the problem. Using the django toolbar on my development instance, it seems like jobdetail is only doing this query (now that bug 1287140 is fixed):
SELECT `job_detail`.`id`, `job_detail`.`job_id`, `job_detail`.`title`, `job_detail`.`value`, `job_detail`.`url`, `job`.`id`, `job`.`repository_id`, `job`.`guid`, `job`.`project_specific_id`, `repository`.`id`, `repository`.`repository_group_id`, `repository`.`name`, `repository`.`dvcs_type`, `repository`.`url`, `repository`.`codebase`, `repository`.`description`, `repository`.`active_status`, `repository`.`performance_alerts_enabled` FROM `job_detail` INNER JOIN `job` ON ( `job_detail`.`job_id` = `job`.`id` ) INNER JOIN `repository` ON ( `job`.`repository_id` = `repository`.`id` ) LIMIT 2000;
This works fine on my development instance, and running the query directly against treeherder stage's db also works fine. But somehow the view is still hanging. Hmm...
Assignee | ||
Comment 12•9 years ago
|
||
Aha, the problem is the "count" query we use for pagination, which is super slow. We can fix this by switching to cursor based pagination, in case too many results are returned.
Comment 13•9 years ago
|
||
Assignee | ||
Comment 14•9 years ago
|
||
(In reply to William Lachance (:wlach) from comment #12)
> Aha, the problem is the "count" query we use for pagination, which is super
> slow. We can fix this by switching to cursor based pagination, in case too
> many results are returned.
... this was part of it. The other aspect is that "order by" is slow with left joins (which we need to do searching by repository and "project specific" job ids). We can get rid of these silly joins when we move the jobs table into the central db, but for now the only fundamental solution is the one we outlined earlier: rejecting input with no filter parameters.
Assignee | ||
Comment 15•9 years ago
|
||
Comment on attachment 8771492 [details] [review]
[treeherder] wlach:1280039 > mozilla:master
There are still issues here, but we might as well get this merged, as it simplifies stuff a bunch (and is an incremental step towards where we want to go).
Attachment #8771492 -
Flags: review?(emorley)
Updated•9 years ago
|
Attachment #8771492 -
Flags: review?(emorley) → review+
Comment 16•9 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder
https://github.com/mozilla/treeherder/commit/c9e458297f9de503a1083f3db47e87d5ce4374a1
Bug 1280039 - Use cursor pagination with /jobdetail/ endpoint (#1695)
The count(*) query that limit/offset pagination requires is super slow with
InnoDB. Using a cursor-based approach makes the behaviour more reasonable if
no parameters are specified with this endpoint (arguably we should require
a parameter if none specified, but we can leave that for another time)
Assignee | ||
Comment 17•9 years ago
|
||
Just to recap, this is still an issue: see comment 14. We can fix this either by requiring parameters or by waiting for bug 1178641 (which will make iterating through all of the jobdetail objects tractable)
Comment 19•8 years ago
|
||
Assignee | ||
Comment 20•8 years ago
|
||
Stealing this from Wes, hope that's ok :)
Assignee: wkocher → wlachance
Assignee | ||
Updated•8 years ago
|
Attachment #8793076 -
Flags: review?(emorley)
Updated•8 years ago
|
Attachment #8793076 -
Flags: review?(emorley) → review+
Comment 21•8 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder
https://github.com/mozilla/treeherder/commit/f31bb444434b1f6058aaf75db3f583a24aedc4e2
Bug 1280039 - Enforce the presence of a suitable filter on the jobdetail endpoint (#1866)
If not present, the server could be forced to process some really slow
queries.
Comment 22•8 years ago
|
||
23:32 <RyanVM> so, I can't retrigger any jobs at th emoment
23:32 <RyanVM> https://treeherder.mozilla.org/#/jobs?repo=mozilla-beta&revision=6be9a5c9c8cd25d2082376495b788d85ce3cca98&selectedJob=1627297
23:33 <RyanVM> Unable to send retrigger:Must also filter on repository if filtering on job id
Looks like retriggerJob() is passing job id and no repository, so it either need to use the guid, or pass the repo too:
https://github.com/mozilla/treeherder/blob/69b81c772fd9a9686ac11bcd87c9b212ad2abd58/ui/plugins/controller.js#L332-L333
Comment 23•8 years ago
|
||
I did some poking around, I can't get any results from that endpoint even if I DO pass the repository. Then it occurred to me that I was trying a TaskCluster job. So that should work differently anyway. It won't have a ``buildbot_request_id``
I've run out of time tonight, I have to leave for an appt. But it doesn't appear to me that this is handling the TC case. Sorry I couldn't do more tonight.
Assignee | ||
Comment 24•8 years ago
|
||
I think we can resolve this now.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•