1280039 - Missing params from the job_details endpoint causes unbounded DB queries

Reporter

Description

•

9 years ago

Attached file treeherder_processlist.sql — Details

Hello, We have to deal with some count queries against job_details table today which are querying whole table and causing load issues on the database. Please see the attachment for running process list at that moment. Letting you know about the query which is potentially causing issues. Regards,

Ed Morley [:emorley]

Comment 1

•

9 years ago

So I managed to track these down to requests being made to eg: treeherder.mozilla.org/api/jobdetail/?job_guid=727af17e-f12f-48d4-bdbf-7177e6949af5/0 This was from a local copy of the UI based on master, being run against prod using web-server.js. Bug 1278597 changed the job guid URL parameter from `job__guid` to `job_guid`, and as such, the prod API doesn't recognise the master UI's request. However if we're missing parameters, an HTTP 400 should be generated, rather than trying to return all job details records (it's supposed to be 2000 per page, but the whole table is being scanned to return the total records count in the response afaict?). It's also happening on both stage and prod (ie it's still slow after the removal of ORDER BY in bug 1278597): time curl https://treeherder.mozilla.org/api/jobdetail/ time curl https://treeherder.allizom.org/api/jobdetail/ time curl https://treeherder.allizom.org/api/jobdetail/?count=1

Component: Treeherder → Treeherder: API

Priority: -- → P1

Summary: Count queries against job_details table → Missing params from the job_details endpoint causes unbounded DB queries

Ed Morley [:emorley]

Updated

•

9 years ago

Blocks: 1280494

Ed Morley [:emorley]

Comment 2

•

9 years ago

This caused prod to be unresponsive for 5-10mins just now: https://rpm.newrelic.com/accounts/677903/applications/4180461?tw%5Bend%5D=1466182530&tw%5Bstart%5D=1466180730

Pythian Team73

Reporter

Comment 3

•

9 years ago

HI, Currently we have the server OK, and no more sessions running this query. top - 17:22:04 up 37 days, 3:29, 1 user, load average: 0.58, 1.53, 32.31 Tasks: 126 total, 1 running, 125 sleeping, 0 stopped, 0 zombie Cpu(s): 17.3%us, 3.2%sy, 0.0%ni, 67.9%id, 10.9%wa, 0.0%hi, 0.7%si, 0.0%st Mem: 10129816k total, 9998468k used, 131348k free, 134380k buffers Swap: 2097148k total, 761572k used, 1335576k free, 1580452k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 861 mysql 20 0 9760m 7.6g 5568 S 7.9 78.5 20228:24 mysqld 933 root 20 0 0 0 0 S 2.0 0.0 175:05.60 jbd2/sdb1-8 1 root 20 0 19352 700 440 S 0.0 0.0 0:08.13 init 2 root 20 0 0 0 0 S 0.0 0.0 0:00.13 kthreadd 3 root RT 0 0 0 0 S 0.0 0.0 0:22.80 migration/0 goncalves@pythiancom

Ed Morley [:emorley]

Comment 4

•

9 years ago

Yeah I killed the 30+ pending queries just now to make prod responsive again.

Pythian Team73

Reporter

Comment 5

•

9 years ago

Hi Ed, We confirmed that only this query was running. I kindly ask you to inform Sheeri Cabral or Matthew Pressman as I do not find they in IRC chat. At this time server is good and MySQL responding OK. Please be aware that if this query appears again, most probably a session spike will happens again. As several queries running full table scan in same table will cause a lot of contention. Thanks in advance Marcelo goncalves@pythian.com

Pythian Team73

Reporter

Comment 6

•

9 years ago

Here is the time frame of the spike: 08:50:01 AM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15 09:00:01 AM 7 441 4.86 4.32 4.82 09:10:01 AM 9 435 7.58 6.46 5.59 09:20:01 AM 4 435 4.58 5.14 5.32 09:30:01 AM 8 442 5.60 5.24 5.24 09:40:01 AM 8 439 5.25 5.20 5.24 09:50:01 AM 9 435 5.30 5.31 5.27 10:00:01 AM 7 446 6.13 5.70 5.43 10:10:01 AM 8 438 6.53 6.25 5.81 10:20:01 AM 5 436 5.21 5.63 5.72 10:30:01 AM 2 441 0.64 2.26 4.19 10:40:01 AM 4 435 5.18 4.07 3.99 10:50:01 AM 3 432 5.54 5.44 4.77 11:00:02 AM 7 441 6.09 5.60 5.14 11:10:01 AM 5 435 5.25 5.65 5.42 11:20:01 AM 0 435 0.49 1.94 3.79 11:30:01 AM 1 435 2.07 1.31 2.50 11:40:01 AM 2 435 0.49 0.70 1.60 11:50:01 AM 1 432 0.62 0.80 1.23 12:00:01 PM 3 442 0.95 0.70 0.97 12:10:01 PM 5 436 0.65 0.89 0.94 12:20:01 PM 2 432 0.45 0.54 0.71 12:30:01 PM 4 441 1.01 0.75 0.75 12:40:01 PM 0 435 0.31 0.50 0.63 12:50:01 PM 1 436 0.30 0.59 0.65 01:00:02 PM 5 442 0.17 0.46 0.62 01:10:01 PM 3 432 0.60 0.58 0.59 01:20:01 PM 2 436 0.53 0.45 0.50 01:30:01 PM 2 439 1.50 0.78 0.60 01:40:01 PM 0 435 0.40 0.58 0.58 01:50:01 PM 0 435 0.76 0.99 0.78 02:00:01 PM 2 438 0.64 0.83 0.83 02:10:01 PM 0 432 0.57 1.06 1.02 02:20:01 PM 7 436 5.36 3.83 2.29 02:30:01 PM 8 441 7.72 6.77 4.53 02:40:01 PM 7 435 8.53 8.40 6.44 02:50:01 PM 10 435 9.72 10.19 8.64 03:00:01 PM 23 450 24.18 15.45 11.24 03:10:01 PM 0 435 0.59 6.71 10.17 03:20:01 PM 0 435 0.34 1.37 5.60 03:30:01 PM 0 443 1.63 1.17 3.47 03:40:01 PM 1 435 0.49 0.95 2.32 03:50:01 PM 1 432 0.67 0.90 1.65 04:00:01 PM 0 441 0.70 0.61 1.14 04:10:02 PM 83 439 72.75 44.64 20.84 04:20:02 PM 158 435 160.68 133.81 77.03 04:30:02 PM 142 445 139.72 138.60 107.40 04:40:02 PM 201 456 187.17 156.85 128.39 04:50:02 PM 248 469 252.23 237.67 184.21 05:00:01 PM 3 477 2.04 68.18 130.57 05:10:01 PM 3 472 2.54 10.96 69.35 05:20:01 PM 0 471 0.67 2.01 36.69 05:30:01 PM 5 477 1.30 1.40 19.78 Average: 11 439 9.79 9.53 9.34 [root@treeherder2.db.scl3 ~]#

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 7

•

9 years ago

I'm not sure where in the JobDetailsViewSet[1] you would even reject requests that have the missing params. Would you have to define a `list` view within it to wrap what's there and then get access to the request's query strings or something? 1. https://github.com/mozilla/treeherder/blob/master/treeherder/webapp/api/jobs.py#L311

Flags: needinfo?(wlachance)

William Lachance (:wlach)

Assignee

Comment 8

•

9 years ago

Yeah, I think that'd be what you'd do: define a custom list parameter which demands a parameter if one is not set, then just passes through to the default implementation if not: https://github.com/mozilla/treeherder/blob/7d62a4b38bb68c635c48a06da0cf527d89247ad7/treeherder/webapp/api/refdata.py#L65 I think we should also just remove ordering from the job detail view set: https://github.com/mozilla/treeherder/blob/master/treeherder/webapp/api/jobs.py#L338 Do you want to take this on?

Flags: needinfo?(wlachance)

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Updated

•

9 years ago

Assignee: nobody → wkocher

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 9

•

9 years ago

For posterity because I'm going to be distracted a lot today: 11:10 AM <KWierso> wlach: so for bug 1280039, would I leave the viewset alone and then add the `def list` down below https://github.com/mozilla/treeherder/blob/master/treeherder/webapp/api/jobs.py#L343 ? 11:10 AM <KWierso> or would the contents of the viewset need to go into that `def list`? 11:12 AM <wlach> KWierso: I think the former. You can call the parent with something like: return viewsets.ReadOnlyModelViewSet.list(self, ...) once you've validated that there are parameters

William Lachance (:wlach)

Assignee

Comment 10

•

9 years ago

On further consideration, I think it would be better just to remove the OrderingFilter from the endpoint: there's no need to sort these sorts of results. I'll file a patch for this.

William Lachance (:wlach)

Assignee

Comment 11

•

9 years ago

(In reply to William Lachance (:wlach) from comment #10) > On further consideration, I think it would be better just to remove the > OrderingFilter from the endpoint: there's no need to sort these sorts of > results. I'll file a patch for this. Hmm, though it's not OrderingFilter that's the problem. Using the django toolbar on my development instance, it seems like jobdetail is only doing this query (now that bug 1287140 is fixed): SELECT `job_detail`.`id`, `job_detail`.`job_id`, `job_detail`.`title`, `job_detail`.`value`, `job_detail`.`url`, `job`.`id`, `job`.`repository_id`, `job`.`guid`, `job`.`project_specific_id`, `repository`.`id`, `repository`.`repository_group_id`, `repository`.`name`, `repository`.`dvcs_type`, `repository`.`url`, `repository`.`codebase`, `repository`.`description`, `repository`.`active_status`, `repository`.`performance_alerts_enabled` FROM `job_detail` INNER JOIN `job` ON ( `job_detail`.`job_id` = `job`.`id` ) INNER JOIN `repository` ON ( `job`.`repository_id` = `repository`.`id` ) LIMIT 2000; This works fine on my development instance, and running the query directly against treeherder stage's db also works fine. But somehow the view is still hanging. Hmm...

William Lachance (:wlach)

Assignee

Comment 12

•

9 years ago

Aha, the problem is the "count" query we use for pagination, which is super slow. We can fix this by switching to cursor based pagination, in case too many results are returned.

GitHub Autolander Bot

Comment 13

•

9 years ago

Attached file [treeherder] wlach:1280039 > mozilla:master — Details

William Lachance (:wlach)

Assignee

Comment 14

•

9 years ago

(In reply to William Lachance (:wlach) from comment #12) > Aha, the problem is the "count" query we use for pagination, which is super > slow. We can fix this by switching to cursor based pagination, in case too > many results are returned. ... this was part of it. The other aspect is that "order by" is slow with left joins (which we need to do searching by repository and "project specific" job ids). We can get rid of these silly joins when we move the jobs table into the central db, but for now the only fundamental solution is the one we outlined earlier: rejecting input with no filter parameters.

William Lachance (:wlach)

Assignee

Comment 15

•

9 years ago

Comment on attachment 8771492 [details] [review] [treeherder] wlach:1280039 > mozilla:master There are still issues here, but we might as well get this merged, as it simplifies stuff a bunch (and is an incremental step towards where we want to go).

Attachment #8771492 - Flags: review?(emorley)

Ed Morley [:emorley]

Updated

•

9 years ago

Attachment #8771492 - Flags: review?(emorley) → review+

Treeherder GitHub Bugbot

Comment 16

•

9 years ago

Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/c9e458297f9de503a1083f3db47e87d5ce4374a1 Bug 1280039 - Use cursor pagination with /jobdetail/ endpoint (#1695) The count(*) query that limit/offset pagination requires is super slow with InnoDB. Using a cursor-based approach makes the behaviour more reasonable if no parameters are specified with this endpoint (arguably we should require a parameter if none specified, but we can leave that for another time)

William Lachance (:wlach)

Assignee

Comment 17

•

9 years ago

Just to recap, this is still an issue: see comment 14. We can fix this either by requiring parameters or by waiting for bug 1178641 (which will make iterating through all of the jobdetail objects tractable)

GitHub Autolander Bot

Comment 19

•

8 years ago

Attached file [treeherder] wlach:1280039-2 > mozilla:master — Details

William Lachance (:wlach)

Assignee

Comment 20

•

8 years ago

Stealing this from Wes, hope that's ok :)

Assignee: wkocher → wlachance

William Lachance (:wlach)

Assignee

Updated

•

8 years ago

Attachment #8793076 - Flags: review?(emorley)

Ed Morley [:emorley]

Updated

•

8 years ago

Attachment #8793076 - Flags: review?(emorley) → review+

Treeherder GitHub Bugbot

Comment 21

•

8 years ago

Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/f31bb444434b1f6058aaf75db3f583a24aedc4e2 Bug 1280039 - Enforce the presence of a suitable filter on the jobdetail endpoint (#1866) If not present, the server could be forced to process some really slow queries.

Ed Morley [:emorley]

Comment 22

•

8 years ago

23:32 <RyanVM> so, I can't retrigger any jobs at th emoment 23:32 <RyanVM> https://treeherder.mozilla.org/#/jobs?repo=mozilla-beta&revision=6be9a5c9c8cd25d2082376495b788d85ce3cca98&selectedJob=1627297 23:33 <RyanVM> Unable to send retrigger:Must also filter on repository if filtering on job id Looks like retriggerJob() is passing job id and no repository, so it either need to use the guid, or pass the repo too: https://github.com/mozilla/treeherder/blob/69b81c772fd9a9686ac11bcd87c9b212ad2abd58/ui/plugins/controller.js#L332-L333

Cameron Dawson [:camd]

Comment 23

•

8 years ago

I did some poking around, I can't get any results from that endpoint even if I DO pass the repository. Then it occurred to me that I was trying a TaskCluster job. So that should work differently anyway. It won't have a ``buildbot_request_id`` I've run out of time tonight, I have to leave for an appt. But it doesn't appear to me that this is handling the TC case. Sorry I couldn't do more tonight.

Ryan VanderMeulen [:RyanVM]

Updated

•

8 years ago

Depends on: 1304253

William Lachance (:wlach)

Assignee

Comment 24

•

8 years ago

I think we can resolve this now.

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

treeherder_processlist.sql 9 years ago Pythian Team73 41.52 KB, text/x-sql		Details
[treeherder] wlach:1280039 > mozilla:master 9 years ago GitHub Autolander Bot 47 bytes, text/x-github-pull-request	emorley : review+	Details \| Review
[treeherder] wlach:1280039-2 > mozilla:master 8 years ago GitHub Autolander Bot 47 bytes, text/x-github-pull-request	emorley : review+	Details \| Review