Missing params from the job_details endpoint causes unbounded DB queries

RESOLVED FIXED

Status

Tree Management
Treeherder: API
P1
normal
RESOLVED FIXED
2 years ago
a year ago

People

(Reporter: Pythian Team73, Assigned: wlach)

Tracking

(Blocks: 1 bug)

Details

Attachments

(3 attachments)

(Reporter)

Description

2 years ago
Created attachment 8762677 [details]
treeherder_processlist.sql

Hello,

We have to deal with some count queries against job_details table today which are querying whole table and causing load issues on the database. Please see the attachment for running process list at that moment.

Letting you know about the query which is potentially causing issues.


Regards,

Comment 1

2 years ago
So I managed to track these down to requests being made to eg:
treeherder.mozilla.org/api/jobdetail/?job_guid=727af17e-f12f-48d4-bdbf-7177e6949af5/0

This was from a local copy of the UI based on master, being run against prod using web-server.js.

Bug 1278597 changed the job guid URL parameter from `job__guid` to `job_guid`, and as such, the prod API doesn't recognise the master UI's request.

However if we're missing parameters, an HTTP 400 should be generated, rather than trying to return all job details records (it's supposed to be 2000 per page, but the whole table is being scanned to return the total records count in the response afaict?).

It's also happening on both stage and prod (ie it's still slow after the removal of ORDER BY in bug 1278597):

time curl https://treeherder.mozilla.org/api/jobdetail/
time curl https://treeherder.allizom.org/api/jobdetail/
time curl https://treeherder.allizom.org/api/jobdetail/?count=1
Component: Treeherder → Treeherder: API
Priority: -- → P1
Summary: Count queries against job_details table → Missing params from the job_details endpoint causes unbounded DB queries

Updated

2 years ago
Blocks: 1280494

Comment 2

2 years ago
This caused prod to be unresponsive for 5-10mins just now:
https://rpm.newrelic.com/accounts/677903/applications/4180461?tw%5Bend%5D=1466182530&tw%5Bstart%5D=1466180730
(Reporter)

Comment 3

2 years ago
HI,
Currently we have the server OK, and no more sessions running this query.

top - 17:22:04 up 37 days,  3:29,  1 user,  load average: 0.58, 1.53, 32.31
Tasks: 126 total,   1 running, 125 sleeping,   0 stopped,   0 zombie
Cpu(s): 17.3%us,  3.2%sy,  0.0%ni, 67.9%id, 10.9%wa,  0.0%hi,  0.7%si,  0.0%st
Mem:  10129816k total,  9998468k used,   131348k free,   134380k buffers
Swap:  2097148k total,   761572k used,  1335576k free,  1580452k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                              
  861 mysql     20   0 9760m 7.6g 5568 S  7.9 78.5  20228:24 mysqld                                                                                                               
  933 root      20   0     0    0    0 S  2.0  0.0 175:05.60 jbd2/sdb1-8                                                                                                          
    1 root      20   0 19352  700  440 S  0.0  0.0   0:08.13 init                                                                                                                 
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.13 kthreadd                                                                                                             
    3 root      RT   0     0    0    0 S  0.0  0.0   0:22.80 migration/0           

goncalves@pythiancom

Comment 4

2 years ago
Yeah I killed the 30+ pending queries just now to make prod responsive again.
(Reporter)

Comment 5

2 years ago
Hi Ed,

We confirmed that only this query was running. I kindly ask you to inform Sheeri Cabral or Matthew Pressman as I do not find they in IRC chat.
At this time server is good and MySQL responding OK.

Please be aware that if this query appears again, most probably a session spike will happens again. As several queries running full table scan in same table will cause a lot of contention.

Thanks in advance
Marcelo

goncalves@pythian.com
(Reporter)

Comment 6

2 years ago
Here is the time frame of the spike:

08:50:01 AM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
09:00:01 AM         7       441      4.86      4.32      4.82
09:10:01 AM         9       435      7.58      6.46      5.59
09:20:01 AM         4       435      4.58      5.14      5.32
09:30:01 AM         8       442      5.60      5.24      5.24
09:40:01 AM         8       439      5.25      5.20      5.24
09:50:01 AM         9       435      5.30      5.31      5.27
10:00:01 AM         7       446      6.13      5.70      5.43
10:10:01 AM         8       438      6.53      6.25      5.81
10:20:01 AM         5       436      5.21      5.63      5.72
10:30:01 AM         2       441      0.64      2.26      4.19
10:40:01 AM         4       435      5.18      4.07      3.99
10:50:01 AM         3       432      5.54      5.44      4.77
11:00:02 AM         7       441      6.09      5.60      5.14
11:10:01 AM         5       435      5.25      5.65      5.42
11:20:01 AM         0       435      0.49      1.94      3.79
11:30:01 AM         1       435      2.07      1.31      2.50
11:40:01 AM         2       435      0.49      0.70      1.60
11:50:01 AM         1       432      0.62      0.80      1.23
12:00:01 PM         3       442      0.95      0.70      0.97
12:10:01 PM         5       436      0.65      0.89      0.94
12:20:01 PM         2       432      0.45      0.54      0.71
12:30:01 PM         4       441      1.01      0.75      0.75
12:40:01 PM         0       435      0.31      0.50      0.63
12:50:01 PM         1       436      0.30      0.59      0.65
01:00:02 PM         5       442      0.17      0.46      0.62
01:10:01 PM         3       432      0.60      0.58      0.59
01:20:01 PM         2       436      0.53      0.45      0.50
01:30:01 PM         2       439      1.50      0.78      0.60
01:40:01 PM         0       435      0.40      0.58      0.58
01:50:01 PM         0       435      0.76      0.99      0.78
02:00:01 PM         2       438      0.64      0.83      0.83
02:10:01 PM         0       432      0.57      1.06      1.02
02:20:01 PM         7       436      5.36      3.83      2.29
02:30:01 PM         8       441      7.72      6.77      4.53
02:40:01 PM         7       435      8.53      8.40      6.44
02:50:01 PM        10       435      9.72     10.19      8.64
03:00:01 PM        23       450     24.18     15.45     11.24
03:10:01 PM         0       435      0.59      6.71     10.17
03:20:01 PM         0       435      0.34      1.37      5.60
03:30:01 PM         0       443      1.63      1.17      3.47
03:40:01 PM         1       435      0.49      0.95      2.32
03:50:01 PM         1       432      0.67      0.90      1.65
04:00:01 PM         0       441      0.70      0.61      1.14
04:10:02 PM        83       439     72.75     44.64     20.84
04:20:02 PM       158       435    160.68    133.81     77.03
04:30:02 PM       142       445    139.72    138.60    107.40
04:40:02 PM       201       456    187.17    156.85    128.39
04:50:02 PM       248       469    252.23    237.67    184.21
05:00:01 PM         3       477      2.04     68.18    130.57
05:10:01 PM         3       472      2.54     10.96     69.35
05:20:01 PM         0       471      0.67      2.01     36.69
05:30:01 PM         5       477      1.30      1.40     19.78
Average:           11       439      9.79      9.53      9.34
[root@treeherder2.db.scl3 ~]#
I'm not sure where in the JobDetailsViewSet[1] you would even reject requests that have the missing params. Would you have to define a `list` view within it to wrap what's there and then get access to the request's query strings or something?

1. https://github.com/mozilla/treeherder/blob/master/treeherder/webapp/api/jobs.py#L311
Flags: needinfo?(wlachance)
Yeah, I think that'd be what you'd do: define a custom list parameter which demands a parameter if one is not set, then just passes through to the default implementation if not:

https://github.com/mozilla/treeherder/blob/7d62a4b38bb68c635c48a06da0cf527d89247ad7/treeherder/webapp/api/refdata.py#L65

I think we should also just remove ordering from the job detail view set:

https://github.com/mozilla/treeherder/blob/master/treeherder/webapp/api/jobs.py#L338

Do you want to take this on?
Flags: needinfo?(wlachance)

Updated

2 years ago
Assignee: nobody → wkocher
For posterity because I'm going to be distracted a lot today:
11:10 AM <KWierso> wlach: so for bug 1280039, would I leave the viewset alone and then add the `def list` down below https://github.com/mozilla/treeherder/blob/master/treeherder/webapp/api/jobs.py#L343 ?
11:10 AM <KWierso> or would the contents of the viewset need to go into that `def list`?
11:12 AM <wlach> KWierso: I think the former. You can call the parent with something like: return viewsets.ReadOnlyModelViewSet.list(self, ...) once you've validated that there are parameters
On further consideration, I think it would be better just to remove the OrderingFilter from the endpoint: there's no need to sort these sorts of results. I'll file a patch for this.
(In reply to William Lachance (:wlach) from comment #10)
> On further consideration, I think it would be better just to remove the
> OrderingFilter from the endpoint: there's no need to sort these sorts of
> results. I'll file a patch for this.

Hmm, though it's not OrderingFilter that's the problem. Using the django toolbar on my development instance, it seems like jobdetail is only doing this query (now that bug 1287140 is fixed):

SELECT `job_detail`.`id`, `job_detail`.`job_id`, `job_detail`.`title`, `job_detail`.`value`, `job_detail`.`url`, `job`.`id`, `job`.`repository_id`, `job`.`guid`, `job`.`project_specific_id`, `repository`.`id`, `repository`.`repository_group_id`, `repository`.`name`, `repository`.`dvcs_type`, `repository`.`url`, `repository`.`codebase`, `repository`.`description`, `repository`.`active_status`, `repository`.`performance_alerts_enabled` FROM `job_detail` INNER JOIN `job` ON ( `job_detail`.`job_id` = `job`.`id` ) INNER JOIN `repository` ON ( `job`.`repository_id` = `repository`.`id` ) LIMIT 2000;

This works fine on my development instance, and running the query directly against treeherder stage's db also works fine. But somehow the view is still hanging. Hmm...
Aha, the problem is the "count" query we use for pagination, which is super slow. We can fix this by switching to cursor based pagination, in case too many results are returned.
Created attachment 8771492 [details] [review]
[treeherder] wlach:1280039 > mozilla:master
(In reply to William Lachance (:wlach) from comment #12)
> Aha, the problem is the "count" query we use for pagination, which is super
> slow. We can fix this by switching to cursor based pagination, in case too
> many results are returned.

... this was part of it. The other aspect is that "order by" is slow with left joins (which we need to do searching by repository and "project specific" job ids). We can get rid of these silly joins when we move the jobs table into the central db, but for now the only fundamental solution is the one we outlined earlier: rejecting input with no filter parameters.
Comment on attachment 8771492 [details] [review]
[treeherder] wlach:1280039 > mozilla:master

There are still issues here, but we might as well get this merged, as it simplifies stuff a bunch (and is an incremental step towards where we want to go).
Attachment #8771492 - Flags: review?(emorley)

Updated

a year ago
Attachment #8771492 - Flags: review?(emorley) → review+

Comment 16

a year ago
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/c9e458297f9de503a1083f3db47e87d5ce4374a1
Bug 1280039 - Use cursor pagination with /jobdetail/ endpoint (#1695)

The count(*) query that limit/offset pagination requires is super slow with
InnoDB. Using a cursor-based approach makes the behaviour more reasonable if
no parameters are specified with this endpoint (arguably we should require
a parameter if none specified, but we can leave that for another time)
Just to recap, this is still an issue: see comment 14. We can fix this either by requiring parameters or by waiting for bug 1178641 (which will make iterating through all of the jobdetail objects tractable)
Duplicate of this bug: 1304153
Created attachment 8793076 [details] [review]
[treeherder] wlach:1280039-2 > mozilla:master
Stealing this from Wes, hope that's ok :)
Assignee: wkocher → wlachance
Attachment #8793076 - Flags: review?(emorley)

Updated

a year ago
Attachment #8793076 - Flags: review?(emorley) → review+

Comment 21

a year ago
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/f31bb444434b1f6058aaf75db3f583a24aedc4e2
Bug 1280039 - Enforce the presence of a suitable filter on the jobdetail endpoint (#1866)

If not present, the server could be forced to process some really slow
queries.
23:32 <RyanVM> so, I can't retrigger any jobs at th emoment
23:32 <RyanVM> https://treeherder.mozilla.org/#/jobs?repo=mozilla-beta&revision=6be9a5c9c8cd25d2082376495b788d85ce3cca98&selectedJob=1627297
23:33 <RyanVM>  Unable to send retrigger:Must also filter on repository if filtering on job id 

Looks like retriggerJob() is passing job id and no repository, so it either need to use the guid, or pass the repo too:
https://github.com/mozilla/treeherder/blob/69b81c772fd9a9686ac11bcd87c9b212ad2abd58/ui/plugins/controller.js#L332-L333
I did some poking around, I can't get any results from that endpoint even if I DO pass the repository.  Then it occurred to me that I was trying a TaskCluster job.  So that should work differently anyway.  It won't have a ``buildbot_request_id``

I've run out of time tonight, I have to leave for an appt.  But it doesn't appear to me that this is handling the TC case.  Sorry I couldn't do more tonight.
Depends on: 1304253
I think we can resolve this now.
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.