1288028 - "Add New Jobs" often fails due to the runnable_jobs API hitting the 20s timeout ("Error fetching runnable jobs")

Reporter

Description

•

9 years ago

Since the completion of Bug 1284911, TaskCluster jobs have been enabled in "Add New Jobs". However, it takes considerable time to load the list of all jobs (~5 seconds). This is probably because downloading the file https://github.com/mozilla/treeherder/blob/master/treeherder/webapp/api/runnable_jobs.py#L29 is taking considerable amount of time. This file is typically about 2.5MB

Kalpesh Krishna [:martianwars]

Reporter

Updated

•

9 years ago

Depends on: 1284911

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 1

•

9 years ago

Wonder if there's some purges of jobs that haven't run in XX months (years?) that could reduce wait times for this and maybe the sheriffing panel's exclusion editor.

Armen [:armenzg]

Assignee

Comment 2

•

9 years ago

The main problem is with the design we have. When the UI hits the API we got and make all calculations when we should have that API ready with all the data prepopulated. I think the easiest way to deal with this problem is by building the data that runnable apis needs as soon as possible. We should also minimize the data that needs fetching. We can create a handler on pulse_actions which listens to two sources: 1) completed gecko decision tasks 2) merges to buildbot-config's production branch When a gecko decision task finishes, we create the data for TH to use for that revision and notify TH that the file for that push is ready to be fetched. When a merge to production happens, we create a file similar to allthething.json (less data; skip unwanted builders). We then send a message to TH to fetch the new file. The API in itself does not have to be on TH (AFAIK), however, it is fine this way from my POV>

Armen [:armenzg]

Assignee

Comment 3

•

9 years ago

martianwars also mentions that the refactoring of legacy test tasks will speed up this. I don't know how but he says so.

Armen [:armenzg]

Assignee

Comment 4

•

8 years ago

camd: I would like this to be tackled. It can either be me or someone from the community. Do you agreed with this plan? There's no rush to reply as I will be afk until Tuesday. To fix this we will have to: * rename the cerely task from 'fetch_allthethings' to 'update_runnable_jobs' [1] * move 'fetch_allthethings' into a treeherder/etl/tasks/runnable_jobs_tasks.py [2] * rename treeherder/etl/allthethings.py to treeherder/etl/runnable_jobs.py [3] * move the gecko decision task logic into treeherder/etl/runnable_jobs.py [4] * store taskcluster jobs information into the runnable_jobs table * add memcache to the endpoint [5] * if a decision task id is not given fetch the latest gecko decision task for that project [6] * add some tests under tests/webapp/api/ ** I added some runnable jobs fixtures in my PR which might need to be relocated when writing tests for this [1] https://github.com/mozilla/treeherder/blob/master/Procfile#L11 [2] https://github.com/mozilla/treeherder/blob/c88a9fb3fa5fd4d60dc6d3438fee5c65c9e91dae/treeherder/etl/tasks/buildapi_tasks.py#L39-L44 [3] https://github.com/mozilla/treeherder/blob/c88a9fb3fa5fd4d60dc6d3438fee5c65c9e91dae/treeherder/etl/allthethings.py [4] https://github.com/mozilla/treeherder/blob/master/treeherder/webapp/api/runnable_jobs.py#L18-L123 [5] https://docs.djangoproject.com/en/1.10/topics/cache/#memcached [6] https://github.com/armenzg/treeherder/blob/seta_rewrite/tests/seta/conftest.py#L70-L114 [6] https://github.com/armenzg/treeherder/blob/seta_rewrite/treeherder/seta/runnable_jobs.py#L37-L52

Flags: needinfo?(cdawson)

Cameron Dawson [:camd]

Comment 5

•

8 years ago

Dang, yeah. The call to TASKCLUSTER_TASKGRAPH_URL in the runnable_jobs endpoint is probably the slow part here. Though I don't know for sure, of course. But the rest of the processing in there for the taskcluster graph should be pretty speedy. The thing is that the calls to the runnable_jobs endpoint could be with any decision task id. Quite possibly a brand new one. And since those graphs could be quite different per push, we'd need to store each and every variation. That would be a LOT of info to store, going forward. And I'm not sure it's worth it, tbh. I wonder how much mileage we could get out of just memcaching each time we call TASKCLUSTER_TASKGRAPH_URL keyed off the decision task. That way the second person to call it would get it faster, at least. I think the first thing to do would be some profiling to see exactly where this is slow, then we could find an approach to address that. A unit test that feeds a sample task graph would tell you how much time is spent in those calculations.

Flags: needinfo?(cdawson)

Dustin J. Mitchell [:dustin] (he/him)

Comment 6

•

8 years ago

Bear in mind that task graphs are likely to quintuple in size -- I expect the final size will be around 3000 tasks, and that will continue to grow slowly.

Armen [:armenzg]

Assignee

Updated

•

8 years ago

Summary: "Add New Jobs" is very slow when we want to see the list of available jobs → The runnable_jobs API is very slow which makes the "Add New Jobs" action very slow

Armen [:armenzg]

Assignee

Comment 7

•

8 years ago

From emorley: > I'd propose moving everything inside RunnableJobsViewSet.list() > into a separate method, then just calling that method from both > the API viewset plus also wherever SETA needs it. This avoids the > HTTP roundtrip for what can just be a function call :-)

Ed Morley [:emorley]

Comment 8

•

8 years ago

I'm not sure how that will help here? The HTTP roundtrip being avoided is not the one to fetch the dependency graph but the additional one (where a treeherder worker dyno makes an HTTP call to the Treeherder API) the SETA PR is trying to add.

Ed Morley [:emorley]

Updated

•

7 years ago

Component: Treeherder → Treeherder: Job Triggering & Cancellation

Joel Maher ( :jmaher ) (UTC -8)

Comment 9

•

7 years ago

I have had a lot of failures (as in 90% failure rate) to add runnable jobs across many revisions and multiple tries on some of the same revisions. is there a chance we can get some cycles on this in the coming week? Is it possibly that we have a much larger runnable_jobs.json object which takes longer to fetch.

Ed Morley [:emorley]

Comment 10

•

7 years ago

Cameron/myself haven't really worked on this feature (or many of the others under this component), and we are really understaffed at the moment. Looking at the git log I see Rob/Armen have worked on parts of this - perhaps they might be able to take a look?

Ed Morley [:emorley]

Comment 12

•

7 years ago

This appears to have gotten worse over the last few months. Presumably due to there being a bigger task graph/more jobs/... Rob/Armen, I don't suppose you'd be able to take a look at this at some point? :-)

Flags: needinfo?(rwood)

Flags: needinfo?(armenzg)

Priority: -- → P1

Summary: The runnable_jobs API is very slow which makes the "Add New Jobs" action very slow → "Add New Jobs" often fails due to the runnable_jobs API hitting the 20s timeout

Sylvestre Ledru [:Sylvestre]

Updated

•

7 years ago

Summary: "Add New Jobs" often fails due to the runnable_jobs API hitting the 20s timeout → "Add New Jobs" often fails due to the runnable_jobs API hitting the 20s timeout ("Error fetching runnable jobs")

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 13

•

7 years ago

Not "last few months", but "last week" or "last two weeks". Is there a 20s deadline for fetching the jobs list?

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 14

•

7 years ago

Sorry, forget about the 20s question, should have read the summary. Is increasing the limit an option?

Ed Morley [:emorley]

Comment 15

•

7 years ago

I'd prefer not to, since it helps keep badly behaving API endpoints (such as this one) under control, to avoid web worker depletion.

Armen [:armenzg]

Assignee

Comment 16

•

7 years ago

I will look into this.

Assignee: nobody → armenzg

Flags: needinfo?(rwood)

Flags: needinfo?(armenzg)

Armen [:armenzg]

Assignee

Comment 17

•

7 years ago

From a preliminary look (and adding some more local logging) we spend most of the time fetching the graph: [2017-11-29 20:09:07,838] INFO [django.server:124] "GET /api/project/mozilla-inbound/runnable_jobs/?format=json&decision_task_id=Cqn_Ceb9Q4SVnIspx3VGDw HTTP/1.1" 200 201314 [2017-11-29 20:09:33,664] INFO [treeherder.etl.runnable_jobs:133] Validating [2017-11-29 20:09:33,666] INFO [treeherder.etl.runnable_jobs:135] Fetching https://public-artifacts.taskcluster.net/c1xuq3soTOyzeaz1_vd9Ww/0/public/full-task-graph.json [2017-11-29 20:09:42,434] INFO [treeherder.etl.runnable_jobs:137] Fetched [2017-11-29 20:09:42,436] INFO [treeherder.etl.runnable_jobs:146] About to transform [2017-11-29 20:09:42,580] INFO [treeherder.etl.runnable_jobs:172] Finished transforming [2017-11-29 20:09:43,342] INFO [django.server:124] "GET /api/project/mozilla-inbound/runnable_jobs/?format=json&decision_task_id=c1xuq3soTOyzeaz1_vd9Ww HTTP/1.1" 200 201315 See the file sizes (18475x bigger and that includes data added from Treeherder): > ls -l full-task-graph.json runnable_jobs.json > -rw-r--r--@ 1 armenzg staff 39461427 Nov 29 05:31 full-task-graph.json > -rw-r--r-- 1 armenzg staff 2136 Nov 29 15:18 runnable_jobs.json Not everything inside of full-task-graph.json is necessary to generate the data for the runnable_jobs API. We can modify the gecko decision task to generate a file with the minimum ammount of data necessary for runnable jobs. We can name it runnable_jobs.json. All the transformations we do are in here: https://github.com/mozilla/treeherder/blob/master/treeherder/etl/runnable_jobs.py#L143-L167 We need the following: * label * node['task']['extra']['treeherder'] * node['task']['metadata'] dustin, works for you? What code changes would be needed?

Flags: needinfo?(dustin)

Armen [:armenzg]

Assignee

Comment 18

•

7 years ago

We can also post-process the full-task-graph.json within the gecko decision task. Which code would need touching to do so?

Dustin J. Mitchell [:dustin] (he/him)

Comment 19

•

7 years ago

This sounds like a good idea. The task-graph generation code (generate.py) creates a number of "side" artifacts such as actions.json, and this would just be one more.

Flags: needinfo?(dustin)

Armen [:armenzg]

Assignee

Updated

•

7 years ago

Depends on: 1422133

GitHub Bugzilla PR Linker

Comment 20

•

7 years ago

Attached file Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/3009 — Details

Armen [:armenzg]

Assignee

Comment 21

•

7 years ago

Comment on attachment 8933740 [details] [review] Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/3009 Whoever can get to it.

Attachment #8933740 - Flags: review?(emorley)

Attachment #8933740 - Flags: review?(cdawson)

Ed Morley [:emorley]

Comment 22

•

7 years ago

Comment on attachment 8933740 [details] [review] Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/3009 Thank you for looking into this! :-) The file is now 3MB rather than 39MB, which is a drastic improvement. However I notice it's still served uncompressed? I think we should also try and make sure it's gzipped (either on disk or as part of the upload process), to reduce the transfer size further. Also, looking at an example entry... "android-push-apk-breakpoint/opt": { "collection": { "opt": true }, "description": "PushApk breakpoint. Decides whether APK should be published onto Google Play Store ([Treeherder push](https://treeherder.mozilla.org/#/jobs?repo=try&revision=cff10641048c869f1783397f8531e56dd449cab8))", "groupName": "APK publishing", "groupSymbol": "pub", "platform": "Android", "symbol": "Br" }, ...is the description field used anywhere? Re the PR specifically, I think I would also rather the Treeherder side remove support for the full graph entirely, since it will reduce complexity (and the old graph mostly timed out anyway). If there are older branches for which the graph is small enough to not timeout, and breakage there is not ok, then it would be good to instead backport bug 1422133 :-)

Attachment #8933740 - Flags: review?(emorley)

Attachment #8933740 - Flags: review?(cdawson)

Armen [:armenzg]

Assignee

Comment 23

•

7 years ago

I don't think we use description. The UI for adding new jobs only shows some meta information when "hovering" over the jobs, however, I don't believe it uses any information from full-task-graph.json [1]. I looked for "GUnit tests" and it does not show up there. For older branches we're not likely to use add new jobs as we don't coalesce or use SETA. I will mention it on the mailing lists that we will be dropping the support and we uplift/backlift if needed. [1] https://index.taskcluster.net/v1/task/gecko.v2.mozilla-central.latest.firefox.decision/

Armen [:armenzg]

Assignee

Updated

•

7 years ago

Attachment #8933740 - Flags: review?(emorley)

Armen [:armenzg]

Assignee

Comment 24

•

7 years ago

Since we don't have fallback support for full-task-graph.json we will need to let bug 1422133 propagate to all integration trees. Once it lands I will send an email to few newsgroups. I mentioned it on the PR. The original json file is 420x larger than the gziped file (<100kb). The API takes 2-3 seconds to load. That's pretty good :)

Ed Morley [:emorley]

Comment 25

•

7 years ago

Comment on attachment 8933740 [details] [review] Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/3009 I've left some comments on the PR - couple of tweaks and this should be good to go :-) The 39MB -> 100KB reduction in filesize is amazing!

Attachment #8933740 - Flags: review?(emorley) → feedback+

Ed Morley [:emorley]

Comment 26

•

7 years ago

Comment on attachment 8933740 [details] [review] Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/3009 I've added a commit for a workaround to bug 1423215 (that uses requests' automatic compression to save needing to use gzip manually). With that, everything seems to work well on stage :-)

Attachment #8933740 - Flags: review+

Treeherder GitHub Bugbot

Comment 27

•

7 years ago

Commits pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/27489eefabfc3d4b14aca87330f6388167c9c79d Bug 1288028 - Reduce latency of runnable jobs API The runnable jobs API now fetches runnable-jobs.json if available and fallsback to full-task-graph.json. The new file is less than a tenth of the original file and contains the minimum amount of data required for the endpoint. Drop support for full-task-graph.json and 'job_type_description'. https://github.com/mozilla/treeherder/commit/9cb37712473b0691a03aacf37110c7b24ddfc6ab Bug 1288028 - Force decompression of runnable-jobs.json.gz Works around the uploaded `runnable-jobs.json.gz` file being served without a `Content-Encoding: gzip` header (bug 1423215).

Ed Morley [:emorley]

Updated

•

7 years ago

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

Ed Morley [:emorley]

Comment 28

•

7 years ago

Here's a transaction trace on stage, showing the new timing breakdown for reference: https://rpm.newrelic.com/accounts/677903/applications/14179733/transactions?tw%5Bend%5D=1512494800&tw%5Bstart%5D=1512493000#id=5b225765625472616e73616374696f6e2f46756e6374696f6e2f747265656865726465722e7765626170702e6170692e72756e6e61626c655f6a6f62733a52756e6e61626c654a6f6273566965775365742e6c697374222c22225d&app_trace_id=18ca9b35-d9e1-11e7-9b30-0242ac110009_0_3265

Nobody; OK to take it and work on it

Updated

•

3 years ago

Component: Treeherder: Job Triggering & Cancellation → TreeHerder