Closed Bug 1288028 Opened 3 years ago Closed 2 years ago

"Add New Jobs" often fails due to the runnable_jobs API hitting the 20s timeout ("Error fetching runnable jobs")


(Tree Management :: Treeherder: Job Triggering & Cancellation, defect, P1)



(Not tracked)



(Reporter: martianwars, Assigned: armenzg, Mentored)




(1 file)

Since the completion of Bug 1284911, TaskCluster jobs have been enabled in "Add New Jobs". However, it takes considerable time to load the list of all jobs (~5 seconds).

This is probably because downloading the file is taking considerable amount of time. This file is typically about 2.5MB
Depends on: 1284911
Wonder if there's some purges of jobs that haven't run in XX months (years?) that could reduce wait times for this and maybe the sheriffing panel's exclusion editor.
The main problem is with the design we have. When the UI hits the API we got and make all calculations when we should have that API ready with all the data prepopulated.

I think the easiest way to deal with this problem is by building the data that runnable apis needs as soon as possible. We should also minimize the data that needs fetching.

We can create a handler on pulse_actions which listens to two sources:
1) completed gecko decision tasks
2) merges to buildbot-config's production branch

When a gecko decision task finishes, we create the data for TH to use for that revision and notify TH that the file for that push is ready to be fetched.

When a merge to production happens, we create a file similar to allthething.json (less data; skip unwanted builders). We then send a message to TH to fetch the new file.

The API in itself does not have to be on TH (AFAIK), however, it is fine this way from my POV>
martianwars also mentions that the refactoring of legacy test tasks will speed up this. I don't know how but he says so.
camd: I would like this to be tackled. It can either be me or someone from the community. Do you agreed with this plan?
There's no rush to reply as I will be afk until Tuesday.

To fix this we will have to:
* rename the cerely task from 'fetch_allthethings' to 'update_runnable_jobs' [1]
* move 'fetch_allthethings' into a treeherder/etl/tasks/ [2]
* rename treeherder/etl/ to treeherder/etl/ [3]
* move the gecko decision task logic into treeherder/etl/ [4]
* store taskcluster jobs information into the runnable_jobs table
* add memcache to the endpoint [5]
* if a decision task id is not given fetch the latest gecko decision task for that project [6]
* add some tests under tests/webapp/api/
** I added some runnable jobs fixtures in my PR which might need to be relocated when writing tests for this

Flags: needinfo?(cdawson)
Dang, yeah.  The call to TASKCLUSTER_TASKGRAPH_URL in the runnable_jobs endpoint is probably the slow part here.  Though I don't know for sure, of course.  But the rest of the processing in there for the taskcluster graph should be pretty speedy.

The thing is that the calls to the runnable_jobs endpoint could be with any decision task id.  Quite possibly a brand new one.  And since those graphs could be quite different per push, we'd need to store each and every variation.  That would be a LOT of info to store, going forward.  And I'm not sure it's worth it, tbh.

I wonder how much mileage we could get out of just memcaching each time we call TASKCLUSTER_TASKGRAPH_URL keyed off the decision task.  That way the second person to call it would get it faster, at least.

I think the first thing to do would be some profiling to see exactly where this is slow, then we could find an approach to address that.  A unit test that feeds a sample task graph would tell you how much time is spent in those calculations.
Flags: needinfo?(cdawson)
Bear in mind that task graphs are likely to quintuple in size -- I expect the final size will be around 3000 tasks, and that will continue to grow slowly.
Summary: "Add New Jobs" is very slow when we want to see the list of available jobs → The runnable_jobs API is very slow which makes the "Add New Jobs" action very slow
From emorley:
> I'd propose moving everything inside RunnableJobsViewSet.list()
> into a separate method, then just calling that method from both
> the API viewset plus also wherever SETA needs it. This avoids the
> HTTP roundtrip for what can just be a function call :-)
I'm not sure how that will help here?

The HTTP roundtrip being avoided is not the one to fetch the dependency graph but the additional one (where a treeherder worker dyno makes an HTTP call to the Treeherder API) the SETA PR is trying to add.
Component: Treeherder → Treeherder: Job Triggering & Cancellation
I have had a lot of failures (as in 90% failure rate) to add runnable jobs across many revisions and multiple tries on some of the same revisions.

is there a chance we can get some cycles on this in the coming week?  Is it possibly that we have a much larger runnable_jobs.json object which takes longer to fetch.
Cameron/myself haven't really worked on this feature (or many of the others under this component), and we are really understaffed at the moment. Looking at the git log I see Rob/Armen have worked on parts of this - perhaps they might be able to take a look?
Duplicate of this bug: 1420500
This appears to have gotten worse over the last few months. Presumably due to there being a bigger task graph/more jobs/...

Rob/Armen, I don't suppose you'd be able to take a look at this at some point? :-)
Flags: needinfo?(rwood)
Flags: needinfo?(armenzg)
Priority: -- → P1
Summary: The runnable_jobs API is very slow which makes the "Add New Jobs" action very slow → "Add New Jobs" often fails due to the runnable_jobs API hitting the 20s timeout
Summary: "Add New Jobs" often fails due to the runnable_jobs API hitting the 20s timeout → "Add New Jobs" often fails due to the runnable_jobs API hitting the 20s timeout ("Error fetching runnable jobs")
Not "last few months", but "last week" or "last two weeks". Is there a 20s deadline for fetching the jobs list?
Sorry, forget about the 20s question, should have read the summary. Is increasing the limit an option?
I'd prefer not to, since it helps keep badly behaving API endpoints (such as this one) under control, to avoid web worker depletion.
I will look into this.
Assignee: nobody → armenzg
Flags: needinfo?(rwood)
Flags: needinfo?(armenzg)
From a preliminary look (and adding some more local logging) we spend most of the time fetching the graph:
[2017-11-29 20:09:07,838] INFO [django.server:124] "GET /api/project/mozilla-inbound/runnable_jobs/?format=json&decision_task_id=Cqn_Ceb9Q4SVnIspx3VGDw HTTP/1.1" 200 201314
[2017-11-29 20:09:33,664] INFO [treeherder.etl.runnable_jobs:133] Validating
[2017-11-29 20:09:33,666] INFO [treeherder.etl.runnable_jobs:135] Fetching
[2017-11-29 20:09:42,434] INFO [treeherder.etl.runnable_jobs:137] Fetched
[2017-11-29 20:09:42,436] INFO [treeherder.etl.runnable_jobs:146] About to transform
[2017-11-29 20:09:42,580] INFO [treeherder.etl.runnable_jobs:172] Finished transforming
[2017-11-29 20:09:43,342] INFO [django.server:124] "GET /api/project/mozilla-inbound/runnable_jobs/?format=json&decision_task_id=c1xuq3soTOyzeaz1_vd9Ww HTTP/1.1" 200 201315

See the file sizes (18475x bigger and that includes data added from Treeherder):
> ls -l full-task-graph.json runnable_jobs.json
> -rw-r--r--@ 1 armenzg  staff  39461427 Nov 29 05:31 full-task-graph.json
> -rw-r--r--  1 armenzg  staff      2136 Nov 29 15:18 runnable_jobs.json

Not everything inside of full-task-graph.json is necessary to generate the data for the runnable_jobs API.
We can modify the gecko decision task to generate a file with the minimum ammount of data necessary for runnable jobs.
We can name it runnable_jobs.json.

All the transformations we do are in here:

We need the following:
 * label
 * node['task']['extra']['treeherder']
 * node['task']['metadata']

dustin, works for you? What code changes would be needed?
Flags: needinfo?(dustin)
We can also post-process the full-task-graph.json within the gecko decision task.

Which code would need touching to do so?
This sounds like a good idea.  The task-graph generation code ( creates a number of "side" artifacts such as actions.json, and this would just be one more.
Flags: needinfo?(dustin)
Depends on: 1422133
Comment on attachment 8933740 [details] [review]
Link to GitHub pull-request:

Whoever can get to it.
Attachment #8933740 - Flags: review?(emorley)
Attachment #8933740 - Flags: review?(cdawson)
Comment on attachment 8933740 [details] [review]
Link to GitHub pull-request:

Thank you for looking into this! :-)

The file is now 3MB rather than 39MB, which is a drastic improvement. However I notice it's still served uncompressed? I think we should also try and make sure it's gzipped (either on disk or as part of the upload process), to reduce the transfer size further.

Also, looking at an example entry...

  "android-push-apk-breakpoint/opt": {
    "collection": {
      "opt": true
    "description": "PushApk breakpoint. Decides whether APK should be published onto Google Play Store ([Treeherder push](",
    "groupName": "APK publishing",
    "groupSymbol": "pub",
    "platform": "Android",
    "symbol": "Br"
  }, the description field used anywhere?

Re the PR specifically, I think I would also rather the Treeherder side remove support for the full graph entirely, since it will reduce complexity (and the old graph mostly timed out anyway). If there are older branches for which the graph is small enough to not timeout, and breakage there is not ok, then it would be good to instead backport bug 1422133 :-)
Attachment #8933740 - Flags: review?(emorley)
Attachment #8933740 - Flags: review?(cdawson)
I don't think we use description.
The UI for adding new jobs only shows some meta information when "hovering" over the jobs, however, I don't believe it uses any information from full-task-graph.json [1]. I looked for "GUnit tests" and it does not show up there.

For older branches we're not likely to use add new jobs as we don't coalesce or use SETA.
I will mention it on the mailing lists that we will be dropping the support and we uplift/backlift if needed.

Attachment #8933740 - Flags: review?(emorley)
Since we don't have fallback support for full-task-graph.json we will need to let bug 1422133 propagate to all integration trees.
Once it lands I will send an email to few newsgroups.

I mentioned it on the PR. The original json file is 420x larger than the gziped file (<100kb).

The API takes 2-3 seconds to load. That's pretty good :)
Comment on attachment 8933740 [details] [review]
Link to GitHub pull-request:

I've left some comments on the PR - couple of tweaks and this should be good to go :-)

The 39MB -> 100KB reduction in filesize is amazing!
Attachment #8933740 - Flags: review?(emorley) → feedback+
Comment on attachment 8933740 [details] [review]
Link to GitHub pull-request:

I've added a commit for a workaround to bug 1423215 (that uses requests' automatic compression to save needing to use gzip manually).

With that, everything seems to work well on stage :-)
Attachment #8933740 - Flags: review+
Commits pushed to master at
Bug 1288028 - Reduce latency of runnable jobs API

The runnable jobs API now fetches runnable-jobs.json if available and fallsback to

The new file is less than a tenth of the original file and contains the minimum
amount of data required for the endpoint.

Drop support for full-task-graph.json and 'job_type_description'.
Bug 1288028 - Force decompression of runnable-jobs.json.gz

Works around the uploaded `runnable-jobs.json.gz` file being served
without a `Content-Encoding: gzip` header (bug 1423215).
Closed: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.