Closed Bug 1254325 Opened 5 years ago Closed 5 years ago

Change the runnable_jobs API to include TaskCluster jobs as well

Categories

(Tree Management :: Treeherder: API, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Assigned: martianwars)

References

Details

Attachments

(4 files)

I want to define this project as a Google Summer of Code project [1]

This is a feature that developers are really liking and we need to make it work for TaskCluster.

This is the equivalent to bug 1194830

This depends on bug 1232005 which would give us the ability to know which tasks could have been scheduled.

[1] https://docs.google.com/document/d/1TDF-UKYb16wxvAtqdzCFauTvJXSx_fjf7i1f8F3l7bk/edit#heading=h.fsx45ke2f3gd
jmaher, armenzg,

Hi! I would like to start working on this! Went through the docs briefly. What sort of setup would I need? My treeherder is setup :)

martianwars
Flags: needinfo?(jmaher)
Flags: needinfo?(armenzg)
there are a few parts here:
1) getting the list of available jobs from taskcluster to schedule
2) having pulse actions take the selected job(s) from a treeherder requestion initiated in the UI and resolve scheduling via taskcluster api
3) working on taskcluster and the gecko decision task to accept requests coming in and making it happen.
4) worrying about dependencies (request a test job, make sure the build job exists or is scheduled)

if you are fluent in treeherder, I would look to pulse actions: https://github.com/mozilla/pulse_actions.  Likewise see the interface for buildbot (https://github.com/mozilla/mozilla_ci_tools) as an example.  And finally read up on Taskcluster and the gecko decision task.
Flags: needinfo?(jmaher)
Flags: needinfo?(armenzg)
Assignee: nobody → kalpeshk2011
So I guess I'll begin work on this. My first task here is to fetch all_tasks.json (you can see the file here https://tools.taskcluster.net/task-inspector/#PzmhiGZ8TMiLxeNf0IZtbg/0). I will be following the approach used for allthethings.json
A conversation between camd and me.
 
<martianwars> camd, wlach: a bit confused here. When exactly is this task triggered? https://github.com/mozilla/treeherder/blob/master/treeherder/etl/tasks/buildapi_tasks.py#L39
12:24 AM <•camd> martianwars: so that's a celery task
12:24 AM <•camd> it's scheduled on a "celery beat" 
12:24 AM <•camd> that's specified in the settings.py file
12:24 AM <•camd> I believe it's set to happen once/day right now
12:25 AM <martianwars> camd: yeah correct, it's set to happen once a day
12:26 AM <martianwars> camd: thanks!
12:29 AM <•camd> martianwars: you bet!  :)
12:32 AM <martianwars> camd: so the problem here is, all_tasks.json isn't on any public link. I would have to fetch the URL from the UI (which isn't hard to produce), send it to the Django API and use python's requests or something
12:32 AM <martianwars> camd: it's push specific in other words
12:33 AM <martianwars> camd: is this possible? if not, we can use the "latest" of the day
12:34 AM <•camd> martianwars: you shouldn't have to fetch it from the UI.  You have access to the whole db from the service.
12:34 AM <•camd> from the python side, that is
12:35 AM <•camd> But I think I see your point
12:35 AM <martianwars> camd: no, I'll just get the URL from the UI, send it to the python side, and python fetches it realtime I mean. Do you like this approach? IMO, it'll be too slow
12:35 AM <•camd> you need to fetch that file from one of the jobs in the push that's a taskcluster job
12:35 AM mcote|afk → mcote
12:35 AM <•camd> martianwars: oh I see.
12:35 AM <•camd> I'm not sure how slow that would be.  might not be a problem at all
12:36 AM <•camd> martianwars: that's worth trying as that might be simpler.  then you wouldn't have to do the fetch on the service side at all
12:37 AM <martianwars> camd: I'll just check up to see how different this file is push to push
12:39 AM <martianwars> camd: it seems to have revision data, so I think it's better to get it each time "Add New Jobs" is pressed or something
12:41 AM <•camd> martianwars: yeah, it kind of seems like you would have the api just use ANY taskcluster job in the push, and send that up
12:41 AM <•camd> then python would open the file and send the job info down with it.
12:42 AM <martianwars> camd: so should I fetch the file using "requests" in python?
12:42 AM <•camd> martianwars: well, I'd have to look closer at the code to determine the actual workflow.  But sounds like you're on a good track
12:43 AM <•camd> martianwars: yes, that's the right way
12:44 AM <martianwars> camd: alright thanks! Also, I can see all the pushes in local.treeherder.mozilla.org (identical to the actual treeherder.mozilla.org). Is that a problem?
12:44 AM <martianwars> I didnt need to open another terminal for this ^
12:45 AM <•camd> martianwars: if you are running "celery -A treeherder worker -B" then that's what you'd get, yeah.
12:45 AM <•camd> there's a celery beat task called "pushlog" that gets all the pushes locally, just like for production
12:45 AM camd → •camd|lunch
Flags: needinfo?(armenzg)
I read the conversation but it is still a bit confusing the approach.
camd: what do you think of what I write below?

If it helps, here are some links to the tasks and to the artifact:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&filter-searchStr=gecko&selectedJob=3866852&revision=c3f5e6079284
https://tools.taskcluster.net/task-inspector/#C4UJf0PhSleAkNDckZENvg/0

The following are just some ideas on how this would work in my head.

Steps:
* Users clicks add new jobs
* UI or backend grabs all_tasks.json which is an artifact of the gecko decision task
** Treeherder has the info about where this file is hosted

Steps:
* Push appears on Treeherder
* Gecko decision completes
* Treeherder notices this, fetches and caches the all_tasks.json information, thus, being ready for when the user clicks "add new jobs"
** OR it puts the data in some sort of API similar to runnable 

Steps:
* Push appears on Treeherder
* Gecko decision submits all_tasks.json information to Treeherder via pulse or an API, thus, being ready for when the user clicks "add new jobs"
** OR it puts the data in some sort of API similar to runnable
Flags: needinfo?(armenzg) → needinfo?(cdawson)
Hey Armen: yeah, those all sound like viable approaches.  TBH, IMO (need another TLA here...) going with the first one would be best.  We've done premature optimizations so many times that end up just not being necessary.  I would think we should just grab that data in real-time and see how the performance is.  If it's too slow, then we could try approach 2.  Approach 2 is similar to how Alice did things, which is fine.  But perhaps we could go with the simplest approach here and just optimize as needed.

But I don't feel TOO strongly about it. :)  If you prefer the pre-caching, then we can do that.
Flags: needinfo?(cdawson)
This is the set of jobs that I get (marked with yellow squares) :) Does it look good? Should there be any more jobs?
Attachment #8753534 - Flags: review?(armenzg)
This is the original set of jobs. So I guess this completes the first part of the PR, which was to allow TC runnable jobs to be shown in Treeherder. Now I need to study how the code works when we try to click on the yellow symbols.
Comment on attachment 8751911 [details] [review]
[treeherder] martiansideofthemoon:forceawakens > mozilla:master

camd, I hope this is okay :)
Attachment #8751911 - Flags: feedback?(cdawson)
Comment on attachment 8753534 [details]
Screenshot from 2016-05-18 02:31:09.png

That looks about right :)
Attachment #8753534 - Flags: review?(armenzg) → review+
Hi Kalpesh,
Here's some info on what comes after you can produce the pulse messages from Treeherder.

Right now, for Buildbot, we get the list of builders selected
and it is processed by "buildbot_graph_builder()"
https://github.com/mozilla/pulse_actions/blob/master/pulse_actions/handlers/treeherder_runnable.py#L91

In the case of Buildbot, there can be these in a request:
* requests for one or more test builders (downstream jobs)
  * The developer might have forgotten to select the required "build" for a specific test builder
  * The developer might have chosen a test job for which there is completed "build"; in that case pulse_actions requests the missing build to be scheduled
  * The developer might have chosen a test job for which there is a running "build", thus, we can't schedule the test job until it completes (right this is not possible for Buildbot even through the Buildbot bridge)
* requests for one or more build builders (upstream jobs)
  * There might be a running build; I don't know if we do or do not schedule it. I think we do
  * There might not be a build scheduled

In the case of TaskCluster, if we receive the tasks that the developer requests + associated build job we won't need to do any searching for the upstream job on pulse_actions.

What I'm trying to say is that Treherder should be able to know that if a selected job has the "required" field, it should then also send the associated task for the upstream build.

* task N requires task A
* developer only selects task N
* Treeherder determines that both tasks should be sent over pulse

The alternative is to make pulse_actions have to go and search for all_tasks.json to determine what other tasks does task N need (similar to what Buildbot does).
Comment on attachment 8751911 [details] [review]
[treeherder] martiansideofthemoon:forceawakens > mozilla:master

Looking good so far!  Please need-info me if you have more questions and I'm not around on IRC.  :)
Attachment #8751911 - Flags: feedback?(cdawson) → feedback+
:camd, :armenzg, :garndt, :dustin, 

I'd like to outline the approach I plan to follow with respect to Treeherder. Dustin has landed changes which changes all_tasks.json to full-tasks.json and adds a few changes in the JSON format. For example, you can see an artifact https://public-artifacts.taskcluster.net/SQas-oGSQaWoyFIPLeLBdg/0/public/full-task-graph.json. This file has a number of keys called TaskLabels each having a task JSON.

I intend to send these taskLabels along with the URL of the push in the pulse message. MozCI can download this file and get the corresponding task easily using these taskLabels.

What are your thoughts?
Flags: needinfo?(garndt)
Flags: needinfo?(cdawson)
Flags: needinfo?(armenzg)
I think that by doing it this way, mozci should be able to locate the label in the full-tasks.json and be able to know what to schedule.  Would be nice if there was a way to have the labels for these that mozci could resubmit a decision task for and things would just get scheduled without mozci needing to make a decision but that might be difficult or not possible yet.
Flags: needinfo?(garndt)
Comment on attachment 8751911 [details] [review]
[treeherder] martiansideofthemoon:forceawakens > mozilla:master

Hey Cameron,
This is working for me locally (I hope I've removed all the debugging code). I'm not sure how to see the Pulse messages, but I can definitely see status code 200 for both add_runnable_jobs and runnable_jobs APIs.
Attachment #8751911 - Flags: review?(cdawson)
WRT to pulse messages, are you publishing locally and you want to listen to that?
Or are you intending to listen to the production exchanges/topics?
From a Treeherder perspective, your approach sounds good to me, Kalpesh.  Now to check out the PR.  :)

You can see pulse messages with https://tools.taskcluster.net/pulse-inspector/ if you want to see what's going by.
Flags: needinfo?(cdawson)
Comment on attachment 8751911 [details] [review]
[treeherder] martiansideofthemoon:forceawakens > mozilla:master

This is excellent work.  :)  I've added a few comments here and there.  A few spots that need addressing.

I wasn't totally clear on the next steps.  Is there more work on the task cluster side that's needed?  I wouldn't want to merge and enable this until it will actually do the retriggering in task cluster, of course.

Once everything is set to happen on the TC side, then we can try rebasing this on master and pushing to stage to test it out prior to merging to master.

Great work!!  :)
Attachment #8751911 - Flags: review?(cdawson)
Cameron,
I think there is a bit of work needed on the mozCI side. But I would need an inflow of pulse messages before I can think of starting that work. The code to schedule BB jobs will not break due to these changes as far as I understand.
I can't see those messages in Pulse Inspector, since I am not sure about how to publish to the exchange exchange/treeherder/v1/resultset-runnable-job-actions using my local version of Treeherder. I have an account on PulseGuardian, what settings would I need to change? The "Posting Data" section here https://treeherder.readthedocs.io/pulseload.html doesn't seem to help as my pulse messages are generated programmatically.
Flags: needinfo?(cdawson)
(In reply to Kalpesh Krishna [:martianwars] from comment #14)
> :camd, :armenzg, :garndt, :dustin, 
> 
> I'd like to outline the approach I plan to follow with respect to
> Treeherder. Dustin has landed changes which changes all_tasks.json to
> full-tasks.json and adds a few changes in the JSON format. For example, you
> can see an artifact
> https://public-artifacts.taskcluster.net/SQas-oGSQaWoyFIPLeLBdg/0/public/
> full-task-graph.json. This file has a number of keys called TaskLabels each
> having a task JSON.
> 
> I intend to send these taskLabels along with the URL of the push in the
> pulse message. MozCI can download this file and get the corresponding task
> easily using these taskLabels.
> 
> What are your thoughts?

I have an unfinished comment on my other laptop. I don't want to delay my reply any longer.

What I recall saying is that unlike on Buildbot, you can use the index to reach the full-task-graph.json for each revision (I remember noticing that it would require changes to the decision task to have more than references to "latest").

You don't need to send the URL. You could only send the task ID. It would however save some fetches on pulse_actions.

Remember that dev might not select the upstream tasks needed for the downstream tasks (dependent tasks).
pulse_actions/mozci has to determine what is missing and perhaps having to select the missing task which was not selected.

Sending task labels on the message sounds like a good plan.

If you could post a message containing both Buildbot buildernames and TC task labels it would be great.
Flags: needinfo?(armenzg)
I would also gain value by adding UTC unix timestamp to the messages so I can measure how late we're to processing it.
Kalpesh-- This may take some experimentation, but this is the code that shows the settings you would set to use your own pulse user/exchange for testing:

https://github.com/mozilla/treeherder/blob/444d5e0745dd3bbaafa5575b2d4ebe4d821070ef/treeherder/model/tasks.py#L32-L36
Flags: needinfo?(cdawson)
Blocks: 1080265
Depends on: 1276963
dustin and I spoke about this project in the RRA.

For now, we would like to *only* focus on adding new jobs for Try.
This means that the pulse_actions credentials will only be able to schedule Try jobs.

On the TH PR, could we avoid allowing to add new jobs for TaskCluster in every other repo but try?

Once TH uses TaskCluster for authentication, I would like to see the scm_level being sent as part of the message so pulse_actions can determine if the user can add new jobs on that push.

Another approach would be that TH would not allonw users click "add new jobs" unless they have the right scm level (e.g. someone has try push level and can only add new jobs to try pushes).

camd: what do you think about adding UI restrictions on TH?


[1] From conversation with dustin
> assume we can ensure that only task graphs the user could have created (are at or below their SCM level) can be extended, > so a level-1 person can only modify level-1 pushes (try)
>   -> then this service provides no capabilities not already present by pushing to that tree
> 
> We'll accomplish this initially by only operating at level 1.  Then when treeherder's integration with taskcluster has
> improved (and thus we can determine the user's permissions and/or scm level reliably), we can improve that.
Flags: needinfo?(cdawson)
Summary: Give Treeherder the ability to add TaskCluster jobs to pushes → Give Treeherder the ability to add TaskCluster jobs to try pushes
I want to document the use cases I want to see tested:
1) only a build job is requested
2) only a test job is requested for a completed build
3) only a test job is requested for a non-completed build
4) only a test job is requested for a currently running build

We should *not* be scheduling a new build if there is already one completed or currently running.
We should *not* be creating the docker image tasks unless we really have to.
Armen-- Yeah, that's just fine about the UI restrictions to only try.  There's precedence there, actually.  I think we had code like that where you could only cancel jobs (or was it all jobs?) on try unless you're a sheriff.

I don't think you'd want to hide the menu item in that case, because you still want to be able to add BB jobs.  But you could certainly do that in the controller or model.  Just do the same check for sheriff'ness that is done for enabling the buttons in the Sheriff panel.
Flags: needinfo?(cdawson)
Thanks camd!

martianwars: do the changes that camd and I speak on comment 26 and comment 24 make sense to you?
Note from one of my emails:
I see in full-task-graph.json that build jobs depend on the docker build image task.
However, if I recall correctly, we don't always need to build the docker image and can go straight to running the build (since we can reuse the docker build image that was built on another push).

How can we know when we don't need to request building the docker build image?

----
dustin's reply is to use optimization which lives in the tree.

It is unlikely that our system will work well in pulse_actions unless we have access to the code in-tree.

There are other messy ways of accomplishing this, however, it might make more sense to start investing on the "action tasks" idea. We should talk more about it in London.
Comment on attachment 8751911 [details] [review]
[treeherder] martiansideofthemoon:forceawakens > mozilla:master

I've completed the small issues. I've added UTC timestamps in the Pulse messages. I've also restricted this to the try repository. @armenzg
Attachment #8751911 - Flags: review?(cdawson)
By restricting to try, I mean that TC jobs will only show up on pressing "Add New Jobs" in the try repository. For other repositories, you can still use "Add New Jobs" to schedule buildbot jobs.
I'm sorry I haven't reviewed this yet.  I will take a look at it tomorrow.
Hey Kalpesh-- I went through this again.  Generally looks very good.  But there's a problem when the decision task isn't present.  Left a few comments.  :)  Clearing the review flag for now.
Attachment #8751911 - Flags: review?(cdawson)
Depends on: 1281062
I've fixed the empty decision task problem and I've added a UI restriction to unblock mikeling's project here https://github.com/mozilla/treeherder/pull/1490/files#diff-9da180a71af78c18a9d5089198b6099fR335. Once this is merged, the API **could** return TC jobs, but the UI will not. Once action tasks are ready, we can fix this in a moment's work.
Also, timestamps will come along in the pulse message now.
Attachment #8751911 - Flags: review?(cdawson)
Hey Kalpesh: I had a couple bugs I really had to knock out today.  I will review this tomorrow for sure, though.
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/9d57fed7260ed7542b53aac8f469b4fe96338b58
Bug 1254325 - Adding support to schedule TaskCluster jobs (#1490)

* Bug 1254325 - Adding support to download all_tasks.json

* Adding support to send task ID of gecko decision via Pulse

* Fixing a few problems

* Adding UTC timestamp in pulse message

* Fixing nits, adding restriction to block ui changes

* Updating URL for full tasks file

* Renaming buildernames to requested_jobs
There's not much we can do about it now - but don't forget to either use the squash and merge feature or else ask the PR author to squash manually before merging :-)
Blocks: 1282906
(In reply to Ed Morley [:emorley] from comment #36)
> There's not much we can do about it now - but don't forget to either use the
> squash and merge feature or else ask the PR author to squash manually before
> merging :-)

Ah sorry this is one commit, but just with all the commit messages combined (I think these can be edited via the input field when using squash+merge). I did double check against the PR, but when using the squash+merge feature, the original PR still shows the multiple commits, not the single squashed one.
Attachment #8751911 - Flags: review?(cdawson) → review+
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/97211859070710e633ea1b71fcee60d9d272d305
Bug 1254325 - Fixing the small UI error for TaskCluster Add New Jobs (#1625)

* Bug 1254325 - Fixing the small UI error for TaskCluster Add New Jobs
Attachment #8766052 - Flags: review+
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Can we keep this bug open until we're fully live? or rename the bug and use it a different one?

For external viewers it might be confusing to see the bug marked as fixed.
It is fully live, unless I'm missing something?

If there is remaining work, I agree adjusting the bug summary to reflect what part was completed here would be good. (I'd prefer not to keep the bug open, it's harder to track what has and hasn't landed)
This bug's title proposal: Add to Treeherder's try runnable API TaskCluster tasks

New bug: Enable Treeherder's ability to add TaskCluster jobs to try pushes

martianwars: does this work for you?
Flags: needinfo?(kalpeshk2011)
I'd say the better approach would be,

This bug: Change the runnable_jobs API to include TaskCluster jobs as well.
New bug: ^ Like you said. Out there we will add a patch which allows us to see TC jobs in the UI.
Flags: needinfo?(kalpeshk2011)
Component: Treeherder → Treeherder: API
Summary: Give Treeherder the ability to add TaskCluster jobs to try pushes → Change the runnable_jobs API to include TaskCluster jobs as well
Filed bug 1284911 and CCed everyone who was in this bug.

I'm also porting some of the dependencies to that other bug.

Thanks you all for your hard work in here!
No longer blocks: 1080265
No longer depends on: 1281062
I've just spotted no tests landed with this change. I don't suppose you could file a new bug that depends on this one, so we remember to add some?
Flags: needinfo?(kalpeshk2011)
Blocks: 1285924
Flags: needinfo?(kalpeshk2011)
Blocks: 1284911
Depends on: 1288053
Duplicate of this bug: 1108154
You need to log in before you can comment on or make changes to this bug.