Closed Bug 1113281 Opened 9 years ago Closed 9 years ago

Treeherder should use Pulse to notify consumers when a "Job Retrigger or Cancel" has been requested.

Categories

(Tree Management :: Treeherder: API, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bc, Assigned: jlal)

References

Details

Attachments

(2 files)

Treeherder currently supports retriggering tests via the self-serve buildbot api. A Treeherder user can select a job, and from the job details panel request that the job be re-run. This is currently only supported for buildbot managed jobs.

Non-buildbot jobs also need to be able to respond to a retrigger request from the Treeherder UI. I suggest that Treeherder publish a message to Pulse in such a way that a non-buildbot test framework can listen using pulsebuildmonitor and initiate a retest of the job in response to the message.
We'd previously discussed other job types specifying the api endpoints for the retrigger request. I can't say off the top of my head which solution is preferable, but either way we should probably not implement both.

This overlaps with bug 1077053 and bug 1076687.
See Also: → 1076687
See Also: → 1077053
Component: Treeherder → Treeherder: API
Reading the linked bugs and skimming the TaskCluster documentation certainly gives me an appreciation for TaskCluster's scope and design. It also pointed out to me that I missed the need for handling Cancel requests.

Without attempting to influence TaskCluster's design or implementation, it seems to me that having Treeherder call directly into a non-buildbot test framework's api tightly couples Treeherder and the test framework requiring any new framework and Treeherder itself to deal with the same issues being discussed in the linked bugs.

I think having Treeherder produce messages for retriggers, cancels or other possible user driven actions that are used to control the operation of the test framework provides the benefit of allowing any test framework to listen to the messages and respond appropriately without having to coordinate adding endpoints for Treeherder to call into.

The non-buildbot test framework can populate the Treeherder Job Info with the information it needs to process messages and Treeherder can send the Job Info back to the framework via the message.

Authentication would have two levels:

1. Who can write the appropriate messages to the message queue.

   This could be tightly controlled and only accessible to a limited set of tools such as
   Treeherder.

2. Who can issue retriggers, cancels, ... from the Treeherder UI.

   This can be controlled by whatever means Treeherder decides is appropriate... LDAP, 
   Persona, ...
Summary: Treeherder should use Pulse to notify consumers when a "Job Retrigger" has been requested. → Treeherder should use Pulse to notify consumers when a "Job Retrigger or Cancel" has been requested.
jlal, if the ability to submit retriggers and cancels was authenticated in Treeherder and would this approach work for TaskCluster?
Flags: needinfo?(jlal)
Yes! Actually this would be trivial to implement if we had pulse messages ... The other approach I have considered recently is custom UI hooks but this would certainly be easier/better/scalable..

+1
Flags: needinfo?(jlal)
NI? Ed,

Short term we are nearing the point where we badly need retriggers for TC my current thoughts where to workaround this purely at a UI level in the least intrusive way possible this obviously does not help the aim of making treeherder the api to help you this kind of stuff however...

I like this approach if Ed (and perhaps mdoglio) like this I would be happy to implement it (baring someone else feeling excited about this) and hopefully this would be easy for other consumers (like autophone) to use.

The question here is "Do we feel comfortable enforcing treeherder level login to run cancel/retrigger opertions" I believe the answer is yes ?
Flags: needinfo?(emorley)
(In reply to James Lal [:lightsofapollo] from comment #5)
> The question here is "Do we feel comfortable enforcing treeherder level
> login to run cancel/retrigger opertions" I believe the answer is yes ?

Yeah that's fine - and is the direction a more complete service-side solution would take too :-)
Flags: needinfo?(emorley)
Blocks: 1080265
Taking this one... This is pretty easy now that we have some code that publishes to pulse... The piece I don't know offhand is checking for persona auth. 

The routes I am thinking now look like:

<prefix>.<action>.<project>.<revision_hash>
Status: NEW → ASSIGNED
about the routing key.

prefix: ?

action: would be something like retrigger or cancel?

project: is this the repository?

revision_hash: would you really want to create listeners for specific revision_hashes? Wouldn't this be better in the payload?

I think having an identifier for the test framework, say the treeherder group name, in the routing key is important so that frameworks can filter only for their own messages.

about the payload: Is there an api call we can make to treeherder to return information about the job identified by the revision_hash? Or should we have treeherder package that information directly in the payload to save the framework from having to look it up each time?
Flags: needinfo?(mdoglio)
Flags: needinfo?(jlal)
Prefix is something like treeherder-staging (see https://wiki.mozilla.org/Auto-tools/Projects/Pulse/Exchanges)

> project: is this the repository? 
Project as defined by treeherder (b2g-inbound / mozilla-central / etc...)

> revision_hash: would you really want to create listeners for specific revision_hashes? Wouldn't this be better in the payload?


After sleeping on this for a bit I came up with some alternative ideas...

<prefix>.<project>.<action>.< routing key >

In the artifact payload you specify something like this:

{
  "routing_key": "autophone",
  "payload": {
    // ... whatever you want to get mirrored back to you when cancel / retrigger is sent
  }
}

>about the payload: Is there an api call we can make to treeherder to return information about the job >identified by the revision_hash? Or should we have treeherder package that information directly in the >payload to save the framework from having to look it up each time?

We should likely include some generic information in all payloads such as job_guid / revision_hash / project
Flags: needinfo?(jlal)
> publishes to pulse... The piece I don't know offhand is checking for persona

The authorization check is done in the api endpoints code
https://github.com/mozilla/treeherder-service/blob/master/treeherder/webapp/api/jobs.py#L89-L130
Flags: needinfo?(mdoglio)
For the payload I agree with James, I would use project, job_guid and revision_hash. I would also add a url to the job detail to make it trivial to retrieve further info.
Depends on: 1133580
Blocks: 1133580
No longer depends on: 1133580
server side
Attachment #8565821 - Flags: review?(mdoglio)
Okay the approach I took is a bit different then the above but fairly close...

## On the server:

When retrigger/cancel (cancel also works with cancel all) an event is sent with the routing key:

<prefix>.<build_system_type>.<project>.<action>

The payload contains job id, job guid, project, action and _requester_ the important piece here is I currently assume that we are using browser id on auth ( I don't think these endpoints will work over oauth ) and we transmit the user data over for book keeping on the other side (an example is taskcluster would eventually like to keep track of who is spending what)

## In the UI

(cancel workflow is the same for the most part)

retrigger workflow:

 -> call new retrigger endpoint
 -> call buildapi if buildapi artifact is present

As a follow up item we should move the buildapi calls entirely into treeherder server side but that requires encoding ldap somewhere (which I would like to avoid initially)

---

Aside from the targeted changes I completely changed how we configure pulse (which let me add tests for the above stuff) so this will require some changes on the puppet side to enable this...
Ah note - I did _not_ include job details url but I did include job id into all the action payloads which lets you build the url yourself as well as run a few other actions.
Assignee: nobody → jlal
Comment on attachment 8565821 [details] [review]
https://github.com/mozilla/treeherder-service/pull/376

Consider this an r+ once a minor code duplication gets removed
Attachment #8565821 - Flags: review?(mdoglio) → review+
Commits pushed to master at https://github.com/mozilla/treeherder-service

https://github.com/mozilla/treeherder-service/commit/38e066918700a61dee1cafa0ee46e60b0e6aa6b8
Bug 1113281 - Add pulse messages when after cancel/retrigger r=maurodoglio

 - Rename configs to make it easier to run tests without pulse
 - Fix scoping bug (all messages used to be published under same exchange!)

https://github.com/mozilla/treeherder-service/commit/65f1c2243bf9f44e13234f975407e5890f7ed5fd
Merge pull request #376 from lightsofapollo/bug-1113281

Bug 1113281 - Send event of actions over pulse (retrigger/cancel)
Attachment #8565822 - Flags: review?(cdawson) → review+
Commits pushed to master at https://github.com/mozilla/treeherder-ui

https://github.com/mozilla/treeherder-ui/commit/f48584b985cc900f64b761d626fd1ff7047f5f95
Bug 1113281 - Call new retrigger endpoint and always allow both retrigger/cancel operations. r=camd

https://github.com/mozilla/treeherder-ui/commit/3103fad8825663c2372d9b2626391a122d7bf3de
Merge pull request #366 from lightsofapollo/bug-1113281

Bug 1113281 - Update retrigger logic to always hit new retrigger endpoint
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
James, would you mind opening another bug (or reopening this one) to improve the UX here? The error message needs to be more actionable & ideally for buildbot jobs we don't give an error if logged in (though eventually people will need to be logged in, once buildbot jobs are handled service-side too, so not sure if this is worth the bother). Thanks :-)

<bz> So I'm trying to retrigger a job on treeherder
<bz> I click the little "reload" icon
<bz> I get this message, in red: "Unable to send retrigger: Authentication credentials were not provided"
<bz> What's up with that?
<jgraham> I got that yesterday until I logged in with Persona
<bz> Um
<bz> So just for context, I can retrigger fine from tbpl
— bz tries signing in
<bz> ok
<bz> So it just won't prompt me to log in
<bz> I see
<bz> So.....
<bz> 1)  You have to be logged in to retrigger.
<bz> 2)  The error message if you're not says nothing about logging in.
<edmorley> bz: so we're moving to a model where the retrigger is done service-side, since we need to for taskcluster, and as an added bonus it will mean not having to auth twice in treeherder. It seems like bug 1113281 and friends affected buildbot _and_ taskcluster jobs. I suspect your retrigger request went through anyway, you're just getting the error from the new
<edmorley> route through the service, that works in parallel
Flags: needinfo?(jlal)
Blocks: 1136738
Blocks: 1137519
See https://bugzilla.mozilla.org/show_bug.cgi?id=1136738 should make the error message nicer at least.
Flags: needinfo?(jlal)
Commits pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/cfadebcf1f6d3eb8a45e7113e428af35c31df3db
Bug 1113281 - Call new retrigger endpoint and always allow both retrigger/cancel operations. r=camd

https://github.com/mozilla/treeherder/commit/6c739b77c5dbbda52457e775ce85b47772e29ca0
Merge pull request #366 from lightsofapollo/bug-1113281

Bug 1113281 - Update retrigger logic to always hit new retrigger endpoint
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: