Closed Bug 1156301 Opened 6 years ago Closed 5 years ago

be able to cancel buildbot builds through taskcluster

Categories

(Release Engineering :: General, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: bhearsum)

References

Details

Attachments

(2 files)

As things stand right now, Buildbot jobs that have a corresponding TaskCluster task can be cancelled in TaskCluster, which will put the task in an exception state, but the Buildbot Build will continue to run. The Buildbot bridge should be able to cancel the corresponding Build to avoid unnecessary work (and potential confusion when a "cancelled" job completes).

One complication here might be how to attach artifacts for the cancelled job. IIRC, Taskcluster immediately puts the task in an exception state, which means we can't attach any artifacts to it, which Buildbot (and the bridge) would want to do after the Build stops.
1) a task-exception message is published when a task is cancelled, identified with {reason: 'canceled'}.
So being notified shouldn't be an issues.

2) As for artifacts after task resolution see: bug 1148860.
(In reply to Jonas Finnemann Jensen (:jonasfj) from comment #1)
> 1) a task-exception message is published when a task is cancelled,
> identified with {reason: 'canceled'}.
> So being notified shouldn't be an issues.
> 
> 2) As for artifacts after task resolution see: bug 1148860.

Thanks Jonas!
Depends on: 1148860
Depends on: 1156810
I've got a patch in the works for this. I thought this would depend on bug 1156810, but it turns out we can work around it by just setting "X-Remote-User" on the client side.
Depends on: 1164878
Depends on: 1166307
I still need to finish testing my patch for the bridge, but this is certainly needed.
Attachment #8607586 - Flags: review?(dustin)
Attachment #8607586 - Flags: review?(dustin) → review+
Attachment #8607586 - Flags: checked-in+
Selena, I'm not sure if you're comfortable reviewing this since it deals with a lot of interactions with BuildAPI, but you're probably the best reviewer for the bridge parts! If you want someone else to validate the buildapi stuff just let me know.

I don't have too much to say up front about this patch - there's a lot of inline comments about how it works and the edge cases that I know it won't deal with just yet. The high level overview is that this patch gets the TCListener paying attention to cancellation events sent by Taskcluster (via the task-exception exchange), and cancels the relevant Buildbot Build or BuildRequest upon receiving one.

There's a little bit of hackiness around the branch names because "branch" means different things to different systems. Taskcluster generally talks about full URLs to repos, Buildbot generally uses the path part of the URL (eg: integration/mozilla-inbound), and BuildAPI uses the "short" name (eg: mozilla-inbound). So there's some translation that needs to happen at times.
Attachment #8610121 - Flags: review?(sdeckelmann)
Turns out that bug 1148860 doesn't block us here, but we should make some improvements after it's fixed.
No longer depends on: 1148860
No longer blocks: bbb
Comment on attachment 8610121 [details] [diff] [review]
support cancellation of buildbot builds/buildrequests through taskcluster

Review of attachment 8610121 [details] [diff] [review]:
-----------------------------------------------------------------

A couple nits on documentation and TODOs. 

One question I had was whether BuildAPI returns immedidately for DELETE calls.  It's a TODO rather than a refactor if so.

::: bbb/runner.py
@@ -76,5 @@
>  
>      signal(SIGTERM, handle_sigterm)
>  
>      log.info("Running %s service", args.service[0])
> -    # TODO: If we're not going to run with supervisor or something similar,

Did something change here?

::: bbb/servicebase.py
@@ +30,5 @@
> +        # background on this.
> +        url = "%s/%s" % (self.base_uri, url)
> +        log.debug("Making %s request to %s", method, url)
> +        r = requests.request(method, url, headers={"X-Remote-User": "buildbot-bridge"})
> +        # TODO: should we raise here? Maybe return the return code so the consumer

We are raising-- maybe revise this TODO. Do consumers want/need to explicitly know about a 200?

@@ +39,5 @@
> +            log.exception("Caught exception:")
> +
> +    def cancelBuild(self, branch, id_):
> +        url = "%s/build/%s" % (branch, id_)
> +        self._do_request("DELETE", url)

I'm assuming this returns immediately? If so, do we need to check that it worked?  If not, could this process get stalled for a while waiting for the task to complete?

::: bbb/services.py
@@ +158,5 @@
>              expires = arrow.now().replace(weeks=1).isoformat()
> +            try:
> +                createJsonArtifact(self.tc_queue, taskid, runid, "public/properties.json", properties, expires)
> +            except TaskclusterRestFailure:
> +                # Probably tried to create an artifact for a completed job. This can be reworked

Do you want a TODO marker here?

@@ +160,5 @@
> +                createJsonArtifact(self.tc_queue, taskid, runid, "public/properties.json", properties, expires)
> +            except TaskclusterRestFailure:
> +                # Probably tried to create an artifact for a completed job. This can be reworked
> +                # after bug 1148860 is fixed.
> +                log.exception("Caught exception when creating an artifact for %s, not retrying...", taskid)

If we think this is most likely a completed job, could we also put that into the log message?

@@ +207,5 @@
>                  # TODO: runid might be wrong for the rerun for a period of time because we don't update it
>                  # until the TCListener gets the task-pending event. Maybe we should update it here too/instead?
>                  self.tc_queue.rerunTask(taskid)
>              elif results == CANCELLED:
> +                # We could end up in this block for two different reasons:

Excellent explanation!
Attachment #8610121 - Flags: review?(sdeckelmann) → review+
(In reply to Selena Deckelmann :selenamarie :selena from comment #8)
> One question I had was whether BuildAPI returns immedidately for DELETE
> calls.  It's a TODO rather than a refactor if so.

Yeah, it returns right away and processes the request in another process.

> ::: bbb/runner.py
> @@ -76,5 @@
> >  
> >      signal(SIGTERM, handle_sigterm)
> >  
> >      log.info("Running %s service", args.service[0])
> > -    # TODO: If we're not going to run with supervisor or something similar,
> 
> Did something change here?

Naw, just removing this comment because we settled on supervisord, so it's not relevant anymore.

> ::: bbb/servicebase.py
> @@ +30,5 @@
> > +        # background on this.
> > +        url = "%s/%s" % (self.base_uri, url)
> > +        log.debug("Making %s request to %s", method, url)
> > +        r = requests.request(method, url, headers={"X-Remote-User": "buildbot-bridge"})
> > +        # TODO: should we raise here? Maybe return the return code so the consumer
> 
> We are raising-- maybe revise this TODO. Do consumers want/need to
> explicitly know about a 200?

Actually, I don't think we're raising because this code ends up catching the RequestException...
(In reply to Ben Hearsum [:bhearsum] from comment #9)
> (In reply to Selena Deckelmann :selenamarie :selena from comment #8)

> > ::: bbb/servicebase.py
> > @@ +30,5 @@
> > > +        # background on this.
> > > +        url = "%s/%s" % (self.base_uri, url)
> > > +        log.debug("Making %s request to %s", method, url)
> > > +        r = requests.request(method, url, headers={"X-Remote-User": "buildbot-bridge"})
> > > +        # TODO: should we raise here? Maybe return the return code so the consumer
> > 
> > We are raising-- maybe revise this TODO. Do consumers want/need to
> > explicitly know about a 200?
> 
> Actually, I don't think we're raising because this code ends up catching the
> RequestException...

Oh sorry. Thinko. Thanks. 

Nothing to add.  I haven't thought through whether that would be helpful for the consumer :/
Comment on attachment 8610121 [details] [diff] [review]
support cancellation of buildbot builds/buildrequests through taskcluster

I improved the comments/log message around the failure to create artifacts in https://github.com/bhearsum/bbb/commit/b94d4b678d8cf16a29cb4f577ced68000074d459. I'm going to bump the version number and get this deployed now!
Attachment #8610121 - Flags: checked-in+
Deployed in production!
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.