bugzilla.mozilla.org has resumed normal operation. Attachments prior to 2014 will be unavailable for a few days. This is tracked in Bug 1475801.
Please report any other irregularities here.

make sure buildbot bridge message acking makes sense

RESOLVED FIXED

Status

Release Engineering
General
RESOLVED FIXED
3 years ago
2 months ago

People

(Reporter: bhearsum, Assigned: bhearsum)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [bbb])

Attachments

(1 attachment, 1 obsolete attachment)

(Assignee)

Description

3 years ago
Right now, the Buildbot Bridge acks messages from both the Taskcluster and Buildbot queues even if an exception is raised. This probably makes sense for some cases (eg, if a bug in the bridge tries to do something that continually raises an exception, we don't want the message to stick around forever), but maybe not in others. This needs to be given more thought, and the logic adjusted.

It might even be best to delegate message acking to the BuildbotListener and TCListener classes, rather than having the ListenerService base do it.
(Assignee)

Comment 1

3 years ago
Selena had some good thoughts on this in bug 1135192:
> What happens when we get a partial failure between acking a pulse message,
> claiming a task and updating a row in mysql? Should that group of events be
> wrapped in try/finally blocks to clean up after itself if there's a partial
> failure?

I think she's right - this (and any other place where we writing changes to multiple systems) should be wrapped, and acks should probably be conditional on successfully completing everything. The only question might be whether or not all of the operations are undo-able....
(Assignee)

Comment 2

3 years ago
Created attachment 8609375 [details] [diff] [review]
improve message acking

The philosophy I took here was to ack messages if everything completes successfully, as soon as we do something that's not repeatable (eg, most Taskcluster requests), or if we hit an error in the data (eg, lots of the cases in handleFinished).

I also realized that the buildername checking I recently added to BuildbotListener could be done outside of the loops in handleStarted/Finished, so I moved them there.
Attachment #8609375 - Flags: review?(sdeckelmann)
Comment on attachment 8609375 [details] [diff] [review]
improve message acking

Review of attachment 8609375 [details] [diff] [review]:
-----------------------------------------------------------------

::: bbb/services.py
@@ +58,5 @@
> +        for allowed in self.allowed_builders:
> +            if re.match(allowed, buildername):
> +                log.debug("Builder %s matches an allowed pattern", buildername)
> +                break
> +        else:

Hah! So, if self.allowed_builders is empty do we want this to run? I'm guessing so. That's a funny config issue, but maybe documented? for/else is tricksy. 

What if we turned this into a function instead? 

like: 

def allow_builder(self, buildername):
  is_allowed = False
  for allowed in self.allowed_builders:
     if re.match(allowed, buildername):
        log.debug("Builder %s matches an allowed pattern", buildername)
        is_allowed = True
        break
   return is_allowed

if not self.allow_builder(buildername):
  log.debug("Builder %s does not match any pattern, ignoring it", buildername)
  msg.ack()
  return

@@ +134,5 @@
> +        for allowed in self.allowed_builders:
> +            if re.match(allowed, buildername):
> +                log.debug("Builder %s matches an allowed pattern", buildername)
> +                break
> +        else:

See previous comment about using a function for this guy.

@@ +165,5 @@
>              log.info("Buildbot results are %s", results)
>              if results == SUCCESS:
>                  log.info("Marking task %s as completed", taskid)
>                  self.tc_queue.reportCompleted(taskid, runid)
> +                msg.ack()

Should we be setting acked = True in these clauses? it appears that we'd try to double-ack once we make it through this conditional block.

@@ +201,4 @@
>                  self.bbb_db.deleteBuildRequest(brid)
>  
> +        # If everything went well and the message hasn't been acked, do it. This could
> +        # happen if the "WEIRD" conditions is hit in every iteration of the loop

Is this comment correct? From reading, it seems like we could hit this if results == SKIPPED also.

@@ +358,4 @@
>                  log.debug("Builder %s matches an allowed pattern", buildername)
>                  break
>          else:
>              log.info("Builder %s does not match any pattern, rejecting it", buildername)

See above about for/else :)
(Assignee)

Comment 4

3 years ago
(In reply to Selena Deckelmann :selenamarie :selena from comment #3)
> Comment on attachment 8609375 [details] [diff] [review]
> improve message acking
> 
> Review of attachment 8609375 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: bbb/services.py
> @@ +58,5 @@
> > +        for allowed in self.allowed_builders:
> > +            if re.match(allowed, buildername):
> > +                log.debug("Builder %s matches an allowed pattern", buildername)
> > +                break
> > +        else:
> 
> Hah! So, if self.allowed_builders is empty do we want this to run? I'm
> guessing so. That's a funny config issue, but maybe documented? for/else is
> tricksy. 

If allowed_builders is empty I think it make sense that we ignore _everything_. With that being the case I think this code is correct. However...

> What if we turned this into a function instead? 
> 
> like: 
> 
> def allow_builder(self, buildername):
>   is_allowed = False
>   for allowed in self.allowed_builders:
>      if re.match(allowed, buildername):
>         log.debug("Builder %s matches an allowed pattern", buildername)
>         is_allowed = True
>         break
>    return is_allowed
> 
> if not self.allow_builder(buildername):
>   log.debug("Builder %s does not match any pattern, ignoring it",
> buildername)
>   msg.ack()
>   return

...this is much more readable anyways, so it's worth doing.

> @@ +165,5 @@
> >              log.info("Buildbot results are %s", results)
> >              if results == SUCCESS:
> >                  log.info("Marking task %s as completed", taskid)
> >                  self.tc_queue.reportCompleted(taskid, runid)
> > +                msg.ack()
> 
> Should we be setting acked = True in these clauses? it appears that we'd try
> to double-ack once we make it through this conditional block.
> 
> @@ +201,4 @@
> >                  self.bbb_db.deleteBuildRequest(brid)
> >  
> > +        # If everything went well and the message hasn't been acked, do it. This could
> > +        # happen if the "WEIRD" conditions is hit in every iteration of the loop
> 
> Is this comment correct? From reading, it seems like we could hit this if
> results == SKIPPED also.

It's....partially correct :). I'll update it.
(Assignee)

Comment 5

3 years ago
Created attachment 8609514 [details] [diff] [review]
factor out allowed_builders parsing; fix checking of previous acks

I think this addresses all of your concerns. I implemented allow_builders a bit differently, using an early return instead of a break (for short functions this reads better to me). I fixed up the "acked" stuff too, and discovered that I can check msg.acknowledge rather than maintaining a local variable to track it.
Attachment #8609375 - Attachment is obsolete: true
Attachment #8609375 - Flags: review?(sdeckelmann)
Attachment #8609514 - Flags: review?(sdeckelmann)
Attachment #8609514 - Flags: review?(sdeckelmann) → review+
(Assignee)

Updated

3 years ago
Attachment #8609514 - Flags: checked-in+
(Assignee)

Updated

3 years ago
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
(Assignee)

Updated

3 years ago
Whiteboard: [bbb]
Component: General Automation → General
Product: Release Engineering → Release Engineering
You need to log in before you can comment on or make changes to this bug.