1135192 - (bbb) buildbot <-> taskcluster bridge

Reporter

Description

•

9 years ago

It will greatly simplify migrating things to taskcluster if we can have a bridge between the two systems.

The rough idea is to listen for new tasks for a certain workerType, e.g. "buildbot". The task must specify the proper buildername and any properties the builder expected. The bridge then injects this information into the buildbot scheduler database, and buildbot takes over from there.

The bridge also needs to handle:
- regularly re-claiming the tasks in taskcluster while they're running
- handling job completion by calling reportCompleted/reportFailed
- handling job retries by calling reportFailed/rerunTask
- handling job coalescing by calling reportCompleted/reportFailed/rerunTask on all the coalesced tasks (ew!)
- posting job artifacts for the resulting buildbot status and properties with createArtifact

James Lal [:lightsofapollo]

Comment 1

•

9 years ago

+ a number of people from a-team...

The benefits of scheduling go beyond just scheduling jobs in this world we would have real task ids for each buildbot job which means we can easily do things like generate artifacts on the fly to update treeherder etc...

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 2

•

9 years ago

Catlee and I were talking about this some more today, mostly in the context of moving release automation scheduling into taskcluster while keeping most of the jobs in buildbot. He's got an implementation that works on his laptop, but it needs more testing and some tweaks based on what we talked about today, including:
* Don't set TC task state to "running" until a buildbot slave claims a job
* Make sure Buildbot+TC retries interact well. Eg, we don't want them both causing retries. My suggestion here is to always set retries to 0 in TC and only handle them in Buildbot. If we see Buildbot RETRY or EXCEPTION states, should report that as a failure to TC (to avoid triggering it's built-in retry logic).

Release automation is probably a very good way to validate the bridge, because it is our most complicated set of scheduling logic. I'm thinking that if I'm able to get it working (including testing of job failures, retries, exceptions, and other error states/edge cases) from tagging through en-US builds, that we should be good to deploy this.

What do you think Chris, is that enough to give you confidence about running it in production?

Assignee: nobody → bhearsum

Flags: needinfo?(catlee)

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

9 years ago

Blocks: 1150162

Chris AtLee [:catlee]

Reporter

Comment 3

•

9 years ago

I'd definitely like to see tests of release automation working with all kinds of different failure modes before we run it in production.

Another test I was thinking of doing was disabling buildbot scheduling on one project branch and driving everything through TC.

Flags: needinfo?(catlee)

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 4

•

9 years ago

I'm now at the point where I have bbb reliably working for simple test cases. One caveat is that our Buildbot masters seem to ignore db_poll_interval (perhaps due to the TimerService instances going out of scope quickly: http://mxr.mozilla.org/build/source/buildbot/master/buildbot/master.py#1043). This means that jobs created by the bridge will not be picked up until the master checks for new jobs for another reason (build completed, slave reconnects). This tripped me up in dev, but I don't think it will be an issue in production...at least not immediately.

Dustin J. Mitchell [:dustin] (he/him)

Comment 5

•

9 years ago

They're going out of scope but are still referenced by their parent

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 6

•

9 years ago

Oops, turns out I was misreading configuration - db_poll_interval actually wasn't set on my master. Fixing that gets things firing as expected.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 7

•

9 years ago

Jonas and I talked a lot about how buildbot and TC states match up today. Here's what we came up with:
* Buildbot's SUCCESS -> TC's reportCompleted. Both of these currently result in green on TH.
* Buildbot's WARNINGS or FAILURE -> TC's reportFailed on TH. WARNINGS and FAILURE are orange and red respectively on TH. TC doesn't have enough status' to let us report these different to it, so this is the best we can do for now. Jonas and James have some ideas about to make this possible, and we can update the bridge to make use of it when it exists.
* Buildbot's EXCEPTION -> TC's reportException("malformed-syntax"). Both of these go purple on TH. "malformed-exception" isn't exactly correct, but it's the only TC exception that doesn't automatically retry.
* Buildbot's RETRY -> reportException("worker-shutdown"). Both of these are blue on TH. "worker-shutdown" is essentially "buildslave lost". That's not always why we retry in buildbot, but it's close enough IMO. Getting the TH colour is more important anyways.
* Buildbot's CANCELLED -> cancelTask. Both of these are pink on TH. One crucial note here is it that job cancellation will ONLY work through Buildbot means (eg, buildapi). CANCELLED status will propagate to TC, but trying to cancel a buildbot-bridge job in TC will do nothing. We might be able to get the bridge to support that, but it's not something I want to tackle initially.

Ed, James suggested that I talk to you about the merging of the orange/red. Is that something that we need to find a solution for before we can use TC as the data source for jobs that run in Buildbot? Given that we've got quite a few B2G jobs in TC already makes me wonder what sort of workarounds we have for that already.

Flags: needinfo?(emorley)

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 8

•

9 years ago

(In reply to Ben Hearsum [:bhearsum] from comment #7)
> * Buildbot's RETRY -> reportException("worker-shutdown"). Both of these are
> blue on TH. "worker-shutdown" is essentially "buildslave lost". That's not
> always why we retry in buildbot, but it's close enough IMO. Getting the TH
> colour is more important anyways.

When I was implementing this I realized that we can't use worker-shutdown because it will end up with two runs. Buildbot's RETRY status will automatically start one, and then TC will create one. We'll have to change this to malformed-payload + rerunTask.

Ed Morley [:emorley]

Comment 9

•

9 years ago

(In reply to Ben Hearsum [:bhearsum] from comment #7)
> Ed, James suggested that I talk to you about the merging of the orange/red.
> Is that something that we need to find a solution for before we can use TC
> as the data source for jobs that run in Buildbot? Given that we've got quite
> a few B2G jobs in TC already makes me wonder what sort of workarounds we
> have for that already.

I'm fine with it, but I'd check with the sheriffs :-)

Flags: needinfo?(emorley)

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 10

•

9 years ago

Chris, what do you think about the new structure of the services on my refactor branch? https://github.com/bhearsum/bbb/tree/refactor

I've gotten rid of the 3 services blobbed into one class and did a simple inheritance structure for the services, with the databases getting their own classes that are instantiated by the base. Seems to have a good amount of code re-use without the confusion. I'd kindof like a new name for the Reclaimer since it needs to deal with reflecting Buildbot cancellations into TC now - I haven't been able to come up with anything though.

Flags: needinfo?(catlee)

Chris AtLee [:catlee]

Reporter

Comment 11

•

9 years ago

Looks basically sane to me!

Flags: needinfo?(catlee)

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 12

•

9 years ago

I finished work on my refactor branch today. It now has the same functionality as the old code, except in a (hopefully) more maintainable design, and has some tests.

Here's a braindump of a bunch of things that are left to do:
* Figure out where/how to deploy each component
* Add some tests for more complex behaviours
* Add support for for forwarding cancellations done in TC to Buildbot
* Make sure the current behaviour of "always ack messages, even if there's an exception" makes sense
* Re-enable SSL
* Make sure taskId is in build properties somewhere (needed for some jobs to create artifacts)
* Follow-up with jlal about how TC retriggers should be handled

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

9 years ago

Depends on: 1154423

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

9 years ago

Depends on: 1155343

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

9 years ago

Alias: bbb

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

9 years ago

Depends on: 1156296

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

9 years ago

Depends on: 1156297

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

9 years ago

Depends on: 1156301

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

9 years ago

Depends on: 1156303

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

9 years ago

Depends on: 1156305

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

9 years ago

Depends on: 1156408

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

9 years ago

Depends on: 1156773

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

9 years ago

Depends on: 1157242

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

9 years ago

Depends on: 1157310

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

9 years ago

Blocks: 1158240

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

9 years ago

Depends on: 1158243

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

9 years ago

Depends on: 1158271

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

9 years ago

Depends on: 1161554

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

9 years ago

Depends on: 1161583

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 13

•

9 years ago

Attached patch initial version of buildbot bridge — Details — Splinter Review

Here's the initial version of the Buildbot Bridge for review. There's a couple of things that I talked about last week that aren't included here because I've left them on a different branch until I was able to test them more thoroughly. I'll be putting patches for individual issues in specific bugs after this initial review cycle is complete.

Attachment #8604072 - Flags: review?(sdeckelmann)

Attachment #8604072 - Flags: review?(jopsen)

Rail Aliiev [:rail]

Comment 14

•

9 years ago

Comment on attachment 8604072 [details] [diff] [review]
initial version of buildbot bridge

Review of attachment 8604072 [details] [diff] [review]:
-----------------------------------------------------------------

::: /tmp/empty/bbb/tcutils.py
@@ +41,5 @@
> +        raise IOError("couldn't upload artifact to s3")
> +
> +def makeTaskId():
> +    """Used in testing to generate task ids without talking to TaskCluster."""
> +    return b64encode(uuid4().bytes).replace("+", "-").replace("/", "-").rstrip("=")

I think this can be replaced by https://github.com/taskcluster/taskcluster-client.py/blob/5e1b1db4b3ecc84251afa781775ccacfbcbcc876/taskcluster/utils.py#L82-L85

Jonas Finnemann Jensen (:jonasfj)

Comment 15

•

9 years ago

Comment on attachment 8604072 [details] [diff] [review]
initial version of buildbot bridge

Review of attachment 8604072 [details] [diff] [review]:
-----------------------------------------------------------------

I strongly recommend fixing:
 A) reclaim timeout to not ever reclaim a before takenUntil < now() - 5min
    (We can still iterate each 60s, just skip some of the reclaims)
 B) define provisionerId in config, and embed it in your routing_key for pulse

With (A) and (B) addressed you have my r+ to ship :)

::: /tmp/empty/bbb/services.py
@@ +139,5 @@
> +                # TODO: can we use worker-shutdown instead of rerunTask here? We used malformed-payload
> +                # before because TCListener didn't know how to skip build request creation for
> +                # reruns....
> +                # using worker-shutdown would probably be better for treeherder, because
> +                # the buildbot and TC states would line up better.

This todo says it all.

I don't think it's critical to do. As long as the behaviour is well-defined in documentation.
When we at some point want to use TC treeherder integration, we can review this again.
As I understood we would like more TH states, we're interested in a th-state artifact, and haven't yet
figured out how we would slowly turn off TH ingestion from bb.

@@ +224,5 @@
> +                    log.warn("Too many buildbot builds? runId is %i but we have %i builds", t.runId, len(builds))
> +
> +                log.info("BuildRequest is in progress, reclaiming")
> +                try:
> +                    result = self.tc_queue.reclaimTask(t.taskId, int(t.runId))

Do not reclaim excessively.

I don't know how slow this loop is, obviously is synchronous, so I might take a while to run through it.
But if you run through the loop ever 60s, and one trip through the loop doesn't take more than 120s, then 
there is no reason reclaim on each iteration.

I recommend that you reclaim tasks **only** if (takenUntil < now() + 5min).

This should be addressed before we ship this.
Excessive reclaims costs resources and slows down this loop making the application less reliable.

> if int(t.takenUntil) < (arrow.utcnow().timestamp - 5 * 60):
>   try:
>     result = self.tc_queue.reclaimTask(t.taskId, int(t.runId))
>     ...

@@ +261,5 @@
> +            ),
> +            ListenerServiceEvent(
> +                queue_name="%s/task-exception" % pulse_queue_basename,
> +                exchange="%s/task-exception" % pulse_exchange_basename,
> +                routing_key="*.*.*.*.*.*.%s.#" % worker_type,

You should insert `provisionerId` in this routing key (and in the routing_key for task-pending too).
See: http://docs.taskcluster.net/queue/exchanges/#taskPending

WorkerTypes are only unique under a given provisionerId. Someone could create a buildbot-bridge workerType on aws-provisioner using docker-worker. This would obviously be a bad choice of name, but it's perfectly legal. Let keep the concept clean.

You only want tasks for the provisionerId you've assigned the buildbot-bridge, probably "buildbot-bridge".

Note, this will allow you to make a staging buildbot-bridge, by using a different provisionerId.

@@ +293,5 @@
> +        else:
> +            log.info("Builder %s does not match any pattern, rejecting it", buildername)
> +            # malformed-payload is the most accurate TC status for this situation
> +            # but we can't use reportException for cancelling pending tasks,
> +            # so this will show up as "cancelled" on TC.

Correct, you can't "reportException" for a pending task.
But if you want to, you can do claimTask(), and then reportException() with 'malformed-payload'.

Not a critical thing to do, but you would avoid the exception message with canceled reason, which you also listen for.
And the state would be more accurately reflected.

@@ +309,5 @@
> +        # for the same Task. In these cases, we don't want to do anything
> +        # except update our own runId. If we created a new BuildRequest for it
> +        # we'd end up with an extra Build.
> +        if our_task:
> +            self.bbb_db.updateRunId(our_task.buildrequestId, runid)

FYI, the first run of a task always have {runId: 0}, you could use this to assert correctness.
Ie. runId > 0 implies that you are in some form of retry/rerun.

(this is not important, just and fyi)

::: /tmp/empty/config.json.sample
@@ +13,5 @@
> +    "pulse_queue_basename": "queue/foo",
> +
> +    "tclistener": {
> +        "pulse_exchange_basename": "exchange/taskcluster-queue/v1",
> +        "worker_type": "buildbot-bridge",

you should probably also add provisionerId here.

Attachment #8604072 - Flags: review?(jopsen) → review+

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

9 years ago

Depends on: 1164055

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

9 years ago

Depends on: 1164056

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 16

•

9 years ago

Thanks for the review Jonas - I've got fixes for the critical things on branches already. Since you r+'ed this I'll deal with them in follow-ups (bugs 1164055 and 1164056).

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

9 years ago

Depends on: 1164455

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

9 years ago

Depends on: 1164527

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

9 years ago

Depends on: 1164819

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

9 years ago

Depends on: 1165408

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

9 years ago

Depends on: 1165431

Selena Deckelmann :selenamarie :selena

Updated

•

9 years ago

Blocks: bb-to-tc

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

9 years ago

Depends on: 1166250

Selena Deckelmann :selenamarie :selena

Comment 17

•

9 years ago

Comment on attachment 8604072 [details] [diff] [review]
initial version of buildbot bridge

Review of attachment 8604072 [details] [diff] [review]:
-----------------------------------------------------------------

Overall nits:

I don't know what PEP8 advises for blocks of SQL in quotes. My "strategy" for multi-line, inline SQL has been to: always put a newline after the first set of quotes, and indent the block of SQL to match the function call. I find that it makes it marginally easier to debug/copy&paste for testing. I'm not going to r- based on that - and I could contribute a patch to show what I mean.

Are we are operating in AUTOCOMMIT mode for MySQL? That's probably just an FYI for me as I'm new to this environment.

Two high level questions:

We've brought up 12factor patterns as a team -- and I'm wondering about supporting things like the authentication credentials as environment variables. Like if we deployed this with Deis, we'd need to use environment variables instead of config files. I won't block deployment of this on it, but maybe something to think about down the road.

What happens when we get a partial failure between acking a pulse message, claiming a task and updating a row in mysql? Should that group of events be wrapped in try/finally blocks to clean up after itself if there's a partial failure?

::: /tmp/empty/bbb/runner.py
@@ +12,5 @@
> +        loglevel=logging.INFO
> +    )
> +    parser.add_argument("-v", "--verbose", dest="loglevel", action="store_const", const=logging.DEBUG)
> +    parser.add_argument("-q", "--quiet", dest="loglevel", action="store_const", const=logging.WARN)
> +    parser.add_argument("-c", "--config", dest="config", required=True)

Nit: may wish to specify the default filename

::: /tmp/empty/bbb/servicebase.py
@@ +36,5 @@
> +
> +    @property
> +    def tasks(self):
> +        log.debug("Fetching all tasks")
> +        tasks = self.tasks_table.select().execute().fetchall()

Do we need to be concerned about ETOOMANYTASKS?

@@ +82,5 @@
> +    def __init__(self, uri):
> +        self.db = sa.create_engine(uri, pool_recycle=60)
> +
> +    def getBuildRequest(self, brid):
> +        return self.db.execute(sa.text("select * from buildrequests where id=:brid"), brid=brid).fetchone()

may wish to not 'select *' - getting unexpected columns can be sad-making. Would it be a terrible inconvenience to specify the columns?

@@ +87,5 @@
> +
> +    def getBuildRequests(self, buildnumber, buildername, claimed_by_name, claimed_by_incarnation):
> +        now = time.time()
> +        ret = self.db.execute(
> +            sa.text("""select buildrequests.id from buildrequests join builds

nit: I see capitalization in the rest of the query, so I recommend capitalizing SELECT, FROM and JOIN.

@@ +103,5 @@
> +        log.debug("getBuildRequests Query took %f seconds", time.time() - now)
> +        return ret
> +
> +    def getBuilds(self, brid):
> +        return self.db.execute(sa.text("select * from builds where brid=:brid"), brid=brid).fetchall()

Recommend replacing '*' with a list of columnns.

::: /tmp/empty/bbb/services.py
@@ +46,5 @@
> +        Taskcluster, which will move it into the "running" state there. We
> +        also update the BBB database with the claim time which triggers the
> +        Reflector to start reclaiming it periodically."""
> +        log.debug("Handling started event: %s", data)
> +        msg.ack()

So, if some part of this fails, do we want to roll it back? Like if the database update fails, do we "unack" it?

@@ +57,5 @@
> +            brid = brid[0]
> +            try:
> +                task = self.bbb_db.getTaskFromBuildRequest(brid)
> +            except TaskNotFound:
> +                log.debug("Task not found for brid %s, nothing to do.", brid)

Does this mean that the BBB is not aware of the task? Is this because we're listening for a mix of events, some of which the BBB doesn't care about?

@@ +82,5 @@
> +        # Get the request_ids from the properties
> +        try:
> +            properties = dict((key, (value, source)) for (key, value, source) in data["payload"]["build"]["properties"])
> +        except KeyError:
> +            log.error("Couldn't get job properties")

Is there anything about the job that we could log here to help with debugging?

@@ +96,5 @@
> +
> +        try:
> +            results = data["payload"]["build"]["results"]
> +        except KeyError:
> +            log.error("Couldn't find job results")

Anything about the job we could log here to help with debugging?

::: /tmp/empty/bbb/test/test_services.py
@@ +38,5 @@
> +        self.buildbot_db = self.bblistener.buildbot_db.db
> +
> +    def testHandleStartedOneBuildRequest(self):
> +        self.buildbot_db.execute(sa.text("""
> +INSERT INTO buildrequests

nit: indentation here would be nice.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 18

•

9 years ago

FYI: Because I was able to configure the Bridge to only touch jobs on certain branches, I ended up going ahead and deploying this before all reviews were done. There's been a bit of code change since, which has all gotten review. Given that, I think it would be better to address your comments in a follow-up patch if that's okay with you. From here on out it's going to be standard write/test/review/deploy though.

(In reply to Selena Deckelmann :selenamarie :selena from comment #17)
> Comment on attachment 8604072 [details] [diff] [review]
> initial version of buildbot bridge
> 
> Review of attachment 8604072 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Overall nits:
> 
> I don't know what PEP8 advises for blocks of SQL in quotes. My "strategy"
> for multi-line, inline SQL has been to: always put a newline after the first
> set of quotes, and indent the block of SQL to match the function call. I
> find that it makes it marginally easier to debug/copy&paste for testing. I'm
> not going to r- based on that - and I could contribute a patch to show what
> I mean.

This sounds fine to me, I may as well go ahead and address it.

> Are we are operating in AUTOCOMMIT mode for MySQL? That's probably just an
> FYI for me as I'm new to this environment.

To be honest, I'm not 100% certain. Based on http://docs.sqlalchemy.org/en/latest/core/connections.html#understanding-autocommit I'm guessing that SQLAlchemy disables MySQL's AUTOCOMMIT in favour of its own.

> We've brought up 12factor patterns as a team -- and I'm wondering about
> supporting things like the authentication credentials as environment
> variables. Like if we deployed this with Deis, we'd need to use environment
> variables instead of config files. I won't block deployment of this on it,
> but maybe something to think about down the road.

Good point. I more or less copied what we do in PuppetAgain for a bunch of other things deployed to similar machines. I don't think we have a model for a 12factor type deployment there yet (and probably not even the building blocks like Docker helpers.) We should do this, but as you say, doesn't block.
 
> What happens when we get a partial failure between acking a pulse message,
> claiming a task and updating a row in mysql? Should that group of events be
> wrapped in try/finally blocks to clean up after itself if there's a partial
> failure?

bug 1156296 talks about this a bit. I think you're probably write though - this (and any other place where we writing changes to multiple systems) should be wrapped, and acks should probably be conditional on successfully completing everything. The only question might be whether or not all of the operations are undo-able....

> ::: /tmp/empty/bbb/servicebase.py
> @@ +36,5 @@
> > +
> > +    @property
> > +    def tasks(self):
> > +        log.debug("Fetching all tasks")
> > +        tasks = self.tasks_table.select().execute().fetchall()
> 
> Do we need to be concerned about ETOOMANYTASKS?

Good point. So far, no - but I expect the number of active tasks will be much much higher in the future. I'll switch this be iterative instead.

> @@ +82,5 @@
> > +    def __init__(self, uri):
> > +        self.db = sa.create_engine(uri, pool_recycle=60)
> > +
> > +    def getBuildRequest(self, brid):
> > +        return self.db.execute(sa.text("select * from buildrequests where id=:brid"), brid=brid).fetchone()
> 
> may wish to not 'select *' - getting unexpected columns can be sad-making.
> Would it be a terrible inconvenience to specify the columns?

Not a problem at all. Will do here and everywhere.

> @@ +87,5 @@
> > +
> > +    def getBuildRequests(self, buildnumber, buildername, claimed_by_name, claimed_by_incarnation):
> > +        now = time.time()
> > +        ret = self.db.execute(
> > +            sa.text("""select buildrequests.id from buildrequests join builds
> 
> nit: I see capitalization in the rest of the query, so I recommend
> capitalizing SELECT, FROM and JOIN.

Will do.

> @@ +57,5 @@
> > +            brid = brid[0]
> > +            try:
> > +                task = self.bbb_db.getTaskFromBuildRequest(brid)
> > +            except TaskNotFound:
> > +                log.debug("Task not found for brid %s, nothing to do.", brid)
> 
> Does this mean that the BBB is not aware of the task? Is this because we're
> listening for a mix of events, some of which the BBB doesn't care about?

Pretty much, yes. There will definitely be a transition period where not all jobs are being scheduled through Taskcluster/bbb, so we have nothing to do for those. It might be worthwhile getting finer grained with this since I've added a buildername regex for jobs that we care about. We could continue doing this for jobs that don't match that, and spit out a better message (or raise an alarm) for ones that *do* much and we don't know anything about.

> @@ +82,5 @@
> > +        # Get the request_ids from the properties
> > +        try:
> > +            properties = dict((key, (value, source)) for (key, value, source) in data["payload"]["build"]["properties"])
> > +        except KeyError:
> > +            log.error("Couldn't get job properties")
> 
> Is there anything about the job that we could log here to help with
> debugging?

It's probably worth dumping the entire payload now that you mention it - will do.

> @@ +96,5 @@
> > +
> > +        try:
> > +            results = data["payload"]["build"]["results"]
> > +        except KeyError:
> > +            log.error("Couldn't find job results")
> 
> Anything about the job we could log here to help with debugging?

Yeah...I'm sure there's something.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 19

•

9 years ago

Selena - is this the sort of SQL indentation you mean?

     def getBuildRequests(self, buildnumber, buildername, claimed_by_name, claimed_by_incarnation):
         now = time.time()
         ret = self.db.execute(
             # TODO: Using complete=0 sucks a bit. If builds complete before we process
             # the build started event, this query doesn't work.
-            sa.text("""select buildrequests.id from buildrequests join builds
-                       ON buildrequests.id=builds.brid
-                       WHERE builds.number=:buildnumber
-                         AND buildrequests.complete=0
-                         AND buildrequests.buildername=:buildername
-                         AND buildrequests.claimed_by_name=:claimed_by_name
-                         AND buildrequests.claimed_by_incarnation=:claimed_by_incarnation"""),
+            sa.text("""
+                    SELECT buildrequests.id FROM buildrequests JOIN builds
+                    ON buildrequests.id=builds.brid
+                    WHERE builds.number=:buildnumber
+                    AND buildrequests.complete=0
+                    AND buildrequests.buildername=:buildername
+                    AND buildrequests.claimed_by_name=:claimed_by_name
+                    AND buildrequests.claimed_by_incarnation=:claimed_by_incarnation
+            """),
             buildnumber=buildnumber,
             buildername=buildername,
             claimed_by_name=claimed_by_name,
             claimed_by_incarnation=claimed_by_incarnation,
         ).fetchall()
         log.debug("getBuildRequests Query took %f seconds", time.time() - now)
         return ret

Flags: needinfo?(sdeckelmann)

Selena Deckelmann :selenamarie :selena

Comment 20

•

9 years ago

(In reply to Ben Hearsum [:bhearsum] from comment #19)
> Selena - is this the sort of SQL indentation you mean?

Nearly!  Here's my way of handling this:

sa.text("""
    SELECT 
        buildrequests.id 
    FROM 
        buildrequests 
            JOIN builds ON buildrequests.id=builds.brid
    WHERE 
        builds.number=:buildnumber
        AND buildrequests.complete=0
        AND buildrequests.buildername=:buildername
        AND buildrequests.claimed_by_name=:claimed_by_name
        AND buildrequests.claimed_by_incarnation=:claimed_by_incarnation
""")

YMMV on that level of separation. The main things I like to see are: 
* Every item in a SELECT on a separate line for ease of troubleshooting later ("Did I map blah[2] to the right column?")
* Every JOIN ... ON on a separate line for troubleshooting bad JOINs later ("Did I accidentally create a CROSS JOIN?")
* Every WHERE clause on a separate line

I've even been known to use "comma first", but I try not to force my aesthetic sense there on others. I know that style makes some people's skin crawl!

Flags: needinfo?(sdeckelmann)

Selena Deckelmann :selenamarie :selena

Comment 21

•

9 years ago

Comment on attachment 8604072 [details] [diff] [review]
initial version of buildbot bridge

Review of attachment 8604072 [details] [diff] [review]:
-----------------------------------------------------------------

Based on Ben's responses, I r+ and can help review other patches as needed!  Thanks so much. I am super excited to see this go into production.

Attachment #8604072 - Flags: review?(sdeckelmann) → review+

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 22

•

9 years ago

Comment on attachment 8604072 [details] [diff] [review]
initial version of buildbot bridge

I'll be addressing most of the review comments here and in bug 1156296.

Attachment #8604072 - Flags: checked-in+

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 23

•

9 years ago

Attached patch move allowed builders in config — Details — Splinter Review

Both the tclistener and bblistener will be using this list after my patch that addresses your comments - we need to update the config to reflect that.

Attachment #8608299 - Flags: review?(sdeckelmann)

Selena Deckelmann :selenamarie :selena

Updated

•

9 years ago

Attachment #8608299 - Flags: review?(sdeckelmann) → review+

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

9 years ago

Attachment #8608299 - Flags: checked-in+

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 24

•

9 years ago

Attached patch improve queries; use allowed_builders in more places; enable ssl on pulse — Details — Splinter Review

This patch should address all of your comments other than the message acking ones (which will happen in bug 1156296). Specifically:
* Use allowed_builders in the BuildbotListener too
* Use a generator when iterating over Tasks for better performance. This turned out to be fairly complicated and require using a different MySQLdb Cursor - there's inline comments about why.
* Change all queries that used "*" to return specific columns. This ended up changing the nature of some queries for the better. Eg: getBuilds turned out to only be there to get the _count_ of the builds, so I changed it to a COUNT(*) query instead!
* Switch BuildbotDb to SQLAlchemy queries instead of raw SQL. This was more or less necessary after the Cursor change, because SQLAlchemy autocloses Cursors, and the new Cursor didn't allow multiple active queries at the same time.
* Change getBuildRequests query to a multi-table select instead of a join. AFAIK, this gets the same result and has the same performance characteristics. I chose to do it this way because it's easier to write in SQLAlchemy-fu than a JOIN. If I'm wrong about any of this, and the JOIN is still better I'm happy to change it.
* Add hook to BuildbotDb to let tests init the db before running.
* Address the nits

Attachment #8608790 - Flags: review?(sdeckelmann)

Selena Deckelmann :selenamarie :selena

Comment 25

•

9 years ago

Comment on attachment 8608790 [details] [diff] [review]
improve queries; use allowed_builders in more places; enable ssl on pulse

Review of attachment 8608790 [details] [diff] [review]:
-----------------------------------------------------------------

I haven't run this code, but looks sound. Thanks for addressing my concerns!

::: bbb/servicebase.py
@@ +27,5 @@
> +        # are returned. The SSCursor lets you stream data, but has the side effect
> +        # of not allowing multiple queries on the same Connection at the same time.
> +        # That's OK for us because each service is single threaded, but it means
> +        # we must ensure that "close" is called whenever we do queries. "fetchall"
> +        # does this automatically, so we always use it.

Thanks for the detailed comment.

::: bbb/services.py
@@ +68,5 @@
>              try:
>                  task = self.bbb_db.getTaskFromBuildRequest(brid)
>              except TaskNotFound:
> +                # TODO: will this still be weird after we have a reverse bridge?
> +                log.warning("WEIRD: Task not found for brid %s (%s), nothing to do.", brid, buildername)

lol. I'm pretty into calling out the weirdness. Nice idea.

Attachment #8608790 - Flags: review?(sdeckelmann) → review+

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 26

•

9 years ago

Comment on attachment 8608790 [details] [diff] [review]
improve queries; use allowed_builders in more places; enable ssl on pulse

I merged this to master, bumped the version number, and pushed.

Attachment #8608790 - Flags: checked-in+

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

9 years ago

Depends on: 1168099

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 27

•

9 years ago

Removing dependencies from bugs that don't initial production deployment of the Buildbot Bridge. I'll be stuffing "[bbb]" in each one's whiteboard for easy searchability in the future.

No longer depends on: 1156301, 1156305, 1157242, 1157310, 1158271, 1164455, 1164819

Whiteboard: [bbb]

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

9 years ago

No longer depends on: 1168099

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 28

•

9 years ago

All of the initial blockers for the bridge are fixed and deployed in production, so I'm closing this bug. There's still some things that should probably be fixed, but they can be done as needed. I've tagged all of these bugs with "[bbb]" in the whiteboard.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

6 years ago

Component: General Automation → General

initial version of buildbot bridge 9 years ago bhearsum@mozilla.com (:bhearsum) 52.88 KB, patch	jonasfj : review+ selenamarie : review+ bhearsum : checked-in+	Details \| Diff \| Splinter Review
move allowed builders in config 9 years ago bhearsum@mozilla.com (:bhearsum) 1.06 KB, patch	selenamarie : review+ bhearsum : checked-in+	Details \| Diff \| Splinter Review
improve queries; use allowed_builders in more places; enable ssl on pulse 9 years ago bhearsum@mozilla.com (:bhearsum) 38.24 KB, patch	selenamarie : review+ bhearsum : checked-in+	Details \| Diff \| Splinter Review