Closed Bug 299909 Opened 19 years ago Closed 16 years ago

Replace Tinderbox client with Buildbot

Categories

(Release Engineering :: General, enhancement, P3)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mkanat, Unassigned)

References

()

Details

This was a discussion that we had in #mozwebtools a while ago.

Basically, Tinderbox development is dead. There's always a chance that somebody
will come back and work on it, but right now it seems pretty dead. There are
certainly a few features that it could use that would make our lives easier,
such as a better client interface.

From what I understand, most of the features we'd like to see in Tinderbox are
already in Buildbot. It might lack one or two features that Tinderbox has, but I
think it would be easier to hack in the very small features we'd like to
Buildbot than to locate a Tinderbox maintainer.

Of course, I'm not suggesting this change lightly, or even saying that it should
happen any time soon. :-) I'm just saying that instead of trying to find
developers for Tinderbox, we could just use Buildbot, which is alive and has
many of the features that we want.

As a start, somebody suggested that we could also write a buildbot client which
emailed Tinderbox. (Probably easier to write than a tinderclient, for many
cases, from my understanding.)
Comment appreciated -- though it's not true that no one is developing on
Tinderbox.  I know of at least two people (myself and Mike, both cc'd on this
bug).  Perhaps more directly, there's a level of interest on my part in moving
Tinderbox development forward.

In order to cover our bets I agree it makes sense to evaluate available
alternatives.  For those seeking a pointer, BuildBot can be found at
http://buildbot.sourceforge.net/.

I know Tinderbox1/2/3 are feature incomplete (I have my own list of features
that I want to see added).  mkanat, in your view what features are in BuildBot
that aren't in Tinderbox?
Assignee: justdave → chase
I am interested in working on Tinderbox 2/3 if there is a list of things to do - currently I'm patching it 
to add the features required for our environment.

I also use Buildbot for other projects and have commit rights to that project so I can help with testing of 
it's features if that is needed.  Currently Buildbot is being refactored to allow it to work with different 
branches in the same project and other "large project" features.

Buildbot could easily have a "step" created that generates a Tinderbox email.

bear == Mike for those who do not know me :)
Canonical uses BuildBot for their Ubuntu build farm (or at least they did when I
was working there), so I've seen it in action.  At the time I think it was more
tuned to raw build-farm management and less to running tests on things, but it's
been a year since I've seen it, and a lot can happen in a year.
(In reply to comment #1)
> mkanat, in your view what features are in BuildBot that aren't in Tinderbox?

  I think the major feature is the ability to easily write flexible clients. With Tinderbox, writing a client that does something special requires a lot of perl hacking, but it's easier with buildbot.

  Also, the fact that (unlike Tinderbox 1, at least), buildbot can communicate with the server in more ways than just email.
Component: Server Operations → Build & Release
Mass reassign of open bugs for chase@mozilla.org to build@mozilla-org.bugs.
Assignee: chase → build
I've been testing buildbot for firefox trunk (based on vlad's config), it looks really good. However, I've identified the following issues so far:

* bonsai support does not work out-of-box (I just sent the buildbot-devel list a patch to fix)
* no branch support for bonsai (sent a patch for this too)
* no way to attach notes ala tinderbox (we use this a lot)
* server unavailable causes clients to stop building (maybe this is configurable?)

The last point is part of the design of buildbot AFAICT, and it's ok but may not 100% map to our needs in the default configuration. However, there are several good points:

* ability to stop/force build
* good design/strong ownership
* great user/dev docs 
* log is written to as build progresses
* builds are done in discrete steps
* errors are caught very early
* stdin/stderr is dealt with properly

The last 3 are big problems for tinderbox currently.
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Mass re-assign of bugs that aren't on the build team radar, so bugs assigned to build@mozilla-org.bugs reflects reality.

If there is a bug you really think we need to be looking at, please *email* build@mozilla.org with a bug number and explanation.
Assignee: build → nobody
Status: ASSIGNED → NEW
Assignee: nobody → build
QA Contact: myk → mozpreed
Assignee: build → nobody
QA Contact: mozpreed → build
Depends on: 404413
I'm currently working on transitioning our tinderbox version 1 clients over to buildbot. Buildbot has a BonsaiPoller and TinderboxMailNotifier now, so it can build on-demand based on Bonsai checkins and publish to Tinderbox server just as tinderbox client, so it's a pretty drop-in replacement.

Here are some blog posts I've done on the subject, with links to more info:
http://roberthelmer.com/blog/?p=21
http://roberthelmer.com/blog/?p=23

The first step is transitioning all of our builds over to the release automation system we've been using for Firefox and Thunderbird releases (which really just wraps tinderbox client). Then, going to calling the build system directly from Buildbot, and making some enhancements to the build system to make this more palatable.

Take a look at the blog posts for more info on the hows and whys.

The goal with the last bit is to make it very obvious what buildbot is doing, making it super easy for a developer to replicate those changes or suggest a different approach, because they're using the same build system. This most likely involves pushing some of the logic Tinderbox client is responsible into the build system, and we'll probably just have makefile targets which call scripts (see tools/update-packaging for an example).

Tinderbox server does a bunch of stuff Buildbot server doesn't do, and is probably inappropriate for it to do. Tinderbox server has some advantages, so I'm not currently working on replacing it, but rather working on enhancing it so we can get information out of it to use with other systems (see the JSON interfaces being added for example). There's various work going on to create some kind of project dashboard to use instead of Tinderbox, which would have the kind of status, vital info, tree open/closing tools etc. that Tinderbox server is currently used for.

So to summarize - replacing tinderbox client with buildbot, absolutely. Replacing tinderbox server with buildbot, no. We need something all that  Tinderbox server currently provides and more, which is out of the scope of what buildbot should be responsible for.
Status: NEW → ASSIGNED
Summary: Replace Tinderbox with Buildbot → Replace Tinderbox client with Buildbot
(In reply to comment #8)
> Buildbot has a BonsaiPoller and TinderboxMailNotifier now, so it can build
> on-demand based on Bonsai checkins and publish to Tinderbox server just
> as tinderbox client, so it's a pretty drop-in replacement.

I think doing on-demand builds for the normal tinderbox machines is a bad idea. We already have enough trouble as it is getting the current buildbot machines to build again when we need them to. The machines need to continually build, just like now, imho.
Also, buildbot needs to learn to clobber properly, which it doesn't do right now.
(In reply to comment #9)
> I think doing on-demand builds for the normal tinderbox machines is a bad idea.
> We already have enough trouble as it is getting the current buildbot machines
> to build again when we need them to. The machines need to continually build,
> just like now, imho.

Can you give more detail on the trouble? The release-automation-based ones won't have the "SIGKILL" problem (and I don't want to move to the more direct solution until that's fixed).

Continuous building wastes a lot of time, and ties up machines that could be doing something useful instead of rebuilding the same source code over and over. If it's just not possible to do in a stable manner I agree we can drop it, but I don't believe that's the case based on the testing I've done so far.

(In reply to comment #10)
> Also, buildbot needs to learn to clobber properly, which it doesn't do right
> now.

This isn't a problem with the release-automation-based ones, and again I don't want to move backwards, so we need to make buildbot more directly support this (working on this on the buildbot mailing list) before moving to the longer-term idea.

I've been watching and learning from the Talos and unit testing boxes, believe me :)

One other detail I should mention - this all has to be working in a staging environment before anything touches production, so if we hit any problems we have a safe place to reproduce and fix. There's no need to rush things into production before they're ready.

I think that the best way to describe the initial phase is that we're replacing the multi-tinderbox.pl script with Buildbot, and later is working towards deeper integration with Buildbot and separation of Tinderbox logic into the build system. 

The idea here is to make sure that all changes are and continue to be thoroughly tested before they hit production, and to start with something that looks and acts exactly like our current system (because it is really Tinderbox client under the covers) and to improve incrementally from there.
(In reply to comment #11)
> (In reply to comment #9)
> > I think doing on-demand builds for the normal tinderbox machines is a bad idea.
> > We already have enough trouble as it is getting the current buildbot machines
> > to build again when we need them to. The machines need to continually build,
> > just like now, imho.
> 
> Can you give more detail on the trouble? The release-automation-based ones
> won't have the "SIGKILL" problem (and I don't want to move to the more direct
> solution until that's fixed).

We have to commit something to get them to build. That's dysfunctional, at the least.

> Continuous building wastes a lot of time, and ties up machines that could be
> doing something useful instead of rebuilding the same source code over and
> over. If it's just not possible to do in a stable manner I agree we can drop
> it, but I don't believe that's the case based on the testing I've done so far.

Continuous building finds problems. If something doesn't test or compile correctly, it's possible that it is just a one-time fluke and will just fix itself the next time around. Rebuilding the same source over and over makes sure that there isn't a timing issue (which has happened before) with something. Plus, as our tinderboxes run perf tests, it's good to have their results over a longer period of time, instead of just when they build, so we can see if something caused a change or not. One cycle just isn't a enough for most things, so I personally feel we're hurting ourselves more by only doing one cycle for things.

> (In reply to comment #10)
> > Also, buildbot needs to learn to clobber properly, which it doesn't do right
> > now.
> 
> This isn't a problem with the release-automation-based ones

They clobber source and objdirs? because the current talos and unit test ones don't.
(In reply to comment #13)
> Continuous building finds problems. If something doesn't test or compile
> correctly, it's possible that it is just a one-time fluke and will just fix
> itself the next time around. Rebuilding the same source over and over makes
> sure that there isn't a timing issue (which has happened before) with
> something.

reed: you are aware of the ability in buildbot to trigger a new build at any time by clicking on the machine name and filling in the form? It would be trivial to add a clobber flag to that triggering process too.

Sure, it's not automatic, but maybe that would force us to resolve any build system dependencies that get us into those fluke states in the first place.

> Plus, as our tinderboxes run perf tests, it's good to have their
> results over a longer period of time, instead of just when they build, so we
> can see if something caused a change or not. One cycle just isn't a enough for
> most things, so I personally feel we're hurting ourselves more by only doing
> one cycle for things.

I thought the plan of record was to move all the tests (modulo keepalive) off the build slaves to keep them free for building, and have the new Mac mini perf farm test the hell out the builds. We're getting better numbers (from better tests) in talos these days anyway, no? Easier to diagnose build failures on build slave, and test failures on test machines IMO.
(In reply to comment #14)
> reed: you are aware of the ability in buildbot to trigger a new build at any
> time by clicking on the machine name and filling in the form? It would be
> trivial to add a clobber flag to that triggering process too.

You are aware that the buildbot admin interface is restricted to MoCo-only people and is behind the MPT VPN? That makes it completely useless to me, as I'm neither a MoCo employee nor do I have access to the MPT VPN. Also, I'm sure MoCo employees don't want to have to load the MPT VPN in order to start a new build.

> I thought the plan of record was to move all the tests (modulo keepalive) off
> the build slaves to keep them free for building, and have the new Mac mini perf
> farm test the hell out the builds. We're getting better numbers (from better
> tests) in talos these days anyway, no? Easier to diagnose build failures on
> build slave, and test failures on test machines IMO.

Talos is currently useless to me, as it takes forever to give me results. When tons of people commit at once, you need a smaller window in order to figure out which check-in caused a problem. Without that, you'll spend HOURS trying to figure out which patch caused a regression.

Also, it's only on Windows on the main Firefox tree, so what about Mac and Linux? bl-bldlnx03 can usually tell if something is a perf win/loss in around ~30 minutes compared to the ~105 minutes that Talos takes.
(In reply to comment #13)
> We have to commit something to get them to build. That's dysfunctional, at the
> least.
> 
> > Continuous building wastes a lot of time, and ties up machines that could be
> > doing something useful instead of rebuilding the same source code over and
> > over. If it's just not possible to do in a stable manner I agree we can drop
> > it, but I don't believe that's the case based on the testing I've done so far.
> 
> Continuous building finds problems. If something doesn't test or compile
> correctly, it's possible that it is just a one-time fluke and will just fix
> itself the next time around. Rebuilding the same source over and over makes
> sure that there isn't a timing issue (which has happened before) with
> something. Plus, as our tinderboxes run perf tests, it's good to have their
> results over a longer period of time, instead of just when they build, so we
> can see if something caused a change or not. One cycle just isn't a enough for
> most things, so I personally feel we're hurting ourselves more by only doing
> one cycle for things.


It's pretty easy to do continuous builds with Buildbot for some or all builders, so I think we can separate that from this bug (but let's continue elsewhere, in irc or newsgroups).


> > (In reply to comment #10)
> > > Also, buildbot needs to learn to clobber properly, which it doesn't do right
> > > now.
> > 
> > This isn't a problem with the release-automation-based ones
> 
> They clobber source and objdirs? because the current talos and unit test ones
> don't.


Yes. Like I said, release automation is calling into Tinderbox (build-seamonkey.pl), so the behavior is exactly the same as now. We're taking a different approach than Talos and the unit test boxes.

(In reply to comment #15)
> (In reply to comment #14)
> > reed: you are aware of the ability in buildbot to trigger a new build at any
> > time by clicking on the machine name and filling in the form? It would be
> > trivial to add a clobber flag to that triggering process too.
> 
> You are aware that the buildbot admin interface is restricted to MoCo-only
> people and is behind the MPT VPN? That makes it completely useless to me, as
> I'm neither a MoCo employee nor do I have access to the MPT VPN. Also, I'm sure
> MoCo employees don't want to have to load the MPT VPN in order to start a new
> build.


I think we should allow anyone with CVS access, but that needs to be discussed still I think. We should have some way for anyone with CVS access to force builds, IMHO, whether that's the Buildbot master interface or something else.


> Talos is currently useless to me, as it takes forever to give me results. When
> tons of people commit at once, you need a smaller window in order to figure out
> which check-in caused a problem. Without that, you'll spend HOURS trying to
> figure out which patch caused a regression.
> 
> Also, it's only on Windows on the main Firefox tree, so what about Mac and
> Linux? bl-bldlnx03 can usually tell if something is a perf win/loss in around
> ~30 minutes compared to the ~105 minutes that Talos takes.
 

There are things we can do about the above, but that's out of scope for what I'm currently working on. The tinderbox tests will stay there until Talos is ready, and if we have to hook the tinderbox tests up to Buildbot via the release automation then so be it.

As I said before, I don't want to move things backwards. It sounds like the only thing we're disagreeing about is continuous versus change-triggered builds; I think that this bug isn't the right forum for discussing it, but Buildbot easily supports both, either for all builders or case-by-case, so it doesn't preclude moving to Buildbot or not.
Depends on: 401936
No longer blocks: 291167
I am doing this incrementally in bugs such as bug 417147, I'm not sure how useful this overall "replace" bug is though because we're going to have bits of both for a long time.

Leaving open but returning to default owner, in case anyone else wants to do something more drastic :)
Assignee: rhelmer → nobody
Status: ASSIGNED → NEW
In triage, we think all the remaining work here is being tracked in dependent bugs, so want to close this. 

Please reopen if we've missed something.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.