Closed Bug 481886 Opened 15 years ago Closed 15 years ago

Tracking bug for buildbot 0.7.10p1 upgrade

Categories

(Release Engineering :: General, defect, P2)

x86
Linux
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: catlee, Assigned: bhearsum)

References

Details

Attachments

(3 files)

Buildbot 0.7.10p1 has lots of features that would be useful to us.  We should upgrade!
Blocks: 435472
We need to take a version of buildbot that fixes at least http://buildbot.net/trac/ticket/446, too.
Blocks: 484542
Blocks: 480145
We're going to try and do this early in Q2.
Assignee: catlee → bhearsum
Status: NEW → ASSIGNED
Priority: -- → P3
Some of the nice features in 0.7.10p1 include:
* ATOM/RSS
* Fixed 'ping' button
* Fixed reconfig (no more tracebacks)
* Graceful slave shutdown
* Configurable BuildRequest merging

There's also a few patches which landed post-0.7.10 we should consider including:
* Fixes for one_line_per_build (http://buildbot.net/trac/ticket/455)
* Mercurial step fixes (http://buildbot.net/trac/ticket/462 and http://buildbot.net/trac/ticket/277)

We don't strictly need to take the Mercurial fixes but it might be good to get that testing out of the way - it wouldn't surprise me if it breaks us a bit.
This is going to be a pretty easy import, by the looks of it. I'm going to be backing out the patch in bug 485584 since it hasn't solved our issue, and conflicts with some incoming changes. Other than that, there's a few conflicts: process/base.py, process/builder.py, slave/commands.py - all of which are trivial to resolve.

I still need to do a lot of testing in staging before we think about deploying this. It's going to be a bit of a pain to roll out, too, because we'll need to update all of the build slaves (we can probably omit Talos slaves from this since there's no big commands.py changes that affect them). It's probably going to require a fairly big downtime.

So, here's the plan:
* Test the new Buildbot in staging well, focusing on the Mercurial step
* Import 0.7.10p1 into production
* Schedule downtime, roll out across the farm.
There have been some major patches to the mail notifier stuff, and previously, I ran across problems with TinderboxNotifier, too. We should make sure those don't break, including the l10n-specific uses with WithProperties in tree names.
(In reply to comment #3)
> There's also a few patches which landed post-0.7.10 we should consider
> including:
> * Fixes for one_line_per_build (http://buildbot.net/trac/ticket/455)
> * Mercurial step fixes (http://buildbot.net/trac/ticket/462 and
> http://buildbot.net/trac/ticket/277)
> 
> We don't strictly need to take the Mercurial fixes but it might be good to get
> that testing out of the way - it wouldn't surprise me if it breaks us a bit.

Why not just import a clean 0.7.10p1, and omit those extra later fixes until they are included in 0.7.11, and we import a clean 0.7.11? 

It feels easier (and safer?) to import a clean 0.7.10p1, rather then pick-and-choose additional later changes, but I could be missing something.
Mantra #1, use 0.7.10p1 without patches, and you break the build. The builds won't get their .hg/hgrc set up with paths, which will break about:buildconfig, and make ident required for l10n builds.

Besides the technical details that we need to patch slave/commands.py to include the custom slave-side step code we have. I wonder if it's worth to fork those to slave/mozcommands.py, and to import that from commands.py, to make the distinction more apparent.
I wonder if it's worth it to stop using buildbot's built-in mercurial support completely.
There are good things coming up, in particular the clobber on switching from one repo-as-branch to another is pretty tough to mimic in pure shell scripts. That's in patches towards .11, too.

Basically, when you have a fx36x clone, and you branch to a releases/mozilla-1.9.2 repo, the Mercurial step realizes that you're now pulling from some place else, and does a clobber. That's the same scenario why we're currently clobbering build/tools all the time, it's not comparing the repo you pulled from with the repo you want to pull from.
(In reply to comment #7)
> (In reply to comment #3)
> > There's also a few patches which landed post-0.7.10 we should consider
> > including:
> > * Fixes for one_line_per_build (http://buildbot.net/trac/ticket/455)
> > * Mercurial step fixes (http://buildbot.net/trac/ticket/462 and
> > http://buildbot.net/trac/ticket/277)
> > 
> > We don't strictly need to take the Mercurial fixes but it might be good to get
> > that testing out of the way - it wouldn't surprise me if it breaks us a bit.
> 
> Why not just import a clean 0.7.10p1, and omit those extra later fixes until
> they are included in 0.7.11, and we import a clean 0.7.11? 
> 
> It feels easier (and safer?) to import a clean 0.7.10p1, rather then
> pick-and-choose additional later changes, but I could be missing something.

We've been importing a release + some patches every time we import a new Buildbot - so it's nothing new.

Some of these changes we don't _have_ to take, but since I'm going to be doing the work to import 0.7.10p1 I figure we may as well take some patches that will benefit us. I really want to take the Mercurial ones, and now that Axel mentions it, the MailNotifier ones, so we can deal with whatever bustage there at the same time.

Any any case, as Axel mentions, 0.7.10p1 stock will break the build:

(In reply to comment #8)
> Mantra #1, use 0.7.10p1 without patches, and you break the build. The builds
> won't get their .hg/hgrc set up with paths, which will break about:buildconfig,
> and make ident required for l10n builds.
(In reply to comment #8)
> Besides the technical details that we need to patch slave/commands.py to
> include the custom slave-side step code we have. I wonder if it's worth to fork
> those to slave/mozcommands.py, and to import that from commands.py, to make the
> distinction more apparent.

I think we should avoid these as much as possible mainly because of the huge PITA to deploy them initial + the inevitable bugfixes. But, we do have one custom command in here currently, and I think it's a great idea to move it out.
I've imported 0.7.10p1 and the following tickets into my user repository:
http://buildbot.net/trac/ticket/455
http://buildbot.net/trac/ticket/446
http://buildbot.net/trac/ticket/451
http://buildbot.net/trac/ticket/277
http://buildbot.net/trac/ticket/462

The repository is here: http://hg.mozilla.org/users/bhearsum_mozilla.com/buildbot. I plan to start testing this week starting on staging-master:moz2-master. Once I have all of that sorted out I'll move onto try and talos.
Turns out I forgot to 'hg addremove' after unpacking 0.7.10p1. I've fixed my repository to include all the new files.
Depends on: 487496
While testing 0.7.10p1 on the staging try server I encountered a problem with the MozillaPatchDownload step. I landed a fix upstream for it, and also in http://hg.mozilla.org/users/bhearsum_mozilla.com/buildbot.

Other than that, and the issue I filed bug 487496 for, everything has been fine. I still have to test the Talos buildbot though, and I wouldn't be surprised to find a thing or two that needs fixing.
Priority: P3 → P2
No longer blocks: 488262
Blocks: 488368
Blocks: 488273
Deployment on Linux:
* Log on as root
wget --no-check-certificate -Obuildbot-0.7.10p1.sh https://bugzilla.mozilla.org/attachment.cgi?id=374948
chmod +x buildbot-0.7.10p1.sh
./buildbot-0.7.10p1.sh

Deployment on Mac:
* Log on as cltbld
wget --no-check-certificate -Obuildbot-0.7.10p1.sh https://bugzilla.mozilla.org/attachment.cgi?id=374948
chmod +x buildbot-0.7.10p1.sh
sudo ./buildbot-0.7.10p1.sh

Deployment on Windows:
* Log on as Administrator
wget --no-check-certificate -Obuildbot-0.7.10p1.sh https://bugzilla.mozilla.org/attachment.cgi?id=374949
chmod +x buildbot-0.7.10p1.sh
./buildbot-0.7.10p1.sh
After a few bumps in the road we've got this deployed. Major problems were:
* Talos losing the ability to override commands (fixed in bug 487496)
* Windows slaves failing due to http://buildbot.net/trac/ticket/456. We checked in this patch and updated the slaves to fix it.
* Many builds failing due to SetMozillaBuildProperties not existing on the slaves. This was the result of a bad merge during the initial import. To fix, re-added the command into commands.py and the slaves were updated.
try-mac-slave06
moz2-darwin9-slave03

weren't updated because they're offline
moz2-darwin9-slave03 has been upgraded.

Holding off on try-mac-slave06 until we get the new buildbot code working on try slaves.
Depends on: 490850
All of the production-1.8 and production-1.9 master + slaves have been updated now. Still to do:
staging-1.9
1.9 unittest
staging-1.9 has been upgraded.
We're hitting what seems to be an ignorable traceback on the 1.9 masters, related to l10n:
	  File "/tools/buildbot/lib/python2.5/site-packages/buildbot/master.py", line 759, in <lambda>
	    d.addCallback(lambda res: self.loadConfig_Schedulers(schedulers))
	  File "/tools/buildbot/lib/python2.5/site-packages/buildbot/master.py", line 835, in loadConfig_Schedulers
	    d.addCallback(updateDownstreams)
	  File "/tools/twisted-2.4.0/lib/python2.5/site-packages/twisted/internet/defer.py", line 191, in addCallback
	    callbackKeywords=kw)
	  File "/tools/twisted-2.4.0/lib/python2.5/site-packages/twisted/internet/defer.py", line 182, in addCallbacks
	    self._runCallbacks()
	--- <exception caught here> ---
	  File "/tools/twisted-2.4.0/lib/python2.5/site-packages/twisted/internet/defer.py", line 307, in _runCallbacks
	    self.result = callback(self.result, *args, **kw)
	  File "/tools/buildbot/lib/python2.5/site-packages/buildbot/master.py", line 834, in updateDownstreams
	    s.checkUpstreamScheduler()
	  File "/tools/buildbot/lib/python2.5/site-packages/buildbot/scheduler.py", line 350, in checkUpstreamScheduler
	    for s in self.parent.allSchedulers():
	<type 'exceptions.AttributeError'>: 'NoneType' object has no attribute 'allSchedulers'
	

I've commented in the upstream ticket about it, it doesn't seem to be interfering with anything, though. http://buildbot.net/trac/ticket/35
The only thing left to do here is get the Try Server slaves upgraded to 0.7.10p1. This is blocked on figuring out how to avoid them breaking when the try repository grows too many heads.
Last week I worked with the maintainers of the Buildbot Mercurial code and they landed an upstream patch that will enable us to use 'hg clone --rev' on the try server. We'll need to pull in http://github.com/djmitche/buildbot/commit/483a6043ed2cab2436009eeb7465269b7a48e65f, and land the attached patch. We'll need a short downtime so we can upgrade the slaves at the same time as we land these.
Attachment #378326 - Flags: review?(catlee)
Attachment #378326 - Flags: review?(catlee) → review+
Comment on attachment 378326 [details] [diff] [review]
MozillaTryServerHgClone fixes for 0.7.10p1+

changeset:   299:1aa4bb2bdf4d
Attachment #378326 - Flags: checked‑in+ checked‑in+
I got the Try Server slaves upgraded today (yay).
This bug is ripe for the closing - all of our installations have been updated to 0.7.10p1, save 1.9 unittests (which is ok).
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: