Closed Bug 637347 Opened 13 years ago Closed 12 years ago

deploy Buildbot-0.8.4-pre-moz1 to OPSI and manual buildslaves

Categories

(Release Engineering :: General, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: coop)

References

()

Details

(Whiteboard: [buildbot][idleizer][buildduty])

Attachments

(4 files, 5 obsolete files)

In parallel with bug 631851.  This will bring a great bounty of improvements, including idleizer and better process handling.
Blocks: 637349
Depends on: 600736
Attached file install-buildbot.bat (obsolete) —
Well, this was rather easy.  The attached script installs Buildbot in a virtualenv, using the vars at the top.  It uses the puppet server for the packages, which I think makes more sense than manually duplicating them elsewhere.

I'd like to roll this into an OPSI package, and then run it by hand on w7, w64, and w764 boxes via the usual tinyurl trick.  There are a few things blocking all that:

 - bug 600736 - tweaks for process killing on Windows
 - bug 665254 - fix idleizer
    (and test it!)
 - bug 650004 - more commits since 0.8.4-pre-moz1
 - upgrade to buildbot-0.8.4p1, when it's released (tonight? tomorrow?)

but I'd like to know, in the interim, if this looks insane.  If it doesn't look that bad, I may install this on some staging *build* slaves to see how well idleizer works.
Attachment #540202 - Flags: review?(mlarrain)
Bug 600736 has gotten too twisty-turny, so I'm fixing the process killing in bug 666019.
Depends on: 666019
No longer depends on: 600736
Attachment #540202 - Flags: review?(mlarrain)
With bug 666019 closed, this is ready to start deploying.  I'd like to deploy by hand on dev/pp systems, and then hand-modify runslave.py to see if the updated version works.  If so, I'll set up both with OPSI.
Comment on attachment 540202 [details]
install-buildbot.bat

SET python=d:\mozilla-build\python25\python.exe will not work on 64bit systems as we have removed the D:\ drive and moved its content to the C:\
Attachment #540202 - Flags: review-
Attached file buildbot.bat (obsolete) —
I'm going to run this on w32-ix-slave01 to see how it works.
Attachment #540202 - Attachment is obsolete: true
Attached file buildbot.bat (obsolete) —
A better version that has run successfully on w32-ix-slave01.  I'll try it on an XP test slave now.
Attachment #548299 - Attachment is obsolete: true
That doesn't work on XP because windows can't find PYTHON25.DLL after PYTHON.EXE is copied into C:\MOZILLA-BUILD\BUILDBOTVE\SCRIPTS.  Copying PYTHON25.DLL into that directory works, and doesn't hurt on w32 builders.
Attached file buildbot.bat (obsolete) —
success so far on w32-ix-slaveNN, talos-r3-xp-NNN, and talos-r3-w7-NNN.
Attachment #548301 - Attachment is obsolete: true
Attached file buildbot.bat (obsolete) —
OK, I've had success with this on all platforms now (including t-r3-w764-NNN, which don't run runslave.py anyway, so it doesn't matter).  Matt, do you see anything outrageously stupid here?
Attachment #548321 - Attachment is obsolete: true
Attachment #548335 - Flags: review?(mlarrain)
Corresponding changes to runslave.py to find Buildbot in the virtualenv.

I'll deploy this and buildbot.bat on staging slaves, and enable idleizer on them, and see how they behave.  I'll list the affected slaves here.
Attachment #548338 - Flags: review?
Attachment #548338 - Flags: review? → review?(bhearsum)
Attachment #548338 - Flags: review?(bhearsum) → review+
Attachment #548338 - Flags: checked-in+
OK, bb084-pre-moz2 and the updated runslave.py are installed on

mw32-ix-slave01
mw32-ix-slave19
mw32-ix-slave21
talos-r3-w7-001
talos-r3-w7-002
talos-r3-w7-003
talos-r3-w7-010
talos-r3-xp-001
talos-r3-xp-002
talos-r3-xp-003
talos-r3-xp-010
w32-ix-slave01
w32-ix-slave34
win32-slave04
win32-slave10
win32-slave21

and idleizer is enabled via slavealloc.  Let's see how those behave.
oops, talos-r3-w7-001 was a lie (it's down awaiting a reimage)
So I'm bumping this back into the releng queue - I need a definite thumbs-up from releng that this update is good before I do it.  It will take a few (2?) solid days of VNC'ing to deploy this, and the rollback would be another 2 days, so it needs to be right the first time.

The main things I'm worried about are (a) problems killing processes and (b) idleizer failures that leave slaves wedged.
Assignee: dustin → nobody
Priority: P2 → P3
Whiteboard: [buildbot][idleizer]
(In reply to Dustin J. Mitchell [:dustin] from comment #14) 
> The main things I'm worried about are (a) problems killing processes and (b)
> idleizer failures that leave slaves wedged.

This has been running on staging Windows slaves for a while now and we haven't seen anything abnormal, but frankly we haven't really been looking.

I signed up today to specifically look at some Windows slaves in staging and verify (a) and (b) as much as I realistically can.

Dustin has offered to do the deployment once we're confirmed to be in a good state. I will likely help him out to expedite the deployment.
Assignee: nobody → coop
FYI, I'm on the hook for the 7.0b2 release today/tomorrow, so there may be a slight delay here.
The slaves seem to be restarting correctly, but when they do, we see the following exception on the master: 

2011-08-31 07:56:41-0700 [Broker,4,10.12.50.171] Peer will receive following PB trace
back:
2011-08-31 07:56:41-0700 [Broker,4,10.12.50.171] Unhandled Error
        Traceback (most recent call last):
          File "/builds/buildbot/coop/tests-master/lib/python2.6/site-packages/twiste
d/spread/banana.py", line 153, in gotItem
            self.callExpressionReceived(item)
          File "/builds/buildbot/coop/tests-master/lib/python2.6/site-packages/twiste
d/spread/banana.py", line 116, in callExpressionReceived
            self.expressionReceived(obj)
          File "/builds/buildbot/coop/tests-master/lib/python2.6/site-packages/twiste
d/spread/pb.py", line 514, in expressionReceived
            method(*sexp[1:])
          File "/builds/buildbot/coop/tests-master/lib/python2.6/site-packages/twiste
d/spread/pb.py", line 826, in proto_message
            self._recvMessage(self.localObjectForID, requestID, objectID, message, an
swerRequired, netArgs, netKw)
        --- <exception caught here> ---
          File "/builds/buildbot/coop/tests-master/lib/python2.6/site-packages/twisted/spread/pb.py", line 840, in _recvMessage
            netResult = object.remoteMessageReceived(self, message, netArgs, netKw)
          File "/builds/buildbot/coop/tests-master/lib/python2.6/site-packages/twisted/spread/pb.py", line 223, in perspectiveMessageReceived
            method = getattr(self, "perspective_%s" % message)
        exceptions.AttributeError: BuildSlave instance has no attribute 'perspective_shutdown'

Is this expected?
that's usually a symptom of a too-old version of buildbot. 10.12.50.171 is talos-r3-w7-010
(In reply to Chris AtLee [:catlee] from comment #18)
> that's usually a symptom of a too-old version of buildbot. 10.12.50.171 is
> talos-r3-w7-010 

Too old, or just out-of-sync? We're actually trying to use a *newer* version here on the slaves.

I'm using talos-r3-w7-010 and talos-r3-xp-003 for testing, so getting log data from those slaves is expected, but both are reporting in to my test master (http://dev-master01.build.scl1.mozilla.com:8045/buildslaves) as running 0.8.4-pre-moz2.
It's expected - it's the master that's too old (and idleizer works around that).
(In reply to Dustin J. Mitchell [:dustin] from comment #20)
> It's expected - it's the master that's too old (and idleizer works around
> that).

OK, then it looks like Idleizer is working as expected. I see the Windows slaves being rebooted every 7 hours when idle. I didn't see any hangs on my slaves, so either there weren't any hangs or Idleizer was able to reboot the machines anyway.

Dustin: I think we're ready to deploy this to production. When are you available for this, keeping in mind that we're still in chemspill mode for today and (possibly) tomorrow? Do you need/want some help to speed the process along?
I'll work on it and let you know if I need help.
Assignee: coop → dustin
Attached file buildbot.bat
this version also installs runslave.py.
Attachment #548335 - Attachment is obsolete: true
Attachment #548335 - Flags: review?(mlarrain)
OK!  Aside from the machines listed below, all windows machines are now running 0.8.4-pre-moz2.  However, note that I have not *enabled* idleizer on any of them.

I need to find a time when these aren't running tests to install the new runslave.py via VNC:

talos-r3-w7-013
talos-r3-w7-014
talos-r3-w7-015
talos-r3-w7-019
talos-r3-w7-030
talos-r3-w7-031
talos-r3-w7-035
talos-r3-w7-039
talos-r3-w7-040
talos-r3-w7-042
talos-r3-w7-043
talos-r3-w7-044
talos-r3-w7-056
talos-r3-w7-060
talos-r3-w7-062

The following are inaccessible in various ways and will need releng TLC:

w32-ix-slave03 (bad password)
w32-ix-slave06
w32-ix-slave08
w32-ix-slave26
w32-ix-slave35
w32-ix-slave41
talos-r3-w7-033 (bad password)
talos-r3-w7-045
talos-r3-w7-053 (no mouse)
Also need tlc:

talos-r3-w7-001
talos-r3-xp-045

and as a reminder to myself, I need to
 * fix the ref machines
 * remove buildbot from OPSI
Update:
 * 0.8.4-pre-moz2 installed on all accessible machines, including ref

Need TLC from releng: (some of these are known to be down)

talos-r3-w7-001		x		??
talos-r3-w7-033		x		bad pw
talos-r3-w7-045		x		
talos-r3-w7-053		x		no mouse
talos-r3-w7-ref		x		
talos-r3-xp-045		x		??
w32-ix-slave03		x		bad pw
w32-ix-slave06		x		
w32-ix-slave26		x		
w32-ix-slave35		x		
w64-ix-slave02		x		
w64-ix-slave41		x		
talos-r3-w764-NNN	x - not in slavealloc	

TODO:
 * remove buildbot from OPSI
 * document on refimage pages
(In reply to Dustin J. Mitchell [:dustin] from comment #20)
> It's expected - it's the master that's too old (and idleizer works around
> that).

dustin/catlee: anything preventing us from upgrading buildbot on the masters sometime soon? We'll be seeing those exceptions in the email exception reports until we do.
Remove the packages from OPSI.

I think that the steps on the opsi master would be:
 hg up
 rm -rf ~cltbld/opsi-packages/buildbot-{tip,production}

anything else?
Attachment #557520 - Flags: review?(bhearsum)
(In reply to Chris Cooper [:coop] from comment #27)
> dustin/catlee: anything preventing us from upgrading buildbot on the masters
> sometime soon? We'll be seeing those exceptions in the email exception
> reports until we do.

That's a totally different, and huge, project, but should happen.
This only differs with the other patch in that it uses runas to wget runslave.py:
runas /user:administrator "%mozillabuild%\wget\wget -OC:\runslave.py http://hg.mozilla.org/build/puppet-manifests/raw-file/tip/modules/buildslave/files/runslave.py"

On Windows 7 this version would not work.
OK, aside from OPSI changes in attachment 557520 [details] [diff] [review], this is done - over to releng for the slave cleanup.  When I get an r+ I'll land the OPSI changes.
Assignee: dustin → nobody
Attachment #557520 - Flags: review?(bhearsum) → review+
Attachment #557520 - Flags: checked-in+
btw, I had to run
 opsi-package-manager -r buildbot-tip
 opsi-package-manager -r buildbot-production
too.
List of slaves that require attention is in comment #26.
Whiteboard: [buildbot][idleizer] → [buildbot][idleizer][buildduty]
(In reply to Chris Cooper [:coop] from comment #34)
> List of slaves that require attention is in comment #26.

Just to be clear: once the slaves in comment #26 are fixed up, this bug is done?
(In reply to Ben Hearsum [:bhearsum] from comment #35) 
> Just to be clear: once the slaves in comment #26 are fixed up, this bug is
> done?

Yes. If there are still slaves in that list that need IT intervention, we can file a follow-up bug.
I fixed up w32-ix-slave03 today.

Remaining list is:

talos-r3-w7-001		x		??
talos-r3-w7-033		x		bad pw
talos-r3-w7-045		x		
talos-r3-w7-053		x		no mouse
talos-r3-w7-ref		x		
talos-r3-xp-045		x		??
w32-ix-slave06		x		
w32-ix-slave26		x		
w32-ix-slave35		x		
w64-ix-slave02		x		
w64-ix-slave41		x		
talos-r3-w764-NNN	x               not in slavealloc
Assignee: nobody → coop
Status: NEW → ASSIGNED
Priority: P3 → P2
Only slaves left are talos-r3-w7-ref (offline) and w64-ix-slave41 (Bug 683976).
Depends on: 683976
talos-r3-w7-ref was imaged earlier today, and should be accessible.   Did we miss getting this into the snapshot?
(In reply to Dustin J. Mitchell [:dustin] from comment #39)
> talos-r3-w7-ref was imaged earlier today, and should be accessible.   Did we
> miss getting this into the snapshot?

It's currently unpingable, so I can't tell. 

Many of the slaves in comment #37 were already done when I got them (but the bug wasn't updated - grrr), so I'm cautiously optimistic.
It should be pingable now - I forgot to renew its DHCP lease on the build network.
(In reply to Dustin J. Mitchell [:dustin] from comment #41)
> It should be pingable now - I forgot to renew its DHCP lease on the build
> network.

Yes, and buildbotve is already installed. \o/

Now just waiting on w64-ix-slave41.
Going to close this out.

Whenever Matt is done with w64-ix-slave41 in bug 683976, it will need to be re-imaged anyway, and the post-image steps will get buildbotve on the slave.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: