Closed
Bug 637347
Opened 14 years ago
Closed 13 years ago
deploy Buildbot-0.8.4-pre-moz1 to OPSI and manual buildslaves
Categories
(Release Engineering :: General, defect, P2)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: dustin, Assigned: coop)
References
()
Details
(Whiteboard: [buildbot][idleizer][buildduty])
Attachments
(4 files, 5 obsolete files)
981 bytes,
patch
|
bhearsum
:
review+
dustin
:
checked-in+
|
Details | Diff | Splinter Review |
2.43 KB,
text/plain
|
Details | |
6.93 KB,
patch
|
bhearsum
:
review+
dustin
:
checked-in+
|
Details | Diff | Splinter Review |
2.38 KB,
text/plain
|
Details |
In parallel with bug 631851. This will bring a great bounty of improvements, including idleizer and better process handling.
Reporter | ||
Comment 1•14 years ago
|
||
Armen has identified the following related bugs:
https://bugzilla.mozilla.org/show_bug.cgi?id=558430
https://bugzilla.mozilla.org/show_bug.cgi?id=508672
https://bugzilla.mozilla.org/show_bug.cgi?id=590383
Reporter | ||
Comment 2•14 years ago
|
||
Well, this was rather easy. The attached script installs Buildbot in a virtualenv, using the vars at the top. It uses the puppet server for the packages, which I think makes more sense than manually duplicating them elsewhere.
I'd like to roll this into an OPSI package, and then run it by hand on w7, w64, and w764 boxes via the usual tinyurl trick. There are a few things blocking all that:
- bug 600736 - tweaks for process killing on Windows
- bug 665254 - fix idleizer
(and test it!)
- bug 650004 - more commits since 0.8.4-pre-moz1
- upgrade to buildbot-0.8.4p1, when it's released (tonight? tomorrow?)
but I'd like to know, in the interim, if this looks insane. If it doesn't look that bad, I may install this on some staging *build* slaves to see how well idleizer works.
Attachment #540202 -
Flags: review?(mlarrain)
Reporter | ||
Comment 3•14 years ago
|
||
Bug 600736 has gotten too twisty-turny, so I'm fixing the process killing in bug 666019.
Reporter | ||
Updated•14 years ago
|
Attachment #540202 -
Flags: review?(mlarrain)
Reporter | ||
Comment 4•14 years ago
|
||
With bug 666019 closed, this is ready to start deploying. I'd like to deploy by hand on dev/pp systems, and then hand-modify runslave.py to see if the updated version works. If so, I'll set up both with OPSI.
Comment 5•14 years ago
|
||
Comment on attachment 540202 [details]
install-buildbot.bat
SET python=d:\mozilla-build\python25\python.exe will not work on 64bit systems as we have removed the D:\ drive and moved its content to the C:\
Attachment #540202 -
Flags: review-
Reporter | ||
Comment 6•14 years ago
|
||
I'm going to run this on w32-ix-slave01 to see how it works.
Attachment #540202 -
Attachment is obsolete: true
Reporter | ||
Comment 7•14 years ago
|
||
A better version that has run successfully on w32-ix-slave01. I'll try it on an XP test slave now.
Attachment #548299 -
Attachment is obsolete: true
Reporter | ||
Comment 8•14 years ago
|
||
That doesn't work on XP because windows can't find PYTHON25.DLL after PYTHON.EXE is copied into C:\MOZILLA-BUILD\BUILDBOTVE\SCRIPTS. Copying PYTHON25.DLL into that directory works, and doesn't hurt on w32 builders.
Reporter | ||
Comment 9•14 years ago
|
||
success so far on w32-ix-slaveNN, talos-r3-xp-NNN, and talos-r3-w7-NNN.
Attachment #548301 -
Attachment is obsolete: true
Reporter | ||
Comment 10•14 years ago
|
||
OK, I've had success with this on all platforms now (including t-r3-w764-NNN, which don't run runslave.py anyway, so it doesn't matter). Matt, do you see anything outrageously stupid here?
Attachment #548321 -
Attachment is obsolete: true
Attachment #548335 -
Flags: review?(mlarrain)
Reporter | ||
Comment 11•14 years ago
|
||
Corresponding changes to runslave.py to find Buildbot in the virtualenv.
I'll deploy this and buildbot.bat on staging slaves, and enable idleizer on them, and see how they behave. I'll list the affected slaves here.
Attachment #548338 -
Flags: review?
Reporter | ||
Updated•14 years ago
|
Attachment #548338 -
Flags: review? → review?(bhearsum)
Updated•14 years ago
|
Attachment #548338 -
Flags: review?(bhearsum) → review+
Reporter | ||
Updated•14 years ago
|
Attachment #548338 -
Flags: checked-in+
Reporter | ||
Comment 12•14 years ago
|
||
OK, bb084-pre-moz2 and the updated runslave.py are installed on
mw32-ix-slave01
mw32-ix-slave19
mw32-ix-slave21
talos-r3-w7-001
talos-r3-w7-002
talos-r3-w7-003
talos-r3-w7-010
talos-r3-xp-001
talos-r3-xp-002
talos-r3-xp-003
talos-r3-xp-010
w32-ix-slave01
w32-ix-slave34
win32-slave04
win32-slave10
win32-slave21
and idleizer is enabled via slavealloc. Let's see how those behave.
Reporter | ||
Updated•14 years ago
|
Reporter | ||
Comment 13•14 years ago
|
||
oops, talos-r3-w7-001 was a lie (it's down awaiting a reimage)
Reporter | ||
Comment 14•14 years ago
|
||
So I'm bumping this back into the releng queue - I need a definite thumbs-up from releng that this update is good before I do it. It will take a few (2?) solid days of VNC'ing to deploy this, and the rollback would be another 2 days, so it needs to be right the first time.
The main things I'm worried about are (a) problems killing processes and (b) idleizer failures that leave slaves wedged.
Assignee: dustin → nobody
Assignee | ||
Updated•13 years ago
|
Priority: P2 → P3
Whiteboard: [buildbot][idleizer]
Assignee | ||
Comment 15•13 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #14)
> The main things I'm worried about are (a) problems killing processes and (b)
> idleizer failures that leave slaves wedged.
This has been running on staging Windows slaves for a while now and we haven't seen anything abnormal, but frankly we haven't really been looking.
I signed up today to specifically look at some Windows slaves in staging and verify (a) and (b) as much as I realistically can.
Dustin has offered to do the deployment once we're confirmed to be in a good state. I will likely help him out to expedite the deployment.
Assignee: nobody → coop
Assignee | ||
Comment 16•13 years ago
|
||
FYI, I'm on the hook for the 7.0b2 release today/tomorrow, so there may be a slight delay here.
Assignee | ||
Comment 17•13 years ago
|
||
The slaves seem to be restarting correctly, but when they do, we see the following exception on the master:
2011-08-31 07:56:41-0700 [Broker,4,10.12.50.171] Peer will receive following PB trace
back:
2011-08-31 07:56:41-0700 [Broker,4,10.12.50.171] Unhandled Error
Traceback (most recent call last):
File "/builds/buildbot/coop/tests-master/lib/python2.6/site-packages/twiste
d/spread/banana.py", line 153, in gotItem
self.callExpressionReceived(item)
File "/builds/buildbot/coop/tests-master/lib/python2.6/site-packages/twiste
d/spread/banana.py", line 116, in callExpressionReceived
self.expressionReceived(obj)
File "/builds/buildbot/coop/tests-master/lib/python2.6/site-packages/twiste
d/spread/pb.py", line 514, in expressionReceived
method(*sexp[1:])
File "/builds/buildbot/coop/tests-master/lib/python2.6/site-packages/twiste
d/spread/pb.py", line 826, in proto_message
self._recvMessage(self.localObjectForID, requestID, objectID, message, an
swerRequired, netArgs, netKw)
--- <exception caught here> ---
File "/builds/buildbot/coop/tests-master/lib/python2.6/site-packages/twisted/spread/pb.py", line 840, in _recvMessage
netResult = object.remoteMessageReceived(self, message, netArgs, netKw)
File "/builds/buildbot/coop/tests-master/lib/python2.6/site-packages/twisted/spread/pb.py", line 223, in perspectiveMessageReceived
method = getattr(self, "perspective_%s" % message)
exceptions.AttributeError: BuildSlave instance has no attribute 'perspective_shutdown'
Is this expected?
Comment 18•13 years ago
|
||
that's usually a symptom of a too-old version of buildbot. 10.12.50.171 is talos-r3-w7-010
Assignee | ||
Comment 19•13 years ago
|
||
(In reply to Chris AtLee [:catlee] from comment #18)
> that's usually a symptom of a too-old version of buildbot. 10.12.50.171 is
> talos-r3-w7-010
Too old, or just out-of-sync? We're actually trying to use a *newer* version here on the slaves.
I'm using talos-r3-w7-010 and talos-r3-xp-003 for testing, so getting log data from those slaves is expected, but both are reporting in to my test master (http://dev-master01.build.scl1.mozilla.com:8045/buildslaves) as running 0.8.4-pre-moz2.
Reporter | ||
Comment 20•13 years ago
|
||
It's expected - it's the master that's too old (and idleizer works around that).
Assignee | ||
Comment 21•13 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #20)
> It's expected - it's the master that's too old (and idleizer works around
> that).
OK, then it looks like Idleizer is working as expected. I see the Windows slaves being rebooted every 7 hours when idle. I didn't see any hangs on my slaves, so either there weren't any hangs or Idleizer was able to reboot the machines anyway.
Dustin: I think we're ready to deploy this to production. When are you available for this, keeping in mind that we're still in chemspill mode for today and (possibly) tomorrow? Do you need/want some help to speed the process along?
Reporter | ||
Comment 22•13 years ago
|
||
I'll work on it and let you know if I need help.
Assignee: coop → dustin
Reporter | ||
Comment 23•13 years ago
|
||
this version also installs runslave.py.
Attachment #548335 -
Attachment is obsolete: true
Attachment #548335 -
Flags: review?(mlarrain)
Reporter | ||
Comment 24•13 years ago
|
||
OK! Aside from the machines listed below, all windows machines are now running 0.8.4-pre-moz2. However, note that I have not *enabled* idleizer on any of them.
I need to find a time when these aren't running tests to install the new runslave.py via VNC:
talos-r3-w7-013
talos-r3-w7-014
talos-r3-w7-015
talos-r3-w7-019
talos-r3-w7-030
talos-r3-w7-031
talos-r3-w7-035
talos-r3-w7-039
talos-r3-w7-040
talos-r3-w7-042
talos-r3-w7-043
talos-r3-w7-044
talos-r3-w7-056
talos-r3-w7-060
talos-r3-w7-062
The following are inaccessible in various ways and will need releng TLC:
w32-ix-slave03 (bad password)
w32-ix-slave06
w32-ix-slave08
w32-ix-slave26
w32-ix-slave35
w32-ix-slave41
talos-r3-w7-033 (bad password)
talos-r3-w7-045
talos-r3-w7-053 (no mouse)
Reporter | ||
Comment 25•13 years ago
|
||
Also need tlc:
talos-r3-w7-001
talos-r3-xp-045
and as a reminder to myself, I need to
* fix the ref machines
* remove buildbot from OPSI
Reporter | ||
Comment 26•13 years ago
|
||
Update:
* 0.8.4-pre-moz2 installed on all accessible machines, including ref
Need TLC from releng: (some of these are known to be down)
talos-r3-w7-001 x ??
talos-r3-w7-033 x bad pw
talos-r3-w7-045 x
talos-r3-w7-053 x no mouse
talos-r3-w7-ref x
talos-r3-xp-045 x ??
w32-ix-slave03 x bad pw
w32-ix-slave06 x
w32-ix-slave26 x
w32-ix-slave35 x
w64-ix-slave02 x
w64-ix-slave41 x
talos-r3-w764-NNN x - not in slavealloc
TODO:
* remove buildbot from OPSI
* document on refimage pages
Assignee | ||
Comment 27•13 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #20)
> It's expected - it's the master that's too old (and idleizer works around
> that).
dustin/catlee: anything preventing us from upgrading buildbot on the masters sometime soon? We'll be seeing those exceptions in the email exception reports until we do.
Reporter | ||
Comment 28•13 years ago
|
||
refimage docs are updated:
https://wiki.mozilla.org/index.php?title=ReferencePlatforms%2FWin32&action=historysubmit&diff=345808&oldid=316970
https://wiki.mozilla.org/index.php?title=ReferencePlatforms%2FWin64&action=historysubmit&diff=345810&oldid=341519
https://wiki.mozilla.org/index.php?title=ReferencePlatforms%2FTest%2FWin7_64-bit&action=historysubmit&diff=345813&oldid=339225
https://wiki.mozilla.org/index.php?title=ReferencePlatforms%2FTest%2FWinXP&action=historysubmit&diff=345811&oldid=339232
https://wiki.mozilla.org/index.php?title=ReferencePlatforms%2FTest%2FWin7&action=historysubmit&diff=345814&oldid=341444
I'll file separate bugs to re-up the snapshots (there are a few outstanding already, IIRC).
Reporter | ||
Comment 29•13 years ago
|
||
Remove the packages from OPSI.
I think that the steps on the opsi master would be:
hg up
rm -rf ~cltbld/opsi-packages/buildbot-{tip,production}
anything else?
Attachment #557520 -
Flags: review?(bhearsum)
Reporter | ||
Comment 30•13 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #27)
> dustin/catlee: anything preventing us from upgrading buildbot on the masters
> sometime soon? We'll be seeing those exceptions in the email exception
> reports until we do.
That's a totally different, and huge, project, but should happen.
Comment 31•13 years ago
|
||
This only differs with the other patch in that it uses runas to wget runslave.py:
runas /user:administrator "%mozillabuild%\wget\wget -OC:\runslave.py http://hg.mozilla.org/build/puppet-manifests/raw-file/tip/modules/buildslave/files/runslave.py"
On Windows 7 this version would not work.
Reporter | ||
Comment 32•13 years ago
|
||
OK, aside from OPSI changes in attachment 557520 [details] [diff] [review], this is done - over to releng for the slave cleanup. When I get an r+ I'll land the OPSI changes.
Assignee: dustin → nobody
Updated•13 years ago
|
Attachment #557520 -
Flags: review?(bhearsum) → review+
Reporter | ||
Updated•13 years ago
|
Attachment #557520 -
Flags: checked-in+
Reporter | ||
Comment 33•13 years ago
|
||
btw, I had to run
opsi-package-manager -r buildbot-tip
opsi-package-manager -r buildbot-production
too.
Assignee | ||
Comment 34•13 years ago
|
||
List of slaves that require attention is in comment #26.
Whiteboard: [buildbot][idleizer] → [buildbot][idleizer][buildduty]
Comment 35•13 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #34)
> List of slaves that require attention is in comment #26.
Just to be clear: once the slaves in comment #26 are fixed up, this bug is done?
Assignee | ||
Comment 36•13 years ago
|
||
(In reply to Ben Hearsum [:bhearsum] from comment #35)
> Just to be clear: once the slaves in comment #26 are fixed up, this bug is
> done?
Yes. If there are still slaves in that list that need IT intervention, we can file a follow-up bug.
Assignee | ||
Comment 37•13 years ago
|
||
I fixed up w32-ix-slave03 today.
Remaining list is:
talos-r3-w7-001 x ??
talos-r3-w7-033 x bad pw
talos-r3-w7-045 x
talos-r3-w7-053 x no mouse
talos-r3-w7-ref x
talos-r3-xp-045 x ??
w32-ix-slave06 x
w32-ix-slave26 x
w32-ix-slave35 x
w64-ix-slave02 x
w64-ix-slave41 x
talos-r3-w764-NNN x not in slavealloc
Assignee | ||
Updated•13 years ago
|
Assignee: nobody → coop
Status: NEW → ASSIGNED
Priority: P3 → P2
Assignee | ||
Comment 38•13 years ago
|
||
Only slaves left are talos-r3-w7-ref (offline) and w64-ix-slave41 (Bug 683976).
Reporter | ||
Comment 39•13 years ago
|
||
talos-r3-w7-ref was imaged earlier today, and should be accessible. Did we miss getting this into the snapshot?
Assignee | ||
Comment 40•13 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #39)
> talos-r3-w7-ref was imaged earlier today, and should be accessible. Did we
> miss getting this into the snapshot?
It's currently unpingable, so I can't tell.
Many of the slaves in comment #37 were already done when I got them (but the bug wasn't updated - grrr), so I'm cautiously optimistic.
Reporter | ||
Comment 41•13 years ago
|
||
It should be pingable now - I forgot to renew its DHCP lease on the build network.
Assignee | ||
Comment 42•13 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #41)
> It should be pingable now - I forgot to renew its DHCP lease on the build
> network.
Yes, and buildbotve is already installed. \o/
Now just waiting on w64-ix-slave41.
Assignee | ||
Comment 43•13 years ago
|
||
Going to close this out.
Whenever Matt is done with w64-ix-slave41 in bug 683976, it will need to be re-imaged anyway, and the post-image steps will get buildbotve on the slave.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Updated•12 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•