Closed
Bug 521722
Opened 16 years ago
Closed 16 years ago
All but one moz2-win32-* build slave have no running python process
Categories
(Release Engineering :: General, defect, P1)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: philor, Assigned: nthomas)
Details
According to https://build.mozilla.org/mozilla-central-nightly/buildslaves there are 59 Windows slaves, of which 58 are disconnected. That's gone surprisingly well on a very slow Sunday, but since the waterfall says we're already a build or two behind on various trees, and Europe will be waking up and starting to push things soon, I'm closing mozilla-central and mozilla-1.9.2.
(I pinged IT's oncall, but apparently sometime since when I used to constantly file bugs against them for tinderboxes, they've stopped being so in charge of doing frontline support on them.)
Comment 1•16 years ago
|
||
fox2mike rebooted moz2-win32-slave26, and it seems to have recovered fine.
Updated•16 years ago
|
Summary: All but one moz2-win32-* build slave is disconnected → All but one moz2-win32-* build slave have no running python process
Comment 2•16 years ago
|
||
Sample nagios alert in #build :
< nagios> [78] moz2-win32-slave46.build:buildbot is CRITICAL: CRITICAL: python.exe: stopped (critical)
I've restarted moz2-win32-slave26 through to 36
| Reporter | ||
Comment 3•16 years ago
|
||
Sadly, looks like only six of them actually came back up ready to work, and we're up to 12 pending builds since closed doesn't mean closed, it only means closed unless you really want to push.
| Assignee | ||
Comment 4•16 years ago
|
||
It looks these machines are waiting for a Control-Alt-Delete to get the Windows logon prompt. They shouldn't be getting in to that state at all. so possibly related to failing OPSI.
Rebooting does help, so I'll work on getting these machines back in action.
Assignee: nobody → nthomas
Priority: -- → P1
Comment 5•16 years ago
|
||
Thanks Nick. I was looking for you earlier on IRC.
| Assignee | ||
Comment 6•16 years ago
|
||
OK, I am strongly of the opinion that production-opsi was at fault here, so I've rebooted that VM. And then rebooted all the VMs that nagios reported were not running buildbot. Everything there is now happy, so I've reopened the Firefox and Firefox3.6 trees.
| Assignee | ||
Comment 7•16 years ago
|
||
(In reply to comment #6)
> OK, I am strongly of the opinion that production-opsi was at fault here,
The case against production-opsi
* A very large proportion of moz2-win32-slaveNN were affected, but not the staging slaves 03, 04, and 17 which use a different opsi server
* The munin monitoring detected several periods, lasting 45-60 minutes, where there was system and iowait load totaling 100%. Which suggests some sort of I/O hang or fail state. Either the underlying VM or SMB I would guess.
* Looking at nagios, the slaves where buildbot failed most recently correspond to periods where production-opsi was in this state
FWIW, the system/iowait load-spikes started about midnight on Oct 9th, immediately following a break in the munin data.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Comment 8•16 years ago
|
||
(In reply to comment #0)
> According to https://build.mozilla.org/mozilla-central-nightly/buildslaves
> there are 59 Windows slaves, of which 58 are disconnected. That's gone
> surprisingly well on a very slow Sunday, but since the waterfall says we're
> already a build or two behind on various trees, and Europe will be waking up
> and starting to push things soon, I'm closing mozilla-central and
> mozilla-1.9.2.
Just for the record, not all of the slaves should be connected to that master. Many of them are connected to our second production-master machine, which isn't proxyed anywhere.
I've got a couple of threads going on the OPSI forum about these issues:
https://forum.opsi.org/viewtopic.php?f=8&t=994
https://forum.opsi.org/viewtopic.php?f=8&t=939&p=4991#p4991
Updated•12 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•