All but one moz2-win32-* build slave have no running python process

RESOLVED FIXED

Status

Release Engineering
General
P1
blocker
RESOLVED FIXED
9 years ago
5 years ago

People

(Reporter: philor, Assigned: nthomas)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

9 years ago
According to https://build.mozilla.org/mozilla-central-nightly/buildslaves there are 59 Windows slaves, of which 58 are disconnected. That's gone surprisingly well on a very slow Sunday, but since the waterfall says we're already a build or two behind on various trees, and Europe will be waking up and starting to push things soon, I'm closing mozilla-central and mozilla-1.9.2.

(I pinged IT's oncall, but apparently sometime since when I used to constantly file bugs against them for tinderboxes, they've stopped being so in charge of doing frontline support on them.)
fox2mike rebooted moz2-win32-slave26, and it seems to have recovered fine.

Updated

9 years ago
Summary: All but one moz2-win32-* build slave is disconnected → All but one moz2-win32-* build slave have no running python process
Sample nagios alert in #build :

< nagios> [78] moz2-win32-slave46.build:buildbot is CRITICAL: CRITICAL: python.exe: stopped (critical)

I've restarted moz2-win32-slave26 through to 36
(Reporter)

Comment 3

9 years ago
Sadly, looks like only six of them actually came back up ready to work, and we're up to 12 pending builds since closed doesn't mean closed, it only means closed unless you really want to push.
(Assignee)

Comment 4

9 years ago
It looks these machines are waiting for a Control-Alt-Delete to get the Windows logon prompt. They shouldn't be getting in to that state at all. so possibly related to failing OPSI.

Rebooting does help, so I'll work on getting these machines back in action.
Assignee: nobody → nthomas
Priority: -- → P1
Thanks Nick. I was looking for you earlier on IRC.
(Assignee)

Comment 6

9 years ago
OK, I am strongly of the opinion that production-opsi was at fault here, so I've rebooted that VM. And then rebooted all the VMs that nagios reported were not running buildbot. Everything there is now happy, so I've reopened the Firefox and Firefox3.6 trees.
(Assignee)

Comment 7

9 years ago
(In reply to comment #6)
> OK, I am strongly of the opinion that production-opsi was at fault here, 

The case against production-opsi
* A very large proportion of moz2-win32-slaveNN were affected, but not the staging slaves 03, 04, and 17 which use a different opsi server
* The munin monitoring detected several periods, lasting 45-60 minutes, where there was system and iowait load totaling 100%. Which suggests some sort of I/O hang or fail state. Either the underlying VM or SMB I would guess.
* Looking at nagios, the slaves where buildbot failed most recently correspond to periods where production-opsi was in this state

FWIW, the system/iowait load-spikes started about midnight on Oct 9th, immediately following a break in the munin data.
Status: NEW → RESOLVED
Last Resolved: 9 years ago
Resolution: --- → FIXED
(In reply to comment #0)
> According to https://build.mozilla.org/mozilla-central-nightly/buildslaves
> there are 59 Windows slaves, of which 58 are disconnected. That's gone
> surprisingly well on a very slow Sunday, but since the waterfall says we're
> already a build or two behind on various trees, and Europe will be waking up
> and starting to push things soon, I'm closing mozilla-central and
> mozilla-1.9.2.

Just for the record, not all of the slaves should be connected to that master. Many of them are connected to our second production-master machine, which isn't proxyed anywhere.

I've got a couple of threads going on the OPSI forum about these issues:
https://forum.opsi.org/viewtopic.php?f=8&t=994
https://forum.opsi.org/viewtopic.php?f=8&t=939&p=4991#p4991
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.