521722 - All but one moz2-win32-* build slave have no running python process

Reporter

Description

•

16 years ago

According to https://build.mozilla.org/mozilla-central-nightly/buildslaves there are 59 Windows slaves, of which 58 are disconnected. That's gone surprisingly well on a very slow Sunday, but since the waterfall says we're already a build or two behind on various trees, and Europe will be waking up and starting to push things soon, I'm closing mozilla-central and mozilla-1.9.2. (I pinged IT's oncall, but apparently sometime since when I used to constantly file bugs against them for tinderboxes, they've stopped being so in charge of doing frontline support on them.)

:Gavin Sharp [email: gavin@gavinsharp.com]

Comment 1

•

16 years ago

fox2mike rebooted moz2-win32-slave26, and it seems to have recovered fine.

Shyam Mani [:fox2mike]

Updated

•

16 years ago

Summary: All but one moz2-win32-* build slave is disconnected → All but one moz2-win32-* build slave have no running python process

Shyam Mani [:fox2mike]

Comment 2

•

16 years ago

Sample nagios alert in #build : < nagios> [78] moz2-win32-slave46.build:buildbot is CRITICAL: CRITICAL: python.exe: stopped (critical) I've restarted moz2-win32-slave26 through to 36

Phil Ringnalda (:philor)

Reporter

Comment 3

•

16 years ago

Sadly, looks like only six of them actually came back up ready to work, and we're up to 12 pending builds since closed doesn't mean closed, it only means closed unless you really want to push.

Nick Thomas [:nthomas] (UTC+12)

Assignee

Comment 4

•

16 years ago

It looks these machines are waiting for a Control-Alt-Delete to get the Windows logon prompt. They shouldn't be getting in to that state at all. so possibly related to failing OPSI. Rebooting does help, so I'll work on getting these machines back in action.

Assignee: nobody → nthomas

Priority: -- → P1

Shyam Mani [:fox2mike]

Comment 5

•

16 years ago

Thanks Nick. I was looking for you earlier on IRC.

Nick Thomas [:nthomas] (UTC+12)

Assignee

Comment 6

•

16 years ago

OK, I am strongly of the opinion that production-opsi was at fault here, so I've rebooted that VM. And then rebooted all the VMs that nagios reported were not running buildbot. Everything there is now happy, so I've reopened the Firefox and Firefox3.6 trees.

Nick Thomas [:nthomas] (UTC+12)

Assignee

Comment 7

•

16 years ago

(In reply to comment #6) > OK, I am strongly of the opinion that production-opsi was at fault here, The case against production-opsi * A very large proportion of moz2-win32-slaveNN were affected, but not the staging slaves 03, 04, and 17 which use a different opsi server * The munin monitoring detected several periods, lasting 45-60 minutes, where there was system and iowait load totaling 100%. Which suggests some sort of I/O hang or fail state. Either the underlying VM or SMB I would guess. * Looking at nagios, the slaves where buildbot failed most recently correspond to periods where production-opsi was in this state FWIW, the system/iowait load-spikes started about midnight on Oct 9th, immediately following a break in the munin data.

Status: NEW → RESOLVED

Closed: 16 years ago

Resolution: --- → FIXED

bhearsum@mozilla.com (:bhearsum)

Comment 8

•

16 years ago

(In reply to comment #0) > According to https://build.mozilla.org/mozilla-central-nightly/buildslaves > there are 59 Windows slaves, of which 58 are disconnected. That's gone > surprisingly well on a very slow Sunday, but since the waterfall says we're > already a build or two behind on various trees, and Europe will be waking up > and starting to push things soon, I'm closing mozilla-central and > mozilla-1.9.2. Just for the record, not all of the slaves should be connected to that master. Many of them are connected to our second production-master machine, which isn't proxyed anywhere. I've got a couple of threads going on the OPSI forum about these issues: https://forum.opsi.org/viewtopic.php?f=8&t=994 https://forum.opsi.org/viewtopic.php?f=8&t=939&p=4991#p4991

Nobody; OK to take it and work on it

Updated

•

12 years ago

Product: mozilla.org → Release Engineering

Bugzilla

All but one moz2-win32-* build slave have no running python process

Categories

(Release Engineering :: General, defect, P1)

Tracking

(Not tracked)

People

(Reporter: philor, Assigned: nthomas)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Updated