Closed Bug 476677 Opened 16 years ago Closed 16 years ago

build slaves (win32 in particular) disconnecting at an alarming rate

Categories

(Release Engineering :: General, defect, P3)

x86
All
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Unassigned)

References

Details

Not long ago we had nearly every single win32 production slave disconnect from its master, interrupting many builds.
Priority: -- → P1
bm-xserve18 has also be cycling, though less than all the windows slaves
Summary: win32 slaves disconnecting at an alarming rate → build slaves (win32 in particular) disconnecting at an alarming rate
I'm setting keepalive=None in buildbot.tac on staging slaves right now to test it as a possible fix...
OK, keepalive has been set to None on the following slaves: moz2-linux-slave03, 04, 17 moz2-win32-slave03, 04, 21 moz2-darwin9-slave03, 04 moz2-linux64-slave01 I'll be watching these machines for disconnects. If it protects these slaves in the next round of failure we can probably roll this out.
FWIW, talos has similar problems, mainly on windows: qm-pxp-talos03 qm-plinux-trunk07 qm-pxp-fast04 qm-pleopard-trunk04 qm-mini-vista01
Haven't seen any disconnects _anywhere_ since about 11:10am PST. Hard to tell if this has helped, will keep watching
qm-plinux-talos04 just cycled, and qm-pxp-talos02 cycled at noon PST
Hey, so, this bug is keeping the trunk closed, does it need to?
I don't think it needs to hold anything closed right now. I'm still trying to track down the source of the problem, but it appears to have subsided for now.
Priority: P1 → P2
OK, re-opened the tree, but asked people to comment here if it starts up again.
Priority: P2 → P1
OK, there was two rounds of disconnects last night affecting the main build pool and possibly others. The staging slaves which we were testing a fix on (see comment#2) were completely unaffected. catlee and I will be rolling this out on the rest of the build slaves shortly. This shouldn't require any downtime, we'll do our best not to interrupt any running builds.
Alright, we've deployed this on all but one of the build slaves. I'm waiting for a build to finish on moz2-win32-slave19, and then I can restart it for this change. Coincidentally (and usefully) we had a situation today that normally causes slave disconnects. We experienced exactly zero disconnections during this, so I'm very confident this has fixed *that* problem. There is still an underlying problem of why build slaves are unable to respond to keepalive pings which could be related to load on esx hosts or storage arrays or be something completely different. I won't be investigating this in more depth right now, so I'm throwing this bug back in the pool.
Assignee: bhearsum → nobody
Status: ASSIGNED → NEW
Priority: P1 → P3
(In reply to comment #11) > There is still an underlying problem of why build slaves are unable to respond > to keepalive pings which could be related to load on esx hosts or storage > arrays or be something completely different. Gozer hit the exact same problem with his slaves on MoMo infrastructure, so I doubt its something specific to MoCo hosts/arrays. I guess it *might* be something that both MoMo and MoCo use, like a design bug in how buildbot slaves handle keepalive messages for example? From email with gozer, bhearsum, joduinn, others earlier this week: [snip] Basically, like I said in the call, every 600 seconds, the slave sends a ping request to the master, and if it doesn't get it back 30 seconds later, the slave assumes the master is busted and tries to fix itself by disconnecting/reconnecting to the master. I am not 100% certain what's going on (my python/twisted foo is too weak) here, but my theory is that for a very busy slave, it just might not be fast enough to process the ping back it gets from the master in 30 seconds. From looking at logs for 'nothing from master', I've found tons of such messages, sometimes with very large values > 300 seconds. Since then, I've disabled keepAliveInterval completely on most builders, and the problem has since to reappear. [snip] > I won't be investigating this in more depth right now, so I'm throwing this bug > back in the pool. ok, but whats left to do here? Given the MoMo experience above, maybe the next step is to investigate if there's a timing bug in how buildbot slaves handle keepalive messages when also busy doing work?
Component: Release Engineering → Release Engineering: Future
OS: Mac OS X → All
Since Feb 4, 12:00pm, we've had the following disconnections on production-master: 2009-02-04 12:21:00 moz2-darwin9-slave02 diconnected for 60 seconds 2009-02-04 12:36:00 moz2-win32-slave18 diconnected for 0 seconds 2009-02-04 14:51:00 moz2-win32-slave14 diconnected for 0 seconds 2009-02-04 15:06:00 moz2-win32-slave10 diconnected for 60 seconds 2009-02-04 15:55:00 moz2-win32-slave07 diconnected for 0 seconds 2009-02-04 19:08:00 moz2-linux64-slave01 diconnected for 60 seconds 2009-02-04 23:28:00 bm-xserve19 diconnected for 0 seconds bm-xserve19 looks like it was rebooted around 21:30 last night, and buildbot was started at 23:28. This disconnection seems unrelated to the keepalive problem. moz2-linux64-slave01 looks like it did lose its connection to the master. I haven't yet examined the windows machines.
(In reply to comment #15) > 2009-02-04 12:21:00 moz2-darwin9-slave02 diconnected for 60 seconds > 2009-02-04 12:36:00 moz2-win32-slave18 diconnected for 0 seconds These were disconnected on purpose to make the keepalive change.
Current hypothesis is that buildbot calls fsync() after writing log files, which could be blocking the whole O/S until all dirty files are flushed to disk, resulting in a spike in load, and unresponsiveness.
2009-02-05 08:46:00 moz2-win32-slave19 diconnected for 0 seconds 2009-02-06 04:31:00 moz2-linux-slave15 diconnected for 0 seconds 2009-02-06 19:06:00 moz2-darwin9-slave06 diconnected for 0 seconds 2009-02-07 05:01:00 moz2-linux-slave16 diconnected for 0 seconds
Today, we had random disconnects on moz2-win32-slave01, moz2-win32-slave06, moz2-win32-slave09, moz2-win32-slave12.
[17:04] * joduinn logged into each of those this afternoon using RDP and looked at the buildbot.tac file for umask settings So comment #19 could be chalked up to user error; not sure if we can check for an existing buildbot process in the batchfile (or check if it's RDP or VNC).
sshd would also help. bug 485519
This isn't a problem anymore.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Moving closed Future bugs into Release Engineering in preparation for removing the Future component.
Component: Release Engineering: Future → Release Engineering
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.