Closed
Bug 476677
Opened 16 years ago
Closed 16 years ago
build slaves (win32 in particular) disconnecting at an alarming rate
Categories
(Release Engineering :: General, defect, P3)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bhearsum, Unassigned)
References
Details
Not long ago we had nearly every single win32 production slave disconnect from its master, interrupting many builds.
Reporter | ||
Updated•16 years ago
|
Priority: -- → P1
Comment 1•16 years ago
|
||
bm-xserve18 has also be cycling, though less than all the windows slaves
Reporter | ||
Updated•16 years ago
|
Summary: win32 slaves disconnecting at an alarming rate → build slaves (win32 in particular) disconnecting at an alarming rate
Reporter | ||
Comment 2•16 years ago
|
||
I'm setting keepalive=None in buildbot.tac on staging slaves right now to test it as a possible fix...
Reporter | ||
Comment 3•16 years ago
|
||
OK, keepalive has been set to None on the following slaves:
moz2-linux-slave03, 04, 17
moz2-win32-slave03, 04, 21
moz2-darwin9-slave03, 04
moz2-linux64-slave01
I'll be watching these machines for disconnects. If it protects these slaves in the next round of failure we can probably roll this out.
Comment 4•16 years ago
|
||
FWIW, talos has similar problems, mainly on windows:
qm-pxp-talos03
qm-plinux-trunk07
qm-pxp-fast04
qm-pleopard-trunk04
qm-mini-vista01
Reporter | ||
Comment 5•16 years ago
|
||
Haven't seen any disconnects _anywhere_ since about 11:10am PST. Hard to tell if this has helped, will keep watching
Comment 6•16 years ago
|
||
qm-plinux-talos04 just cycled, and qm-pxp-talos02 cycled at noon PST
Comment 7•16 years ago
|
||
Hey, so, this bug is keeping the trunk closed, does it need to?
Reporter | ||
Comment 8•16 years ago
|
||
I don't think it needs to hold anything closed right now. I'm still trying to track down the source of the problem, but it appears to have subsided for now.
Reporter | ||
Updated•16 years ago
|
Priority: P1 → P2
Comment 9•16 years ago
|
||
OK, re-opened the tree, but asked people to comment here if it starts up again.
Priority: P2 → P1
Reporter | ||
Comment 10•16 years ago
|
||
OK, there was two rounds of disconnects last night affecting the main build pool and possibly others. The staging slaves which we were testing a fix on (see comment#2) were completely unaffected. catlee and I will be rolling this out on the rest of the build slaves shortly. This shouldn't require any downtime, we'll do our best not to interrupt any running builds.
Reporter | ||
Comment 11•16 years ago
|
||
Alright, we've deployed this on all but one of the build slaves. I'm waiting for a build to finish on moz2-win32-slave19, and then I can restart it for this change.
Coincidentally (and usefully) we had a situation today that normally causes slave disconnects. We experienced exactly zero disconnections during this, so I'm very confident this has fixed *that* problem.
There is still an underlying problem of why build slaves are unable to respond to keepalive pings which could be related to load on esx hosts or storage arrays or be something completely different.
I won't be investigating this in more depth right now, so I'm throwing this bug back in the pool.
Assignee: bhearsum → nobody
Status: ASSIGNED → NEW
Priority: P1 → P3
Comment 12•16 years ago
|
||
(In reply to comment #11)
> There is still an underlying problem of why build slaves are unable to respond
> to keepalive pings which could be related to load on esx hosts or storage
> arrays or be something completely different.
Gozer hit the exact same problem with his slaves on MoMo infrastructure, so I doubt its something specific to MoCo hosts/arrays. I guess it *might* be something that both MoMo and MoCo use, like a design bug in how buildbot slaves handle keepalive messages for example?
From email with gozer, bhearsum, joduinn, others earlier this week:
[snip]
Basically, like I said in the call, every 600 seconds, the slave sends a ping
request to the master, and if it doesn't get it back 30 seconds later, the
slave assumes the master is busted and tries to fix itself by disconnecting/reconnecting to the master.
I am not 100% certain what's going on (my python/twisted foo is too weak) here,
but my theory is that for a very busy slave, it just might not be fast enough
to process the ping back it gets from the master in 30 seconds.
From looking at logs for 'nothing from master', I've found tons of such
messages, sometimes with very large values > 300 seconds.
Since then, I've disabled keepAliveInterval completely on most builders,
and the problem has since to reappear.
[snip]
> I won't be investigating this in more depth right now, so I'm throwing this bug
> back in the pool.
ok, but whats left to do here? Given the MoMo experience above, maybe the next step is to investigate if there's a timing bug in how buildbot slaves handle keepalive messages when also busy doing work?
Component: Release Engineering → Release Engineering: Future
OS: Mac OS X → All
Comment 15•16 years ago
|
||
Since Feb 4, 12:00pm, we've had the following disconnections on production-master:
2009-02-04 12:21:00 moz2-darwin9-slave02 diconnected for 60 seconds
2009-02-04 12:36:00 moz2-win32-slave18 diconnected for 0 seconds
2009-02-04 14:51:00 moz2-win32-slave14 diconnected for 0 seconds
2009-02-04 15:06:00 moz2-win32-slave10 diconnected for 60 seconds
2009-02-04 15:55:00 moz2-win32-slave07 diconnected for 0 seconds
2009-02-04 19:08:00 moz2-linux64-slave01 diconnected for 60 seconds
2009-02-04 23:28:00 bm-xserve19 diconnected for 0 seconds
bm-xserve19 looks like it was rebooted around 21:30 last night, and buildbot was started at 23:28. This disconnection seems unrelated to the keepalive problem.
moz2-linux64-slave01 looks like it did lose its connection to the master.
I haven't yet examined the windows machines.
Comment 16•16 years ago
|
||
(In reply to comment #15)
> 2009-02-04 12:21:00 moz2-darwin9-slave02 diconnected for 60 seconds
> 2009-02-04 12:36:00 moz2-win32-slave18 diconnected for 0 seconds
These were disconnected on purpose to make the keepalive change.
Comment 17•16 years ago
|
||
Current hypothesis is that buildbot calls fsync() after writing log files, which could be blocking the whole O/S until all dirty files are flushed to disk, resulting in a spike in load, and unresponsiveness.
Comment 18•16 years ago
|
||
2009-02-05 08:46:00 moz2-win32-slave19 diconnected for 0 seconds
2009-02-06 04:31:00 moz2-linux-slave15 diconnected for 0 seconds
2009-02-06 19:06:00 moz2-darwin9-slave06 diconnected for 0 seconds
2009-02-07 05:01:00 moz2-linux-slave16 diconnected for 0 seconds
Comment 19•16 years ago
|
||
Today, we had random disconnects on moz2-win32-slave01, moz2-win32-slave06, moz2-win32-slave09, moz2-win32-slave12.
Comment 20•16 years ago
|
||
[17:04] * joduinn logged into each of those this afternoon using RDP and looked at the buildbot.tac file for umask settings
So comment #19 could be chalked up to user error; not sure if we can check for an existing buildbot process in the batchfile (or check if it's RDP or VNC).
Comment 21•16 years ago
|
||
sshd would also help. bug 485519
Reporter | ||
Comment 23•16 years ago
|
||
This isn't a problem anymore.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Comment 24•15 years ago
|
||
Moving closed Future bugs into Release Engineering in preparation for removing the Future component.
Component: Release Engineering: Future → Release Engineering
Assignee | ||
Updated•12 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•