build slaves (win32 in particular) disconnecting at an alarming rate

RESOLVED FIXED

Status

P3
normal
RESOLVED FIXED
10 years ago
5 years ago

People

(Reporter: bhearsum, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Not long ago we had nearly every single win32 production slave disconnect from its master, interrupting many builds.
(Reporter)

Updated

10 years ago
Priority: -- → P1

Comment 1

10 years ago
bm-xserve18 has also be cycling, though less than all the windows slaves
(Reporter)

Updated

10 years ago
Summary: win32 slaves disconnecting at an alarming rate → build slaves (win32 in particular) disconnecting at an alarming rate
(Reporter)

Comment 2

10 years ago
I'm setting keepalive=None in buildbot.tac on staging slaves right now to test it as a possible fix...
(Reporter)

Comment 3

10 years ago
OK, keepalive has been set to None on the following slaves:
moz2-linux-slave03, 04, 17
moz2-win32-slave03, 04, 21
moz2-darwin9-slave03, 04
moz2-linux64-slave01

I'll be watching these machines for disconnects. If it protects these slaves in the next round of failure we can probably roll this out.

Comment 4

10 years ago
FWIW, talos has similar problems, mainly on windows:

qm-pxp-talos03
qm-plinux-trunk07
qm-pxp-fast04
qm-pleopard-trunk04
qm-mini-vista01
(Reporter)

Comment 5

10 years ago
Haven't seen any disconnects _anywhere_ since about 11:10am PST. Hard to tell if this has helped, will keep watching

Comment 6

10 years ago
qm-plinux-talos04 just cycled, and qm-pxp-talos02 cycled at noon PST
Hey, so, this bug is keeping the trunk closed, does it need to?
(Reporter)

Comment 8

10 years ago
I don't think it needs to hold anything closed right now. I'm still trying to track down the source of the problem, but it appears to have subsided for now.
(Reporter)

Updated

10 years ago
Priority: P1 → P2
OK, re-opened the tree, but asked people to comment here if it starts up again.
Priority: P2 → P1
(Reporter)

Comment 10

10 years ago
OK, there was two rounds of disconnects last night affecting the main build pool and possibly others. The staging slaves which we were testing a fix on (see comment#2) were completely unaffected. catlee and I will be rolling this out on the rest of the build slaves shortly. This shouldn't require any downtime, we'll do our best not to interrupt any running builds.
(Reporter)

Comment 11

10 years ago
Alright, we've deployed this on all but one of the build slaves. I'm waiting for a build to finish on moz2-win32-slave19, and then I can restart it for this change.

Coincidentally (and usefully) we had a situation today that normally causes slave disconnects. We experienced exactly zero disconnections during this, so I'm very confident this has fixed *that* problem.

There is still an underlying problem of why build slaves are unable to respond to keepalive pings which could be related to load on esx hosts or storage arrays or be something completely different.

I won't be investigating this in more depth right now, so I'm throwing this bug back in the pool.
Assignee: bhearsum → nobody
Status: ASSIGNED → NEW
Priority: P1 → P3
(In reply to comment #11)
> There is still an underlying problem of why build slaves are unable to respond
> to keepalive pings which could be related to load on esx hosts or storage
> arrays or be something completely different.
Gozer hit the exact same problem with his slaves on MoMo infrastructure, so I doubt its something specific to MoCo hosts/arrays. I guess it *might* be something that both MoMo and MoCo use, like a design bug in how buildbot slaves handle keepalive messages for example?


From email with gozer, bhearsum, joduinn, others earlier this week:
[snip]
Basically, like I said in the call, every 600 seconds, the slave sends a ping
request to the master, and if it doesn't get it back 30 seconds later, the
slave assumes  the master is busted and tries to fix itself by disconnecting/reconnecting to the master.

I am not 100% certain what's going on (my python/twisted foo is too weak) here,
but my theory is that for a very busy slave, it just might not be fast enough
to process the ping back it gets from the master in 30 seconds.

From looking at logs for 'nothing from master', I've found tons of such
messages, sometimes with very large values > 300 seconds.

Since then, I've disabled keepAliveInterval completely on most builders,
and the problem has since to reappear.
[snip]


> I won't be investigating this in more depth right now, so I'm throwing this bug
> back in the pool.
ok, but whats left to do here? Given the MoMo experience above, maybe the next step is to investigate if there's a timing bug in how buildbot slaves handle keepalive messages when also busy doing work?
Component: Release Engineering → Release Engineering: Future
OS: Mac OS X → All
Duplicate of this bug: 476651

Updated

10 years ago
Duplicate of this bug: 473586
Since Feb 4, 12:00pm, we've had the following disconnections on production-master:
2009-02-04 12:21:00 moz2-darwin9-slave02 diconnected for 60 seconds
2009-02-04 12:36:00 moz2-win32-slave18 diconnected for 0 seconds
2009-02-04 14:51:00 moz2-win32-slave14 diconnected for 0 seconds
2009-02-04 15:06:00 moz2-win32-slave10 diconnected for 60 seconds
2009-02-04 15:55:00 moz2-win32-slave07 diconnected for 0 seconds
2009-02-04 19:08:00 moz2-linux64-slave01 diconnected for 60 seconds
2009-02-04 23:28:00 bm-xserve19 diconnected for 0 seconds

bm-xserve19 looks like it was rebooted around 21:30 last night, and buildbot was started at 23:28.  This disconnection seems unrelated to the keepalive problem.

moz2-linux64-slave01 looks like it did lose its connection to the master.

I haven't yet examined the windows machines.
(In reply to comment #15)
> 2009-02-04 12:21:00 moz2-darwin9-slave02 diconnected for 60 seconds
> 2009-02-04 12:36:00 moz2-win32-slave18 diconnected for 0 seconds

These were disconnected on purpose to make the keepalive change.
Current hypothesis is that buildbot calls fsync() after writing log files, which could be blocking the whole O/S until all dirty files are flushed to disk, resulting in a spike in load, and unresponsiveness.
2009-02-05 08:46:00 moz2-win32-slave19 diconnected for 0 seconds
2009-02-06 04:31:00 moz2-linux-slave15 diconnected for 0 seconds
2009-02-06 19:06:00 moz2-darwin9-slave06 diconnected for 0 seconds
2009-02-07 05:01:00 moz2-linux-slave16 diconnected for 0 seconds

Comment 19

10 years ago
Today, we had random disconnects on moz2-win32-slave01, moz2-win32-slave06, moz2-win32-slave09, moz2-win32-slave12.

Comment 20

10 years ago
[17:04]	* joduinn	logged into each of those this afternoon using RDP and looked at the buildbot.tac file for umask settings

So comment #19 could be chalked up to user error; not sure if we can check for an existing buildbot process in the batchfile (or check if it's RDP or VNC).

Comment 21

10 years ago
sshd would also help. bug 485519
Duplicate of this bug: 470404
This isn't a problem anymore.
Status: NEW → RESOLVED
Last Resolved: 9 years ago
Resolution: --- → FIXED
Moving closed Future bugs into Release Engineering in preparation for removing the Future component.
Component: Release Engineering: Future → Release Engineering
(Assignee)

Updated

5 years ago
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.