476677 - build slaves (win32 in particular) disconnecting at an alarming rate

OK, keepalive has been set to None on the following slaves: moz2-linux-slave03, 04, 17 moz2-win32-slave03, 04, 21 moz2-darwin9-slave03, 04 moz2-linux64-slave01 I'll be watching these machines for disconnects. If it protects these slaves in the next round of failure we can probably roll this out.

Benjamin Smedberg

Comment 4

•

16 years ago

FWIW, talos has similar problems, mainly on windows: qm-pxp-talos03 qm-plinux-trunk07 qm-pxp-fast04 qm-pleopard-trunk04 qm-mini-vista01

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 5

•

16 years ago

Haven't seen any disconnects _anywhere_ since about 11:10am PST. Hard to tell if this has helped, will keep watching

Benjamin Smedberg

Comment 6

•

16 years ago

qm-plinux-talos04 just cycled, and qm-pxp-talos02 cycled at noon PST

Mike Beltzner [:beltzner, not reading bugmail]

Comment 7

•

16 years ago

Hey, so, this bug is keeping the trunk closed, does it need to?

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 8

•

16 years ago

I don't think it needs to hold anything closed right now. I'm still trying to track down the source of the problem, but it appears to have subsided for now.

bhearsum@mozilla.com (:bhearsum)

Reporter

Updated

•

16 years ago

Priority: P1 → P2

Mike Beltzner [:beltzner, not reading bugmail]

Comment 9

•

16 years ago

OK, re-opened the tree, but asked people to comment here if it starts up again.

Priority: P2 → P1

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 10

•

16 years ago

OK, there was two rounds of disconnects last night affecting the main build pool and possibly others. The staging slaves which we were testing a fix on (see comment#2) were completely unaffected. catlee and I will be rolling this out on the rest of the build slaves shortly. This shouldn't require any downtime, we'll do our best not to interrupt any running builds.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 11

•

16 years ago

Alright, we've deployed this on all but one of the build slaves. I'm waiting for a build to finish on moz2-win32-slave19, and then I can restart it for this change. Coincidentally (and usefully) we had a situation today that normally causes slave disconnects. We experienced exactly zero disconnections during this, so I'm very confident this has fixed *that* problem. There is still an underlying problem of why build slaves are unable to respond to keepalive pings which could be related to load on esx hosts or storage arrays or be something completely different. I won't be investigating this in more depth right now, so I'm throwing this bug back in the pool.

Assignee: bhearsum → nobody

Status: ASSIGNED → NEW

Priority: P1 → P3

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 12

•

16 years ago

(In reply to comment #11) > There is still an underlying problem of why build slaves are unable to respond > to keepalive pings which could be related to load on esx hosts or storage > arrays or be something completely different. Gozer hit the exact same problem with his slaves on MoMo infrastructure, so I doubt its something specific to MoCo hosts/arrays. I guess it *might* be something that both MoMo and MoCo use, like a design bug in how buildbot slaves handle keepalive messages for example? From email with gozer, bhearsum, joduinn, others earlier this week: [snip] Basically, like I said in the call, every 600 seconds, the slave sends a ping request to the master, and if it doesn't get it back 30 seconds later, the slave assumes the master is busted and tries to fix itself by disconnecting/reconnecting to the master. I am not 100% certain what's going on (my python/twisted foo is too weak) here, but my theory is that for a very busy slave, it just might not be fast enough to process the ping back it gets from the master in 30 seconds. From looking at logs for 'nothing from master', I've found tons of such messages, sometimes with very large values > 300 seconds. Since then, I've disabled keepAliveInterval completely on most builders, and the problem has since to reappear. [snip] > I won't be investigating this in more depth right now, so I'm throwing this bug > back in the pool. ok, but whats left to do here? Given the MoMo experience above, maybe the next step is to investigate if there's a timing bug in how buildbot slaves handle keepalive messages when also busy doing work?

Component: Release Engineering → Release Engineering: Future

OS: Mac OS X → All

Chris AtLee [:catlee]

Comment 15

•

16 years ago

Since Feb 4, 12:00pm, we've had the following disconnections on production-master: 2009-02-04 12:21:00 moz2-darwin9-slave02 diconnected for 60 seconds 2009-02-04 12:36:00 moz2-win32-slave18 diconnected for 0 seconds 2009-02-04 14:51:00 moz2-win32-slave14 diconnected for 0 seconds 2009-02-04 15:06:00 moz2-win32-slave10 diconnected for 60 seconds 2009-02-04 15:55:00 moz2-win32-slave07 diconnected for 0 seconds 2009-02-04 19:08:00 moz2-linux64-slave01 diconnected for 60 seconds 2009-02-04 23:28:00 bm-xserve19 diconnected for 0 seconds bm-xserve19 looks like it was rebooted around 21:30 last night, and buildbot was started at 23:28. This disconnection seems unrelated to the keepalive problem. moz2-linux64-slave01 looks like it did lose its connection to the master. I haven't yet examined the windows machines.

Chris AtLee [:catlee]

Comment 16

•

16 years ago

(In reply to comment #15) > 2009-02-04 12:21:00 moz2-darwin9-slave02 diconnected for 60 seconds > 2009-02-04 12:36:00 moz2-win32-slave18 diconnected for 0 seconds These were disconnected on purpose to make the keepalive change.

Chris AtLee [:catlee]

Comment 17

•

16 years ago

Current hypothesis is that buildbot calls fsync() after writing log files, which could be blocking the whole O/S until all dirty files are flushed to disk, resulting in a spike in load, and unresponsiveness.

Chris AtLee [:catlee]

Comment 18

•

16 years ago

2009-02-05 08:46:00 moz2-win32-slave19 diconnected for 0 seconds 2009-02-06 04:31:00 moz2-linux-slave15 diconnected for 0 seconds 2009-02-06 19:06:00 moz2-darwin9-slave06 diconnected for 0 seconds 2009-02-07 05:01:00 moz2-linux-slave16 diconnected for 0 seconds

Aki Sasaki (not active)

Comment 19

•

16 years ago

Today, we had random disconnects on moz2-win32-slave01, moz2-win32-slave06, moz2-win32-slave09, moz2-win32-slave12.

Aki Sasaki (not active)

Comment 20

•

16 years ago

[17:04] * joduinn logged into each of those this afternoon using RDP and looked at the buildbot.tac file for umask settings So comment #19 could be chalked up to user error; not sure if we can check for an existing buildbot process in the batchfile (or check if it's RDP or VNC).

Aki Sasaki (not active)

Comment 21

•

16 years ago

sshd would also help. bug 485519

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 23

•

16 years ago

This isn't a problem anymore.

Status: NEW → RESOLVED

Closed: 16 years ago

Resolution: --- → FIXED

Chris Cooper [:coop] (he/him)

Comment 24

•

15 years ago

Moving closed Future bugs into Release Engineering in preparation for removing the Future component.

Component: Release Engineering: Future → Release Engineering

Nobody; OK to take it and work on it

Assignee

Updated

•

12 years ago

Product: mozilla.org → Release Engineering

Bugzilla

build slaves (win32 in particular) disconnecting at an alarming rate

Categories

(Release Engineering :: General, defect, P3)

Tracking

(Not tracked)

People

(Reporter: bhearsum, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Updated

Comment 9

Comment 10

Comment 11

Comment 12

Comment 15

Comment 16

Comment 17

Comment 18

Comment 19

Comment 20

Comment 21

Comment 23

Comment 24

Updated