Closed
Bug 973419
Opened 11 years ago
Closed 11 years ago
What happened at 18:00 PT that disconnected Windows builds across all trees?
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: RyanVM, Unassigned)
Details
(Whiteboard: [buildduty][capacity][buildslaves])
For at least a month now (I want to say that I first started noticing it right around the time all of our other AWS issues started hitting the fan), Windows builds have been disconnecting far more frequently than they used to. I noticed tonight that there was a large disconnection event right around 18:00 PT +/- 1m. Any ideas what happened here?
As we've gone over before, this is particularly insidious because:
1) We don't auto-clobber after these disconnects, and build bustages can and *DO* happen on subsequent builds.
2) As some of the logs below show, these can happen after the build completes and checktests are already running, meaning we end up running a second round of tests on the same cset, which isn't great for our already over-taxed infrastructure.
Example logs:
https://tbpl.mozilla.org/php/getParsedLog.php?id=34767433&tree=Mozilla-Aurora
https://tbpl.mozilla.org/php/getParsedLog.php?id=34767494&tree=Mozilla-Beta
https://tbpl.mozilla.org/php/getParsedLog.php?id=34767435&tree=Mozilla-Beta
https://tbpl.mozilla.org/php/getParsedLog.php?id=34767440&tree=Mozilla-Esr24
https://tbpl.mozilla.org/php/getParsedLog.php?id=34767422&tree=Mozilla-Esr24
https://tbpl.mozilla.org/php/getParsedLog.php?id=34767484&tree=Mozilla-Esr24
https://tbpl.mozilla.org/php/getParsedLog.php?id=34767438&tree=Mozilla-B2g26-v1.2
https://tbpl.mozilla.org/php/getParsedLog.php?id=34767436&tree=Mozilla-B2g26-v1.2
https://tbpl.mozilla.org/php/getParsedLog.php?id=34767453&tree=Mozilla-B2g26-v1.2
https://tbpl.mozilla.org/php/getParsedLog.php?id=34767452&tree=Mozilla-B2g26-v1.2
https://tbpl.mozilla.org/php/getParsedLog.php?id=34767478&tree=Mozilla-B2g26-v1.2
https://tbpl.mozilla.org/php/getParsedLog.php?id=34767446&tree=Mozilla-B2g26-v1.2
Comment 1•11 years ago
|
||
Jobs on w64 slaves which ended with result (5) retry:
Job end day end time slave master switch
2014-02-17 02:01:21 w64-ix-slave24 bm85-build1 sw-3a.scl1:26
2014-02-17 02:01:19 w64-ix-slave100 bm84-build1 switch1.r102-6:ge-0/0/43
2014-02-17 02:01:12 w64-ix-slave151 bm82-build1 sw-3a.scl1:15
2014-02-17 02:01:10 w64-ix-slave88 bm84-build1 switch1.r102-6:ge-0/0/31
2014-02-17 02:01:08 w64-ix-slave145 bm82-build1 sw-3a.scl1:19
2014-02-17 02:01:08 w64-ix-slave166 bm83-try1 sw-3a.scl1:20
2014-02-17 02:01:06 w64-ix-slave46 bm83-try1 sw-5a.scl1:42
2014-02-17 02:01:05 w64-ix-slave144 bm82-build1 sw-4a.scl1:31
2014-02-17 02:01:05 w64-ix-slave170 bm83-try1 sw-5a.scl1:10
2014-02-17 02:01:04 w64-ix-slave79 bm85-build1 (I got tired)
2014-02-17 02:00:21 w64-ix-slave147 bm82-build1
2014-02-17 02:00:18 w64-ix-slave157 bm82-build1
2014-02-17 02:00:14 w64-ix-slave150 bm82-build1
2014-02-17 02:00:13 w64-ix-slave48 bm83-try1
2014-02-17 02:00:12 w64-ix-slave70 bm83-try1
2014-02-17 02:00:12 w64-ix-slave47 bm83-try1
2014-02-17 02:00:10 w64-ix-slave146 bm82-build1
2014-02-17 02:00:09 w64-ix-slave167 bm83-try1
2014-02-17 02:00:08 w64-ix-slave72 bm83-try1
2014-02-17 02:00:07 w64-ix-slave155 bm82-build1
2014-02-17 02:00:03 w64-ix-slave156 bm82-build1
2014-02-17 02:00:03 w64-ix-slave06 bm86-build1
2014-02-17 02:00:00 w64-ix-slave77 bm85-build1
2014-02-17 02:00:00 w64-ix-slave10 bm86-build1
2014-02-17 01:59:36 w64-ix-slave152 bm82-build1
2014-02-17 01:59:32 w64-ix-slave58 bm83-try1
2014-02-17 01:59:31 w64-ix-slave112 bm84-build1
* 27 unique slaves, all in SCL1, multiple racks & switches
* four different masters, all in SCL3
Which looks like an SCL1-SCL3 network glitch (ni? adam), or something common in the ESX hosting of the masters (cc gcox).
Component: General Automation → Buildduty
Flags: needinfo?(adam)
QA Contact: catlee → armenzg
![]() |
||
Comment 2•11 years ago
|
||
master 82 is esx2, master 83 and 85 share the same host (esx4), 84 is esx7.
No unexpected events registered on any of those hosts or guests. (e.g no vmotions or other interruptions).
Comment 3•11 years ago
|
||
Ok, ESX is in the clear.
Master log from bm82:
# two slaves re-appear without the master realizing they were gone
2014-02-16 17:59:33-0800 [Broker,34300,10.12.40.188] duplicate slave w64-ix-slave156; rejecting new slave and pinging old
2014-02-16 17:59:33-0800 [Broker,34300,10.12.40.188] old slave was connected from IPv4Address(TCP, '10.12.40.188', 49170)
2014-02-16 17:59:33-0800 [Broker,34300,10.12.40.188] new slave is from IPv4Address(TCP, '10.12.40.188', 60804)
2014-02-16 17:59:33-0800 [Broker,34301,10.12.40.187] duplicate slave w64-ix-slave155; rejecting new slave and pinging old
2014-02-16 17:59:33-0800 [Broker,34301,10.12.40.187] old slave was connected from IPv4Address(TCP, '10.12.40.187', 49170)
2014-02-16 17:59:33-0800 [Broker,34301,10.12.40.187] new slave is from IPv4Address(TCP, '10.12.40.187', 63634)
# w64-ix-slave152 drops and is noticed
2014-02-16 17:59:36-0800 [Broker,34234,10.12.40.184] BuildSlave.detached(w64-ix-slave152)
2014-02-16 17:59:36-0800 [Broker,34234,10.12.40.184] <Build WINNT 5.2 mozilla-esr24 leak test build>.lostRemote
2014-02-16 17:59:36-0800 [Broker,34234,10.12.40.184] stopping currentStep <buildbotcustom.steps.mock.MockCommand instance at 0x4ac6ff38>
2014-02-16 17:59:36-0800 [Broker,34234,10.12.40.184] addCompleteLog(interrupt)
2014-02-16 17:59:36-0800 [Broker,34234,10.12.40.184] RemoteCommand.interrupt <RemoteShellCommand '['python', 'e:/builds/moz2_slave/m-esr24-w32-d-0000000000000000/build/build/pymake/make.py', '-f', 'client.mk', 'build', u'MOZ_BUILD_DATE=20140216173550']'> [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
]
2014-02-16 17:59:36-0800 [Broker,34234,10.12.40.184] RemoteCommand.disconnect: lost slave
2014-02-16 17:59:36-0800 [Broker,34234,10.12.40.184] releaseLocks(<buildbotcustom.steps.mock.MockCommand instance at 0x4ac6ff38>): []
2014-02-16 17:59:36-0800 [Broker,34234,10.12.40.184] step 'compile' complete: retry
2014-02-16 17:59:36-0800 [Broker,34234,10.12.40.184] <Build WINNT 5.2 mozilla-esr24 leak test build>: build finished
Nothing smoking there AFAICT, cc dustin who has worked on this sort of issue before.
Comment 4•11 years ago
|
||
Duplicate slaves are generally caused by something in the middle of the network that tries to track TCP session state failing to do so - generally a firewall.
Comment 5•11 years ago
|
||
cltbld@W64-IX-SLAVE156 ~
$ tracert buildbot-master82.build.mozilla.org
Tracing route to buildbot-master82.srv.releng.scl3.mozilla.com [10.26.48.52]
over a maximum of 30 hops:
1 <1 ms <1 ms <1 ms 10.12.40.1
2 1 ms 1 ms 1 ms v-1042.border1.sjc2.mozilla.net [10.0.12.2]
3 2 ms 2 ms 2 ms xe-0-0-1.border2.scl3.mozilla.net [63.245.219.162]
4 1 ms 1 ms 1 ms v-1033.border1.scl3.mozilla.net [10.0.22.9]
5 2 ms 2 ms 2 ms v-1033.fw1.releng.scl3.mozilla.net [10.0.22.10]
6 2 ms 2 ms 2 ms buildbot-master82.srv.releng.scl3.mozilla.com [10.26.48.52]
Touching two scl3 firewalls which might have changed in the networking rebuild a few TCW's ago, plus one in sjc2 and scl3 releng firewall.
Comment 6•11 years ago
|
||
(In reply to Nick Thomas [:nthomas] from comment #5)
> Touching two scl3 firewalls which might have changed in the networking
> rebuild a few TCW's ago, plus one in sjc2 and scl3 releng firewall.
Can someone decode this? I'm far from an expert here, but crossing 3 different firewalls (especially one in sjc2) seems undesirable.
Comment 7•11 years ago
|
||
Only fw1.releng.scl3 is a firewall. The border routers are not stateful, and thus wouldn't be responsible for session state issues. sjc2 is our POP that, among other things, links datacenters like scl1 and scl3. So what you're seeing here is traffic from scl1 to the POP, into scl3, across border1 and into the releng BU via its firewall, and then to the master.
Comment 8•11 years ago
|
||
Thanks for the correction.
Comment 9•11 years ago
|
||
No info forthcoming from NetOps, and added masters in SCL3 to avoid talking across the big bad internet. --> FIXED.
Status: NEW → RESOLVED
Closed: 11 years ago
Flags: needinfo?(adam)
Resolution: --- → FIXED
Updated•7 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•