Closed Bug 973419 Opened 11 years ago Closed 11 years ago

What happened at 18:00 PT that disconnected Windows builds across all trees?

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86
Windows 7
task
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: RyanVM, Unassigned)

Details

(Whiteboard: [buildduty][capacity][buildslaves])

For at least a month now (I want to say that I first started noticing it right around the time all of our other AWS issues started hitting the fan), Windows builds have been disconnecting far more frequently than they used to. I noticed tonight that there was a large disconnection event right around 18:00 PT +/- 1m. Any ideas what happened here? As we've gone over before, this is particularly insidious because: 1) We don't auto-clobber after these disconnects, and build bustages can and *DO* happen on subsequent builds. 2) As some of the logs below show, these can happen after the build completes and checktests are already running, meaning we end up running a second round of tests on the same cset, which isn't great for our already over-taxed infrastructure. Example logs: https://tbpl.mozilla.org/php/getParsedLog.php?id=34767433&tree=Mozilla-Aurora https://tbpl.mozilla.org/php/getParsedLog.php?id=34767494&tree=Mozilla-Beta https://tbpl.mozilla.org/php/getParsedLog.php?id=34767435&tree=Mozilla-Beta https://tbpl.mozilla.org/php/getParsedLog.php?id=34767440&tree=Mozilla-Esr24 https://tbpl.mozilla.org/php/getParsedLog.php?id=34767422&tree=Mozilla-Esr24 https://tbpl.mozilla.org/php/getParsedLog.php?id=34767484&tree=Mozilla-Esr24 https://tbpl.mozilla.org/php/getParsedLog.php?id=34767438&tree=Mozilla-B2g26-v1.2 https://tbpl.mozilla.org/php/getParsedLog.php?id=34767436&tree=Mozilla-B2g26-v1.2 https://tbpl.mozilla.org/php/getParsedLog.php?id=34767453&tree=Mozilla-B2g26-v1.2 https://tbpl.mozilla.org/php/getParsedLog.php?id=34767452&tree=Mozilla-B2g26-v1.2 https://tbpl.mozilla.org/php/getParsedLog.php?id=34767478&tree=Mozilla-B2g26-v1.2 https://tbpl.mozilla.org/php/getParsedLog.php?id=34767446&tree=Mozilla-B2g26-v1.2
Jobs on w64 slaves which ended with result (5) retry: Job end day end time slave master switch 2014-02-17 02:01:21 w64-ix-slave24 bm85-build1 sw-3a.scl1:26 2014-02-17 02:01:19 w64-ix-slave100 bm84-build1 switch1.r102-6:ge-0/0/43 2014-02-17 02:01:12 w64-ix-slave151 bm82-build1 sw-3a.scl1:15 2014-02-17 02:01:10 w64-ix-slave88 bm84-build1 switch1.r102-6:ge-0/0/31 2014-02-17 02:01:08 w64-ix-slave145 bm82-build1 sw-3a.scl1:19 2014-02-17 02:01:08 w64-ix-slave166 bm83-try1 sw-3a.scl1:20 2014-02-17 02:01:06 w64-ix-slave46 bm83-try1 sw-5a.scl1:42 2014-02-17 02:01:05 w64-ix-slave144 bm82-build1 sw-4a.scl1:31 2014-02-17 02:01:05 w64-ix-slave170 bm83-try1 sw-5a.scl1:10 2014-02-17 02:01:04 w64-ix-slave79 bm85-build1 (I got tired) 2014-02-17 02:00:21 w64-ix-slave147 bm82-build1 2014-02-17 02:00:18 w64-ix-slave157 bm82-build1 2014-02-17 02:00:14 w64-ix-slave150 bm82-build1 2014-02-17 02:00:13 w64-ix-slave48 bm83-try1 2014-02-17 02:00:12 w64-ix-slave70 bm83-try1 2014-02-17 02:00:12 w64-ix-slave47 bm83-try1 2014-02-17 02:00:10 w64-ix-slave146 bm82-build1 2014-02-17 02:00:09 w64-ix-slave167 bm83-try1 2014-02-17 02:00:08 w64-ix-slave72 bm83-try1 2014-02-17 02:00:07 w64-ix-slave155 bm82-build1 2014-02-17 02:00:03 w64-ix-slave156 bm82-build1 2014-02-17 02:00:03 w64-ix-slave06 bm86-build1 2014-02-17 02:00:00 w64-ix-slave77 bm85-build1 2014-02-17 02:00:00 w64-ix-slave10 bm86-build1 2014-02-17 01:59:36 w64-ix-slave152 bm82-build1 2014-02-17 01:59:32 w64-ix-slave58 bm83-try1 2014-02-17 01:59:31 w64-ix-slave112 bm84-build1 * 27 unique slaves, all in SCL1, multiple racks & switches * four different masters, all in SCL3 Which looks like an SCL1-SCL3 network glitch (ni? adam), or something common in the ESX hosting of the masters (cc gcox).
Component: General Automation → Buildduty
Flags: needinfo?(adam)
QA Contact: catlee → armenzg
master 82 is esx2, master 83 and 85 share the same host (esx4), 84 is esx7. No unexpected events registered on any of those hosts or guests. (e.g no vmotions or other interruptions).
Ok, ESX is in the clear. Master log from bm82: # two slaves re-appear without the master realizing they were gone 2014-02-16 17:59:33-0800 [Broker,34300,10.12.40.188] duplicate slave w64-ix-slave156; rejecting new slave and pinging old 2014-02-16 17:59:33-0800 [Broker,34300,10.12.40.188] old slave was connected from IPv4Address(TCP, '10.12.40.188', 49170) 2014-02-16 17:59:33-0800 [Broker,34300,10.12.40.188] new slave is from IPv4Address(TCP, '10.12.40.188', 60804) 2014-02-16 17:59:33-0800 [Broker,34301,10.12.40.187] duplicate slave w64-ix-slave155; rejecting new slave and pinging old 2014-02-16 17:59:33-0800 [Broker,34301,10.12.40.187] old slave was connected from IPv4Address(TCP, '10.12.40.187', 49170) 2014-02-16 17:59:33-0800 [Broker,34301,10.12.40.187] new slave is from IPv4Address(TCP, '10.12.40.187', 63634) # w64-ix-slave152 drops and is noticed 2014-02-16 17:59:36-0800 [Broker,34234,10.12.40.184] BuildSlave.detached(w64-ix-slave152) 2014-02-16 17:59:36-0800 [Broker,34234,10.12.40.184] <Build WINNT 5.2 mozilla-esr24 leak test build>.lostRemote 2014-02-16 17:59:36-0800 [Broker,34234,10.12.40.184] stopping currentStep <buildbotcustom.steps.mock.MockCommand instance at 0x4ac6ff38> 2014-02-16 17:59:36-0800 [Broker,34234,10.12.40.184] addCompleteLog(interrupt) 2014-02-16 17:59:36-0800 [Broker,34234,10.12.40.184] RemoteCommand.interrupt <RemoteShellCommand '['python', 'e:/builds/moz2_slave/m-esr24-w32-d-0000000000000000/build/build/pymake/make.py', '-f', 'client.mk', 'build', u'MOZ_BUILD_DATE=20140216173550']'> [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ] 2014-02-16 17:59:36-0800 [Broker,34234,10.12.40.184] RemoteCommand.disconnect: lost slave 2014-02-16 17:59:36-0800 [Broker,34234,10.12.40.184] releaseLocks(<buildbotcustom.steps.mock.MockCommand instance at 0x4ac6ff38>): [] 2014-02-16 17:59:36-0800 [Broker,34234,10.12.40.184] step 'compile' complete: retry 2014-02-16 17:59:36-0800 [Broker,34234,10.12.40.184] <Build WINNT 5.2 mozilla-esr24 leak test build>: build finished Nothing smoking there AFAICT, cc dustin who has worked on this sort of issue before.
Duplicate slaves are generally caused by something in the middle of the network that tries to track TCP session state failing to do so - generally a firewall.
cltbld@W64-IX-SLAVE156 ~ $ tracert buildbot-master82.build.mozilla.org Tracing route to buildbot-master82.srv.releng.scl3.mozilla.com [10.26.48.52] over a maximum of 30 hops: 1 <1 ms <1 ms <1 ms 10.12.40.1 2 1 ms 1 ms 1 ms v-1042.border1.sjc2.mozilla.net [10.0.12.2] 3 2 ms 2 ms 2 ms xe-0-0-1.border2.scl3.mozilla.net [63.245.219.162] 4 1 ms 1 ms 1 ms v-1033.border1.scl3.mozilla.net [10.0.22.9] 5 2 ms 2 ms 2 ms v-1033.fw1.releng.scl3.mozilla.net [10.0.22.10] 6 2 ms 2 ms 2 ms buildbot-master82.srv.releng.scl3.mozilla.com [10.26.48.52] Touching two scl3 firewalls which might have changed in the networking rebuild a few TCW's ago, plus one in sjc2 and scl3 releng firewall.
(In reply to Nick Thomas [:nthomas] from comment #5) > Touching two scl3 firewalls which might have changed in the networking > rebuild a few TCW's ago, plus one in sjc2 and scl3 releng firewall. Can someone decode this? I'm far from an expert here, but crossing 3 different firewalls (especially one in sjc2) seems undesirable.
Only fw1.releng.scl3 is a firewall. The border routers are not stateful, and thus wouldn't be responsible for session state issues. sjc2 is our POP that, among other things, links datacenters like scl1 and scl3. So what you're seeing here is traffic from scl1 to the POP, into scl3, across border1 and into the releng BU via its firewall, and then to the master.
Thanks for the correction.
No info forthcoming from NetOps, and added masters in SCL3 to avoid talking across the big bad internet. --> FIXED.
Status: NEW → RESOLVED
Closed: 11 years ago
Flags: needinfo?(adam)
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.