Closed
Bug 910818
Opened 12 years ago
Closed 12 years ago
Please investigate cause of network disconnects 2013-08-29 10:22-10:24 Pacific
Categories
(Infrastructure & Operations Graveyard :: NetOps, task)
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: jhopkins, Assigned: cransom)
Details
Attachments
(1 file)
36.81 KB,
image/png
|
Details |
We experienced numerous build slave disconnects (on existing network connections established before the outage) on 2013-08-29 between 10:22 and 10:24. This caused many builds to fail and have to restart. Can you please investigate the cause?
log: https://tbpl.mozilla.org/php/getParsedLog.php?id=27182526&tree=Mozilla-Inbound
source: w64-ix-slave117.winbuild.scl1.mozilla.com (10.12.40.149)
dest: buildbot-master63.srv.releng.use1.mozilla.com:9001 (10.134.48.196)
time: 2013-08-29 10:24:39.915172
log: https://tbpl.mozilla.org/php/getParsedLog.php?id=27182405&tree=Mozilla-Inbound
source: w64-ix-slave78.winbuild.scl1.mozilla.com (10.12.40.108)
dest: buildbot-master63.srv.releng.use1.mozilla.com:9001 (10.134.48.196)
time: 2013-08-29 10:22:32.052494
log: https://tbpl.mozilla.org/php/getParsedLog.php?id=27182419&tree=Mozilla-Inbound
source: w64-ix-slave123.winbuild.scl1.mozilla.com (10.12.40.155)
dest: buildbot-master63.srv.releng.use1.mozilla.com:9001 (10.134.48.196)
time: 2013-08-29 10:22:38.034932
log: https://tbpl.mozilla.org/php/getParsedLog.php?id=27182531&tree=Mozilla-Inbound
source: w64-ix-slave107.winbuild.scl1.mozilla.com (10.12.40.139)
dest: buildbot-master63.srv.releng.use1.mozilla.com:9001 (10.134.48.196)
time: 2013-08-29 10:24:45.846917
log: https://tbpl.mozilla.org/php/getParsedLog.php?id=27182399&tree=Mozilla-Inbound
source: w64-ix-slave116.winbuild.scl1.mozilla.com (10.12.40.148)
dest: buildbot-master66.srv.releng.usw2.mozilla.com:9001 (10.132.50.247)
time: 2013-08-29 10:22:33.478881
log: https://tbpl.mozilla.org/php/getParsedLog.php?id=27182384&tree=Mozilla-Inbound
source: w64-ix-slave120.winbuild.scl1.mozilla.com (10.12.40.152)
dest: buildbot-master66.srv.releng.usw2.mozilla.com:9001 (10.132.50.247)
time: 2013-08-29 10:22:26.527321
log: https://tbpl.mozilla.org/php/getParsedLog.php?id=27182407&tree=Mozilla-Inbound
source: w64-ix-slave108.winbuild.scl1.mozilla.com (10.12.40.140)
dest: buildbot-master58.srv.releng.usw2.mozilla.com:9001 (10.132.49.125)
time: 2013-08-29 10:22:35.332269
log: https://tbpl.mozilla.org/php/getParsedLog.php?id=27182398&tree=B2g-Inbound
source: w64-ix-slave118.winbuild.scl1.mozilla.com (10.12.40.150)
dest: buildbot-master65.srv.releng.usw2.mozilla.com:9001 (10.132.49.112)
time: 2013-08-29 10:22:38.532237
log: https://tbpl.mozilla.org/php/getParsedLog.php?id=27182515&tree=B2g-Inbound
source: w64-ix-slave111.winbuild.scl1.mozilla.com (10.12.40.143)
dest: buildbot-master61.srv.releng.use1.mozilla.com:9001 (10.134.49.62)
time: 2013-08-29 10:24:33.499145
log: https://tbpl.mozilla.org/php/getParsedLog.php?id=27182412&tree=B2g-Inbound
source: w64-ix-slave151.winbuild.scl1.mozilla.com (10.12.40.183)
dest: buildbot-master62.srv.releng.use1.mozilla.com:9001 (10.134.48.236)
time: 2013-08-29 10:22:34.410747
I noticed some cases of builds completing successfully on two of the buildbot masters above. In those cases, the build slaves were on Amazon (as opposed to scl1 in the case of all the instances above).
http://buildbot-master63.srv.releng.use1.mozilla.com:8001/builders/b2g_b2g-inbound_emulator_dep/builds/64
http://buildbot-master58.srv.releng.usw2.mozilla.com:8001/builders/Android%20no-ionmonkey%20mozilla-inbound%20build/builds/577
http://buildbot-master58.srv.releng.usw2.mozilla.com:8001/builders/Android%20Debug%20mozilla-inbound%20build/builds/552
Notes:
- all times are PDT
- comment 0 may not contain all hosts affected by the event. These are simply the ones noticed by us.
Assignee | ||
Updated•12 years ago
|
Assignee: network-operations → cransom
Assignee | ||
Comment 2•12 years ago
|
||
I saw no events on the mozilla network nor vpn failures in scl1 or scl3. Given that your failure window was 2 minutes, that is shorter than our default timers for BGP failover of the VPN tunnel, so it's possible there was internet churn for 2 minutes or Amazon VPC VPN endpoints had problems. If you can make a target available (a static host that responds to ping) in each region or AZ for our smokeping instance we can add it to monitoring and get a better idea of the failure radius for future issues. Please keep in mind that Amazon provides no SLA for VPC connectivity (VPN or their Direct Connect service) so slaves in scl1/scl3 connecting to masters in the cloud will likely be less stable than to masters housed in mozilla infrastructure.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WORKSFORME
Thanks - we may have those hosts available for smokeping -- see bug 896812 comment 18. I'll put a note there, so :ashish can discuss with you when he's back.
Comment 4•12 years ago
|
||
So far we know we have seen this happening for Win64 build machines on SCL1.
Have we seen this happening for other platforms?
Any of them on SCL3? or MTV1?
I want to collect more info about this problem and see what we can do about it.
It would be great if we changed the way buildbot works to minimize the amount of alive network connections it establishes.
Do we know if it was a DNS issue? or a TCP dropped connection?
Comment 5•12 years ago
|
||
DNS won't caused disconnects - the lookup happens before the connection is established.
Reporter | ||
Comment 6•12 years ago
|
||
What we saw sounds a lot like https://en.wikipedia.org/wiki/Congestive_collapse#Congestive_collapse
Assignee | ||
Comment 7•12 years ago
|
||
Attached are the uplink graphs for scl1. We are well below maximum throughput capability of the firewall and also a long ways before we hit saturation of the network uplinks. This is a pretty good graph of what network saturation/congestion doesn't look like. One thing you can monitor if you want to detect network back pressure is looking at the TCP stack stats on the sender and receiver. If you see heavily incrementing tcp retransmits (in the SNMP standard mibs, .1.3.6.1.2.1.6.12 or tcpRetransSegs), that is congestion and packet loss.
Comment 8•12 years ago
|
||
> One thing you can monitor if you want to detect network back pressure is looking at
> the TCP stack stats on the sender and receiver. If you see heavily incrementing tcp
> retransmits (in the SNMP standard mibs, .1.3.6.1.2.1.6.12 or tcpRetransSegs), that is
> congestion and packet loss
Do you know by chance how we can do this on Windows? Or even Linux so we could search about the topic? I'm clueless.
From IRC:
11:54 dustin: the "connection lost" is usually in response to ECONNRESET from the OS socket layer
11:54 Callek: armenzg_lunch: we're talking windows, as far back as XP, all bets are off :-p
11:54 dustin: but that only happens when the socket layer figures out that something is wrong
11:55 dustin: a read() can hang forever waiting for an incoming packet from a slave, and if that packet never arrives, no error is generated (on the side doing the read())
12:55 armenzg: so Windows so far?
12:56 armenzg: dustin: does that mean that Windows' socket layer is more fragile?
12:57 dustin: yes, or more specifically has different timeouts, behaviors, etc.
Assignee | ||
Comment 9•12 years ago
|
||
It took me a bit to find a Windows machine I could stick the SNMP agent on as all of our windows ninjas are out, however I found an XP VM and enabled snmp.
0ogre:~% snmpwalk -v2c -c public 198.18.1.48 tcpRetransSegs.0
TCP-MIB::tcpRetransSegs.0 = Counter32: 0
The same method works for linux. There are other methods that are more useful like collectd plugins, however, SNMP should be universal. Just take note, the only way the counter stays 0 is if the machine is completely idle. TCP retransmissions are common and there are a great many reasons as to why network traffic may need to be retransmitted. Small increments are normal, large (>1000 in a minute) would be worth looking into, but it depends on the environment.
Updated•2 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•