Closed Bug 918677 Opened 11 years ago Closed 11 years ago

spike in infamous disconnects ([Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.) last night on september 19 from 8pm on

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86
All
task
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 940356

People

(Reporter: cbook, Unassigned)

References

Details

looking at the logs something seems happened last night between 19:00 and 21:00pm PST because a lot of traffic/marking happenend in this 2 bugs above. Because of lot of disconnects happened at that time across trees. The conversation i found was: 21:11 < philor> great, 40 disconnects 21:19 < philor> where by 40 I mean more like 150 Do we know what happened last night (i checked some of the slaves and they seem to be located in SCL3). Just wonder if something is needed our attention here :) see bug 845994 and bug 710942 with the latest staring during that time. Sample logs are https://tbpl.mozilla.org/php/getParsedLog.php?id=28132812&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=28107098&tree=Mozilla-Aurora https://tbpl.mozilla.org/php/getParsedLog.php?id=28132774&tree=Mozilla-Beta so it was on various trees
Summary: spike in spike in infamous disconnects last night on september 19 from 8pm on → spike in infamous disconnects ([Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.) last night on september 19 from 8pm on
created a sample list from the bugs above on https://etherpad.mozilla.org/slave-disconnect slave: talos-r4-lion-012 slave: talos-r4-lion-016 slave: talos-mtnlion-r5-039 slave: talos-mtnlion-r5-017 slave: talos-mtnlion-r5-071 slave: talos-mtnlion-r5-073 slave: talos-mtnlion-r5-074 slave: talos-mtnlion-r5-082 slave: talos-mtnlion-r5-070 slave: talos-mtnlion-r5-013 slave: talos-mtnlion-r5-072 slave: talos-mtnlion-r5-088 slave: talos-mtnlion-r5-060 slave: talos-mtnlion-r5-005 slave: talos-mtnlion-r5-019 slave: talos-r4-lion-001 slave: bld-lion-r5-045 slave: bld-lion-r5-045 slave: talos-linux64-ix-024 slave: talos-r4-lion-078 slave: talos-mtnlion-r5-051 slave: talos-mtnlion-r5-022 slave: talos-r4-lion-050 slave: bld-lion-r5-086 slave: talos-mtnlion-r5-084 slave: talos-r4-snow-075 slave: talos-r4-lion-074 slave: talos-r4-snow-059 slave: w64-ix-slave154 slave: w64-ix-slave70 slave: w64-ix-slave129 slave: t-w864-ix-095 slave: t-w732-ix-130 slave: talos-r4-lion-067 slave: t-w864-ix-061 slave: t-w864-ix-111 slave: t-w864-ix-124 slave: t-w864-ix-107 slave: t-w864-ix-031 slave: t-w732-ix-022 slave: t-w732-ix-035 slave: t-w732-ix-087 slave: t-w732-ix-104 t-xp32-ix-008 t-xp32-ix-061 slave: t-xp32-ix-040 slave: t-xp32-ix-067 slave: t-xp32-ix-076 slave: w64-ix-slave151
Tomcat put together a nice list of almost 50 slaves that hit this. I did some analysis and found that nearly all of them were on use1 masters. The break down is as follows: buildbot-master77.srv.releng.use1.mozilla.com - 16 buildbot-master69.srv.releng.use1.mozilla.com - 8 buildbot-master71.srv.releng.use1.mozilla.com - 6 buildbot-master76.srv.releng.use1.mozilla.com - 6 buildbot-master75.srv.releng.use1.mozilla.com - 2 buildbot-master61.srv.releng.use1.mozilla.com - 2 buildbot-master63.srv.releng.use1.mozilla.com - 2 buildbot-master70.srv.releng.use1.mozilla.com - 2 buildbot-master62.srv.releng.use1.mozilla.com - 1 buildbot-master67.srv.releng.use1.mozilla.com - 1 buildbot-master58.srv.releng.usw2.mozilla.com - 1 buildbot-master65.srv.releng.usw2.mozilla.com - 1 buildbot-master79.srv.releng.usw2.mozilla.com - 1 Given that, I'm not exactly sure what to think. The fact that most of them are use1 leads me to think that the problem lies somewhere between our ISPs and Amazon. But having a few usw2 mixed in doesn't support that theory (unless those are just coincidence). Casey, you pulled up some graphs the last time this happened (smokeping ones IIRC). Could you do that again for both use1 and usw2? Let's say from 6pm PT yesterday to this morning if you need a time range.
Dustin, I don't understand much of networks and specifically our setup. Do you see anything wrong with this traceroute? Or even slow? Dustin, would you please be able to get us some network logs of buildbot-rw-vip.db.scl3.mozilla.com between 20:45 to 21:15? I could be asking non-sense so please let me know. Do we have ganglia access for this? Dustin, what info do we have wrt to our links around that time? Sorry if I ask non-sense but I would like to understand what happened during that time before closing this. [cltbld@buildbot-master62.srv.releng.use1.mozilla.com build1]$ ping v-1034.border2.scl3.mozilla.net PING v-1034.border2.scl3.mozilla.net (10.0.22.13) 56(84) bytes of data. 64 bytes from v-1034.border2.scl3.mozilla.net (10.0.22.13): icmp_seq=1 ttl=62 time=83.2 ms 64 bytes from v-1034.border2.scl3.mozilla.net (10.0.22.13): icmp_seq=2 ttl=62 time=86.6 ms [cltbld@buildbot-master62.srv.releng.use1.mozilla.com build1]$ traceroute buildbot-rw-vip.db.scl3.mozilla.com traceroute to buildbot-rw-vip.db.scl3.mozilla.com (10.22.70.57), 30 hops max, 60 byte packets 1 169.254.255.21 (169.254.255.21) 0.704 ms 0.685 ms 0.659 ms 2 72.21.209.193 (72.21.209.193) 0.642 ms 0.615 ms 0.594 ms 3 169.254.255.74 (169.254.255.74) 85.225 ms 85.205 ms 85.174 ms 4 v-1034.border2.scl3.mozilla.net (10.0.22.13) 82.694 ms 82.665 ms 82.643 ms 5 * * * 6 * * * 7 * * * 8 * * * 9 * * * 10 * * * 11 * * * 12 * * * 13 * * * 14 * * * 15 * * * 16 * * * 17 * * * 18 * * * 19 * * * 20 * * * 21 * * * 22 * * * 23 * * * 24 * * * 25 * * * 26 * * * 27 * * * 28 * * * 29 * * * 30 * * * [cltbld@buildbot-master62.srv.releng.use1.mozilla.com build1]$ traceroute 10.12.40.169 traceroute to 10.12.40.169 (10.12.40.169), 30 hops max, 60 byte packets 1 169.254.255.17 (169.254.255.17) 0.695 ms 0.684 ms 0.661 ms 2 72.21.209.193 (72.21.209.193) 0.613 ms 0.573 ms 0.564 ms 3 169.254.255.74 (169.254.255.74) 77.779 ms 77.768 ms 77.743 ms 4 v-1033.border1.scl3.mozilla.net (10.0.22.9) 84.852 ms 84.827 ms 84.794 ms 5 v-1050.border1.pao1.mozilla.net (10.0.12.6) 85.377 ms 85.012 ms 85.337 ms 6 * * * 7 * * * 8 * * * 9 * * * 10 * * * 11 * * * 12 * * * 13 * * * 14 * * * 15 * * * 16 * * * 17 * * * 18 * * * 19 * * * 20 * * * 21 * * * 22 * * * 23 * * * 24 * * * 25 * * * 26 * * * 27 * * * 28 * * * 29 * * * 30 * * * It seems we have a couple of connections that could be bad (not certain): host on scl3 <-> buildbot-master on AWS buildbot-rw-vip on scl3 <-> buildbot-master on AWS I have this from w64-ix-slave151. It seems that it took 4 minutes to be able to connect properly. I have copied the buildbot-master62 log to /builds/buildbot/build1/twistd.log.4 in case anyone wants to have a look at it. A lot of these: - killing new slave on IPv4Address(TCP, '10.12.40.26', 50172) - duplicate slave w64-ix-slave137; rejecting new slave and pinging old - didn't hear back from slave in 30 seconds; killing existing connection - disconnecting old slave w64-ix-slave155 now - waiting for slave to finish disconnecting ###### buildbot-master62 ####### 2013-09-19 21:02:51-0700 [-] prioritizeBuilders: 15.94s finished prioritization 2013-09-19 21:02:51-0700 [-] Pulse <0x55ae2170>: Processed 2 events (0 heartbeats) in 0.00 seconds 2013-09-19 21:02:51-0700 [-] expiring old connection 2013-09-19 21:02:51-0700 [-] adbapi closing: MySQLdb 2013-09-19 21:02:51-0700 [-] adbapi connecting: MySQLdb {'use_unicode': True, 'passwd': 'blah', 'charset': 'utf8', 'db': 'buildbot_schedulers', 'host': 'buildbot-rw-vip.db.scl3.mozilla.com', 'user': 'buildbot2'} 2013-09-19 21:02:51-0700 [-] expiring old connection 2013-09-19 21:02:51-0700 [-] adbapi closing: MySQLdb 2013-09-19 21:02:51-0700 [-] adbapi connecting: MySQLdb {'use_unicode': True, 'passwd': 'blah', 'charset': 'utf8', 'db': 'buildbot_schedulers', 'host': 'buildbot-rw-vip.db.scl3.mozilla.com', 'user': 'buildbot2'} 2013-09-19 21:02:52-0700 [-] expiring old connection 2013-09-19 21:02:52-0700 [-] adbapi closing: MySQLdb 2013-09-19 21:02:52-0700 [-] adbapi connecting: MySQLdb {'use_unicode': True, 'passwd': 'blah', 'charset': 'utf8', 'db': 'buildbot_schedulers', 'host': 'buildbot-rw-vip.db.scl3.mozilla.com', 'user': 'buildbot2'} ... 2013-09-19 21:05:46-0700 [-] adbapi closing: MySQLdb 2013-09-19 21:05:46-0700 [-] adbapi connecting: MySQLdb {'use_unicode': True, 'passwd': 'blah', 'charset': 'utf8', 'db': 'buildbot_schedulers', 'host': 'buildbot-rw-vip.db.scl3.mozilla.com', 'user': 'buildbot2'} ... 2013-09-19 21:05:54-0700 [-] killing new slave on IPv4Address(TCP, '10.12.40.26', 50161) 2013-09-19 21:05:54-0700 [-] killing new slave on IPv4Address(TCP, '10.12.40.102', 50178) 2013-09-19 21:05:54-0700 [-] killing new slave on IPv4Address(TCP, '10.12.40.183', 55468) 2013-09-19 21:05:54-0700 [-] killing new slave on IPv4Address(TCP, '10.12.40.169', 64388) 2013-09-19 21:05:54-0700 [-] killing new slave on IPv4Address(TCP, '10.12.40.187', 50686) 2013-09-19 21:05:55-0700 [-] expiring old connection 2013-09-19 21:05:55-0700 [-] adbapi closing: MySQLdb ... 2013-09-19 21:05:58-0700 [Broker,16673,10.12.40.169] duplicate slave w64-ix-slave137; rejecting new slave and pinging old 2013-09-19 21:05:58-0700 [Broker,16673,10.12.40.169] old slave was connected from IPv4Address(TCP, '10.12.40.169', 49172) 2013-09-19 21:05:58-0700 [Broker,16673,10.12.40.169] new slave is from IPv4Address(TCP, '10.12.40.169', 64393) ... 2013-09-19 21:06:01-0700 [Broker,16670,10.12.40.102] duplicate slave w64-ix-slave72; rejecting new slave and pinging old 2013-09-19 21:06:01-0700 [Broker,16670,10.12.40.102] old slave was connected from IPv4Address(TCP, '10.12.40.102', 49176) 2013-09-19 21:06:01-0700 [Broker,16670,10.12.40.102] new slave is from IPv4Address(TCP, '10.12.40.102', 50183) 2013-09-19 21:06:01-0700 [Broker,16671,10.12.40.183] duplicate slave w64-ix-slave151; rejecting new slave and pinging old 2013-09-19 21:06:01-0700 [Broker,16671,10.12.40.183] old slave was connected from IPv4Address(TCP, '10.12.40.183', 49172) 2013-09-19 21:06:01-0700 [Broker,16671,10.12.40.183] new slave is from IPv4Address(TCP, '10.12.40.183', 55473) 2013-09-19 21:06:01-0700 [Broker,16672,10.12.40.26] duplicate slave w64-ix-slave06; rejecting new slave and pinging old 2013-09-19 21:06:01-0700 [Broker,16672,10.12.40.26] old slave was connected from IPv4Address(TCP, '10.12.40.26', 49172) 2013-09-19 21:06:01-0700 [Broker,16672,10.12.40.26] new slave is from IPv4Address(TCP, '10.12.40.26', 50166) 2013-09-19 21:06:01-0700 [Broker,16674,10.12.40.187] duplicate slave w64-ix-slave155; rejecting new slave and pinging old 2013-09-19 21:06:01-0700 [Broker,16674,10.12.40.187] old slave was connected from IPv4Address(TCP, '10.12.40.187', 49172) 2013-09-19 21:06:01-0700 [Broker,16674,10.12.40.187] new slave is from IPv4Address(TCP, '10.12.40.187', 50693) ... 2013-09-19 21:06:03-0700 [-] killing new slave on IPv4Address(TCP, '10.12.40.169', 64393) 2013-09-19 21:06:06-0700 [-] killing new slave on IPv4Address(TCP, '10.12.40.102', 50183) 2013-09-19 21:06:06-0700 [-] killing new slave on IPv4Address(TCP, '10.12.40.183', 55473) 2013-09-19 21:06:06-0700 [-] killing new slave on IPv4Address(TCP, '10.12.40.26', 50166) 2013-09-19 21:06:06-0700 [-] killing new slave on IPv4Address(TCP, '10.12.40.187', 50693) 2013-09-19 21:06:06-0700 [Broker,16675,10.12.40.169] duplicate slave w64-ix-slave137; rejecting new slave and pinging old 2013-09-19 21:06:06-0700 [Broker,16675,10.12.40.169] old slave was connected from IPv4Address(TCP, '10.12.40.169', 49172) 2013-09-19 21:06:06-0700 [Broker,16675,10.12.40.169] new slave is from IPv4Address(TCP, '10.12.40.169', 64396) 2013-09-19 21:06:09-0700 [-] expiring old connection ... 2013-09-19 21:06:28-0700 [Broker,16619,10.12.40.26] Unhandled error in Deferred: 2013-09-19 21:06:28-0700 [Broker,16619,10.12.40.26] Unhandled Error Traceback (most recent call last): Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionDone'>: Connection was closed cleanly. ] 2013-09-19 21:06:28-0700 [Broker,16619,10.12.40.26] Unhandled error in Deferred: 2013-09-19 21:06:28-0700 [Broker,16619,10.12.40.26] Unhandled Error Traceback (most recent call last): Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionDone'>: Connection was closed cleanly. ] 2013-09-19 21:06:28-0700 [Broker,16619,10.12.40.26] Unhandled error in Deferred: 2013-09-19 21:06:28-0700 [Broker,16619,10.12.40.26] Unhandled Error Traceback (most recent call last): Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionDone'>: Connection was closed cleanly. ] 2013-09-19 21:06:28-0700 [Broker,16619,10.12.40.26] Unhandled error in Deferred: 2013-09-19 21:06:28-0700 [Broker,16619,10.12.40.26] Unhandled Error Traceback (most recent call last): Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionDone'>: Connection was closed cleanly. ] ... 2013-09-19 21:06:28-0700 [-] didn't hear back from slave in 30 seconds; killing existing connection 2013-09-19 21:06:28-0700 [-] disconnecting old slave w64-ix-slave72 now 2013-09-19 21:06:28-0700 [-] waiting for slave to finish disconnecting 2013-09-19 21:06:28-0700 [-] didn't hear back from slave in 30 seconds; killing existing connection 2013-09-19 21:06:28-0700 [-] disconnecting old slave w64-ix-slave151 now 2013-09-19 21:06:28-0700 [-] waiting for slave to finish disconnecting 2013-09-19 21:06:28-0700 [-] didn't hear back from slave in 30 seconds; killing existing connection 2013-09-19 21:06:28-0700 [-] disconnecting old slave w64-ix-slave137 now 2013-09-19 21:06:28-0700 [-] waiting for slave to finish disconnecting 2013-09-19 21:06:28-0700 [-] didn't hear back from slave in 30 seconds; killing existing connection 2013-09-19 21:06:28-0700 [-] disconnecting old slave w64-ix-slave155 now 2013-09-19 21:06:28-0700 [-] waiting for slave to finish disconnecting 2013-09-19 21:06:28-0700 [-] killing new slave on IPv4Address(TCP, '10.12.40.26', 50172) 2013-09-19 21:06:28-0700 [-] killing new slave on IPv4Address(TCP, '10.12.40.169', 64399) ###### buildbot-master62 ####### ###### w64-ix-slave151 ####### 2013-09-19 21:03:31-0700 [-] command finished with signal None, exit code 0, elapsedTime: 3188.283000 2013-09-19 21:03:31-0700 [-] SlaveBuilder.commandComplete <buildslave.commands.shell.SlaveShellCommand instance at 0x02F21DC8> 2013-09-19 21:03:50-0700 [Broker,client] SlaveBuilder._ackFailed: SlaveBuilder.sendUpdate 2013-09-19 21:03:50-0700 [Broker,client] Unhandled Error Traceback (most recent call last): Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.] 2013-09-19 21:03:50-0700 [Broker,client] SlaveBuilder._ackFailed: sendComplete 2013-09-19 21:03:50-0700 [Broker,client] Unhandled Error Traceback (most recent call last): Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ] 2013-09-19 21:03:50-0700 [Broker,client] SlaveBuilder._ackFailed: SlaveBuilder.sendUpdate 2013-09-19 21:03:50-0700 [Broker,client] Unhandled Error Traceback (most recent call last): Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ] 2013-09-19 21:03:50-0700 [Broker,client] lost remote 2013-09-19 21:03:50-0700 [Broker,client] lost remote ... 2013-09-19 21:03:50-0700 [Broker,client] lost remote 2013-09-19 21:03:50-0700 [Broker,client] lost remote 2013-09-19 21:03:51-0700 [Broker,client] Lost connection to buildbot-master62.srv.releng.use1.mozilla.com:9001 2013-09-19 21:03:51-0700 [Broker,client] <twisted.internet.tcp.Connector instance at 0x02B12AA8> will retry in 2 seconds 2013-09-19 21:03:51-0700 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x02C98490> 2013-09-19 21:03:53-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x02C98490> 2013-09-19 21:03:53-0700 [-] Connecting to buildbot-master62.srv.releng.use1.mozilla.com:9001 2013-09-19 21:04:14-0700 [Uninitialized] Connection to buildbot-master62.srv.releng.use1.mozilla.com:9001 failed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.TCPTimedOutError'>: TCP connection timed out: 10060: A connection attempt failed becau se the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.. ] 2013-09-19 21:04:14-0700 [Uninitialized] <twisted.internet.tcp.Connector instance at 0x02B12AA8> will retry in 8 seconds 2013-09-19 21:04:14-0700 [Uninitialized] Stopping factory <buildslave.bot.BotFactory instance at 0x02C98490> 2013-09-19 21:04:22-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x02C98490> 2013-09-19 21:04:22-0700 [-] Connecting to buildbot-master62.srv.releng.use1.mozilla.com:9001 2013-09-19 21:04:43-0700 [Uninitialized] Connection to buildbot-master62.srv.releng.use1.mozilla.com:9001 failed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.TCPTimedOutError'>: TCP connection timed out: 10060: A connection attempt failed becau se the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.. ] 2013-09-19 21:04:43-0700 [Uninitialized] <twisted.internet.tcp.Connector instance at 0x02B12AA8> will retry in 25 seconds 2013-09-19 21:04:43-0700 [Uninitialized] Stopping factory <buildslave.bot.BotFactory instance at 0x02C98490> 2013-09-19 21:05:09-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x02C98490> 2013-09-19 21:05:09-0700 [-] Connecting to buildbot-master62.srv.releng.use1.mozilla.com:9001 2013-09-19 21:05:21-0700 [Broker,client] Connected to buildbot-master62.srv.releng.use1.mozilla.com:9001; slave is ready 2013-09-19 21:05:26-0700 [Broker,client] Lost connection to buildbot-master62.srv.releng.use1.mozilla.com:9001 2013-09-19 21:05:26-0700 [Broker,client] <twisted.internet.tcp.Connector instance at 0x02B12AA8> will retry in 2 seconds 2013-09-19 21:05:26-0700 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x02C98490> 2013-09-19 21:05:28-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x02C98490> 2013-09-19 21:05:28-0700 [-] Connecting to buildbot-master62.srv.releng.use1.mozilla.com:9001 2013-09-19 21:05:32-0700 [Broker,client] Connected to buildbot-master62.srv.releng.use1.mozilla.com:9001; slave is ready 2013-09-19 21:05:37-0700 [Broker,client] Lost connection to buildbot-master62.srv.releng.use1.mozilla.com:9001 2013-09-19 21:05:37-0700 [Broker,client] <twisted.internet.tcp.Connector instance at 0x02B12AA8> will retry in 2 seconds 2013-09-19 21:05:37-0700 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x02C98490> 2013-09-19 21:05:40-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x02C98490> 2013-09-19 21:05:40-0700 [-] Connecting to buildbot-master62.srv.releng.use1.mozilla.com:9001 2013-09-19 21:05:44-0700 [Broker,client] Connected to buildbot-master62.srv.releng.use1.mozilla.com:9001; slave is ready 2013-09-19 21:05:49-0700 [Broker,client] Lost connection to buildbot-master62.srv.releng.use1.mozilla.com:9001 2013-09-19 21:05:49-0700 [Broker,client] <twisted.internet.tcp.Connector instance at 0x02B12AA8> will retry in 3 seconds 2013-09-19 21:05:49-0700 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x02C98490> 2013-09-19 21:05:53-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x02C98490> 2013-09-19 21:05:53-0700 [-] Connecting to buildbot-master62.srv.releng.use1.mozilla.com:9001 2013-09-19 21:07:28-0700 [Broker,client] message from master: attached
Group: mozilla-corporation-confidential
(In reply to Ben Hearsum [:bhearsum] from comment #2) > Casey, you pulled up some graphs the last time this happened (smokeping ones > IIRC). Could you do that again for both use1 and usw2? Let's say from 6pm PT > yesterday to this morning if you need a time range. Sorry, meant to set needinfo...
Flags: needinfo?(cransom)
09/20 17:41:41 fw1.releng.console.scl3.mozilla.net %-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 169.254.255.73 (External AS 7224) changed state from OpenConfirm to Established (event RecvKeepAlive) 09/20 17:41:13 fw1.releng.console.scl3.mozilla.net %-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 169.254.255.73 (External AS 7224) changed state from Established to Idle (event RecvNotify) 09/20 04:04:57 fw1.releng.console.scl3.mozilla.net %-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 169.254.255.73 (External AS 7224) changed state from OpenConfirm to Established (event RecvKeepAlive) 09/20 04:04:52 fw1.releng.console.scl3.mozilla.net %-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 169.254.255.77 (External AS 7224) changed state from OpenConfirm to Established (event RecvKeepAlive) 09/20 04:04:20 fw1.releng.console.scl3.mozilla.net %-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 169.254.255.73 (External AS 7224) changed state from Established to Idle (event HoldTime) 09/20 04:04:19 fw1.releng.console.scl3.mozilla.net %-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 169.254.255.77 (External AS 7224) changed state from Established to Idle (event HoldTime) 09/19 21:53:25 fw1.releng.console.scl3.mozilla.net %-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 169.254.249.29 (External AS 7224) changed state from OpenConfirm to Established (event RecvKeepAlive) 09/19 21:53:23 fw1.releng.console.scl3.mozilla.net %-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 169.254.249.29 (External AS 7224) changed state from Established to Idle (event RecvNotify) these timestamps are UTC. After shifting them 7 hours, the break you are seeing was an amazon vpn failure. both VPNs received an idle timeout at 04:04:19 (3 minutes without no activity, bgp considers the sessions dead) and at 04:04:52, they are re-established. there is absolutely nothing netops can do about these because the failure isn't on our side and we are already wired for all the redundancy amazon can supply.
Flags: needinfo?(cransom)
The old-slave/new-slave stuff generally indicates that the slave and master have different ideas of current connections: the master thinks the "old" connection is still live, while the slave thinks that connection is closed and is automatically initiating a new one. In this situation, the master tests the old connection before admitting the new. In the past, we've seen this caused by stateful firewalls losing connection state. The slave sends some tcp packets, and the firewall, having no memory of the connection they are a part of, drops them. The slave retransmits a few times, and finally closes the connection in error. During this time, the master has not seen any packets and continues waiting patiently to hear back from the slave -- there is no TCP timer to run out on the master[1]. The slave automatically reconnects with the normal TCP connection-setup traffic, which the firewall allows, and we see the symptoms described in the previous paragraph. As you've suggested, it may be that Windows slaves time out their connections more quickly than Linux or Darwin slaves[2]. I'm happy to further explain this via Vidyo if you'd like. It's also possible for extended packet loss to cause this same behavior, if the slave gives up on retransmissions during the outage. This appears to be the more likely scenario, given Casey's information. A BGP session loss like this one is based on communication failures over the public Internet, in this case between scl3 and use1. The Internet is a wild and crazy place, with routes shifting all the time, and is neither within Mozilla's nor Amazon's network. I suspect that the next step for pursuing this is to talk to Amazon and see if they can provide any more detail (e.g., logs) on what they are seeing, or any other advice or diagnostics they can provide. Using usw2 instead might help, too - I haven't heard about issues with masters there, and inventory shows quite a few. [1] Aside from TCP keepalives, which trigger at 2 hours; shortening those would not fix this problem, but only make the timeouts occur more quickly. [2] http://stackoverflow.com/questions/5227520/how-many-times-will-tcp-retransmit (Linux: about 15m) http://support.microsoft.com/kb/170359 (Windows: about 5m) Note that these timeouts depend on the measured RTT, and can vary significantly.
(In reply to Dustin J. Mitchell [:dustin] from comment #6) Thanks for all your insight into this, it's much appreciated For some reason it reminded me that we also used to have problems with Buildbot masters hanging for long periods of time (in the past this was often due to /waterfall requests or very large logs), but I wonder if the masters are getting close to this state again, and unable to respond to keepalives or other updates from slaves in a timely manner? I know that we've correlated some of these failures to network issues, but I wonder if there's more than one cause here.
We should be able to monitor things like that with collectd - either using the existing OS-level metrics, or by feeding some metrics from Buildbot itself into collectd.
Moving this bug to Platform Support as it is a project to improve the master<->slaves connections.
Component: Buildduty → Platform Support
QA Contact: armenzg → coop
We had some more disconnects just now, and RyanVM confirms the time corresponds to a BGP flap: 09/27 20:16:00 %-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 169.254.249.25 (External AS 7224) changed state from Established to Idle (event HoldTime) 09/27 20:16:32 %-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 169.254.249.25 (External AS 7224) changed state from OpenConfirm to Established (event RecvKeepAlive)
So that was a failure of one of the two links, which would have triggered a failover. The time scales are smaller than TCP timeouts, so the more likely explanation here is that a stateful router somewhere is losing session state when the failover occurs. On the Juniper firewall on our end, this is not the case - the TCP tuples before and after the failover are the same, with just the next-hop (not part of the session state) changing. However, it may be a different story on the Amazon end. Catlee's going to get in touch with Amazon to see about that.
Depends on: 925285
Depends on: 867593
note, this is at the moment spiking a lot over all trees (in case we need to debug/logs)
Amazon's response: """ In the case of BGP connection failover, existing TCP connections are unfortunately not going to be transparently handled using the secondary tunnel. The only work-around I can think of would be to support appropriate retry procedures in the application itself or use asynchronous communication methods. Please accept my apologies, we are doing all that's possible to provide highly available and stable VPN connections but if the actual communication happens over Internet, intermittent connectivity issues cannot be entirely avoided despite of the protocol being used.\ """ So yeah, the redundant connections don't act like we thought they did. This argues pretty strongly for having slaves connect to local masters as much as possible for now.
Does this bug need to be MoCo only? I don't see anything that needs to really, so would like to open up.
Group: mozilla-corporation-confidential
Removing bug 867593 as a blocker, since the only master left to move is a dev instance. Arzhel is working on the fix over in bug 940356.
Status: NEW → RESOLVED
Closed: 11 years ago
No longer depends on: 867593
Resolution: --- → DUPLICATE
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.