Closed
Bug 918677
Opened 11 years ago
Closed 11 years ago
spike in infamous disconnects ([Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.) last night on september 19 from 8pm on
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Tracking
(Not tracked)
RESOLVED
DUPLICATE
of bug 940356
People
(Reporter: cbook, Unassigned)
References
Details
looking at the logs something seems happened last night between 19:00 and 21:00pm PST because a lot of traffic/marking happenend in this 2 bugs above. Because of lot of disconnects happened at that time across trees.
The conversation i found was:
21:11 < philor> great, 40 disconnects
21:19 < philor> where by 40 I mean more like 150
Do we know what happened last night (i checked some of the slaves and they seem to be located in SCL3). Just wonder if something is needed our attention here :)
see bug 845994 and bug 710942 with the latest staring during that time.
Sample logs are
https://tbpl.mozilla.org/php/getParsedLog.php?id=28132812&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=28107098&tree=Mozilla-Aurora
https://tbpl.mozilla.org/php/getParsedLog.php?id=28132774&tree=Mozilla-Beta
so it was on various trees
Reporter | ||
Updated•11 years ago
|
Summary: spike in spike in infamous disconnects last night on september 19 from 8pm on → spike in infamous disconnects ([Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.) last night on september 19 from 8pm on
Reporter | ||
Comment 1•11 years ago
|
||
created a sample list from the bugs above on https://etherpad.mozilla.org/slave-disconnect
slave: talos-r4-lion-012
slave: talos-r4-lion-016
slave: talos-mtnlion-r5-039
slave: talos-mtnlion-r5-017
slave: talos-mtnlion-r5-071
slave: talos-mtnlion-r5-073
slave: talos-mtnlion-r5-074
slave: talos-mtnlion-r5-082
slave: talos-mtnlion-r5-070
slave: talos-mtnlion-r5-013
slave: talos-mtnlion-r5-072
slave: talos-mtnlion-r5-088
slave: talos-mtnlion-r5-060
slave: talos-mtnlion-r5-005
slave: talos-mtnlion-r5-019
slave: talos-r4-lion-001
slave: bld-lion-r5-045
slave: bld-lion-r5-045
slave: talos-linux64-ix-024
slave: talos-r4-lion-078
slave: talos-mtnlion-r5-051
slave: talos-mtnlion-r5-022
slave: talos-r4-lion-050
slave: bld-lion-r5-086
slave: talos-mtnlion-r5-084
slave: talos-r4-snow-075
slave: talos-r4-lion-074
slave: talos-r4-snow-059
slave: w64-ix-slave154
slave: w64-ix-slave70
slave: w64-ix-slave129
slave: t-w864-ix-095
slave: t-w732-ix-130
slave: talos-r4-lion-067
slave: t-w864-ix-061
slave: t-w864-ix-111
slave: t-w864-ix-124
slave: t-w864-ix-107
slave: t-w864-ix-031
slave: t-w732-ix-022
slave: t-w732-ix-035
slave: t-w732-ix-087
slave: t-w732-ix-104
t-xp32-ix-008
t-xp32-ix-061
slave: t-xp32-ix-040
slave: t-xp32-ix-067
slave: t-xp32-ix-076
slave: w64-ix-slave151
Comment 2•11 years ago
|
||
Tomcat put together a nice list of almost 50 slaves that hit this. I did some analysis and found that nearly all of them were on use1 masters. The break down is as follows:
buildbot-master77.srv.releng.use1.mozilla.com - 16
buildbot-master69.srv.releng.use1.mozilla.com - 8
buildbot-master71.srv.releng.use1.mozilla.com - 6
buildbot-master76.srv.releng.use1.mozilla.com - 6
buildbot-master75.srv.releng.use1.mozilla.com - 2
buildbot-master61.srv.releng.use1.mozilla.com - 2
buildbot-master63.srv.releng.use1.mozilla.com - 2
buildbot-master70.srv.releng.use1.mozilla.com - 2
buildbot-master62.srv.releng.use1.mozilla.com - 1
buildbot-master67.srv.releng.use1.mozilla.com - 1
buildbot-master58.srv.releng.usw2.mozilla.com - 1
buildbot-master65.srv.releng.usw2.mozilla.com - 1
buildbot-master79.srv.releng.usw2.mozilla.com - 1
Given that, I'm not exactly sure what to think. The fact that most of them are use1 leads me to think that the problem lies somewhere between our ISPs and Amazon. But having a few usw2 mixed in doesn't support that theory (unless those are just coincidence).
Casey, you pulled up some graphs the last time this happened (smokeping ones IIRC). Could you do that again for both use1 and usw2? Let's say from 6pm PT yesterday to this morning if you need a time range.
Comment 3•11 years ago
|
||
Dustin, I don't understand much of networks and specifically our setup. Do you see anything wrong with this traceroute? Or even slow?
Dustin, would you please be able to get us some network logs of buildbot-rw-vip.db.scl3.mozilla.com between 20:45 to 21:15? I could be asking non-sense so please let me know.
Do we have ganglia access for this?
Dustin, what info do we have wrt to our links around that time?
Sorry if I ask non-sense but I would like to understand what happened during that time before closing this.
[cltbld@buildbot-master62.srv.releng.use1.mozilla.com build1]$ ping v-1034.border2.scl3.mozilla.net
PING v-1034.border2.scl3.mozilla.net (10.0.22.13) 56(84) bytes of data.
64 bytes from v-1034.border2.scl3.mozilla.net (10.0.22.13): icmp_seq=1 ttl=62 time=83.2 ms
64 bytes from v-1034.border2.scl3.mozilla.net (10.0.22.13): icmp_seq=2 ttl=62 time=86.6 ms
[cltbld@buildbot-master62.srv.releng.use1.mozilla.com build1]$ traceroute buildbot-rw-vip.db.scl3.mozilla.com
traceroute to buildbot-rw-vip.db.scl3.mozilla.com (10.22.70.57), 30 hops max, 60 byte packets
1 169.254.255.21 (169.254.255.21) 0.704 ms 0.685 ms 0.659 ms
2 72.21.209.193 (72.21.209.193) 0.642 ms 0.615 ms 0.594 ms
3 169.254.255.74 (169.254.255.74) 85.225 ms 85.205 ms 85.174 ms
4 v-1034.border2.scl3.mozilla.net (10.0.22.13) 82.694 ms 82.665 ms 82.643 ms
5 * * *
6 * * *
7 * * *
8 * * *
9 * * *
10 * * *
11 * * *
12 * * *
13 * * *
14 * * *
15 * * *
16 * * *
17 * * *
18 * * *
19 * * *
20 * * *
21 * * *
22 * * *
23 * * *
24 * * *
25 * * *
26 * * *
27 * * *
28 * * *
29 * * *
30 * * *
[cltbld@buildbot-master62.srv.releng.use1.mozilla.com build1]$ traceroute 10.12.40.169
traceroute to 10.12.40.169 (10.12.40.169), 30 hops max, 60 byte packets
1 169.254.255.17 (169.254.255.17) 0.695 ms 0.684 ms 0.661 ms
2 72.21.209.193 (72.21.209.193) 0.613 ms 0.573 ms 0.564 ms
3 169.254.255.74 (169.254.255.74) 77.779 ms 77.768 ms 77.743 ms
4 v-1033.border1.scl3.mozilla.net (10.0.22.9) 84.852 ms 84.827 ms 84.794 ms
5 v-1050.border1.pao1.mozilla.net (10.0.12.6) 85.377 ms 85.012 ms 85.337 ms
6 * * *
7 * * *
8 * * *
9 * * *
10 * * *
11 * * *
12 * * *
13 * * *
14 * * *
15 * * *
16 * * *
17 * * *
18 * * *
19 * * *
20 * * *
21 * * *
22 * * *
23 * * *
24 * * *
25 * * *
26 * * *
27 * * *
28 * * *
29 * * *
30 * * *
It seems we have a couple of connections that could be bad (not certain):
host on scl3 <-> buildbot-master on AWS
buildbot-rw-vip on scl3 <-> buildbot-master on AWS
I have this from w64-ix-slave151.
It seems that it took 4 minutes to be able to connect properly.
I have copied the buildbot-master62 log to /builds/buildbot/build1/twistd.log.4 in case anyone wants to have a look at it.
A lot of these:
- killing new slave on IPv4Address(TCP, '10.12.40.26', 50172)
- duplicate slave w64-ix-slave137; rejecting new slave and pinging old
- didn't hear back from slave in 30 seconds; killing existing connection
- disconnecting old slave w64-ix-slave155 now
- waiting for slave to finish disconnecting
###### buildbot-master62 #######
2013-09-19 21:02:51-0700 [-] prioritizeBuilders: 15.94s finished prioritization
2013-09-19 21:02:51-0700 [-] Pulse <0x55ae2170>: Processed 2 events (0 heartbeats) in 0.00 seconds
2013-09-19 21:02:51-0700 [-] expiring old connection
2013-09-19 21:02:51-0700 [-] adbapi closing: MySQLdb
2013-09-19 21:02:51-0700 [-] adbapi connecting: MySQLdb {'use_unicode': True, 'passwd': 'blah', 'charset': 'utf8', 'db': 'buildbot_schedulers', 'host': 'buildbot-rw-vip.db.scl3.mozilla.com', 'user': 'buildbot2'}
2013-09-19 21:02:51-0700 [-] expiring old connection
2013-09-19 21:02:51-0700 [-] adbapi closing: MySQLdb
2013-09-19 21:02:51-0700 [-] adbapi connecting: MySQLdb {'use_unicode': True, 'passwd': 'blah', 'charset': 'utf8', 'db': 'buildbot_schedulers', 'host': 'buildbot-rw-vip.db.scl3.mozilla.com', 'user': 'buildbot2'}
2013-09-19 21:02:52-0700 [-] expiring old connection
2013-09-19 21:02:52-0700 [-] adbapi closing: MySQLdb
2013-09-19 21:02:52-0700 [-] adbapi connecting: MySQLdb {'use_unicode': True, 'passwd': 'blah', 'charset': 'utf8', 'db': 'buildbot_schedulers', 'host': 'buildbot-rw-vip.db.scl3.mozilla.com', 'user': 'buildbot2'}
...
2013-09-19 21:05:46-0700 [-] adbapi closing: MySQLdb
2013-09-19 21:05:46-0700 [-] adbapi connecting: MySQLdb {'use_unicode': True, 'passwd': 'blah', 'charset': 'utf8', 'db': 'buildbot_schedulers', 'host': 'buildbot-rw-vip.db.scl3.mozilla.com', 'user': 'buildbot2'}
...
2013-09-19 21:05:54-0700 [-] killing new slave on IPv4Address(TCP, '10.12.40.26', 50161)
2013-09-19 21:05:54-0700 [-] killing new slave on IPv4Address(TCP, '10.12.40.102', 50178)
2013-09-19 21:05:54-0700 [-] killing new slave on IPv4Address(TCP, '10.12.40.183', 55468)
2013-09-19 21:05:54-0700 [-] killing new slave on IPv4Address(TCP, '10.12.40.169', 64388)
2013-09-19 21:05:54-0700 [-] killing new slave on IPv4Address(TCP, '10.12.40.187', 50686)
2013-09-19 21:05:55-0700 [-] expiring old connection
2013-09-19 21:05:55-0700 [-] adbapi closing: MySQLdb
...
2013-09-19 21:05:58-0700 [Broker,16673,10.12.40.169] duplicate slave w64-ix-slave137; rejecting new slave and pinging old
2013-09-19 21:05:58-0700 [Broker,16673,10.12.40.169] old slave was connected from IPv4Address(TCP, '10.12.40.169', 49172)
2013-09-19 21:05:58-0700 [Broker,16673,10.12.40.169] new slave is from IPv4Address(TCP, '10.12.40.169', 64393)
...
2013-09-19 21:06:01-0700 [Broker,16670,10.12.40.102] duplicate slave w64-ix-slave72; rejecting new slave and pinging old
2013-09-19 21:06:01-0700 [Broker,16670,10.12.40.102] old slave was connected from IPv4Address(TCP, '10.12.40.102', 49176)
2013-09-19 21:06:01-0700 [Broker,16670,10.12.40.102] new slave is from IPv4Address(TCP, '10.12.40.102', 50183)
2013-09-19 21:06:01-0700 [Broker,16671,10.12.40.183] duplicate slave w64-ix-slave151; rejecting new slave and pinging old
2013-09-19 21:06:01-0700 [Broker,16671,10.12.40.183] old slave was connected from IPv4Address(TCP, '10.12.40.183', 49172)
2013-09-19 21:06:01-0700 [Broker,16671,10.12.40.183] new slave is from IPv4Address(TCP, '10.12.40.183', 55473)
2013-09-19 21:06:01-0700 [Broker,16672,10.12.40.26] duplicate slave w64-ix-slave06; rejecting new slave and pinging old
2013-09-19 21:06:01-0700 [Broker,16672,10.12.40.26] old slave was connected from IPv4Address(TCP, '10.12.40.26', 49172)
2013-09-19 21:06:01-0700 [Broker,16672,10.12.40.26] new slave is from IPv4Address(TCP, '10.12.40.26', 50166)
2013-09-19 21:06:01-0700 [Broker,16674,10.12.40.187] duplicate slave w64-ix-slave155; rejecting new slave and pinging old
2013-09-19 21:06:01-0700 [Broker,16674,10.12.40.187] old slave was connected from IPv4Address(TCP, '10.12.40.187', 49172)
2013-09-19 21:06:01-0700 [Broker,16674,10.12.40.187] new slave is from IPv4Address(TCP, '10.12.40.187', 50693)
...
2013-09-19 21:06:03-0700 [-] killing new slave on IPv4Address(TCP, '10.12.40.169', 64393)
2013-09-19 21:06:06-0700 [-] killing new slave on IPv4Address(TCP, '10.12.40.102', 50183)
2013-09-19 21:06:06-0700 [-] killing new slave on IPv4Address(TCP, '10.12.40.183', 55473)
2013-09-19 21:06:06-0700 [-] killing new slave on IPv4Address(TCP, '10.12.40.26', 50166)
2013-09-19 21:06:06-0700 [-] killing new slave on IPv4Address(TCP, '10.12.40.187', 50693)
2013-09-19 21:06:06-0700 [Broker,16675,10.12.40.169] duplicate slave w64-ix-slave137; rejecting new slave and pinging old
2013-09-19 21:06:06-0700 [Broker,16675,10.12.40.169] old slave was connected from IPv4Address(TCP, '10.12.40.169', 49172)
2013-09-19 21:06:06-0700 [Broker,16675,10.12.40.169] new slave is from IPv4Address(TCP, '10.12.40.169', 64396)
2013-09-19 21:06:09-0700 [-] expiring old connection
...
2013-09-19 21:06:28-0700 [Broker,16619,10.12.40.26] Unhandled error in Deferred:
2013-09-19 21:06:28-0700 [Broker,16619,10.12.40.26] Unhandled Error
Traceback (most recent call last):
Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionDone'>: Connection was closed cleanly.
]
2013-09-19 21:06:28-0700 [Broker,16619,10.12.40.26] Unhandled error in Deferred:
2013-09-19 21:06:28-0700 [Broker,16619,10.12.40.26] Unhandled Error
Traceback (most recent call last):
Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionDone'>: Connection was closed cleanly.
]
2013-09-19 21:06:28-0700 [Broker,16619,10.12.40.26] Unhandled error in Deferred:
2013-09-19 21:06:28-0700 [Broker,16619,10.12.40.26] Unhandled Error
Traceback (most recent call last):
Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionDone'>: Connection was closed cleanly.
]
2013-09-19 21:06:28-0700 [Broker,16619,10.12.40.26] Unhandled error in Deferred:
2013-09-19 21:06:28-0700 [Broker,16619,10.12.40.26] Unhandled Error
Traceback (most recent call last):
Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionDone'>: Connection was closed cleanly.
]
...
2013-09-19 21:06:28-0700 [-] didn't hear back from slave in 30 seconds; killing existing connection
2013-09-19 21:06:28-0700 [-] disconnecting old slave w64-ix-slave72 now
2013-09-19 21:06:28-0700 [-] waiting for slave to finish disconnecting
2013-09-19 21:06:28-0700 [-] didn't hear back from slave in 30 seconds; killing existing connection
2013-09-19 21:06:28-0700 [-] disconnecting old slave w64-ix-slave151 now
2013-09-19 21:06:28-0700 [-] waiting for slave to finish disconnecting
2013-09-19 21:06:28-0700 [-] didn't hear back from slave in 30 seconds; killing existing connection
2013-09-19 21:06:28-0700 [-] disconnecting old slave w64-ix-slave137 now
2013-09-19 21:06:28-0700 [-] waiting for slave to finish disconnecting
2013-09-19 21:06:28-0700 [-] didn't hear back from slave in 30 seconds; killing existing connection
2013-09-19 21:06:28-0700 [-] disconnecting old slave w64-ix-slave155 now
2013-09-19 21:06:28-0700 [-] waiting for slave to finish disconnecting
2013-09-19 21:06:28-0700 [-] killing new slave on IPv4Address(TCP, '10.12.40.26', 50172)
2013-09-19 21:06:28-0700 [-] killing new slave on IPv4Address(TCP, '10.12.40.169', 64399)
###### buildbot-master62 #######
###### w64-ix-slave151 #######
2013-09-19 21:03:31-0700 [-] command finished with signal None, exit code 0, elapsedTime: 3188.283000
2013-09-19 21:03:31-0700 [-] SlaveBuilder.commandComplete <buildslave.commands.shell.SlaveShellCommand instance at 0x02F21DC8>
2013-09-19 21:03:50-0700 [Broker,client] SlaveBuilder._ackFailed: SlaveBuilder.sendUpdate
2013-09-19 21:03:50-0700 [Broker,client] Unhandled Error
Traceback (most recent call last):
Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.]
2013-09-19 21:03:50-0700 [Broker,client] SlaveBuilder._ackFailed: sendComplete
2013-09-19 21:03:50-0700 [Broker,client] Unhandled Error
Traceback (most recent call last):
Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
]
2013-09-19 21:03:50-0700 [Broker,client] SlaveBuilder._ackFailed: SlaveBuilder.sendUpdate
2013-09-19 21:03:50-0700 [Broker,client] Unhandled Error
Traceback (most recent call last):
Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
]
2013-09-19 21:03:50-0700 [Broker,client] lost remote
2013-09-19 21:03:50-0700 [Broker,client] lost remote
...
2013-09-19 21:03:50-0700 [Broker,client] lost remote
2013-09-19 21:03:50-0700 [Broker,client] lost remote
2013-09-19 21:03:51-0700 [Broker,client] Lost connection to buildbot-master62.srv.releng.use1.mozilla.com:9001
2013-09-19 21:03:51-0700 [Broker,client] <twisted.internet.tcp.Connector instance at 0x02B12AA8> will retry in 2 seconds
2013-09-19 21:03:51-0700 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x02C98490>
2013-09-19 21:03:53-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x02C98490>
2013-09-19 21:03:53-0700 [-] Connecting to buildbot-master62.srv.releng.use1.mozilla.com:9001
2013-09-19 21:04:14-0700 [Uninitialized] Connection to buildbot-master62.srv.releng.use1.mozilla.com:9001 failed: [Failure instance: Traceback
(failure with no frames): <class 'twisted.internet.error.TCPTimedOutError'>: TCP connection timed out: 10060: A connection attempt failed becau
se the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.. ]
2013-09-19 21:04:14-0700 [Uninitialized] <twisted.internet.tcp.Connector instance at 0x02B12AA8> will retry in 8 seconds
2013-09-19 21:04:14-0700 [Uninitialized] Stopping factory <buildslave.bot.BotFactory instance at 0x02C98490>
2013-09-19 21:04:22-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x02C98490>
2013-09-19 21:04:22-0700 [-] Connecting to buildbot-master62.srv.releng.use1.mozilla.com:9001
2013-09-19 21:04:43-0700 [Uninitialized] Connection to buildbot-master62.srv.releng.use1.mozilla.com:9001 failed: [Failure instance: Traceback
(failure with no frames): <class 'twisted.internet.error.TCPTimedOutError'>: TCP connection timed out: 10060: A connection attempt failed becau
se the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..
]
2013-09-19 21:04:43-0700 [Uninitialized] <twisted.internet.tcp.Connector instance at 0x02B12AA8> will retry in 25 seconds
2013-09-19 21:04:43-0700 [Uninitialized] Stopping factory <buildslave.bot.BotFactory instance at 0x02C98490>
2013-09-19 21:05:09-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x02C98490>
2013-09-19 21:05:09-0700 [-] Connecting to buildbot-master62.srv.releng.use1.mozilla.com:9001
2013-09-19 21:05:21-0700 [Broker,client] Connected to buildbot-master62.srv.releng.use1.mozilla.com:9001; slave is ready
2013-09-19 21:05:26-0700 [Broker,client] Lost connection to buildbot-master62.srv.releng.use1.mozilla.com:9001
2013-09-19 21:05:26-0700 [Broker,client] <twisted.internet.tcp.Connector instance at 0x02B12AA8> will retry in 2 seconds
2013-09-19 21:05:26-0700 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x02C98490>
2013-09-19 21:05:28-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x02C98490>
2013-09-19 21:05:28-0700 [-] Connecting to buildbot-master62.srv.releng.use1.mozilla.com:9001
2013-09-19 21:05:32-0700 [Broker,client] Connected to buildbot-master62.srv.releng.use1.mozilla.com:9001; slave is ready
2013-09-19 21:05:37-0700 [Broker,client] Lost connection to buildbot-master62.srv.releng.use1.mozilla.com:9001
2013-09-19 21:05:37-0700 [Broker,client] <twisted.internet.tcp.Connector instance at 0x02B12AA8> will retry in 2 seconds
2013-09-19 21:05:37-0700 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x02C98490>
2013-09-19 21:05:40-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x02C98490>
2013-09-19 21:05:40-0700 [-] Connecting to buildbot-master62.srv.releng.use1.mozilla.com:9001
2013-09-19 21:05:44-0700 [Broker,client] Connected to buildbot-master62.srv.releng.use1.mozilla.com:9001; slave is ready
2013-09-19 21:05:49-0700 [Broker,client] Lost connection to buildbot-master62.srv.releng.use1.mozilla.com:9001
2013-09-19 21:05:49-0700 [Broker,client] <twisted.internet.tcp.Connector instance at 0x02B12AA8> will retry in 3 seconds
2013-09-19 21:05:49-0700 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x02C98490>
2013-09-19 21:05:53-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x02C98490>
2013-09-19 21:05:53-0700 [-] Connecting to buildbot-master62.srv.releng.use1.mozilla.com:9001
2013-09-19 21:07:28-0700 [Broker,client] message from master: attached
Group: mozilla-corporation-confidential
Comment 4•11 years ago
|
||
(In reply to Ben Hearsum [:bhearsum] from comment #2)
> Casey, you pulled up some graphs the last time this happened (smokeping ones
> IIRC). Could you do that again for both use1 and usw2? Let's say from 6pm PT
> yesterday to this morning if you need a time range.
Sorry, meant to set needinfo...
Flags: needinfo?(cransom)
Comment 5•11 years ago
|
||
09/20 17:41:41 fw1.releng.console.scl3.mozilla.net %-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 169.254.255.73 (External AS 7224) changed state from OpenConfirm to Established (event RecvKeepAlive)
09/20 17:41:13 fw1.releng.console.scl3.mozilla.net %-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 169.254.255.73 (External AS 7224) changed state from Established to Idle (event RecvNotify)
09/20 04:04:57 fw1.releng.console.scl3.mozilla.net %-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 169.254.255.73 (External AS 7224) changed state from OpenConfirm to Established (event RecvKeepAlive)
09/20 04:04:52 fw1.releng.console.scl3.mozilla.net %-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 169.254.255.77 (External AS 7224) changed state from OpenConfirm to Established (event RecvKeepAlive)
09/20 04:04:20 fw1.releng.console.scl3.mozilla.net %-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 169.254.255.73 (External AS 7224) changed state from Established to Idle (event HoldTime)
09/20 04:04:19 fw1.releng.console.scl3.mozilla.net %-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 169.254.255.77 (External AS 7224) changed state from Established to Idle (event HoldTime)
09/19 21:53:25 fw1.releng.console.scl3.mozilla.net %-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 169.254.249.29 (External AS 7224) changed state from OpenConfirm to Established (event RecvKeepAlive)
09/19 21:53:23 fw1.releng.console.scl3.mozilla.net %-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 169.254.249.29 (External AS 7224) changed state from Established to Idle (event RecvNotify)
these timestamps are UTC. After shifting them 7 hours, the break you are seeing was an amazon vpn failure. both VPNs received an idle timeout at 04:04:19 (3 minutes without no activity, bgp considers the sessions dead) and at 04:04:52, they are re-established. there is absolutely nothing netops can do about these because the failure isn't on our side and we are already wired for all the redundancy amazon can supply.
Updated•11 years ago
|
Flags: needinfo?(cransom)
Comment 6•11 years ago
|
||
The old-slave/new-slave stuff generally indicates that the slave and master have different ideas of current connections: the master thinks the "old" connection is still live, while the slave thinks that connection is closed and is automatically initiating a new one. In this situation, the master tests the old connection before admitting the new.
In the past, we've seen this caused by stateful firewalls losing connection state. The slave sends some tcp packets, and the firewall, having no memory of the connection they are a part of, drops them. The slave retransmits a few times, and finally closes the connection in error. During this time, the master has not seen any packets and continues waiting patiently to hear back from the slave -- there is no TCP timer to run out on the master[1]. The slave automatically reconnects with the normal TCP connection-setup traffic, which the firewall allows, and we see the symptoms described in the previous paragraph. As you've suggested, it may be that Windows slaves time out their connections more quickly than Linux or Darwin slaves[2]. I'm happy to further explain this via Vidyo if you'd like.
It's also possible for extended packet loss to cause this same behavior, if the slave gives up on retransmissions during the outage. This appears to be the more likely scenario, given Casey's information.
A BGP session loss like this one is based on communication failures over the public Internet, in this case between scl3 and use1. The Internet is a wild and crazy place, with routes shifting all the time, and is neither within Mozilla's nor Amazon's network.
I suspect that the next step for pursuing this is to talk to Amazon and see if they can provide any more detail (e.g., logs) on what they are seeing, or any other advice or diagnostics they can provide.
Using usw2 instead might help, too - I haven't heard about issues with masters there, and inventory shows quite a few.
[1] Aside from TCP keepalives, which trigger at 2 hours; shortening those would not fix this problem, but only make the timeouts occur more quickly.
[2] http://stackoverflow.com/questions/5227520/how-many-times-will-tcp-retransmit
(Linux: about 15m)
http://support.microsoft.com/kb/170359
(Windows: about 5m)
Note that these timeouts depend on the measured RTT, and can vary significantly.
Comment 7•11 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #6)
Thanks for all your insight into this, it's much appreciated
For some reason it reminded me that we also used to have problems with Buildbot masters hanging for long periods of time (in the past this was often due to /waterfall requests or very large logs), but I wonder if the masters are getting close to this state again, and unable to respond to keepalives or other updates from slaves in a timely manner?
I know that we've correlated some of these failures to network issues, but I wonder if there's more than one cause here.
Comment 8•11 years ago
|
||
We should be able to monitor things like that with collectd - either using the existing OS-level metrics, or by feeding some metrics from Buildbot itself into collectd.
Comment 9•11 years ago
|
||
Moving this bug to Platform Support as it is a project to improve the master<->slaves connections.
Component: Buildduty → Platform Support
QA Contact: armenzg → coop
Comment 10•11 years ago
|
||
We had some more disconnects just now, and RyanVM confirms the time corresponds to a BGP flap:
09/27 20:16:00 %-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 169.254.249.25 (External AS 7224) changed state from Established to Idle (event HoldTime)
09/27 20:16:32 %-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 169.254.249.25 (External AS 7224) changed state from OpenConfirm to Established (event RecvKeepAlive)
Comment 11•11 years ago
|
||
So that was a failure of one of the two links, which would have triggered a failover. The time scales are smaller than TCP timeouts, so the more likely explanation here is that a stateful router somewhere is losing session state when the failover occurs. On the Juniper firewall on our end, this is not the case - the TCP tuples before and after the failover are the same, with just the next-hop (not part of the session state) changing. However, it may be a different story on the Amazon end. Catlee's going to get in touch with Amazon to see about that.
Reporter | ||
Comment 12•11 years ago
|
||
note, this is at the moment spiking a lot over all trees (in case we need to debug/logs)
Comment 13•11 years ago
|
||
Comment 15•11 years ago
|
||
Amazon's response:
"""
In the case of BGP connection failover, existing TCP connections are unfortunately not going to be transparently handled using the secondary tunnel. The only work-around I can think of would be to support appropriate retry procedures in the application itself or use asynchronous communication methods. Please accept my apologies, we are doing all that's possible to provide highly available and stable VPN connections but if the actual communication happens over Internet, intermittent connectivity issues cannot be entirely avoided despite of the protocol being used.\
"""
So yeah, the redundant connections don't act like we thought they did. This argues pretty strongly for having slaves connect to local masters as much as possible for now.
Comment 17•11 years ago
|
||
Does this bug need to be MoCo only? I don't see anything that needs to really, so would like to open up.
Updated•11 years ago
|
Group: mozilla-corporation-confidential
Comment 18•11 years ago
|
||
Removing bug 867593 as a blocker, since the only master left to move is a dev instance.
Arzhel is working on the fix over in bug 940356.
Assignee | ||
Updated•7 years ago
|
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•