Closed
Bug 817597
Opened 12 years ago
Closed 12 years ago
Please remove TCP timeouts (if that is the issue)
Categories
(Infrastructure & Operations Graveyard :: NetOps, task, P1)
Infrastructure & Operations Graveyard
NetOps
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: armenzg, Assigned: cransom)
References
Details
(Whiteboard: [reit-b2g])
I had lots of buildslaves on the foopies for the b2g project that are supposed to be connected to my buildbot master but I have noticed that they were not being able to take any jobs this morning. foopy40.p1.releng.scl1.mozilla.com <--> dev-master01.build.scl1.mozilla.com Can you please have a look if the TCP timeouts have been disabled? Once we move this to production we will also need the TCP timeouts disabled when trying to reach the buildbot masters: buildbot-master##.build.scl1.mozilla.com:8201 Not sure if it was relevant or not to add that last paragraph but just in case. Thanks, Armen I have looked at the logs and I have seen this: 2012-12-01 20:36:25-0800 [Broker,client] <twisted.internet.tcp.Connector instance at 0xdbca28> will retry in 2 seconds 2012-12-01 20:36:25-0800 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x1271e60> 2012-12-01 20:36:28-0800 [-] Starting factory <buildslave.bot.BotFactory instance at 0x1271e60> 2012-12-01 20:36:28-0800 [-] Connecting to dev-master01.build.scl1.mozilla.com:9042 2012-12-01 20:36:28-0800 [Broker,client] Connected to dev-master01.build.scl1.mozilla.com:9042; slave is ready 2012-12-01 20:36:33-0800 [Broker,client] Lost connection to dev-master01.build.scl1.mozilla.com:9042 2012-12-01 20:36:33-0800 [Broker,client] <twisted.internet.tcp.Connector instance at 0xdbca28> will retry in 2 seconds 2012-12-01 20:36:33-0800 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x1271e60> 2012-12-01 20:36:35-0800 [-] Starting factory <buildslave.bot.BotFactory instance at 0x1271e60> 2012-12-01 20:36:35-0800 [-] Connecting to dev-master01.build.scl1.mozilla.com:9042 2012-12-01 20:36:35-0800 [Broker,client] Connected to dev-master01.build.scl1.mozilla.com:9042; slave is ready 2012-12-01 20:36:40-0800 [Broker,client] Lost connection to dev-master01.build.scl1.mozilla.com:9042 2012-12-01 20:36:40-0800 [Broker,client] <twisted.internet.tcp.Connector instance at 0xdbca28> will retry in 2 seconds 2012-12-01 20:36:40-0800 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x1271e60> 2012-12-01 20:36:43-0800 [-] Starting factory <buildslave.bot.BotFactory instance at 0x1271e60> 2012-12-01 20:36:43-0800 [-] Connecting to dev-master01.build.scl1.mozilla.com:9042 2012-12-01 20:36:43-0800 [Broker,client] Connected to dev-master01.build.scl1.mozilla.com:9042; slave is ready 2012-12-01 20:36:48-0800 [Broker,client] Lost connection to dev-master01.build.scl1.mozilla.com:9042 2012-12-01 20:36:48-0800 [Broker,client] <twisted.internet.tcp.Connector instance at 0xdbca28> will retry in 2 seconds 2012-12-01 20:36:48-0800 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x1271e60> 2012-12-01 20:36:50-0800 [-] Starting factory <buildslave.bot.BotFactory instance at 0x1271e60> 2012-12-01 20:36:50-0800 [-] Connecting to dev-master01.build.scl1.mozilla.com:9042 and from the master side I can see these: 2012-12-03 04:26:59-0800 [Broker,160313,10.12.128.18] Unhandled Error Traceback (most recent call last): Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ] 2012-12-03 04:26:59-0800 [Broker,160313,10.12.128.18] Unhandled error in Deferred: 2012-12-03 04:26:59-0800 [Broker,160313,10.12.128.18] Unhandled Error Traceback (most recent call last): Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ] 2012-12-03 04:27:00-0800 [-] killing new slave on IPv4Address(TCP, '10.12.128.18', 51891) 2012-12-03 04:27:00-0800 [-] killing new slave on IPv4Address(TCP, '10.12.128.18', 51892) 2012-12-03 04:27:00-0800 [-] killing new slave on IPv4Address(TCP, '10.12.128.18', 51893) 2012-12-03 04:27:01-0800 [-] killing new slave on IPv4Address(TCP, '10.12.128.18', 51894) 2012-12-03 04:27:01-0800 [Broker,167225,10.12.128.18] duplicate slave panda-0107; rejecting new slave and pinging old 2012-12-03 04:27:01-0800 [Broker,167225,10.12.128.18] old slave was connected from IPv4Address(TCP, '10.12.128.18', 50153) 2012-12-03 04:27:01-0800 [Broker,167225,10.12.128.18] new slave is from IPv4Address(TCP, '10.12.128.18', 51897) 2012-12-03 04:27:01-0800 [Broker,167226,10.12.128.18] Got slaveinfo from 'panda-0104' 2012-12-03 04:27:01-0800 [Broker,167226,10.12.128.18] bot attached 2012-12-03 04:27:01-0800 [-] Sorted 38 builders in 0.02s 2012-12-03 04:27:01-0800 [Broker,167227,10.12.128.18] Got slaveinfo from 'panda-0098' 2012-12-03 04:27:01-0800 [Broker,167227,10.12.128.18] bot attached 2012-12-03 04:27:01-0800 [-] Sorted 38 builders in 0.02s 2012-12-03 04:27:01-0800 [Broker,167228,10.12.128.18] Got slaveinfo from 'panda-0097' 2012-12-03 04:27:01-0800 [Broker,167228,10.12.128.18] bot attached 2012-12-03 04:27:01-0800 [-] Sorted 38 builders in 0.02s 2012-12-03 04:27:02-0800 [-] killing new slave on IPv4Address(TCP, '10.12.128.18', 51895) 2012-12-03 04:27:02-0800 [Broker,167229,10.12.128.18] duplicate slave panda-0099; rejecting new slave and pinging old 2012-12-03 04:27:02-0800 [Broker,167229,10.12.128.18] old slave was connected from IPv4Address(TCP, '10.12.128.18', 50146) 2012-12-03 04:27:02-0800 [Broker,167229,10.12.128.18] new slave is from IPv4Address(TCP, '10.12.128.18', 51901) 2012-12-03 04:27:02-0800 [Broker,167230,10.12.128.18] duplicate slave panda-0095; rejecting new slave and pinging old 2012-12-03 04:27:02-0800 [Broker,167230,10.12.128.18] old slave was connected from IPv4Address(TCP, '10.12.128.18', 50150) 2012-12-03 04:27:02-0800 [Broker,167230,10.12.128.18] new slave is from IPv4Address(TCP, '10.12.128.18', 51902) 2012-12-03 04:27:02-0800 [Broker,167231,10.12.128.18] duplicate slave panda-0103; rejecting new slave and pinging old 2012-12-03 04:27:02-0800 [Broker,167231,10.12.128.18] old slave was connected from IPv4Address(TCP, '10.12.128.18', 50148) 2012-12-03 04:27:02-0800 [Broker,167231,10.12.128.18] new slave is from IPv4Address(TCP, '10.12.128.18', 51903) 2012-12-03 04:27:02-0800 [Broker,160314,10.12.128.18] BuildSlave.detached(panda-0105) 2012-12-03 04:27:02-0800 [Broker,160314,10.12.128.18] Unhandled error in Deferred: 2012-12-03 04:27:02-0800 [Broker,160314,10.12.128.18] Unhandled Error Traceback (most recent call last): Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ] 2012-12-03 04:27:02-0800 [Broker,160314,10.12.128.18] Unhandled error in Deferred: 2012-12-03 04:27:02-0800 [Broker,160314,10.12.128.18] Unhandled Error Traceback (most recent call last): Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ]
Updated•12 years ago
|
Priority: -- → P1
Comment 1•12 years ago
|
||
To clarify, this is the ages-old problem of Buildbot using long-lived TCP connections that are idle for significant lengths of time. I *suspect* that the new rules for the mobile pods don't account for these - it's easy to forget when adding new rules. If that's not what's happening, then we'll need to dig more deeply.
Assignee | ||
Comment 2•12 years ago
|
||
This is likely. We worked on a bug a long time ago that was supposed to mitigate these by fixing at least one half of the tcp keepalives, did that ever get rolled to production?
Updated•12 years ago
|
Whiteboard: [reit-b2g]
Comment 3•12 years ago
|
||
(In reply to casey ransom [:casey] from comment #2) > This is likely. We worked on a bug a long time ago that was supposed to > mitigate these by fixing at least one half of the tcp keepalives, did that > ever get rolled to production? If you mean bug 781860, no - that is still open. However, the slaves all do have keep alive set, so the slave->master leg is already protected. What's the next step for netops here? Can we get a rough ETC please?
Comment 4•12 years ago
|
||
The buildbot application is defined as: term 91 protocol tcp destination-port 9101-9106 inactivity-timeout 86400; term 92 protocol tcp destination-port 9201-9206 inactivity-timeout 86400; term 93 protocol tcp destination-port 9301-9306 inactivity-timeout 86400; term 90 protocol tcp destination-port 9001-9006 inactivity-timeout 86400; This is for the flow from pods to the trust zone which includes vlans 40, 48, and 75. Can someone define the flow that is not working or confirm the above is what is being used?
Comment 5•12 years ago
|
||
From bug 805821, this is the flow in question: * -> dev-master01.build.scl1.mozilla.com:tcp/{1024..65535} (development master, more arbitrary ports) which doesn't resemble any of the terms in comment 4.
Comment 6•12 years ago
|
||
Having all ephemeral ports with long timeout is potentially problematic. Thousands of hosts hitting the session table will run the risk of clobbering the FW. Unfortunately I wasn't aware this was a requirement so couldn't raise a red flag at the time. The default TCP inactivity timeout is 30 minutes. Can you quantify how long you will really need? 24 hours is quite a long time, but we can perhaps increase it and watch the table. The issue is if we start filling up the table the only options to prevent an outage would be to clear sessions from the stable which in of itself could cause outages.
Comment 7•12 years ago
|
||
s/stable/table Anyhow, fixing the application is the (hopefully) obvious solution which we have, so what is blocking us from applying it? Or am I missing something that isn't noted in a bug?
Comment 8•12 years ago
|
||
This isn't different from any other configuration within the releng network -- they're any:any everywhere else, but they're still sessions. So there's not actually anything new here. We could limit the range to {1024..32768} or even {1024..9999}, but that won't change the number of sessions. Fixing the application is certainly the right solution here, though. Given that dev-master01 is a dev/pp/staging master, it shouldn't be problematic to apply the patch there, and if successful, on the production masters.
Comment 9•12 years ago
|
||
My concern is the addition of thousands of new hosts whose sessions are squatting in the table. The timeout is set to 24 hours which was a value picked in absence of knowing how long the application really needed. If we can tighten that value a bit I'd be more comfortable increasing the limit to that value.
Comment 10•12 years ago
|
||
24 hours reduces the difference enough that what falls through can be worked around - slaves are seldom idle for >24h. There's not a maximum.
Comment 11•12 years ago
|
||
Who is managing the implementation of the patch? Hal? What can I do to help get this done?
Assignee | ||
Updated•12 years ago
|
Assignee: network-operations → cransom
Assignee | ||
Comment 12•12 years ago
|
||
I added a specific policy to extend TCP of all connection from the pods to dev-master01 while :hwine and :dustin work out getting keepalive settings/patches applied to the appropriate devices. New connections will inherit this setting.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Comment 13•12 years ago
|
||
Armen, if you want to be doubly sure, you can add a sysctl similar to that in 781860#c1 on the foopies, using PuppetAgain. The Buildslave code is already setting SO_KEEPALIVE, but that keepalive duration is 2h, which is longer than the default firewall timeout.
Reporter | ||
Comment 14•12 years ago
|
||
Thanks a lot!
Updated•11 years ago
|
Product: mozilla.org → Infrastructure & Operations
Updated•2 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•