Closed Bug 817597 Opened 12 years ago Closed 12 years ago

Please remove TCP timeouts (if that is the issue)

Categories

(Infrastructure & Operations Graveyard :: NetOps, task, P1)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Assigned: cransom)

References

Details

(Whiteboard: [reit-b2g])

I had lots of buildslaves on the foopies for the b2g project that are supposed to be connected to my buildbot master but I have noticed that they were
not being able to take any jobs this morning.

foopy40.p1.releng.scl1.mozilla.com <--> dev-master01.build.scl1.mozilla.com

Can you please have a look if the TCP timeouts have been disabled?

Once we move this to production we will also need the TCP timeouts disabled when trying to reach the buildbot masters:
buildbot-master##.build.scl1.mozilla.com:8201

Not sure if it was relevant or not to add that last paragraph but just in case.

Thanks,
Armen

I have looked at the logs and I have seen this:
2012-12-01 20:36:25-0800 [Broker,client]
<twisted.internet.tcp.Connector instance at 0xdbca28> will retry in
2
seconds
2012-12-01 20:36:25-0800 [Broker,client] Stopping factory
<buildslave.bot.BotFactory instance at 0x1271e60>
2012-12-01 20:36:28-0800 [-] Starting factory
<buildslave.bot.BotFactory instance at 0x1271e60>
2012-12-01 20:36:28-0800 [-] Connecting to
dev-master01.build.scl1.mozilla.com:9042
2012-12-01 20:36:28-0800 [Broker,client] Connected to
dev-master01.build.scl1.mozilla.com:9042; slave is ready
2012-12-01 20:36:33-0800 [Broker,client] Lost connection to
dev-master01.build.scl1.mozilla.com:9042
2012-12-01 20:36:33-0800 [Broker,client]
<twisted.internet.tcp.Connector instance at 0xdbca28> will retry in
2
seconds
2012-12-01 20:36:33-0800 [Broker,client] Stopping factory
<buildslave.bot.BotFactory instance at 0x1271e60>
2012-12-01 20:36:35-0800 [-] Starting factory
<buildslave.bot.BotFactory instance at 0x1271e60>
2012-12-01 20:36:35-0800 [-] Connecting to
dev-master01.build.scl1.mozilla.com:9042
2012-12-01 20:36:35-0800 [Broker,client] Connected to
dev-master01.build.scl1.mozilla.com:9042; slave is ready
2012-12-01 20:36:40-0800 [Broker,client] Lost connection to
dev-master01.build.scl1.mozilla.com:9042
2012-12-01 20:36:40-0800 [Broker,client]
<twisted.internet.tcp.Connector instance at 0xdbca28> will retry in
2
seconds
2012-12-01 20:36:40-0800 [Broker,client] Stopping factory
<buildslave.bot.BotFactory instance at 0x1271e60>
2012-12-01 20:36:43-0800 [-] Starting factory
<buildslave.bot.BotFactory instance at 0x1271e60>
2012-12-01 20:36:43-0800 [-] Connecting to
dev-master01.build.scl1.mozilla.com:9042
2012-12-01 20:36:43-0800 [Broker,client] Connected to
dev-master01.build.scl1.mozilla.com:9042; slave is ready
2012-12-01 20:36:48-0800 [Broker,client] Lost connection to
dev-master01.build.scl1.mozilla.com:9042
2012-12-01 20:36:48-0800 [Broker,client]
<twisted.internet.tcp.Connector instance at 0xdbca28> will retry in
2
seconds
2012-12-01 20:36:48-0800 [Broker,client] Stopping factory
<buildslave.bot.BotFactory instance at 0x1271e60>
2012-12-01 20:36:50-0800 [-] Starting factory
<buildslave.bot.BotFactory instance at 0x1271e60>
2012-12-01 20:36:50-0800 [-] Connecting to
dev-master01.build.scl1.mozilla.com:9042

and from the master side I can see these:
2012-12-03 04:26:59-0800 [Broker,160313,10.12.128.18] Unhandled
Error
        Traceback (most recent call last):
        Failure: twisted.spread.pb.PBConnectionLost: [Failure
instance: Traceback (failure with no frames): <class
'twisted.internet.error.ConnectionLost'>: Connection to the other
side
was lost in a non-clean fashion.
        ]

2012-12-03 04:26:59-0800 [Broker,160313,10.12.128.18] Unhandled
error
in Deferred:
2012-12-03 04:26:59-0800 [Broker,160313,10.12.128.18] Unhandled
Error
        Traceback (most recent call last):
        Failure: twisted.spread.pb.PBConnectionLost: [Failure
instance: Traceback (failure with no frames): <class
'twisted.internet.error.ConnectionLost'>: Connection to the other
side
was lost in a non-clean fashion.
        ]

2012-12-03 04:27:00-0800 [-] killing new slave on IPv4Address(TCP,
'10.12.128.18', 51891)
2012-12-03 04:27:00-0800 [-] killing new slave on IPv4Address(TCP,
'10.12.128.18', 51892)
2012-12-03 04:27:00-0800 [-] killing new slave on IPv4Address(TCP,
'10.12.128.18', 51893)
2012-12-03 04:27:01-0800 [-] killing new slave on IPv4Address(TCP,
'10.12.128.18', 51894)
2012-12-03 04:27:01-0800 [Broker,167225,10.12.128.18] duplicate
slave
panda-0107; rejecting new slave and pinging old
2012-12-03 04:27:01-0800 [Broker,167225,10.12.128.18] old slave was
connected from IPv4Address(TCP, '10.12.128.18', 50153)
2012-12-03 04:27:01-0800 [Broker,167225,10.12.128.18] new slave is
from IPv4Address(TCP, '10.12.128.18', 51897)
2012-12-03 04:27:01-0800 [Broker,167226,10.12.128.18] Got slaveinfo
from 'panda-0104'
2012-12-03 04:27:01-0800 [Broker,167226,10.12.128.18] bot attached
2012-12-03 04:27:01-0800 [-] Sorted 38 builders in 0.02s
2012-12-03 04:27:01-0800 [Broker,167227,10.12.128.18] Got slaveinfo
from 'panda-0098'
2012-12-03 04:27:01-0800 [Broker,167227,10.12.128.18] bot attached
2012-12-03 04:27:01-0800 [-] Sorted 38 builders in 0.02s
2012-12-03 04:27:01-0800 [Broker,167228,10.12.128.18] Got slaveinfo
from 'panda-0097'
2012-12-03 04:27:01-0800 [Broker,167228,10.12.128.18] bot attached
2012-12-03 04:27:01-0800 [-] Sorted 38 builders in 0.02s
2012-12-03 04:27:02-0800 [-] killing new slave on IPv4Address(TCP,
'10.12.128.18', 51895)
2012-12-03 04:27:02-0800 [Broker,167229,10.12.128.18] duplicate
slave
panda-0099; rejecting new slave and pinging old
2012-12-03 04:27:02-0800 [Broker,167229,10.12.128.18] old slave was
connected from IPv4Address(TCP, '10.12.128.18', 50146)
2012-12-03 04:27:02-0800 [Broker,167229,10.12.128.18] new slave is
from IPv4Address(TCP, '10.12.128.18', 51901)
2012-12-03 04:27:02-0800 [Broker,167230,10.12.128.18] duplicate
slave
panda-0095; rejecting new slave and pinging old
2012-12-03 04:27:02-0800 [Broker,167230,10.12.128.18] old slave was
connected from IPv4Address(TCP, '10.12.128.18', 50150)
2012-12-03 04:27:02-0800 [Broker,167230,10.12.128.18] new slave is
from IPv4Address(TCP, '10.12.128.18', 51902)
2012-12-03 04:27:02-0800 [Broker,167231,10.12.128.18] duplicate
slave
panda-0103; rejecting new slave and pinging old
2012-12-03 04:27:02-0800 [Broker,167231,10.12.128.18] old slave was
connected from IPv4Address(TCP, '10.12.128.18', 50148)
2012-12-03 04:27:02-0800 [Broker,167231,10.12.128.18] new slave is
from IPv4Address(TCP, '10.12.128.18', 51903)
2012-12-03 04:27:02-0800 [Broker,160314,10.12.128.18]
BuildSlave.detached(panda-0105)
2012-12-03 04:27:02-0800 [Broker,160314,10.12.128.18] Unhandled
error
in Deferred:
2012-12-03 04:27:02-0800 [Broker,160314,10.12.128.18] Unhandled
Error
        Traceback (most recent call last):
        Failure: twisted.spread.pb.PBConnectionLost: [Failure
instance: Traceback (failure with no frames): <class
'twisted.internet.error.ConnectionLost'>: Connection to the other
side
was lost in a non-clean fashion.
        ]

2012-12-03 04:27:02-0800 [Broker,160314,10.12.128.18] Unhandled
error
in Deferred:
2012-12-03 04:27:02-0800 [Broker,160314,10.12.128.18] Unhandled
Error
        Traceback (most recent call last):
        Failure: twisted.spread.pb.PBConnectionLost: [Failure
instance: Traceback (failure with no frames): <class
'twisted.internet.error.ConnectionLost'>: Connection to the other
side
was lost in a non-clean fashion.
        ]
Blocks: 816237
Priority: -- → P1
To clarify, this is the ages-old problem of Buildbot using long-lived TCP connections that are idle for significant lengths of time.  I *suspect* that the new rules for the mobile pods don't account for these - it's easy to forget when adding new rules.

If that's not what's happening, then we'll need to dig more deeply.
This is likely.  We worked on a bug a long time ago that was supposed to mitigate these by fixing at least one half of the tcp keepalives, did that ever get rolled to production?
Whiteboard: [reit-b2g]
(In reply to casey ransom [:casey] from comment #2)
> This is likely.  We worked on a bug a long time ago that was supposed to
> mitigate these by fixing at least one half of the tcp keepalives, did that
> ever get rolled to production?

If you mean bug 781860, no - that is still open. However, the slaves all do have keep alive set, so the slave->master leg is already protected.

What's the next step for netops here? Can we get a rough ETC please?
The buildbot application is defined as:

term 91 protocol tcp destination-port 9101-9106 inactivity-timeout 86400;
term 92 protocol tcp destination-port 9201-9206 inactivity-timeout 86400;
term 93 protocol tcp destination-port 9301-9306 inactivity-timeout 86400;
term 90 protocol tcp destination-port 9001-9006 inactivity-timeout 86400;

This is for the flow from pods to the trust zone which includes vlans 40, 48, and 75.

Can someone define the flow that is not working or confirm the above is what is being used?
From bug 805821, this is the flow in question:

* -> dev-master01.build.scl1.mozilla.com:tcp/{1024..65535} 
  (development master, more arbitrary ports)

which doesn't resemble any of the terms in comment 4.
Having all ephemeral ports with long timeout is potentially problematic.  Thousands of hosts hitting the session table will run the risk of clobbering the FW.  Unfortunately I wasn't aware this was a requirement so couldn't raise a red flag at the time.

The default TCP inactivity timeout is 30 minutes.  Can you quantify how long you will really need?  24 hours is quite a long time, but we can perhaps increase it and watch the table.  The issue is if we start filling up the table the only options to prevent an outage would be to clear sessions from the stable which in of itself could cause outages.
s/stable/table

Anyhow, fixing the application is the (hopefully) obvious solution which we have, so what is blocking us from applying it?  Or am I missing something that isn't noted in a bug?
This isn't different from any other configuration within the releng network -- they're any:any everywhere else, but they're still sessions.  So there's not actually anything new here.  We could limit the range to {1024..32768} or even {1024..9999}, but that won't change the number of sessions.

Fixing the application is certainly the right solution here, though.  Given that dev-master01 is a dev/pp/staging master, it shouldn't be problematic to apply the patch there, and if successful, on the production masters.
My concern is the addition of thousands of new hosts whose sessions are squatting in the table.  The timeout is set to 24 hours which was a value picked in absence of knowing how long the application really needed.  If we can tighten that value a bit I'd be more comfortable increasing the limit to that value.
24 hours reduces the difference enough that what falls through can be worked around - slaves are seldom idle for >24h.  There's not a maximum.
Who is managing the implementation of the patch?   Hal?  What can I do to help get this done?
Assignee: network-operations → cransom
I added a specific policy to extend TCP of all connection from the pods to dev-master01 while :hwine and :dustin work out getting keepalive settings/patches applied to the appropriate devices.  New connections will inherit this setting.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Armen, if you want to be doubly sure, you can add a sysctl similar to that in 781860#c1 on the foopies, using PuppetAgain.  The Buildslave code is already setting SO_KEEPALIVE, but that keepalive duration is 2h, which is longer than the default firewall timeout.
Blocks: 817838
Thanks a lot!
Product: mozilla.org → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.