817597 - Please remove TCP timeouts (if that is the issue)

Reporter

Description

•

12 years ago

I had lots of buildslaves on the foopies for the b2g project that are supposed to be connected to my buildbot master but I have noticed that they were
not being able to take any jobs this morning.

foopy40.p1.releng.scl1.mozilla.com <--> dev-master01.build.scl1.mozilla.com

Can you please have a look if the TCP timeouts have been disabled?

Once we move this to production we will also need the TCP timeouts disabled when trying to reach the buildbot masters:
buildbot-master##.build.scl1.mozilla.com:8201

Not sure if it was relevant or not to add that last paragraph but just in case.

Thanks,
Armen

I have looked at the logs and I have seen this:
2012-12-01 20:36:25-0800 [Broker,client]
<twisted.internet.tcp.Connector instance at 0xdbca28> will retry in
2
seconds
2012-12-01 20:36:25-0800 [Broker,client] Stopping factory
<buildslave.bot.BotFactory instance at 0x1271e60>
2012-12-01 20:36:28-0800 [-] Starting factory
<buildslave.bot.BotFactory instance at 0x1271e60>
2012-12-01 20:36:28-0800 [-] Connecting to
dev-master01.build.scl1.mozilla.com:9042
2012-12-01 20:36:28-0800 [Broker,client] Connected to
dev-master01.build.scl1.mozilla.com:9042; slave is ready
2012-12-01 20:36:33-0800 [Broker,client] Lost connection to
dev-master01.build.scl1.mozilla.com:9042
2012-12-01 20:36:33-0800 [Broker,client]
<twisted.internet.tcp.Connector instance at 0xdbca28> will retry in
2
seconds
2012-12-01 20:36:33-0800 [Broker,client] Stopping factory
<buildslave.bot.BotFactory instance at 0x1271e60>
2012-12-01 20:36:35-0800 [-] Starting factory
<buildslave.bot.BotFactory instance at 0x1271e60>
2012-12-01 20:36:35-0800 [-] Connecting to
dev-master01.build.scl1.mozilla.com:9042
2012-12-01 20:36:35-0800 [Broker,client] Connected to
dev-master01.build.scl1.mozilla.com:9042; slave is ready
2012-12-01 20:36:40-0800 [Broker,client] Lost connection to
dev-master01.build.scl1.mozilla.com:9042
2012-12-01 20:36:40-0800 [Broker,client]
<twisted.internet.tcp.Connector instance at 0xdbca28> will retry in
2
seconds
2012-12-01 20:36:40-0800 [Broker,client] Stopping factory
<buildslave.bot.BotFactory instance at 0x1271e60>
2012-12-01 20:36:43-0800 [-] Starting factory
<buildslave.bot.BotFactory instance at 0x1271e60>
2012-12-01 20:36:43-0800 [-] Connecting to
dev-master01.build.scl1.mozilla.com:9042
2012-12-01 20:36:43-0800 [Broker,client] Connected to
dev-master01.build.scl1.mozilla.com:9042; slave is ready
2012-12-01 20:36:48-0800 [Broker,client] Lost connection to
dev-master01.build.scl1.mozilla.com:9042
2012-12-01 20:36:48-0800 [Broker,client]
<twisted.internet.tcp.Connector instance at 0xdbca28> will retry in
2
seconds
2012-12-01 20:36:48-0800 [Broker,client] Stopping factory
<buildslave.bot.BotFactory instance at 0x1271e60>
2012-12-01 20:36:50-0800 [-] Starting factory
<buildslave.bot.BotFactory instance at 0x1271e60>
2012-12-01 20:36:50-0800 [-] Connecting to
dev-master01.build.scl1.mozilla.com:9042

and from the master side I can see these:
2012-12-03 04:26:59-0800 [Broker,160313,10.12.128.18] Unhandled
Error
        Traceback (most recent call last):
        Failure: twisted.spread.pb.PBConnectionLost: [Failure
instance: Traceback (failure with no frames): <class
'twisted.internet.error.ConnectionLost'>: Connection to the other
side
was lost in a non-clean fashion.
        ]

2012-12-03 04:26:59-0800 [Broker,160313,10.12.128.18] Unhandled
error
in Deferred:
2012-12-03 04:26:59-0800 [Broker,160313,10.12.128.18] Unhandled
Error
        Traceback (most recent call last):
        Failure: twisted.spread.pb.PBConnectionLost: [Failure
instance: Traceback (failure with no frames): <class
'twisted.internet.error.ConnectionLost'>: Connection to the other
side
was lost in a non-clean fashion.
        ]

2012-12-03 04:27:00-0800 [-] killing new slave on IPv4Address(TCP,
'10.12.128.18', 51891)
2012-12-03 04:27:00-0800 [-] killing new slave on IPv4Address(TCP,
'10.12.128.18', 51892)
2012-12-03 04:27:00-0800 [-] killing new slave on IPv4Address(TCP,
'10.12.128.18', 51893)
2012-12-03 04:27:01-0800 [-] killing new slave on IPv4Address(TCP,
'10.12.128.18', 51894)
2012-12-03 04:27:01-0800 [Broker,167225,10.12.128.18] duplicate
slave
panda-0107; rejecting new slave and pinging old
2012-12-03 04:27:01-0800 [Broker,167225,10.12.128.18] old slave was
connected from IPv4Address(TCP, '10.12.128.18', 50153)
2012-12-03 04:27:01-0800 [Broker,167225,10.12.128.18] new slave is
from IPv4Address(TCP, '10.12.128.18', 51897)
2012-12-03 04:27:01-0800 [Broker,167226,10.12.128.18] Got slaveinfo
from 'panda-0104'
2012-12-03 04:27:01-0800 [Broker,167226,10.12.128.18] bot attached
2012-12-03 04:27:01-0800 [-] Sorted 38 builders in 0.02s
2012-12-03 04:27:01-0800 [Broker,167227,10.12.128.18] Got slaveinfo
from 'panda-0098'
2012-12-03 04:27:01-0800 [Broker,167227,10.12.128.18] bot attached
2012-12-03 04:27:01-0800 [-] Sorted 38 builders in 0.02s
2012-12-03 04:27:01-0800 [Broker,167228,10.12.128.18] Got slaveinfo
from 'panda-0097'
2012-12-03 04:27:01-0800 [Broker,167228,10.12.128.18] bot attached
2012-12-03 04:27:01-0800 [-] Sorted 38 builders in 0.02s
2012-12-03 04:27:02-0800 [-] killing new slave on IPv4Address(TCP,
'10.12.128.18', 51895)
2012-12-03 04:27:02-0800 [Broker,167229,10.12.128.18] duplicate
slave
panda-0099; rejecting new slave and pinging old
2012-12-03 04:27:02-0800 [Broker,167229,10.12.128.18] old slave was
connected from IPv4Address(TCP, '10.12.128.18', 50146)
2012-12-03 04:27:02-0800 [Broker,167229,10.12.128.18] new slave is
from IPv4Address(TCP, '10.12.128.18', 51901)
2012-12-03 04:27:02-0800 [Broker,167230,10.12.128.18] duplicate
slave
panda-0095; rejecting new slave and pinging old
2012-12-03 04:27:02-0800 [Broker,167230,10.12.128.18] old slave was
connected from IPv4Address(TCP, '10.12.128.18', 50150)
2012-12-03 04:27:02-0800 [Broker,167230,10.12.128.18] new slave is
from IPv4Address(TCP, '10.12.128.18', 51902)
2012-12-03 04:27:02-0800 [Broker,167231,10.12.128.18] duplicate
slave
panda-0103; rejecting new slave and pinging old
2012-12-03 04:27:02-0800 [Broker,167231,10.12.128.18] old slave was
connected from IPv4Address(TCP, '10.12.128.18', 50148)
2012-12-03 04:27:02-0800 [Broker,167231,10.12.128.18] new slave is
from IPv4Address(TCP, '10.12.128.18', 51903)
2012-12-03 04:27:02-0800 [Broker,160314,10.12.128.18]
BuildSlave.detached(panda-0105)
2012-12-03 04:27:02-0800 [Broker,160314,10.12.128.18] Unhandled
error
in Deferred:
2012-12-03 04:27:02-0800 [Broker,160314,10.12.128.18] Unhandled
Error
        Traceback (most recent call last):
        Failure: twisted.spread.pb.PBConnectionLost: [Failure
instance: Traceback (failure with no frames): <class
'twisted.internet.error.ConnectionLost'>: Connection to the other
side
was lost in a non-clean fashion.
        ]

2012-12-03 04:27:02-0800 [Broker,160314,10.12.128.18] Unhandled
error
in Deferred:
2012-12-03 04:27:02-0800 [Broker,160314,10.12.128.18] Unhandled
Error
        Traceback (most recent call last):
        Failure: twisted.spread.pb.PBConnectionLost: [Failure
instance: Traceback (failure with no frames): <class
'twisted.internet.error.ConnectionLost'>: Connection to the other
side
was lost in a non-clean fashion.
        ]

Armen [:armenzg]

Reporter

Updated

•

12 years ago

Blocks: 816237

Justin Wood (:Callek)

Updated

•

12 years ago

Priority: -- → P1

Dustin J. Mitchell [:dustin] (he/him)

Comment 1

•

12 years ago

To clarify, this is the ages-old problem of Buildbot using long-lived TCP connections that are idle for significant lengths of time.  I *suspect* that the new rules for the mobile pods don't account for these - it's easy to forget when adding new rules.

If that's not what's happening, then we'll need to dig more deeply.

casey ransom [:casey]

Assignee

Comment 2

•

12 years ago

This is likely.  We worked on a bug a long time ago that was supposed to mitigate these by fixing at least one half of the tcp keepalives, did that ever get rolled to production?

Hal Wine [:hwine] use NI!

Updated

•

12 years ago

Whiteboard: [reit-b2g]

Hal Wine [:hwine] use NI!

Comment 3

•

12 years ago

(In reply to casey ransom [:casey] from comment #2)
> This is likely.  We worked on a bug a long time ago that was supposed to
> mitigate these by fixing at least one half of the tcp keepalives, did that
> ever get rolled to production?

If you mean bug 781860, no - that is still open. However, the slaves all do have keep alive set, so the slave->master leg is already protected.

What's the next step for netops here? Can we get a rough ETC please?

Ravi Pina [:ravi]

Comment 4

•

12 years ago

The buildbot application is defined as:

term 91 protocol tcp destination-port 9101-9106 inactivity-timeout 86400;
term 92 protocol tcp destination-port 9201-9206 inactivity-timeout 86400;
term 93 protocol tcp destination-port 9301-9306 inactivity-timeout 86400;
term 90 protocol tcp destination-port 9001-9006 inactivity-timeout 86400;

This is for the flow from pods to the trust zone which includes vlans 40, 48, and 75.

Can someone define the flow that is not working or confirm the above is what is being used?

Dustin J. Mitchell [:dustin] (he/him)

Comment 5

•

12 years ago

From bug 805821, this is the flow in question:

* -> dev-master01.build.scl1.mozilla.com:tcp/{1024..65535} 
  (development master, more arbitrary ports)

which doesn't resemble any of the terms in comment 4.

Ravi Pina [:ravi]

Comment 6

•

12 years ago

Having all ephemeral ports with long timeout is potentially problematic.  Thousands of hosts hitting the session table will run the risk of clobbering the FW.  Unfortunately I wasn't aware this was a requirement so couldn't raise a red flag at the time.

The default TCP inactivity timeout is 30 minutes.  Can you quantify how long you will really need?  24 hours is quite a long time, but we can perhaps increase it and watch the table.  The issue is if we start filling up the table the only options to prevent an outage would be to clear sessions from the stable which in of itself could cause outages.

Ravi Pina [:ravi]

Comment 7

•

12 years ago

s/stable/table

Anyhow, fixing the application is the (hopefully) obvious solution which we have, so what is blocking us from applying it?  Or am I missing something that isn't noted in a bug?

Dustin J. Mitchell [:dustin] (he/him)

Comment 8

•

12 years ago

This isn't different from any other configuration within the releng network -- they're any:any everywhere else, but they're still sessions.  So there's not actually anything new here.  We could limit the range to {1024..32768} or even {1024..9999}, but that won't change the number of sessions.

Fixing the application is certainly the right solution here, though.  Given that dev-master01 is a dev/pp/staging master, it shouldn't be problematic to apply the patch there, and if successful, on the production masters.

Ravi Pina [:ravi]

Comment 9

•

12 years ago

My concern is the addition of thousands of new hosts whose sessions are squatting in the table.  The timeout is set to 24 hours which was a value picked in absence of knowing how long the application really needed.  If we can tighten that value a bit I'd be more comfortable increasing the limit to that value.

Dustin J. Mitchell [:dustin] (he/him)

Comment 10

•

12 years ago

24 hours reduces the difference enough that what falls through can be worked around - slaves are seldom idle for >24h.  There's not a maximum.

Ravi Pina [:ravi]

Comment 11

•

12 years ago

Who is managing the implementation of the patch?   Hal?  What can I do to help get this done?

casey ransom [:casey]

Assignee

Updated

•

12 years ago

Assignee: network-operations → cransom

casey ransom [:casey]

Assignee

Comment 12

•

12 years ago

I added a specific policy to extend TCP of all connection from the pods to dev-master01 while :hwine and :dustin work out getting keepalive settings/patches applied to the appropriate devices.  New connections will inherit this setting.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Dustin J. Mitchell [:dustin] (he/him)

Comment 13

•

12 years ago

Armen, if you want to be doubly sure, you can add a sysctl similar to that in 781860#c1 on the foopies, using PuppetAgain.  The Buildslave code is already setting SO_KEEPALIVE, but that keepalive duration is 2h, which is longer than the default firewall timeout.

Hal Wine [:hwine] use NI!

Updated

•

12 years ago

Blocks: 817838

Armen [:armenzg]

Reporter

Comment 14

•

12 years ago

Thanks a lot!

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: mozilla.org → Infrastructure & Operations

BMO Automation

Updated

•

2 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

Bugzilla

Quick Search

Please remove TCP timeouts (if that is the issue)

Categories

(Infrastructure & Operations Graveyard :: NetOps, task, P1)

Tracking

(Not tracked)

People

(Reporter: armenzg, Assigned: cransom)

References

Details

(Whiteboard: [reit-b2g])

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Updated

Comment 12

Comment 13

Updated

Comment 14

Updated

Updated