buildslave (in castro) <-> master (in mpt) connections get dropped

RESOLVED FIXED

Status

RESOLVED FIXED
9 years ago
4 years ago

People

(Reporter: bhearsum, Assigned: dmoore)

Tracking

Details

We're seeing this on new ix machines (mv-moz2-linux-ix-slave* and mw32-ix-slave*) as well as all of the talos-r3-* machines, all of which are located in Castro.

As a specific example, talos-r3-leopard-017 has been connected to talos-master.mozilla.org:9012 for a couple of hours, but the master thinks it's disconnected.
mv-moz2-linux-ix-slave09 is currently in this state.

buildbot is running on the slave, nothing in the twistd.log file to indicate that it's disconnected.

staging-master claims that it's disconnected and is not giving it any new jobs.

[root@mv-moz2-linux-ix-slave09 ~]# netstat -atpn | grep 9010
tcp        0      0 10.250.49.157:48161         10.2.71.208:9010            ESTABLISHED 2383/python         

[root@staging-master ~]# netstat -atpn | grep 10.250.49.157
(no results)
As a reference, I just found this comment from awhile back where we had a similar/the same issue with build<->build connections: https://bugzilla.mozilla.org/show_bug.cgi?id=515348#c7

> Ben/Nick,
> 
> Our internal firewall will terminate TCP connections which have been idle for
> longer than 60 minutes. I suspect this is the issue, and I have removed this
> timeout for build<->build traffic. Please let me know if you encounter another
> interruption and we'll investigate further.
(Assignee)

Comment 3

9 years ago
We've identified and removed the timeout point between MPT and MV. It's been in place since we opened the Castro office, so I'm not sure why this hasn't been an issue for you until now.

For the record, this kind of behavior (a long-running TCP connection with no keepalive traffic) is anomalous and creates a fair amount of administrative overhead for us. We have to manually go into the network equipment and clear out old connections to avoid filling up our session tables. It also prevents us from opening a truly remote build location in a geographically detached datacenter.

Ideally, we'd like to eventually see one of two types of behavior:

1) Long running connections that periodically (every 10 minutes?) send a small keepalive packet to prevent sessions from being torn down.

2) Abandoning nailed-up TCP connections in favor of periodic polling.
Status: ASSIGNED → RESOLVED
Last Resolved: 9 years ago
Resolution: --- → FIXED

Comment 4

9 years ago
I imagine we haven't noticed til recently because of not having nagios.

Hm, could we create a perma-queue of no-op (or "hostname") jobs that get assigned to idle slaves that is lower priority than a real job?

The long term answer may be pods with buildbot masters in the same LAN as the slaves.
(In reply to comment #3)

We currently have keepalive disabled on the buildbot slaves because it has caused problems in the past (eg bug 476677). That bug is a little confused on details but we still have problems with busy buildbot masters, so would probably have disconnect problems if we re-enabled keepalive. Hopefully that will change in the medium term when we can redo the setup (ie buildbot pods where the number of slaves is matched to keeping master load happy). 

Certainly possible that other details have changed since then (ESX loading, buildbot version etc etc). Any thoughts bhearsum & catlee ?
(In reply to comment #5)
> (In reply to comment #3)
> 
> We currently have keepalive disabled on the buildbot slaves because it has
> caused problems in the past (eg bug 476677). That bug is a little confused on
> details but we still have problems with busy buildbot masters, so would
> probably have disconnect problems if we re-enabled keepalive. Hopefully that
> will change in the medium term when we can redo the setup (ie buildbot pods
> where the number of slaves is matched to keeping master load happy). 
> 
> Certainly possible that other details have changed since then (ESX loading,
> buildbot version etc etc). Any thoughts bhearsum & catlee ?

Yeah, I think once we can split out our slaves in such a way that we don't kill the masters we can re-enable the keepalive ping without worry.


Also, I think the reason we haven't seen this before is because up until now, the only machines in Castro were Talos machines, which typically have much less idle time, so it's harder to get into this state.
I just scanned over all the machines that are up, and all of them are conneceted, except for a couple Linux machines with a grub problem. Thanks Derek!
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.