Closed Bug 764537 Opened 12 years ago Closed 12 years ago

cal-vm-win32-1.community.scl3.mozilla.com and others have connectivity issues with their master.

Categories

(Infrastructure & Operations Graveyard :: NetOps, task, P1)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Fallen, Assigned: cransom)

References

Details

A few of our slaves that are having frequent buildbot "Slave Lost" issues. I'm starting to have the feeling this might be a network issue, since it happens on both our windows machines and the mac box we loaned. They all are on the community network behind jump1.

I have not had this issue with our linux machines, which are on the internal vm network.

I would appreciate if someone could investigate.

The master they are connecting to:
momo-cal-master-01.vm.labs.scl3.mozilla.com  (calendar-master.mozillalabs.com)

Machines with issues connecting:
cal-vm-win32-1.community.scl3.mozilla.com
cal-vm-win32-tbox.community.scl3.mozilla.com
bm-xserve07.build.mozilla.org

Machines without issues:
momo-vm-cal-linux-01.vm.labs.scl3.mozilla.com
momo-vm-cal-linux64-01.vm.labs.scl3.mozilla.com
Summary: cal-vm-win32-1.community.scl3.mozilla.com and others have connectivity issues with its master. → cal-vm-win32-1.community.scl3.mozilla.com and others have connectivity issues with their master.
To clarify, they don't have issues *connecting*, but rather I often get a red due to slave lost, see for example:

http://calendar-master.mozillalabs.com/builders/WINNT%205.2%20comm-central%20lightning%20nightly/builds/275/steps/shell_7/logs/stdio
Can you give example flows that are failing?  Are they failing after 30 minutes of inactivity (the currently configured default) or something else?  Are there any logs indicating reasons for the dropped session?
fwiw to netops interest in solving this:

cal-vm-win32-1.community.scl3.mozilla.com
cal-vm-win32-tbox.community.scl3.mozilla.com
bm-xserve07.build.mozilla.org

All (aiui) connect via port 80 as a workaround for flow issues (historic, c.f. :gozer). Port 80 is a port forward to port 9030 iirc what gozer told me. I don't know at what layer that redirect happens.

My theory is that something is specifically limiting the connection timer/timeout on port 80 requests (expecting of course, that the socket request is a finite spaced webpage)

If we make sure that port 9030 is open flow [same use-case as already documented (in mana) port9010, that SeaMonkey uses] calendar should be able to change to that easily, and avoid possibly some of the issues.

(momo-vm-cal-linux-01.vm.labs.scl3.mozilla.com at least *is* using port 9030 which might be part of why it is not having issues)
We'd be happy to open to 63.245.223.165 9030/tcp if that is what needs to happen, and it sounds like it is the more correct thing to do.  Looking through our configs I don't see anything that we're doing for the 80->9030 redirect so it is likely something internal between the hosts in question.
(In reply to Ravi Pina [:ravi] from comment #2)
> Can you give example flows that are failing? 
The flow here is:

cal-vm-win32-1 -> calendar-master.mozillalabs.com:80 

> Are they failing after 30
> minutes of inactivity (the currently configured default) or something else? 

According to the buildbot job here:

https://calendar-master.mozillalabs.com/builders/WINNT%205.2%20comm-central%20lightning%20nightly/builds/275/

, the build took 6 minutes and 7 seconds before it failed due to slave lost. keepalive is set to 900 seconds on that buildbot client.

> Are there any logs indicating reasons for the dropped session?

Yes, sorry. The link above was missing https and contains a message why the slave was lost (connection closed in an unclean fashion). There is no more specific error log I know of.

https://calendar-master.mozillalabs.com/builders/WINNT%205.2%20comm-central%20lightning%20nightly/builds/275/steps/shell_7/logs/stdio
(In reply to Ravi Pina [:ravi] from comment #4)
> We'd be happy to open to 63.245.223.165 9030/tcp if that is what needs to
> happen, and it sounds like it is the more correct thing to do.  Looking
> through our configs I don't see anything that we're doing for the 80->9030
> redirect so it is likely something internal between the hosts in question.

Lets give it a try!
Let me know when 63.245.223.165 is listening on 9030/tcp and I'll get flows opened.
(In reply to Ravi Pina [:ravi] from comment #7)
> Let me know when 63.245.223.165 is listening on 9030/tcp and I'll get flows
> opened.

Internally (on the vm itself) it already is, since the linux hosts talk to it with 9030 directly. We would just need to change configs on the other hosts that use 80/tcp before we close that.
Right now, the main problem is that the calendar master (and the 2 linux VMs) were hosted at momo, then moved into the labs cluster, for convenience.

The ideal solution would be to move the 3 calendar boxes in labs into the community network. this way, they would all be on the same network, simplifying things quite a bit.
(In reply to Philippe M. Chiasson (:gozer) from comment #9)
> Right now, the main problem is that the calendar master (and the 2 linux
> VMs) were hosted at momo, then moved into the labs cluster, for convenience.
> 
> The ideal solution would be to move the 3 calendar boxes in labs into the
> community network. this way, they would all be on the same network,
> simplifying things quite a bit.

I don't really mind how its done, but I really would love to see this connectivity issue go away.

I don't know what is easier/faster for you, but I would appreciate if you could either move all related machines into the community network or open the network flows.

(In reply to Ravi Pina [:ravi] from comment #7)
> Let me know when 63.245.223.165 is listening on 9030/tcp and I'll get flows
> opened.

As Callek mentioned, this is already the case.
Another note if someone wants to analyse what is happening with the current setup, it seems to happen with build steps that take more than 5 minutes. The client.py checkout step got a slave lost message after exactly 5 minutes and 6 seconds:


https://calendar-master.mozillalabs.com/builders/WINNT%205.2%20comm-central%20lightning%20nightly/builds/287/steps/shell_7
Whats the status of this bug? I really need this working asap, we have been missing nightly builds sine mid may.
Severity: normal → major
Severity: major → normal
Priority: -- → P1
I found a valid session from cal-vm-win32-1 to calendar-master on the firewall and tracked it's state here:
cransom@fw1.scl3> show security flow session session-identifier 40219019    
Flow Sessions on FPC2 PIC0:

Session ID: 40219019, Status: Normal, State: Active
Flag: 0x8000040
Policy name: http/651
Source NAT pool: Null, Application: junos-http/6
Maximum timeout: 1800, Current timeout: 1500
Session State: Valid
Start time: 9577668, Duration: 300
   In: 63.245.223.105/4058 --> 63.245.223.165/80;tcp, 
    Interface: reth0.20, 
    Session token: 0x15, Flag: 0x0x21
    Route: 0xb69d3c4, Gateway: 63.245.223.105, Tunnel: 0
    Port sequence: 0, FIN sequence: 0, 
    FIN state: 0, 
    Pkts: 14, Bytes: 6796
   Out: 63.245.223.165/80 --> 63.245.223.105/4058;tcp, 
    Interface: reth0.21, 
    Session token: 0x8, Flag: 0x0x20
    Route: 0x987cbc4, Gateway: 63.245.223.165, Tunnel: 0
    Port sequence: 0, FIN sequence: 0, 
    FIN state: 0, 
    Pkts: 13, Bytes: 1594
Total sessions: 1

{primary:node0}
cransom@fw1.scl3> show security flow session session-identifier 40219019    
Flow Sessions on FPC2 PIC0:

Session ID: 40219019, Status: Normal, State: Active
Flag: 0x88000040
Policy name: http/651
Source NAT pool: Null, Application: junos-http/6
Maximum timeout: 150, Current timeout: 2
Session State: Valid
Start time: 9577668, Duration: 301
   In: 63.245.223.105/4058 --> 63.245.223.165/80;tcp, 
    Interface: reth0.20, 
    Session token: 0x15, Flag: 0x0x21
    Route: 0xb69d3c4, Gateway: 63.245.223.105, Tunnel: 0
    Port sequence: 0, FIN sequence: 0, 
    FIN state: 2, 
    Pkts: 15, Bytes: 6836
   Out: 63.245.223.165/80 --> 63.245.223.105/4058;tcp, 
    Interface: reth0.21, 
    Session token: 0x8, Flag: 0x0x20
    Route: 0x987cbc4, Gateway: 63.245.223.165, Tunnel: 0
    Port sequence: 0, FIN sequence: 0, 
    FIN state: 2, 
    Pkts: 15, Bytes: 1674
Total sessions: 1


The moment the session hit the 5 minute mark, the firewall saw TCP FIN packets between the client/server saying the connection is over. As far as I can see, these are not being triggered from the firewall. I'll see if I can get access to a machine and tcpdump to see which side is shutting down the connection first.
Assignee: network-operations → cransom
Status: NEW → ASSIGNED
This is from the load balancer that is servicing calendar-master:
08:42:59.241117 IP 63.245.223.165.http > 63.245.223.105.4302: F 1537154458:1537154458(0) ack 2320241894 win 22976
08:42:59.241527 IP 63.245.223.105.4302 > 63.245.223.165.http: . ack 1 win 63174
08:42:59.241770 IP 63.245.223.105.4302 > 63.245.223.165.http: F 1:1(0) ack 1 win 63174
08:42:59.241782 IP 63.245.223.165.http > 63.245.223.105.4302: . ack 2 win 22976

Would :gozer be the best one to check out keepalive settings there? It's the load balancer that is terminating the connection after 5 minutes of idle.
I got into the load balancer and disabled the keepalive timeout for calendar-master (it was set to 300 seconds, a default). I'll monitor to see if sessions live beyond 300 seconds.
It made it to 30 minutes with no activity which then fell prey to the firewall timing out the session. I've increased this to 8 hours which hopefully is more than enough for this.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.