764537 - cal-vm-win32-1.community.scl3.mozilla.com and others have connectivity issues with their master.

Reporter

Description

•

13 years ago

A few of our slaves that are having frequent buildbot "Slave Lost" issues. I'm starting to have the feeling this might be a network issue, since it happens on both our windows machines and the mac box we loaned. They all are on the community network behind jump1. I have not had this issue with our linux machines, which are on the internal vm network. I would appreciate if someone could investigate. The master they are connecting to: momo-cal-master-01.vm.labs.scl3.mozilla.com (calendar-master.mozillalabs.com) Machines with issues connecting: cal-vm-win32-1.community.scl3.mozilla.com cal-vm-win32-tbox.community.scl3.mozilla.com bm-xserve07.build.mozilla.org Machines without issues: momo-vm-cal-linux-01.vm.labs.scl3.mozilla.com momo-vm-cal-linux64-01.vm.labs.scl3.mozilla.com

Philipp Kewisch [:Fallen][📱📆]

Reporter

Updated

•

13 years ago

Summary: cal-vm-win32-1.community.scl3.mozilla.com and others have connectivity issues with its master. → cal-vm-win32-1.community.scl3.mozilla.com and others have connectivity issues with their master.

Philipp Kewisch [:Fallen][📱📆]

Reporter

Comment 1

•

13 years ago

To clarify, they don't have issues *connecting*, but rather I often get a red due to slave lost, see for example: http://calendar-master.mozillalabs.com/builders/WINNT%205.2%20comm-central%20lightning%20nightly/builds/275/steps/shell_7/logs/stdio

Philipp Kewisch [:Fallen][📱📆]

Reporter

Updated

•

13 years ago

Blocks: 756116

Ravi Pina [:ravi]

Comment 2

•

13 years ago

Can you give example flows that are failing? Are they failing after 30 minutes of inactivity (the currently configured default) or something else? Are there any logs indicating reasons for the dropped session?

Justin Wood (:Callek)

Comment 3

•

13 years ago

fwiw to netops interest in solving this: cal-vm-win32-1.community.scl3.mozilla.com cal-vm-win32-tbox.community.scl3.mozilla.com bm-xserve07.build.mozilla.org All (aiui) connect via port 80 as a workaround for flow issues (historic, c.f. :gozer). Port 80 is a port forward to port 9030 iirc what gozer told me. I don't know at what layer that redirect happens. My theory is that something is specifically limiting the connection timer/timeout on port 80 requests (expecting of course, that the socket request is a finite spaced webpage) If we make sure that port 9030 is open flow [same use-case as already documented (in mana) port9010, that SeaMonkey uses] calendar should be able to change to that easily, and avoid possibly some of the issues. (momo-vm-cal-linux-01.vm.labs.scl3.mozilla.com at least *is* using port 9030 which might be part of why it is not having issues)

Ravi Pina [:ravi]

Comment 4

•

13 years ago

We'd be happy to open to 63.245.223.165 9030/tcp if that is what needs to happen, and it sounds like it is the more correct thing to do. Looking through our configs I don't see anything that we're doing for the 80->9030 redirect so it is likely something internal between the hosts in question.

Philipp Kewisch [:Fallen][📱📆]

Reporter

Comment 5

•

13 years ago

(In reply to Ravi Pina [:ravi] from comment #2) > Can you give example flows that are failing? The flow here is: cal-vm-win32-1 -> calendar-master.mozillalabs.com:80 > Are they failing after 30 > minutes of inactivity (the currently configured default) or something else? According to the buildbot job here: https://calendar-master.mozillalabs.com/builders/WINNT%205.2%20comm-central%20lightning%20nightly/builds/275/ , the build took 6 minutes and 7 seconds before it failed due to slave lost. keepalive is set to 900 seconds on that buildbot client. > Are there any logs indicating reasons for the dropped session? Yes, sorry. The link above was missing https and contains a message why the slave was lost (connection closed in an unclean fashion). There is no more specific error log I know of. https://calendar-master.mozillalabs.com/builders/WINNT%205.2%20comm-central%20lightning%20nightly/builds/275/steps/shell_7/logs/stdio

Philipp Kewisch [:Fallen][📱📆]

Reporter

Comment 6

•

13 years ago

(In reply to Ravi Pina [:ravi] from comment #4) > We'd be happy to open to 63.245.223.165 9030/tcp if that is what needs to > happen, and it sounds like it is the more correct thing to do. Looking > through our configs I don't see anything that we're doing for the 80->9030 > redirect so it is likely something internal between the hosts in question. Lets give it a try!

Ravi Pina [:ravi]

Comment 7

•

13 years ago

Let me know when 63.245.223.165 is listening on 9030/tcp and I'll get flows opened.

Justin Wood (:Callek)

Comment 8

•

13 years ago

(In reply to Ravi Pina [:ravi] from comment #7) > Let me know when 63.245.223.165 is listening on 9030/tcp and I'll get flows > opened. Internally (on the vm itself) it already is, since the linux hosts talk to it with 9030 directly. We would just need to change configs on the other hosts that use 80/tcp before we close that.

Philippe M. Chiasson (:gozer)

Comment 9

•

13 years ago

Right now, the main problem is that the calendar master (and the 2 linux VMs) were hosted at momo, then moved into the labs cluster, for convenience. The ideal solution would be to move the 3 calendar boxes in labs into the community network. this way, they would all be on the same network, simplifying things quite a bit.

Philipp Kewisch [:Fallen][📱📆]

Reporter

Comment 10

•

13 years ago

(In reply to Philippe M. Chiasson (:gozer) from comment #9) > Right now, the main problem is that the calendar master (and the 2 linux > VMs) were hosted at momo, then moved into the labs cluster, for convenience. > > The ideal solution would be to move the 3 calendar boxes in labs into the > community network. this way, they would all be on the same network, > simplifying things quite a bit. I don't really mind how its done, but I really would love to see this connectivity issue go away. I don't know what is easier/faster for you, but I would appreciate if you could either move all related machines into the community network or open the network flows. (In reply to Ravi Pina [:ravi] from comment #7) > Let me know when 63.245.223.165 is listening on 9030/tcp and I'll get flows > opened. As Callek mentioned, this is already the case.

Philipp Kewisch [:Fallen][📱📆]

Reporter

Comment 11

•

13 years ago

Another note if someone wants to analyse what is happening with the current setup, it seems to happen with build steps that take more than 5 minutes. The client.py checkout step got a slave lost message after exactly 5 minutes and 6 seconds: https://calendar-master.mozillalabs.com/builders/WINNT%205.2%20comm-central%20lightning%20nightly/builds/287/steps/shell_7

Philipp Kewisch [:Fallen][📱📆]

Reporter

Comment 12

•

13 years ago

Whats the status of this bug? I really need this working asap, we have been missing nightly builds sine mid may.

Severity: normal → major

Ashish Vijayaram [:ashish]

Updated

•

13 years ago

Severity: major → normal

Priority: -- → P1

casey ransom [:casey]

Assignee

Comment 13

•

13 years ago

I found a valid session from cal-vm-win32-1 to calendar-master on the firewall and tracked it's state here: cransom@fw1.scl3> show security flow session session-identifier 40219019 Flow Sessions on FPC2 PIC0: Session ID: 40219019, Status: Normal, State: Active Flag: 0x8000040 Policy name: http/651 Source NAT pool: Null, Application: junos-http/6 Maximum timeout: 1800, Current timeout: 1500 Session State: Valid Start time: 9577668, Duration: 300 In: 63.245.223.105/4058 --> 63.245.223.165/80;tcp, Interface: reth0.20, Session token: 0x15, Flag: 0x0x21 Route: 0xb69d3c4, Gateway: 63.245.223.105, Tunnel: 0 Port sequence: 0, FIN sequence: 0, FIN state: 0, Pkts: 14, Bytes: 6796 Out: 63.245.223.165/80 --> 63.245.223.105/4058;tcp, Interface: reth0.21, Session token: 0x8, Flag: 0x0x20 Route: 0x987cbc4, Gateway: 63.245.223.165, Tunnel: 0 Port sequence: 0, FIN sequence: 0, FIN state: 0, Pkts: 13, Bytes: 1594 Total sessions: 1 {primary:node0} cransom@fw1.scl3> show security flow session session-identifier 40219019 Flow Sessions on FPC2 PIC0: Session ID: 40219019, Status: Normal, State: Active Flag: 0x88000040 Policy name: http/651 Source NAT pool: Null, Application: junos-http/6 Maximum timeout: 150, Current timeout: 2 Session State: Valid Start time: 9577668, Duration: 301 In: 63.245.223.105/4058 --> 63.245.223.165/80;tcp, Interface: reth0.20, Session token: 0x15, Flag: 0x0x21 Route: 0xb69d3c4, Gateway: 63.245.223.105, Tunnel: 0 Port sequence: 0, FIN sequence: 0, FIN state: 2, Pkts: 15, Bytes: 6836 Out: 63.245.223.165/80 --> 63.245.223.105/4058;tcp, Interface: reth0.21, Session token: 0x8, Flag: 0x0x20 Route: 0x987cbc4, Gateway: 63.245.223.165, Tunnel: 0 Port sequence: 0, FIN sequence: 0, FIN state: 2, Pkts: 15, Bytes: 1674 Total sessions: 1 The moment the session hit the 5 minute mark, the firewall saw TCP FIN packets between the client/server saying the connection is over. As far as I can see, these are not being triggered from the firewall. I'll see if I can get access to a machine and tcpdump to see which side is shutting down the connection first.

Assignee: network-operations → cransom

Status: NEW → ASSIGNED

casey ransom [:casey]

Assignee

Comment 14

•

13 years ago

This is from the load balancer that is servicing calendar-master: 08:42:59.241117 IP 63.245.223.165.http > 63.245.223.105.4302: F 1537154458:1537154458(0) ack 2320241894 win 22976 08:42:59.241527 IP 63.245.223.105.4302 > 63.245.223.165.http: . ack 1 win 63174 08:42:59.241770 IP 63.245.223.105.4302 > 63.245.223.165.http: F 1:1(0) ack 1 win 63174 08:42:59.241782 IP 63.245.223.165.http > 63.245.223.105.4302: . ack 2 win 22976 Would :gozer be the best one to check out keepalive settings there? It's the load balancer that is terminating the connection after 5 minutes of idle.

casey ransom [:casey]

Assignee

Comment 15

•

13 years ago

I got into the load balancer and disabled the keepalive timeout for calendar-master (it was set to 300 seconds, a default). I'll monitor to see if sessions live beyond 300 seconds.

casey ransom [:casey]

Assignee

Comment 16

•

13 years ago

It made it to 30 minutes with no activity which then fell prey to the firewall timing out the session. I've increased this to 8 hours which hopefully is more than enough for this.

Status: ASSIGNED → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

12 years ago

Product: mozilla.org → Infrastructure & Operations

BMO Automation

Updated

•

3 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

Bugzilla

cal-vm-win32-1.community.scl3.mozilla.com and others have connectivity issues with their master.

Categories

(Infrastructure & Operations Graveyard :: NetOps, task, P1)

Tracking

(Not tracked)

People

(Reporter: Fallen, Assigned: cransom)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Updated

Comment 13

Comment 14

Comment 15

Comment 16

Updated

Updated