zlb11.ops.phx1.mozilla.com possible SYN Flood.

RESOLVED INCOMPLETE

Status

Infrastructure & Operations
WebOps: Other
RESOLVED INCOMPLETE
4 years ago
4 years ago

People

(Reporter: MOC Nagios API, Unassigned)

Tracking

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/75] [id=nagios1.private.phx1.mozilla.com:346576], URL)

(Reporter)

Description

4 years ago
Automated alert report from nagios1.private.phx1.mozilla.com:

Hostname: zlb11.ops.phx1.mozilla.com
State:    DOWN
Output:   PING CRITICAL - Packet loss = 100%
(Reporter)

Comment 1

4 years ago
Automated alert recovery:

Hostname: zlb11.ops.phx1.mozilla.com
State:    UP
Output:   PING OK - Packet loss = 75%, RTA = 17.45 ms
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED

Comment 2

4 years ago
Tracking bug

Updated

4 years ago
Status: RESOLVED → REOPENED
Resolution: FIXED → ---

Updated

4 years ago
Summary: zlb11.ops.phx1.mozilla.com is DOWN: PING CRITICAL - Packet loss = 100% → zlb11.ops.phx1.mozilla.com possible SYN Flood.
zlb11 zeus service stopped, puppet disabled, chkconfig zeus disabled.

Each incident begins with the following syslog message:

Jul 23 17:02:58 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:03:59 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:11:57 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:12:57 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:13:57 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:14:57 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:15:57 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:16:57 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:17:57 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:24:11 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:25:11 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:26:11 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:27:11 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:28:11 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:29:11 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:30:11 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.

And then the host loses all networking (ICMP, TCP, Zeus uni/multicast) for around a minute.
Triaging out of the MOC queue
Assignee: nobody → server-ops-webops
Component: Server Operations: MOC → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
QA Contact: dmoore → nmaul

Updated

4 years ago
Whiteboard: [id=nagios1.private.phx1.mozilla.com:346576] → [kanban:https://kanbanize.com/ctrl_board/4/596] [id=nagios1.private.phx1.mozilla.com:346576]

Comment 5

4 years ago
Jake shut this device down the other day. The ZLB was not removed from the pool first. This was preventing anyone from making any configuration changes to the cluster. I forcibly removed zlb11 from the pool which unblocked others work. This means two things:

1) When the device is booted back up it will be confused and unable to communicate with the cluster. It will assume that the other nodes have died and it will attempt to take all of the VIPs for itself. This will obviously cause arp issues.

2) This node will need to rejoin the pool. This is a simple mater of going to the configuration page and checking the box to join a new pool (forgetting its current configuration and state).

These steps will need to be taken quickly and at least after hours, but possibly during a maintenance window as it will cause service blips.

In the future, when removing failing nodes from a cluster, the node should be removed from the pool from within the management console (web portal). This can be accomplished from any node. This will ensure that the cluster configurations do not get wedged in this way, especially as it is trivial to re-add a node to a pool.

Cheers everybody
We chkconfig'd zeus off when shutting it down, because it was impossible to remove it cleanly due to the constant network packet loss it was causing when it was up. So at least the server should be able to be started up without doing any harm, and it should permit joining the server to the cluster from the command line *before* starting it up, using the installer script, so that it grabs the current config before spinning anything up.
Blocks: 1043831

Updated

4 years ago
Blocks: 1080721

Updated

4 years ago
Blocks: 1080714, 1080715

Comment 7

4 years ago
At the request of cyliang, I am removing the nagios checks for this hostname.

Comment 8

4 years ago
Here is a nagios check that was acked bug1081465.

Updated

4 years ago
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/596] [id=nagios1.private.phx1.mozilla.com:346576] → [kanban:https://webops.kanbanize.com/ctrl_board/2/75] [id=nagios1.private.phx1.mozilla.com:346576]

Comment 9

4 years ago
There are no plans to bring this node back into service, so I'm going to close this bug as INCOMPLETE.
Status: REOPENED → RESOLVED
Last Resolved: 4 years ago4 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.