Closed Bug 1043058 Opened 10 years ago Closed 9 years ago

zlb11.ops.phx1.mozilla.com possible SYN Flood.

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

Other
Other
task
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: nagiosapi, Unassigned)

References

()

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/75] [id=nagios1.private.phx1.mozilla.com:346576])

Automated alert report from nagios1.private.phx1.mozilla.com:

Hostname: zlb11.ops.phx1.mozilla.com
State:    DOWN
Output:   PING CRITICAL - Packet loss = 100%
Automated alert recovery:

Hostname: zlb11.ops.phx1.mozilla.com
State:    UP
Output:   PING OK - Packet loss = 75%, RTA = 17.45 ms
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Tracking bug
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Summary: zlb11.ops.phx1.mozilla.com is DOWN: PING CRITICAL - Packet loss = 100% → zlb11.ops.phx1.mozilla.com possible SYN Flood.
zlb11 zeus service stopped, puppet disabled, chkconfig zeus disabled.

Each incident begins with the following syslog message:

Jul 23 17:02:58 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:03:59 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:11:57 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:12:57 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:13:57 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:14:57 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:15:57 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:16:57 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:17:57 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:24:11 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:25:11 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:26:11 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:27:11 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:28:11 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:29:11 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:30:11 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.

And then the host loses all networking (ICMP, TCP, Zeus uni/multicast) for around a minute.
Triaging out of the MOC queue
Assignee: nobody → server-ops-webops
Component: Server Operations: MOC → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
QA Contact: dmoore → nmaul
Whiteboard: [id=nagios1.private.phx1.mozilla.com:346576] → [kanban:https://kanbanize.com/ctrl_board/4/596] [id=nagios1.private.phx1.mozilla.com:346576]
Jake shut this device down the other day. The ZLB was not removed from the pool first. This was preventing anyone from making any configuration changes to the cluster. I forcibly removed zlb11 from the pool which unblocked others work. This means two things:

1) When the device is booted back up it will be confused and unable to communicate with the cluster. It will assume that the other nodes have died and it will attempt to take all of the VIPs for itself. This will obviously cause arp issues.

2) This node will need to rejoin the pool. This is a simple mater of going to the configuration page and checking the box to join a new pool (forgetting its current configuration and state).

These steps will need to be taken quickly and at least after hours, but possibly during a maintenance window as it will cause service blips.

In the future, when removing failing nodes from a cluster, the node should be removed from the pool from within the management console (web portal). This can be accomplished from any node. This will ensure that the cluster configurations do not get wedged in this way, especially as it is trivial to re-add a node to a pool.

Cheers everybody
We chkconfig'd zeus off when shutting it down, because it was impossible to remove it cleanly due to the constant network packet loss it was causing when it was up. So at least the server should be able to be started up without doing any harm, and it should permit joining the server to the cluster from the command line *before* starting it up, using the installer script, so that it grabs the current config before spinning anything up.
At the request of cyliang, I am removing the nagios checks for this hostname.
Here is a nagios check that was acked bug1081465.
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/596] [id=nagios1.private.phx1.mozilla.com:346576] → [kanban:https://webops.kanbanize.com/ctrl_board/2/75] [id=nagios1.private.phx1.mozilla.com:346576]
There are no plans to bring this node back into service, so I'm going to close this bug as INCOMPLETE.
Status: REOPENED → RESOLVED
Closed: 10 years ago9 years ago
Resolution: --- → INCOMPLETE
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.