Closed
Bug 1043058
Opened 10 years ago
Closed 9 years ago
zlb11.ops.phx1.mozilla.com possible SYN Flood.
Categories
(Infrastructure & Operations Graveyard :: WebOps: Other, task)
Tracking
(Not tracked)
RESOLVED
INCOMPLETE
People
(Reporter: nagiosapi, Unassigned)
References
()
Details
(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/75] [id=nagios1.private.phx1.mozilla.com:346576])
Automated alert report from nagios1.private.phx1.mozilla.com: Hostname: zlb11.ops.phx1.mozilla.com State: DOWN Output: PING CRITICAL - Packet loss = 100%
Reporter | ||
Comment 1•10 years ago
|
||
Automated alert recovery: Hostname: zlb11.ops.phx1.mozilla.com State: UP Output: PING OK - Packet loss = 75%, RTA = 17.45 ms
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Comment 2•10 years ago
|
||
Tracking bug
Updated•10 years ago
|
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Updated•10 years ago
|
Summary: zlb11.ops.phx1.mozilla.com is DOWN: PING CRITICAL - Packet loss = 100% → zlb11.ops.phx1.mozilla.com possible SYN Flood.
zlb11 zeus service stopped, puppet disabled, chkconfig zeus disabled. Each incident begins with the following syslog message: Jul 23 17:02:58 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies. Jul 23 17:03:59 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies. Jul 23 17:11:57 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies. Jul 23 17:12:57 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies. Jul 23 17:13:57 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies. Jul 23 17:14:57 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies. Jul 23 17:15:57 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies. Jul 23 17:16:57 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies. Jul 23 17:17:57 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies. Jul 23 17:24:11 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies. Jul 23 17:25:11 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies. Jul 23 17:26:11 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies. Jul 23 17:27:11 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies. Jul 23 17:28:11 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies. Jul 23 17:29:11 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies. Jul 23 17:30:11 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies. And then the host loses all networking (ICMP, TCP, Zeus uni/multicast) for around a minute.
Comment 4•10 years ago
|
||
Triaging out of the MOC queue
Assignee: nobody → server-ops-webops
Component: Server Operations: MOC → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
QA Contact: dmoore → nmaul
Whiteboard: [id=nagios1.private.phx1.mozilla.com:346576] → [kanban:https://kanbanize.com/ctrl_board/4/596] [id=nagios1.private.phx1.mozilla.com:346576]
Comment 5•10 years ago
|
||
Jake shut this device down the other day. The ZLB was not removed from the pool first. This was preventing anyone from making any configuration changes to the cluster. I forcibly removed zlb11 from the pool which unblocked others work. This means two things: 1) When the device is booted back up it will be confused and unable to communicate with the cluster. It will assume that the other nodes have died and it will attempt to take all of the VIPs for itself. This will obviously cause arp issues. 2) This node will need to rejoin the pool. This is a simple mater of going to the configuration page and checking the box to join a new pool (forgetting its current configuration and state). These steps will need to be taken quickly and at least after hours, but possibly during a maintenance window as it will cause service blips. In the future, when removing failing nodes from a cluster, the node should be removed from the pool from within the management console (web portal). This can be accomplished from any node. This will ensure that the cluster configurations do not get wedged in this way, especially as it is trivial to re-add a node to a pool. Cheers everybody
We chkconfig'd zeus off when shutting it down, because it was impossible to remove it cleanly due to the constant network packet loss it was causing when it was up. So at least the server should be able to be started up without doing any harm, and it should permit joining the server to the cluster from the command line *before* starting it up, using the installer script, so that it grabs the current config before spinning anything up.
Comment 7•10 years ago
|
||
At the request of cyliang, I am removing the nagios checks for this hostname.
Comment 8•10 years ago
|
||
Here is a nagios check that was acked bug1081465.
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/596] [id=nagios1.private.phx1.mozilla.com:346576] → [kanban:https://webops.kanbanize.com/ctrl_board/2/75] [id=nagios1.private.phx1.mozilla.com:346576]
Comment 9•9 years ago
|
||
There are no plans to bring this node back into service, so I'm going to close this bug as INCOMPLETE.
Status: REOPENED → RESOLVED
Closed: 10 years ago → 9 years ago
Resolution: --- → INCOMPLETE
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•