Closed Bug 1043058 Opened 10 years ago Closed 9 years ago

zlb11.ops.phx1.mozilla.com possible SYN Flood.

Tracking

(Not tracked)

Status:

RESOLVED INCOMPLETE

People

(Reporter: nagiosapi, Unassigned)

References

(
URL
)

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/75] [id=nagios1.private.phx1.mozilla.com:346576])

MOC Nagios API

Reporter

Description

•

10 years ago

Automated alert report from nagios1.private.phx1.mozilla.com:

Hostname: zlb11.ops.phx1.mozilla.com
State:    DOWN
Output:   PING CRITICAL - Packet loss = 100%

MOC Nagios API

Reporter

Comment 1

•

10 years ago

Automated alert recovery:

Hostname: zlb11.ops.phx1.mozilla.com
State:    UP
Output:   PING OK - Packet loss = 75%, RTA = 17.45 ms

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Rick Bryce [:rbryce]

Comment 2

•

10 years ago

Tracking bug

Rick Bryce [:rbryce]

Updated

•

10 years ago

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Rick Bryce [:rbryce]

Updated

•

10 years ago

Summary: zlb11.ops.phx1.mozilla.com is DOWN: PING CRITICAL - Packet loss = 100% → zlb11.ops.phx1.mozilla.com possible SYN Flood.

:Atoll

Comment 3

•

10 years ago

zlb11 zeus service stopped, puppet disabled, chkconfig zeus disabled.

Each incident begins with the following syslog message:

Jul 23 17:02:58 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:03:59 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:11:57 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:12:57 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:13:57 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:14:57 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:15:57 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:16:57 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:17:57 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:24:11 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:25:11 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:26:11 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:27:11 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:28:11 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:29:11 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.
Jul 23 17:30:11 zlb11.ops.phx1.mozilla.com kernel: possible SYN flooding on port 80. Sending cookies.

And then the host loses all networking (ICMP, TCP, Zeus uni/multicast) for around a minute.

Ashish Vijayaram [:ashish]

Comment 4

•

10 years ago

Triaging out of the MOC queue

Assignee: nobody → server-ops-webops

Component: Server Operations: MOC → WebOps: Other

Product: mozilla.org → Infrastructure & Operations

QA Contact: dmoore → nmaul

:kanban

Updated

•

10 years ago

Whiteboard: [id=nagios1.private.phx1.mozilla.com:346576] → [kanban:https://kanbanize.com/ctrl_board/4/596] [id=nagios1.private.phx1.mozilla.com:346576]

Jason Crowe [:jd]

Comment 5

•

10 years ago

Jake shut this device down the other day. The ZLB was not removed from the pool first. This was preventing anyone from making any configuration changes to the cluster. I forcibly removed zlb11 from the pool which unblocked others work. This means two things:

1) When the device is booted back up it will be confused and unable to communicate with the cluster. It will assume that the other nodes have died and it will attempt to take all of the VIPs for itself. This will obviously cause arp issues.

2) This node will need to rejoin the pool. This is a simple mater of going to the configuration page and checking the box to join a new pool (forgetting its current configuration and state).

These steps will need to be taken quickly and at least after hours, but possibly during a maintenance window as it will cause service blips.

In the future, when removing failing nodes from a cluster, the node should be removed from the pool from within the management console (web portal). This can be accomplished from any node. This will ensure that the cluster configurations do not get wedged in this way, especially as it is trivial to re-add a node to a pool.

Cheers everybody

:Atoll

Comment 6

•

10 years ago

We chkconfig'd zeus off when shutting it down, because it was impossible to remove it cleanly due to the constant network packet loss it was causing when it was up. So at least the server should be able to be started up without doing any harm, and it should permit joining the server to the cluster from the command line *before* starting it up, using the installer script, so that it grabs the current config before spinning anything up.

Ashish Vijayaram [:ashish]

Updated

•

10 years ago

Blocks: 1043831

Rick Bryce [:rbryce]

Comment 7

•

10 years ago

At the request of cyliang, I am removing the nagios checks for this hostname.

david garvey:dgarvey

Comment 8

•

10 years ago

Here is a nagios check that was acked bug1081465.

:kanban

Updated

•

9 years ago

Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/596] [id=nagios1.private.phx1.mozilla.com:346576] → [kanban:https://webops.kanbanize.com/ctrl_board/2/75] [id=nagios1.private.phx1.mozilla.com:346576]

Jake Maul [:jakem]

Comment 9

•

9 years ago

There are no plans to bring this node back into service, so I'm going to close this bug as INCOMPLETE.

Status: REOPENED → RESOLVED

Closed: 10 years ago → 9 years ago

Resolution: --- → INCOMPLETE

BMO Automation

Updated

•

5 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

zlb11.ops.phx1.mozilla.com possible SYN Flood.

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

Tracking

(Not tracked)

People

(Reporter: nagiosapi, Unassigned)

References

(
URL
)

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/75] [id=nagios1.private.phx1.mozilla.com:346576])

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Updated

Updated

Comment 3

Comment 4

Updated

Comment 5

Comment 6

Updated

Comment 7

Comment 8

Updated

Comment 9

Updated