Closed
Bug 713685
Opened 13 years ago
Closed 13 years ago
dec 27 zeus outage
Categories
(mozilla.org Graveyard :: Server Operations, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bkero, Assigned: cshields)
References
Details
pp-zlb01 is unresponsive and not forwarding/load balancing traffic as it should be. This in turn is causing pm-amo-zlb01 to page with failed forwarded traffic to it
09:46 < nagios-phx1> [114] sp-collector03.phx1:Zeus - Port 80 is CRITICAL: ERROR: No response from remote host pp-zlb01.phx.mozilla.net
09:46 < nagios-phx1> [116] sp-collector01.phx1:Zeus - Port 81 is CRITICAL: ERROR: No response from remote host pp-zlb01.phx.mozilla.net
10:33 < nagios-sjc1> [89] pm-zlb-amo01.nms:TRAP is CRITICAL: pm-zlb-amo01.nms - SERIOUS pools/snippets.mozilla.com - 443 - phx pooldied Pool has no back-end nodes responding
snippets.mozilla.com - 443 - phx
Assignee | ||
Comment 3•13 years ago
|
||
quick update:
the load balancers are adding 0.0.0.0 to their group_ip address mappings until the servers run out of memory to do so.
(this is a new issue that appears to be unrelated to the outages that we have had in the past)
Assignee | ||
Comment 4•13 years ago
|
||
Adding an 8th and 9th zeus node to the cluster earlier today (to eventually migrate all traffic to 10g nodes) triggered a bug in zeus when you have a cluster >= 8 nodes. This bug was such that disabling and taking those nodes down did nothing to help the situation, as that was the first thing we did.
We still need to move forward with replacing our 1g zeus nodes with 10g nodes, so instead of adding capacity we will have to replace capacity. To do this we will be temporarily blocking VAMO in an hour to relieve enough load to take a couple of zeus nodes down and replace them with 10g nodes without impacting every other site in phx.
These upgrades are a necessity to have the capacity to handle VAMO once we start pushing the updates for 9.
Will post again here when we are "all clear" for VAMO
Assignee | ||
Updated•13 years ago
|
Assignee: server-ops → cshields
Assignee | ||
Comment 5•13 years ago
|
||
Current status:
We are on 3 10g LB nodes, and VAMO bits are flowing again. Problems in the migration have all been cleared up, and work is moving forward on adding 3 more 10g nodes.
We are in the clear, but soon we will need to take the 1st node from the original 3 leaving us with 5 (it was borrowed). That reclaiming process may cause a slight 5-10 second outage and a handful of tracebacks as IPs move to another LB node.
I'll poke releng, we are okay for updates to flow again.
Assignee | ||
Comment 6•13 years ago
|
||
(sorry all, had this written out last night and didn't hit save)
As we stand right now, we have moved production DB traffic to its own pair of load balancers not handling frontend traffic. The frontend load balancing cluster is now 6 10g nodes. We will still need to pull one of those out, planned for Thursday night
Summary: pp-zlb01 outage → dec 27 zeus outage
Assignee | ||
Comment 7•13 years ago
|
||
Today's update:
We rode through peak traffic with no problems, frontend or backend. Peak traffic for VAMO also included 10% unthrottling on Firefox 9 updates. Soon that will be unthrottled 100% which I'm confident we are prepared for. That is tracked in bug 713964
Assignee | ||
Comment 8•13 years ago
|
||
We have continued improvements in zeus and since moved VAMO away from other sites in this zeus cluster. Closing this out.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Updated•10 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•