Closed Bug 804658 Opened 12 years ago Closed 12 years ago

Brief outage of services on genericrhel6 prod pool

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task, P1)

x86
macOS

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bburton, Assigned: bburton)

References

Details

(Whiteboard: [service interrupt])

Failover of a component in the seamicro chassis that currently houses generic1-5.webapp.phx1 caused a brief outage of services hosted on the genericrhel6 cluster, including

* wiki.mozilla.org
* blog.mozilla.org
* tbpl.mozilla.org
* etc

Will post additional details shortly
Assignee: server-ops-webops → bburton
Priority: -- → P1
Whiteboard: [service interrupt]
Two factors played into this outage

1. Our blade server was set to draining due to maintenance work and was never put back into rotation
2. All five of our seamicro nodes were in the single chassis that had a component failover

To rectify this we're taking the following actions

1. We've re-enabled the blade and confirmed it's good to serve prod traffic
2. We're going to move generic4-5 to another seamicro chassis to prevent chassis failure causing an outage

We'll update this bug once #2 is complete and RF it
Depends on: 804669
We used to have a nagios alert that would flag on any zeus backend notes not marked as active.  Would this have caught that?
(In reply to Brandon Burton [:solarce] from comment #1)
>
> 2. We're going to move generic4-5 to another seamicro chassis to prevent
> chassis failure causing an outage

this is now complete per bug 804669.
Per :cturra's work, RF!
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.