Closed Bug 804658 Opened 12 years ago Closed 12 years ago

Brief outage of services on genericrhel6 prod pool

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: bburton, Assigned: bburton)

References

Details

(Whiteboard: [service interrupt])

Brandon Burton [:solarce]

Assignee

Description

•

12 years ago

Failover of a component in the seamicro chassis that currently houses generic1-5.webapp.phx1 caused a brief outage of services hosted on the genericrhel6 cluster, including

* wiki.mozilla.org
* blog.mozilla.org
* tbpl.mozilla.org
* etc

Will post additional details shortly

Brandon Burton [:solarce]

Assignee

Updated

•

12 years ago

Assignee: server-ops-webops → bburton

Priority: -- → P1

Whiteboard: [service interrupt]

Brandon Burton [:solarce]

Assignee

Comment 1

•

12 years ago

Two factors played into this outage

1. Our blade server was set to draining due to maintenance work and was never put back into rotation
2. All five of our seamicro nodes were in the single chassis that had a component failover

To rectify this we're taking the following actions

1. We've re-enabled the blade and confirmed it's good to serve prod traffic
2. We're going to move generic4-5 to another seamicro chassis to prevent chassis failure causing an outage

We'll update this bug once #2 is complete and RF it

Chris Turra [:cturra]

Updated

•

12 years ago

Depends on: 804669

matthew zeier [:mrz]

Comment 2

•

12 years ago

We used to have a nagios alert that would flag on any zeus backend notes not marked as active.  Would this have caught that?

Chris Turra [:cturra]

Comment 3

•

12 years ago

(In reply to Brandon Burton [:solarce] from comment #1)
>
> 2. We're going to move generic4-5 to another seamicro chassis to prevent
> chassis failure causing an outage

this is now complete per bug 804669.

Brandon Burton [:solarce]

Assignee

Comment 4

•

12 years ago

Per :cturra's work, RF!

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

11 years ago

Component: Server Operations: Web Operations → WebOps: Other

Product: mozilla.org → Infrastructure & Operations

BMO Automation

Updated

•

5 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Brief outage of services on genericrhel6 prod pool

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task, P1)

Tracking

(Not tracked)

People

(Reporter: bburton, Assigned: bburton)

References

Details

(Whiteboard: [service interrupt])

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Updated

Updated