Status

mozilla.org Graveyard
Server Operations
--
major
RESOLVED FIXED
6 years ago
3 years ago

People

(Reporter: ericz, Assigned: ericz)

Tracking

Details

(Whiteboard: [buildduty][outage])

(Assignee)

Description

6 years ago
One or more racks went offline in scl3 for a short period.
(Assignee)

Comment 1

6 years ago
A switch was added to rack 102-1 which appears to have caused this hiccup.  Normally doing so does not cause any problems.  Another rack had a switch added at the same time so we may see another burp within the next few minutes.
Assignee: server-ops → eziegenhorn
Severity: normal → major
Group: infra

Comment 2

6 years ago
this impacted HG (bug 758022) but I've already confirmed it's not a problem and closed that bug

releng also saw some mysql nagios alerts but things auto-recovered
Whiteboard: [buildduty][outage]
Root cause was discovered and confirmed by Cisco as a known issue.

During the fairly routine process of bringing a new c7000 chassis online, the software auto-upgrade feature of the embedded Cisco blade switches resulted in high CPU load for existing production chassis. During high CPU load, servers would have experienced significant packet loss (but not a complete outage).

The impacted servers were located in scl3 rack 102-1, which includes nagios and hg. At Cisco's advice, we have modified our operations procedures to prevent this from occurring in the future.

Closing this bug, as the event was limited and maintenance work has stopped.
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED

Comment 4

6 years ago
at around 15:30 we had some hg repack jobs fail in the beta release with:

abort: HTTP Error 500: Internal Server Error

repacks are being restarted

Comment 5

6 years ago
at 1528 hgweb impacted the developers who were in the middle of trying to unmuck the earlier zeus/mysql glitch:

[15:28]  <ehsan-extremelybusy> bear: fwiw, I can't access hg again
[15:28]  <ehsan-extremelybusy> this time hgweb
[15:29]  <bear> ehsan - give it 5 min
[15:29]  <bear> dcops is fixing the issue from before 
[15:29]  <ehsan-extremelybusy> ok
[15:30]  <bear> if it's still happening in 5 then poke me again
So - is this still an issue or not?

Comment 7

6 years ago
no, the issues were short lived.  the above is just noting for the many folks watching which items were associated with this event.

hmm, I can see how I left it hanging - apologies

Comment 8

6 years ago
The network issue here exposed an "all eggs in one basket" issue.  All the hg servers are in the same physical HP c7000 chassis and rack.  Had hg been split between chassis/switch stacks, this would have been a non-event.

Bug 758094 filed to track.
(In reply to matthew zeier [:mrz] from comment #8)
> The network issue here exposed an "all eggs in one basket" issue.  All the
> hg servers are in the same physical HP c7000 chassis and rack.  Had hg been
> split between chassis/switch stacks, this would have been a non-event.

This is incorrect..  All of the hg servers are NOT in the same physical HP chassis.  They are, however, in the same rack, split across 3 physical chassis.  So - just clearing this up, we intentionally spread them out across chassis to prevent chassis issues from causing a problem.
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.