Closed Bug 758035 Opened 12 years ago Closed 12 years ago

Network outage in scl3

Categories

(mozilla.org Graveyard :: Server Operations, task)

task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ericz, Assigned: ericz)

References

Details

(Whiteboard: [buildduty][outage])

One or more racks went offline in scl3 for a short period.
A switch was added to rack 102-1 which appears to have caused this hiccup.  Normally doing so does not cause any problems.  Another rack had a switch added at the same time so we may see another burp within the next few minutes.
Assignee: server-ops → eziegenhorn
Severity: normal → major
Group: infra
this impacted HG (bug 758022) but I've already confirmed it's not a problem and closed that bug

releng also saw some mysql nagios alerts but things auto-recovered
Whiteboard: [buildduty][outage]
Root cause was discovered and confirmed by Cisco as a known issue.

During the fairly routine process of bringing a new c7000 chassis online, the software auto-upgrade feature of the embedded Cisco blade switches resulted in high CPU load for existing production chassis. During high CPU load, servers would have experienced significant packet loss (but not a complete outage).

The impacted servers were located in scl3 rack 102-1, which includes nagios and hg. At Cisco's advice, we have modified our operations procedures to prevent this from occurring in the future.

Closing this bug, as the event was limited and maintenance work has stopped.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
at around 15:30 we had some hg repack jobs fail in the beta release with:

abort: HTTP Error 500: Internal Server Error

repacks are being restarted
at 1528 hgweb impacted the developers who were in the middle of trying to unmuck the earlier zeus/mysql glitch:

[15:28]  <ehsan-extremelybusy> bear: fwiw, I can't access hg again
[15:28]  <ehsan-extremelybusy> this time hgweb
[15:29]  <bear> ehsan - give it 5 min
[15:29]  <bear> dcops is fixing the issue from before 
[15:29]  <ehsan-extremelybusy> ok
[15:30]  <bear> if it's still happening in 5 then poke me again
Blocks: 744594
So - is this still an issue or not?
no, the issues were short lived.  the above is just noting for the many folks watching which items were associated with this event.

hmm, I can see how I left it hanging - apologies
The network issue here exposed an "all eggs in one basket" issue.  All the hg servers are in the same physical HP c7000 chassis and rack.  Had hg been split between chassis/switch stacks, this would have been a non-event.

Bug 758094 filed to track.
(In reply to matthew zeier [:mrz] from comment #8)
> The network issue here exposed an "all eggs in one basket" issue.  All the
> hg servers are in the same physical HP c7000 chassis and rack.  Had hg been
> split between chassis/switch stacks, this would have been a non-event.

This is incorrect..  All of the hg servers are NOT in the same physical HP chassis.  They are, however, in the same rack, split across 3 physical chassis.  So - just clearing this up, we intentionally spread them out across chassis to prevent chassis issues from causing a problem.
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.