Closed
Bug 758035
Opened 12 years ago
Closed 12 years ago
Network outage in scl3
Categories
(mozilla.org Graveyard :: Server Operations, task)
mozilla.org Graveyard
Server Operations
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: ericz, Assigned: ericz)
References
Details
(Whiteboard: [buildduty][outage])
One or more racks went offline in scl3 for a short period.
Assignee | ||
Comment 1•12 years ago
|
||
A switch was added to rack 102-1 which appears to have caused this hiccup. Normally doing so does not cause any problems. Another rack had a switch added at the same time so we may see another burp within the next few minutes.
Assignee: server-ops → eziegenhorn
Severity: normal → major
Updated•12 years ago
|
Group: infra
Comment 2•12 years ago
|
||
this impacted HG (bug 758022) but I've already confirmed it's not a problem and closed that bug releng also saw some mysql nagios alerts but things auto-recovered
Whiteboard: [buildduty][outage]
Comment 3•12 years ago
|
||
Root cause was discovered and confirmed by Cisco as a known issue. During the fairly routine process of bringing a new c7000 chassis online, the software auto-upgrade feature of the embedded Cisco blade switches resulted in high CPU load for existing production chassis. During high CPU load, servers would have experienced significant packet loss (but not a complete outage). The impacted servers were located in scl3 rack 102-1, which includes nagios and hg. At Cisco's advice, we have modified our operations procedures to prevent this from occurring in the future. Closing this bug, as the event was limited and maintenance work has stopped.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Comment 4•12 years ago
|
||
at around 15:30 we had some hg repack jobs fail in the beta release with: abort: HTTP Error 500: Internal Server Error repacks are being restarted
Comment 5•12 years ago
|
||
at 1528 hgweb impacted the developers who were in the middle of trying to unmuck the earlier zeus/mysql glitch: [15:28] <ehsan-extremelybusy> bear: fwiw, I can't access hg again [15:28] <ehsan-extremelybusy> this time hgweb [15:29] <bear> ehsan - give it 5 min [15:29] <bear> dcops is fixing the issue from before [15:29] <ehsan-extremelybusy> ok [15:30] <bear> if it's still happening in 5 then poke me again
Comment 6•12 years ago
|
||
So - is this still an issue or not?
Comment 7•12 years ago
|
||
no, the issues were short lived. the above is just noting for the many folks watching which items were associated with this event. hmm, I can see how I left it hanging - apologies
Comment 8•12 years ago
|
||
The network issue here exposed an "all eggs in one basket" issue. All the hg servers are in the same physical HP c7000 chassis and rack. Had hg been split between chassis/switch stacks, this would have been a non-event. Bug 758094 filed to track.
Comment 9•12 years ago
|
||
(In reply to matthew zeier [:mrz] from comment #8) > The network issue here exposed an "all eggs in one basket" issue. All the > hg servers are in the same physical HP c7000 chassis and rack. Had hg been > split between chassis/switch stacks, this would have been a non-event. This is incorrect.. All of the hg servers are NOT in the same physical HP chassis. They are, however, in the same rack, split across 3 physical chassis. So - just clearing this up, we intentionally spread them out across chassis to prevent chassis issues from causing a problem.
Updated•9 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•