Closed
Bug 773682
Opened 13 years ago
Closed 12 years ago
kvms in scl3 and mtv1 went down
Categories
(mozilla.org Graveyard :: Server Operations, task)
mozilla.org Graveyard
Server Operations
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: ashish, Assigned: cshields)
References
Details
(Whiteboard: [outage])
09:16:42 < nagios-rele> | [73] kvm1.private.releng.scl3 is DOWN: PING
CRITICAL - Packet loss = 100%
Aj and arr are working on moving VMs to a different node
Reporter | ||
Updated•13 years ago
|
Summary: kvm1.private.releng.scl3 went down → kvms in scl3 and mtv1 went down
Comment 1•13 years ago
|
||
General update:
The servers are up.
Comment 2•13 years ago
|
||
Amy is working on repairing problems on the hosts themselves. The interface issue itself is resolved.
Group: infra
Comment 4•13 years ago
|
||
This issue impacted any kvm nodes that were controlled by sysadmins puppet. Errors in the bond0 (and in at least one case br0) config files took down the bond network interface for machines as they ran puppet.
A fix was checked in to puppet, and hosts that had dropped off the network were fixed manually and brought back online.
Once the machines were reachable again, I did the following for the releng kvm nodes:
if the master node went down, logged into a different node and ran:
gnt-cluster master-failover
reactivated the sets of degraded disks:
for i in `gnt-instance list| awk '{print $1}'`|grep -v Instance; do echo $i; done
rebuilt the secondary disks for each node in case they became out of sync while the two kvm nodes were disconnected:
for i in `gnt-instance list| awk '{print $1}'|grep -v Instance`; do gnt-instance replace-disks -s $i; done
The latter is still running and will take some time (probably 5+ hours) as it cycles through all of them vms. I have downtimed the drbd checks on the releng kvm servers in scl3 and mtv1 for 5 hours for now.
Comment 5•13 years ago
|
||
How do we avoid this in the future?
Comment 6•13 years ago
|
||
notes:
- trees closed at 0938
- trees opened at 1051
Whiteboard: [buildduty][outage]
Updated•13 years ago
|
Assignee: server-ops → afernandez
Comment 7•13 years ago
|
||
ALso, this impacted more than just releng. Metrics also lost machines, and I'm not sure if anyone in metrics/IT has reconstructed the disks on those vms yet.
Comment 8•13 years ago
|
||
As far as I know, it only affected kvm1.generic.metrics.scl3 and no other alerts came in.
Perhaps :cyliang could verify if everything looks ok on their end?
Comment 9•13 years ago
|
||
The secondary rebuilds of all releng kvm vms in mtv1 and scl3 have completed successfully.
Comment 10•13 years ago
|
||
(In reply to Aki Sasaki [:aki] from comment #5)
> How do we avoid this in the future?
Who can best answer this?
Assignee | ||
Comment 11•13 years ago
|
||
(In reply to John O'Duinn [:joduinn] from comment #10)
> (In reply to Aki Sasaki [:aki] from comment #5)
> > How do we avoid this in the future?
>
> Who can best answer this?
I can, but I don't have the answer yet.
Updated•13 years ago
|
Assignee: afernandez → cshields
Updated•13 years ago
|
Whiteboard: [buildduty][outage] → [outage]
Assignee | ||
Comment 12•12 years ago
|
||
(In reply to Aki Sasaki [:aki] from comment #5)
> How do we avoid this in the future?
More puppet testing in a stage environment.. Something we can't easily resource right now. This was caused by a one-off puppet problem - not one we can easily build to prevent specifically in the future.
KVM has been stabilized, closing this out.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Updated•10 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•