Closed Bug 773682 Opened 12 years ago Closed 12 years ago

kvms in scl3 and mtv1 went down

Categories

(mozilla.org Graveyard :: Server Operations, task)

task
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ashish, Assigned: cshields)

References

Details

(Whiteboard: [outage])

09:16:42 < nagios-rele> | [73] kvm1.private.releng.scl3 is DOWN: PING 
CRITICAL - Packet loss = 100%

Aj and arr are working on moving VMs to a different node
Summary: kvm1.private.releng.scl3 went down → kvms in scl3 and mtv1 went down
Blocks: 773692
General update:

The servers are up.
Amy is working on repairing problems on the hosts themselves.  The interface issue itself is resolved.
Group: infra
This issue impacted any kvm nodes that were controlled by sysadmins puppet.  Errors in the bond0 (and in at least one case br0) config files took down the bond network interface for machines as they ran puppet.

A fix was checked in to puppet, and hosts that had dropped off the network were fixed manually and brought back online.

Once the machines were reachable again, I did the following for the releng kvm nodes:

if the master node went down, logged into a different node and ran:

gnt-cluster master-failover

reactivated the sets of degraded disks:

for i in `gnt-instance list| awk '{print $1}'`|grep -v Instance; do echo $i; done

rebuilt the secondary disks for each node in case they became out of sync while the two kvm nodes were disconnected:

for i in `gnt-instance list| awk '{print $1}'|grep -v Instance`; do gnt-instance replace-disks -s $i; done

The latter is still running and will take some time (probably 5+ hours) as it cycles through all of them vms.  I have downtimed the drbd checks on the releng kvm servers in scl3 and mtv1 for 5 hours for now.
How do we avoid this in the future?
notes:
 - trees closed at 0938
 - trees opened at 1051
Whiteboard: [buildduty][outage]
Assignee: server-ops → afernandez
ALso, this impacted more than just releng.  Metrics also lost machines, and I'm not sure if anyone in metrics/IT has reconstructed the disks on those vms yet.
As far as I know, it only affected kvm1.generic.metrics.scl3 and no other alerts came in.

Perhaps :cyliang could verify if everything looks ok on their end?
The secondary rebuilds of all releng kvm vms in mtv1 and scl3 have completed successfully.
(In reply to Aki Sasaki [:aki] from comment #5)
> How do we avoid this in the future?

Who can best answer this?
(In reply to John O'Duinn [:joduinn] from comment #10)
> (In reply to Aki Sasaki [:aki] from comment #5)
> > How do we avoid this in the future?
> 
> Who can best answer this?

I can, but I don't have the answer yet.
Assignee: afernandez → cshields
Whiteboard: [buildduty][outage] → [outage]
(In reply to Aki Sasaki [:aki] from comment #5)
> How do we avoid this in the future?

More puppet testing in a stage environment..  Something we can't easily resource right now.  This was caused by a one-off puppet problem - not one we can easily build to prevent specifically in the future.

KVM has been stabilized, closing this out.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.