Closed Bug 773682 Opened 12 years ago Closed 12 years ago

kvms in scl3 and mtv1 went down

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: ashish, Assigned: cshields)

References

Details

(Whiteboard: [outage])

Ashish Vijayaram [:ashish]

Reporter

Description

•

12 years ago

09:16:42 < nagios-rele> | [73] kvm1.private.releng.scl3 is DOWN: PING 
CRITICAL - Packet loss = 100%

Aj and arr are working on moving VMs to a different node

Ashish Vijayaram [:ashish]

Reporter

Updated

•

12 years ago

Summary: kvm1.private.releng.scl3 went down → kvms in scl3 and mtv1 went down

Mike Taylor [:bear]

Updated

•

12 years ago

Blocks: 773692

Adrian J Fernandez [:Aj]

Comment 1

•

12 years ago

General update:

The servers are up.

Dustin J. Mitchell [:dustin] (he/him)

Comment 2

•

12 years ago

Amy is working on repairing problems on the hosts themselves.  The interface issue itself is resolved.

Group: infra

Amy Rich [:arr] [:arich]

Comment 4

•

12 years ago

This issue impacted any kvm nodes that were controlled by sysadmins puppet.  Errors in the bond0 (and in at least one case br0) config files took down the bond network interface for machines as they ran puppet.

A fix was checked in to puppet, and hosts that had dropped off the network were fixed manually and brought back online.

Once the machines were reachable again, I did the following for the releng kvm nodes:

if the master node went down, logged into a different node and ran:

gnt-cluster master-failover

reactivated the sets of degraded disks:

for i in `gnt-instance list| awk '{print $1}'`|grep -v Instance; do echo $i; done

rebuilt the secondary disks for each node in case they became out of sync while the two kvm nodes were disconnected:

for i in `gnt-instance list| awk '{print $1}'|grep -v Instance`; do gnt-instance replace-disks -s $i; done

The latter is still running and will take some time (probably 5+ hours) as it cycles through all of them vms.  I have downtimed the drbd checks on the releng kvm servers in scl3 and mtv1 for 5 hours for now.

Aki Sasaki (not active)

Comment 5

•

12 years ago

How do we avoid this in the future?

Mike Taylor [:bear]

Comment 6

•

12 years ago

notes:
 - trees closed at 0938
 - trees opened at 1051

Whiteboard: [buildduty][outage]

Adrian J Fernandez [:Aj]

Updated

•

12 years ago

Assignee: server-ops → afernandez

Amy Rich [:arr] [:arich]

Comment 7

•

12 years ago

ALso, this impacted more than just releng.  Metrics also lost machines, and I'm not sure if anyone in metrics/IT has reconstructed the disks on those vms yet.

Adrian J Fernandez [:Aj]

Comment 8

•

12 years ago

As far as I know, it only affected kvm1.generic.metrics.scl3 and no other alerts came in.

Perhaps :cyliang could verify if everything looks ok on their end?

Amy Rich [:arr] [:arich]

Comment 9

•

12 years ago

The secondary rebuilds of all releng kvm vms in mtv1 and scl3 have completed successfully.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 10

•

12 years ago

(In reply to Aki Sasaki [:aki] from comment #5)
> How do we avoid this in the future?

Who can best answer this?

Corey Shields [:cshields]

Assignee

Comment 11

•

12 years ago

(In reply to John O'Duinn [:joduinn] from comment #10)
> (In reply to Aki Sasaki [:aki] from comment #5)
> > How do we avoid this in the future?
> 
> Who can best answer this?

I can, but I don't have the answer yet.

Phong Tran [:phong]

Updated

•

12 years ago

Assignee: afernandez → cshields

Chris AtLee [:catlee]

Updated

•

12 years ago

Whiteboard: [buildduty][outage] → [outage]

Corey Shields [:cshields]

Assignee

Comment 12

•

12 years ago

(In reply to Aki Sasaki [:aki] from comment #5)
> How do we avoid this in the future?

More puppet testing in a stage environment..  Something we can't easily resource right now.  This was caused by a one-off puppet problem - not one we can easily build to prevent specifically in the future.

KVM has been stabilized, closing this out.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

9 years ago

Product: mozilla.org → mozilla.org Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

kvms in scl3 and mtv1 went down

Categories

(mozilla.org Graveyard :: Server Operations, task)

Tracking

(Not tracked)

People

(Reporter: ashish, Assigned: cshields)

References

Details

(Whiteboard: [outage])

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Comment 1

Comment 2

Comment 4

Comment 5

Comment 6

Updated

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Updated

Updated

Comment 12

Updated