797869 - issue with cluster impacting some releng machines

Reporter

Description

•

12 years ago

nagios started alerting about some vmware guests having load and then connectivity (NRPE) issues around 7:46.  Other guests were also impacted.  gcox is working on the issue now.

Greg Cox [:gcox]

Assignee

Comment 1

•

12 years ago

This is a co-bug with bug 797653.

Did one storage vmotion of people1, since it was on a different datastore than the 500g vmdk that we were trying to add to it.  That took a long time but showed no ill effects other than slowness (possibly related: esx1 was griping about exclusive locks in vmkwarning.log).

When the guest landed on the new datastore, I did a refresh in the datastore browser, and the world went white as vsphere client went nonresponsive.  nagios alarms kicked in (~0801 PDT) as guests (infra, releng, etc) started having timeout issues.  The datastore browser didn't refresh on the first try.  When vsphere client came alive again, esx12 showed disconnected, and 4 other hosts were in warning.

After people1 booted, I went hands-off, and the world calmed down and self-healed.

Peter Radcliffe [:pir]

Comment 2

•

12 years ago

Hosts that were not VMs but had netapp NFS mounts went into high-load states or went partially unresponsive, also. They recovered along with the guests.

Peter Radcliffe [:pir]

Comment 3

•

12 years ago

Example of hosts that are not VMs: ftp[1234].dmz.scl3

Dan Parsons [:lerxst]

Comment 4

•

12 years ago

We're working with NetApp Support on this issue

Dan Parsons [:lerxst]

Comment 5

•

12 years ago

NetApp case # 2003576078

Dan Parsons [:lerxst]

Comment 6

•

12 years ago

We gave NetApp support a bunch of information, they're processing it and will get back to us soon. In the meantime, we upgraded the NetApp referenced in this bug from 8.1 to 8.1.1 and it already seems to be running much better. After we hear back from NetApp, and review the system behavior over the next few days, we should know if this problem is fixed for good or not.

Dustin J. Mitchell [:dustin] (he/him)

Comment 7

•

12 years ago

This caused SCSI errors on production-opsi and staging-opsi, which were cleared by a reboot of those hosts.

Is there a quick way to determine what other hosts might be affected?

Peter Radcliffe [:pir]

Comment 8

•

12 years ago

I'm wondering if this might be related to the problems we've been seeing with ftp5, the high load when http is running sounds similar but the only relation here is NFS to the netapp, the ftp boxes are seamicros not VMs.

During the problems yesterday all the ftp machines hit high load but otherwise we've only seem issues with ftp5 which is... confusing at best.

 https://bugzilla.mozilla.org/show_bug.cgi?id=792848

Peter Radcliffe [:pir]

Comment 9

•

12 years ago

litmus2.dmz.scl3.mozilla.com, a VM in scl3, just went unreachable to everything except ping and when I connected to the console returned nothing.
Had to power cycle it in vsphere.

Greg Cox [:gcox]

Assignee

Comment 10

•

12 years ago

litmus2 was not ESX/filer.  oom-killer kicked in at that time.  This one looks particularly bad:

[root@litmus2 log]# cat messages | grep 'Oct  5' | grep oom-killer | grep init
Oct  5 04:58:43 litmus2 kernel: init invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0

As to hosts affected, there's not a quick way to find what was affected.  kernels, drivers, and processes all have different tolerances, and it wasn't like data STOPPED; some still flowed, meaning depending on tolerances and luck, some will have survived while others drowned.

Dustin J. Mitchell [:dustin] (he/him)

Comment 11

•

12 years ago

OK, thanks for the detail.  I figure if something's broken it's either not important of we'd have noticed by now :)

Greg Cox [:gcox]

Assignee

Comment 12

•

12 years ago

Our case with NetApp pointed out a few things that are severely unlikely to have been the cause of this incident, but yet support can't/won't look past as they triage.  Most notably the misalignment of our VMDK's.  We're tracking the cleanup in bug 798042 (scl3), bug 802131 (phx1), and bug 802132 (hci) and have already gotten rid of some of the worst offenders. 

The other is that, by the time we got support looped in, we were already recovered, and they were looking at performance statistics after the fact.  Bug 802120 is for me to write up how to fire off a perfstat, so that we can have one gathering data if we get into these situations in the future.

As such, we've pretty much hit the end of the line without a really satisfying answer, but I don't think there's more to come from this.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → INCOMPLETE

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: mozilla.org → Infrastructure & Operations

Bugzilla

Quick Search

issue with cluster impacting some releng machines

Categories

(Infrastructure & Operations :: Virtualization, task)

Tracking

(Not tracked)

People

(Reporter: arich, Assigned: gcox)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Updated