issue with cluster impacting some releng machines

RESOLVED INCOMPLETE

Status

Infrastructure & Operations
Virtualization
RESOLVED INCOMPLETE
6 years ago
4 years ago

People

(Reporter: arr, Assigned: gcox)

Tracking

Details

(Reporter)

Description

6 years ago
nagios started alerting about some vmware guests having load and then connectivity (NRPE) issues around 7:46.  Other guests were also impacted.  gcox is working on the issue now.
(Assignee)

Comment 1

6 years ago
This is a co-bug with bug 797653.

Did one storage vmotion of people1, since it was on a different datastore than the 500g vmdk that we were trying to add to it.  That took a long time but showed no ill effects other than slowness (possibly related: esx1 was griping about exclusive locks in vmkwarning.log).

When the guest landed on the new datastore, I did a refresh in the datastore browser, and the world went white as vsphere client went nonresponsive.  nagios alarms kicked in (~0801 PDT) as guests (infra, releng, etc) started having timeout issues.  The datastore browser didn't refresh on the first try.  When vsphere client came alive again, esx12 showed disconnected, and 4 other hosts were in warning.

After people1 booted, I went hands-off, and the world calmed down and self-healed.
Hosts that were not VMs but had netapp NFS mounts went into high-load states or went partially unresponsive, also. They recovered along with the guests.
Example of hosts that are not VMs: ftp[1234].dmz.scl3
We're working with NetApp Support on this issue
NetApp case # 2003576078
We gave NetApp support a bunch of information, they're processing it and will get back to us soon. In the meantime, we upgraded the NetApp referenced in this bug from 8.1 to 8.1.1 and it already seems to be running much better. After we hear back from NetApp, and review the system behavior over the next few days, we should know if this problem is fixed for good or not.
This caused SCSI errors on production-opsi and staging-opsi, which were cleared by a reboot of those hosts.

Is there a quick way to determine what other hosts might be affected?
I'm wondering if this might be related to the problems we've been seeing with ftp5, the high load when http is running sounds similar but the only relation here is NFS to the netapp, the ftp boxes are seamicros not VMs.

During the problems yesterday all the ftp machines hit high load but otherwise we've only seem issues with ftp5 which is... confusing at best.

 https://bugzilla.mozilla.org/show_bug.cgi?id=792848
litmus2.dmz.scl3.mozilla.com, a VM in scl3, just went unreachable to everything except ping and when I connected to the console returned nothing.
Had to power cycle it in vsphere.
(Assignee)

Comment 10

6 years ago
litmus2 was not ESX/filer.  oom-killer kicked in at that time.  This one looks particularly bad:

[root@litmus2 log]# cat messages | grep 'Oct  5' | grep oom-killer | grep init
Oct  5 04:58:43 litmus2 kernel: init invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0

As to hosts affected, there's not a quick way to find what was affected.  kernels, drivers, and processes all have different tolerances, and it wasn't like data STOPPED; some still flowed, meaning depending on tolerances and luck, some will have survived while others drowned.
OK, thanks for the detail.  I figure if something's broken it's either not important of we'd have noticed by now :)
(Assignee)

Comment 12

6 years ago
Our case with NetApp pointed out a few things that are severely unlikely to have been the cause of this incident, but yet support can't/won't look past as they triage.  Most notably the misalignment of our VMDK's.  We're tracking the cleanup in bug 798042 (scl3), bug 802131 (phx1), and bug 802132 (hci) and have already gotten rid of some of the worst offenders. 

The other is that, by the time we got support looped in, we were already recovered, and they were looking at performance statistics after the fact.  Bug 802120 is for me to write up how to fire off a perfstat, so that we can have one gathering data if we get into these situations in the future.

As such, we've pretty much hit the end of the line without a really satisfying answer, but I don't think there's more to come from this.
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → INCOMPLETE
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.