Several Windows 8 VMs are freezing (0 MB memory usage - 100% CPU) and are not accessible anymore

RESOLVED FIXED

Status

Mozilla QA
Infrastructure
--
critical
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: cosmin, Assigned: whimboo)

Tracking

Details

Attachments

(4 attachments, 1 obsolete attachment)

(Reporter)

Description

3 years ago
Node wen't offline and I can't acccess it via VPN it needs no to be restarted.
The box is back. It was hanging with 100% cpu load. You might want to check tomorrow what could have been the cause of it. 

Given that the Jenkins link is broken on the deskop, Andreea will reconnect the box to Jenkins now.
Flags: needinfo?(andreea.matei)
Node is back online.
Flags: needinfo?(andreea.matei)
Thanks.
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
I had to restart the node once more because it was not responding. Cosmin please investigate this problem and let us know what you can find. I will attach some screenshots in a bit.
Status: RESOLVED → REOPENED
Flags: needinfo?(cosmin.malutan)
Resolution: FIXED → ---
Created attachment 8513274 [details]
screenshot (cpu %)
Created attachment 8513275 [details]
screenshot (cpu mhz)
Created attachment 8513276 [details]
screenshot (memory usage)
(Reporter)

Comment 8

3 years ago
Created attachment 8513289 [details]
EventView.txt

Here is the Log from Event-VIew, I'll check this further in a bit.
(Reporter)

Comment 9

3 years ago
I couldn't find what caused this, but the security updates might fix this.
Otherwise the closest issue I found to this is the one bellow, which will require to disable the VMCI driver.
https://communities.vmware.com/message/2379497#2379497
Flags: needinfo?(cosmin.malutan)
Duplicate of this bug: 1091050
The same seems to have happened today to mm-win-8-64-2. I will check that now.
Summary: Node mm-win-81-64-3 from scl3 can't be accessed via VPN → Several Windows 8 nodes are freezing and are not accessible anymore
I do not think that there is anything I can actually do here. Something is totally going crazy here with our VMs. Memory usage goes down to 0B while CPU is at 100%. Looks hardly like an ESX issue.

Greg, or Chris can one of you help us here?
Assignee: hskupin → server-ops-virtualization
Severity: normal → critical
Status: REOPENED → NEW
Component: Infrastructure → Server Operations: Virtualization
Flags: needinfo?(gcox)
Flags: needinfo?(cknowles)
Product: Mozilla QA → mozilla.org
QA Contact: hskupin → cshields
Version: unspecified → other
Whiteboard: [qa-automation-blocked]
Summary: Several Windows 8 nodes are freezing and are not accessible anymore → Several Windows 8 VMs are freezing (0 MB memory usage - 100% CPU) and are not accessible anymore
Can I get a complete list of VMs affected so I can research?  that VMCI is likely fixed in the latest VMtools, but I don't know what you're running until I have that list of affected guests.
Alright, did some poking around.

I'm assuming that the list of potential victims is:

mm-win-8-64-{1..4} and 
mm-win-8-32-{1..4}

Only mm-win-8-64-2 is currently showing the issue - it's completely locked, there's nothing I can see in there - feel free to reboot it, as I can't do any diagnosis on a locked up box.

Looking at the performance, I see something mildly interesting.  ALL of the above vms show a LARGE network spike accompanied by a smaller but still significant CPU and memory spike at ~10:40AM (Eastern) - looks like most of the mm VMs gulped whatever this was down and continued about their day - but mm-win-8-64-2 didn't recover - it went into that high cpu/no ram pattern, and is completely locked.

From an ESX standpoint, this doesn't appear to be anything systemic.  The VMs are on several different hosts, and a spot check of the environment doesn't show a network (or CPU/RAM) spike on any other vms at that time, and things appear to be healthy outside this group.

About the only ESX-ish advice I can give you is that the VMtools are *slightly* out of date.  (version 9354 is current, 9344 appears to be the version on there)  That said, this totally shouldn't be doing that to you. 

That's about all I can see, let me know if I'm looking in the wrong direction, or if I can clear anything up for you.
Flags: needinfo?(gcox)
Flags: needinfo?(cknowles)
And because I'm a glutton for punishment went looking some more.

mm-win-81-{32,64}-{1-4}

two interesting things - 
-32-4 showed no spike, looked like normal windows activity for the last 24 hours.  
-64-3 showed the same aberrant behavior, but not associated with this spike, ended around 3AM eastern - and when it ended it did so with a spike to memory and network usage on this vm only - which might be indicative of nothing more than returning to function - and indeed appears to be when the VM was rebooted.

Other than that, the others all showed the network spike at the same time that -8-64-2 fell over.  

Again, not finding anything that points to ESX or its hardware as being involved.  Just wanted to update with anything further.  Let me know how I can assist further.
Thank you Chris for looking into that! Looks like I will have to dig into this problem and figure out what's wrong. Interesting is why ESX tells us that the machine has 0MB of RAM in use. How can that come? Is there a disconnect between ESX and the VM? I might indeed wanna upgrade VMware tools for those two affected machines first and check how it goes.
Assignee: server-ops-virtualization → nobody
Component: Server Operations: Virtualization → Infrastructure
Product: mozilla.org → Mozilla QA
QA Contact: cshields → hskupin
Version: other → unspecified
Assignee: nobody → hskupin
Status: NEW → ASSIGNED
Interestingly the vSphere client shows me that VMware tools is not installed on mm-win-8-64-2? How can this be? I installed it now, and will continue to watch that machine. I will also check all the other hosts about it.
After installing VMware tools on mm-win-8-64-2 I brought back the machine. I checked the Event logs but nothing in there showed something suspicious. I will leave the resource monitor running on that box which might give us an idea what's going on.

I will check mm-win-81-64-3 now, and apply the same changes if necessary.
mm-win-81-64-3 is back with upgraded VMWare tools. Lets see how both machines will behave today.

Chris, I thought that you can get VMware tools installed/upgraded during a client reboot/startup. Is that not the case? Or would we have to set a special option for that in all of our VMs? Would that be wise to do?
Flags: needinfo?(cknowles)
So several questions happened overnight ...

64-2 - while locked up, I don't fully trust the stats coming out - Though, frankly if it had been hardware we simply wouldn't know if the CPU was at 100% while the memory was at 0% because we can't look inside while it's locked... only due to its being a VM do we have any information.

Similarly, the Tools are a process running on the system - when it's completely locked up, ESX will report the tools as not running, because it can't make contact with the tools to query its state.  

Also, it's not that you can upgrade *while* rebooting, but that for windows, a reboot is required for full installation.  Sadly with windows the best thing is to do the install - you can *try* the following mana page, which should work for without a reboot, but as in most things windows, if in danger or in doubt, throw in a reboot.  https://mana.mozilla.org/wiki/display/SYSADMIN/VMware+Tools+Installs#Advanced%20Trickery
Flags: needinfo?(cknowles)
Talked with Chris on IRC and we have to wait here. I hope it will happen again so we could do an analysis of the problem, and if it does for a time when I'm around. So lets cross fingers.
Duplicate of this bug: 1092982
I also upgraded VMware tools on win-81-32 and brought it back online. I will most likely upgrade all VMware tools later today for at least all the Windows nodes on staging.
I have updated all Windows machines in staging with the newest version of VMware tools now. Lets see how they behave over the next few days. If all is fine I will update the tools also for our production machines.
Earlier today you'd asked for changelogs for the vmware tools.

Sadly there doesn't appear to be *one* document for all changes to the tools.  The tools are usually wrapped in with an ESXi upgrade - So, here's what I found.

First, the list of all recent versions of tools (first column) with the update ESXi version (second column)
http://packages.vmware.com/tools/versions

Then search for "vmware tools release notes" on google - for example, version 9354 is from 5.5. update 2.

The changelog for this latest version is https://www.vmware.com/support/vsphere5/doc/vsphere-esxi-55u2-release-notes.html#resolvedissuesvmwaretools

Not the easiest/best way to find info, but they really expect you to come at them with specific error messages it seems.
Created attachment 8516533 [details]
screenshots ubuntu 13.10_64

It looks like we have the same problem on Ubuntu machines (http://mm-ci-master.qa.scl3.mozilla.com:8080/computer/mm-ub-1310-64-1/)
Given that a process has 100% cpu load is not equal to a VM does not respond at all, I really would like to get the new Ubuntu issue moved out into a separate bug. Thanks.
Attachment #8516533 - Attachment is obsolete: true
Unable to connect to mm-win-8-32-2 machine
Fixed that machine and also update VMware Tools. It is really suspicious why all that is starting nowish. We never have seen such things before in the last two years.

Chris, have there been made any changes to the ESX cluster or software lately, which could have caused this?
Flags: needinfo?(cknowles)
There was a UCS update about 2 months ago, and standard security patches are applied monthly, but nothing in the last 2 weeks - and certainly nothing that should cause this sort of thing.
We have many other windows and ubuntu VMs that aren't experiencing any degradations like those reported here.  Also, just to rule out hosts, there are several other mm-* VMs on this particular host that appear to be functioning well.  

Have any increased logging steps been taken on the clients to see what they're doing just prior to the failure?
Flags: needinfo?(cknowles)
As we have talked earlier this only happens to Windows machines, and there seem to be no way to increase any kind of logging. Everything is halted and nothing gets added to the Events log.

Anyway, meanwhile I really would like to go ahead and upgrade the VMware tools for all the Windows machines. I might wanna do it after the next release tests.
It looks like that every Windows node below Windows 8 does not have VMware tools installed at all! I'm doing that now for all of them.
All Windows machines got VMware tools upgraded or installed. While doing that I found that there is a checkbox inside the VM settings to allow an auto-upgrade of VMware tools during startup. That was the reason why some machines already had the latest version of it installed. I will update all VMs now and enable that checkbox. So whenever a VM gets restarted it will check for a newer release of VMware tools automatically.
Whiteboard: [qa-automation-blocked]
And per an IRC conversation, updated all the QA windows templates to have this setting, so that new machines going forward will have this.
All VM settings have been updated. So machines should keep on latest version.

I'm not sure if the changes as done on this bug will prevent misbehavior as what we have seen. At least I'm not sure what else I could do right now. I will close this bug as fixed for now. But if machines start failing again, please re-open it and I will dig further.
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago3 years ago
Resolution: --- → FIXED
Not sure if this is the case here, I was checking all windows nodes for the java autoupdate settings and the jenkins icon on desktop and I couldn't reconnect to this one: mm-win-81-64-4.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Looking at the CPU/Memory graphs, it doesn't show the same 100% CPU, 0% memory issues that the original complaint was, but I'll leave final determination for this to :whimboo
Indeed, filed bug 1097755 for this, thanks!
Status: REOPENED → RESOLVED
Last Resolved: 3 years ago3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.