Closed
Bug 1090282
Opened 10 years ago
Closed 10 years ago
Several Windows 8 VMs are freezing (0 MB memory usage - 100% CPU) and are not accessible anymore
Categories
(Mozilla QA Graveyard :: Infrastructure, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: cosmin-malutan, Assigned: whimboo)
References
Details
Attachments
(4 files, 1 obsolete file)
Node wen't offline and I can't acccess it via VPN it needs no to be restarted.
Assignee | ||
Comment 1•10 years ago
|
||
The box is back. It was hanging with 100% cpu load. You might want to check tomorrow what could have been the cause of it.
Given that the Jenkins link is broken on the deskop, Andreea will reconnect the box to Jenkins now.
Flags: needinfo?(andreea.matei)
Assignee | ||
Comment 3•10 years ago
|
||
Thanks.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 4•10 years ago
|
||
I had to restart the node once more because it was not responding. Cosmin please investigate this problem and let us know what you can find. I will attach some screenshots in a bit.
Status: RESOLVED → REOPENED
Flags: needinfo?(cosmin.malutan)
Resolution: FIXED → ---
Assignee | ||
Comment 5•10 years ago
|
||
Assignee | ||
Comment 6•10 years ago
|
||
Assignee | ||
Comment 7•10 years ago
|
||
Reporter | ||
Comment 8•10 years ago
|
||
Here is the Log from Event-VIew, I'll check this further in a bit.
Reporter | ||
Comment 9•10 years ago
|
||
I couldn't find what caused this, but the security updates might fix this.
Otherwise the closest issue I found to this is the one bellow, which will require to disable the VMCI driver.
https://communities.vmware.com/message/2379497#2379497
Flags: needinfo?(cosmin.malutan)
Assignee | ||
Comment 11•10 years ago
|
||
The same seems to have happened today to mm-win-8-64-2. I will check that now.
Assignee | ||
Updated•10 years ago
|
Summary: Node mm-win-81-64-3 from scl3 can't be accessed via VPN → Several Windows 8 nodes are freezing and are not accessible anymore
Assignee | ||
Comment 12•10 years ago
|
||
I do not think that there is anything I can actually do here. Something is totally going crazy here with our VMs. Memory usage goes down to 0B while CPU is at 100%. Looks hardly like an ESX issue.
Greg, or Chris can one of you help us here?
Assignee: hskupin → server-ops-virtualization
Severity: normal → critical
Status: REOPENED → NEW
Component: Infrastructure → Server Operations: Virtualization
Flags: needinfo?(gcox)
Flags: needinfo?(cknowles)
Product: Mozilla QA → mozilla.org
QA Contact: hskupin → cshields
Version: unspecified → other
Assignee | ||
Updated•10 years ago
|
Whiteboard: [qa-automation-blocked]
Assignee | ||
Updated•10 years ago
|
Summary: Several Windows 8 nodes are freezing and are not accessible anymore → Several Windows 8 VMs are freezing (0 MB memory usage - 100% CPU) and are not accessible anymore
Comment 13•10 years ago
|
||
Can I get a complete list of VMs affected so I can research? that VMCI is likely fixed in the latest VMtools, but I don't know what you're running until I have that list of affected guests.
Comment 14•10 years ago
|
||
Alright, did some poking around.
I'm assuming that the list of potential victims is:
mm-win-8-64-{1..4} and
mm-win-8-32-{1..4}
Only mm-win-8-64-2 is currently showing the issue - it's completely locked, there's nothing I can see in there - feel free to reboot it, as I can't do any diagnosis on a locked up box.
Looking at the performance, I see something mildly interesting. ALL of the above vms show a LARGE network spike accompanied by a smaller but still significant CPU and memory spike at ~10:40AM (Eastern) - looks like most of the mm VMs gulped whatever this was down and continued about their day - but mm-win-8-64-2 didn't recover - it went into that high cpu/no ram pattern, and is completely locked.
From an ESX standpoint, this doesn't appear to be anything systemic. The VMs are on several different hosts, and a spot check of the environment doesn't show a network (or CPU/RAM) spike on any other vms at that time, and things appear to be healthy outside this group.
About the only ESX-ish advice I can give you is that the VMtools are *slightly* out of date. (version 9354 is current, 9344 appears to be the version on there) That said, this totally shouldn't be doing that to you.
That's about all I can see, let me know if I'm looking in the wrong direction, or if I can clear anything up for you.
Flags: needinfo?(gcox)
Flags: needinfo?(cknowles)
Comment 15•10 years ago
|
||
And because I'm a glutton for punishment went looking some more.
mm-win-81-{32,64}-{1-4}
two interesting things -
-32-4 showed no spike, looked like normal windows activity for the last 24 hours.
-64-3 showed the same aberrant behavior, but not associated with this spike, ended around 3AM eastern - and when it ended it did so with a spike to memory and network usage on this vm only - which might be indicative of nothing more than returning to function - and indeed appears to be when the VM was rebooted.
Other than that, the others all showed the network spike at the same time that -8-64-2 fell over.
Again, not finding anything that points to ESX or its hardware as being involved. Just wanted to update with anything further. Let me know how I can assist further.
Assignee | ||
Comment 16•10 years ago
|
||
Thank you Chris for looking into that! Looks like I will have to dig into this problem and figure out what's wrong. Interesting is why ESX tells us that the machine has 0MB of RAM in use. How can that come? Is there a disconnect between ESX and the VM? I might indeed wanna upgrade VMware tools for those two affected machines first and check how it goes.
Assignee: server-ops-virtualization → nobody
Component: Server Operations: Virtualization → Infrastructure
Product: mozilla.org → Mozilla QA
QA Contact: cshields → hskupin
Version: other → unspecified
Assignee | ||
Updated•10 years ago
|
Assignee: nobody → hskupin
Status: NEW → ASSIGNED
Assignee | ||
Comment 17•10 years ago
|
||
Interestingly the vSphere client shows me that VMware tools is not installed on mm-win-8-64-2? How can this be? I installed it now, and will continue to watch that machine. I will also check all the other hosts about it.
Assignee | ||
Comment 18•10 years ago
|
||
After installing VMware tools on mm-win-8-64-2 I brought back the machine. I checked the Event logs but nothing in there showed something suspicious. I will leave the resource monitor running on that box which might give us an idea what's going on.
I will check mm-win-81-64-3 now, and apply the same changes if necessary.
Assignee | ||
Comment 19•10 years ago
|
||
mm-win-81-64-3 is back with upgraded VMWare tools. Lets see how both machines will behave today.
Chris, I thought that you can get VMware tools installed/upgraded during a client reboot/startup. Is that not the case? Or would we have to set a special option for that in all of our VMs? Would that be wise to do?
Flags: needinfo?(cknowles)
Comment 20•10 years ago
|
||
So several questions happened overnight ...
64-2 - while locked up, I don't fully trust the stats coming out - Though, frankly if it had been hardware we simply wouldn't know if the CPU was at 100% while the memory was at 0% because we can't look inside while it's locked... only due to its being a VM do we have any information.
Similarly, the Tools are a process running on the system - when it's completely locked up, ESX will report the tools as not running, because it can't make contact with the tools to query its state.
Also, it's not that you can upgrade *while* rebooting, but that for windows, a reboot is required for full installation. Sadly with windows the best thing is to do the install - you can *try* the following mana page, which should work for without a reboot, but as in most things windows, if in danger or in doubt, throw in a reboot. https://mana.mozilla.org/wiki/display/SYSADMIN/VMware+Tools+Installs#Advanced%20Trickery
Flags: needinfo?(cknowles)
Assignee | ||
Comment 21•10 years ago
|
||
Talked with Chris on IRC and we have to wait here. I hope it will happen again so we could do an analysis of the problem, and if it does for a time when I'm around. So lets cross fingers.
Assignee | ||
Comment 23•10 years ago
|
||
I also upgraded VMware tools on win-81-32 and brought it back online. I will most likely upgrade all VMware tools later today for at least all the Windows nodes on staging.
Assignee | ||
Comment 24•10 years ago
|
||
I have updated all Windows machines in staging with the newest version of VMware tools now. Lets see how they behave over the next few days. If all is fine I will update the tools also for our production machines.
Comment 25•10 years ago
|
||
Earlier today you'd asked for changelogs for the vmware tools.
Sadly there doesn't appear to be *one* document for all changes to the tools. The tools are usually wrapped in with an ESXi upgrade - So, here's what I found.
First, the list of all recent versions of tools (first column) with the update ESXi version (second column)
http://packages.vmware.com/tools/versions
Then search for "vmware tools release notes" on google - for example, version 9354 is from 5.5. update 2.
The changelog for this latest version is https://www.vmware.com/support/vsphere5/doc/vsphere-esxi-55u2-release-notes.html#resolvedissuesvmwaretools
Not the easiest/best way to find info, but they really expect you to come at them with specific error messages it seems.
Comment 26•10 years ago
|
||
It looks like we have the same problem on Ubuntu machines (http://mm-ci-master.qa.scl3.mozilla.com:8080/computer/mm-ub-1310-64-1/)
Assignee | ||
Comment 27•10 years ago
|
||
Given that a process has 100% cpu load is not equal to a VM does not respond at all, I really would like to get the new Ubuntu issue moved out into a separate bug. Thanks.
Assignee | ||
Updated•10 years ago
|
Attachment #8516533 -
Attachment is obsolete: true
Comment 28•10 years ago
|
||
Unable to connect to mm-win-8-32-2 machine
Assignee | ||
Comment 29•10 years ago
|
||
Fixed that machine and also update VMware Tools. It is really suspicious why all that is starting nowish. We never have seen such things before in the last two years.
Chris, have there been made any changes to the ESX cluster or software lately, which could have caused this?
Flags: needinfo?(cknowles)
Comment 30•10 years ago
|
||
There was a UCS update about 2 months ago, and standard security patches are applied monthly, but nothing in the last 2 weeks - and certainly nothing that should cause this sort of thing.
We have many other windows and ubuntu VMs that aren't experiencing any degradations like those reported here. Also, just to rule out hosts, there are several other mm-* VMs on this particular host that appear to be functioning well.
Have any increased logging steps been taken on the clients to see what they're doing just prior to the failure?
Flags: needinfo?(cknowles)
Assignee | ||
Comment 31•10 years ago
|
||
As we have talked earlier this only happens to Windows machines, and there seem to be no way to increase any kind of logging. Everything is halted and nothing gets added to the Events log.
Anyway, meanwhile I really would like to go ahead and upgrade the VMware tools for all the Windows machines. I might wanna do it after the next release tests.
Assignee | ||
Comment 32•10 years ago
|
||
It looks like that every Windows node below Windows 8 does not have VMware tools installed at all! I'm doing that now for all of them.
Assignee | ||
Comment 33•10 years ago
|
||
All Windows machines got VMware tools upgraded or installed. While doing that I found that there is a checkbox inside the VM settings to allow an auto-upgrade of VMware tools during startup. That was the reason why some machines already had the latest version of it installed. I will update all VMs now and enable that checkbox. So whenever a VM gets restarted it will check for a newer release of VMware tools automatically.
Whiteboard: [qa-automation-blocked]
Comment 34•10 years ago
|
||
And per an IRC conversation, updated all the QA windows templates to have this setting, so that new machines going forward will have this.
Assignee | ||
Comment 35•10 years ago
|
||
All VM settings have been updated. So machines should keep on latest version.
I'm not sure if the changes as done on this bug will prevent misbehavior as what we have seen. At least I'm not sure what else I could do right now. I will close this bug as fixed for now. But if machines start failing again, please re-open it and I will dig further.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago → 10 years ago
Resolution: --- → FIXED
Comment 36•10 years ago
|
||
Not sure if this is the case here, I was checking all windows nodes for the java autoupdate settings and the jenkins icon on desktop and I couldn't reconnect to this one: mm-win-81-64-4.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 37•10 years ago
|
||
Looking at the CPU/Memory graphs, it doesn't show the same 100% CPU, 0% memory issues that the original complaint was, but I'll leave final determination for this to :whimboo
Comment 38•10 years ago
|
||
Indeed, filed bug 1097755 for this, thanks!
Status: REOPENED → RESOLVED
Closed: 10 years ago → 10 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Mozilla QA → Mozilla QA Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•