Closed Bug 1090282 Opened 10 years ago Closed 10 years ago

Several Windows 8 VMs are freezing (0 MB memory usage - 100% CPU) and are not accessible anymore

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: cosmin-malutan, Assigned: whimboo)

References

Details

Attachments

(4 files, 1 obsolete file)

screenshot (cpu %) 10 years ago Henrik Skupin [:whimboo][⌚️UTC+1] 43.01 KB, image/jpeg		Details
screenshot (cpu mhz) 10 years ago Henrik Skupin [:whimboo][⌚️UTC+1] 46.14 KB, image/jpeg		Details
screenshot (memory usage) 10 years ago Henrik Skupin [:whimboo][⌚️UTC+1] 51.19 KB, image/jpeg		Details
EventView.txt 10 years ago Cosmin Malutan, [:cosmin-malutan] 9.40 KB, text/plain		Details
screenshots ubuntu 13.10_64 10 years ago Mihaela Velimiroviciu (:mihaelav) 244.33 KB, application/zip		Details

Cosmin Malutan, [:cosmin-malutan]

Reporter

Description

•

10 years ago

Node wen't offline and I can't acccess it via VPN it needs no to be restarted.

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 1

•

10 years ago

The box is back. It was hanging with 100% cpu load. You might want to check tomorrow what could have been the cause of it. 

Given that the Jenkins link is broken on the deskop, Andreea will reconnect the box to Jenkins now.

Flags: needinfo?(andreea.matei)

Andreea Matei [:AndreeaMatei]

Comment 2

•

10 years ago

Node is back online.

Flags: needinfo?(andreea.matei)

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 3

•

10 years ago

Thanks.

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 4

•

10 years ago

I had to restart the node once more because it was not responding. Cosmin please investigate this problem and let us know what you can find. I will attach some screenshots in a bit.

Status: RESOLVED → REOPENED

Flags: needinfo?(cosmin.malutan)

Resolution: FIXED → ---

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 5

•

10 years ago

Attached image screenshot (cpu %) — Details

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 6

•

10 years ago

Attached image screenshot (cpu mhz) — Details

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 7

•

10 years ago

Attached image screenshot (memory usage) — Details

Cosmin Malutan, [:cosmin-malutan]

Reporter

Comment 8

•

10 years ago

Attached file EventView.txt — Details

Here is the Log from Event-VIew, I'll check this further in a bit.

Cosmin Malutan, [:cosmin-malutan]

Reporter

Comment 9

•

10 years ago

I couldn't find what caused this, but the security updates might fix this.
Otherwise the closest issue I found to this is the one bellow, which will require to disable the VMCI driver.
https://communities.vmware.com/message/2379497#2379497

Flags: needinfo?(cosmin.malutan)

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 11

•

10 years ago

The same seems to have happened today to mm-win-8-64-2. I will check that now.

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Updated

•

10 years ago

Summary: Node mm-win-81-64-3 from scl3 can't be accessed via VPN → Several Windows 8 nodes are freezing and are not accessible anymore

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 12

•

10 years ago

I do not think that there is anything I can actually do here. Something is totally going crazy here with our VMs. Memory usage goes down to 0B while CPU is at 100%. Looks hardly like an ESX issue.

Greg, or Chris can one of you help us here?

Assignee: hskupin → server-ops-virtualization

Severity: normal → critical

Status: REOPENED → NEW

Component: Infrastructure → Server Operations: Virtualization

Flags: needinfo?(gcox)

Flags: needinfo?(cknowles)

Product: Mozilla QA → mozilla.org

QA Contact: hskupin → cshields

Version: unspecified → other

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Updated

•

10 years ago

Whiteboard: [qa-automation-blocked]

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Updated

•

10 years ago

Summary: Several Windows 8 nodes are freezing and are not accessible anymore → Several Windows 8 VMs are freezing (0 MB memory usage - 100% CPU) and are not accessible anymore

Chris Knowles [:cknowles]

Comment 13

•

10 years ago

Can I get a complete list of VMs affected so I can research?  that VMCI is likely fixed in the latest VMtools, but I don't know what you're running until I have that list of affected guests.

Chris Knowles [:cknowles]

Comment 14

•

10 years ago

Alright, did some poking around.

I'm assuming that the list of potential victims is:

mm-win-8-64-{1..4} and 
mm-win-8-32-{1..4}

Only mm-win-8-64-2 is currently showing the issue - it's completely locked, there's nothing I can see in there - feel free to reboot it, as I can't do any diagnosis on a locked up box.

Looking at the performance, I see something mildly interesting.  ALL of the above vms show a LARGE network spike accompanied by a smaller but still significant CPU and memory spike at ~10:40AM (Eastern) - looks like most of the mm VMs gulped whatever this was down and continued about their day - but mm-win-8-64-2 didn't recover - it went into that high cpu/no ram pattern, and is completely locked.

From an ESX standpoint, this doesn't appear to be anything systemic.  The VMs are on several different hosts, and a spot check of the environment doesn't show a network (or CPU/RAM) spike on any other vms at that time, and things appear to be healthy outside this group.

About the only ESX-ish advice I can give you is that the VMtools are *slightly* out of date.  (version 9354 is current, 9344 appears to be the version on there)  That said, this totally shouldn't be doing that to you. 

That's about all I can see, let me know if I'm looking in the wrong direction, or if I can clear anything up for you.

Flags: needinfo?(gcox)

Flags: needinfo?(cknowles)

Chris Knowles [:cknowles]

Comment 15

•

10 years ago

And because I'm a glutton for punishment went looking some more.

mm-win-81-{32,64}-{1-4}

two interesting things - 
-32-4 showed no spike, looked like normal windows activity for the last 24 hours.  
-64-3 showed the same aberrant behavior, but not associated with this spike, ended around 3AM eastern - and when it ended it did so with a spike to memory and network usage on this vm only - which might be indicative of nothing more than returning to function - and indeed appears to be when the VM was rebooted.

Other than that, the others all showed the network spike at the same time that -8-64-2 fell over.  

Again, not finding anything that points to ESX or its hardware as being involved.  Just wanted to update with anything further.  Let me know how I can assist further.

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 16

•

10 years ago

Thank you Chris for looking into that! Looks like I will have to dig into this problem and figure out what's wrong. Interesting is why ESX tells us that the machine has 0MB of RAM in use. How can that come? Is there a disconnect between ESX and the VM? I might indeed wanna upgrade VMware tools for those two affected machines first and check how it goes.

Assignee: server-ops-virtualization → nobody

Component: Server Operations: Virtualization → Infrastructure

Product: mozilla.org → Mozilla QA

QA Contact: cshields → hskupin

Version: other → unspecified

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Updated

•

10 years ago

Assignee: nobody → hskupin

Status: NEW → ASSIGNED

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 17

•

10 years ago

Interestingly the vSphere client shows me that VMware tools is not installed on mm-win-8-64-2? How can this be? I installed it now, and will continue to watch that machine. I will also check all the other hosts about it.

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 18

•

10 years ago

After installing VMware tools on mm-win-8-64-2 I brought back the machine. I checked the Event logs but nothing in there showed something suspicious. I will leave the resource monitor running on that box which might give us an idea what's going on.

I will check mm-win-81-64-3 now, and apply the same changes if necessary.

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 19

•

10 years ago

mm-win-81-64-3 is back with upgraded VMWare tools. Lets see how both machines will behave today.

Chris, I thought that you can get VMware tools installed/upgraded during a client reboot/startup. Is that not the case? Or would we have to set a special option for that in all of our VMs? Would that be wise to do?

Flags: needinfo?(cknowles)

Chris Knowles [:cknowles]

Comment 20

•

10 years ago

So several questions happened overnight ...

64-2 - while locked up, I don't fully trust the stats coming out - Though, frankly if it had been hardware we simply wouldn't know if the CPU was at 100% while the memory was at 0% because we can't look inside while it's locked... only due to its being a VM do we have any information.

Similarly, the Tools are a process running on the system - when it's completely locked up, ESX will report the tools as not running, because it can't make contact with the tools to query its state.  

Also, it's not that you can upgrade *while* rebooting, but that for windows, a reboot is required for full installation.  Sadly with windows the best thing is to do the install - you can *try* the following mana page, which should work for without a reboot, but as in most things windows, if in danger or in doubt, throw in a reboot.  https://mana.mozilla.org/wiki/display/SYSADMIN/VMware+Tools+Installs#Advanced%20Trickery

Flags: needinfo?(cknowles)

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 21

•

10 years ago

Talked with Chris on IRC and we have to wait here. I hope it will happen again so we could do an analysis of the problem, and if it does for a time when I'm around. So lets cross fingers.

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 23

•

10 years ago

I also upgraded VMware tools on win-81-32 and brought it back online. I will most likely upgrade all VMware tools later today for at least all the Windows nodes on staging.

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 24

•

10 years ago

I have updated all Windows machines in staging with the newest version of VMware tools now. Lets see how they behave over the next few days. If all is fine I will update the tools also for our production machines.

Chris Knowles [:cknowles]

Comment 25

•

10 years ago

Earlier today you'd asked for changelogs for the vmware tools.

Sadly there doesn't appear to be *one* document for all changes to the tools.  The tools are usually wrapped in with an ESXi upgrade - So, here's what I found.

First, the list of all recent versions of tools (first column) with the update ESXi version (second column)
http://packages.vmware.com/tools/versions

Then search for "vmware tools release notes" on google - for example, version 9354 is from 5.5. update 2.

The changelog for this latest version is https://www.vmware.com/support/vsphere5/doc/vsphere-esxi-55u2-release-notes.html#resolvedissuesvmwaretools

Not the easiest/best way to find info, but they really expect you to come at them with specific error messages it seems.

Mihaela Velimiroviciu (:mihaelav)

Comment 26

•

10 years ago

Attached file screenshots ubuntu 13.10_64 (obsolete) — Details

It looks like we have the same problem on Ubuntu machines (http://mm-ci-master.qa.scl3.mozilla.com:8080/computer/mm-ub-1310-64-1/)

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 27

•

10 years ago

Given that a process has 100% cpu load is not equal to a VM does not respond at all, I really would like to get the new Ubuntu issue moved out into a separate bug. Thanks.

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Updated

•

10 years ago

Attachment #8516533 - Attachment is obsolete: true

Mihaela Velimiroviciu (:mihaelav)

Comment 28

•

10 years ago

Unable to connect to mm-win-8-32-2 machine

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 29

•

10 years ago

Fixed that machine and also update VMware Tools. It is really suspicious why all that is starting nowish. We never have seen such things before in the last two years.

Chris, have there been made any changes to the ESX cluster or software lately, which could have caused this?

Flags: needinfo?(cknowles)

Chris Knowles [:cknowles]

Comment 30

•

10 years ago

There was a UCS update about 2 months ago, and standard security patches are applied monthly, but nothing in the last 2 weeks - and certainly nothing that should cause this sort of thing.
We have many other windows and ubuntu VMs that aren't experiencing any degradations like those reported here.  Also, just to rule out hosts, there are several other mm-* VMs on this particular host that appear to be functioning well.  

Have any increased logging steps been taken on the clients to see what they're doing just prior to the failure?

Flags: needinfo?(cknowles)

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 31

•

10 years ago

As we have talked earlier this only happens to Windows machines, and there seem to be no way to increase any kind of logging. Everything is halted and nothing gets added to the Events log.

Anyway, meanwhile I really would like to go ahead and upgrade the VMware tools for all the Windows machines. I might wanna do it after the next release tests.

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 32

•

10 years ago

It looks like that every Windows node below Windows 8 does not have VMware tools installed at all! I'm doing that now for all of them.

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 33

•

10 years ago

All Windows machines got VMware tools upgraded or installed. While doing that I found that there is a checkbox inside the VM settings to allow an auto-upgrade of VMware tools during startup. That was the reason why some machines already had the latest version of it installed. I will update all VMs now and enable that checkbox. So whenever a VM gets restarted it will check for a newer release of VMware tools automatically.

Whiteboard: [qa-automation-blocked]

Chris Knowles [:cknowles]

Comment 34

•

10 years ago

And per an IRC conversation, updated all the QA windows templates to have this setting, so that new machines going forward will have this.

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 35

•

10 years ago

All VM settings have been updated. So machines should keep on latest version.

I'm not sure if the changes as done on this bug will prevent misbehavior as what we have seen. At least I'm not sure what else I could do right now. I will close this bug as fixed for now. But if machines start failing again, please re-open it and I will dig further.

Status: ASSIGNED → RESOLVED

Closed: 10 years ago → 10 years ago

Resolution: --- → FIXED

Andreea Matei [:AndreeaMatei]

Comment 36

•

10 years ago

Not sure if this is the case here, I was checking all windows nodes for the java autoupdate settings and the jenkins icon on desktop and I couldn't reconnect to this one: mm-win-81-64-4.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Chris Knowles [:cknowles]

Comment 37

•

10 years ago

Looking at the CPU/Memory graphs, it doesn't show the same 100% CPU, 0% memory issues that the original complaint was, but I'll leave final determination for this to :whimboo

Andreea Matei [:AndreeaMatei]

Comment 38

•

10 years ago

Indeed, filed bug 1097755 for this, thanks!

Status: REOPENED → RESOLVED

Closed: 10 years ago → 10 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

6 years ago

Product: Mozilla QA → Mozilla QA Graveyard

You need to log in before you can comment on or make changes to this bug.