Closed Bug 916746 Opened 6 years ago Closed 6 years ago

System crash affecting all mm-win-81-64-* nodes

Categories

(Infrastructure & Operations :: Virtualization, task, critical)

x86_64
Windows 8.1
task
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: davehunt, Assigned: whimboo)

References

Details

(Keywords: crash, Whiteboard: [qa-automation-blocked])

The Windows 8.1 node (mm-win-81-64-2) is crashing, causing the machine to reboot and to lose connection with the Mozmill CI.

The following is from the Application log in Event Viewer:

Fault bucket 0x109_7, type 0
Event Name: BlueScreen
Response: http://wer.microsoft.com/responses/resredir.aspx?sid=10&Bucket=0x109_7&State=1&ID=db2416d2-59a8-432f-81b8-66a020b69edf
Cab Id: db2416d2-59a8-432f-81b8-66a020b69edf

Problem signature:
P1: 109
P2: a3a01f5891f1e508
P3: b3b72bdee471e5e0
P4: c0000103
P5: 7
P6: 6_3_9431
P7: 0_0
P8: 256_1
P9: 
P10: 

Attached files:
C:\Windows\Minidump\091613-19828-01.dmp
C:\Users\mozauto\AppData\Local\Temp\WER-30281-0.sysdata.xml
C:\Windows\MEMORY.DMP
C:\Users\mozauto\AppData\Local\Temp\WER7C54.tmp.WERInternalMetadata.xml

These files may be available here:
C:\ProgramData\Microsoft\Windows\WER\ReportArchive\Kernel_109_4de57b2f902d8667562aecbd8d3ba413a08ef134_00000000_cab_0b289fe9

Analysis symbol: 
Rechecking for solution: 0
Report Id: 091613-19828-01
Report Status: 0
Hashed bucket: 

If we can't resolve this, then we may need to recreate the VM from the template.
As I have seen the crash happens in the kernel module, specifically in csrss.exe. That sounds kinda bad. I searched if I can find something about such a behavior but wasn't successful yet.

I would say lets get it recreated from the template. Given that the other machines are not affected yet, we might solve it without having to spent too much time on it.
Severity: normal → critical
Flags: needinfo?(afernandez)
Perhaps updating the Win8.1 VMs to RTM would alleviate this and other issues we're having? I believe it's available to MSDN subscribers right now.
(In reply to Anthony Hughes, Mozilla QA (:ashughes) from comment #2)
> Perhaps updating the Win8.1 VMs to RTM would alleviate this and other issues
> we're having? I believe it's available to MSDN subscribers right now.

Adrian, do you have an idea if we have the ability to do that currently? If that is the case it could be a solution. Not sure how far behind the previews currently are.
Still not sure what's causing this, but I feel it could be UltraVNC. I have uninstalled it on mm-win-81-64-1 to test if the OS crashes stop, which are frequently occurring when running endurance tests.
Assignee: nobody → hskupin
Summary: Crash affecting mm-win-81-64-2 → Crash affecting mm-win-81-64-1 and mm-win-81-64-2
Whiteboard: [qa-automation-blocked]
We do have the ability to download RTM. Seems a few were released yesterday.

Which version do you want?
URL: https://msdn.microsoft.com/en-US/subscriptions/securedownloads/hh442898#searchTerm=&ProductFamilyId=524&Languages=en&PageSize=10&PageIndex=0&FileId=0

Windows 8.1 Pro N VL (x86)
Windows 8.1 Pro N VL (x64)

or perhaps both?
Flags: needinfo?(afernandez)
I don't have the credentials to sign into the page. So I cannot tell which versions are available. But in general we want both, the 32bit and 64bit version. They should replace the current set of Win8.1 nodes.

Something weird I have noticed is that those crashes / restarts are only happening when you are NOT connected via RDP or VNC to the machine. Whenever I'm connected that behavior stopped. I will do an extensive check today and will keep the RDP connection active for a couple of hours. I hope that the final versions have a better quality, but well it's MS.
Oh, and if we go with new VMs now, we should get a new bug filed for their creation. It will then block this crasher bug. Thanks Adrian!
All the time when this crash / restart happens the Software protection service has been run. Looks like it is closely correlated here.
I have disabled the service by setting HKLM\SYSTEM\CurrentControlSet\Services\Start from 2 to 4. Lets see if that helps on the 81-61-1 machine.
Status: NEW → ASSIGNED
It looks like that disabling real-time protection in Windows Defender on bug 918676 helped here. I will continue to watch.
Depends on: 918676
As of yesterday all three machines experienced a lot of critical failures and rebooted a couple of times with:

The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly.
Summary: Crash affecting mm-win-81-64-1 and mm-win-81-64-2 → Crash affecting all mm-win-81-64-* nodes
We will re-investigate this bug once the final 8.1 VMs are up. That task is tracked on bug 919618.
Depends on: 919618
Whiteboard: [qa-automation-blocked] → [qa-automation-blocked][blocked by bug 919618]
New machines are up and running. The crash should hopefully be gone now. If it reappears we will reopen the bug.
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
OS: Windows 8 → Windows 8.1
Resolution: --- → FIXED
Sad face... We got the first system level crash again for the mm-win-81-64-1 machine:

The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly. Adrian, any idea what this could be?
Status: RESOLVED → REOPENED
Keywords: crash
Resolution: FIXED → ---
Flags: needinfo?(afernandez)
There is absolutely nothing I can do here. The other system which crashed right now while I was simply navigating in the Event Viewer is mm-win-81-64-3. This situation is not tolerable and we have to get this fixed as quick as possible. I have no idea what's going on here. :/
Status: REOPENED → NEW
Component: Infrastructure → Server Operations: Virtualization
Product: Mozilla QA → mozilla.org
QA Contact: dparsons
Summary: Crash affecting all mm-win-81-64-* nodes → System crash affecting all mm-win-81-64-* nodes
Version: unspecified → other
Here the list of system crashes for mm-win-81-64-3:

Critical	10/17/2013 11:23:06 PM	Kernel-Power	41	(63)
Critical	10/17/2013 11:14:44 PM	Kernel-Power	41	(63)
Critical	10/17/2013 9:57:57 PM	Kernel-Power	41	(63)
Critical	10/17/2013 8:18:06 PM	Kernel-Power	41	(63)
Critical	10/17/2013 7:41:42 PM	Kernel-Power	41	(63)
Critical	10/17/2013 7:08:01 PM	Kernel-Power	41	(63)
Critical	10/17/2013 6:02:33 PM	Kernel-Power	41	(63)

After the first system crash and restart the node was no longer connected to our CI. So anything related to Java and Jenkins is most likely unrelated.
The computer has rebooted from a bugcheck.  The bugcheck was: 0x00000109 (0xa3a01f5891f63635, 0xb3b72bdee4763734, 0x00000000c0000103, 0x0000000000000007). A dump was saved in: C:\Windows\MEMORY.DMP. Report Id: 101713-14546-01.

So I found some details here:
http://msdn.microsoft.com/en-us/library/windows/hardware/ff557228%28v=vs.85%29.aspx

Given the last parameter of bugcheck it is:
0x00000007 	A critical MSR modification

Given that it is a VM I would say we are not affected by memory issues. Especially because the other 64bit nodes are also affected. So I assume it's highly correlated to a driver issue again.

Adrian or someone else, can you please check that? It might be related to the VMWare tools running in our VMs. Are those up2date or do we have to do an update for Windows 8.1?
Whiteboard: [qa-automation-blocked][blocked by bug 919618] → [qa-automation-blocked]
Could it be the NIC which is selected for those VMs again? I can remember that we already had similar issues a long time ago, and in that case you had to change the NIC in ESX.
For now I updated the host to store a full memory dump to disk. Once we crashed again, we might want to get some more information by doing the steps as pointed out here:

http://www.chicagotech.net/netforums/viewtopic.php?p=1574&sid=9aab7d1b06857538f7fff848b3baa0e7#p1574
I do not see that the VMware tools are running inside mm-win-81-64-3 right now. It does say that the installed version is the latest, however. I'm unable to identify any issues on the ESX side that could be causing this problem, however Windows 8 support for ESX is very new. 

The NIC type is VMXNET 3, which I've never seen stability issues with, however again, Windows 8 support is very new. 

I wish I had more to offer you but I can't see anything I can do on my end.
Dan, this is Windows 8.1 and not Windows 8. All the last months we never had issues with the Windows 8 64bit machines. I will try to figure out which bug it was Adrian had to make a change to the used NIC.
Oh, I found it on bug 885599 comment 13. As it looks like he already tried to use another NIC but that didn't stop the crash. So what could be the cause? Do we have to install VMware tool given that those are not running yet?
:whimboo without much more info, would be very hard to track down the exact cause and the fact that we are dealing with a recently release OS.

However, as a process of elimination, you could turn off vmware tools and see if that's what's causing the issue.

Seems the VMs only go down when something is running on them and in this case would be the software suite you run on top of it (jenkins/java etc).

If there was a clear procedure that recreates the issue, then it would be easier to troubleshoot and perhaps fix as then we could actual test our potential "fix".

I'll see if I have some spare cycles and deploy another VM so that I could run some benchmarks and see if any load cause the issue OR if it's a specific load and/or process/software.

As of right now, I see that all the 8.1 VMs are up so hopefully this is not delaying any work unless the jenkins jobs were stopped.
Flags: needinfo?(afernandez)
(In reply to Adrian Fernandez [:Aj] from comment #23)
> :whimboo without much more info, would be very hard to track down the exact
> cause and the fact that we are dealing with a recently release OS.

Well, compared the the crashes we have already seen while setting up the preview release VMs there is nothing more I can add here. Something system level is causing this problem. Our tools don't install any drivers which could interact with the system in such a bad way.

> However, as a process of elimination, you could turn off vmware tools and
> see if that's what's causing the issue.

Can't you do this via the vSphere tool? Not sure how I could turn it off from inside a VM.

> Seems the VMs only go down when something is running on them and in this
> case would be the software suite you run on top of it (jenkins/java etc).

No, that's not the case. Please see the above comment of mine and the crash reports. Only for the first crash in that table Jenkins was running. But a node doesn't get auto-connected after a restart. So any further crashes happened while the machine was without any load.

> If there was a clear procedure that recreates the issue, then it would be
> easier to troubleshoot and perhaps fix as then we could actual test our
> potential "fix".

Right, but before we have a fix we will have to figure out what's wrong. So lets do baby-step and start with VMware tools, as you have suggested. 

> As of right now, I see that all the 8.1 VMs are up so hopefully this is not
> delaying any work unless the jenkins jobs were stopped.

The VMs are up because they restart automatically. But 2 of them have not been re-connected to Jenkins, and I stopped tests on those VMs early yesterday morning (European time). So any crashes happened afterward are not caused by Jenkins.
At least for the few FQDN hostnames you've listed in this bug, the VMware tools are _not_ running. VMware Tools have never caused a crash like this, but I have seen them prevent crashes. So if anything, I suggest you _start_ them, and yes, that is something that is done from inside the guest OS, not from vSphere Client.

Finally, again, Windows 8.1 support is "beta" at best, and I'm sorry we are unable to better for you, but again, beta.
(In reply to Dan Parsons [:lerxst] from comment #25)
> At least for the few FQDN hostnames you've listed in this bug, the VMware
> tools are _not_ running. VMware Tools have never caused a crash like this,
> but I have seen them prevent crashes. So if anything, I suggest you _start_
> them, and yes, that is something that is done from inside the guest OS, not
> from vSphere Client.

Both machines I have seen the crashes so far which are 64-1 and 64-3 have the tools installed and a tray icon is visible. So I assume those are running. For 64-3 I have selected Exit from it's context menu, so not sure if the service has been stopped or only the tray app is closed.

> Finally, again, Windows 8.1 support is "beta" at best, and I'm sorry we are
> unable to better for you, but again, beta.

Beta in terms of ESX support? If yes, do other users have the similar problem as reported in VMwares support forum?
No one is virtualizing Windows 8.1 at scale yet because problems like this are common. This has been the case for every new OS release from every vendor, not just Windows, but even the latest Fedora has always been shaky under virtualization.

The main part of VMware tools isn't the tray service but the drivers it loads, particularly the networking driver. Without the networking driver, you have to use the virtual NIC as an Intel e1000 chipset, and that has higher overhead costs, and those costs are at the level that we do not support it as a long-term solution due to the significantly higher impact it has on cluster resources.
So what would have to be done? I'm kinda lost at the moment, sorry.
We are in uncharted territory. That means a couple of things. One of them means that most solutions will be novel and invented by you or me. It also means that the amount of support this endeavor takes is disproportionate from 99% of the other things I have to support. If I had infinite time, and access to both the ESXi and Windows 8.1 source code, I'd patch this for you.

Can your needs be met with physical hardware?
I'm fine with everything which would allow us to run tests on Windows 8.1 64bit. Even if it is temporary and we know that we have to go back. I think it should also be fine for Anthony given that QA really wants to run our tests on that platform. Let us also get feedback from him.
Flags: needinfo?(anthony.s.hughes)
Update on this?
(In reply to Dan Parsons [:lerxst] from comment #31)
> Update on this?

I need to discuss with my peers what the best way forward is (ie. deploying temporary hardware vs just waiting for ESX support). I will get back to you asap.
Flags: needinfo?(anthony.s.hughes)
Dan, the following support article from the VMware website lists a similar issue with Windows 8.1. I wonder if that could be related to the problem we are facing here:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2060019
:whimboo, this fix looks like it's worth testing. Can you give me the name of a specific VM of yours to try it on? If it works well I can apply it to all of them.
(In reply to Anthony Hughes, Mozilla QA (:ashughes) from comment #32)
> (In reply to Dan Parsons [:lerxst] from comment #31)
> > Update on this?
> 
> I need to discuss with my peers what the best way forward is (ie. deploying
> temporary hardware vs just waiting for ESX support). I will get back to you
> asap.

After some discussion it's been decided that we can live without Win 8.1 64-bit automation until VMWare ESX supports it. Assuming we aren't waiting more than a month or two we should be okay with manual spotchecking and dogfooding builds, in addition to Win 8.1 32-bit automation. We don't feel it's worth the cost and effort to stand up hardware that will only be used for a very short time.

Thank you
(In reply to Dan Parsons [:lerxst] from comment #34)
> :whimboo, this fix looks like it's worth testing. Can you give me the name
> of a specific VM of yours to try it on? If it works well I can apply it to
> all of them.

Dan, please update the settings for mm-win-81-64-1, that's the machine which runs all the tests and I would like see how it works. Also mm-win-81-64-3 crashes a lot, just in case if the fix can be applied quickly. Thanks.

If that works we would be fine. Otherwise lets wait and I should file a bug on VMware's support website.
I've applied the settings from the KB page to the two VMs you mentioned. They're booting back up now. Let me know if you notice any difference in stability.
I brought both nodes online and will run some endurance tests to stress the OS. Lets see if that works.
We haven't had any crash yesterday and even not today. I have re-enabled pulse triggered tests for the 8.1 64bit nodes. If we don't see crashes until end of the week, we might also apply this change to the other VMs and document. Lets cross the fingers that it really fixed it.
I ran the ondemand functional and betatest update tests on 25rc2 with Win 8.1 64-bit and did not encounter any unexpected errors or crashes.
As of now we got a single disconnect of mm-win-81-64-2, which crashed the same way. But that machine hasn't been modified with the above workaround. -1 and -3 are pretty stable. Dan, can we get the other two machines updated? Please let me know via IRC so we can schedule a good time. Thanks.
Please explicitly list the VMs you want updated. And how does 1PM Pacific time sound? (Friday)
Could we do it now if possible? I'm most likely not around at your proposed time.
As per IRC, just did mm-win-81-64-4 and mm-win-81-64-2.
Both machines are back online. We will observe them over the weekend. If no further crashes occur I will close the bug on Monday. Thanks Dan for quickly jumping in.
Status: NEW → ASSIGNED
Well don't close out the bug yet, please update if everything is OK and then I'll update the templates.
Adrian, I think that all is going well now. There was no single crash over the last couple of days, even with a high load of tests executed on those 64bit hosts. I'm really confident that this crash is gone now and I found the right solution for it.

Please get the templates updated and we should be fine then to get this bug closed. Thanks.
Templates updated and documentation updated as well (https://mana.mozilla.org/wiki/display/websites/QA+Automation+ESX+Service).
Status: ASSIGNED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → FIXED
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.