Closed Bug 1225094 Opened 9 years ago Closed 8 years ago

Crash in [@ intel_aes_gcmINIT ]

Categories

(NSS :: Libraries, defect)

3.21
x86
Windows
defect
Not set
normal

Tracking

(firefox45 wontfix, firefox47 wontfix, firefox48 wontfix, firefox49 wontfix, firefox-esr45 wontfix, firefox50 wontfix, firefox52 wontfix)

RESOLVED WONTFIX
Tracking Status
firefox45 --- wontfix
firefox47 --- wontfix
firefox48 --- wontfix
firefox49 --- wontfix
firefox-esr45 --- wontfix
firefox50 --- wontfix
firefox52 --- wontfix

People

(Reporter: btot, Assigned: n.nethercote)

References

Details

(Keywords: crash)

Crash Data

Attachments

(2 files)

User Agent: Mozilla/5.0 (Windows NT 5.1; rv:45.0) Gecko/20100101 Firefox/45.0 Build ID: 20151115030440 Graphics: NVIDIA GeForce 210 Steps to reproduce: Open a new tab via Ctrl+T or the "new tab" button on the toolbar. Here are the crash reports, for about 30 minutes: bp-7ddb5190-462f-4952-bfc6-22ea32151116 dd597c47-9a52-4c8d-bada-8c3953cbe831 e5388451-0c38-4497-bad5-0d3d232389aa bp-440e611a-9b84-44f4-af8c-be8f22151116 Actual results: The browser crashes, intermittent Expected results: Browser should not crash.
Took a look at this and with mconley's suggestion, adding ttaubert to the bug, who is potentially doubly familiar to this :-) With newtab and with NSS! This looks like an NSS-related issue that happens fairly infrequently. However, it might still be worth a look. Possibilities: On newtab pageload, we either make a request over HTTPS to: * the tiles servers * a page url to capture a thumbnail
Hmm, the crash is in the ASM code for intel_aes_gcmINIT: http://hg.mozilla.org/mozilla-central/annotate/51fa3e0d4f7b/security/nss/lib/freebl/intel-gcm-x86-masm.asm#l132 Is there any good way to reproduce it? You'd surely need an Intel CPU with AES instructions but other than that? Looks like it's Windows-only as well: https://crash-stats.mozilla.com/report/list?product=Firefox&signature=intel_aes_gcmINIT
Looks like I can't move this to the NSS component.
OS: Windows XP → Windows
The crash reason is EXCEPTION_ILLEGAL_INSTRUCTION, so is it possible that the CPU doesn't support |vzeroupper|? Looks like the CPU needs Advanced Vector Extensions [1]. "AuthenticAMD family 21 model 2 stepping 0" (as taken from the first crash report) is codename "Zambezi", and from the AMD Bulldozer Family. According to [2] the Bulldozer CPUs should support those extensions. Maybe I'm on the wrong path here, maybe not. I'm not sure who the best person to investigate this is? [1] https://en.wikipedia.org/wiki/Advanced_Vector_Extensions [2] https://en.wikipedia.org/wiki/Bulldozer_%28microarchitecture%29
Assignee: nobody → nobody
Component: New Tab Page → Libraries
Product: Firefox → NSS
Summary: Firefox crashes when opening a new tab → Firefox 45.0a1 Crash [@ intel_aes_gcmINIT ]
Version: 45 Branch → 3.21
Tim: I think you are on the right track. We need to review the code that sets has_intel_avx to 1: http://mxr.mozilla.org/nss/ident?i=has_intel_avx But according to the Wikipedia articles you cited, the AMD Bulldozer CPU should have AVX. So I don't know why we get the illegal instruction exception there.
Brindusa, can you tell use what machine you run Firefox on? The manufacturer and maybe the exact model?
Flags: needinfo?(brindusa.tot)
Sorry for late response. I am using an AMD FX - 8320 Eight-Core processor, 8GB Ram and Graphics: NVIDIA GeForce 210.
Flags: needinfo?(brindusa.tot)
If you are still here, can you post the output of: https://technet.microsoft.com/en-ca/sysinternals/cc835722.aspx
I filed bug 1263495 for this same issue. The reason I didn't see this bug was that the crash signature field was filled out incorrectly, which meant this bug didn't show up on crash-stats.mozilla.org. The field should read "[@ intel_aes_gcmINIT ]". We should dup one of these bugs to the other one.
Crash Signature: [intel_aes_gcmINIT] → [@ intel_aes_gcmINIT ]
See Also: → 1263495
Attached patch Zero %ecx before calling CPUID — — Splinter Review
I haven't tested this code, because I'm not sure how to test NSS changes across multiple platforms. I also don't know if modifying freebl is any different to modifying other parts of NSS. mt, are you able to help test this?
Attachment #8753148 - Flags: review?(martin.thomson)
Assignee: nobody → n.nethercote
Status: NEW → ASSIGNED
While at it, I would suggest in freebl_cpuid to get rid of the pushad/popad and change wcpuid to cpuid (the instruction was added to MSVC long long ago I think).
BTW, newer version of MSVC also automatically zero ecx automatically in the __cpuid intrinsic.
I think I will stick with the smallest possible changes to implement this fix. This code is way outside of my zone of expertise :)
Comment on attachment 8753148 [details] [diff] [review] Zero %ecx before calling CPUID Review of attachment 8753148 [details] [diff] [review]: ----------------------------------------------------------------- I see no reason not to land this. Did you check that we don't need the code in lib/freebl/mpi/mpcpucache_x86.s ? xorl %ecx, %ecx seems to be necessary there as well, but it looks like unused code. https://hg.mozilla.org/projects/nss/rev/a9af87b6c9a7
Attachment #8753148 - Flags: review?(martin.thomson)
Attachment #8753148 - Flags: review+
Attachment #8753148 - Flags: checked-in+
As for testing. I don't have the resources :) I can only verify that it works on our build cluster, for which I have to check it in. I'll keep an eye on it.
Might as well do this one too, just in case.
Attachment #8753604 - Flags: review?(martin.thomson)
mt, is there precedent for cherry-picking NSS changes into Firefox? It'd be nice to see if the patches in bug 1225094 fix crashes without having to wait for an NSS uplift. And if they do fix the crashes, backporting them to Aurora/Beta would be worthwhile.
It's not recommended. The preferred way is to just pick up the latest version of NSS in those branches, as cherry picking puts downstream users who package NSS separately in a bind. It can happen if the fix is important enough and the changes to NSS branch are risky.
Comment on attachment 8753604 [details] [diff] [review] Zero %ecx before calling CPUID in mpcpucache_x86.s Review of attachment 8753604 [details] [diff] [review]: ----------------------------------------------------------------- I don't know this stuff, but should this be xorl rather than xor?
> I don't know this stuff, but should this be xorl rather than xor? Yes, it should be. Are you able to make the change before landing? Thank you.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → 3.25
I am not sure this will actually fix the problem.
Nick, this will be in NSS 3.25, I can backport these patches to NSS 3.24 but we would have to discuss whether we wanted a point release. And then we have to convince the release managers that they want to take the release. I will leave that part to you. Which release would you like this to be in? (I too would like to see if this makes a difference.)
Flags: needinfo?(n.nethercote)
Attachment #8753604 - Flags: review?(martin.thomson)
Comment on attachment 8753148 [details] [diff] [review] Zero %ecx before calling CPUID Review of attachment 8753148 [details] [diff] [review]: ----------------------------------------------------------------- ::: security/nss/lib/freebl/mpi/mpcpucache.c @@ +73,5 @@ > void freebl_cpuid(unsigned long op, unsigned long *eax, > unsigned long *ebx, unsigned long *ecx, > unsigned long *edx) > { > +/* Some older processors don't fill the ecx register with cpuid, so clobber it I'm very curious about this bug. How did you know about this problem? I can't find a webpage that describes this problem by web searches. I guess we don't need to zero edx because the feature flags returned in edx are older features?
> I'm very curious about this bug. How did you know about > this problem? I can't find a webpage that describes this > problem by web searches. The first I heard about it was from bug 1096651. I don't know how/where glandium learned about it.
Flags: needinfo?(n.nethercote)
I'm going to leave this open until we have a clear picture of whether clearing %ecx fixes the problem.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to Martin Thomson [:mt:] from comment #24) > Nick, this will be in NSS 3.25, I can backport these patches to NSS 3.24 but > we would have to discuss whether we wanted a point release. And then we > have to convince the release managers that they want to take the release. I > will leave that part to you. So... - This crash doesn't make the top 300 crashes for Firefox 46.0 in the past 7 days. - However, it is #19 for Firefox 49.0 in the past 7 days. I'm not sure why it's ranked so much higher in Nightly than Release. - It also shows up for Thunderbird users quite a bit. - I hope the two patches will fix the crash, but it's not certain. I don't have a good sense of whether these numbers would typically warrant an NSS point release and update. Ritu, perhaps you do? Thank you.
Flags: needinfo?(rkothari)
(In reply to Nicholas Nethercote [:njn] from comment #26) > > The first I heard about it was from bug 1096651. I don't know how/where > glandium learned about it. Hi Nicholas: thanks a lot for the info. According to bug 1096651 comment 0, the old CPU in question was manufactured by Cyrix and has a clock speed of 225 MHz. This patch has essentially no risk, so it is worth a try, but I doubt many Firefox 49.0 users use such an old and slow CPU. So, either other more modern CPUs also have this bug, or this bug is caused by something else.
(In reply to Nicholas Nethercote [:njn] from comment #28) > (In reply to Martin Thomson [:mt:] from comment #24) > > I don't have a good sense of whether these numbers would typically warrant > an NSS point release and update. Ritu, perhaps you do? Thank you. Hi Nicholas, the current NSS - release mappings from https://kuix.de/mozilla/versions/ indicates that Fx47 is at NSS 3.23 RTM. Looking at the # of occurrences of this on Fx47 (< 25 total in a week), it does not warrant another NSS update especially since we are 2 weeks away from going live. For Fx46, the total # is also less than 125 per week, so definitely not a dot release driver by itself. Lizzard fyi.
Flags: needinfo?(rkothari) → needinfo?(lhenry)
We can keep an eye on the crash in Nightly. Currently, there were 19 crashes in the last week with this signature, on 9 installs, all on XP. That doesn't worry me too much.
Flags: needinfo?(lhenry)
> For Fx46, the total # is also less than 125 per week, so definitely not a > dot release driver by itself. Thank you for confirming.
Anyway, I just asked Brindusa Tot to run Sysinternals CoreInfo and that shows OSXSAVE in CPUID is enabled.
Below is the output after running CoreInfo: Coreinfo v3.31 - Dump information on system CPU and memory topology Copyright (C) 2008-2014 Mark Russinovich Sysinternals - www.sysinternals.com AMD FX(tm)-8320 Eight-Core Processor x86 Family 21 Model 2 Stepping 0, AuthenticAMD HTT * Multicore HYPERVISOR - Hypervisor is present VMX - Supports Intel hardware-assisted virtualization SVM * Supports AMD hardware-assisted virtualization X64 * Supports 64-bit mode SMX - Supports Intel trusted execution SKINIT * Supports AMD SKINIT NX * Supports no-execute page protection SMEP - Supports Supervisor Mode Execution Prevention SMAP - Supports Supervisor Mode Access Prevention PAGE1GB * Supports 1 GB large pages PAE * Supports > 32-bit physical addresses PAT * Supports Page Attribute Table PSE * Supports 4 MB pages PSE36 * Supports > 32-bit address 4 MB pages PGE * Supports global bit in page tables SS - Supports bus snooping for cache operations VME * Supports Virtual-8086 mode RDWRFSGSBASE - Supports direct GS/FS base access FPU * Implements i387 floating point instructions MMX * Supports MMX instruction set MMXEXT * Implements AMD MMX extensions 3DNOW - Supports 3DNow! instructions 3DNOWEXT - Supports 3DNow! extension instructions SSE * Supports Streaming SIMD Extensions SSE2 * Supports Streaming SIMD Extensions 2 SSE3 * Supports Streaming SIMD Extensions 3 SSSE3 * Supports Supplemental SIMD Extensions 3 SSE4a * Supports Streaming SIMDR Extensions 4a SSE4.1 * Supports Streaming SIMD Extensions 4.1 SSE4.2 * Supports Streaming SIMD Extensions 4.2 AES * Supports AES extensions AVX * Supports AVX intruction extensions FMA * Supports FMA extensions using YMM state MSR * Implements RDMSR/WRMSR instructions MTRR * Supports Memory Type Range Registers XSAVE * Supports XSAVE/XRSTOR instructions OSXSAVE * Supports XSETBV/XGETBV instructions RDRAND - Supports RDRAND instruction RDSEED - Supports RDSEED instruction CMOV * Supports CMOVcc instruction CLFSH * Supports CLFLUSH instruction CX8 * Supports compare and exchange 8-byte instructions CX16 * Supports CMPXCHG16B instruction BMI1 * Supports bit manipulation extensions 1 BMI2 - Supports bit manipulation extensions 2 ADX - Supports ADCX/ADOX instructions DCA - Supports prefetch from memory-mapped device F16C * Supports half-precision instruction FXSR * Supports FXSAVE/FXSTOR instructions FFXSR * Supports optimized FXSAVE/FSRSTOR instruction MONITOR * Supports MONITOR and MWAIT instructions MOVBE - Supports MOVBE instruction ERMSB - Supports Enhanced REP MOVSB/STOSB PCLMULDQ * Supports PCLMULDQ instruction POPCNT * Supports POPCNT instruction LZCNT * Supports LZCNT instruction SEP * Supports fast system call instructions LAHF-SAHF * Supports LAHF/SAHF instructions in 64-bit mode HLE - Supports Hardware Lock Elision instructions RTM - Supports Restricted Transactional Memory instruction DE * Supports I/O breakpoints including CR4.DE DTES64 - Can write history of 64-bit branch addresses DS - Implements memory-resident debug buffer DS-CPL - Supports Debug Store feature with CPL PCID - Supports PCIDs and settable CR4.PCIDE INVPCID - Supports INVPCID instruction PDCM - Supports Performance Capabilities MSR RDTSCP * Supports RDTSCP instruction TSC * Supports RDTSC instruction TSC-DEADLINE - Local APIC supports one-shot deadline timer TSC-INVARIANT * TSC runs at constant rate xTPR - Supports disabling task priority messages EIST - Supports Enhanced Intel Speedstep ACPI - Implements MSR for power management TM - Implements thermal monitor circuitry TM2 - Implements Thermal Monitor 2 control APIC * Implements software-accessible local APIC x2APIC - Supports x2APIC CNXT-ID - L1 data cache mode adaptive or BIOS MCE * Supports Machine Check, INT18 and CR4.MCE MCA * Implements Machine Check Architecture PBE - Supports use of FERR#/PBE# pin PSN - Implements 96-bit processor serial number PREFETCHW * Supports PREFETCHW instruction Maximum implemented CPUID leaves: 0000000D (Basic), 8000001E (Extended). Logical to Physical Processor Map: *------- Physical Processor 0 -*------ Physical Processor 1 --*----- Physical Processor 2 ---*---- Physical Processor 3 ----*--- Physical Processor 4 -----*-- Physical Processor 5 ------*- Physical Processor 6 -------* Physical Processor 7 Logical Processor to Socket Map: ******** Socket 0 Logical Processor to NUMA Node Map: ******** NUMA Node 0 Logical Processor to Cache Map: *------- Data Cache 0, Level 1, 16 KB, Assoc 4, LineSize 64 *------- Instruction Cache 0, Level 1, 64 KB, Assoc 2, LineSize 64 *------- Unified Cache 0, Level 2, 2 MB, Assoc 16, LineSize 64 -*------ Data Cache 1, Level 1, 16 KB, Assoc 4, LineSize 64 -*------ Instruction Cache 1, Level 1, 64 KB, Assoc 2, LineSize 64 -*------ Unified Cache 1, Level 2, 2 MB, Assoc 16, LineSize 64 --*----- Data Cache 2, Level 1, 16 KB, Assoc 4, LineSize 64 --*----- Instruction Cache 2, Level 1, 64 KB, Assoc 2, LineSize 64 --*----- Unified Cache 2, Level 2, 2 MB, Assoc 16, LineSize 64 ---*---- Data Cache 3, Level 1, 16 KB, Assoc 4, LineSize 64 ---*---- Instruction Cache 3, Level 1, 64 KB, Assoc 2, LineSize 64 ---*---- Unified Cache 3, Level 2, 2 MB, Assoc 16, LineSize 64 ----*--- Data Cache 4, Level 1, 16 KB, Assoc 4, LineSize 64 ----*--- Instruction Cache 4, Level 1, 64 KB, Assoc 2, LineSize 64 ----*--- Unified Cache 4, Level 2, 2 MB, Assoc 16, LineSize 64 -----*-- Data Cache 5, Level 1, 16 KB, Assoc 4, LineSize 64 -----*-- Instruction Cache 5, Level 1, 64 KB, Assoc 2, LineSize 64 -----*-- Unified Cache 5, Level 2, 2 MB, Assoc 16, LineSize 64 ------*- Data Cache 6, Level 1, 16 KB, Assoc 4, LineSize 64 ------*- Instruction Cache 6, Level 1, 64 KB, Assoc 2, LineSize 64 ------*- Unified Cache 6, Level 2, 2 MB, Assoc 16, LineSize 64 -------* Data Cache 7, Level 1, 16 KB, Assoc 4, LineSize 64 -------* Instruction Cache 7, Level 1, 64 KB, Assoc 2, LineSize 64 -------* Unified Cache 7, Level 2, 2 MB, Assoc 16, LineSize 64
Just filed bug 1283585 with the new AES-GCM code.
I also ran into this crash today several times on Windows XP 32-bit while testing Firefox 48beta7. Don't have any steps to reproduce unfortunately, it just happened all of a sudden. I also modified the title since this is not specific to 45 branch. Setup: Operating System: Windows XP Professional (5.1, Build 2600) Service Pack 3 (2600.xpsp_sp3_qfe.130704-0421) Language: English (Regional Setting: English) System Manufacturer: Gigabyte Technology Co., Ltd. System Model: GA-78LMT-USB3 BIOS: Award Modular BIOS v6.00PG Processor: AMD FX(tm)-8320 Eight-Core Processor, MMX (8 CPUs), ~3.5GHz Memory: 3326MB RAM Page File: 3523MB used, 1681MB available Windows Dir: C:\WINDOWS DirectX Version: DirectX 9.0c (4.09.0000.0904) DX Setup Parameters: Not found DxDiag Version: 5.03.2600.5512 32bit Unicode Card name: AMD Radeon HD 6450 Crash reports: bp-2452f2fc-47ac-4979-9137-8c12e2160712 bp-0274bd69-36c5-419d-8595-b11962160712 bp-0ba074fb-8258-4c12-b309-b440c2160712 bp-132956e3-57fa-444e-88b7-edf802160712
Summary: Firefox 45.0a1 Crash [@ intel_aes_gcmINIT ] → Crash in [@ intel_aes_gcmINIT ]
Crash volume for signature 'intel_aes_gcmINIT': - nightly (version 50): 29 crashes from 2016-06-06. - aurora (version 49): 0 crash from 2016-06-07. - beta (version 48): 127 crashes from 2016-06-06. - release (version 47): 691 crashes from 2016-05-31. - esr (version 45): 25 crashes from 2016-04-07. Crash volume on the last weeks: Week N-1 Week N-2 Week N-3 Week N-4 Week N-5 Week N-6 Week N-7 - nightly 2 3 0 0 8 3 5 - aurora 0 0 0 0 0 0 0 - beta 10 19 25 29 23 13 7 - release 119 100 105 114 85 81 46 - esr 0 1 0 15 4 4 0 Affected platform: Windows
Crash volume for signature 'intel_aes_gcmINIT': - nightly (version 51): 0 crashes from 2016-08-01. - aurora (version 50): 0 crashes from 2016-08-01. - beta (version 49): 59 crashes from 2016-08-02. - release (version 48): 138 crashes from 2016-07-25. - esr (version 45): 31 crashes from 2016-05-02. Crash volume on the last weeks (Week N is from 08-22 to 08-28): W. N-1 W. N-2 W. N-3 - nightly 0 0 0 - aurora 0 0 0 - beta 19 3 4 - release 52 41 18 - esr 3 0 1 Affected platform: Windows Crash rank on the last 7 days: Browser Content Plugin - nightly - aurora - beta #406 - release #532 - esr #2273
An update: these patches went in with NSS 3.25, which went into Firefox 49 (https://hg.mozilla.org/integration/mozilla-inbound/rev/a2f23b6058a2, bug 1277255). Looking at crashes from the past 7 days, I see 7 in 49.0 and another 18 or so in 49 betas. This suggests that the patches didn't fix the problem :(
This used to be more frequent with AMD CPUs, but it's the other way around now. Intel 125/180 AMD 56/180
Has anybody checked in a raw dump if the instruction that we're trying to execute is actually vzeroupper?
(In reply to Marco Castelluccio [:marco] from comment #41) > Has anybody checked in a raw dump if the instruction that we're trying to > execute > is actually vzeroupper? I've checked this crash (https://crash-stats.mozilla.com/report/index/54a2ecaa-3bbf-488b-970a-d69de2160930#tab-rawdump) and the instruction is actually vzeroupper. d2cd905: 5b pop %ebx d2cd906: c3 ret d2cd907: 8d a4 24 00 00 00 00 lea 0x0(%esp),%esp d2cd90e: 8b ff mov %edi,%edi d2cd910: 8b 44 24 04 mov 0x4(%esp),%eax d2cd914: 8b 4c 24 08 mov 0x8(%esp),%ecx d2cd918: 8b 54 24 0c mov 0xc(%esp),%edx d2cd91c: c5 f8 77 vzeroupper ^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^ d2cd91f: c5 fa 6f 01 vmovdqu (%ecx),%xmm0 d2cd923: 8d 49 10 lea 0x10(%ecx),%ecx d2cd926: 4a dec %edx d2cd927: c4 e2 79 dc 01 vaesenc (%ecx),%xmm0,%xmm0 d2cd92c: 8d 49 10 lea 0x10(%ecx),%ecx d2cd92f: 4a dec %edx d2cd930: 75 f5 jne 0xd2cd927 d2cd932: c4 e2 79 dd 01 vaesenclast (%ecx),%xmm0,%xmm0 d2cd937: c4 e2 79 00 05 90 23 vpshufb 0xd2e2390,%xmm0,%xmm0
According to https://msdn.microsoft.com/en-us/library/ff545910.aspx#avx_registers, "In Windows 7 with Service Pack 1 (SP1), Windows Server 2008 R2, and newer versions of Windows, both x86 and x64 versions of the operating system preserve the AVX registers across thread (and process) switches.". I don't see how not preserving the state of the AVX registers would cause a EXCEPTION_ILLEGAL_INSTRUCTION, but it could definitely cause hard to diagnose problems. The crash occurs, even if with smaller frequency (out of 184, 19 with Windows 8.1, 1 with Windows 8, 1 with Windows 7 SP1), with versions of Windows newer than what is stated on that page.
Looks like there's a way to disable AVX on Windows, it was implemented by Microsoft as a workaround for some issue (the command is bcdedit /set xsavedisable 1, see https://support.microsoft.com/en-us/kb/2568088). There's also a way to detect if AVX is enabled on Windows: https://msdn.microsoft.com/en-us/library/windows/desktop/hh134240(v=vs.85).aspx.
(In reply to Marco Castelluccio [:marco] from comment #43) > According to > https://msdn.microsoft.com/en-us/library/ff545910.aspx#avx_registers, "In > Windows 7 with Service Pack 1 (SP1), Windows Server 2008 R2, and newer > versions of Windows, both x86 and x64 versions of the operating system > preserve the AVX registers across thread (and process) switches.". > > I don't see how not preserving the state of the AVX registers would cause a > EXCEPTION_ILLEGAL_INSTRUCTION, but it could definitely cause hard to > diagnose problems. > The crash occurs, even if with smaller frequency (out of 184, 19 with > Windows 8.1, 1 with Windows 8, 1 with Windows 7 SP1), with versions of > Windows newer than what is stated on that page. The way it works is that the CPU blocks execution of the AVX instructions until the OS sets OSXSAVE in CR4 and XSETBV is used to set the correct bits.
(In reply to Yuhong Bao from comment #45) > (In reply to Marco Castelluccio [:marco] from comment #43) > > According to > > https://msdn.microsoft.com/en-us/library/ff545910.aspx#avx_registers, "In > > Windows 7 with Service Pack 1 (SP1), Windows Server 2008 R2, and newer > > versions of Windows, both x86 and x64 versions of the operating system > > preserve the AVX registers across thread (and process) switches.". > > > > I don't see how not preserving the state of the AVX registers would cause a > > EXCEPTION_ILLEGAL_INSTRUCTION, but it could definitely cause hard to > > diagnose problems. > > The crash occurs, even if with smaller frequency (out of 184, 19 with > > Windows 8.1, 1 with Windows 8, 1 with Windows 7 SP1), with versions of > > Windows newer than what is stated on that page. > > The way it works is that the CPU blocks execution of the AVX instructions > until the OS sets OSXSAVE in CR4 and XSETBV is used to set the correct bits. If the CPU blocks the execution, than it would explain why we're getting an EXCEPTION_ILLEGAL_INSTRUCTION. Most OSes where we're crashing do not support AVX, so there must be a bug in our code to detect it. We have three buckets of OSes, out of 205 crashes: - Versions older than Windows 7 SP1 (no support for AVX): 176 - Versions newer than Windows 7 SP1, 32-bit (where AVX is disabled, "the VEX instructions can only be used when running in 64-bit mode" from https://software.intel.com/en-us/articles/introduction-to-intel-advanced-vector-extensions): 27 - Versions newer than Windows 7 SP1, 64-bit (where AVX should be enabled, unless forcibly disabled by the user): 2 There are some snippets of code to detect AVX on https://software.intel.com/en-us/blogs/2011/04/14/is-avx-enabled/ and http://masm32.com/board/index.php?topic=3191.0.
(In reply to Marco Castelluccio [:marco] from comment #46) > We have three buckets of OSes, out of 205 crashes: > - Versions newer than Windows 7 SP1, 32-bit (where AVX is disabled, "the VEX > instructions can only be used when running in 64-bit mode" from > https://software.intel.com/en-us/articles/introduction-to-intel-advanced- > vector-extensions): 27 > - Versions newer than Windows 7 SP1, 64-bit (where AVX should be enabled, > unless forcibly disabled by the user): 2 N.B.: By 32-bit and 64-bit here I meant the OS. Firefox is always 32-bit with this crash.
The function which is checking the OS support is check_xcr0_ymm: https://dxr.mozilla.org/mozilla-central/source/security/nss/lib/freebl/rijndael.c#1005. It's using xgetbv in two different ways (for 32-bit builds manually with an assembly snippet, for 64-bit builds with the _xgetbv function). Perhaps we can revert the patch that introduced the assembly snippet (it was needed to support VS 2010 RTM: https://bugzilla.mozilla.org/show_bug.cgi?id=979703#c23). I don't see why the assembly snippet would be wrong, but perhaps directly using _xgetbv from intrin.h would be safer (we're already doing it with libyuv: https://dxr.mozilla.org/mozilla-central/source/media/libyuv/source/cpu_id.cc#102).
Crash volume for signature 'intel_aes_gcmINIT': - nightly (version 52): 18 crashes from 2016-09-19. - aurora (version 51): 0 crashes from 2016-09-19. - beta (version 50): 86 crashes from 2016-09-20. - release (version 49): 359 crashes from 2016-09-05. - esr (version 45): 14 crashes from 2016-07-25. Crash volume on the last weeks (Week N is from 10-17 to 10-23): W. N-1 W. N-2 W. N-3 W. N-4 - nightly 17 0 1 0 - aurora 0 0 0 0 - beta 14 41 20 6 - release 98 100 99 29 - esr 0 0 5 0 Affected platform: Windows Crash rank on the last 7 days: Browser Content Plugin - nightly #55 - aurora - beta #900 #3115 - release #522 - esr
Firefox crash during startup after motherboard replacement https://support.mozilla.org/en-US/questions/1144850#answer-931402 This problem was solved by doing a full XP repair; the partial repair I did first was not sufficient.
owner I spoke to soon, it is re-occurring, here's the latest: https://crash-stats.mozilla.com/report/index/a7e4d91d-b28e-4c54-a8cd-a77492161031
Anything further on this crash? I'm running AMD FX 4300 CPU on Windows XP, XP is loaded through the Windows 10 boot loader menu (dual boot scenario) I know that XP does not support AVX instructions. Is this relevant ?
Which mobo are you using?
It is the ASRock 960GC-GS FX Let me mention something. In various Windows forums I've described this crash scenario but the response is that it couldn't possibly be, it must be a coincidence. I've only ever seen this crash when I've booted XP from the Windows 10 boot loader (normal dual boot situation) On a hunch, I restored XP's original boot loader so I could boot XP natively. In this situation, the crash has never occurred. I'd be pleased to help or run any tests to get to the bottom of this problem - Thanks
Thanks. This might mean that it is a bug in BOOTMGR.
There are various documents on the web that talk about Windows 8/10 usage of hibernate/ Fast Start/Hybrid sleep causing problems in dual boot scenarios. But disabling that lot did not affect the situation. I thought I was onto something for a minute. Firefox is not the only program that is crashing. The avast anti virus ( now uninstalled and replaced) created many dumps in it's appdata folder, it was unusable. Also FFmpeg.dll (video codec/tool)as used in application Tmpgenc 2.5 also crashed exception illegal operation. All these only crashed when XP was booted via the Windows 10 boot menu. FF does not crash constantly. It's usually within a minute of starting it, but not every time. I'm running it today after restoring the Windows 10 loader and nothing has occurred (yet)
Perhaps similar occurrences in XP from other users could be checked to see if the Windows 10 (possibly Windows 7 & 8 also ?) boot loader scenario is evident. I know that when problems are reported, this detail may not have been included
It does appear that CR4.OSFXSR is set early in newer BOOTMGR.
And CR4.OSXSAVE too.
Thanks for the update. It's a little above my head, I'm not a windows programmer. But I'm assuming these values in the control register are significant to the problem. Let me know if there is anything further I can do to help
I saw the brief email discussion ... interesting. Has it been determined if the AMD CPU is to blame, or if there is a bug in the Windows 10 Bootmgr/Winload where some bits are set incorrectly when booting to NTLDR ?
BTW, Geoff Chappell is a consultant you can pay, if Mozilla is interested.
I have done the debugging and confirmed this is a bug in BOOTMGR failing to clear XCR0 and CR4.OSXSAVE when chaining to NTLDR.
Doing this debugging in a VM took quite a bit of work including getting a checked build of NTLDR to confirm.
Thank you for the excellent investigation, Yuhong Bao. Since this isn't a Firefox bug, the crash rate is low, and it seems hard to do anything about it, I suggest that we close this bug report as WONTFIX.
Status: REOPENED → RESOLVED
Closed: 9 years ago8 years ago
Resolution: --- → WONTFIX
The bug should still be reported to Microsoft though.
Yuhong Bao I want to thank you myself for this work you did. As was mentioned by Nicholas, the incidence of those experiencing this bug must be quite low. (I wonder what % of Windows 10 users are also booting to XP?) For myself, right now, I can find a work-around, boot XP from a floppy drive, for example. How does one approach Microsoft with this kind of info? Secondly, would they even be interested in fixing it? One thing I didn't really understand. When the PC goes through POST, what is the content of those registers? Are those bits off by default ? (I assume they must be).
(In reply to davexnet01 from comment #67) > One thing I didn't really understand. When the PC goes through POST, what > is the content of those registers? Are those bits off by default ? (I assume > they must be). Yes, they should be.
(In reply to davexnet01 from comment #67) > Yuhong Bao I want to thank you myself for this work you did. > As was mentioned by Nicholas, the incidence of those experiencing this bug > must be quite low. I just checked crash-stats.mozilla.com. This crash signature accounts for 255 out of 3,242,628 crash reports submitted in the past 7 days. I.e. not that many.
If we do anything, it could be to blacklist the processor in the AVX test. I think that I'd be happy to take code that did that.
You would be better off using IsProcessorFeaturePresent(PF_XSAVE_ENABLED)
Regarding Martin's comment above, is he implying that some errata in the AMD FX is involved in this problem? Would the problem not arise if an Intel CPU was used in an otherwise similar scenario ?
For posterity reasons only here is my configuration of the machine where I encounter this crash: One HDD 500GB with the following OSs each on it's own partition: - Windows XP 32bit - Windows Vista 64bit - Windows 7 32bit - Windows 7 64bit - Windows 8.1 32bit - Windows 8.1 64bit One SSD 256GB with the following OSs each on it's own partition: - Windows 7 64bit - Windows 10 64bit - Ubuntu 16.04 32bit One HDD 500GB with the following OSs each on it's own partition: - Ubuntu 12.04 32bit - Ubuntu 16.04 64bit - Ubuntu 14.04 32bit Software used to make this configuration was EasyBCD. The CPU, Memory, BIOS etc. can be seen in comment 36.
FYI, I just finished a version for intel-gcm.s in bug 1283585
Good news! In Win10 1703, bootmgr!ImgPcatStartLegacyLoader now call bootmgr!ArchRestoreProcessorFeatures (to get rid of XSAVE bits)! This does not deal with CR4.OSFXSR, but it is a step forward.
See Also: → 1403353
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: