Closed Bug 1225094 Opened 5 years ago Closed 4 years ago

Crash in [@ intel_aes_gcmINIT ]

Categories

(NSS :: Libraries, defect)

3.21
x86
Windows
defect
Not set
normal

Tracking

(firefox45 affected, firefox47 affected, firefox48 affected, firefox49 affected, firefox-esr45 affected, firefox50 affected, firefox52 affected)

RESOLVED WONTFIX
Tracking Status
firefox45 --- affected
firefox47 --- affected
firefox48 --- affected
firefox49 --- affected
firefox-esr45 --- affected
firefox50 --- affected
firefox52 --- affected

People

(Reporter: brindusat, Assigned: njn)

References

Details

(Keywords: crash)

Crash Data

Attachments

(2 files)

User Agent: Mozilla/5.0 (Windows NT 5.1; rv:45.0) Gecko/20100101 Firefox/45.0
Build ID:  20151115030440
Graphics: NVIDIA GeForce 210 

Steps to reproduce:
Open a new tab via Ctrl+T or the "new tab" button on the toolbar.
Here are the crash reports, for about 30 minutes:
bp-7ddb5190-462f-4952-bfc6-22ea32151116
dd597c47-9a52-4c8d-bada-8c3953cbe831
e5388451-0c38-4497-bad5-0d3d232389aa
bp-440e611a-9b84-44f4-af8c-be8f22151116


Actual results:
The browser crashes, intermittent


Expected results:
Browser should not crash.
Took a look at this and with mconley's suggestion, adding ttaubert to the bug, who is potentially doubly familiar to this :-) With newtab and with NSS!

This looks like an NSS-related issue that happens fairly infrequently.
However, it might still be worth a look.

Possibilities:

On newtab pageload, we either make a request over HTTPS to:
* the tiles servers
* a page url to capture a thumbnail
Hmm, the crash is in the ASM code for intel_aes_gcmINIT:

http://hg.mozilla.org/mozilla-central/annotate/51fa3e0d4f7b/security/nss/lib/freebl/intel-gcm-x86-masm.asm#l132

Is there any good way to reproduce it? You'd surely need an Intel CPU with AES instructions but other than that? Looks like it's Windows-only as well:

https://crash-stats.mozilla.com/report/list?product=Firefox&signature=intel_aes_gcmINIT
Looks like I can't move this to the NSS component.
OS: Windows XP → Windows
The crash reason is EXCEPTION_ILLEGAL_INSTRUCTION, so is it possible that the CPU doesn't support |vzeroupper|? Looks like the CPU needs Advanced Vector Extensions [1].

"AuthenticAMD family 21 model 2 stepping 0" (as taken from the first crash report) is codename "Zambezi", and from the AMD Bulldozer Family. According to [2] the Bulldozer CPUs should support those extensions.

Maybe I'm on the wrong path here, maybe not. I'm not sure who the best person to investigate this is?

[1] https://en.wikipedia.org/wiki/Advanced_Vector_Extensions
[2] https://en.wikipedia.org/wiki/Bulldozer_%28microarchitecture%29
Assignee: nobody → nobody
Component: New Tab Page → Libraries
Product: Firefox → NSS
Summary: Firefox crashes when opening a new tab → Firefox 45.0a1 Crash [@ intel_aes_gcmINIT ]
Version: 45 Branch → 3.21
Tim: I think you are on the right track. We need to review the
code that sets has_intel_avx to 1:
http://mxr.mozilla.org/nss/ident?i=has_intel_avx

But according to the Wikipedia articles you cited, the AMD Bulldozer
CPU should have AVX. So I don't know why we get the illegal instruction
exception there.
Brindusa, can you tell use what machine you run Firefox on? The manufacturer and maybe the exact model?
Flags: needinfo?(brindusa.tot)
Sorry for late response. 
I am using an AMD FX - 8320 Eight-Core processor, 8GB Ram and Graphics: NVIDIA GeForce 210.
Flags: needinfo?(brindusa.tot)
If you are still here, can you post the output of: https://technet.microsoft.com/en-ca/sysinternals/cc835722.aspx
I filed bug 1263495 for this same issue. The reason I didn't see this bug was that the crash signature field was filled out incorrectly, which meant this bug didn't show up on crash-stats.mozilla.org. The field should read "[@ intel_aes_gcmINIT ]". We should dup one of these bugs to the other one.
Crash Signature: [intel_aes_gcmINIT] → [@ intel_aes_gcmINIT ]
See Also: → 1263495
Duplicate of this bug: 1263495
I haven't tested this code, because I'm not sure how to test NSS changes across
multiple platforms. I also don't know if modifying freebl is any different to
modifying other parts of NSS.

mt, are you able to help test this?
Attachment #8753148 - Flags: review?(martin.thomson)
Assignee: nobody → n.nethercote
Status: NEW → ASSIGNED
While at it, I would suggest in freebl_cpuid to get rid of the pushad/popad and change wcpuid to cpuid (the instruction was added to MSVC long long ago I think).
BTW, newer version of MSVC also automatically zero ecx automatically in the __cpuid intrinsic.
I think I will stick with the smallest possible changes to implement this fix. This code is way outside of my zone of expertise :)
Comment on attachment 8753148 [details] [diff] [review]
Zero %ecx before calling CPUID

Review of attachment 8753148 [details] [diff] [review]:
-----------------------------------------------------------------

I see no reason not to land this.   Did you check that we don't need the code in lib/freebl/mpi/mpcpucache_x86.s ? xorl %ecx, %ecx seems to be necessary there as well, but it looks like unused code.

https://hg.mozilla.org/projects/nss/rev/a9af87b6c9a7
Attachment #8753148 - Flags: review?(martin.thomson)
Attachment #8753148 - Flags: review+
Attachment #8753148 - Flags: checked-in+
As for testing.  I don't have the resources :)  I can only verify that it works on our build cluster, for which I have to check it in.  I'll keep an eye on it.
Might as well do this one too, just in case.
Attachment #8753604 - Flags: review?(martin.thomson)
mt, is there precedent for cherry-picking NSS changes into Firefox? It'd be nice to see if the patches in bug 1225094 fix crashes without having to wait for an NSS uplift. And if they do fix the crashes, backporting them to Aurora/Beta would be worthwhile.
It's not recommended. The preferred way is to just pick up the latest version of NSS in those branches, as cherry picking puts downstream users who package NSS separately in a bind. 

It can happen if the fix is important enough and the changes to NSS branch are risky.
Comment on attachment 8753604 [details] [diff] [review]
Zero %ecx before calling CPUID in mpcpucache_x86.s

Review of attachment 8753604 [details] [diff] [review]:
-----------------------------------------------------------------

I don't know this stuff, but should this be xorl rather than xor?
> I don't know this stuff, but should this be xorl rather than xor?

Yes, it should be. Are you able to make the change before landing? Thank you.
https://hg.mozilla.org/projects/nss/rev/c95c137a1829
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → 3.25
I am not sure this will actually fix the problem.
Nick, this will be in NSS 3.25, I can backport these patches to NSS 3.24 but we would have to discuss whether we wanted a point release.  And then we have to convince the release managers that they want to take the release.  I will leave that part to you.

Which release would you like this to be in?  (I too would like to see if this makes a difference.)
Flags: needinfo?(n.nethercote)
Attachment #8753604 - Flags: review?(martin.thomson)
Comment on attachment 8753148 [details] [diff] [review]
Zero %ecx before calling CPUID

Review of attachment 8753148 [details] [diff] [review]:
-----------------------------------------------------------------

::: security/nss/lib/freebl/mpi/mpcpucache.c
@@ +73,5 @@
>  void freebl_cpuid(unsigned long op, unsigned long *eax, 
>  	                 unsigned long *ebx, unsigned long *ecx, 
>                           unsigned long *edx)
>  {
> +/* Some older processors don't fill the ecx register with cpuid, so clobber it

I'm very curious about this bug. How did you know about
this problem? I can't find a webpage that describes this
problem by web searches.

I guess we don't need to zero edx because the feature
flags returned in edx are older features?
> I'm very curious about this bug. How did you know about
> this problem? I can't find a webpage that describes this
> problem by web searches.

The first I heard about it was from bug 1096651. I don't know how/where glandium learned about it.
Flags: needinfo?(n.nethercote)
I'm going to leave this open until we have a clear picture of whether clearing %ecx fixes the problem.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to Martin Thomson [:mt:] from comment #24)
> Nick, this will be in NSS 3.25, I can backport these patches to NSS 3.24 but
> we would have to discuss whether we wanted a point release.  And then we
> have to convince the release managers that they want to take the release.  I
> will leave that part to you.

So...
- This crash doesn't make the top 300 crashes for Firefox 46.0 in the past 7 days.
- However, it is #19 for Firefox 49.0 in the past 7 days. I'm not sure why it's ranked so much higher in Nightly than Release.
- It also shows up for Thunderbird users quite a bit.
- I hope the two patches will fix the crash, but it's not certain.

I don't have a good sense of whether these numbers would typically warrant an NSS point release and update. Ritu, perhaps you do? Thank you.
Flags: needinfo?(rkothari)
(In reply to Nicholas Nethercote [:njn] from comment #26)
> 
> The first I heard about it was from bug 1096651. I don't know how/where
> glandium learned about it.

Hi Nicholas: thanks a lot for the info. According to bug 1096651 comment 0,
the old CPU in question was manufactured by Cyrix and has a clock speed of
225 MHz. This patch has essentially no risk, so it is worth a try, but I
doubt many Firefox 49.0 users use such an old and slow CPU. So, either other
more modern CPUs also have this bug, or this bug is caused by something else.
(In reply to Nicholas Nethercote [:njn] from comment #28)
> (In reply to Martin Thomson [:mt:] from comment #24)
> 
> I don't have a good sense of whether these numbers would typically warrant
> an NSS point release and update. Ritu, perhaps you do? Thank you.

Hi Nicholas, the current NSS - release mappings from https://kuix.de/mozilla/versions/ indicates that Fx47 is at NSS 3.23 RTM. Looking at the # of occurrences of this on Fx47 (< 25 total in a week), it does not warrant another NSS update especially since we are 2 weeks away from going live.

For Fx46, the total # is also less than 125 per week, so definitely not a dot release driver by itself. Lizzard fyi.
Flags: needinfo?(rkothari) → needinfo?(lhenry)
We can keep an eye on the crash in Nightly. Currently, there were 19 crashes in the last week with this signature, on 9 installs, all on XP. That doesn't worry me too much.
Flags: needinfo?(lhenry)
> For Fx46, the total # is also less than 125 per week, so definitely not a
> dot release driver by itself.

Thank you for confirming.
Anyway, I just asked Brindusa Tot to run Sysinternals CoreInfo and that shows OSXSAVE in CPUID is enabled.
Below is the output after running CoreInfo:

Coreinfo v3.31 - Dump information on system CPU and memory topology Copyright (C) 2008-2014 Mark Russinovich Sysinternals - www.sysinternals.com

AMD FX(tm)-8320 Eight-Core Processor
x86 Family 21 Model 2 Stepping 0, AuthenticAMD
HTT             *       Multicore
HYPERVISOR      -       Hypervisor is present
VMX             -       Supports Intel hardware-assisted virtualization
SVM             *       Supports AMD hardware-assisted virtualization
X64             *       Supports 64-bit mode

SMX             -       Supports Intel trusted execution
SKINIT          *       Supports AMD SKINIT

NX              *       Supports no-execute page protection
SMEP            -       Supports Supervisor Mode Execution Prevention
SMAP            -       Supports Supervisor Mode Access Prevention
PAGE1GB         *       Supports 1 GB large pages
PAE             *       Supports > 32-bit physical addresses
PAT             *       Supports Page Attribute Table
PSE             *       Supports 4 MB pages
PSE36           *       Supports > 32-bit address 4 MB pages
PGE             *       Supports global bit in page tables
SS              -       Supports bus snooping for cache operations
VME             *       Supports Virtual-8086 mode
RDWRFSGSBASE    -       Supports direct GS/FS base access

FPU             *       Implements i387 floating point instructions
MMX             *       Supports MMX instruction set
MMXEXT          *       Implements AMD MMX extensions
3DNOW           -       Supports 3DNow! instructions
3DNOWEXT        -       Supports 3DNow! extension instructions
SSE             *       Supports Streaming SIMD Extensions
SSE2            *       Supports Streaming SIMD Extensions 2
SSE3            *       Supports Streaming SIMD Extensions 3
SSSE3           *       Supports Supplemental SIMD Extensions 3
SSE4a           *       Supports Streaming SIMDR Extensions 4a
SSE4.1          *       Supports Streaming SIMD Extensions 4.1
SSE4.2          *       Supports Streaming SIMD Extensions 4.2

AES             *       Supports AES extensions
AVX             *       Supports AVX intruction extensions
FMA             *       Supports FMA extensions using YMM state
MSR             *       Implements RDMSR/WRMSR instructions
MTRR            *       Supports Memory Type Range Registers
XSAVE           *       Supports XSAVE/XRSTOR instructions
OSXSAVE         *       Supports XSETBV/XGETBV instructions
RDRAND          -       Supports RDRAND instruction
RDSEED          -       Supports RDSEED instruction

CMOV            *       Supports CMOVcc instruction
CLFSH           *       Supports CLFLUSH instruction
CX8             *       Supports compare and exchange 8-byte instructions
CX16            *       Supports CMPXCHG16B instruction
BMI1            *       Supports bit manipulation extensions 1
BMI2            -       Supports bit manipulation extensions 2
ADX             -       Supports ADCX/ADOX instructions
DCA             -       Supports prefetch from memory-mapped device
F16C            *       Supports half-precision instruction
FXSR            *       Supports FXSAVE/FXSTOR instructions
FFXSR           *       Supports optimized FXSAVE/FSRSTOR instruction
MONITOR         *       Supports MONITOR and MWAIT instructions
MOVBE           -       Supports MOVBE instruction
ERMSB           -       Supports Enhanced REP MOVSB/STOSB
PCLMULDQ        *       Supports PCLMULDQ instruction
POPCNT          *       Supports POPCNT instruction
LZCNT           *       Supports LZCNT instruction
SEP             *       Supports fast system call instructions
LAHF-SAHF       *       Supports LAHF/SAHF instructions in 64-bit mode
HLE             -       Supports Hardware Lock Elision instructions
RTM             -       Supports Restricted Transactional Memory instruction

DE              *       Supports I/O breakpoints including CR4.DE
DTES64          -       Can write history of 64-bit branch addresses
DS              -       Implements memory-resident debug buffer
DS-CPL          -       Supports Debug Store feature with CPL
PCID            -       Supports PCIDs and settable CR4.PCIDE
INVPCID         -       Supports INVPCID instruction
PDCM            -       Supports Performance Capabilities MSR
RDTSCP          *       Supports RDTSCP instruction
TSC             *       Supports RDTSC instruction
TSC-DEADLINE    -       Local APIC supports one-shot deadline timer
TSC-INVARIANT   *       TSC runs at constant rate
xTPR            -       Supports disabling task priority messages

EIST            -       Supports Enhanced Intel Speedstep
ACPI            -       Implements MSR for power management
TM              -       Implements thermal monitor circuitry
TM2             -       Implements Thermal Monitor 2 control
APIC            *       Implements software-accessible local APIC
x2APIC          -       Supports x2APIC

CNXT-ID         -       L1 data cache mode adaptive or BIOS

MCE             *       Supports Machine Check, INT18 and CR4.MCE
MCA             *       Implements Machine Check Architecture
PBE             -       Supports use of FERR#/PBE# pin

PSN             -       Implements 96-bit processor serial number

PREFETCHW       *       Supports PREFETCHW instruction

Maximum implemented CPUID leaves: 0000000D (Basic), 8000001E (Extended).

Logical to Physical Processor Map:
*-------  Physical Processor 0
-*------  Physical Processor 1
--*-----  Physical Processor 2
---*----  Physical Processor 3
----*---  Physical Processor 4
-----*--  Physical Processor 5
------*-  Physical Processor 6
-------*  Physical Processor 7

Logical Processor to Socket Map:
********  Socket 0

Logical Processor to NUMA Node Map:
********  NUMA Node 0

Logical Processor to Cache Map:
*-------  Data Cache          0, Level 1,   16 KB, Assoc   4, LineSize  64
*-------  Instruction Cache   0, Level 1,   64 KB, Assoc   2, LineSize  64
*-------  Unified Cache       0, Level 2,    2 MB, Assoc  16, LineSize  64
-*------  Data Cache          1, Level 1,   16 KB, Assoc   4, LineSize  64
-*------  Instruction Cache   1, Level 1,   64 KB, Assoc   2, LineSize  64
-*------  Unified Cache       1, Level 2,    2 MB, Assoc  16, LineSize  64
--*-----  Data Cache          2, Level 1,   16 KB, Assoc   4, LineSize  64
--*-----  Instruction Cache   2, Level 1,   64 KB, Assoc   2, LineSize  64
--*-----  Unified Cache       2, Level 2,    2 MB, Assoc  16, LineSize  64
---*----  Data Cache          3, Level 1,   16 KB, Assoc   4, LineSize  64
---*----  Instruction Cache   3, Level 1,   64 KB, Assoc   2, LineSize  64
---*----  Unified Cache       3, Level 2,    2 MB, Assoc  16, LineSize  64
----*---  Data Cache          4, Level 1,   16 KB, Assoc   4, LineSize  64
----*---  Instruction Cache   4, Level 1,   64 KB, Assoc   2, LineSize  64
----*---  Unified Cache       4, Level 2,    2 MB, Assoc  16, LineSize  64
-----*--  Data Cache          5, Level 1,   16 KB, Assoc   4, LineSize  64
-----*--  Instruction Cache   5, Level 1,   64 KB, Assoc   2, LineSize  64
-----*--  Unified Cache       5, Level 2,    2 MB, Assoc  16, LineSize  64
------*-  Data Cache          6, Level 1,   16 KB, Assoc   4, LineSize  64
------*-  Instruction Cache   6, Level 1,   64 KB, Assoc   2, LineSize  64
------*-  Unified Cache       6, Level 2,    2 MB, Assoc  16, LineSize  64
-------*  Data Cache          7, Level 1,   16 KB, Assoc   4, LineSize  64
-------*  Instruction Cache   7, Level 1,   64 KB, Assoc   2, LineSize  64
-------*  Unified Cache       7, Level 2,    2 MB, Assoc  16, LineSize  64
Just filed bug 1283585 with the new AES-GCM code.
I also ran into this crash today several times on Windows XP 32-bit while testing Firefox 48beta7. Don't have any steps to reproduce unfortunately, it just happened all of a sudden. I also modified the title since this is not specific to 45 branch.

Setup:
Operating System: Windows XP Professional (5.1, Build 2600) Service Pack 3 (2600.xpsp_sp3_qfe.130704-0421)
Language: English (Regional Setting: English)
System Manufacturer: Gigabyte Technology Co., Ltd.
System Model: GA-78LMT-USB3
BIOS: Award Modular BIOS v6.00PG
Processor: AMD FX(tm)-8320 Eight-Core Processor, MMX (8 CPUs), ~3.5GHz
Memory: 3326MB RAM
Page File: 3523MB used, 1681MB available
Windows Dir: C:\WINDOWS
DirectX Version: DirectX 9.0c (4.09.0000.0904)
DX Setup Parameters: Not found
DxDiag Version: 5.03.2600.5512 32bit Unicode
Card name: AMD Radeon HD 6450

Crash reports:
bp-2452f2fc-47ac-4979-9137-8c12e2160712
bp-0274bd69-36c5-419d-8595-b11962160712
bp-0ba074fb-8258-4c12-b309-b440c2160712
bp-132956e3-57fa-444e-88b7-edf802160712
Summary: Firefox 45.0a1 Crash [@ intel_aes_gcmINIT ] → Crash in [@ intel_aes_gcmINIT ]
Crash volume for signature 'intel_aes_gcmINIT':
 - nightly (version 50): 29 crashes from 2016-06-06.
 - aurora  (version 49): 0 crash from 2016-06-07.
 - beta    (version 48): 127 crashes from 2016-06-06.
 - release (version 47): 691 crashes from 2016-05-31.
 - esr     (version 45): 25 crashes from 2016-04-07.

Crash volume on the last weeks:
             Week N-1   Week N-2   Week N-3   Week N-4   Week N-5   Week N-6   Week N-7
 - nightly          2          3          0          0          8          3          5
 - aurora           0          0          0          0          0          0          0
 - beta            10         19         25         29         23         13          7
 - release        119        100        105        114         85         81         46
 - esr              0          1          0         15          4          4          0

Affected platform: Windows
Crash volume for signature 'intel_aes_gcmINIT':
 - nightly (version 51): 0 crashes from 2016-08-01.
 - aurora  (version 50): 0 crashes from 2016-08-01.
 - beta    (version 49): 59 crashes from 2016-08-02.
 - release (version 48): 138 crashes from 2016-07-25.
 - esr     (version 45): 31 crashes from 2016-05-02.

Crash volume on the last weeks (Week N is from 08-22 to 08-28):
            W. N-1  W. N-2  W. N-3
 - nightly       0       0       0
 - aurora        0       0       0
 - beta         19       3       4
 - release      52      41      18
 - esr           3       0       1

Affected platform: Windows

Crash rank on the last 7 days:
           Browser     Content   Plugin
 - nightly
 - aurora
 - beta    #406
 - release #532
 - esr     #2273
An update: these patches went in with NSS 3.25, which went into Firefox 49 (https://hg.mozilla.org/integration/mozilla-inbound/rev/a2f23b6058a2, bug 1277255). Looking at crashes from the past 7 days, I see 7 in 49.0 and another 18 or so in 49 betas. This suggests that the patches didn't fix the problem :(
This used to be more frequent with AMD CPUs, but it's the other way around now.

Intel 125/180
AMD 56/180
Has anybody checked in a raw dump if the instruction that we're trying to execute
is actually vzeroupper?
(In reply to Marco Castelluccio [:marco] from comment #41)
> Has anybody checked in a raw dump if the instruction that we're trying to
> execute
> is actually vzeroupper?

I've checked this crash (https://crash-stats.mozilla.com/report/index/54a2ecaa-3bbf-488b-970a-d69de2160930#tab-rawdump) and the instruction is actually vzeroupper.

 d2cd905:	5b                   	pop    %ebx
 d2cd906:	c3                   	ret    
 d2cd907:	8d a4 24 00 00 00 00 	lea    0x0(%esp),%esp
 d2cd90e:	8b ff                	mov    %edi,%edi
 d2cd910:	8b 44 24 04          	mov    0x4(%esp),%eax
 d2cd914:	8b 4c 24 08          	mov    0x8(%esp),%ecx
 d2cd918:	8b 54 24 0c          	mov    0xc(%esp),%edx
 d2cd91c:	c5 f8 77             	vzeroupper 
^^^^^^^^^	^^^^^^^^^^^^^^^^^^^^^	^^^^^^^^^^^^
 d2cd91f:	c5 fa 6f 01          	vmovdqu (%ecx),%xmm0
 d2cd923:	8d 49 10             	lea    0x10(%ecx),%ecx
 d2cd926:	4a                   	dec    %edx
 d2cd927:	c4 e2 79 dc 01       	vaesenc (%ecx),%xmm0,%xmm0
 d2cd92c:	8d 49 10             	lea    0x10(%ecx),%ecx
 d2cd92f:	4a                   	dec    %edx
 d2cd930:	75 f5                	jne    0xd2cd927
 d2cd932:	c4 e2 79 dd 01       	vaesenclast (%ecx),%xmm0,%xmm0
 d2cd937:	c4 e2 79 00 05 90 23 	vpshufb 0xd2e2390,%xmm0,%xmm0
According to https://msdn.microsoft.com/en-us/library/ff545910.aspx#avx_registers, "In Windows 7 with Service Pack 1 (SP1), Windows Server 2008 R2, and newer versions of Windows, both x86 and x64 versions of the operating system preserve the AVX registers across thread (and process) switches.".

I don't see how not preserving the state of the AVX registers would cause a EXCEPTION_ILLEGAL_INSTRUCTION, but it could definitely cause hard to diagnose problems.
The crash occurs, even if with smaller frequency (out of 184, 19 with Windows 8.1, 1 with Windows 8, 1 with Windows 7 SP1), with versions of Windows newer than what is stated on that page.
Looks like there's a way to disable AVX on Windows, it was implemented by Microsoft as a workaround for some issue (the command is bcdedit /set xsavedisable 1, see https://support.microsoft.com/en-us/kb/2568088).

There's also a way to detect if AVX is enabled on Windows: https://msdn.microsoft.com/en-us/library/windows/desktop/hh134240(v=vs.85).aspx.
(In reply to Marco Castelluccio [:marco] from comment #43)
> According to
> https://msdn.microsoft.com/en-us/library/ff545910.aspx#avx_registers, "In
> Windows 7 with Service Pack 1 (SP1), Windows Server 2008 R2, and newer
> versions of Windows, both x86 and x64 versions of the operating system
> preserve the AVX registers across thread (and process) switches.".
> 
> I don't see how not preserving the state of the AVX registers would cause a
> EXCEPTION_ILLEGAL_INSTRUCTION, but it could definitely cause hard to
> diagnose problems.
> The crash occurs, even if with smaller frequency (out of 184, 19 with
> Windows 8.1, 1 with Windows 8, 1 with Windows 7 SP1), with versions of
> Windows newer than what is stated on that page.

The way it works is that the CPU blocks execution of the AVX instructions until the OS sets OSXSAVE in CR4 and XSETBV is used to set the correct bits.
(In reply to Yuhong Bao from comment #45)
> (In reply to Marco Castelluccio [:marco] from comment #43)
> > According to
> > https://msdn.microsoft.com/en-us/library/ff545910.aspx#avx_registers, "In
> > Windows 7 with Service Pack 1 (SP1), Windows Server 2008 R2, and newer
> > versions of Windows, both x86 and x64 versions of the operating system
> > preserve the AVX registers across thread (and process) switches.".
> > 
> > I don't see how not preserving the state of the AVX registers would cause a
> > EXCEPTION_ILLEGAL_INSTRUCTION, but it could definitely cause hard to
> > diagnose problems.
> > The crash occurs, even if with smaller frequency (out of 184, 19 with
> > Windows 8.1, 1 with Windows 8, 1 with Windows 7 SP1), with versions of
> > Windows newer than what is stated on that page.
> 
> The way it works is that the CPU blocks execution of the AVX instructions
> until the OS sets OSXSAVE in CR4 and XSETBV is used to set the correct bits.

If the CPU blocks the execution, than it would explain why we're getting
an EXCEPTION_ILLEGAL_INSTRUCTION.

Most OSes where we're crashing do not support AVX, so there must be
a bug in our code to detect it.

We have three buckets of OSes, out of 205 crashes:
- Versions older than Windows 7 SP1 (no support for AVX): 176
- Versions newer than Windows 7 SP1, 32-bit (where AVX is disabled, "the VEX instructions can only be used when running in 64-bit mode" from https://software.intel.com/en-us/articles/introduction-to-intel-advanced-vector-extensions): 27
- Versions newer than Windows 7 SP1, 64-bit (where AVX should be enabled, unless forcibly disabled by the user): 2

There are some snippets of code to detect AVX on https://software.intel.com/en-us/blogs/2011/04/14/is-avx-enabled/ and http://masm32.com/board/index.php?topic=3191.0.
(In reply to Marco Castelluccio [:marco] from comment #46)
> We have three buckets of OSes, out of 205 crashes:
> - Versions newer than Windows 7 SP1, 32-bit (where AVX is disabled, "the VEX
> instructions can only be used when running in 64-bit mode" from
> https://software.intel.com/en-us/articles/introduction-to-intel-advanced-
> vector-extensions): 27
> - Versions newer than Windows 7 SP1, 64-bit (where AVX should be enabled,
> unless forcibly disabled by the user): 2

N.B.: By 32-bit and 64-bit here I meant the OS. Firefox is always 32-bit with
this crash.
The function which is checking the OS support is check_xcr0_ymm: https://dxr.mozilla.org/mozilla-central/source/security/nss/lib/freebl/rijndael.c#1005.

It's using xgetbv in two different ways (for 32-bit builds manually with an
assembly snippet, for 64-bit builds with the _xgetbv function).
Perhaps we can revert the patch that introduced the assembly snippet (it was
needed to support VS 2010 RTM: https://bugzilla.mozilla.org/show_bug.cgi?id=979703#c23).

I don't see why the assembly snippet would be wrong, but perhaps directly using
_xgetbv from intrin.h would be safer (we're already doing it with libyuv:
https://dxr.mozilla.org/mozilla-central/source/media/libyuv/source/cpu_id.cc#102).
Crash volume for signature 'intel_aes_gcmINIT':
 - nightly (version 52): 18 crashes from 2016-09-19.
 - aurora  (version 51): 0 crashes from 2016-09-19.
 - beta    (version 50): 86 crashes from 2016-09-20.
 - release (version 49): 359 crashes from 2016-09-05.
 - esr     (version 45): 14 crashes from 2016-07-25.

Crash volume on the last weeks (Week N is from 10-17 to 10-23):
            W. N-1  W. N-2  W. N-3  W. N-4
 - nightly      17       0       1       0
 - aurora        0       0       0       0
 - beta         14      41      20       6
 - release      98     100      99      29
 - esr           0       0       5       0

Affected platform: Windows

Crash rank on the last 7 days:
           Browser   Content     Plugin
 - nightly #55
 - aurora
 - beta    #900      #3115
 - release #522
 - esr
Firefox crash during startup after motherboard replacement
https://support.mozilla.org/en-US/questions/1144850#answer-931402

This problem was solved by doing a full XP repair; the partial repair I did first was not sufficient.
owner
I spoke to soon, it is re-occurring, here's the latest:
https://crash-stats.mozilla.com/report/index/a7e4d91d-b28e-4c54-a8cd-a77492161031
Anything further on this crash?
I'm running AMD FX 4300 CPU on Windows XP, XP is loaded through the Windows 10
boot loader menu (dual boot scenario)
I know that XP does not support AVX instructions.  Is this relevant ?
Which mobo are you using?
It is the ASRock 960GC-GS FX

Let me mention something.  In various Windows forums I've described this crash scenario but
the response is that it couldn't possibly be, it must be a coincidence.

I've only ever seen this crash when I've booted XP from the Windows 10 boot loader (normal dual boot situation)

On a hunch, I restored XP's original boot loader so I could boot XP natively. In this situation,
the crash has never occurred.

I'd be pleased to help or run any tests to get to the bottom of this problem - Thanks
Thanks. This might mean that it is a bug in BOOTMGR.
There are various documents on the web that talk about Windows 8/10 usage of hibernate/
Fast Start/Hybrid sleep causing problems in dual boot scenarios.  But disabling that lot 
did not affect the situation.  I thought I was onto something for a minute.

Firefox is not the only program that is crashing.  The avast anti virus ( now uninstalled and replaced)
created many dumps in it's appdata folder, it was unusable.  Also FFmpeg.dll (video codec/tool)as used
in application Tmpgenc 2.5 also crashed exception illegal operation.  All these only crashed when
XP was booted via the Windows 10 boot menu.

FF does not crash constantly.  It's usually within a minute of starting it,
but not every time.  I'm running it today after restoring the Windows 10 loader and nothing has 
occurred (yet)
Perhaps similar occurrences in XP from other users could be checked to see if the Windows 10
(possibly Windows 7 & 8 also ?) boot loader scenario is evident.

I know that when problems are reported, this detail may not have been included
It does appear that CR4.OSFXSR is set early in newer BOOTMGR.
And CR4.OSXSAVE too.
Thanks for the update.  It's a little above my head, I'm not a windows programmer.
But I'm assuming these values in the control register are significant to the problem.

Let me know if there is anything further I can do to help
I saw the brief email discussion ... interesting.

Has it been determined if the AMD CPU is to blame, or if there is a bug in the Windows 10
Bootmgr/Winload where some bits are set incorrectly when booting to NTLDR ?
BTW, Geoff Chappell is a consultant you can pay, if Mozilla is interested.
I have done the debugging and confirmed this is a bug in BOOTMGR failing to clear XCR0 and CR4.OSXSAVE when chaining to NTLDR.
Doing this debugging in a VM took quite a bit of work including getting a checked build of NTLDR to confirm.
Thank you for the excellent investigation, Yuhong Bao. Since this isn't a Firefox bug, the crash rate is low, and it seems hard to do anything about it, I suggest that we close this bug report as WONTFIX.
Status: REOPENED → RESOLVED
Closed: 5 years ago4 years ago
Resolution: --- → WONTFIX
The bug should still be reported to Microsoft though.
Yuhong Bao I want to thank you myself for this work you did.
As was mentioned by Nicholas, the incidence of those experiencing this bug must be quite low.
(I wonder what % of Windows 10 users are also booting to XP?)

For myself, right now, I can find a work-around, boot XP from a floppy drive, for example.

How does one approach Microsoft with this kind of info?  Secondly, would they even be interested
in fixing it?

One thing I didn't really understand.  When the PC goes through POST, what is the content of those registers? Are those bits off by default ? (I assume they must be).
(In reply to davexnet01 from comment #67)
> One thing I didn't really understand.  When the PC goes through POST, what
> is the content of those registers? Are those bits off by default ? (I assume
> they must be).
Yes, they should be.
(In reply to davexnet01 from comment #67)
> Yuhong Bao I want to thank you myself for this work you did.
> As was mentioned by Nicholas, the incidence of those experiencing this bug
> must be quite low.

I just checked crash-stats.mozilla.com. This crash signature accounts for 255 out of 
3,242,628 crash reports submitted in the past 7 days. I.e. not that many.
If we do anything, it could be to blacklist the processor in the AVX test.  I think that I'd be happy to take code that did that.
You would be better off using IsProcessorFeaturePresent(PF_XSAVE_ENABLED)
Regarding Martin's comment above, is he implying that some errata in the AMD FX
is involved in this problem? Would the problem not arise if an Intel CPU
was used in an otherwise similar scenario ?
For posterity reasons only here is my configuration of the machine where I encounter this crash:

One HDD 500GB with the following OSs each on it's own partition:
- Windows XP 32bit
- Windows Vista 64bit
- Windows 7 32bit
- Windows 7 64bit
- Windows 8.1 32bit
- Windows 8.1 64bit

One SSD 256GB with the following OSs each on it's own partition:
- Windows 7 64bit
- Windows 10 64bit
- Ubuntu 16.04 32bit

One HDD 500GB with the following OSs each on it's own partition:
- Ubuntu 12.04 32bit
- Ubuntu 16.04 64bit
- Ubuntu 14.04 32bit

Software used to make this configuration was EasyBCD.

The CPU, Memory, BIOS etc. can be seen in comment 36.
FYI, I just finished a version for intel-gcm.s in bug 1283585
Good news! In Win10 1703, bootmgr!ImgPcatStartLegacyLoader now call bootmgr!ArchRestoreProcessorFeatures (to get rid of XSAVE bits)! This does not deal with CR4.OSFXSR, but it is a step forward.
See Also: → 1403353
You need to log in before you can comment on or make changes to this bug.