Crashes on AMD Zen 1 (family 23 model 1 stepping 1)
Categories
(Core :: Graphics, defect)
Tracking
()
Tracking | Status | |
---|---|---|
relnote-firefox | --- | 106+ |
firefox-esr102 | --- | unaffected |
firefox106 | + | fixed |
People
(Reporter: pascalc, Unassigned, NeedInfo)
References
(Blocks 1 open bug)
Details
(Keywords: crash, regression)
Crash Data
Crash report: https://crash-stats.mozilla.org/report/index/9b145d1b-1934-45b3-a92e-374190221019
Reason: EXCEPTION_ACCESS_VIOLATION_READ
Top 10 frames of crashing thread:
0 xul.dll mozilla::nsDisplayItem::GetClippedBounds const layout/painting/nsDisplayList.cpp:2692
0 xul.dll mozilla::nsDisplayList::GetClippedBoundsWithRespectToASR const layout/painting/nsDisplayList.cpp:2104
1 xul.dll mozilla::nsDisplayContainer::UpdateBounds layout/painting/nsDisplayList.cpp:2773
1 xul.dll mozilla::nsDisplayContainer::nsDisplayContainer layout/painting/nsDisplayList.cpp:2703
1 xul.dll mozilla::MakeDisplayItemWithIndex layout/painting/nsDisplayList.h:1999
1 xul.dll mozilla::MakeDisplayItem layout/painting/nsDisplayList.h:2045
1 xul.dll WrapInWrapList layout/generic/nsIFrame.cpp:3862
1 xul.dll nsIFrame::BuildDisplayListForChild layout/generic/nsIFrame.cpp:4329
2 xul.dll DisplayLine layout/generic/nsBlockFrame.cpp:7045
2 xul.dll nsBlockFrame::BuildDisplayList layout/generic/nsBlockFrame.cpp:7200
New crash signature in 106.0
Reporter | ||
Updated•2 years ago
|
Comment 1•2 years ago
|
||
Tim, seems like DisplayList is a involved here. Could you please have a look?
Comment 2•2 years ago
|
||
This seems like it's CPU specific and probably not graphics related
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Comment 5•2 years ago
|
||
I had a look at a minidump for mozilla::LinkedListIterator<T>::operator*
and the crashing instruction is: mov dword ptr [rsp+44h], eax
. This instruction comes after a mov dword ptr [rsp+64h], ecx
and a mov dword ptr [rsp+38h], r9
Reporter | ||
Comment 6•2 years ago
|
||
Gabriele, could you help us diagnose and confirm that we are hitting a CPU bug in 106.0? If this is the case, issuing a dot release should be a viable solution for us, otherwise we would need more investigation to fix this at the code level. Thanks!
Comment 7•2 years ago
|
||
(100.0% in signature vs 25.87% overall) CPU Info = family 23 model 1 stepping 1 [100.0% vs 31.58% if cpu_arch = amd64]
This correlation seems so strong it's almost certainly a hardware issue. But the 25% overall for an older AMD CPU seems...unexpected?
Comment 8•2 years ago
|
||
At least one user on SUMO referenced Twitch as a site with such crashes: https://support.mozilla.org/questions/1393585
If there is any mitigation for this -- such as modifying the value of a preference or denying a site permission -- that would be helpful to know for support purposes.
Comment 9•2 years ago
|
||
If there is any mitigation for this -- such as modifying the value of a preference or denying a site permission -- that would be helpful to know for support purposes.
Jeff, is there a pref that avoids this codepath? It sounds like Firefox is effectively unusable on affected machines, and I'm not sure what options we have here except to re-jiggle the code and hope that fixes it?
Comment 10•2 years ago
|
||
Nope. Most of the crashes our in our regular paint path.
Updated•2 years ago
|
Comment 11•2 years ago
•
|
||
(In reply to Jeff Muizelaar [:jrmuizel] from comment #5)
I had a look at a minidump for
mozilla::LinkedListIterator<T>::operator*
and the crashing instruction is:mov dword ptr [rsp+44h], eax
. This instruction comes after amov dword ptr [rsp+64h], ecx
and amov dword ptr [rsp+38h], r9
This is not static. So far I've confirmed three crash locations in four minidumps for this signature:
00007FFD274060A5 45 31 E4 xor r12d,r12d
00007FFD274060A8 31 F6 xor esi,esi
00007FFD274060AA 31 C9 xor ecx,ecx
00007FFD274060AC 31 C0 xor eax,eax
00007FFD274060AE 4C 89 4C 24 38 mov qword ptr [rsp+38h],r9
00007FFD274060B3 89 4C 24 64 mov dword ptr [rsp+64h],ecx
00007FFD274060B7 89 44 24 44 mov dword ptr [rsp+44h],eax // CRASH: 0xFFF...FFF
00007FFD274060BB 44 89 64 24 40 mov dword ptr [rsp+40h],r12d
00007FFD274060C0 89 74 24 54 mov dword ptr [rsp+54h],esi // CRASH: 0xFFF...FFF
00007FFD274060C4 48 89 5C 24 48 mov qword ptr [rsp+48h],rbx
00007FFD274060C9 4C 8B 7B 08 mov r15,qword ptr [rbx+8] // CRASH: 0x000...008
00007FFD274060CD 49 8B 07 mov rax,qword ptr [r15]
00007FFD274060D0 48 8B 80 B0 00 00 00 mov rax,qword ptr [rax+0B0h]
In all cases, the crash was reported as an attempt to access (EDIT: specifically, read) an inaccessible memory location.
- In the case of the first two locations, with address
0xFFF...FFF
, this is not consistent with the register values reported in the minidump (that is,rsp
has something that looks reasonable). (EDIT: Also, it's not consistent with the opcodes, which are writing rather than reading.) - In the last case, with address
0x000...008
,rbx
is reported to have a value of 0, so the crash is what would be expected... except of course thatrbx
probably shouldn't have a value of 0 there, due to the test up at...6094
.
The crash is probably not localized to the point to which it is attributed in the crash dump. Crashes are distributed across several functions in a way which seems consistent with arising <mumble>ns after something else happens. Here is a more extensive crash-stats link, showing that crashes are not limited to a small handful of functions.
I haven't looked at many of the crashes yet (so far, only about 10, with a lot of selection-bias-induced correlation); but those that I have checked seem to have MakeDisplayItemWithIndex
as their lowest common ancestor in the call stack, and nsIFrame::BuildDisplayListForChild
as their highest shared ancestor above that.
Comment 12•2 years ago
|
||
Family 23 model 1, that's first-gen Ryzen right? We already had issues in the past and I looked at the uncorrected errata, see bug 1687914 comment 8. The revision guide I pointed to in the comment has never been updated past that version, so even though more recent microcode bundles have been released for those processors it's unlikely that the erratas have been fixed.
Updated•2 years ago
|
Comment 13•2 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #12)
Family 23 model 1, that's first-gen Ryzen right? We already had issues in the past and I looked at the uncorrected errata, see bug 1687914 comment 8. The revision guide I pointed to in the comment has never been updated past that version, so even though more recent microcode bundles have been released for those processors it's unlikely that the erratas have been fixed.
Family 23 (which AMD calls 17h) model 1 is AMD Threadripper 1900X as far as I can tell, but all models of this generation appear to be affected (e.g. Ryzen 5 1400 and others), the identifying part of the product name is the 1xxx naming.
Comment 14•2 years ago
|
||
(In reply to Ashley Hale from comment #13)
Family 23 (which AMD calls 17h) model 1 is AMD Threadripper 1900X as far as I can tell
No, it stands for the whole family using that core (see this table).
Comment 15•2 years ago
|
||
I've looked at a whole bunch of crashes and I really think it's the same erratas I thought we hit in bug 1687914 (1021 and 1091 in AMD's errata). Both issues cause a bug in the store-to-load forwarding logic so a load delivers stale data instead of the contents of a previous store. Now looking at the crashes we see fundamentally two different types: one hitting address 0xffffffffffffffff
and one hitting an address near NULL. In the first case that's because the CPU tried to load from a non-canonical address and the OS reported a global protection fault (hence the lack of an actual crash address). However disassembly of crashes disproves this: the processor was loading a register containing a valid pointer, hence it must have tried to load some other data instead, stale data probably. The second type of crashes are more subtle but still due to the same cause: we have a load to a pointer after a NULL check... but disassembling the crashes show that the register that was just tested contained NULL! So the preceding test instruction operated on stale non-NULL data, caused the control-flow to go past it and then follow up with the crash.
Comment 16•2 years ago
|
||
I read https://www.amd.com/system/files/TechDocs/55449_Fam_17h_M_00h-0Fh_Rev_Guide.pdf while following along with this incident, if other values than 0xffffffffffffffff are being hit then I agree. For it to be only 0xffffffffffffffff seemed too anomalous for me to consider those errata in my reading, but having other values show up makes sense.
Comment 17•2 years ago
|
||
Fixed by the 106.0.1 rebuild.
Comment 18•2 years ago
|
||
(In reply to Ashley Hale from comment #16)
I read https://www.amd.com/system/files/TechDocs/55449_Fam_17h_M_00h-0Fh_Rev_Guide.pdf while following along with this incident, if other values than 0xffffffffffffffff are being hit then I agree. For it to be only 0xffffffffffffffff seemed too anomalous for me to consider those errata in my reading, but having other values show up makes sense.
One more detail of interest, there are two errata cited (1021, 1091), and while 1021 affects the entire Zen/Zen+/Zen2 family (but not Zen3 and later), the errata 1091 only affects original Ryzen 1xxx series. If this bug is selective about which CPU series it affects, it would be more likely to be errata 1091 and we should be able to completely avoid that errata by properly aligning 64bit values, there's a penalty to unaligned access already so we have performance reasons to avoid that case as well as stability on this chip.
Comment 19•2 years ago
|
||
Hmm... do we know which kind of unaligned access might we be performing? I don't see any fancy packing or so in the code surrounding the crashing address, but I might've missed it.
Comment 20•2 years ago
•
|
||
One more detail of interest, there are two errata cited (1021, 1091), and while 1021 affects the entire Zen/Zen+/Zen2 family (but not Zen3 and later), the errata 1091 only affects original Ryzen 1xxx series. If this bug is selective about which CPU series it affects
From my looking, it's really only Family 23 model 1 stepping 1
in the crash reports. But if I read the errata guide, Zen+ is also affected by 1021, which would be Family 23 Model 8
, yet we don't seem to be seeing this either. So I'm not sure what to conclude from this.
Comment 21•2 years ago
|
||
This specific issue identified would be limited to Family 23 model 1 stepping 1 only, other Zen1/Zen1+ CPUs are not affected.
There is a fix available in AMD provided system updates for the affected CPUs, which would imply that the affected platforms actually install it. One item that is surprising that this issue only now showed up with Firefox.
Comment 22•2 years ago
•
|
||
My interest in this is purely in identifying if we have any coding patterns that are particularly vulnerable to a data race on Family 23 model 1 stepping 1 CPUs, as we could see this affect a future release.
To cite an example of a hot code path that would be very vulnerable to errata that cause stale loads, there is at least one place in the code where we use std::push_heap to insert items in a list, while also reading it from another thread, this rapidly hits the addresses with store and load operations in several different orderings (which makes it more likely to hit cpu errata), presumably there are x86 lock prefix instructions occurring around this time, and not necessarily on the writing thread - if errata 1021 caused a stale load on the writing thread, it would corrupt the list being inserted into (not to be confused with the expected behavior of stale values seen by the reading thread - x86 lockless programming is fun like that).
Whereas for errata 1091 the most likely similar data structure I can imagine is just a map implementation using unaligned structs - if alignment is enabled then struct {int32 key;void *value;} takes 16 bytes on x86_64, but if not enabled it would be 12 bytes and half the elements of a vector of these structs would be unaligned pointers, making it very possible to get stale data for loads of pointers crossing a 4K boundary in a data race condition.
Comment 23•2 years ago
|
||
Not sure if it'll help, but I've created a list of Firefox 106.0 crashes on my computer.
There are 29 crashes on the list, but one of them seems not to be related to this issue - EXCEPTION_ACCESS_VIOLATION_WRITE
in Rust code. The crashes are all over the place, including one in __security_check_cookie()
, but all have nsIFrame::BuildDisplayListForChild(mozilla::nsDisplayListBuilder*, nsIFrame*, mozilla::nsDisplayListSet const&, mozilla::EnumSet<nsIFrame::DisplayChildFlag, unsigned int>);
at least once in frames <=2.
Most of these crashes occurred while loading an issue on GitHub. If you're interested I have a 434 MiB minidump from a crash with WinDbg attached.
Comment 24•2 years ago
|
||
krzysdz, can you still reproduce these crashes on demand using 106.0? Can you check what version of the AMD chipset drivers you have?
Comment 25•2 years ago
|
||
I've just installed 106.0 and can reproduce crashes (bp-a2e3a85d-aa30-4a57-b43a-0def70221027 is from a fresh 106.0 install).
AMD Chipset Drivers version: 3.10.08.506 (2021-10-21) (screenshot)
Comment 26•2 years ago
|
||
krzysdz, can you try installing the latest AMD chipset drivers from www.amd.com/support and check if you can still reproduce the problem after that?
Comment 27•2 years ago
|
||
(In reply to Paul Blinzer from comment #21)
One item that is surprising that this issue only now showed up with Firefox.
We have already encountered similar issues in the past, just not on the release channel. See comment 15.
Comment 28•2 years ago
|
||
Crashes still occur after updating chipset drivers to the latest version (4.09.23.507) and rebooting. I'm also running an "old" BIOS version - 5603 for ASUS Prime X370-Pro, but it was the last update until Ryzen 5xxx support was introduced and default settings changed for Windows 11.
Pre-update verification: 39a368de-3315-4b23-9493-5f7320221027
New drivers crash #1 (GitHub): no report form (screenshot)
New drivers crash #2 (GitHub): ed4d447b-015e-47a2-8881-f73d90221027 (screenshot)
New drivers crash #3 (Mozilla Crash Reports): ae240229-281f-47f4-bd9f-aa9c30221027
New drivers crash #4 (Mozilla Crash Reports): 5b4addb4-06c5-48a9-a85e-3bbb10221027
New drivers crash #5 (GitHub): WinDbg screenshot
Full WinDbg message (can't be seen on the screenshot):
(588.25c0): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
xul!mozilla::nsDisplayItem::GetClippedBounds [inlined in xul!mozilla::nsDisplayList::GetClippedBoundsWithRespectToASR+0x7d]:
00007ff8`76be60cd 498b07 mov rax,qword ptr [r15] ds:0000028b`26d5d7d0={xul!mozilla::nsDisplayCompositorHitTestInfo::`vftable' (00007ff8`7a6f06d0)}
Updated•2 years ago
|
Comment 29•2 years ago
|
||
kryzsdz AMD has released a new chipset driver that should fix this: https://drivers.amd.com/drivers/amd_chipset_software_4.11.15.342.exe
Can you try installing that and check if you can still reproduce the problem?
Comment 30•2 years ago
|
||
Hmm, it seems like that link needs an AMD referrer to work. You'll need to find it (4.11.15.342) from here: https://www.amd.com/en/support
Description
•