1796126 - Crashes on AMD Zen 1 (family 23 model 1 stepping 1)

Pascal Chevrel:pascalc

Reporter

Description

•

2 years ago

Crash report: https://crash-stats.mozilla.org/report/index/9b145d1b-1934-45b3-a92e-374190221019

Reason: EXCEPTION_ACCESS_VIOLATION_READ

Top 10 frames of crashing thread:

0 xul.dll mozilla::nsDisplayItem::GetClippedBounds const layout/painting/nsDisplayList.cpp:2692
0 xul.dll mozilla::nsDisplayList::GetClippedBoundsWithRespectToASR const layout/painting/nsDisplayList.cpp:2104
1 xul.dll mozilla::nsDisplayContainer::UpdateBounds layout/painting/nsDisplayList.cpp:2773
1 xul.dll mozilla::nsDisplayContainer::nsDisplayContainer layout/painting/nsDisplayList.cpp:2703
1 xul.dll mozilla::MakeDisplayItemWithIndex layout/painting/nsDisplayList.h:1999
1 xul.dll mozilla::MakeDisplayItem layout/painting/nsDisplayList.h:2045
1 xul.dll WrapInWrapList layout/generic/nsIFrame.cpp:3862
1 xul.dll nsIFrame::BuildDisplayListForChild layout/generic/nsIFrame.cpp:4329
2 xul.dll DisplayLine layout/generic/nsBlockFrame.cpp:7045
2 xul.dll nsBlockFrame::BuildDisplayList layout/generic/nsBlockFrame.cpp:7200

New crash signature in 106.0

Pascal Chevrel:pascalc

Reporter

Updated

•

2 years ago

status-firefox106: --- → affected

status-firefox107: --- → affected

Pascal Chevrel:pascalc

Reporter

Updated

•

2 years ago

Updated

•

2 years ago

Comment 1

•

2 years ago

Tim, seems like DisplayList is a involved here. Could you please have a look?

Flags: needinfo?(tnikkel)

Jeff Muizelaar [:jrmuizel]

Comment 2

•

2 years ago

This seems like it's CPU specific and probably not graphics related

Summary: Crash in [@ mozilla::nsDisplayItem::GetClippedBounds] → Crashes on AMD Zen (family 23 model 1 stepping 1)

Jeff Muizelaar [:jrmuizel]

Updated

•

2 years ago

Crash Signature: [@ mozilla::nsDisplayItem::GetClippedBounds] → [@ mozilla::nsDisplayItem::GetClippedBounds] [@ nsRect::SaturatingUnionEdges ]

Jeff Muizelaar [:jrmuizel]

Updated

•

2 years ago

Crash Signature: [@ mozilla::nsDisplayItem::GetClippedBounds] [@ nsRect::SaturatingUnionEdges ] → [@ mozilla::nsDisplayItem::GetClippedBounds] [@ nsRect::SaturatingUnionEdges ] [@ nsRect::SaturatingUnion]

Jeff Muizelaar [:jrmuizel]

Updated

•

2 years ago

Crash Signature: [@ mozilla::nsDisplayItem::GetClippedBounds] [@ nsRect::SaturatingUnionEdges ] [@ nsRect::SaturatingUnion] → [@ mozilla::nsDisplayItem::GetClippedBounds] [@ nsRect::SaturatingUnionEdges ] [@ nsRect::SaturatingUnion] [@ mozilla::nsDisplayContainer::UpdateBounds]

Jeff Muizelaar [:jrmuizel]

Updated

•

2 years ago

Crash Signature: [@ mozilla::nsDisplayItem::GetClippedBounds] [@ nsRect::SaturatingUnionEdges ] [@ nsRect::SaturatingUnion] [@ mozilla::nsDisplayContainer::UpdateBounds] → [@ mozilla::nsDisplayItem::GetClippedBounds] [@ nsRect::SaturatingUnionEdges ] [@ nsRect::SaturatingUnion] [@ mozilla::nsDisplayContainer::UpdateBounds] [@ mozilla::LinkedListIterator<T>::operator*]

Jeff Muizelaar [:jrmuizel]

Comment 5

•

2 years ago

I had a look at a minidump for mozilla::LinkedListIterator<T>::operator* and the crashing instruction is: mov dword ptr [rsp+44h], eax. This instruction comes after a mov dword ptr [rsp+64h], ecx and a mov dword ptr [rsp+38h], r9

Pascal Chevrel:pascalc

Reporter

Comment 6

•

2 years ago

Gabriele, could you help us diagnose and confirm that we are hitting a CPU bug in 106.0? If this is the case, issuing a dot release should be a viable solution for us, otherwise we would need more investigation to fix this at the code level. Thanks!

status-firefox-esr102: --- → unaffected

tracking-firefox106: --- → +

Flags: needinfo?(gsvelto)

Gian-Carlo Pascutto [:gcp]

Comment 7

•

2 years ago

(100.0% in signature vs 25.87% overall) CPU Info = family 23 model 1 stepping 1 [100.0% vs 31.58% if cpu_arch = amd64]

This correlation seems so strong it's almost certainly a hardware issue. But the 25% overall for an older AMD CPU seems...unexpected?

jscher2000

Comment 8

•

2 years ago

At least one user on SUMO referenced Twitch as a site with such crashes: https://support.mozilla.org/questions/1393585

If there is any mitigation for this -- such as modifying the value of a preference or denying a site permission -- that would be helpful to know for support purposes.

Gian-Carlo Pascutto [:gcp]

Comment 9

•

2 years ago

If there is any mitigation for this -- such as modifying the value of a preference or denying a site permission -- that would be helpful to know for support purposes.

Jeff, is there a pref that avoids this codepath? It sounds like Firefox is effectively unusable on affected machines, and I'm not sure what options we have here except to re-jiggle the code and hope that fixes it?

Flags: needinfo?(jmuizelaar)

Jeff Muizelaar [:jrmuizel]

Comment 10

•

2 years ago

Nope. Most of the crashes our in our regular paint path.

Flags: needinfo?(jmuizelaar)

Jeff Muizelaar [:jrmuizel]

Updated

•

2 years ago

Flags: needinfo?(tnikkel)

Ray Kraesig [:rkraesig]

Comment 11

•

2 years ago

•

Edited

(In reply to Jeff Muizelaar [:jrmuizel] from comment #5)

I had a look at a minidump for mozilla::LinkedListIterator<T>::operator* and the crashing instruction is: mov dword ptr [rsp+44h], eax. This instruction comes after a mov dword ptr [rsp+64h], ecx and a mov dword ptr [rsp+38h], r9

This is not static. So far I've confirmed three crash locations in four minidumps for this signature:

00007FFD274060A5 45 31 E4             xor         r12d,r12d  
00007FFD274060A8 31 F6                xor         esi,esi  
00007FFD274060AA 31 C9                xor         ecx,ecx  
00007FFD274060AC 31 C0                xor         eax,eax  
00007FFD274060AE 4C 89 4C 24 38       mov         qword ptr [rsp+38h],r9  
00007FFD274060B3 89 4C 24 64          mov         dword ptr [rsp+64h],ecx  
00007FFD274060B7 89 44 24 44          mov         dword ptr [rsp+44h],eax      // CRASH: 0xFFF...FFF
00007FFD274060BB 44 89 64 24 40       mov         dword ptr [rsp+40h],r12d  
00007FFD274060C0 89 74 24 54          mov         dword ptr [rsp+54h],esi      // CRASH: 0xFFF...FFF
00007FFD274060C4 48 89 5C 24 48       mov         qword ptr [rsp+48h],rbx  
00007FFD274060C9 4C 8B 7B 08          mov         r15,qword ptr [rbx+8]        // CRASH: 0x000...008
00007FFD274060CD 49 8B 07             mov         rax,qword ptr [r15]  
00007FFD274060D0 48 8B 80 B0 00 00 00 mov         rax,qword ptr [rax+0B0h]

In all cases, the crash was reported as an attempt to ~~access~~ (EDIT: specifically, read) an inaccessible memory location.

In the case of the first two locations, with address 0xFFF...FFF, this is not consistent with the register values reported in the minidump (that is, rsp has something that looks reasonable). (EDIT: Also, it's not consistent with the opcodes, which are writing rather than reading.)
In the last case, with address 0x000...008, rbx is reported to have a value of 0, so the crash is what would be expected... except of course that rbx probably shouldn't have a value of 0 there, due to the test up at ...6094.

The crash is probably not localized to the point to which it is attributed in the crash dump. Crashes are distributed across several functions in a way which seems consistent with arising <mumble>‍ns after something else happens. Here is a more extensive crash-stats link, showing that crashes are not limited to a small handful of functions.

I haven't looked at many of the crashes yet (so far, only about 10, with a lot of selection-bias-induced correlation); but those that I have checked seem to have MakeDisplayItemWithIndex as their lowest common ancestor in the call stack, and nsIFrame::BuildDisplayListForChild as their highest shared ancestor above that.

Gabriele Svelto [:gsvelto]

Comment 12

•

2 years ago

Family 23 model 1, that's first-gen Ryzen right? We already had issues in the past and I looked at the uncorrected errata, see bug 1687914 comment 8. The revision guide I pointed to in the comment has never been updated past that version, so even though more recent microcode bundles have been released for those processors it's unlikely that the erratas have been fixed.

Flags: needinfo?(gsvelto)

Jeff Muizelaar [:jrmuizel]

Updated

•

2 years ago

Summary: Crashes on AMD Zen (family 23 model 1 stepping 1) → Crashes on AMD Zen 1 (family 23 model 1 stepping 1)

Jeff Muizelaar [:jrmuizel]

Updated

•

2 years ago

Comment 13

•

2 years ago

(In reply to Gabriele Svelto [:gsvelto] from comment #12)

Family 23 model 1, that's first-gen Ryzen right? We already had issues in the past and I looked at the uncorrected errata, see bug 1687914 comment 8. The revision guide I pointed to in the comment has never been updated past that version, so even though more recent microcode bundles have been released for those processors it's unlikely that the erratas have been fixed.

Family 23 (which AMD calls 17h) model 1 is AMD Threadripper 1900X as far as I can tell, but all models of this generation appear to be affected (e.g. Ryzen 5 1400 and others), the identifying part of the product name is the 1xxx naming.

Gabriele Svelto [:gsvelto]

Comment 14

•

2 years ago

(In reply to Ashley Hale from comment #13)

Family 23 (which AMD calls 17h) model 1 is AMD Threadripper 1900X as far as I can tell

No, it stands for the whole family using that core (see this table).

Gabriele Svelto [:gsvelto]

Comment 15

•

2 years ago

I've looked at a whole bunch of crashes and I really think it's the same erratas I thought we hit in bug 1687914 (1021 and 1091 in AMD's errata). Both issues cause a bug in the store-to-load forwarding logic so a load delivers stale data instead of the contents of a previous store. Now looking at the crashes we see fundamentally two different types: one hitting address 0xffffffffffffffff and one hitting an address near NULL. In the first case that's because the CPU tried to load from a non-canonical address and the OS reported a global protection fault (hence the lack of an actual crash address). However disassembly of crashes disproves this: the processor was loading a register containing a valid pointer, hence it must have tried to load some other data instead, stale data probably. The second type of crashes are more subtle but still due to the same cause: we have a load to a pointer after a NULL check... but disassembling the crashes show that the register that was just tested contained NULL! So the preceding test instruction operated on stale non-NULL data, caused the control-flow to go past it and then follow up with the crash.

Ashley Hale [:ahale]

Comment 16

•

2 years ago

I read https://www.amd.com/system/files/TechDocs/55449_Fam_17h_M_00h-0Fh_Rev_Guide.pdf while following along with this incident, if other values than 0xffffffffffffffff are being hit then I agree. For it to be only 0xffffffffffffffff seemed too anomalous for me to consider those errata in my reading, but having other values show up makes sense.

Timothy Nikkel (:tnikkel)

Updated

•

2 years ago

Comment 17

•

2 years ago

Fixed by the 106.0.1 rebuild.

Status: NEW → RESOLVED

Closed: 2 years ago

status-firefox106: affected → fixed

status-firefox107: affected → ---

Resolution: --- → FIXED

Jeff Muizelaar [:jrmuizel]

Updated

•

2 years ago

Comment 18

•

2 years ago

(In reply to Ashley Hale from comment #16)

I read https://www.amd.com/system/files/TechDocs/55449_Fam_17h_M_00h-0Fh_Rev_Guide.pdf while following along with this incident, if other values than 0xffffffffffffffff are being hit then I agree. For it to be only 0xffffffffffffffff seemed too anomalous for me to consider those errata in my reading, but having other values show up makes sense.

One more detail of interest, there are two errata cited (1021, 1091), and while 1021 affects the entire Zen/Zen+/Zen2 family (but not Zen3 and later), the errata 1091 only affects original Ryzen 1xxx series. If this bug is selective about which CPU series it affects, it would be more likely to be errata 1091 and we should be able to completely avoid that errata by properly aligning 64bit values, there's a penalty to unaligned access already so we have performance reasons to avoid that case as well as stability on this chip.

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Updated

•

2 years ago

Comment 19

•

2 years ago

Hmm... do we know which kind of unaligned access might we be performing? I don't see any fancy packing or so in the code surrounding the crashing address, but I might've missed it.

Gian-Carlo Pascutto [:gcp]

Comment 20

•

2 years ago

•

Edited

One more detail of interest, there are two errata cited (1021, 1091), and while 1021 affects the entire Zen/Zen+/Zen2 family (but not Zen3 and later), the errata 1091 only affects original Ryzen 1xxx series. If this bug is selective about which CPU series it affects

From my looking, it's really only Family 23 model 1 stepping 1 in the crash reports. But if I read the errata guide, Zen+ is also affected by 1021, which would be Family 23 Model 8, yet we don't seem to be seeing this either. So I'm not sure what to conclude from this.

Paul Blinzer

Comment 21

•

2 years ago

This specific issue identified would be limited to Family 23 model 1 stepping 1 only, other Zen1/Zen1+ CPUs are not affected.

There is a fix available in AMD provided system updates for the affected CPUs, which would imply that the affected platforms actually install it. One item that is surprising that this issue only now showed up with Firefox.

Ashley Hale [:ahale]

Comment 22

•

2 years ago

•

Edited

My interest in this is purely in identifying if we have any coding patterns that are particularly vulnerable to a data race on Family 23 model 1 stepping 1 CPUs, as we could see this affect a future release.

To cite an example of a hot code path that would be very vulnerable to errata that cause stale loads, there is at least one place in the code where we use std::push_heap to insert items in a list, while also reading it from another thread, this rapidly hits the addresses with store and load operations in several different orderings (which makes it more likely to hit cpu errata), presumably there are x86 lock prefix instructions occurring around this time, and not necessarily on the writing thread - if errata 1021 caused a stale load on the writing thread, it would corrupt the list being inserted into (not to be confused with the expected behavior of stale values seen by the reading thread - x86 lockless programming is fun like that).

Whereas for errata 1091 the most likely similar data structure I can imagine is just a map implementation using unaligned structs - if alignment is enabled then struct {int32 key;void *value;} takes 16 bytes on x86_64, but if not enabled it would be 12 bytes and half the elements of a vector of these structs would be unaligned pointers, making it very possible to get stale data for loads of pointers crossing a 4K boundary in a data race condition.

krzysdz

Comment 23

•

2 years ago

Not sure if it'll help, but I've created a list of Firefox 106.0 crashes on my computer.
There are 29 crashes on the list, but one of them seems not to be related to this issue - EXCEPTION_ACCESS_VIOLATION_WRITE in Rust code. The crashes are all over the place, including one in __security_check_cookie(), but all have nsIFrame::BuildDisplayListForChild(mozilla::nsDisplayListBuilder*, nsIFrame*, mozilla::nsDisplayListSet const&, mozilla::EnumSet<nsIFrame::DisplayChildFlag, unsigned int>); at least once in frames <=2.

Most of these crashes occurred while loading an issue on GitHub. If you're interested I have a 434 MiB minidump from a crash with WinDbg attached.

Jeff Muizelaar [:jrmuizel]

Comment 24

•

2 years ago

krzysdz, can you still reproduce these crashes on demand using 106.0? Can you check what version of the AMD chipset drivers you have?

Flags: needinfo?(krzysdz)

krzysdz

Comment 25

•

2 years ago

I've just installed 106.0 and can reproduce crashes (bp-a2e3a85d-aa30-4a57-b43a-0def70221027 is from a fresh 106.0 install).

AMD Chipset Drivers version: 3.10.08.506 (2021-10-21) (screenshot)

Flags: needinfo?(krzysdz)

Jeff Muizelaar [:jrmuizel]

Comment 26

•

2 years ago

krzysdz, can you try installing the latest AMD chipset drivers from www.amd.com/support and check if you can still reproduce the problem after that?

Flags: needinfo?(krzysdz)

Gabriele Svelto [:gsvelto]

Comment 27

•

2 years ago

(In reply to Paul Blinzer from comment #21)

One item that is surprising that this issue only now showed up with Firefox.

We have already encountered similar issues in the past, just not on the release channel. See comment 15.

krzysdz

Comment 28

•

2 years ago

Crashes still occur after updating chipset drivers to the latest version (4.09.23.507) and rebooting. I'm also running an "old" BIOS version - 5603 for ASUS Prime X370-Pro, but it was the last update until Ryzen 5xxx support was introduced and default settings changed for Windows 11.

Pre-update verification: 39a368de-3315-4b23-9493-5f7320221027
New drivers crash #1 (GitHub): no report form (screenshot)
New drivers crash #2 (GitHub): ed4d447b-015e-47a2-8881-f73d90221027 (screenshot)
New drivers crash #3 (Mozilla Crash Reports): ae240229-281f-47f4-bd9f-aa9c30221027
New drivers crash #4 (Mozilla Crash Reports): 5b4addb4-06c5-48a9-a85e-3bbb10221027
New drivers crash #5 (GitHub): WinDbg screenshot

Full WinDbg message (can't be seen on the screenshot):

(588.25c0): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
xul!mozilla::nsDisplayItem::GetClippedBounds [inlined in xul!mozilla::nsDisplayList::GetClippedBoundsWithRespectToASR+0x7d]:
00007ff8`76be60cd 498b07          mov     rax,qword ptr [r15] ds:0000028b`26d5d7d0={xul!mozilla::nsDisplayCompositorHitTestInfo::`vftable' (00007ff8`7a6f06d0)}

Flags: needinfo?(krzysdz)

Dianna Smith [:diannaS]

Updated

•

2 years ago

relnote-firefox: --- → 106+

Jeff Muizelaar [:jrmuizel]

Comment 29

•

2 years ago

kryzsdz AMD has released a new chipset driver that should fix this: https://drivers.amd.com/drivers/amd_chipset_software_4.11.15.342.exe

Can you try installing that and check if you can still reproduce the problem?

Flags: needinfo?(krzysdz)

Jeff Muizelaar [:jrmuizel]

Comment 30

•

2 years ago

Hmm, it seems like that link needs an AMD referrer to work. You'll need to find it (4.11.15.342) from here: https://www.amd.com/en/support

Ryan VanderMeulen [:RyanVM]

Updated

•

2 years ago

Updated

•

10 months ago

Blocks: cpu-bugs