Open Bug 1950764 Opened 5 months ago Updated 12 days ago

Crash in [@ zlib_rs::deflate::State::d_code] on Raptor Lake CPUs

Categories

(Core :: General, defect)

Unspecified
Windows 10
defect

Tracking

()

Tracking Status
firefox137 - disabled
firefox138 --- disabled

People

(Reporter: mccr8, Unassigned)

References

(Blocks 2 open bugs, Regression)

Details

(Keywords: crash, regression)

Crash Data

Crash report: https://crash-stats.mozilla.org/report/index/113ebfb1-9197-44e2-88ce-322b20250223

MOZ_CRASH Reason:

index out of bounds: the len is 512 but the index is 567

Top 10 frames:

0  xul.dll  MOZ_Crash(char const*, int, char const*)  mfbt/Assertions.h:382
0  xul.dll  RustMozCrash(char const*, int, char const*)  mozglue/static/rust/wrappers.cpp:18
1  xul.dll  mozglue_static::panic_hook(std::panic::PanicHookInfo*)  mozglue/static/rust/lib.rs:102
2  xul.dll  core::ops::function::Fn::call<void (*)(ref$<std::panic::PanicHookInfo>), tupl...  /rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/core/src/ops/function.rs:250
3  xul.dll  alloc::boxed::impl$30::call()  library/alloc/src/boxed.rs:2007
3  xul.dll  std::panicking::rust_panic_with_hook()  library/std/src/panicking.rs:836
4  xul.dll  std::panicking::begin_panic_handler::closure$0()  library/std/src/panicking.rs:701
5  xul.dll  std::sys::backtrace::__rust_end_short_backtrace<std::panicking::begin_panic_h...  library/std/src/sys/backtrace.rs:168
6  xul.dll  std::panicking::begin_panic_handler()  library/std/src/panicking.rs:692
7  xul.dll  core::panicking::panic_fmt()  library/core/src/panicking.rs:75

Looks like we're hitting a bounds check.

Flags: needinfo?(mh+mozilla)

Do we have a reproducer for any of these?

Flags: needinfo?(mh+mozilla) → needinfo?(continuation)

Oh wait, every single crash is coming from a machine with CPU ID family 6 model 183 stepping 1. That's our good friend Raptor Lake (Raptor Cove S / Gracemont E to be precise).

Summary: Crash in [@ zlib_rs::deflate::State::d_code] → Crash in [@ zlib_rs::deflate::State::d_code] on Raptor Lake CPUs

Additionally the crash with the most recent microcode has version 0x129, the latest microcode version for this instance of Raptor Lake is 0x12c and we have no crashes with that version.

Good catch! I didn't even think to look for a CPU issue at this volume of a crash on Nightly.

I have no reproducer information about how to reproduce beyond what gsvelto said.

Flags: needinfo?(continuation)

Small correction, we also have crashes on file from microcode version 0x12b which is almost the latest. So unless the bug has been p̶a̶p̶e̶r̶e̶d̶ ̶o̶v̶e̶r̶ fixed in 0x12c then it affects all microcode versions. Microcode 0x12c was released two weeks ago so not many machines will be running it.

Hi Gabriele! Can you help set the severity for this issue? Is there anything we can do, in addition to waiting for the resolution of https://github.com/trifectatechfoundation/zlib-rs/issues/306? Thank you.

Flags: needinfo?(gsvelto)

From a severity perspective this is really bad, because it manifests itself as either a content or full browser crash happening often on nightly so I'd say S2. The problem is that it's entirely out of our hands unless we understand what instruction sequence triggers it, but even if we identify it, avoiding it might be impossible. It's also going to be a lot of work. We should definitely reach out for our contacts at Intel given the volume of this crash.

Flags: needinfo?(gsvelto)

To be clear, this does not affect anything else than nightly. The code is not set to ride the trains yet.

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 10 desktop browser crashes on nightly

For more information, please visit BugBot documentation.

Keywords: topcrash
Severity: -- → S2

Given that this is on Raptor Lake, it sounds like the assumption is that this is the over-voltage bug? Assuming that's the case, then as noted there, "there is no fix to the issue if it already affects a CPU, and any damage to the CPU is permanent". So if the CPU had already been damaged, it may not matter if the microcode is a fixed version. (Leaving aside that many of the crashes appear to be on pre-0x129 versions.)

Does Mozilla have a standard procedure for handling likely Raptor Lake CPU bugs? Reaching out to Intel may be a good idea, but also, given the number of crashes occurring with this CPU, might there be some reasonable way to encourage people to take advantage of the warranty extension on their CPUs?

Trying to avoid getting nerd-sniped by the possibility of narrowing down the issue further, since doing so would effectively be trying to reverse-engineer the CPU bug. But some general questions for someone who has access to the crash data: Does the "index out of bounds" panic for these crashes have a consistent index (e.g. the 567 mentioned in this issue) or are the indexes all over the place (between 513 and 767)? Do the backtraces always go through compress_block_dynamic_trees and emit_dist, rather than compress_block_static_trees and emit_dist_static, or is it a mix of both?

The index is not always 567. e.g: 527, 719, 615, 566, 747
7 out of 7 of the crashes I checked went through compress_block_dynamic_trees

(In reply to Josh Triplett from comment #11)

Given that this is on Raptor Lake, it sounds like the assumption is that this is the over-voltage bug? Assuming that's the case, then as noted there, "there is no fix to the issue if it already affects a CPU, and any damage to the CPU is permanent". So if the CPU had already been damaged, it may not matter if the microcode is a fixed version. (Leaving aside that many of the crashes appear to be on pre-0x129 versions.)

This crash is very consistent while the overvoltage issue was randomish in nature, it's more likely to be one of the existing erratas or a new one.

Does Mozilla have a standard procedure for handling likely Raptor Lake CPU bugs? Reaching out to Intel may be a good idea, but also, given the number of crashes occurring with this CPU, might there be some reasonable way to encourage people to take advantage of the warranty extension on their CPUs?

I have reached out to Intel but haven't received an answer yet, we'll see what happens. That being said I've re-checked their errata and microcode release and noticed that they shipped a new microcode version in February (0x12c) which apparently fixes two of the known issues that cause loads to deliver wrong data (RPL050 and RPL060). We currently don't have crashes on file for users with that microcode version installed but it's too early to tell, it will take a while before that version gets rolled out to our users both in Windows and Linux.

Trying to avoid getting nerd-sniped by the possibility of narrowing down the issue further, since doing so would effectively be trying to reverse-engineer the CPU bug. But some general questions for someone who has access to the crash data: Does the "index out of bounds" panic for these crashes have a consistent index (e.g. the 567 mentioned in this issue) or are the indexes all over the place (between 513 and 767)? Do the backtraces always go through compress_block_dynamic_trees and emit_dist, rather than compress_block_static_trees and emit_dist_static, or is it a mix of both?

It never goes through the latter, the stack is always the same you see in the crash in comment 0.

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash

Since the crash volume is low (less than 15 per week), the severity is downgraded to S3. Feel free to change it back if you think the bug is still critical.

For more information, please visit BugBot documentation.

Severity: S2 → S3

gsvelto - any update from Intel? Any crashes with 0x12c in the 2 months since last check?

Flags: needinfo?(gsvelto)

I haven't heard anything from Intel. There have been only two crashes with microcode 0x12c in the past two months, all the other crashes are on older microcodes. Given that those two crashes remain I don't feel confident in saying that the problem was fixed, but it might have been mitigated to the point that it's not very frequent anymore. I also can't rule out that we don't generate the sequence of instructions that crashes that particular CPU.

Flags: needinfo?(gsvelto)

Gabriele pointed out the underlying CPU bug has been identified: https://fgiesen.wordpress.com/2025/05/21/oodle-2-9-14-and-intel-13th-14th-gen-cpus/

If the volume here goes too high we may be able to work around it.

I'm going to deploy libz-rs-sys on early beta in bug 1968103, we'll see how the crash volume evolves.

I have checked the reports again and I have bad news: it seems that there's a new microcode around (version 0x12f) and the crash volume on it is significant. It's hard to be sure if this is a regression on Intel's part but it does look like it is. Either way, with this crash rate in beta we'd probably have a very significant volume of crashes in the release channel.

FYI I've reached out to Intel again, let's see what happens.

Adding an extra note here about the nature of the crash because it might be useful: I traced the root cause of the crash to this bit of code:

https://searchfox.org/mozilla-central/rev/02545fb16ddbc8dae7788c6f52be2c1504b50345/third_party/rust/zlib-rs/src/deflate.rs#1143-1149

By looking at the register contents in the crash reports I can tell that the value that's being loaded in the dist variable is wrong. The uppermost bit of that 16-bit variable is set but it should never be. The value itself looks like it couldn't have possibly come from the buffer it was supposed to be loaded from. Since this is an LTO/PGO build however, this code is highly inlined and specialized, so I couldn't trace it back to the bit of assembly where the load is actually happening, I'm just seeing the value after the fact. Given it's a 16-bit value I suspect that we might be getting the wrong part of a wider load, but this is just speculation on my part which I haven't been able to verify yet.

You need to log in before you can comment on or make changes to this bug.