Crash in [@ zlib_rs::deflate::State::d_code] on Raptor Lake CPUs
Categories
(Core :: General, defect)
Tracking
()
People
(Reporter: mccr8, Unassigned)
References
(Blocks 2 open bugs, Regression)
Details
(Keywords: crash, regression)
Crash Data
Crash report: https://crash-stats.mozilla.org/report/index/113ebfb1-9197-44e2-88ce-322b20250223
MOZ_CRASH Reason:
index out of bounds: the len is 512 but the index is 567
Top 10 frames:
0 xul.dll MOZ_Crash(char const*, int, char const*) mfbt/Assertions.h:382
0 xul.dll RustMozCrash(char const*, int, char const*) mozglue/static/rust/wrappers.cpp:18
1 xul.dll mozglue_static::panic_hook(std::panic::PanicHookInfo*) mozglue/static/rust/lib.rs:102
2 xul.dll core::ops::function::Fn::call<void (*)(ref$<std::panic::PanicHookInfo>), tupl... /rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/core/src/ops/function.rs:250
3 xul.dll alloc::boxed::impl$30::call() library/alloc/src/boxed.rs:2007
3 xul.dll std::panicking::rust_panic_with_hook() library/std/src/panicking.rs:836
4 xul.dll std::panicking::begin_panic_handler::closure$0() library/std/src/panicking.rs:701
5 xul.dll std::sys::backtrace::__rust_end_short_backtrace<std::panicking::begin_panic_h... library/std/src/sys/backtrace.rs:168
6 xul.dll std::panicking::begin_panic_handler() library/std/src/panicking.rs:692
7 xul.dll core::panicking::panic_fmt() library/core/src/panicking.rs:75
Looks like we're hitting a bounds check.
Comment 1•5 months ago
|
||
Do we have a reproducer for any of these?
Comment 2•5 months ago
|
||
Several crashes are on YouTube, here's a few links:
- https://www.youtube.com/watch?v=wDWV-EwReDE
- https://www.youtube.com/watch?v=IJGbkjuRO8E
- https://music.youtube.com/playlist?list=OLAK5uy_kY8dtv44revO4EYCIUOAgYfrWo9a_DQu0
Other crashes are on Discord, Twitch but also under https://www.bing.com/images/
Comment 3•5 months ago
•
|
||
Oh wait, every single crash is coming from a machine with CPU ID family 6 model 183 stepping 1
. That's our good friend Raptor Lake (Raptor Cove S / Gracemont E to be precise).
Comment 4•5 months ago
•
|
||
Additionally the crash with the most recent microcode has version 0x129, the latest microcode version for this instance of Raptor Lake is 0x12c and we have no crashes with that version.
Updated•5 months ago
|
Reporter | ||
Comment 5•5 months ago
|
||
Good catch! I didn't even think to look for a CPU issue at this volume of a crash on Nightly.
I have no reproducer information about how to reproduce beyond what gsvelto said.
Comment 6•5 months ago
|
||
Small correction, we also have crashes on file from microcode version 0x12b which is almost the latest. So unless the bug has been p̶a̶p̶e̶r̶e̶d̶ ̶o̶v̶e̶r̶ fixed in 0x12c then it affects all microcode versions. Microcode 0x12c was released two weeks ago so not many machines will be running it.
Updated•5 months ago
|
Comment 7•5 months ago
|
||
Hi Gabriele! Can you help set the severity for this issue? Is there anything we can do, in addition to waiting for the resolution of https://github.com/trifectatechfoundation/zlib-rs/issues/306? Thank you.
Comment 8•5 months ago
|
||
From a severity perspective this is really bad, because it manifests itself as either a content or full browser crash happening often on nightly so I'd say S2. The problem is that it's entirely out of our hands unless we understand what instruction sequence triggers it, but even if we identify it, avoiding it might be impossible. It's also going to be a lot of work. We should definitely reach out for our contacts at Intel given the volume of this crash.
Updated•5 months ago
|
Comment 9•5 months ago
|
||
To be clear, this does not affect anything else than nightly. The code is not set to ride the trains yet.
Comment 10•5 months ago
|
||
The bug is linked to a topcrash signature, which matches the following criterion:
- Top 10 desktop browser crashes on nightly
For more information, please visit BugBot documentation.
Updated•5 months ago
|
Updated•5 months ago
|
Comment 11•5 months ago
|
||
Given that this is on Raptor Lake, it sounds like the assumption is that this is the over-voltage bug? Assuming that's the case, then as noted there, "there is no fix to the issue if it already affects a CPU, and any damage to the CPU is permanent". So if the CPU had already been damaged, it may not matter if the microcode is a fixed version. (Leaving aside that many of the crashes appear to be on pre-0x129 versions.)
Does Mozilla have a standard procedure for handling likely Raptor Lake CPU bugs? Reaching out to Intel may be a good idea, but also, given the number of crashes occurring with this CPU, might there be some reasonable way to encourage people to take advantage of the warranty extension on their CPUs?
Trying to avoid getting nerd-sniped by the possibility of narrowing down the issue further, since doing so would effectively be trying to reverse-engineer the CPU bug. But some general questions for someone who has access to the crash data: Does the "index out of bounds" panic for these crashes have a consistent index (e.g. the 567
mentioned in this issue) or are the indexes all over the place (between 513
and 767
)? Do the backtraces always go through compress_block_dynamic_trees
and emit_dist
, rather than compress_block_static_trees
and emit_dist_static
, or is it a mix of both?
Comment 12•5 months ago
|
||
The index is not always 567. e.g: 527, 719, 615, 566, 747
7 out of 7 of the crashes I checked went through compress_block_dynamic_trees
Comment 13•5 months ago
|
||
(In reply to Josh Triplett from comment #11)
Given that this is on Raptor Lake, it sounds like the assumption is that this is the over-voltage bug? Assuming that's the case, then as noted there, "there is no fix to the issue if it already affects a CPU, and any damage to the CPU is permanent". So if the CPU had already been damaged, it may not matter if the microcode is a fixed version. (Leaving aside that many of the crashes appear to be on pre-0x129 versions.)
This crash is very consistent while the overvoltage issue was randomish in nature, it's more likely to be one of the existing erratas or a new one.
Does Mozilla have a standard procedure for handling likely Raptor Lake CPU bugs? Reaching out to Intel may be a good idea, but also, given the number of crashes occurring with this CPU, might there be some reasonable way to encourage people to take advantage of the warranty extension on their CPUs?
I have reached out to Intel but haven't received an answer yet, we'll see what happens. That being said I've re-checked their errata and microcode release and noticed that they shipped a new microcode version in February (0x12c) which apparently fixes two of the known issues that cause loads to deliver wrong data (RPL050 and RPL060). We currently don't have crashes on file for users with that microcode version installed but it's too early to tell, it will take a while before that version gets rolled out to our users both in Windows and Linux.
Trying to avoid getting nerd-sniped by the possibility of narrowing down the issue further, since doing so would effectively be trying to reverse-engineer the CPU bug. But some general questions for someone who has access to the crash data: Does the "index out of bounds" panic for these crashes have a consistent index (e.g. the
567
mentioned in this issue) or are the indexes all over the place (between513
and767
)? Do the backtraces always go throughcompress_block_dynamic_trees
andemit_dist
, rather thancompress_block_static_trees
andemit_dist_static
, or is it a mix of both?
It never goes through the latter, the stack is always the same you see in the crash in comment 0.
Comment 14•4 months ago
|
||
Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.
For more information, please visit BugBot documentation.
Updated•4 months ago
|
Comment 15•3 months ago
|
||
Since the crash volume is low (less than 15 per week), the severity is downgraded to S3
. Feel free to change it back if you think the bug is still critical.
For more information, please visit BugBot documentation.
Comment 16•3 months ago
|
||
gsvelto - any update from Intel? Any crashes with 0x12c in the 2 months since last check?
Comment 17•3 months ago
|
||
I haven't heard anything from Intel. There have been only two crashes with microcode 0x12c in the past two months, all the other crashes are on older microcodes. Given that those two crashes remain I don't feel confident in saying that the problem was fixed, but it might have been mitigated to the point that it's not very frequent anymore. I also can't rule out that we don't generate the sequence of instructions that crashes that particular CPU.
Comment 18•3 months ago
|
||
Gabriele pointed out the underlying CPU bug has been identified: https://fgiesen.wordpress.com/2025/05/21/oodle-2-9-14-and-intel-13th-14th-gen-cpus/
If the volume here goes too high we may be able to work around it.
Comment 19•3 months ago
•
|
||
I'm going to deploy libz-rs-sys on early beta in bug 1968103, we'll see how the crash volume evolves.
Comment 20•1 month ago
|
||
I have checked the reports again and I have bad news: it seems that there's a new microcode around (version 0x12f) and the crash volume on it is significant. It's hard to be sure if this is a regression on Intel's part but it does look like it is. Either way, with this crash rate in beta we'd probably have a very significant volume of crashes in the release channel.
Comment 21•1 month ago
|
||
FYI I've reached out to Intel again, let's see what happens.
Updated•1 month ago
|
Comment 22•12 days ago
|
||
Adding an extra note here about the nature of the crash because it might be useful: I traced the root cause of the crash to this bit of code:
By looking at the register contents in the crash reports I can tell that the value that's being loaded in the dist
variable is wrong. The uppermost bit of that 16-bit variable is set but it should never be. The value itself looks like it couldn't have possibly come from the buffer it was supposed to be loaded from. Since this is an LTO/PGO build however, this code is highly inlined and specialized, so I couldn't trace it back to the bit of assembly where the load is actually happening, I'm just seeing the value after the fact. Given it's a 16-bit value I suspect that we might be getting the wrong part of a wider load, but this is just speculation on my part which I haven't been able to verify yet.
Description
•