1950764 - Crash in [@ zlib_rs::deflate::State::d_code] on Raptor Lake CPUs

Reporter

Description

•

5 months ago

Crash report: https://crash-stats.mozilla.org/report/index/113ebfb1-9197-44e2-88ce-322b20250223

MOZ_CRASH Reason:

index out of bounds: the len is 512 but the index is 567

Top 10 frames:

0  xul.dll  MOZ_Crash(char const*, int, char const*)  mfbt/Assertions.h:382
0  xul.dll  RustMozCrash(char const*, int, char const*)  mozglue/static/rust/wrappers.cpp:18
1  xul.dll  mozglue_static::panic_hook(std::panic::PanicHookInfo*)  mozglue/static/rust/lib.rs:102
2  xul.dll  core::ops::function::Fn::call<void (*)(ref$<std::panic::PanicHookInfo>), tupl...  /rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/core/src/ops/function.rs:250
3  xul.dll  alloc::boxed::impl$30::call()  library/alloc/src/boxed.rs:2007
3  xul.dll  std::panicking::rust_panic_with_hook()  library/std/src/panicking.rs:836
4  xul.dll  std::panicking::begin_panic_handler::closure$0()  library/std/src/panicking.rs:701
5  xul.dll  std::sys::backtrace::__rust_end_short_backtrace<std::panicking::begin_panic_h...  library/std/src/sys/backtrace.rs:168
6  xul.dll  std::panicking::begin_panic_handler()  library/std/src/panicking.rs:692
7  xul.dll  core::panicking::panic_fmt()  library/core/src/panicking.rs:75

Looks like we're hitting a bounds check.

Flags: needinfo?(mh+mozilla)

Mike Hommey [:glandium]

Comment 1

•

5 months ago

Do we have a reproducer for any of these?

Flags: needinfo?(mh+mozilla) → needinfo?(continuation)

Gabriele Svelto [:gsvelto]

Comment 2

•

5 months ago

Several crashes are on YouTube, here's a few links:

Other crashes are on Discord, Twitch but also under https://www.bing.com/images/

Gabriele Svelto [:gsvelto]

Comment 3

•

5 months ago

•

Edited

Oh wait, every single crash is coming from a machine with CPU ID family 6 model 183 stepping 1. That's our good friend Raptor Lake (Raptor Cove S / Gracemont E to be precise).

Summary: Crash in [@ zlib_rs::deflate::State::d_code] → Crash in [@ zlib_rs::deflate::State::d_code] on Raptor Lake CPUs

Gabriele Svelto [:gsvelto]

Updated

•

5 months ago

Blocks: cpu-bugs

Gabriele Svelto [:gsvelto]

Comment 4

•

5 months ago

•

Edited

Additionally the crash with the most recent microcode has version 0x129, the latest microcode version for this instance of Raptor Lake is 0x12c and we have no crashes with that version.

Mike Hommey [:glandium]

Updated

•

5 months ago

See Also: → https://github.com/trifectatechfoundation/zlib-rs/issues/306

Andrew McCreight [:mccr8]

Reporter

Comment 5

•

5 months ago

Good catch! I didn't even think to look for a CPU issue at this volume of a crash on Nightly.

I have no reproducer information about how to reproduce beyond what gsvelto said.

Flags: needinfo?(continuation)

Gabriele Svelto [:gsvelto]

Comment 6

•

5 months ago

Small correction, we also have crashes on file from microcode version 0x12b which is almost the latest. So unless the bug has been p̶a̶p̶e̶r̶e̶d̶ ̶o̶v̶e̶r̶ fixed in 0x12c then it affects all microcode versions. Microcode 0x12c was released two weeks ago so not many machines will be running it.

Pascal Chevrel:pascalc

Updated

•

5 months ago

status-firefox137: --- → affected

tracking-firefox137: --- → +

Hsin-Yi Tsai (she/her) | away [:hsinyi]

Comment 7

•

5 months ago

Hi Gabriele! Can you help set the severity for this issue? Is there anything we can do, in addition to waiting for the resolution of https://github.com/trifectatechfoundation/zlib-rs/issues/306? Thank you.

Flags: needinfo?(gsvelto)

Gabriele Svelto [:gsvelto]

Comment 8

•

5 months ago

From a severity perspective this is really bad, because it manifests itself as either a content or full browser crash happening often on nightly so I'd say S2. The problem is that it's entirely out of our hands unless we understand what instruction sequence triggers it, but even if we identify it, avoiding it might be impossible. It's also going to be a lot of work. We should definitely reach out for our contacts at Intel given the volume of this crash.

Flags: needinfo?(gsvelto)

Mike Hommey [:glandium]

Updated

•

5 months ago

status-firefox137: affected → disabled

tracking-firefox137: + → -

Mike Hommey [:glandium]

Comment 9

•

5 months ago

To be clear, this does not affect anything else than nightly. The code is not set to ride the trains yet.

BugBot [:suhaib / :marco/ :calixte]

Comment 10

•

5 months ago

The bug is linked to a topcrash signature, which matches the following criterion:

Top 10 desktop browser crashes on nightly

For more information, please visit BugBot documentation.

Keywords: topcrash

Hsin-Yi Tsai (she/her) | away [:hsinyi]

Updated

•

5 months ago

Severity: -- → S2

Dianna Smith [:diannaS]

Updated

•

5 months ago

status-firefox138: --- → affected

Josh Triplett

Comment 11

•

5 months ago

Given that this is on Raptor Lake, it sounds like the assumption is that this is the over-voltage bug? Assuming that's the case, then as noted there, "there is no fix to the issue if it already affects a CPU, and any damage to the CPU is permanent". So if the CPU had already been damaged, it may not matter if the microcode is a fixed version. (Leaving aside that many of the crashes appear to be on pre-0x129 versions.)

Does Mozilla have a standard procedure for handling likely Raptor Lake CPU bugs? Reaching out to Intel may be a good idea, but also, given the number of crashes occurring with this CPU, might there be some reasonable way to encourage people to take advantage of the warranty extension on their CPUs?

Trying to avoid getting nerd-sniped by the possibility of narrowing down the issue further, since doing so would effectively be trying to reverse-engineer the CPU bug. But some general questions for someone who has access to the crash data: Does the "index out of bounds" panic for these crashes have a consistent index (e.g. the 567 mentioned in this issue) or are the indexes all over the place (between 513 and 767)? Do the backtraces always go through compress_block_dynamic_trees and emit_dist, rather than compress_block_static_trees and emit_dist_static, or is it a mix of both?

Jeff Muizelaar [:jrmuizel]

Comment 12

•

5 months ago

The index is not always 567. e.g: 527, 719, 615, 566, 747
7 out of 7 of the crashes I checked went through compress_block_dynamic_trees

Gabriele Svelto [:gsvelto]

Comment 13

•

5 months ago

(In reply to Josh Triplett from comment #11)

Given that this is on Raptor Lake, it sounds like the assumption is that this is the over-voltage bug? Assuming that's the case, then as noted there, "there is no fix to the issue if it already affects a CPU, and any damage to the CPU is permanent". So if the CPU had already been damaged, it may not matter if the microcode is a fixed version. (Leaving aside that many of the crashes appear to be on pre-0x129 versions.)

This crash is very consistent while the overvoltage issue was randomish in nature, it's more likely to be one of the existing erratas or a new one.

Does Mozilla have a standard procedure for handling likely Raptor Lake CPU bugs? Reaching out to Intel may be a good idea, but also, given the number of crashes occurring with this CPU, might there be some reasonable way to encourage people to take advantage of the warranty extension on their CPUs?

I have reached out to Intel but haven't received an answer yet, we'll see what happens. That being said I've re-checked their errata and microcode release and noticed that they shipped a new microcode version in February (0x12c) which apparently fixes two of the known issues that cause loads to deliver wrong data (RPL050 and RPL060). We currently don't have crashes on file for users with that microcode version installed but it's too early to tell, it will take a while before that version gets rolled out to our users both in Windows and Linux.

Trying to avoid getting nerd-sniped by the possibility of narrowing down the issue further, since doing so would effectively be trying to reverse-engineer the CPU bug. But some general questions for someone who has access to the crash data: Does the "index out of bounds" panic for these crashes have a consistent index (e.g. the 567 mentioned in this issue) or are the indexes all over the place (between 513 and 767)? Do the backtraces always go through compress_block_dynamic_trees and emit_dist, rather than compress_block_static_trees and emit_dist_static, or is it a mix of both?

It never goes through the latter, the stack is always the same you see in the crash in comment 0.

BugBot [:suhaib / :marco/ :calixte]

Comment 14

•

4 months ago

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash

Dianna Smith [:diannaS]

Updated

•

4 months ago

status-firefox138: affected → disabled

BugBot [:suhaib / :marco/ :calixte]

Comment 15

•

3 months ago

Since the crash volume is low (less than 15 per week), the severity is downgraded to S3. Feel free to change it back if you think the bug is still critical.

For more information, please visit BugBot documentation.

Severity: S2 → S3

Randell Jesup [:jesup] (needinfo me)

Comment 16

•

3 months ago

gsvelto - any update from Intel? Any crashes with 0x12c in the 2 months since last check?

Flags: needinfo?(gsvelto)

Gabriele Svelto [:gsvelto]

Comment 17

•

3 months ago

I haven't heard anything from Intel. There have been only two crashes with microcode 0x12c in the past two months, all the other crashes are on older microcodes. Given that those two crashes remain I don't feel confident in saying that the problem was fixed, but it might have been mitigated to the point that it's not very frequent anymore. I also can't rule out that we don't generate the sequence of instructions that crashes that particular CPU.

Flags: needinfo?(gsvelto)

Gian-Carlo Pascutto [:gcp]

Comment 18

•

3 months ago

Gabriele pointed out the underlying CPU bug has been identified: https://fgiesen.wordpress.com/2025/05/21/oodle-2-9-14-and-intel-13th-14th-gen-cpus/

If the volume here goes too high we may be able to work around it.

Mike Hommey [:glandium]

Comment 19

•

3 months ago

•

Edited

I'm going to deploy libz-rs-sys on early beta in bug 1968103, we'll see how the crash volume evolves.

Gabriele Svelto [:gsvelto]

Comment 20

•

1 month ago

I have checked the reports again and I have bad news: it seems that there's a new microcode around (version 0x12f) and the crash volume on it is significant. It's hard to be sure if this is a regression on Intel's part but it does look like it is. Either way, with this crash rate in beta we'd probably have a very significant volume of crashes in the release channel.

Gabriele Svelto [:gsvelto]

Comment 21

•

1 month ago

FYI I've reached out to Intel again, let's see what happens.

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Updated

•

1 month ago

Blocks: cpu-raptor-lake-bugs

Gabriele Svelto [:gsvelto]

Comment 22

•

12 days ago

Adding an extra note here about the nature of the crash because it might be useful: I traced the root cause of the crash to this bit of code:

https://searchfox.org/mozilla-central/rev/02545fb16ddbc8dae7788c6f52be2c1504b50345/third_party/rust/zlib-rs/src/deflate.rs#1143-1149

By looking at the register contents in the crash reports I can tell that the value that's being loaded in the dist variable is wrong. The uppermost bit of that 16-bit variable is set but it should never be. The value itself looks like it couldn't have possibly come from the buffer it was supposed to be loaded from. Since this is an LTO/PGO build however, this code is highly inlined and specialized, so I couldn't trace it back to the bit of assembly where the load is actually happening, I'm just seeing the value after the fact. Given it's a 16-bit value I suspect that we might be getting the wrong part of a wider load, but this is just speculation on my part which I haven't been able to verify yet.