Open Bug 1898999 Opened 8 months ago Updated 7 months ago

Crash in [@ mozilla::ScrollContainerFrame::InInitialReflow]

Categories

(Core :: Layout, defect)

Other
Windows 11
defect

Tracking

()

People

(Reporter: release-mgmt-account-bot, Unassigned)

References

(Blocks 2 open bugs)

Details

(Keywords: crash, stalled)

Crash Data

Crash report: https://crash-stats.mozilla.org/report/index/62677645-01a4-461e-9a6e-5ef250240524

Reason: EXCEPTION_ACCESS_VIOLATION_READ

Top 10 frames of crashing thread:

0  xul.dll  mozilla::ScrollContainerFrame::InInitialReflow const  layout/generic/ScrollContainerFrame.cpp:1043
0  xul.dll  mozilla::ScrollContainerFrame::Reflow  layout/generic/ScrollContainerFrame.cpp:1628
1  xul.dll  nsAbsoluteContainingBlock::ReflowAbsoluteFrame  layout/generic/nsAbsoluteContainingBlock.cpp:811
1  xul.dll  nsAbsoluteContainingBlock::Reflow  layout/generic/nsAbsoluteContainingBlock.cpp:219
2  xul.dll  nsBlockFrame::Reflow  layout/generic/nsBlockFrame.cpp:1759
3  xul.dll  nsContainerFrame::ReflowChild  layout/generic/nsContainerFrame.cpp:885
3  xul.dll  mozilla::ScrollContainerFrame::ReflowScrolledFrame  layout/generic/ScrollContainerFrame.cpp:915
4  xul.dll  mozilla::ScrollContainerFrame::ReflowContents  layout/generic/ScrollContainerFrame.cpp:1050
4  xul.dll  mozilla::ScrollContainerFrame::Reflow  layout/generic/ScrollContainerFrame.cpp:1518
5  xul.dll  nsAbsoluteContainingBlock::ReflowAbsoluteFrame  layout/generic/nsAbsoluteContainingBlock.cpp:811

By querying Nightly crashes reported within the last 2 months, here are some insights about the signature:

  • First crash report: 2024-05-24
  • Process type: Content
  • Is startup crash: No
  • Has user comments: No
  • Is null crash: No

By analyzing the backtrace, the regression may have been introduced by a patch [1] to fix Bug 1897752.

[1] https://hg.mozilla.org/mozilla-central/rev?node=df44b0eea88f

:emilio, since you are the author of the potential regressor, could you please take a look?

Flags: needinfo?(emilio)

That seems fairly unlikely. If this is something, it's probably a signature change from TYLin's rename to ScrollContainerFrame. But that said I don't understand how we can crash there because this is valid a few lines above.

Flags: needinfo?(emilio)
No longer regressed by: 1897752

(In reply to Emilio Cobos Álvarez (:emilio) from comment #1)

That seems fairly unlikely. If this is something, it's probably a signature change from TYLin's rename to ScrollContainerFrame.

Yeah, we have some small amount of crash volume for nsHTMLScrollFrame::InInitialReflow, and this is probably just that, under a new name:
https://crash-stats.mozilla.org/signature/?signature=nsHTMLScrollFrame%3A%3AInInitialReflow&date=%3E%3D2023-11-28T14%3A58%3A00.000Z&date=%3C2024-05-28T14%3A58%3A00.000Z&_sort=-date

Crash Signature: [@ mozilla::ScrollContainerFrame::InInitialReflow] → [@ mozilla::ScrollContainerFrame::InInitialReflow] [@ nsHTMLScrollFrame::InInitialReflow]

Resetting affected/unaffected flags since they're not meaningful (this isn't known to be a regression).

The oldest crash at this point is bp-17ac3b7b-57ff-4105-8ba1-f19c00231201 which is in Firefox 120.0.1, from nearly 6 months ago (as far back as we track crashes)

This bug has been marked as a regression. Setting status flag for Nightly to affected.

Looking at the minidump from comment 0, this looks like this was some higher-order bits in a pointer-address being somehow zeroed out in our this pointer.

Specifically:

  • At stack level 1 in the backtrace, nsAbsoluteContainingBlock::ReflowAbsoluteFrame, we have aKidFrame being 0x00000143b7ae7b40, which we call a method on: aKidFrame->Reflow(...).
  • Drilling down one level, the this pointer should be that same pointer-value, but it's not quite -- Visual Studio shows this as being 0x00000000b7ae7b40 there, which has the high order bits (0x143) zeroed out for some reason.

This feels likely to be bad hardware (or our stack memory has been stomped on somehow).

Two other recent minidumps seem to show the same pattern (almost certainly from the same user as comment 0, too -- identical hardware and extension list):
bp-e1cd48ba-4879-4b87-b7f7-d663c0240525 (Nightly 128)

  • Stack level 1 has aKidFrame being 0x00000281d0795960
  • Stack level 0 has this being 0x00000000d0795960, with the high 0x281 bits having been zeroed out.

bp-320f9bbf-0a94-45e0-a74e-26cac0240525 (Nightly 128)

  • Stack level 1 has aKidFrame being 0x0000019166b25100
  • Stack level 0 has this being 0x0000000066b25100, with the high 0x191 bits having been zeroed out.

And this one from release (probably a different user, because different graphics card) shows the same zeroing pattern but to a larger extent:
bp-aa0aaf8f-fa4a-4864-975c-4ebe90240515 (Firefox 125.0.3)

  • Stack level 1 has aKidFrame being 0x00000130af363f98
  • Stack level 0 has this being 0x0000000000000000 <NULL> with the whole value having been zeroed out.

The odd thing is that we're crashing towards the end of the ScrollContainerFrame::Reflow() implementation. That leads me to believe that the this pointer is probably fine towards the beginning of that method, and it's somehow getting clobbered partway through, shortly before we hit this call to InInitialReflow where we crash with this partly-or-fully nulled out:
https://searchfox.org/mozilla-central/rev/f60bb10a5fe6936f9e9f9e8a90d52c18a0ffd818/layout/generic/ScrollContainerFrame.cpp#1628

Perhaps this is a CPU bug? All 7 of the crashes here (3 with the new ScrollContainerFrame signature, 4 with the old nsHTMLScrollFrame signature) have the same CPU count and cpu info values:

CPU Count:  32
CPU Info:  family 6 model 183 stepping 1

Perhaps that's a sign that this is a CPU bug?

(They do have variable CPUMicrocodeVersion fields; not sure to-what-extent that would matter.)

Yeah, Raptor Lake has known issues.

Blocks: cpu-bugs

Yes, bug 1897573 and bug 1871892 are all on the same cpu.

Keywords: stalled

The severity field is not set for this bug.
:emilio, could you have a look please?

For more information, please visit BugBot documentation.

Flags: needinfo?(emilio)

Given it seems like a CPU bug, S3 seems about right.

Severity: -- → S3
Flags: needinfo?(emilio)
You need to log in before you can comment on or make changes to this bug.