Closed Bug 1887703 Opened 1 year ago Closed 8 months ago

Crash in [@ gfxWindowsPlatform::RecordContentDeviceFailure]

Categories

(Core :: Graphics, defect, P2)

Other
Windows
defect

Tracking

()

RESOLVED FIXED
138 Branch
Tracking Status
firefox-esr128 --- wontfix
firefox126 --- wontfix
firefox133 --- wontfix
firefox134 --- wontfix
firefox135 --- wontfix
firefox138 --- fixed

People

(Reporter: release-mgmt-account-bot, Assigned: bradwerth)

References

(Blocks 1 open bug)

Details

(Keywords: crash)

Crash Data

Attachments

(3 files)

Crash report: https://crash-stats.mozilla.org/report/index/f99b2a7e-72fc-45f4-aa08-fae570240322

MOZ_CRASH Reason: MOZ_RELEASE_ASSERT(isSome())

Top 10 frames of crashing thread:

0  xul.dll  gfxWindowsPlatform::RecordContentDeviceFailure  mfbt/Maybe.h:783
0  xul.dll  mozilla::gfx::DeviceManagerDx::CreateContentDevice  gfx/thebes/DeviceManagerDx.cpp:1021
1  xul.dll  mozilla::gfx::DeviceManagerDx::CreateContentDevicesLocked  gfx/thebes/DeviceManagerDx.cpp:643
2  xul.dll  mozilla::gfx::DeviceManagerDx::CreateContentDevices  gfx/thebes/DeviceManagerDx.cpp:629
3  xul.dll  mozilla::RDDParent::RecvInitVideoBridge  dom/media/ipc/RDDParent.cpp:216
4  xul.dll  mozilla::PRDDParent::OnMessageReceived  ipc/ipdl/PRDDParent.cpp:824
5  xul.dll  mozilla::ipc::MessageChannel::DispatchAsyncMessage  ipc/glue/MessageChannel.cpp:1818
5  xul.dll  mozilla::ipc::MessageChannel::DispatchMessage  ipc/glue/MessageChannel.cpp:1737
5  xul.dll  mozilla::ipc::MessageChannel::RunMessage  ipc/glue/MessageChannel.cpp:1530
5  xul.dll  mozilla::ipc::MessageChannel::MessageTask::Run  ipc/glue/MessageChannel.cpp:1628

By querying Nightly crashes reported within the last 2 months, here are some insights about the signature:

  • First crash report: 2024-03-22
  • Process type: RDD
  • Is startup crash: No
  • Has user comments: No
  • Is null crash: No
Component: General → Graphics

The severity field is not set for this bug.
:bhood, could you have a look please?

For more information, please visit BugBot documentation.

Flags: needinfo?(bhood)

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 10 desktop browser crashes on nightly

For more information, please visit BugBot documentation.

Keywords: topcrash
Flags: needinfo?(jmuizelaar)

The critical log shows things going wrong:

|[0][GFX1-]: Fallback WR to SW-WR + D3D11 (t=212.35) |[1][GFX1-]: Failed to connect GPU process (t=212.35) |[2][GFX1-]: [D3D11] failed to get compositor device. (t=212.66) |[3][GFX1-]: Failed to initialize CompositorD3D11 for SWGL: FEATURE_FAILURE_D3D11_NO_DEVICE (t=212.66) |[4][GFX1-]: [D3D11] failed to get compositor device. (t=273.13) |[5][GFX1-]: Failed to initialize CompositorD3D11 for SWGL: FEATURE_FAILURE_D3D11_NO_DEVICE (t=273.13) |[6][GFX1-]: [D3D11] failed to get compositor device. (t=377.35) |[7][GFX1-]: Failed to initialize CompositorD3D11 for SWGL: FEATURE_FAILURE_D3D11_NO_DEVICE (t=377.35) |[8][GFX1-]: RenderCompositorSWGL failed mapping default framebuffer, no dt (t=378.6) |[9][GFX1-]: [D3D11] failed to get compositor device. (t=385.31) |[10][GFX1-]: Failed to initialize CompositorD3D11 for SWGL: FEATURE_FAILURE_D3D11_NO_DEVICE (t=385.31) 

Theory: the weird call stack (a release assert on mContentDevice = device) may have something to do with the guarantee that a lock is held before accessing mContentDevice. The call stacks don't seem to show a stack where that lock isn't held, but the function definitions don't enforce the lock (by taking a const MutexAutoLock& aProofOfLock param). Might be an improvement to add those protections.

Much better theory: mDeviceStatus is a Maybe, and it is unconditionally dereferenced at the beginning of DeviceManagerDx::CreateContentDevice. I'll build a patch to fix this.

Assignee: nobody → bwerth

If this is the source of the crash, this assert will change the crash
signature, but that will pinpoint the cause. In such a case, applying
D208682 should fix the crash.

Depends on: 1893567
Flags: needinfo?(jmuizelaar)

Marking leave-open while we see if the landing of D208694 affects the crash signature.

Severity: -- → S2
Flags: needinfo?(bhood)
Keywords: leave-open
Priority: -- → P2

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash

(In reply to Brad Werth [:bradwerth] from comment #8)

Marking leave-open while we see if the landing of D208694 affects the crash signature.

It doesn't appear that mDeviceStatus is NULL in this call stack, or D208694 would have changed the signature. So there's something else going on. I'm putting this back into triage (and taking myself off the Bug) so we can re-discuss.

Assignee: bwerth → nobody
Blocks: gfx-triage
Flags: needinfo?(jmuizelaar)

It looks like we still have bad debug info:
e.g.

0 	xul.dll 	gfxWindowsPlatform::RecordContentDeviceFailure(mozilla::gfx::TelemetryDeviceCode) 	mfbt/Maybe.h:953 	inlined

This looks like another instance of the problem I ran into here: https://github.com/mstange/samply/issues/5

Markus wrote some analysis here: https://github.com/mstange/samply/issues/5#issuecomment-1101746227

Serge, do you have time to take a look at what might be going on here?

Flags: needinfo?(jmuizelaar) → needinfo?(sguelton)
No longer depends on: 1893567

Jeff, I'd be happy to help, but I'm pretty unsure about how to reproduce. It would be very helpful if you could provide direction to pinpoint at the wrong debug info from a fresh build. I could start working on it from there. I'll still give it a try this morning though.

Flags: needinfo?(sguelton)
Flags: needinfo?(jmuizelaar)
No longer blocks: gfx-triage
Severity: S2 → S3
Flags: needinfo?(jmuizelaar)
Flags: needinfo?(jmuizelaar)

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

This is actually over 60% of our RDD crashes on release, or 10% of our total crash volume. I'm triaging it back to S2 based on that, but feel free to disagree as long as you include the data from the previous sentence in your assessment :-)

Severity: S3 → S2
Keywords: topcrash

(In reply to Gian-Carlo Pascutto [:gcp] from comment #15)

This is actually over 60% of our RDD crashes on release, or 10% of our total crash volume. I'm triaging it back to S2 based on that, but feel free to disagree as long as you include the data from the previous sentence in your assessment :-)

Where are you seeing that? Looking at crash stats, I barely see any crashes for this signature. 79 in the last week, and 43% of those are Nightly. Is that from crash pings or something?

Flags: needinfo?(gpascutto)

Is that from crash pings or something?

Yes. Crash stats are never very reliable because they're a biased sample, but they're useless for anything that isn't the main process or a content process, because almost no-one is opting in to sending those crash reports. So if you get any crash reports, odds are it's a topcrasher already.

See the announcement in moz.dev.platform about the crash ping dashboards.

Flags: needinfo?(gpascutto)

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash

Jeff, are you able to help Serge per his request in comment #14?

Flags: needinfo?(jgilbert)

Thanks to Andrew's analysis in Comment 19, I think I can build a patch to fix this.

Assignee: nobody → bwerth

This adds a method IsWARPLocked to cover the cases where the lock is
already held, and the DeviceManagerDx methods are preferred over the
direct access to the Device (to handle this case correctly).

Flags: needinfo?(jmuizelaar)
Flags: needinfo?(jgilbert)
Pushed by bwerth@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/7682f855507a Part 3: Protect more cases where mDeviceStatus is dereferenced, and avoid double-locks. r=aosmond

Flagging as stalled while we observe the effects of the patch. Please remove the keyword if work begins again on the report.

Keywords: stalled

Can't tell yet if this is completely sorted out -- I'm not great at interpreting the crash reports to see if the patch has been applied. But no crashes in Nightly since April 10.

Since the crash volume is low (less than 15 per week), the severity is downgraded to S3. Feel free to change it back if you think the bug is still critical.

For more information, please visit BugBot documentation.

Severity: S2 → S3

I believe this is now fixed. No crashes in Nightly for the past month. If I'm mistaken here, I'm believe this Bug will be automatically re-opened, and I am happy to keep working on it.

Status: NEW → RESOLVED
Closed: 8 months ago
Resolution: --- → FIXED

A bot will reopen intermittent failures if they reoccur and get flagged, but not crash-stats bugs like this. That being said, it sounds perfectly fine to close this.

Since the bug is closed, the stalled keyword is now meaningless.
For more information, please visit BugBot documentation.

Keywords: stalled
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: