Closed Bug 1887703 Opened 1 year ago Closed 8 months ago

Crash in [@ gfxWindowsPlatform::RecordContentDeviceFailure]

Tracking

()

Status:

RESOLVED FIXED

Milestone:

138 Branch

Tracking Flags:

Tracking

Status

firefox-esr128

---

wontfix

firefox126

---

wontfix

firefox133

---

wontfix

firefox134

---

wontfix

firefox135

---

wontfix

firefox138

---

fixed

People

(Reporter: release-mgmt-account-bot, Assigned: bradwerth)

References

(Blocks 1 open bug)

Details

(Keywords: crash)

Crash Data

Attachments

(3 files)

Bug 1887703: Make DeviceManagerDx always check mDeviceStatus before dereferencing it. 1 year ago Brad Werth [:bradwerth] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1887703: Upgrade an assert in DeviceManagerDx::CreateContentDevicesLocked to MOZ_RELEASE_ASSERT. 1 year ago Brad Werth [:bradwerth] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1887703 Part 3: Protect more cases where mDeviceStatus is dereferenced, and avoid double-locks. 11 months ago Brad Werth [:bradwerth] 48 bytes, text/x-phabricator-request		Details \| Review

BugBot [:suhaib / :marco/ :calixte]

Reporter

Description

•

1 year ago

Crash report: https://crash-stats.mozilla.org/report/index/f99b2a7e-72fc-45f4-aa08-fae570240322

MOZ_CRASH Reason: MOZ_RELEASE_ASSERT(isSome())

Top 10 frames of crashing thread:

0  xul.dll  gfxWindowsPlatform::RecordContentDeviceFailure  mfbt/Maybe.h:783
0  xul.dll  mozilla::gfx::DeviceManagerDx::CreateContentDevice  gfx/thebes/DeviceManagerDx.cpp:1021
1  xul.dll  mozilla::gfx::DeviceManagerDx::CreateContentDevicesLocked  gfx/thebes/DeviceManagerDx.cpp:643
2  xul.dll  mozilla::gfx::DeviceManagerDx::CreateContentDevices  gfx/thebes/DeviceManagerDx.cpp:629
3  xul.dll  mozilla::RDDParent::RecvInitVideoBridge  dom/media/ipc/RDDParent.cpp:216
4  xul.dll  mozilla::PRDDParent::OnMessageReceived  ipc/ipdl/PRDDParent.cpp:824
5  xul.dll  mozilla::ipc::MessageChannel::DispatchAsyncMessage  ipc/glue/MessageChannel.cpp:1818
5  xul.dll  mozilla::ipc::MessageChannel::DispatchMessage  ipc/glue/MessageChannel.cpp:1737
5  xul.dll  mozilla::ipc::MessageChannel::RunMessage  ipc/glue/MessageChannel.cpp:1530
5  xul.dll  mozilla::ipc::MessageChannel::MessageTask::Run  ipc/glue/MessageChannel.cpp:1628

By querying Nightly crashes reported within the last 2 months, here are some insights about the signature:

First crash report: 2024-03-22
Process type: RDD
Is startup crash: No
Has user comments: No
Is null crash: No

Andrew McCreight [:mccr8]

Updated

•

1 year ago

Component: General → Graphics

BugBot [:suhaib / :marco/ :calixte]

Reporter

Comment 1

•

1 year ago

The severity field is not set for this bug.
:bhood, could you have a look please?

For more information, please visit BugBot documentation.

Flags: needinfo?(bhood)

BugBot [:suhaib / :marco/ :calixte]

Reporter

Comment 2

•

1 year ago

The bug is linked to a topcrash signature, which matches the following criterion:

Top 10 desktop browser crashes on nightly

For more information, please visit BugBot documentation.

Keywords: topcrash

Jeff Muizelaar [:jrmuizel]

Updated

•

1 year ago

Flags: needinfo?(jmuizelaar)

Jeff Muizelaar [:jrmuizel]

Comment 3

•

1 year ago

The critical log shows things going wrong:

|[0][GFX1-]: Fallback WR to SW-WR + D3D11 (t=212.35) |[1][GFX1-]: Failed to connect GPU process (t=212.35) |[2][GFX1-]: [D3D11] failed to get compositor device. (t=212.66) |[3][GFX1-]: Failed to initialize CompositorD3D11 for SWGL: FEATURE_FAILURE_D3D11_NO_DEVICE (t=212.66) |[4][GFX1-]: [D3D11] failed to get compositor device. (t=273.13) |[5][GFX1-]: Failed to initialize CompositorD3D11 for SWGL: FEATURE_FAILURE_D3D11_NO_DEVICE (t=273.13) |[6][GFX1-]: [D3D11] failed to get compositor device. (t=377.35) |[7][GFX1-]: Failed to initialize CompositorD3D11 for SWGL: FEATURE_FAILURE_D3D11_NO_DEVICE (t=377.35) |[8][GFX1-]: RenderCompositorSWGL failed mapping default framebuffer, no dt (t=378.6) |[9][GFX1-]: [D3D11] failed to get compositor device. (t=385.31) |[10][GFX1-]: Failed to initialize CompositorD3D11 for SWGL: FEATURE_FAILURE_D3D11_NO_DEVICE (t=385.31)

Brad Werth [:bradwerth]

Assignee

Comment 4

•

1 year ago

Theory: the weird call stack (a release assert on mContentDevice = device) may have something to do with the guarantee that a lock is held before accessing mContentDevice. The call stacks don't seem to show a stack where that lock isn't held, but the function definitions don't enforce the lock (by taking a const MutexAutoLock& aProofOfLock param). Might be an improvement to add those protections.

Brad Werth [:bradwerth]

Assignee

Comment 5

•

1 year ago

Much better theory: mDeviceStatus is a Maybe, and it is unconditionally dereferenced at the beginning of DeviceManagerDx::CreateContentDevice. I'll build a patch to fix this.

Assignee: nobody → bwerth

Brad Werth [:bradwerth]

Assignee

Comment 6

•

1 year ago

Attached file Bug 1887703: Make DeviceManagerDx always check mDeviceStatus before dereferencing it. — Details

Brad Werth [:bradwerth]

Assignee

Comment 7

•

1 year ago

Attached file Bug 1887703: Upgrade an assert in DeviceManagerDx::CreateContentDevicesLocked to MOZ_RELEASE_ASSERT. — Details

If this is the source of the crash, this assert will change the crash
signature, but that will pinpoint the cause. In such a case, applying
D208682 should fix the crash.

Jeff Muizelaar [:jrmuizel]

Updated

•

1 year ago

Depends on: 1893567

Flags: needinfo?(jmuizelaar)

Brad Werth [:bradwerth]

Assignee

Comment 8

•

1 year ago

Marking leave-open while we see if the landing of D208694 affects the crash signature.

Severity: -- → S2

Flags: needinfo?(bhood)

Keywords: leave-open

Priority: -- → P2

Sandor Molnar[:smolnar]

Comment 9

•

1 year ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/b89431ee0f24

BugBot [:suhaib / :marco/ :calixte]

Reporter

Comment 10

•

1 year ago

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash

Brad Werth [:bradwerth]

Assignee

Comment 11

•

1 year ago

(In reply to Brad Werth [:bradwerth] from comment #8)

Marking leave-open while we see if the landing of D208694 affects the crash signature.

It doesn't appear that mDeviceStatus is NULL in this call stack, or D208694 would have changed the signature. So there's something else going on. I'm putting this back into triage (and taking myself off the Bug) so we can re-discuss.

Assignee: bwerth → nobody

Blocks: gfx-triage

Jim Mathies [:jimm]

Comment 12

•

1 year ago

This is occurring in the RDD process quite a bit.

https://mozilla.github.io/process-top-crashes/rdd_release.html
https://sql.telemetry.mozilla.org/queries/78917/source?p_channel=release#196128

Jeff Muizelaar [:jrmuizel]

Updated

•

1 year ago

Flags: needinfo?(jmuizelaar)

Jeff Muizelaar [:jrmuizel]

Comment 13

•

1 year ago

It looks like we still have bad debug info:
e.g.

0 	xul.dll 	gfxWindowsPlatform::RecordContentDeviceFailure(mozilla::gfx::TelemetryDeviceCode) 	mfbt/Maybe.h:953 	inlined

This looks like another instance of the problem I ran into here: https://github.com/mstange/samply/issues/5

Markus wrote some analysis here: https://github.com/mstange/samply/issues/5#issuecomment-1101746227

Serge, do you have time to take a look at what might be going on here?

Flags: needinfo?(jmuizelaar) → needinfo?(sguelton)

Jeff Muizelaar [:jrmuizel]

Updated

•

1 year ago

No longer depends on: 1893567

[:sergesanspaille]

Comment 14

•

1 year ago

Jeff, I'd be happy to help, but I'm pretty unsure about how to reproduce. It would be very helpful if you could provide direction to pinpoint at the wrong debug info from a fresh build. I could start working on it from there. I'll still give it a try this morning though.

Flags: needinfo?(sguelton)

Bob Hood [:bhood]

Updated

•

1 year ago

Flags: needinfo?(jmuizelaar)

Jeff Muizelaar [:jrmuizel]

Updated

•

1 year ago

No longer blocks: gfx-triage

Severity: S2 → S3

Flags: needinfo?(jmuizelaar)

Jeff Muizelaar [:jrmuizel]

Updated

•

1 year ago

Flags: needinfo?(jmuizelaar)

Gian-Carlo Pascutto [:gcp]

Comment 15

•

1 year ago

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

This is actually over 60% of our RDD crashes on release, or 10% of our total crash volume. I'm triaging it back to S2 based on that, but feel free to disagree as long as you include the data from the previous sentence in your assessment :-)

Severity: S3 → S2

Keywords: topcrash

Sylvestre Ledru [:Sylvestre]

Updated

•

1 year ago

status-firefox126: affected → wontfix

status-firefox133: --- → affected

status-firefox134: --- → affected

status-firefox135: --- → affected

status-firefox-esr128: --- → affected

Andrew McCreight [:mccr8]

Comment 16

•

1 year ago

(In reply to Gian-Carlo Pascutto [:gcp] from comment #15)

This is actually over 60% of our RDD crashes on release, or 10% of our total crash volume. I'm triaging it back to S2 based on that, but feel free to disagree as long as you include the data from the previous sentence in your assessment :-)

Where are you seeing that? Looking at crash stats, I barely see any crashes for this signature. 79 in the last week, and 43% of those are Nightly. Is that from crash pings or something?

Flags: needinfo?(gpascutto)

Gian-Carlo Pascutto [:gcp]

Comment 17

•

1 year ago

•

Edited

Is that from crash pings or something?

Yes. Crash stats are never very reliable because they're a biased sample, but they're useless for anything that isn't the main process or a content process, because almost no-one is opting in to sending those crash reports. So if you get any crash reports, odds are it's a topcrasher already.

See the announcement in moz.dev.platform about the crash ping dashboards.

Flags: needinfo?(gpascutto)

BugBot [:suhaib / :marco/ :calixte]

Reporter

Comment 18

•

1 year ago

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash

Andrew Osmond [:aosmond] (he/him)

Comment 19

•

1 year ago

I think during init we hit this path, where we reset mDeviceStatus but might successfully get a new pointer for mAdapter:
https://searchfox.org/mozilla-central/rev/0b189f017bc9d48b62012205c5b9f8a8b560497b/gfx/thebes/DeviceManagerDx.cpp#717

Then we crash here:
https://searchfox.org/mozilla-central/rev/0b189f017bc9d48b62012205c5b9f8a8b560497b/gfx/thebes/DeviceManagerDx.cpp#990

Bob Hood [:bhood]

Comment 20

•

1 year ago

Jeff, are you able to help Serge per his request in comment #14?

Kelsey Gilbert [:jgilbert]

Updated

•

1 year ago

Flags: needinfo?(jgilbert)

Brad Werth [:bradwerth]

Assignee

Comment 21

•

11 months ago

Thanks to Andrew's analysis in Comment 19, I think I can build a patch to fix this.

Assignee: nobody → bwerth

Brad Werth [:bradwerth]

Assignee

Comment 22

•

11 months ago

Attached file Bug 1887703 Part 3: Protect more cases where mDeviceStatus is dereferenced, and avoid double-locks. — Details

This adds a method IsWARPLocked to cover the cases where the lock is
already held, and the DeviceManagerDx methods are preferred over the
direct access to the Device (to handle this case correctly).

Andrew Osmond [:aosmond] (he/him)

Updated

•

11 months ago

Flags: needinfo?(jmuizelaar)

Flags: needinfo?(jgilbert)

Pulsebot

Comment 23

•

11 months ago

Pushed by bwerth@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/7682f855507a Part 3: Protect more cases where mDeviceStatus is dereferenced, and avoid double-locks. r=aosmond

Cosmin Sabou [:CosminS]

Comment 24

•

11 months ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/7682f855507a

Bob Hood [:bhood]

Comment 25

•

11 months ago

Flagging as stalled while we observe the effects of the patch. Please remove the keyword if work begins again on the report.

Keywords: stalled

Brad Werth [:bradwerth]

Assignee

Comment 26

•

9 months ago

Can't tell yet if this is completely sorted out -- I'm not great at interpreting the crash reports to see if the patch has been applied. But no crashes in Nightly since April 10.

BugBot [:suhaib / :marco/ :calixte]

Reporter

Comment 27

•

9 months ago

Since the crash volume is low (less than 15 per week), the severity is downgraded to S3. Feel free to change it back if you think the bug is still critical.

For more information, please visit BugBot documentation.

Severity: S2 → S3

Brad Werth [:bradwerth]

Assignee

Comment 28

•

8 months ago

I believe this is now fixed. No crashes in Nightly for the past month. If I'm mistaken here, I'm believe this Bug will be automatically re-opened, and I am happy to keep working on it.

Status: NEW → RESOLVED

Closed: 8 months ago

Resolution: --- → FIXED

Andrew McCreight [:mccr8]

Comment 29

•

8 months ago

A bot will reopen intermittent failures if they reoccur and get flagged, but not crash-stats bugs like this. That being said, it sounds perfectly fine to close this.

BugBot (nomail) [:suhaib / :marco/ :calixte]

Updated

•

8 months ago

Keywords: leave-open

BugBot (nomail) [:suhaib / :marco/ :calixte]

Comment 30

•

8 months ago

Since the bug is closed, the stalled keyword is now meaningless.
For more information, please visit BugBot documentation.

Keywords: stalled

Ryan VanderMeulen [:RyanVM]

Updated

•

8 months ago

status-firefox133: affected → wontfix

status-firefox134: affected → wontfix

status-firefox135: affected → wontfix

status-firefox138: --- → fixed

status-firefox-esr128: affected → wontfix

Target Milestone: --- → 138 Branch

You need to log in before you can comment on or make changes to this bug.