Crash in [@ gfxWindowsPlatform::RecordContentDeviceFailure]
Categories
(Core :: Graphics, defect, P2)
Tracking
()
People
(Reporter: release-mgmt-account-bot, Assigned: bradwerth)
References
(Blocks 1 open bug)
Details
(Keywords: crash)
Crash Data
Attachments
(3 files)
Crash report: https://crash-stats.mozilla.org/report/index/f99b2a7e-72fc-45f4-aa08-fae570240322
MOZ_CRASH Reason: MOZ_RELEASE_ASSERT(isSome())
Top 10 frames of crashing thread:
0 xul.dll gfxWindowsPlatform::RecordContentDeviceFailure mfbt/Maybe.h:783
0 xul.dll mozilla::gfx::DeviceManagerDx::CreateContentDevice gfx/thebes/DeviceManagerDx.cpp:1021
1 xul.dll mozilla::gfx::DeviceManagerDx::CreateContentDevicesLocked gfx/thebes/DeviceManagerDx.cpp:643
2 xul.dll mozilla::gfx::DeviceManagerDx::CreateContentDevices gfx/thebes/DeviceManagerDx.cpp:629
3 xul.dll mozilla::RDDParent::RecvInitVideoBridge dom/media/ipc/RDDParent.cpp:216
4 xul.dll mozilla::PRDDParent::OnMessageReceived ipc/ipdl/PRDDParent.cpp:824
5 xul.dll mozilla::ipc::MessageChannel::DispatchAsyncMessage ipc/glue/MessageChannel.cpp:1818
5 xul.dll mozilla::ipc::MessageChannel::DispatchMessage ipc/glue/MessageChannel.cpp:1737
5 xul.dll mozilla::ipc::MessageChannel::RunMessage ipc/glue/MessageChannel.cpp:1530
5 xul.dll mozilla::ipc::MessageChannel::MessageTask::Run ipc/glue/MessageChannel.cpp:1628
By querying Nightly crashes reported within the last 2 months, here are some insights about the signature:
- First crash report: 2024-03-22
- Process type: RDD
- Is startup crash: No
- Has user comments: No
- Is null crash: No
Updated•1 year ago
|
| Reporter | ||
Comment 1•1 year ago
|
||
The severity field is not set for this bug.
:bhood, could you have a look please?
For more information, please visit BugBot documentation.
| Reporter | ||
Comment 2•1 year ago
|
||
The bug is linked to a topcrash signature, which matches the following criterion:
- Top 10 desktop browser crashes on nightly
For more information, please visit BugBot documentation.
Updated•1 year ago
|
Comment 3•1 year ago
|
||
The critical log shows things going wrong:
|[0][GFX1-]: Fallback WR to SW-WR + D3D11 (t=212.35) |[1][GFX1-]: Failed to connect GPU process (t=212.35) |[2][GFX1-]: [D3D11] failed to get compositor device. (t=212.66) |[3][GFX1-]: Failed to initialize CompositorD3D11 for SWGL: FEATURE_FAILURE_D3D11_NO_DEVICE (t=212.66) |[4][GFX1-]: [D3D11] failed to get compositor device. (t=273.13) |[5][GFX1-]: Failed to initialize CompositorD3D11 for SWGL: FEATURE_FAILURE_D3D11_NO_DEVICE (t=273.13) |[6][GFX1-]: [D3D11] failed to get compositor device. (t=377.35) |[7][GFX1-]: Failed to initialize CompositorD3D11 for SWGL: FEATURE_FAILURE_D3D11_NO_DEVICE (t=377.35) |[8][GFX1-]: RenderCompositorSWGL failed mapping default framebuffer, no dt (t=378.6) |[9][GFX1-]: [D3D11] failed to get compositor device. (t=385.31) |[10][GFX1-]: Failed to initialize CompositorD3D11 for SWGL: FEATURE_FAILURE_D3D11_NO_DEVICE (t=385.31)
| Assignee | ||
Comment 4•1 year ago
|
||
Theory: the weird call stack (a release assert on mContentDevice = device) may have something to do with the guarantee that a lock is held before accessing mContentDevice. The call stacks don't seem to show a stack where that lock isn't held, but the function definitions don't enforce the lock (by taking a const MutexAutoLock& aProofOfLock param). Might be an improvement to add those protections.
| Assignee | ||
Comment 5•1 year ago
|
||
Much better theory: mDeviceStatus is a Maybe, and it is unconditionally dereferenced at the beginning of DeviceManagerDx::CreateContentDevice. I'll build a patch to fix this.
| Assignee | ||
Comment 6•1 year ago
|
||
| Assignee | ||
Comment 7•1 year ago
|
||
If this is the source of the crash, this assert will change the crash
signature, but that will pinpoint the cause. In such a case, applying
D208682 should fix the crash.
| Assignee | ||
Comment 8•1 year ago
|
||
Marking leave-open while we see if the landing of D208694 affects the crash signature.
Comment 9•1 year ago
|
||
| bugherder | ||
| Reporter | ||
Comment 10•1 year ago
|
||
Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.
For more information, please visit BugBot documentation.
| Assignee | ||
Comment 11•1 year ago
|
||
(In reply to Brad Werth [:bradwerth] from comment #8)
Marking leave-open while we see if the landing of D208694 affects the crash signature.
It doesn't appear that mDeviceStatus is NULL in this call stack, or D208694 would have changed the signature. So there's something else going on. I'm putting this back into triage (and taking myself off the Bug) so we can re-discuss.
Comment 12•1 year ago
|
||
This is occurring in the RDD process quite a bit.
https://mozilla.github.io/process-top-crashes/rdd_release.html
https://sql.telemetry.mozilla.org/queries/78917/source?p_channel=release#196128
Updated•1 year ago
|
Comment 13•1 year ago
|
||
It looks like we still have bad debug info:
e.g.
0 xul.dll gfxWindowsPlatform::RecordContentDeviceFailure(mozilla::gfx::TelemetryDeviceCode) mfbt/Maybe.h:953 inlined
This looks like another instance of the problem I ran into here: https://github.com/mstange/samply/issues/5
Markus wrote some analysis here: https://github.com/mstange/samply/issues/5#issuecomment-1101746227
Serge, do you have time to take a look at what might be going on here?
Comment 14•1 year ago
|
||
Jeff, I'd be happy to help, but I'm pretty unsure about how to reproduce. It would be very helpful if you could provide direction to pinpoint at the wrong debug info from a fresh build. I could start working on it from there. I'll still give it a try this morning though.
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Comment 15•1 year ago
|
||
Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.
This is actually over 60% of our RDD crashes on release, or 10% of our total crash volume. I'm triaging it back to S2 based on that, but feel free to disagree as long as you include the data from the previous sentence in your assessment :-)
Updated•1 year ago
|
Comment 16•1 year ago
|
||
(In reply to Gian-Carlo Pascutto [:gcp] from comment #15)
This is actually over 60% of our RDD crashes on release, or 10% of our total crash volume. I'm triaging it back to S2 based on that, but feel free to disagree as long as you include the data from the previous sentence in your assessment :-)
Where are you seeing that? Looking at crash stats, I barely see any crashes for this signature. 79 in the last week, and 43% of those are Nightly. Is that from crash pings or something?
Comment 17•1 year ago
•
|
||
Is that from crash pings or something?
Yes. Crash stats are never very reliable because they're a biased sample, but they're useless for anything that isn't the main process or a content process, because almost no-one is opting in to sending those crash reports. So if you get any crash reports, odds are it's a topcrasher already.
See the announcement in moz.dev.platform about the crash ping dashboards.
| Reporter | ||
Comment 18•1 year ago
|
||
Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.
For more information, please visit BugBot documentation.
Comment 19•1 year ago
|
||
I think during init we hit this path, where we reset mDeviceStatus but might successfully get a new pointer for mAdapter:
https://searchfox.org/mozilla-central/rev/0b189f017bc9d48b62012205c5b9f8a8b560497b/gfx/thebes/DeviceManagerDx.cpp#717
Then we crash here:
https://searchfox.org/mozilla-central/rev/0b189f017bc9d48b62012205c5b9f8a8b560497b/gfx/thebes/DeviceManagerDx.cpp#990
Comment 20•1 year ago
|
||
Jeff, are you able to help Serge per his request in comment #14?
Updated•1 year ago
|
| Assignee | ||
Comment 21•11 months ago
|
||
Thanks to Andrew's analysis in Comment 19, I think I can build a patch to fix this.
| Assignee | ||
Comment 22•11 months ago
|
||
This adds a method IsWARPLocked to cover the cases where the lock is
already held, and the DeviceManagerDx methods are preferred over the
direct access to the Device (to handle this case correctly).
Updated•11 months ago
|
Comment 23•11 months ago
|
||
Comment 24•11 months ago
|
||
| bugherder | ||
Comment 25•11 months ago
|
||
Flagging as stalled while we observe the effects of the patch. Please remove the keyword if work begins again on the report.
| Assignee | ||
Comment 26•9 months ago
|
||
Can't tell yet if this is completely sorted out -- I'm not great at interpreting the crash reports to see if the patch has been applied. But no crashes in Nightly since April 10.
| Reporter | ||
Comment 27•9 months ago
|
||
Since the crash volume is low (less than 15 per week), the severity is downgraded to S3. Feel free to change it back if you think the bug is still critical.
For more information, please visit BugBot documentation.
| Assignee | ||
Comment 28•8 months ago
|
||
I believe this is now fixed. No crashes in Nightly for the past month. If I'm mistaken here, I'm believe this Bug will be automatically re-opened, and I am happy to keep working on it.
Comment 29•8 months ago
|
||
A bot will reopen intermittent failures if they reoccur and get flagged, but not crash-stats bugs like this. That being said, it sounds perfectly fine to close this.
Updated•8 months ago
|
Comment 30•8 months ago
|
||
Since the bug is closed, the stalled keyword is now meaningless.
For more information, please visit BugBot documentation.
Updated•8 months ago
|
Description
•