Closed Bug 1710879 Opened 5 months ago Closed 2 months ago

Fission crash in [@ mozilla::dom::BrowserBridgeHost::GetLayersId]

Categories

(Core :: Layout, defect, P3)

Unspecified
All
defect

Tracking

()

RESOLVED FIXED
93 Branch
Fission Milestone MVP
Tracking Status
firefox-esr78 --- unaffected
firefox-esr91 --- disabled
firefox88 --- disabled
firefox89 --- disabled
firefox90 --- disabled
firefox91 --- disabled
firefox92 --- wontfix
firefox93 --- fixed

People

(Reporter: cpeterson, Assigned: emilio)

Details

(Keywords: crash, Whiteboard: fission-soft-blocker)

Crash Data

Attachments

(2 files, 1 obsolete file)

Parent process crash on macOS and Windows.

Maybe Fission related since 100% of crash reports (11 out of 11) have Fission enabled and the crashes have only been reported in Nightly (86, 87, 88) and Beta (89.0b8, where we are running a Fission Beta experiment).

Crash report: https://crash-stats.mozilla.org/report/index/342930c5-8783-44a8-ad5a-8c72b0210507

Reason: EXC_BAD_ACCESS / KERN_INVALID_ADDRESS

Top 4 frames of crashing thread:

0 XUL mozilla::dom::BrowserBridgeHost::GetLayersId const dom/ipc/BrowserBridgeHost.cpp:30
1 XUL nsSubDocumentFrame::BuildDisplayList layout/generic/nsSubDocumentFrame.cpp:372
2 XUL nsBlockFrame::BuildDisplayList layout/generic/nsBlockFrame.cpp:7064
3 XUL nsIFrame::BuildDisplayListForChild layout/generic/nsIFrame.cpp:4030

100% crash on Fission. Give this S2 for now. Perhaps need to be P1 if the milestone is close.

Severity: -- → S2

The stack trace is missing some frames because many functions inlined between nsSubDocumentFrame::BuildDisplayList indirectly calling GetRemoteBrowser() and BrowserBridgeHost::GetLayersId(). mBridge is null here:

https://searchfox.org/mozilla-central/rev/5359952d8b0be3e706e8c943c2bef2674723b8a9/dom/ipc/BrowserBridgeHost.cpp#30

We should null out bridge host before nulling out remote browser?

needinfo'ing Nika to add more comments so someone else can write the patch.

Fission Milestone: ? → M8
Flags: needinfo?(nika)
Priority: -- → P3

I think the risk here is caused by the mBridge field of BrowserBridgeHost being cleared out earlier than the mRoot field of BrowserHost is cleared out, and before the nsFrameLoader::mRemoteBrowser field is cleared. The mRemoteBrowser field is cleared in nsFrameLoader::DestroyComplete(): https://searchfox.org/mozilla-central/rev/0e8b28fb355afd2fcc69d34e8ed66bbabf59a59a/dom/base/nsFrameLoader.cpp#2061-2065, which is calle dfrom the BrowserParent::ActorDestroy callback when the browser is fully destroyed here: https://searchfox.org/mozilla-central/rev/0e8b28fb355afd2fcc69d34e8ed66bbabf59a59a/dom/ipc/BrowserParent.cpp#714. This full destroy path is initially triggered by the nsFrameLoader calling DestroyStart here: https://searchfox.org/mozilla-central/rev/0e8b28fb355afd2fcc69d34e8ed66bbabf59a59a/dom/base/nsFrameLoader.cpp#2007, which ends up asynchronously performing the rest of the process. In the BrowserBridgeHost case, BrowserBridgeHost::DestroyStart() is implemented by directly calling DestroyComplete, meaning that we can clear out the mBridge field before we have cleared out the reference to the host.

To fix this issue, we should probably make BrowserBridgeChild act somewhat more like BrowserChild, calling nsFrameLoader::DestroyComplete from ActorDestroy, and not clearing out mRoot in the host until the final step is complete.

Flags: needinfo?(nika)
Assignee: nobody → emilio
Flags: needinfo?(emilio)
Flags: needinfo?(emilio)

[layout S1/S2 bug triage]: toggling ni=emilio just to be sure this is on your radar (once you're back from PTO). Looks like there's a patch but it's pending some revision. Thanks!

Flags: needinfo?(emilio)

This is probably simpler than messing up with the destruction order, as
callers already need to deal with this case, wdyt?

Flags: needinfo?(emilio)

This bug is a hard blocker for Fission M8.

Whiteboard: fission-hard-blocker

Change the status for beta to have the same as nightly and release.
For more information, please visit auto_nag documentation.

This crash's volume is so low (only 1 report in Beta 91 and 1 report in Nightly 92) that we don't need to bother uplifting a fix to Beta 91.

Flags: needinfo?(emilio)

Try run for the later, which I think is more in line with what Nika wanted: https://treeherder.mozilla.org/jobs?repo=try&revision=d007e75078f757d707c95175a52c88aec032b009

Flags: needinfo?(emilio)
Attachment #9224039 - Attachment is obsolete: true

Setting status-firefox92=wontfix because I don't think we need to uplift this fix to Beta 92. This crash's volume is very low and the fix looks a little risky.

Whiteboard: fission-hard-blocker → fission-soft-blocker
Pushed by ealvarez@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/7fb67d32d491
Don't clear BrowserBridgeHost::mBridge until BrowserBridgeChild has been completely destroyed. r=nika
Status: NEW → RESOLVED
Closed: 2 months ago
Resolution: --- → FIXED
Target Milestone: --- → 93 Branch
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: 93 Branch → ---
Flags: needinfo?(emilio)

So this got backed out because of crashes like bug 1725558.

They seem nullptr crashes with a null frameloader, which I believe can happen now that destruction is async if we never finish initializing the bridge (e.g., if we take any of the early returns in places like RecvMakeFrameRemote). Before, we'd shut down the bridge sync, which would prevent the messages from arriving.

Should I just null-check mFrameLoader in the BrowserBridgeChild::Recv* functions? Add a mDestroying boolean and early return? Or is there a better approach?

Flags: needinfo?(emilio) → needinfo?(nika)

Deferring this bug from Fission Milestone M8 to MVP. The crash volume is low and the fix caused some new crash regressions, so we probably wouldn't uplift the new fix to Beta 92 this late in the beta cycle.

Fission Milestone: M8 → MVP

(In reply to Emilio Cobos Álvarez (:emilio) from comment #16)

They seem nullptr crashes with a null frameloader, which I believe can happen now that destruction is async if we never finish initializing the bridge (e.g., if we take any of the early returns in places like RecvMakeFrameRemote). Before, we'd shut down the bridge sync, which would prevent the messages from arriving.

Ah, that makes sense. The change to make the deleteBridge MakeScopeExit call not synchronously destroy the bridge could cause issues. I should've probably caught that in review :-)

Could we perhaps make Send__delete__ a both: message, so that we can still synchronously destroy the bridge in the case where setup failed, but we do the async shutdown if it's succeeded, and was bound to a nsFrameLoader? That would give us the async teardown behaviour which the frameloader expects, but also allows us to only allow the actor to receive messages after it's been properly connected. I don't think the BrowserBridgeParent side will mind much either, so it'll just simplify things on the BrowserBridgeChild side.

Should I just null-check mFrameLoader in the BrowserBridgeChild::Recv* functions? Add a mDestroying boolean and early return? Or is there a better approach?

Null-checking mFrameLoader would also be a reasonable approach, I suppose. I don't love how future methods could easily be missed there and be potential future crashes though.

Flags: needinfo?(nika) → needinfo?(emilio)

Yeah, I did the both: think. Thanks!

Flags: needinfo?(emilio)
Pushed by ealvarez@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/fc7f4b236ce4
Don't clear BrowserBridgeHost::mBridge until BrowserBridgeChild has been completely destroyed. r=nika
Status: REOPENED → RESOLVED
Closed: 2 months ago2 months ago
Resolution: --- → FIXED
Target Milestone: --- → 93 Branch
You need to log in before you can comment on or make changes to this bug.