Fission crash in [@ IPCError-content | RecvConstructBrowser missing initial frame BrowsingContext]
Categories
(Core :: DOM: Content Processes, defect, P1)
Tracking
()
People
(Reporter: aryx, Assigned: nika)
References
Details
(Keywords: crash, Whiteboard: fission-hard-blocker)
Crash Data
Attachments
(7 files, 2 obsolete files)
48 bytes,
text/x-phabricator-request
|
RyanVM
:
approval-mozilla-beta+
|
Details | Review |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
RyanVM
:
approval-mozilla-beta+
|
Details | Review |
48 bytes,
text/x-phabricator-request
|
RyanVM
:
approval-mozilla-beta+
|
Details | Review |
48 bytes,
text/x-phabricator-request
|
RyanVM
:
approval-mozilla-beta+
|
Details | Review |
~70 crashes from ~50 installations with 92.0 betas, >50% on Windows 7 with ~90% on x86. The signature got added in bug 1722378.
Maybe Fission related. (DOMFissionEnabled=1)
Crash report: https://crash-stats.mozilla.org/report/index/8d2cb23d-5443-4977-8391-86bed0210813
MOZ_CRASH Reason: MOZ_CRASH(Content child abort due to IPC error)
Top 10 frames of crashing thread:
0 xul.dll mozilla::dom::ContentChild::ProcessingError dom/ipc/ContentChild.cpp:2237
1 xul.dll static mozilla::ipc::IPCResult::Fail ipc/glue/ProtocolUtils.cpp:63
2 xul.dll mozilla::dom::ContentChild::RecvConstructBrowser dom/ipc/ContentChild.cpp:1747
3 xul.dll mozilla::dom::PContentChild::OnMessageReceived ipc/ipdl/PContentChild.cpp:8479
4 xul.dll mozilla::ipc::MessageChannel::DispatchMessage ipc/glue/MessageChannel.cpp:1978
5 xul.dll mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal xpcom/threads/TaskController.cpp:805
6 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1148
7 xul.dll mozilla::ipc::MessagePump::Run ipc/glue/MessagePump.cpp:85
8 xul.dll MessageLoop::RunHandler ipc/chromium/src/base/message_loop.cc:324
9 xul.dll MessageLoop::Run ipc/chromium/src/base/message_loop.cc:306
Updated•3 years ago
|
Comment 1•3 years ago
•
|
||
Of the new crash reasons Nika added in bug 1722378, "missing initial frame BrowsingContext" is the only one we're hitting:
All but one of these crashes are happening in Beta 92. 100% are Windows. 86% are 32-bit x86 builds (while "only" 67% of all Windows Beta users are 32-bit). Perhaps this crash is related to OOMs or CPU usage?
Updated•3 years ago
|
Assignee | ||
Comment 2•3 years ago
|
||
After some investigation, I was able to find a theoretical codepath
which could lead to the "missing initial frame browsing context" error:
- Two iframes are created for the same origin, and begin process
switching. - The first iframe finishes process switching, but for some reason
(e.g. being in shutdown) the call toLaunchSubprocessResolve
errors. - The second callback is called and also calls LaunchSubprocessResolve,
which this time returnstrue
due to it previously having been
called. - The BrowserParent is created in the new content process despite
InitInternal()
never having been finished, and therefore the
ContentParent never becoming subscribed to the BrowsingContextGroup.
To fix this, I made 2 changes:
- Abort from process switching if the target process which we're going
to be creating a BrowserParent inIsDead()
, and - Track the return value from
LaunchSubprocessResolve
, so we return
false
if it is called a second time after a failed content process
launch.
I'm not confident that this is the cause of the crashes, as I was unable
to reproduce the issue.
Assignee | ||
Comment 3•3 years ago
|
||
After some investigation, I was able to find a theoretical codepath
which could lead to the "missing initial frame browsing context" error:
- Two iframes are created for the same origin, and begin process
switching. - The first iframe finishes process switching, but for some reason
(e.g. being in shutdown) the call toLaunchSubprocessResolve
errors. - The second callback is called and also calls LaunchSubprocessResolve,
which this time returnstrue
due to it previously having been
called. - The BrowserParent is created in the new content process despite
InitInternal()
never having been finished, and therefore the
ContentParent never becoming subscribed to the BrowsingContextGroup.
To fix this, I made 2 changes:
- Abort from process switching if the target process which we're going
to be creating a BrowserParent inIsDead()
, and - Track the return value from
LaunchSubprocessResolve
, so we return
false
if it is called a second time after a failed content process
launch.
I'm not confident that this is the cause of the crashes, as I was unable
to reproduce the issue.
Updated•3 years ago
|
Updated•3 years ago
|
Assignee | ||
Comment 4•3 years ago
|
||
Comment on attachment 9237847 [details]
Bug 1725572 - Avoid process-switching to a dead process, r=kmag
Beta/Release Uplift Approval Request
- User impact if declined: Potential cause of somewhat high-frequency Fission-only crashes. This is occurring on beta due to our Fission beta experiment. We hope to run a Fission experiment on release next cycle.
- Is this code covered by automated tests?: No
- Has the fix been verified in Nightly?: No
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): The changes should largely only impact the Fission codepath, and generally only impact situations where the browser would crash anyway if the checks failed. There is a small change which may also impact non-fission codepaths due to making the changes more thorough, but I expect it to not have an impact on users. If we are concerned, it could be left out of the uplift.
- String changes made/needed: None
Comment 5•3 years ago
|
||
bugherder |
Comment 6•3 years ago
|
||
Comment on attachment 9237847 [details]
Bug 1725572 - Avoid process-switching to a dead process, r=kmag
Approved for 92.0b9, thanks.
Comment 7•3 years ago
|
||
bugherder uplift |
Comment 8•3 years ago
|
||
This doesn't seem to be helping with the crash rate on 92, unfortunately :(
Assignee | ||
Comment 9•3 years ago
|
||
Well, shoot. Was really hoping I'd found the cause. Reopening...
Assignee | ||
Comment 10•3 years ago
|
||
Going to post a patch which I am not confident will fix the issue, but will be very easy to uplift if it does. Marking the bug as leave-open as I think we'll want to do follow-up after that.
Assignee | ||
Comment 11•3 years ago
|
||
I am not confident that this will fix the underlying issue causing this crash,
but given how small of a change it is, I figure it's worth trying to land
quickly to see if the crash rate drops with it.
Comment 12•3 years ago
|
||
Pushed by nlayzell@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/a166505a2a27 Wait for parent ack when discarding BC from child, r=kmag
Comment 13•3 years ago
|
||
bugherder |
Comment 14•3 years ago
|
||
We're still seeing crash reports in Beta 93 and Release 92.
Updated•3 years ago
|
Comment 15•3 years ago
|
||
The crash rate for 92 on Release is pretty low, so I think we can leave things alone there. It looks like the crash rate on Beta93 maybe holding a bit lower than it was for Beta92 too, so that's nice.
Assignee | ||
Comment 16•3 years ago
|
||
Assignee | ||
Comment 17•3 years ago
|
||
Depends on D125895
Comment 18•3 years ago
|
||
We have about 750K Fission users on Beta 93 and about 420K Fission users on Release 92, so it is very surprising that we see thousands of crashes on Beta 92 and 93 but only 23 on Release 92.
Updated•3 years ago
|
Comment 19•3 years ago
|
||
Pushed by nlayzell@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/6bb38c7309e5 Part 1: Support move-only closures in ScopeExit, r=glandium https://hg.mozilla.org/integration/autoland/rev/9213b6bda67a Part 2: Keep BrowsingContext alive until every process has acknowledged the discard, r=farre
Comment 20•3 years ago
|
||
bugherder |
Updated•3 years ago
|
Comment 21•3 years ago
|
||
FYI, if the release volume goes up 100x when we do the Fission rollout at 100% instead of 1%, this will be a top 10 (possibly top 5) content process crash based on the Fx92 numbers. Are there more diagnostics we can try here?
Comment 22•3 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #21)
FYI, if the release volume goes up 100x when we do the Fission rollout at 100% instead of 1%, this will be a top 10 (possibly top 5) content process crash based on the Fx92 numbers. Are there more diagnostics we can try here?
Some data backing up that forecast:
According to [1], Firefox 92.0.* submitted 47 crash reports from 46 installations. Multiplying by 100x would be 4700, placing this crash signature at #3 for Firefox 93, where #2 had 13,349 reports and #3 had 3969 [2].
[2] https://crash-stats.mozilla.org/topcrashers/?product=Firefox&version=93.0
Comment 23•3 years ago
|
||
Things are looking a little better in Beta 94: Beta 94 b1–b7 has had 822 reports so far, about half of Beta 93 b1–b7's 1432 reports and Beta 92 b1–b7's 1797 reports. Fission was about 35% of Beta 92–94.
Assignee | ||
Comment 24•3 years ago
|
||
This patch adds support for ManagedEndpoint instances to be dropped &
gracefully destroyed. Before this change, a ManagedEndpoint which was
dropped without being bound would not clean up its' peer actor, meaning
that messages to and from that actor would be discarded.
This is done by adding a new actor destroy reason for dropping a
ManagedEndpoint.
Assignee | ||
Comment 25•3 years ago
|
||
The about:framecrashed
URI was not tagged as URI_CAN_LOAD_IN_CHILD so was
firing principal assertions when being loaded. This patch adds that flag and
changes the test to no longer suppress the assertions.
Depends on D128776
Assignee | ||
Comment 26•3 years ago
|
||
If the BrowsingContext is missing in ConstructBrowser, previously we would fail
out, crashing the content process. With this new change, we instead exit
successfully, destroying the ManagedEndpoint
instances due to the earlier
changes in this bug, and displaying the subframe as crashed.
As we don't know how to reproduce the case which caused this crash, this change
instead triggers the failure using a custom pref, which can be used to request
that attempts to create a BrowserParent for a specific BrowsingContext fail in
the same way as it would for a missing BrowsingContext, allowing us to test the
behaviour.
Depends on D128777
Comment 27•3 years ago
|
||
Nika says this change is a little risky, but should only change Fission behavior.
Hopefully this fix can land Monday or Tuesday and, after 1–2 days on Nightly, be uplifted to Thursday's Beta 94.0b9.
Assignee | ||
Comment 28•3 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #21)
FYI, if the release volume goes up 100x when we do the Fission rollout at 100% instead of 1%, this will be a top 10 (possibly top 5) content process crash based on the Fx92 numbers. Are there more diagnostics we can try here?
The new patch I wrote last week and forgot to post above does a final solution of deleting the crash and trying to recover from the error instead of fixing the cause. I'm a bit worried about the potential fallout from the situation which it could cause, but I made sure to include a simple artificial test of the situation.
This should definitely fix the crash (as it deletes it), and I'm hopeful it won't break anything else...
Comment 29•3 years ago
|
||
Pushed by nlayzell@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/6e2afe3fff8c Part 1: Support automatic cleanup for ManagedEndpoint instances, r=handyman https://hg.mozilla.org/integration/autoland/rev/115bcf563f7d Part 2: Stop disabling principal vetting for browser_crash_oopiframe.js, r=mconley https://hg.mozilla.org/integration/autoland/rev/453ef7bf2fe9 Part 3: Recover from a missing subframe BrowsingContext in ConstructBrowser, r=smaug
Comment 30•3 years ago
|
||
About 94% of these crash reports have no URL, but the 6% that do have a URL are all https://mail.yahoo.com/ URLs (except two reports with other URLs). So Yahoo! Mail or some other Yahoo! site is probably hitting a race condition loading or unloading a mail.yahoo.com iframe?
Over the last month, we received only 5 crash reports with this signature, but we see now that there were about 2400 crash pings with the same crash reason: MOZ_CRASH(Content child abort due to IPC error)
. If Nika's fix works, we will hopefully see fewer MOZ_CRASH(Content child abort due to IPC error)
crash pings in Nightly over the next day or two. This would give us more confidence uplifting Nika's fix to Thursday's Beta 94.0b9.
Assignee | ||
Updated•3 years ago
|
Comment 31•3 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/6e2afe3fff8c
https://hg.mozilla.org/mozilla-central/rev/115bcf563f7d
https://hg.mozilla.org/mozilla-central/rev/453ef7bf2fe9
Assignee | ||
Comment 32•3 years ago
|
||
Comment on attachment 9246494 [details]
Bug 1725572 - Part 3: Recover from a missing subframe BrowsingContext in ConstructBrowser, r=smaug
Beta/Release Uplift Approval Request
- User impact if declined: Crashes. The true rate of these crashes appears to be higher than originally anticipated due to the crash reports, as it appears as though there are many times more crash pings for these crashes than we originally expected.
- Is this code covered by automated tests?: Yes
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce: No known STR
- List of other uplifts needed: None
- Risk to taking this patch: Medium
- Why is the change risky/not risky? (and alternatives if risky): The first part in this patch is a fairly large change to IPC which is required to make the recovery function correctly. Taking that change is the biggest risk.
As the new codepaths only really kick in for the error case, it should only impact Fission (as that is the only error case).
The error-case-only nature and lack of STR makes this very difficult to test, however an artificial test to trigger the same codepath has been written.
- String changes made/needed: None
Assignee | ||
Updated•3 years ago
|
Comment 33•3 years ago
|
||
Comment on attachment 9246492 [details]
Bug 1725572 - Part 1: Support automatic cleanup for ManagedEndpoint instances, r=#ipc-reviewers
A bit risky this late in the cycle, but shipping Fission with this bug is pretty risky too given the projected crash volume. Approved for 94.0b8 so we can see what affect it has on a larger audience.
Updated•3 years ago
|
Updated•3 years ago
|
Comment 34•3 years ago
|
||
bugherder uplift |
https://hg.mozilla.org/releases/mozilla-beta/rev/910b791efee6
https://hg.mozilla.org/releases/mozilla-beta/rev/4635222216a5
https://hg.mozilla.org/releases/mozilla-beta/rev/8f29979c57e2
Comment hidden (typo) |
Updated•2 years ago
|
Description
•