Closed Bug 1725572 Opened 3 years ago Closed 3 years ago

Fission crash in [@ IPCError-content | RecvConstructBrowser missing initial frame BrowsingContext]

Categories

(Core :: DOM: Content Processes, defect, P1)

All
Windows
defect

Tracking

()

RESOLVED FIXED
95 Branch
Fission Milestone MVP
Tracking Status
firefox-esr78 --- disabled
firefox-esr91 --- disabled
firefox91 --- disabled
firefox92 + wontfix
firefox93 + wontfix
firefox94 + fixed
firefox95 + fixed

People

(Reporter: aryx, Assigned: nika)

References

Details

(Keywords: crash, Whiteboard: fission-hard-blocker)

Crash Data

Attachments

(7 files, 2 obsolete files)

48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review

~70 crashes from ~50 installations with 92.0 betas, >50% on Windows 7 with ~90% on x86. The signature got added in bug 1722378.

Maybe Fission related. (DOMFissionEnabled=1)

Crash report: https://crash-stats.mozilla.org/report/index/8d2cb23d-5443-4977-8391-86bed0210813

MOZ_CRASH Reason: MOZ_CRASH(Content child abort due to IPC error)

Top 10 frames of crashing thread:

0 xul.dll mozilla::dom::ContentChild::ProcessingError dom/ipc/ContentChild.cpp:2237
1 xul.dll static mozilla::ipc::IPCResult::Fail ipc/glue/ProtocolUtils.cpp:63
2 xul.dll mozilla::dom::ContentChild::RecvConstructBrowser dom/ipc/ContentChild.cpp:1747
3 xul.dll mozilla::dom::PContentChild::OnMessageReceived ipc/ipdl/PContentChild.cpp:8479
4 xul.dll mozilla::ipc::MessageChannel::DispatchMessage ipc/glue/MessageChannel.cpp:1978
5 xul.dll mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal xpcom/threads/TaskController.cpp:805
6 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1148
7 xul.dll mozilla::ipc::MessagePump::Run ipc/glue/MessagePump.cpp:85
8 xul.dll MessageLoop::RunHandler ipc/chromium/src/base/message_loop.cc:324
9 xul.dll MessageLoop::Run ipc/chromium/src/base/message_loop.cc:306
Severity: -- → S2
Fission Milestone: --- → ?

After some investigation, I was able to find a theoretical codepath
which could lead to the "missing initial frame browsing context" error:

  1. Two iframes are created for the same origin, and begin process
    switching.
  2. The first iframe finishes process switching, but for some reason
    (e.g. being in shutdown) the call to LaunchSubprocessResolve
    errors.
  3. The second callback is called and also calls LaunchSubprocessResolve,
    which this time returns true due to it previously having been
    called.
  4. The BrowserParent is created in the new content process despite
    InitInternal() never having been finished, and therefore the
    ContentParent never becoming subscribed to the BrowsingContextGroup.

To fix this, I made 2 changes:

  1. Abort from process switching if the target process which we're going
    to be creating a BrowserParent in IsDead(), and
  2. Track the return value from LaunchSubprocessResolve, so we return
    false if it is called a second time after a failed content process
    launch.

I'm not confident that this is the cause of the crashes, as I was unable
to reproduce the issue.

After some investigation, I was able to find a theoretical codepath
which could lead to the "missing initial frame browsing context" error:

  1. Two iframes are created for the same origin, and begin process
    switching.
  2. The first iframe finishes process switching, but for some reason
    (e.g. being in shutdown) the call to LaunchSubprocessResolve
    errors.
  3. The second callback is called and also calls LaunchSubprocessResolve,
    which this time returns true due to it previously having been
    called.
  4. The BrowserParent is created in the new content process despite
    InitInternal() never having been finished, and therefore the
    ContentParent never becoming subscribed to the BrowsingContextGroup.

To fix this, I made 2 changes:

  1. Abort from process switching if the target process which we're going
    to be creating a BrowserParent in IsDead(), and
  2. Track the return value from LaunchSubprocessResolve, so we return
    false if it is called a second time after a failed content process
    launch.

I'm not confident that this is the cause of the crashes, as I was unable
to reproduce the issue.

Attachment #9238001 - Attachment description: Bug 1725572 - Avoid process-switching to a dead process, r=kmag → Bug 1725572 - Avoid process-switching to a dead process,
Attachment #9238001 - Attachment is obsolete: true

Comment on attachment 9237847 [details]
Bug 1725572 - Avoid process-switching to a dead process, r=kmag

Beta/Release Uplift Approval Request

  • User impact if declined: Potential cause of somewhat high-frequency Fission-only crashes. This is occurring on beta due to our Fission beta experiment. We hope to run a Fission experiment on release next cycle.
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): The changes should largely only impact the Fission codepath, and generally only impact situations where the browser would crash anyway if the checks failed. There is a small change which may also impact non-fission codepaths due to making the changes more thorough, but I expect it to not have an impact on users. If we are concerned, it could be left out of the uplift.
  • String changes made/needed: None
Attachment #9237847 - Flags: approval-mozilla-beta?
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 93 Branch

Comment on attachment 9237847 [details]
Bug 1725572 - Avoid process-switching to a dead process, r=kmag

Approved for 92.0b9, thanks.

Attachment #9237847 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

This doesn't seem to be helping with the crash rate on 92, unfortunately :(

Flags: needinfo?(nika)

Well, shoot. Was really hoping I'd found the cause. Reopening...

Status: RESOLVED → REOPENED
Resolution: FIXED → ---

Going to post a patch which I am not confident will fix the issue, but will be very easy to uplift if it does. Marking the bug as leave-open as I think we'll want to do follow-up after that.

Flags: needinfo?(nika)
Keywords: leave-open

I am not confident that this will fix the underlying issue causing this crash,
but given how small of a change it is, I figure it's worth trying to land
quickly to see if the crash rate drops with it.

Pushed by nlayzell@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/a166505a2a27
Wait for parent ack when discarding BC from child, r=kmag

We're still seeing crash reports in Beta 93 and Release 92.

Fission Milestone: M8 → MVP
Summary: Crash in [@ IPCError-content | RecvConstructBrowser missing initial frame BrowsingContext] → Fission crash in [@ IPCError-content | RecvConstructBrowser missing initial frame BrowsingContext]
Whiteboard: fission-soft-blocker → fission-hard-blocker

The crash rate for 92 on Release is pretty low, so I think we can leave things alone there. It looks like the crash rate on Beta93 maybe holding a bit lower than it was for Beta92 too, so that's nice.

We have about 750K Fission users on Beta 93 and about 420K Fission users on Release 92, so it is very surprising that we see thousands of crashes on Beta 92 and 93 but only 23 on Release 92.

Attachment #9241683 - Attachment description: Bug 1725572 - Part 2: Keep BrowsingContext alive until every process has acknowledged the discard, r=kmag → Bug 1725572 - Part 2: Keep BrowsingContext alive until every process has acknowledged the discard, r=farre
Pushed by nlayzell@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/6bb38c7309e5
Part 1: Support move-only closures in ScopeExit, r=glandium
https://hg.mozilla.org/integration/autoland/rev/9213b6bda67a
Part 2: Keep BrowsingContext alive until every process has acknowledged the discard, r=farre
Status: REOPENED → NEW
Target Milestone: 93 Branch → ---
Regressions: 1732911

FYI, if the release volume goes up 100x when we do the Fission rollout at 100% instead of 1%, this will be a top 10 (possibly top 5) content process crash based on the Fx92 numbers. Are there more diagnostics we can try here?

Flags: needinfo?(nika)

(In reply to Ryan VanderMeulen [:RyanVM] from comment #21)

FYI, if the release volume goes up 100x when we do the Fission rollout at 100% instead of 1%, this will be a top 10 (possibly top 5) content process crash based on the Fx92 numbers. Are there more diagnostics we can try here?

Some data backing up that forecast:

According to [1], Firefox 92.0.* submitted 47 crash reports from 46 installations. Multiplying by 100x would be 4700, placing this crash signature at #3 for Firefox 93, where #2 had 13,349 reports and #3 had 3969 [2].

[1] https://crash-stats.mozilla.org/signature/?signature=IPCError-content%20%7C%20RecvConstructBrowser%20missing%20initial%20frame%20BrowsingContext&date=%3E%3D2021-04-18T17%3A35%3A00.000Z&date=%3C2021-10-18T17%3A35%3A00.000Z

[2] https://crash-stats.mozilla.org/topcrashers/?product=Firefox&version=93.0

Things are looking a little better in Beta 94: Beta 94 b1–b7 has had 822 reports so far, about half of Beta 93 b1–b7's 1432 reports and Beta 92 b1–b7's 1797 reports. Fission was about 35% of Beta 92–94.

This patch adds support for ManagedEndpoint instances to be dropped &
gracefully destroyed. Before this change, a ManagedEndpoint which was
dropped without being bound would not clean up its' peer actor, meaning
that messages to and from that actor would be discarded.

This is done by adding a new actor destroy reason for dropping a
ManagedEndpoint.

The about:framecrashed URI was not tagged as URI_CAN_LOAD_IN_CHILD so was
firing principal assertions when being loaded. This patch adds that flag and
changes the test to no longer suppress the assertions.

Depends on D128776

If the BrowsingContext is missing in ConstructBrowser, previously we would fail
out, crashing the content process. With this new change, we instead exit
successfully, destroying the ManagedEndpoint instances due to the earlier
changes in this bug, and displaying the subframe as crashed.

As we don't know how to reproduce the case which caused this crash, this change
instead triggers the failure using a custom pref, which can be used to request
that attempts to create a BrowserParent for a specific BrowsingContext fail in
the same way as it would for a missing BrowsingContext, allowing us to test the
behaviour.

Depends on D128777

Nika says this change is a little risky, but should only change Fission behavior.

Hopefully this fix can land Monday or Tuesday and, after 1–2 days on Nightly, be uplifted to Thursday's Beta 94.0b9.

(In reply to Ryan VanderMeulen [:RyanVM] from comment #21)

FYI, if the release volume goes up 100x when we do the Fission rollout at 100% instead of 1%, this will be a top 10 (possibly top 5) content process crash based on the Fx92 numbers. Are there more diagnostics we can try here?

The new patch I wrote last week and forgot to post above does a final solution of deleting the crash and trying to recover from the error instead of fixing the cause. I'm a bit worried about the potential fallout from the situation which it could cause, but I made sure to include a simple artificial test of the situation.

This should definitely fix the crash (as it deletes it), and I'm hopeful it won't break anything else...

Flags: needinfo?(nika)
Pushed by nlayzell@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/6e2afe3fff8c
Part 1: Support automatic cleanup for ManagedEndpoint instances, r=handyman
https://hg.mozilla.org/integration/autoland/rev/115bcf563f7d
Part 2: Stop disabling principal vetting for browser_crash_oopiframe.js, r=mconley
https://hg.mozilla.org/integration/autoland/rev/453ef7bf2fe9
Part 3: Recover from a missing subframe BrowsingContext in ConstructBrowser, r=smaug

About 94% of these crash reports have no URL, but the 6% that do have a URL are all https://mail.yahoo.com/ URLs (except two reports with other URLs). So Yahoo! Mail or some other Yahoo! site is probably hitting a race condition loading or unloading a mail.yahoo.com iframe?

Over the last month, we received only 5 crash reports with this signature, but we see now that there were about 2400 crash pings with the same crash reason: MOZ_CRASH(Content child abort due to IPC error). If Nika's fix works, we will hopefully see fewer MOZ_CRASH(Content child abort due to IPC error) crash pings in Nightly over the next day or two. This would give us more confidence uplifting Nika's fix to Thursday's Beta 94.0b9.

Status: NEW → RESOLVED
Closed: 3 years ago3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 95 Branch

Comment on attachment 9246494 [details]
Bug 1725572 - Part 3: Recover from a missing subframe BrowsingContext in ConstructBrowser, r=smaug

Beta/Release Uplift Approval Request

  • User impact if declined: Crashes. The true rate of these crashes appears to be higher than originally anticipated due to the crash reports, as it appears as though there are many times more crash pings for these crashes than we originally expected.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce: No known STR
  • List of other uplifts needed: None
  • Risk to taking this patch: Medium
  • Why is the change risky/not risky? (and alternatives if risky): The first part in this patch is a fairly large change to IPC which is required to make the recovery function correctly. Taking that change is the biggest risk.

As the new codepaths only really kick in for the error case, it should only impact Fission (as that is the only error case).

The error-case-only nature and lack of STR makes this very difficult to test, however an artificial test to trigger the same codepath has been written.

  • String changes made/needed: None
Attachment #9246494 - Flags: approval-mozilla-beta?
Attachment #9246492 - Flags: approval-mozilla-beta?
Attachment #9246493 - Flags: approval-mozilla-beta?

Comment on attachment 9246492 [details]
Bug 1725572 - Part 1: Support automatic cleanup for ManagedEndpoint instances, r=#ipc-reviewers

A bit risky this late in the cycle, but shipping Fission with this bug is pretty risky too given the projected crash volume. Approved for 94.0b8 so we can see what affect it has on a larger audience.

Attachment #9246492 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9246493 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9246494 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9255433 - Attachment is obsolete: true
Duplicate of this bug: 1579442
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: