Closed Bug 1530168 Opened 6 years ago Closed 6 years ago

Crash in [@ mozilla::dom::ClientSource::SetController]

Categories

(WebExtensions :: General, defect, P2)

67 Branch
defect

Tracking

(firefox-esr60 unaffected, firefox65 unaffected, firefox66 unaffected, firefox67 fix-optional, firefox68 fix-optional)

RESOLVED DUPLICATE of bug 1535699
Tracking Status
firefox-esr60 --- unaffected
firefox65 --- unaffected
firefox66 --- unaffected
firefox67 --- fix-optional
firefox68 --- fix-optional

People

(Reporter: philipp, Assigned: asuth)

References

Details

(Keywords: crash, regression)

[Tracking Requested - why for this release]:

This bug is for crash report bp-5ee98431-9bfe-4fbe-aee6-183a60190224.

Top 10 frames of crashing thread:

0 xul.dll mozilla::dom::ClientSource::SetController dom/clients/manager/ClientSource.cpp:361
1 xul.dll nsGlobalWindowInner::EnsureClientSource dom/base/nsGlobalWindowInner.cpp:1820
2 xul.dll nsGlobalWindowOuter::SetNewDocument dom/base/nsGlobalWindowOuter.cpp:2207
3 xul.dll nsresult nsDocumentViewer::InitInternal layout/base/nsDocumentViewer.cpp:963
4 xul.dll nsresult nsDocumentViewer::Init layout/base/nsDocumentViewer.cpp:711
5 xul.dll nsresult nsDocShell::Embed docshell/base/nsDocShell.cpp:6234
6 xul.dll nsresult nsDSURIContentListener::DoContent docshell/base/nsDSURIContentListener.cpp:182
7 xul.dll bool nsDocumentOpenInfo::TryContentListener uriloader/base/nsURILoader.cpp:750
8 xul.dll nsresult nsDocumentOpenInfo::OnStartRequest uriloader/base/nsURILoader.cpp:300
9 xul.dll void mozilla::net::HttpChannelChild::DoOnStartRequest netwerk/protocol/http/HttpChannelChild.cpp:664

this crash signature is spiking up on 67.0a1 starting with build 20190215014208 and then again since build 20190220040540 - this is corresponding exactly with the landing, back-out & re-landing of bug 1525125, which is presumably triggering this regression.

All the crash reports come with MOZ_RELEASE_ASSERT(ClientMatchPrincipalInfo(mClientInfo.PrincipalInfo(), aServiceWorker.PrincipalInfo())).

The crashes also seem to be correlated to a handful of addons:
(27.62% in signature vs 00.76% overall) https://addons.mozilla.org/firefox/addon/groupspeeddial/
(21.90% in signature vs 00.51% overall) https://addons.mozilla.org/firefox/addon/new-tab-speed-dial/
(23.81% in signature vs 05.28% overall) https://www.nightdev.com/betterttv/
(20.00% in signature vs 02.47% overall) https://addons.mozilla.org/firefox/addon/stylish/

Flags: needinfo?(mixedpuppy)

Looking into this.

Some notes from my investigation.

Bug 1525125 is basically preventing the use of an extension url (as the newtab or homepage) if the extension doesn't have permission to private browsing. Code that creates a new document would enter the modified code paths to get the url that will be loaded. The patch shouldn't be having any effect on what happens later during the load of a newtab or homepage.

The first two (speed dial addons) both override newtab.
The second two have no override for homepage or newtab.

I looked at the meta data in a dozen or so of the crashes.

Some have one or more of the correlated extensions, some do not.

In the userPrefs part of the metadata, some have newtab and/or homepage set (cant see to what), some don't.

I'm at a bit of a loss as to how this could happen from my changes, but if it correlates that close to my change...maybe bkelly or asuth have some idea how.

Flags: needinfo?(mixedpuppy)
Flags: needinfo?(bugmail)
Flags: needinfo?(ben)
Flags: needinfo?(mixedpuppy)

This code does some non-standard principal comparing in order to be OMT:

https://searchfox.org/mozilla-central/source/dom/clients/manager/ClientPrincipalUtils.cpp#15

Perhaps something in there is not correct for new private browser or other changes.

I guess it might be hard to figure that out without know what principals are involved in the crashes.

Flags: needinfo?(ben)

@bkelly, thanks for the response! I forgot you'd moved on.

Flags: needinfo?(mixedpuppy)

argh!

Flags: needinfo?(mixedpuppy)

I looked at this in consultation with :mixedpuppy on IRC. There was no obvious smoking gun from the crash stacks available. I think we're both agreed that it's confusing how the referenced bug 1525125 changes would correlate, outside of changing browser startup timing by excluding extensions from the homepage list.

Given the correlation with webextensions, possibilities:

  1. This is the Strict-Transport-Security problem again in a new form.
  2. If we assume WebExtensions are involved somehow, the mismatch is due to the document having an expanded principal over [extension, HTTPS origin] and the ServiceWorker having a principal of just the HTTPS origin. The comparison function :bkelly references in comment 3 will return false in the event the principals are of different types. I'm not quite sure how that would happen, though.

Two viable next steps:

  1. Expand on the site of the release assert to generate different crashes based on what's being observed, illuminating what the principal types are at play and if something like a scheme mismatch is happening due to STS upgrade.
  2. Ensure we have test coverage for WebExtensions manipulating pages with no-fetch ServiceWorkers. (That is a ServiceWorker that doesn't register a fetch handler. In that case documents will still be controlled, but we will go to the network, which is the phenomenon we're seeing in crashes here.) In particular, creating iframes that inherit controllers (via about:blank) and/or that do crazy things with the webRequest webext API to meddle with the network response. (In particular, can that be used to violate channel invariants about the origin in play?)
Flags: needinfo?(bugmail)

(In reply to Andrew Sutherland [:asuth] from comment #6)

  1. If we assume WebExtensions are involved somehow, the mismatch is due to the document having an expanded principal over [extension, HTTPS origin] and the ServiceWorker having a principal of just the HTTPS origin.

Documents aren't allowed to have expanded principals. We have assertions to guarantee that, and a hack to automatically downgrade them in release builds.

David, could we get an assignee for this bug please? The volume on Nightly and the just released Deved Beta1 seem to indicate that it may spike next week when it hits beta. Thanks

Escalating this as a NI to Kris, because so far we've been unable to figure this out.

Flags: needinfo?(ddurst) → needinfo?(kmaglione+bmo)

I think we may have identified the source of the problem in the Serviceworker e10s refactor work. We're still investigating the problem, but in short:

  • The "process switch" mechanism migrates a channel from process A to process B when crossing origins. Code in SessionStore.jsm[1] makes this decision.
  • As a result, the HttpChannelParent/Child pair is migrated to a new process. At least for normal HTTP(S) redirects, this occurs during nsHttpChannel::OnStartRequest when the headers are first received, with the migration happening before the channel resumes processing and actually will process the HTTP(S) redirect.
  • The ClientChannelHelper is attached in the child (by the docshell) and listens to redirects, where it decides to clear the controller from the loadinfo[2] when redirecting off of the original document.
  • There appears to be a race related to the migrated HttpChannelChild in the target process being attached to the docshell and having the ClientChannelHelper be able to hear the processing of the redirect. (The channel's resumption in the parent is not strictly coupled to the child channel being linked to a docshell via a call to resumeRedirectedLoad[3].)

We expect to have this figured out in the next few days and will move this bug to the ServiceWorkers component and/or dupe as appropriate.

That said, if the webextensions team can shed any insight on how they think this might affect process swap timing or how webextensions might complicate the process swap mechanisms, it would be appreciated, as we'd like to make sure we have test coverage for this fix and are unlikely to be able to generate webextension test cases on our own.

1: https://searchfox.org/mozilla-central/rev/aae527894a97ee3bbe0c2cfce9c67c59e8b8fcb9/browser/components/sessionstore/SessionStore.jsm#2317
2: https://searchfox.org/mozilla-central/rev/aae527894a97ee3bbe0c2cfce9c67c59e8b8fcb9/dom/clients/manager/ClientChannelHelper.cpp#142
3: https://searchfox.org/mozilla-central/rev/aae527894a97ee3bbe0c2cfce9c67c59e8b8fcb9/browser/components/sessionstore/ContentRestore.jsm#188

(In reply to Andrew Sutherland [:asuth] from comment #10)

That said, if the webextensions team can shed any insight on how they think this might affect process swap timing or how webextensions might complicate the process swap mechanisms, it would be appreciated, as we'd like to make sure we have test coverage for this fix and are unlikely to be able to generate webextension test cases on our own.

I wouldn't expect it to affect process swap timing much, but loading a homepage from an extension definitely affects which process it loads into. Extension pages always load into the extension content process. Depending on the initial process of the browser, that may require a process switch just to load the page. Then a lot of extensions immediately redirect their home/new tab page to a web URL (since we don't allow them to override either page with a web URL directly), which requires another process switch.

That said, I have a hard time seeing how bug 1525125 would affect this in any meaningful way, since the only affect it should have is preventing those overrides from having an effect on private windows.

Flags: needinfo?(kmaglione+bmo)

(In reply to Kris Maglione [:kmag] from comment #11)

Then a lot of extensions immediately redirect their home/new tab page to a web URL (since we don't allow them to override either page with a web URL directly), which requires another process switch.

minor correction: Home page can be set to an external url (newtab cannot) by an extension. It is part of the reason for the approach in Bug 1532165.

Assignee: nobody → perry
Flags: needinfo?(mixedpuppy)
Priority: -- → P2

Should this move to some other product/component?

Flags: needinfo?(perry)

I made a duplicate bug in Service Workers (bug 1535699) which I believe is the cause of the crash, so this one here shouldn't be needed anymore.

Flags: needinfo?(perry)
Assignee: perry → bugmail

Duping across to the ServiceWorkers bug 1535699 where we'll fix this.

This is so extremely correlated with webextensions because process swap only triggers when there's a remoteType mismatch, which happens when trying to open http/https links in a webextensions process. Normally network accesses would end up going straight to the parent and trigger a process swap, but serviceworker child intercept attempts to fire in the webextensions' content process.

Note that this is indicative of the potentially concerning issue that when there's a ServiceWorker that DOES intercept page loads, it is likely the ServiceWorker and its controlled page do get to exist in the webextension process. I've filed bug 1536826 on that issue and fixing it.

Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → DUPLICATE
Crash Signature: [@ mozilla::dom::ClientSource::SetController]
You need to log in before you can comment on or make changes to this bug.