Properly clear the client (i.e. window) and controller (i.e. Service Worker) on a load info when performing a "process switch"/"process swap"

REOPENED
Assigned to

Status

()

defect
P1
critical
REOPENED
2 months ago
4 days ago

People

(Reporter: perry, Assigned: perry)

Tracking

(Depends on 1 bug, {crash, regression, topcrash})

unspecified
mozilla68
Points:
---
Dependency tree / graph
Bug Flags:
qe-verify +

Firefox Tracking Flags

(firefox-esr60 unaffected, firefox66 wontfix, firefox67blocking verified, firefox68+ affected)

Details

(Whiteboard: [qa-triaged], crash signature)

Attachments

(3 attachments)

Assignee

Description

2 months ago

This is possibly the same thing as bug 1530168.

The cause of the crash seems to be when a Service Worker registered under origin A attempts to intercept a request to origin A, but the Service Worker doesn't actually handle fetch events, and the request to origin A ultimately redirects to origin B. Sometimes (i.e. intermittently), the load info for the cross-origin redirect (i.e. the redirect channel to origin B), incorrectly inherits the controller that was marked on the request to origin A.

Assignee

Updated

2 months ago
Blocks: 1530168, 1510368
Assignee

Updated

2 months ago
Assignee: nobody → perry
Assignee

Updated

2 months ago
Priority: -- → P2
Assignee

Updated

2 months ago
Assignee: perry → bugmail

Debugger detective work by :perry appears to indicate that when the process swap occurs, the "redirect mode" is getting corrupted from being MANUAL to FOLLOW, at least in the HttpChannelChild, which is causing the ClientChannelHelper to get confused. We believe this is due to FOLLOW being the default for all channels with SetRedirectMode being called to propagate the value on redirect. However, it appears likely that although an attempt is made to propagate this value in the parent process, that the value is not being propagated in the child process.

When doing pair debugging after this, we determined that in all likelihood the reason Perry was seeing this intermittently in his Service Worker parent-intercept patches was an initial simplification to randomly select the process to spawn the ServiceWorker in and it was selecting the extensions process, which necessitated a process swap for any opened pages. (Because extensions have their own process, the RemoteWorkerService is created in that process as well, and the RemoteWorker random process selection operates over all known parents without applying any remotetype pool clustering, so it's possible to spawn a ServiceWorker in a process it should not be in.) Perry is addressing the process selection logic in his patch-set for ServiceWorkerPrivate remoting and so I'm taking over this patch.

It's certain that bug 1530168 which was highly correlated to webextensions is experiencing this. In short the scenario there is that:

  • ServiceWorker interception happens in the child initially, so initial channel opening happens in the child process.
  • Interception gets reset and so a channel connection is opened.
  • Process swap logic triggers.
  • The bug with redirect mode not propagating happens. And no other logic cares about redirect mode being lost due to process swap. Process swap only happens for top-level tabs and for navigation, which means it's not exposed to the fetch() API. Process swap also only happens when there's a remoteType mismatch; we don't currently process swap based on origin because we're not ready for that.
Duplicate of this bug: 1530168
Crash Signature: [@ mozilla::dom::ClientSource::SetController]

+500 crashes with the recently released 67 beta 3, marking as blocking for the release.

Priority: P2 → P1
Severity: normal → major

Andrew, did you make any progress on finding a solution for this top crasher? Thanks

Flags: needinfo?(bugmail)
Severity: major → critical

Andrew Sutherland told me that he's still working on tracking down the root cause here. Comment 1 is still valid.

Could we get a progress update please? Thanks

Keywords: topcrash

Andrew Sutherland told me he's going to have a patch for this up for review Soon^TM

Any news of a patch? We are building beta 14, so that leaves only 2 betas before RC week. Thanks

Flags: needinfo?(overholt)

Formulating a test to reproduce this has been more involved than expected, should have it and the fix soon. (This is a serious invariant violation and we really are lacking test coverage in this area. Our previous ability to trigger the bug was due to ServiceWorker remoted WorkerPrivate logic accidentally choosing the wrong process, which is not a realistic means of triggering the failure, especially because we don't want to enable that specific scenario ever.)

Flags: needinfo?(overholt)
Flags: needinfo?(bugmail)

Andrew, we are building beta 15 today, the last beta is on Friday and then this is RC week. Will we get a patch to at least mitigate some of crashes before we ship 67? Thanks

Flags: needinfo?(bugmail)
Assignee

Updated

19 days ago
Assignee: bugmail → perry
Assignee

Updated

19 days ago
Flags: needinfo?(bugmail)
Assignee

Comment 12

19 days ago

Comment on attachment 9062293 [details]
Bug 1535699 - Propagate redirect mode across cross-process redirects. r?asuth

Beta/Release Uplift Approval Request

  • User impact if declined: Intermittent crashes.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Medium
  • Why is the change risky/not risky? (and alternatives if risky): Possibly risky because it touches network request logic, but the changes are correctly propagating requests' state (to my understanding).
  • String changes made/needed:
Attachment #9062293 - Flags: approval-mozilla-beta?

Comment on attachment 9062293 [details]
Bug 1535699 - Propagate redirect mode across cross-process redirects. r?asuth

Patch is r+, the try build looks good so far and it fixes our top crasher for 67. It's not landed yet on mozilla-central but given that we need to go to build beta 16 now and that this is our last beta build that would allow us to evaluate the impact on crashes, I am taking it for uplift. Thanks!

Attachment #9062293 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Assignee

Updated

19 days ago
Keywords: checkin-needed

Comment 14

19 days ago

Pushed by bugmail@asutherland.org:
https://hg.mozilla.org/integration/autoland/rev/e12876ee3835
Propagate redirect mode across cross-process redirects. r=asuth

Keywords: checkin-needed
Assignee

Comment 15

18 days ago

Beta/Release Uplift Approval Request

  • User impact if declined: Intermittent crashes.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: Yes
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Medium
  • Why is the change risky/not risky? (and alternatives if risky): Possibly risky because it touches network request logic, but the changes are correctly propagating requests' state (to my understanding).
  • String changes made/needed:
Attachment #9062368 - Flags: approval-mozilla-beta?
Comment on attachment 9062368 [details] [diff] [review]
beta-patch.diff

Approved updated patch applying to beta branch correctly.
Attachment #9062368 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Comment on attachment 9062293 [details]
Bug 1535699 - Propagate redirect mode across cross-process redirects. r?asuth

unsetting approval on older uplift patch

Attachment #9062293 - Flags: approval-mozilla-beta+ → approval-mozilla-beta-

Comment 19

18 days ago
bugherder
Status: NEW → RESOLVED
Last Resolved: 18 days ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla68

Comment 20

18 days ago

crash reports for this signature are continuing to come in in the early stages of b16 - the crash stack and crash reason seems to have remained the same like they were before the patch.

Flags: needinfo?(perry)
Assignee

Comment 21

15 days ago

(In reply to [:philipp] from comment #20)

crash reports for this signature are continuing to come in in the early stages of b16 - the crash stack and crash reason seems to have remained the same like they were before the patch.

I think we'll need to add some diagnostic assertions. The cause of the crash is likely in another process, and the reported stack isn't providing enough information. I'm working on this now and keeping the needinfo open until then.

Clearly not fixed, reopening.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Duplicate of this bug: 1547524

(Adjusting the subject because there are actually 2 problems going on, one of which we fixed, but which are very similar in nature.)

Restating my current understanding of what's happening:

  • :valentin is pursuing a fix that corrects a bug in nsHttpChannel where currently cache redirects are processed before performing process swap. https://treeherder.mozilla.org/#/jobs?repo=try&revision=219297ea3a0e045122f6abc668a14ceca9d8d2fd is the most recent fix attempt for this, but there was an assertion failure at https://treeherder.mozilla.org/#/jobs?repo=try&revision=219297ea3a0e045122f6abc668a14ceca9d8d2fd&selectedJob=245570125
    • This is the most correct thing, as the ServiceWorker child intercept logic and ClientChannelHelper logic really can't handle where the process swap is happening right now.
    • The fix will likely need to bake on trunk for a little bit before we'd consider uplift.
  • :perry is pursuing adapting the test from this bug to also be a test for this specific manifestation of the nsHttpChannel problem here.
  • The stop-gap solution is to disable the "browser.tabs.remote.useHTTPResponseProcessSelection" preference which was enabled in bug 1528360 initially targeted at Firefox 67.
    • By disabling this preference, cross process swaps will use a less-fancy mechanism based on sessionstore and which means that POST requests that get process swapped will lose their data and break.
    • We can turn this back on in the Necko dot release when we uplift valentin's fix.
    • I'll prepare the review request for this now. (Which is just flipping the pref and adding a comment.)
Summary: Properly clear the controller (i.e. Service Worker) on a load info for cross-origin redirect channels → Properly clear the client (i.e. window) and controller (i.e. Service Worker) on a load info when performing a "process switch"/"process swap"

Comment on attachment 9064111 [details]
Bug 1535699 - Disable httpResponseProcessSelection on Fx67 for now. r=nika a=pascalc

Beta/Release Uplift Approval Request

  • User impact if declined: Crashes in the situation where:
  • An HTTP(S) link is being opened in a non-"web" process. Usually this would mean a "file" process, a WebExtension "extension" process, or potentially the about:newtab fancy "privileged" process type.
  • The link being opened is to a cached redirect.
  • The target of the redirect matches an existing ServiceWorker registration.

Note that :perry is working on refining an automated test for this situation.

Also note that this has already been discussed with :nika and :pascalc. I've asked :nika for review as a sanity-checking step.

Also also note: Although I prepared this patch locally against beta, the phabricator UI seems to suggest it's against mozilla-central and the UI also seems to suggest lando will land it against the wrong place. Our phabricator and lando docs don't seem to cover this. I'll file a bug now, because it seems like they should.

  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: Yes
  • If yes, steps to reproduce: So, I've manually verified the pref flip avoids the crash scenario, but if others want to reproduce, see https://bugzilla.mozilla.org/show_bug.cgi?id=1547524#c0. I've uploaded the provided HTML snippet from there as an attachment, but per the STR, the file must be saved to disk and opened via file URI. So, like, on my system it might be file:///tmp/save-me-open-me-via-file-protocol.html
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): This is un-flipping a pref flip that has not yet shipped to release. (Bug 1528360 was that pref flip, currently riding the trains to a Fx67 release if we don't stop it.)
  • String changes made/needed: no
Attachment #9064111 - Flags: approval-mozilla-beta?
Flags: qe-verify+

(I filed bug 1550847 for the lando docs stuff.)

Comment on attachment 9064111 [details]
Bug 1535699 - Disable httpResponseProcessSelection on Fx67 for now. r=nika a=pascalc

Approved for our beta branch before our first merge, thanks!

Attachment #9064111 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Assignee

Updated

11 days ago
Flags: needinfo?(perry)
Comment on attachment 9062368 [details] [diff] [review]
beta-patch.diff

Clearing approval flags to get this off the uplift queries.
Attachment #9062368 - Flags: approval-mozilla-beta+
Attachment #9064111 - Flags: approval-mozilla-beta+

Comment on attachment 9064111 [details]
Bug 1535699 - Disable httpResponseProcessSelection on Fx67 for now. r=nika a=pascalc

Beta/Release Uplift Approval Request

Flags: needinfo?(bugmail)
Attachment #9064111 - Flags: approval-mozilla-release?

Comment on attachment 9064111 [details]
Bug 1535699 - Disable httpResponseProcessSelection on Fx67 for now. r=nika a=pascalc

Approved for 67 RC, thanks

Attachment #9064111 - Flags: approval-mozilla-release? → approval-mozilla-release+
Whiteboard: [qa-triaged]

Comment 37

6 days ago

Verified as fixed on Firefox 67.0 on Windows 10 x 64, Windows 7 x32, Mac OS X 10.14 and on Ubuntu 18.04 x64 following the steps from bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1547524#c0.

Comment 38

4 days ago

Build ID: 20190516215225
User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0

Verified as fixed on Firefox 67.0 RC build2 on Windows 10 x 64, Windows 7 x32, Mac OS X 10.14 and on Ubuntu 18.04 x64.

You need to log in before you can comment on or make changes to this bug.