Reset interception (with ServiceWorker bypass) on ServiceWorker interception failures
Categories
(Core :: DOM: Service Workers, defect, P2)
Tracking
()
People
(Reporter: marcosc, Assigned: edenchuang)
References
(Blocks 1 open bug, )
Details
Attachments
(7 files)
505.40 KB,
image/png
|
Details | |
9.49 MB,
video/mp4
|
Details | |
5.27 KB,
patch
|
Details | Diff | Splinter Review | |
48 bytes,
text/x-phabricator-request
|
pascalc
:
approval-mozilla-beta+
|
Details | Review |
48 bytes,
text/x-phabricator-request
|
pascalc
:
approval-mozilla-beta+
|
Details | Review |
48 bytes,
text/x-phabricator-request
|
pascalc
:
approval-mozilla-beta+
|
Details | Review |
48 bytes,
text/x-phabricator-request
|
pascalc
:
approval-mozilla-beta+
|
Details | Review |
Reporter | ||
Updated•7 years ago
|
Reporter | ||
Updated•7 years ago
|
Reporter | ||
Comment 1•7 years ago
|
||
Comment 3•7 years ago
|
||
Updated•7 years ago
|
Reporter | ||
Comment 5•7 years ago
|
||
Comment 6•7 years ago
|
||
Comment 7•7 years ago
|
||
Comment 9•7 years ago
|
||
Comment 10•7 years ago
|
||
Comment 11•7 years ago
|
||
Comment 12•7 years ago
|
||
Comment 13•7 years ago
|
||
Comment 14•7 years ago
|
||
Comment 15•7 years ago
|
||
Comment 16•6 years ago
|
||
I'm not sure if the SW has to be implemented out of spec to trigger the problem.
If I've understood correctly, interactions between protective add-ons like EFF's Privacy Badger and bug 1499523 may be triggering NS_ERROR_CORRUPTED_CONTENT errors too.
In that case, navigating to a site from a search engine produces the "Corrupted Content" screen of death, but navigating to the site be opening an empty new tab and pasting in the URL does work. Pasting the same URL in the existing tab fails.
Unfortunately, even experienced users are unlikely to think of doing that. When I reported this to some people, they assumed I didn't know what I was talking about, because everyone expects that pasting a URL into a tab has the same behaviour regardless of whether the tab has been used before, and that clicking a cross-site link also has the same behaviour, if there is no scripting or referrer funny business.
Even if the SW code is out of spec, I think never reaching the 24hr cleanup is quite serious because it permanently breaks a site, which the site provider can't do anything about after the client is "infected". Even if the general problem is difficult to fix, perhaps enforcing the 24hr cleanup should have priority?
Updated•6 years ago
|
Comment 17•6 years ago
|
||
Reloading the page by by-passing cache always works whenever I encounter this issue, which seems to be more frequent recently. Why don't "Try Again" just do the same thing that "Ctrl+Shift+R" (Ctrl+F5) does? It would certainly make for a better user experience.
Comment 18•6 years ago
|
||
(In reply to Steeve McCauley from comment #17)
Reloading the page by by-passing cache always works whenever I encounter this issue, which seems to be more frequent recently. Why don't "Try Again" just do the same thing that "Ctrl+Shift+R" (Ctrl+F5) does? It would certainly make for a better user experience.
Sounds like a possible workaround, :asuth?
Updated•6 years ago
|
Comment 19•6 years ago
|
||
(In reply to Steeve McCauley from comment #17)
Reloading the page by by-passing cache always works whenever I encounter this issue, which seems to be more frequent recently. Why don't "Try Again" just do the same thing that "Ctrl+Shift+R" (Ctrl+F5) does? It would certainly make for a better user experience.
Yes, if we're detecting corrupted content, the 2 steps would be:
- Trigger a reload, bypassing serviceworkers
- Do the marking to always bypass until updated.
We can definitely start with just responding to corrupted content with the automatic bypass-reload, which is simpler than the 2nd part.
Updated•5 years ago
|
Updated•4 years ago
|
Comment 20•4 years ago
|
||
The revised plan for a mitigation fix that hopefully ensures we avoid infinite loops where the pages want to upgrade themselves to using (broken) ServiceWorkers is:
- Due to issues of complexity in implementation and testing in other WIPs in this problem space that involve overhauling the ServiceWorkerRegistrar storage, all state tracking for this patch series is in-memory only. Subsequent patches will layer on top of these changes.
- Whenever we encounter a ServiceWorker fault, we will either reset interception (if we haven't delivered an OnStartRequest yet, as is the case in the recent NS_ERROR_DOM_ABORT_ERR errors) which implicitly redirects with LOAD_BYPASS_SERVICE_WORKER or reload the document (if we have delivered an OnStartRequest and therefore are looking at corrupted content problems) with such flags. In both cases the page will not technically be controlled (as opposed to the no-fetch optimization where the page is technically controlled). We will also perform state tracking that potentially triggers heuristics as described in the next bullet.
- These are the expected places where UX-breaking scenarios can be experienced and how we will handle them:
FetchEventOpChild::CancelInterception
: This is the mechanism by which the current NS_ERROR_DOM_ABORT_ERR errors are reported. This covers all of our current known failure modes (which have changed since the landing of parent intercept, as there's now a more clear-cut life-cycle with explicit IPC actors versus everything being mixed into HttpChannelChild). We'll treat this like ResetInterception was called.InterceptedHttpChannel::OnStopRequest
: The nsInputStreamPump will call this method in the event it is trying to read from the stream and Available() returns an error code, with the pump propagating the error. We're not going to handle this initially, but in general we'd expect the SW misbehavior case (as opposed to crashing) to involve JS streams, which are trickier in other ways too, like that they'd like stall forever rather than abort. Expected errors:- Process crash: NS_ERROR_ABORT because that's what IPCStreamDestination::ActorDestroyed invokes.
- Anything else: The IPCStream will just propagate its closure error code.
- This helps provide the following mitigations:
- The user gets to see the page from the network/HTTP cache instead of a blank page. The meta-concern here that is the page could then try and install a ServiceWorker and pivot to it via navigation or Clients.claim(), but the ServiceWorker could still be broken.
- By ensuring the page isn't controlled, this helps ensure that install/update jobs will be able to make it all the way to the activated state without getting stuck in waiting, which is the complicated (persistence-involving) problem in bug 1668743.
- These are the expected places where UX-breaking scenarios can be experienced and how we will handle them:
- Additional state to be tracked:
- Each ServiceWorkerInfo will track a fault count.
- When the fault count reaches a configurable threshold, we will uninstall the registration and consider clearing the origin.
- Uninstalling the registration acts as a mitigation for:
- ServiceWorkers with broken logic, for example the walmart ServiceWorker that attempted to use a dynamic importScripts based on the date which would never work after installation.
- Lifecycle activation problems in bug 1668743 where the presence of the controlled page (despite the lack of a successful load!) creates a potentially recurring problem. This is actually the same thing we hope that the bypass above will address, which is why we won't take this action immediately, because this should actually be something we can avoid. However, in the event the bypasses aren't sufficient, removing the registration entirely will avoid the possibility of there being a SW getting stuck in the waiting state.
- Uninstalling the registration acts as a mitigation for:
- When the fault count reaches a configurable threshold, we will uninstall the registration and consider clearing the origin.
- Each RegistrationDataPerPrincipal will track whether we have considered clearing the origin for reasons of quota usage.
- When we consider clearing the origin for reasons of quota usage, we:
- Do nothing if we've already initiated a request for this as tracked by our state. There are costs to this and we want to avoid any pathological loops. We set the flag before advancing to the next steps.
- Ask nsIQuotaManagerService.getUsageForPrincipal and nsIQuotaManagerService.estimate to tell us 1) the origin usage, 2) the group usage, and 3) the group limit. Because the group is how QM currently bills things, we potentially need to clear the group, not just the origin.
- In the event the group usage has less than H KiB of headroom, we will presume that the ServiceWorker is running up against its quota limits, as experienced with slack in bug 1678795 and we will mitigate by clearing QuotaManager data for at least the origin. (This is a simplified version of the bumping up against the quota limit detection heuristics I previously proposed but which involves some new QuotaManager development and API surface.)
- The question is then whether to clear the whole group via deleteDataFromHost or from just the origin via deleteDataFromPrincipal. Either way, we will only clear the QM storage by specifying flags of CLEAR_DOM_QUOTA.
- Our decision here is to subtract the origin usage off of the group usage and to see whether that would be enough to make our H KiB of headroom criteria happy. If it's not, then we will clear the group. I was previously thinking about also having some concept of percentage of the group limit as a heuristic, but the reality is that right now Slack is our worst-case and we want to make sure there's at least 400 MiB based on our current investigations. In the future the bump-against-headroom mechanism can be more clever about seeing how much space we had to start with, etc.
- When we consider clearing the origin for reasons of quota usage, we:
- Each ServiceWorkerInfo will track a fault count.
Patch stack for this is going to look something like:
- Fault injection testing mechanism w/test that has a SW that wants to be bypassed.
- Change in InterceptedHttpChannel to reset interception, making the above test pass. This is a band-aid on its own.
- State tracking of faults on ServiceWorkerInfo with registration removal.
- State tracking of origin clearing on RegistrationDataPerPrincipal with origin clearing behavior.
- Telemetry on faults and mitigation actions.
Comment 21•4 years ago
|
||
This test was disabled for parent-intercept by bug 1470266 when parent
intercept was still in-progress. Since the landing of worker remoting
and related cleanups, I think we're generally good here and I was able
to locally get this to pass under "--verify". Also, this test is
actively relevant to the fixes we're making here on bug 1503072.
Comment 22•4 years ago
|
||
Depends on D111844
Comment 23•4 years ago
|
||
:Eden, needinfo-ing you to make sure you're pointed at the plan for the fix in comment 20 and have had a chance to look at the test WIP at https://bugzilla.mozilla.org/attachment.cgi?id=9215466 before I put up (most of, hopefully) the stack later today after end-of-business in Berlin for hopeful review tomorrow.
Updated•4 years ago
|
Comment 24•4 years ago
|
||
ServiceWorkerCleanUp's host based cleaning has not correctly been
checking whether the provided host is a root domain of the ServiceWorker
registration, instead performing an exact match (which is what
removeFromPrincipal does).
This check makes the logic consistent with other deleteByHost
implementations and is actually using the same check logic.
Depends on D111844
Updated•4 years ago
|
Comment 25•4 years ago
|
||
Depends on D111845
Comment 26•4 years ago
|
||
Testing and mitigations
In https://phabricator.services.mozilla.com/D111993 (not yet landed/fully reviewed) I had speculatively disabled the mitigations for all service-workers web platform tests (WPTs) as a least-bad-option versus experiencing false success runs due to the mitigations. I took this action after encountering a failure in /service-workers/service-worker/update-recovery.https.html
because it expects a broken ServiceWorker to stay broken and not be bypassed. :valentin appropriately brought this up in review and after going through the stages of hand-waving/rationalizing, my revised plan here is:
- Add a boolean preference
dom.serviceWorkers.mitigations.testing
which indicates that our test explicitly expects to be depending on mitigations. - When running tests (and/or in DEBUG builds?), in a case where we would fire any of our mitigations, if the above testing pref isn't set, then we hard crash with a message the explicitly indicates what's going on. The rationale is that:
- Tests should not be depending on these mitigations unless they intend to.
- We want it to be very obvious when a mitigation is activated by a test. We don't want to have a newly synchronized WPT be disabled because of mysterious failures and then have to spend several hours realizing the mitigation got involved.
- Switch to only disabling mitigations on
update-recovery.https.html
. This means all other WPT's will run with mitigations enabled, but they will trigger an immediate crash if they are used. The crash/assertion should make it immediately clear in any filed bugs what is going on.
I'm going to do this in a separate patch that sits immediately on top of https://phabricator.services.mozilla.com/D111993.
Comment 27•4 years ago
|
||
Recent status updates:
- There was a problem with the new test being flaky that was a typo where I used
waitForStop
instead ofwaitForStateStop
which was addressed on Friday. - There's another test intermittent that I'm still digging into where the bypassed page still appears to be marked as controlled when it should not be. Given that our clearing of the controller from the load info in the parent on redirect should be quite deterministic, this would suggest something in the interaction with DocumentChannel/ClientChannelHelper that I'm also going to look at statically.
In terms of landing, the original intent for the core mitigation was to land this on Thursday despite the code freeze (given then severity of the impact of the underlying issues), and then it was likely still viable on Friday during the code freeze, but it's definitely not appropriate to land this today on the last day of the code freeze. So the plan is to land this mitigation as soon as both of these are true: the code freeze ends after beta branching, the still-controlled problem is fixed. (Having the page bypassed but still controlled could result in the core HTML of a document being loaded from the network, but every single one of its subresources being broken, which is not a good result.)
I'll then request uplift for the set of mitigations that have landed and baked as they land/bake.
Comment 28•4 years ago
|
||
Re-assigning to Eden to:
- Identify why the page is sometimes still being marked as controlled and stop this from happening.
- This primarily occurs under test-verify mode, but some try pushes have had it happen under the normal run as well. This continued to happen even after Eden's recent landing of the fixes in bug 1584007.
- This failure more is critical to address because if the page still ends up marked as controlled after the bypass attempt, it's very possible that every single subresource load will break.
- Implement the additional layers of data clearing mitigations discussed in comment 20 after addressing and landing the above (which is the current set of reviewed patches). Specifically:
- Speculative patch 3 discussed at the bottom of comment 20 to remove the registration. (The test was written with the expectation of this enhancement, but will need changes for this.)
- Speculative patch 4 discussed at the bottom of comment 20 to clear the origin. (The test was written with the expectation of this enhancement, but will need changes for this.)
I'm going to refresh the patches via moz-phab now to make sure that they are up-to-date as of my last set of try run pushes. Rebasing will likely need to occur in response to child-intercept removal changes. The try pushes with the commit stack at https://hg.mozilla.org/try/rev/1e9ed7fc1e4f570dad52d282b163182f9ee7bb3c were:
Comment 29•4 years ago
|
||
Comment 30•4 years ago
|
||
Backed out for causing bc failures on browser_navigation_fetch_fault_handling.js.
Assignee | ||
Comment 31•4 years ago
|
||
Patch D111992 was not pushed together cause the test fail.
Revise the patch dependency and push again.
Comment 32•4 years ago
|
||
Assignee | ||
Comment 33•4 years ago
|
||
(In reply to Andrew Sutherland [:asuth] (he/him) from comment #28)
Re-assigning to Eden to:
- Identify why the page is sometimes still being marked as controlled and stop this from happening.
- This primarily occurs under test-verify mode, but some try pushes have had it happen under the normal run as well. This continued to happen even after Eden's recent landing of the fixes in bug 1584007.
- This failure more is critical to address because if the page still ends up marked as controlled after the bypass attempt, it's very possible that every single subresource load will break.
The root cause is client controller setting propagation might be later than nsGlobalWindowInner::EnsureClientSource().
When interception starts, ServiceWorkerManager::StartControllingClient(loadInfo::ReservedInfo) is called in ServiceWorkerManager::DispatchFetchEvent() to make sure the reserved ClientSource will be controlled once it is created.
In this test case, we expect the loading is reset as not controlled. To achieve the expectation, nsILoadInfo::ClearController() for the redirected channel. When executing nsGlobalWindowInner::EnsureClientSource(), it can decide if a new ClientSource should be created by comparing the controller setting between the existing reserved client source and channel's loadInfo.
However, ClientSource controller setting propagation involves IPC and thread switching. The controller propagation might happen after nsGlobalWindowInner::EnsureClientSource(). In this case, the ClientSource will be controlled as unexpected.
What a possible fix is also resetting the ReservedClientInfo during the channel redirection. Since we have already known the original ReservedClientSource will be controlled for the situation, we can just drop the original ReservedClientInfo during the channel redirection, such that it would have no chance to connect to the nsGlobalWindowInner.
Comment 34•4 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/c148099d56d7
https://hg.mozilla.org/mozilla-central/rev/c158c8a31e05
https://hg.mozilla.org/mozilla-central/rev/2d6f6e0ee8cd
https://hg.mozilla.org/mozilla-central/rev/3ffffa5d8387
Comment 35•4 years ago
|
||
Comment on attachment 9215688 [details]
Bug 1503072 - Add mitigation to bypass SW on navigation fault. r=#dom-workers
Beta/Release Uplift Approval Request
- User impact if declined: Potentially broken sites as experienced in bug 1678795 and the related family of bugs.
- Is this code covered by automated tests?: Yes
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): This has baked for a full week on nightly and the patches were written in such a way that they only come into play on errors that caused total breakage. The automated test was also able to successfully help us reproduce the worst-case scenario error we were concerned about (which unfortunately delayed being able to land this patch).
- String changes made/needed: n/a
Updated•4 years ago
|
Comment 36•4 years ago
|
||
Comment on attachment 9215688 [details]
Bug 1503072 - Add mitigation to bypass SW on navigation fault. r=#dom-workers
That's sizeable patches but given your evaluation to low risk and the fact that it mitigate cryptic errors on major sites like gmail and slack that we ate suffering from for many cycles, let's take it in 91 beta 5.
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Comment 37•4 years ago
|
||
bugherder uplift |
https://hg.mozilla.org/releases/mozilla-beta/rev/fb2c219ddeb3
https://hg.mozilla.org/releases/mozilla-beta/rev/f75a07e4beb9
https://hg.mozilla.org/releases/mozilla-beta/rev/d76343839ad7
https://hg.mozilla.org/releases/mozilla-beta/rev/b105e48a2396
Description
•