Closed Bug 1503072 Opened 7 years ago Closed 4 years ago

Reset interception (with ServiceWorker bypass) on ServiceWorker interception failures

Categories

(Core :: DOM: Service Workers, defect, P2)

defect

Tracking

()

RESOLVED FIXED
92 Branch
Tracking Status
firefox91 --- fixed
firefox92 --- fixed

People

(Reporter: marcosc, Assigned: edenchuang)

References

(Blocks 1 open bug, )

Details

Attachments

(7 files)

Attached image PNG image.png
When "corrupted content error" is displayed to an end-user, I wonder if we can be a bit more helpful with recovery? Right now, the only option is "Try Again"... but I wonder if we can do better, like navigating to the canonical URL and disabling by passing service worker? This is currently affecting one of Australia's largest retailers (Harvey Norman) in Firefox only. I'm currently coordinating with them and they are trying to reproduce. Naturally, they are concerned that Firefox users are seeing this error and not able to recover. I'll note that they "gray screen of death" is akin to XML/XHTML's "yellow screen of death" - that's not great for the Web, so we should find a general solution for that.
Priority: -- → P3
This was kindly prepared by the folks at HM to show that "Try Again" doesn't work. Only "hard refresh" works, but most users don't know what that is.
(In reply to Marcos Caceres [:marcosc] from comment #1) > Created attachment 9020983 [details] > Video showing user experience of "try again"... > > This was kindly prepared by the folks at HM to show that "Try Again" doesn't > work. > > Only "hard refresh" works, but most users don't know what that is. Thanks for posting the bug Macros. Really appreciate it. Just to add a few more details. We've rolled out a temporary fix on Harvey Norman, Domayne and Joyce Mayne that disables our Workbox implementation while we refactor. The main issue that I'm hoping we can address through this ticket is the way that Firefox recovers from a "Corrupted Content Error" page. As per the attached video. The only way for a user to recover from a "Corrupted Content Error" is by doing a "Hard Refresh". In lot of cases, a Firefox user will not know about a "Hard Refresh" and will be stuck on the "Corrupted Content Error" page. What exacerbates the problem is that, in our case at least. The broken service worker that caused the initial issue, didn't go away after the normal 24 hour period that you'd expect from a service worker. The video that was attached, shows a user going back to Harvey Norman, with a broken service worker installed, after the initial visit several weeks back. (Initial visit around Oct 16th. Video recorded on 30th October) Hitting the "Try Again" button doesn't do anything. As a side note, I wonder whether you're tracking "Corrupted Content Error" instances. I wonder whether's there a number of users out there stuck on the "Grey Screen of Death".
Yes, Marcos, I definitely agree that the error page is bad UX and Firefox should do more to help sites recover from this problem, as this is a case where there was no practical action for a user to take to correct the problem but the user intent is for the site to work. We could certainly trigger SW bypass; a question is whether we should also offer to clear the origin's storage. (It might also be possible to just clear the registration as an intermediate step, since the SW semantics don't have an explicit uninstall step.) I would also like to better understand the lower level details of what happened here so that we can understand if there are specific bugs in Firefox's implementation where we did not conform to the spec, if the spec's update procedures are still inclined to cause problems, or if the Firefox handling of corrupted content just needs to take more aggressive data-clearing actions. Based on Marcos' image, here's what I understand/theorize to be happening: - The ServiceWorker attempts a dynamic import of a script that was not installed during the "install" phase of the ServiceWorker, as described in the very similar Workbox issue at https://github.com/GoogleChrome/workbox/issues/1740 - Firefox began enforcing this in 61 in bug 1448141. - Firefox refined its handling in 62 in bug 1465670. - The Soft Update algorithm triggered by functional event dispatch did not correct the problem. There could be a lot of factors here. I'd like to better understand how the ServiceWorker registration was configured and what HTTP cache directives the SW was initially served with as well as what the cache directives were after the SW was changed. Specifically: - What was the registration's updateViaCache value set to? - What Cache-Control and related headers were used before/after? - Did the contents of the SW top-level script and/or dependent scripts change? (That is, the update algorithm does byte-comparisons to determine if things changed.) - Were there possibly any script errors in any updated scripts? Firefox is currently good about reporting errors in the devtools console when things happen, but attempting to attach the debugger is problematic, so the console errors would be the useful thing here.
Assignee: nobody → perry
Priority: P3 → P2
Hey Andrew, Sorry for the delay, and thanks for replying. In reply. Your right about the core issue being the way Workbox dynamically imports it's dependencies. When we initially deployed our Workbox implementation, pre FF 60. We weren't aware of the up coming changes to the way dynamic Workbox imports are handled. So it was a bit of a surprise when we ran into issues. In regards to your questions: - What was the registration's updateViaCache value set to? We weren't aware of 'updateViaCache' when we where building our SW/Workbox implementation, so we didn't use it. - What Cache-Control and related headers were used before/after? I'll have to confirm the exact details, but as far as I know, we originally disabled caching on our service worker bundle so that changes would be picked up straight away, and we haven't changed the Cache-Control since we started using our service worker. When a browser requests a service worker bundle, we skip caching and return the latest deployed service worker bundle. The response headers for our service worker bundle currently look like this: accept-ranges: bytes access-control-allow-origin: * cache-control: max-age=604800 content-encoding: gzip content-length: 2414 content-type: application/javascript date: Wed, 07 Nov 2018 21:49:03 GMT expires: Wed, 14 Nov 2018 21:49:03 GMT last-modified: Wed, 07 Nov 2018 19:02:13 GMT service-worker-allowed: / status: 200 strict-transport-security: max-age=63072000; includeSubDomains vary: Accept-Encoding,User-Agent x-cache: MISS-w x-cdn: Incapsula x-iinfo: 10-4290040-4289774 PNNN RT(1541627343011 0) q(0 0 0 -1) r(0 0) U5 - Did the contents of the SW top-level script and/or dependent scripts change? (That is, the update algorithm does byte-comparisons to determine if things changed.) When we ran into the "Corrupted Content" error. We first tried using our GTM service worker toggle. Then we tried various versions of an actual service worker bundle, including one that would unregister any/all service workers. What we found was that when FF users would hit the "Corrupted Content" error page and FF wouldn't go far enough in the page load process to pick up any changes to either the HTML page AND/OR the service worker bundle. So no matter what we tried on our end to fix the broken service worker. It wouldn't work as the browser was stuck on the "Corrupted Error" page. - Were there possibly any script errors in any updated scripts? Firefox is currently good about reporting errors in the devtools console when things happen, but attempting to attach the debugger is problematic, so the console errors would be the useful thing here. The was an issue with the service worker to begin with, and that issue was the way that we where using Workbox dependencies. What we expected to happen was that when we deployed a new service worker bundle with fixed code (eg. Workbox removed) into production. We where expecting that the new service worker bundle to replace the old bundle, fixing the "Corrupted Content" error page. Thanks for the help Andrew/Marcos.
Andrew, do you think we could put out some kind of telemetry probe to find out how often users see this page? That might be really useful data. Would that be something we ask the data folks to set up for us?
(In reply to Marcos Caceres [:marcosc] (triage duty) from comment #5) > Andrew, do you think we could put out some kind of telemetry probe to find > out how often users see this page? That might be really useful data. Would > that be something we ask the data folks to set up for us? The data folks would help out with the data review, but we add any telemetry probes ourselves. The docs at https://bugzilla.mozilla.org/show_bug.cgi?id=1503072 are the starting point, and :ttung has been having a lot of fun adding Telemetry probes to QuotaManager/IndexedDB recently! I would naively have thought that we already would gather error page counts elsewhere in the product, but I'm not seeing any obvious probes in the probe dictionary at https://telemetry.mozilla.org/probe-dictionary/. We would probably add a specialized probe just for ServiceWorkers since it seems like the error can come from other places too per https://searchfox.org/mozilla-central/search?q=symbol:_ZL26NS_ERROR_CORRUPTED_CONTENT&redirect=false
Marcin, thank you for the detailed answers to my questions. Your responses are invaluable and will allow us to move forward on this. If it would be possible to share the unregistering failsafe ServiceWorker, it would be very useful as we try and prepare a realistic unit test. If you want to share it privately rather than attaching it to the bug, you can reach us at asuth@mozilla.com and perry@mozilla.com and we can try and distill the relevant algorithmic details. Thanks!
Hey Andrew, No worries. More than happy to help. I'll see whether I can come up with some code but in the mean time, here's what I'd try: 1. Get FF into a "Corrupted Content" error state by doing something like say using a Workbox dependency outside of the install process. (There's probably a better, more replicable way of doing this..) 2. Once in the "Corrupted Content" error state, see whether you can correct the error state by: - Deploying a fixed service worker (We tried this first) - Un-registering a service worker (Then we tried this) In both cases, for us, we saw that FF never downloaded any resources. So it never got to the point of downloading a new service worker OR reading the service worker un-register code on a page.
(In reply to Marcin from comment #8) > In both cases, for us, we saw that FF never downloaded any resources. So it > never got to the point of downloading a new service worker OR reading the > service worker un-register code on a page. This is also very useful as we seek to re-create. Can you elaborate on how you determined FF didn't download any resources? Server logs, wireshark, Firefox devtools, Firefox WebExtension, other? Thanks!
Migrating my comment from https://bugzilla.mozilla.org/show_bug.cgi?id=1495275#c8 to here since I think we're going to address the meta issue on this bug: The only time the corrupted content UI is useful on a navigation fetch is for a developer debugging the SW. For normal users it makes sense to fall-back to the normal network path with LOAD_BYPASS_SERVICE_WORKER in effect. The bigger implementation question is what we do about the existing registration. In the case of bug 1503072, the SW itself was buggy, and so there was some period of time where immediately unregistering the ServiceWorker would lead to a cycle of: - navigation FetchEvent dispatched to ServiceWorker, ServiceWorker explodes - Our migitation fires: SW uninstalled, normal network fetch occurs - The page loaded from the network re-installs the broken ServiceWorker. If the SW claims the client, the page may immediately break. - (user closes page, cycle repeats next time they open a page matching the registration scope) There's additionally some potential weirdness with this approach thanks to the prefix-based scope matching of ServiceWorkers. If a site has a valid SW at "/" and a broken SW at "/maps" and we unregister "/maps", "/" will start handling maps URL's again. Which is to say, that perhaps the thing we should do instead of naively unregistering is: - Add internal-only flags on the registration that allow us to mark it broken. - Set the broken flag on any registration that generates a corrupted content error on a navigation fetch and immediately terminate the SW if it has no (other) controlled clients. - If a registration has "broken" set on it, we always set the LOAD_BYPASS_SERVICE_WORKER flag. We also adjust the 24-hour soft-update logic to allow checks more frequently so that a fixed SW will be detected more promptly. We should also consider treating the updateViaCache mode as if it was "imports" if it was set to "all", and perhaps "none" if it was set to "imports", or maybe always just "none". - We clear the broken flag on a registration when the update algorithm decides to progress to installation and the new ServiceWorker activates. The rationale here is that we avoid wasteful CPU and bandwidth usage by requiring that the SW has actually changed as well as avoiding clients.claim() allowing the the broken SW to break the page.
It's looking like a "broken" service worker that intercepts but incorrectly responds to a navigation fetch blocks any waiting service workers from taking control (i.e. an updated version of a particular service worker). So that's probably why the broken service worker persists even after the 24-hour update check. *Something* is preventing the broken service worker from giving up control of the client/page until the faulty fetch event handler resolves, and I don't think the fetch handler should be running at all when there is another waiting worker.
I'm going to take this over while Perry focuses on ServiceWorkerPrivate remoting; :perry will put up his existing test for me to steal.
Assignee: perry → bugmail
Status: NEW → ASSIGNED
Flags: needinfo?(perry)
IIRC, some debugger output suggested that the KeepAliveToken is being dropped too late, preventing the SW from shutting down in time, which then prevented the updated SW from taking control.
688345 - Log information for diagnosing NS_ERROR_CORRUPTED_CONTENT errors into web console <https://bugzilla.mozilla.org/show_bug.cgi?id=688345>

I'm not sure if the SW has to be implemented out of spec to trigger the problem.

If I've understood correctly, interactions between protective add-ons like EFF's Privacy Badger and bug 1499523 may be triggering NS_ERROR_CORRUPTED_CONTENT errors too.

In that case, navigating to a site from a search engine produces the "Corrupted Content" screen of death, but navigating to the site be opening an empty new tab and pasting in the URL does work. Pasting the same URL in the existing tab fails.

Unfortunately, even experienced users are unlikely to think of doing that. When I reported this to some people, they assumed I didn't know what I was talking about, because everyone expects that pasting a URL into a tab has the same behaviour regardless of whether the tab has been used before, and that clicking a cross-site link also has the same behaviour, if there is no scripting or referrer funny business.

Even if the SW code is out of spec, I think never reaching the 24hr cleanup is quite serious because it permanently breaks a site, which the site provider can't do anything about after the client is "infected". Even if the general problem is difficult to fix, perhaps enforcing the 24hr cleanup should have priority?

Type: enhancement → defect

Reloading the page by by-passing cache always works whenever I encounter this issue, which seems to be more frequent recently. Why don't "Try Again" just do the same thing that "Ctrl+Shift+R" (Ctrl+F5) does? It would certainly make for a better user experience.

(In reply to Steeve McCauley from comment #17)

Reloading the page by by-passing cache always works whenever I encounter this issue, which seems to be more frequent recently. Why don't "Try Again" just do the same thing that "Ctrl+Shift+R" (Ctrl+F5) does? It would certainly make for a better user experience.

Sounds like a possible workaround, :asuth?

Flags: needinfo?(bugmail)
Assignee: bugmail → nobody
Status: ASSIGNED → NEW
Flags: needinfo?(bugmail)

(In reply to Steeve McCauley from comment #17)

Reloading the page by by-passing cache always works whenever I encounter this issue, which seems to be more frequent recently. Why don't "Try Again" just do the same thing that "Ctrl+Shift+R" (Ctrl+F5) does? It would certainly make for a better user experience.

Yes, if we're detecting corrupted content, the 2 steps would be:

  1. Trigger a reload, bypassing serviceworkers
  2. Do the marking to always bypass until updated.

We can definitely start with just responding to corrupted content with the automatic bypass-reload, which is simpler than the 2nd part.

Assignee: nobody → bugmail
Status: NEW → ASSIGNED
Summary: Corrupted content error should allow for better recovery → Reset interception (with ServiceWorker bypass) on ServiceWorker interception failures

The revised plan for a mitigation fix that hopefully ensures we avoid infinite loops where the pages want to upgrade themselves to using (broken) ServiceWorkers is:

  • Due to issues of complexity in implementation and testing in other WIPs in this problem space that involve overhauling the ServiceWorkerRegistrar storage, all state tracking for this patch series is in-memory only. Subsequent patches will layer on top of these changes.
  • Whenever we encounter a ServiceWorker fault, we will either reset interception (if we haven't delivered an OnStartRequest yet, as is the case in the recent NS_ERROR_DOM_ABORT_ERR errors) which implicitly redirects with LOAD_BYPASS_SERVICE_WORKER or reload the document (if we have delivered an OnStartRequest and therefore are looking at corrupted content problems) with such flags. In both cases the page will not technically be controlled (as opposed to the no-fetch optimization where the page is technically controlled). We will also perform state tracking that potentially triggers heuristics as described in the next bullet.
    • These are the expected places where UX-breaking scenarios can be experienced and how we will handle them:
      • FetchEventOpChild::CancelInterception: This is the mechanism by which the current NS_ERROR_DOM_ABORT_ERR errors are reported. This covers all of our current known failure modes (which have changed since the landing of parent intercept, as there's now a more clear-cut life-cycle with explicit IPC actors versus everything being mixed into HttpChannelChild). We'll treat this like ResetInterception was called.
      • InterceptedHttpChannel::OnStopRequest: The nsInputStreamPump will call this method in the event it is trying to read from the stream and Available() returns an error code, with the pump propagating the error. We're not going to handle this initially, but in general we'd expect the SW misbehavior case (as opposed to crashing) to involve JS streams, which are trickier in other ways too, like that they'd like stall forever rather than abort. Expected errors:
    • This helps provide the following mitigations:
      • The user gets to see the page from the network/HTTP cache instead of a blank page. The meta-concern here that is the page could then try and install a ServiceWorker and pivot to it via navigation or Clients.claim(), but the ServiceWorker could still be broken.
      • By ensuring the page isn't controlled, this helps ensure that install/update jobs will be able to make it all the way to the activated state without getting stuck in waiting, which is the complicated (persistence-involving) problem in bug 1668743.
  • Additional state to be tracked:
    • Each ServiceWorkerInfo will track a fault count.
      • When the fault count reaches a configurable threshold, we will uninstall the registration and consider clearing the origin.
        • Uninstalling the registration acts as a mitigation for:
          • ServiceWorkers with broken logic, for example the walmart ServiceWorker that attempted to use a dynamic importScripts based on the date which would never work after installation.
          • Lifecycle activation problems in bug 1668743 where the presence of the controlled page (despite the lack of a successful load!) creates a potentially recurring problem. This is actually the same thing we hope that the bypass above will address, which is why we won't take this action immediately, because this should actually be something we can avoid. However, in the event the bypasses aren't sufficient, removing the registration entirely will avoid the possibility of there being a SW getting stuck in the waiting state.
    • Each RegistrationDataPerPrincipal will track whether we have considered clearing the origin for reasons of quota usage.
      • When we consider clearing the origin for reasons of quota usage, we:
        • Do nothing if we've already initiated a request for this as tracked by our state. There are costs to this and we want to avoid any pathological loops. We set the flag before advancing to the next steps.
        • Ask nsIQuotaManagerService.getUsageForPrincipal and nsIQuotaManagerService.estimate to tell us 1) the origin usage, 2) the group usage, and 3) the group limit. Because the group is how QM currently bills things, we potentially need to clear the group, not just the origin.
        • In the event the group usage has less than H KiB of headroom, we will presume that the ServiceWorker is running up against its quota limits, as experienced with slack in bug 1678795 and we will mitigate by clearing QuotaManager data for at least the origin. (This is a simplified version of the bumping up against the quota limit detection heuristics I previously proposed but which involves some new QuotaManager development and API surface.)
        • The question is then whether to clear the whole group via deleteDataFromHost or from just the origin via deleteDataFromPrincipal. Either way, we will only clear the QM storage by specifying flags of CLEAR_DOM_QUOTA.
          • Our decision here is to subtract the origin usage off of the group usage and to see whether that would be enough to make our H KiB of headroom criteria happy. If it's not, then we will clear the group. I was previously thinking about also having some concept of percentage of the group limit as a heuristic, but the reality is that right now Slack is our worst-case and we want to make sure there's at least 400 MiB based on our current investigations. In the future the bump-against-headroom mechanism can be more clever about seeing how much space we had to start with, etc.

Patch stack for this is going to look something like:

  1. Fault injection testing mechanism w/test that has a SW that wants to be bypassed.
  2. Change in InterceptedHttpChannel to reset interception, making the above test pass. This is a band-aid on its own.
  3. State tracking of faults on ServiceWorkerInfo with registration removal.
  4. State tracking of origin clearing on RegistrationDataPerPrincipal with origin clearing behavior.
  5. Telemetry on faults and mitigation actions.

This test was disabled for parent-intercept by bug 1470266 when parent
intercept was still in-progress. Since the landing of worker remoting
and related cleanups, I think we're generally good here and I was able
to locally get this to pass under "--verify". Also, this test is
actively relevant to the fixes we're making here on bug 1503072.

:Eden, needinfo-ing you to make sure you're pointed at the plan for the fix in comment 20 and have had a chance to look at the test WIP at https://bugzilla.mozilla.org/attachment.cgi?id=9215466 before I put up (most of, hopefully) the stack later today after end-of-business in Berlin for hopeful review tomorrow.

Flags: needinfo?(echuang)
Attachment #9215465 - Attachment description: WIP: Bug 1503072 - re-enable browser_storage_recover.js r=dom-workers → Bug 1503072 - re-enable browser_storage_recover.js r=#dom-workers

ServiceWorkerCleanUp's host based cleaning has not correctly been
checking whether the provided host is a root domain of the ServiceWorker
registration, instead performing an exact match (which is what
removeFromPrincipal does).

This check makes the logic consistent with other deleteByHost
implementations and is actually using the same check logic.

Depends on D111844

Attachment #9215466 - Attachment description: WIP: Bug 1503072 - Test → Bug 1503072 - Navigation fault interception test support. r=#dom-workers

Testing and mitigations

In https://phabricator.services.mozilla.com/D111993 (not yet landed/fully reviewed) I had speculatively disabled the mitigations for all service-workers web platform tests (WPTs) as a least-bad-option versus experiencing false success runs due to the mitigations. I took this action after encountering a failure in /service-workers/service-worker/update-recovery.https.html because it expects a broken ServiceWorker to stay broken and not be bypassed. :valentin appropriately brought this up in review and after going through the stages of hand-waving/rationalizing, my revised plan here is:

  • Add a boolean preference dom.serviceWorkers.mitigations.testing which indicates that our test explicitly expects to be depending on mitigations.
  • When running tests (and/or in DEBUG builds?), in a case where we would fire any of our mitigations, if the above testing pref isn't set, then we hard crash with a message the explicitly indicates what's going on. The rationale is that:
    • Tests should not be depending on these mitigations unless they intend to.
    • We want it to be very obvious when a mitigation is activated by a test. We don't want to have a newly synchronized WPT be disabled because of mysterious failures and then have to spend several hours realizing the mitigation got involved.
  • Switch to only disabling mitigations on update-recovery.https.html. This means all other WPT's will run with mitigations enabled, but they will trigger an immediate crash if they are used. The crash/assertion should make it immediately clear in any filed bugs what is going on.

I'm going to do this in a separate patch that sits immediately on top of https://phabricator.services.mozilla.com/D111993.

Recent status updates:

  • There was a problem with the new test being flaky that was a typo where I used waitForStop instead of waitForStateStop which was addressed on Friday.
  • There's another test intermittent that I'm still digging into where the bypassed page still appears to be marked as controlled when it should not be. Given that our clearing of the controller from the load info in the parent on redirect should be quite deterministic, this would suggest something in the interaction with DocumentChannel/ClientChannelHelper that I'm also going to look at statically.

In terms of landing, the original intent for the core mitigation was to land this on Thursday despite the code freeze (given then severity of the impact of the underlying issues), and then it was likely still viable on Friday during the code freeze, but it's definitely not appropriate to land this today on the last day of the code freeze. So the plan is to land this mitigation as soon as both of these are true: the code freeze ends after beta branching, the still-controlled problem is fixed. (Having the page bypassed but still controlled could result in the core HTML of a document being loaded from the network, but every single one of its subresources being broken, which is not a good result.)

I'll then request uplift for the set of mitigations that have landed and baked as they land/bake.

Flags: needinfo?(echuang)

Re-assigning to Eden to:

  1. Identify why the page is sometimes still being marked as controlled and stop this from happening.
  • This primarily occurs under test-verify mode, but some try pushes have had it happen under the normal run as well. This continued to happen even after Eden's recent landing of the fixes in bug 1584007.
  • This failure more is critical to address because if the page still ends up marked as controlled after the bypass attempt, it's very possible that every single subresource load will break.
  1. Implement the additional layers of data clearing mitigations discussed in comment 20 after addressing and landing the above (which is the current set of reviewed patches). Specifically:
  • Speculative patch 3 discussed at the bottom of comment 20 to remove the registration. (The test was written with the expectation of this enhancement, but will need changes for this.)
  • Speculative patch 4 discussed at the bottom of comment 20 to clear the origin. (The test was written with the expectation of this enhancement, but will need changes for this.)

I'm going to refresh the patches via moz-phab now to make sure that they are up-to-date as of my last set of try run pushes. Rebasing will likely need to occur in response to child-intercept removal changes. The try pushes with the commit stack at https://hg.mozilla.org/try/rev/1e9ed7fc1e4f570dad52d282b163182f9ee7bb3c were:

Assignee: bugmail → echuang
Pushed by echuang@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/b7550232d4f1 re-enable browser_storage_recover.js r=dom-worker-reviewers,edenchuang https://hg.mozilla.org/integration/autoland/rev/c04927c15fd8 Navigation fault interception test support. r=dom-worker-reviewers,edenchuang https://hg.mozilla.org/integration/autoland/rev/f8c6503512f5 Add mitigation to bypass SW on navigation fault. r=dom-worker-reviewers,necko-reviewers,valentin,edenchuang

Backed out for causing bc failures on browser_navigation_fetch_fault_handling.js.

Push with failures

Failure log

Backout link

Flags: needinfo?(echuang)

Patch D111992 was not pushed together cause the test fail.
Revise the patch dependency and push again.

Flags: needinfo?(echuang)
Pushed by echuang@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/c148099d56d7 re-enable browser_storage_recover.js r=dom-worker-reviewers,edenchuang https://hg.mozilla.org/integration/autoland/rev/c158c8a31e05 Correct ServiceWorker host-based cleanup. r=dom-worker-reviewers,edenchuang https://hg.mozilla.org/integration/autoland/rev/2d6f6e0ee8cd Navigation fault interception test support. r=dom-worker-reviewers,edenchuang https://hg.mozilla.org/integration/autoland/rev/3ffffa5d8387 Add mitigation to bypass SW on navigation fault. r=dom-worker-reviewers,necko-reviewers,valentin,edenchuang

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #28)

Re-assigning to Eden to:

  1. Identify why the page is sometimes still being marked as controlled and stop this from happening.
  • This primarily occurs under test-verify mode, but some try pushes have had it happen under the normal run as well. This continued to happen even after Eden's recent landing of the fixes in bug 1584007.
  • This failure more is critical to address because if the page still ends up marked as controlled after the bypass attempt, it's very possible that every single subresource load will break.

The root cause is client controller setting propagation might be later than nsGlobalWindowInner::EnsureClientSource().

When interception starts, ServiceWorkerManager::StartControllingClient(loadInfo::ReservedInfo) is called in ServiceWorkerManager::DispatchFetchEvent() to make sure the reserved ClientSource will be controlled once it is created.
In this test case, we expect the loading is reset as not controlled. To achieve the expectation, nsILoadInfo::ClearController() for the redirected channel. When executing nsGlobalWindowInner::EnsureClientSource(), it can decide if a new ClientSource should be created by comparing the controller setting between the existing reserved client source and channel's loadInfo.
However, ClientSource controller setting propagation involves IPC and thread switching. The controller propagation might happen after nsGlobalWindowInner::EnsureClientSource(). In this case, the ClientSource will be controlled as unexpected.

What a possible fix is also resetting the ReservedClientInfo during the channel redirection. Since we have already known the original ReservedClientSource will be controlled for the situation, we can just drop the original ReservedClientInfo during the channel redirection, such that it would have no chance to connect to the nsGlobalWindowInner.

Comment on attachment 9215688 [details]
Bug 1503072 - Add mitigation to bypass SW on navigation fault. r=#dom-workers

Beta/Release Uplift Approval Request

  • User impact if declined: Potentially broken sites as experienced in bug 1678795 and the related family of bugs.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): This has baked for a full week on nightly and the patches were written in such a way that they only come into play on errors that caused total breakage. The automated test was also able to successfully help us reproduce the worst-case scenario error we were concerned about (which unfortunately delayed being able to land this patch).
  • String changes made/needed: n/a
Attachment #9215688 - Flags: approval-mozilla-beta?
Attachment #9215465 - Flags: approval-mozilla-beta?
Attachment #9215466 - Flags: approval-mozilla-beta?
Attachment #9215687 - Flags: approval-mozilla-beta?

Comment on attachment 9215688 [details]
Bug 1503072 - Add mitigation to bypass SW on navigation fault. r=#dom-workers

That's sizeable patches but given your evaluation to low risk and the fact that it mitigate cryptic errors on major sites like gmail and slack that we ate suffering from for many cycles, let's take it in 91 beta 5.

Attachment #9215688 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9215465 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9215466 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9215687 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: