Open Bug 1668743 Opened 4 years ago Updated 6 months ago

Pinned tabs that use ServiceWorkers and do not skipWaiting on update may fail to load at next startup due to shutdown race involving SW activation

Categories

(Core :: DOM: Service Workers, defect, P2)

defect

Tracking

()

ASSIGNED
Tracking Status
firefox83 --- wontfix

People

(Reporter: Dexter, Assigned: asuth)

References

(Depends on 1 open bug)

Details

(Whiteboard: dom-lws-bugdash-triage)

Attachments

(4 files)

Attached image immagine.png

Since a few Nightlies (I'm now on 83.0a1 (2020-10-01) (64-bit), on Windows 10 x64), the last pinned tab I have does not get properly restored when the browser starts. This does not happen every single time, but often enough to be very annoying!

The favicon is correctly restored (I see the correct pinned favicon!), but when clicking on the tab it shows no URL in the awesomebar and the page content is empty (like about:blank). If I try to right click on the page, no context menu shows.

It's been happening frequently for me too, but in my case it's not the last tab, it's always a pinned Slack tab.

Summary: Sometimes the last pinned tab is not restored → Sometimes a pinned tab is not restored

There is a good chance that the underlying issue may be related to service worker and fetch intercept (e.g. similarly to Bug 1656310), Bug 1662925 should fix those issue. I'm just adding both Bug 1656310 and Bug 1662925 as "see also" for this issue, until we are sure that this is related to the same issue.

Alessio,
would you mind to double-check (e.g. in about:serviceworkers or in about:debugging) if the pinned tabs that are not loading all have a service worker registered that is also intercepting fetch requests?

Flags: needinfo?(alessio.placitelli)
See Also: → 1662925, 1656310

Do you have Fission enabled (check about:support)?

See Also: → 1668920
See Also: → 1668668

(In reply to Neha Kochar [:neha] from comment #3)

Do you have Fission enabled (check about:support)?

Not Alessio, but we were chatting about it on Slack. I'm experiencing the same problem (I'd say for about a week, not at every restart), and Fission is disabled for both of us (fission.autostart is false, I don't have anything about Fission in about:support).

Flags: needinfo?(alessio.placitelli)

See also reports about trouble loading twitter on reddit in #firefox

Flags: needinfo?(bugmail)

(In reply to Neha Kochar [:neha] from comment #3)

Do you have Fission enabled (check about:support)?

Yes, I can confirm what :flod said. fission.autostart is false in my case as well.

This also happens with and without Fission in my case, but seems to get much worse when Fission is enabled.

This was likely due to bug 1662925 which has been fixed on Nightly since the October 8th nightly. The bug impacted page loads that happened when the target process type wasn't already running, which would be the case for non-fission-enabled Firefox when restoring pinned tabs at startup (as well as fission tabs all the time, and especially at startup). Can you confirm if this problem has disappeared for you?

Flags: needinfo?(bugmail) → needinfo?(alessio.placitelli)

I haven't seen the problem in days, so yes, this seems to be fixed.

Status: NEW → RESOLVED
Closed: 4 years ago
Flags: needinfo?(alessio.placitelli)
Resolution: --- → DUPLICATE

This just happened again on my machine, latest Nightly (https://hg.mozilla.org/mozilla-central/rev/c8b4cf6696dd7663a99c914d89e91b427f070800).

Comment 10 is what's showing up in my about:serviceworkers.

Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---

My pinned Spotify tab also isn't restored regularly after reopening Nightly.

Thank you both for the data-points and clarifying that there is indeed an installed ServiceWorker for spotify. I'm currently adding instrumentation/logging to about:serviceworkers to help diagnose this.

I'm assuming that the general phenomenon is the same? Specifically:

  1. Location bar/awesomebar remains empty (resulting in the display of the placeholder text like "Search with Google or enter address")
  2. The page remains blank. (And if you bring up devtools and type in document.location you see "Location about:blank"

Related questions if this happens again or you remember:

  • Is anything interesting showing up in the devtools console for the page?
  • When you restore the session, was the initially selected tab: The ServiceWorker-using page that failed to load, a system page like about:newtab (which also results in an empty location bar), a website, a different ServiceWorker-using page, etc.?

Thanks!

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #13)

I'm assuming that the general phenomenon is the same? Specifically:

  1. Location bar/awesomebar remains empty (resulting in the display of the placeholder text like "Search with Google or enter address")
  2. The page remains blank. (And if you bring up devtools and type in document.location you see "Location about:blank"

I've seen this as well.

Related questions if this happens again or you remember:

  • Is anything interesting showing up in the devtools console for the page?

Not sure, sorry :/

  • When you restore the session, was the initially selected tab: The ServiceWorker-using page that failed to load, a system page like about:newtab (which also results in an empty location bar), a website, a different ServiceWorker-using page, etc.?

It wasn't one of the failing tab, but I can't tell if this was a normal website or a serviceworker-using page.

I'll take more care to these information next time.

Depends on: 1506892

I see this on outlook.com as well. Same issue, tab favicon restored, URL and page is blank.

I just got the problem again and can answer this:

Is anything interesting showing up in the devtools console for the page?

Nothing, completely blank.

Not sure if this is related: my previous reload didn't work, instead Firefox stopped and I had to start it again manually.

I hit this several times a day for slack / gmail tabs, for both of which I have service workers installed. Let me know if there is anything I can do to help debug.

Are people still experiencing this/variations of this after bug 1671962 was fixed in Firefox 84 (now beta)? If so, can people confirm:

  • The platform in use (windows / mac / linux)
  • The current state of your fission.autostart preference
  • Are the browser restarts only a result of applying updates to Firefox, or does the problem also happen when manually quitting and restarting Firefox?

My preliminary attempts to reproduce the scenarios haven't encountered problems since that fix. I'm planning to do more testing on Windows 10 and to try to better emulate the nightly update scenario in terms of impact on QuotaManager regardless, but want to make sure I understand what the current experienced situations are. Thanks!

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #18)

Are people still experiencing this/variations of this after bug 1671962 was fixed in Firefox 84 (now beta)?

Yes. Every single day without fail (on Nightly).

If so, can people confirm:

  • The platform in use (windows / mac / linux)

Windows.

  • The current state of your fission.autostart preference

false

  • Are the browser restarts only a result of applying updates to Firefox, or does the problem also happen when manually quitting and restarting Firefox?

Only a result of applying updates to Firefox as far as I can tell. I don't recall seeing it when I first start the browser.

Thanks, :birtles. I failed to reproduce across a variety of reproduction cases on Windows (10) using about:restartRequired and the MOZ_PURGE_CACHES environment variable to attempt to reproduce the overall scenario of updating a nightly. I started from a very heavy profile and reduced the number of tabs/windows restored and number of pinned tabs to attempt to get the pinned tab loads to happen earlier. Then I added a number of SW-using pinned tabs back.

I'm going to try and finish out the workflow required to easily share logs from a failure case. Specifically, click-to-download into an ndjson (newline-delimited JSON) file that can then be sanity-checked for privacy concerns and then emailed or otherwise shared with me. My end goal is to provide for per-origin log saving but that may not happen in the first pass. I'll spin up try builds for those interested but will also be moving forward with review after splitting the patch up.

I managed to reproduce on linux for spotify in a pinned tab right after posting that comment! Woo!

Assignee: nobody → bugmail
Status: REOPENED → ASSIGNED

It appears that BrowsingContext::LoadURI really likes to Cancel the intercepted channels after a short period of time.

It appears this is happening because CanonicalBrowsingContext::AttemptSpeculativeLoadInParent / DocumentLoadListener::SpeculativeLoadInParent creates an nsHttpChannel which is then shortly superseded by NeckoParent::RecvPDocumentChannelConstructor coming from the content process.

This can reveal a lifecycle management problem for the SW where it can be terminated which triggers an abort of the second request. This is the SW bug, but it would be interesting to know if there's a way to help the speculative channel be used instead of being destroyed. It seems possible that the unaddressed shortcut taken in https://bugzilla.mozilla.org/show_bug.cgi?id=1589749#c13 could be at the root of what's going on.

Status:

  • Comment 21 and comment 22 scenario.
    • Simple fix/mitigation identified. The most straightforward aspect of the problem is that we don't renew the keepalive token until the remote worker's created is notified in ServiceWorkerPrivateImpl::CreationSucceeded. Obtaining the keep-alive token at spawn time will entirely moot the problem.
    • Working on a test to emulate the situation created by the speculative load being canceled, but not depending on that to keep happening.
    • Working on analyzing the job queue/termination logic's interaction is sound or how to make it sound. The [fetch, terminate, fetch] sequence that's implied here in this failure mode would want to net out to [spawn, fetch, terminate, spawn, fetch] with a barrier at terminate by dropping the ServiceWorkerPrivateImpl and re-creating it or similar. This is made slightly more complex by there being 2 levels of waiting for the worker to be spawned (need to wait for a process versus don't need to wait for a process, with comment 22 being not waiting for a process). This will inform the test as well as the fission situation is potentially different and wants fidelity.
  • Investigating another potential source of weirdness where some of these sites (ex: spotify) are triggering a soft update check that seems to go to the network and result in an update of the ServiceWorker every time Firefox is restarted. While our soft update checks are delayed by 1 second, the update checks happen very quickly and can result in a new install very quickly as well, especially when there's a lot going on in the browser. The good news is this isn't happening for all ServiceWorkers, the pre-existing SW seems healthy, and the site doesn't seem to be invoking skipWaiting, so hopefully it's just bad/broken behavior on the part of the site.
    • I'll probably shelve this for the time being. I added enough logging to help get to the point and ran enough runs that I'm feeling good that it is just the site doing something weird (but not in a way that would end up breaking the user experience).

So it appears that the NS_BINDING_ABORTED we're seeing from a canceled speculative load in fact is unrelated to the problem. The keepalive renewal arguably happening later than it should in the process ends up not actually mattering because the cancellation of the InterceptedHttpChannel doesn't directly impact the FetchEventOpChild that is created; the FetchEventOpChild only perceives the channel as canceled when it goes to provide its synthesized response.

Further augmenting of the logging shows that the ServiceWorker wasn't actually being terminated but in fact the RemoteWorker creation was failing. My lucky pernosco trace luckily had an instance of this situation and current analysis shows that the Cache API lookup is failing to locate the main script of the ServiceWorker which is not really a thing that should be able to happen. However, given that I'm seeing the spotify ServiceWorker decide to update itself at the start of every session (visible in the about:serviceworkers excerpt attached here), it seems very believable that it's related. More investigation is required, but the good news is that at first glance "vary" processing doesn't seem to be involved, as the initial query is returning 0 rows and vary processing only starts happening after that.

Quick Context:

  • The data for a ServiceWorker exists in 2 places:
    1. The registration defined in PROFILE/serviceworker.txt and which names the chrome-namespaced Cache API storage where the ServiceWorker script and all of its script dependencies live.
    2. The aforementioned Cache API storage.
  • Although we have a runtime concept of there existing multiple ServiceWorker instances for a registration (evaluating, installing, waiting, active), only one registration exists in the serviceworker.txt file at a time, and that is the installed worker.

Analysis

In the end, I believe the problem is a data coherency race as the shutdown process frees up the "waiting" updated ServiceWorker to advance to being "active" as the browser shuts down. Because the spotify ServiceWorker does not skipWaiting(), whenever there's an update, its new ServiceWorker gets installed but does not become "active" because the pinned spotify tab remains controlled by the existing "active" ServiceWorker. This newly "waiting" SW only gets to advance to "active" when there are no more pages being controlled by the worker. This happens at shutdown when the pinned tab gets torn down.

What happens at this point is extra bad, transactionally speaking:

  • ServiceWorkerRegistrationInfo::TransitionWaitingToActive:
  • This creates the following possible failure modes, with us dubbing the old ServiceWorker "old" and the new one "new".
    • Pinned tab broken at startup, "new" storage leaked-ish: "old" Cache API storage deleted, "old" registration still reflected in serviceworker.txt, "new" Cache API storage orphaned. Note that this should self-correct as the functional event should have already initiated a soft update job which will be compelled to succeed. If it doesn't succeed, we will have marked the update job as meriting the registration being purged from serviceworker.txt.
    • Pinned tab works at startup, "new" storage leaked-ish: "old" Cache API storage still there, "old" registration still reflected in serviceworker.txt, "new" Cache API storage orphaned.
  • In my testing profile, I note that I have 27 distinct SW cache script entries in my Cache API for open.spotify.com, of which 25 must inherently be leaks.
  • Bug 1665197 tracks GC-ing the leaked script data.

Fix Plans and Spec Issues

https://github.com/w3c/ServiceWorker/issues/1372 explicitly raises the scenario of shutdown during the activation process and how it impacts the SW script itself.

Our design goals are:

  • Avoid I/O at shutdown for performance reasons. (And in general it would be nice to support a crash-only mode of operation where we never do anything at shutdown.)
  • Ensure that a pinned tab that uses a ServiceWorker will actually get the chance to update. There's an inherent tension here between performance (fast shutdown, minimal latency on restart) and actually updating.
  • Not leaking ServiceWorker Cache API installations. We don't want to have to GC ServiceWorker Cache API storages constantly.

My tentative plan is:

  • Add additional data for the registrations in serviceworker.txt such that we are able to update the registration when a newly installed ServiceWorker reaches the waiting state. Specifically, we'd add data such that:
    • We can express that the registration was in the waiting state or activating state. It might be best to pose this as the number of attempted activations with waiting being a value of 0. This enables supporting a bounded number of retries (with the attempt number preceding the activation event being sent in case it triggers a browser crash or hang that ends up as a browser crash).
    • We can express the Cache API id of the "old" ServiceWorker. This would be cleared only after we confirm that it's been deleted and the deletion has hit the disk.
    • Other schema changes we want:
      • Space for marking the registration as broken.
      • Expando space for the WebExtension infrastructure. (If we don't just add JSON for everything here, this might want its own JSON.)
  • We would cease to advance ServiceWorkers to the active state once we've reached "quit-application".
  • At startup, when reading registrations:
    • Any registration that had been in the waiting state would be put into a special state such that the next functional event dispatched at the ServiceWorker will begin its process of activation. This will trigger explicit handling in "Handle Fetch" and "Fire Functional Event" to defer the events until activation completes which can result in some additional latency, but the spec issue already resulted in guidance being added that activation logic should endeavor to be minimal.
      • We could also use the above logic for retry attempts. If concerned about this causing an infinite browser crash loop, it would make sense to ensure the retry attempt count was incremented and flushed to disk prior to actually triggering the retry. The higher level logic that adds UI confirmation to session restore might moot the need for this, as would the need to trigger a fetch to trigger the problem. (And arguably anything the activate method might do, normal operation might also do. Although, push notifications I suppose do pose a separate problem for that... and I'm not sure we've seen problems like that?)
    • Any "old" Cache API script storage would be purged. We're always opting to move forward to the newly installed SW, so there's no point in keeping the old data around. Once purged, the registration would be updated to remove this old entry.

A somewhat orthogonal fix is:

  • The error case that's cancelling interception likely does want to actually be performing a resetInterception() with LOAD_BYPASS_SERVICE_WORKER. This is something that I've already been planning to do via other bugs, but that has also entailed marking a registration as broken which would be the durable (until SW update) source of LOAD_BYPASS_SERVICE_WORKER for all future loads too.
Component: Session Restore → DOM: Service Workers
Product: Firefox → Core
Summary: Sometimes a pinned tab is not restored → Pinned tabs that use ServiceWorkers and do not skipWaiting on update may fail to load at next startup due to shutdown race involving SW activation

I think I got the problem for a non-pinned website, but that was the foreground tab when Firefox restarted. This was the Firefox Pofiler website, which is also using a service worker, so I believe this makes sense, but I wanted to highlight this case too.

Severity: -- → S3
Priority: -- → P3

(In reply to Julien Wajsberg [:julienw] from comment #26)

I think I got the problem for a non-pinned website, but that was the foreground tab when Firefox restarted. This was the Firefox Pofiler website, which is also using a service worker, so I believe this makes sense, but I wanted to highlight this case too.

Yes, this makes sense. This problem could also occur for other non-pinned tab scenarios as well. The specific problem is for pages controlled by a ServiceWorker where there's an update pending at shutdown when the browser automatically closes all the tabs. So if you opened a profiler page and leave it open until shutdown, but only switch to that tab a half hour into your next browsing session, you could still witness the same problem.

I've been juggling a few other bugs and waiting to see if there was any interest in altering the spec around this; there didn't seem to be so I'll be going implementing the plan from comment 25 and characterized a little more concisely in the spec issue.

Priority: P3 → P2
See Also: → 1678795
See Also: → 1633648

Hey Andrew,

is there any short term plan to fix this? It makes some major modern web applications unusable after Firefox restarts :(

Flags: needinfo?(bugmail)

Yes, I'm pursuing the proposed fix in comment 25.

Flags: needinfo?(bugmail)
Whiteboard: dom-lws-bugdash-triage
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: