Pinned tabs that use ServiceWorkers and do not skipWaiting on update may fail to load at next startup due to shutdown race involving SW activation
Categories
(Core :: DOM: Service Workers, defect, P2)
Tracking
()
| Tracking | Status | |
|---|---|---|
| firefox83 | --- | wontfix |
People
(Reporter: Dexter, Assigned: asuth)
References
(Depends on 2 open bugs, Blocks 1 open bug)
Details
(Whiteboard: dom-lws-bugdash-triage)
Attachments
(4 files)
Since a few Nightlies (I'm now on 83.0a1 (2020-10-01) (64-bit), on Windows 10 x64), the last pinned tab I have does not get properly restored when the browser starts. This does not happen every single time, but often enough to be very annoying!
The favicon is correctly restored (I see the correct pinned favicon!), but when clicking on the tab it shows no URL in the awesomebar and the page content is empty (like about:blank). If I try to right click on the page, no context menu shows.
Comment 1•5 years ago
|
||
It's been happening frequently for me too, but in my case it's not the last tab, it's always a pinned Slack tab.
| Reporter | ||
Updated•5 years ago
|
Comment 2•5 years ago
|
||
There is a good chance that the underlying issue may be related to service worker and fetch intercept (e.g. similarly to Bug 1656310), Bug 1662925 should fix those issue. I'm just adding both Bug 1656310 and Bug 1662925 as "see also" for this issue, until we are sure that this is related to the same issue.
Alessio,
would you mind to double-check (e.g. in about:serviceworkers or in about:debugging) if the pinned tabs that are not loading all have a service worker registered that is also intercepting fetch requests?
Comment 3•5 years ago
|
||
Do you have Fission enabled (check about:support)?
Comment 4•5 years ago
•
|
||
(In reply to Neha Kochar [:neha] from comment #3)
Do you have Fission enabled (check about:support)?
Not Alessio, but we were chatting about it on Slack. I'm experiencing the same problem (I'd say for about a week, not at every restart), and Fission is disabled for both of us (fission.autostart is false, I don't have anything about Fission in about:support).
Comment 5•5 years ago
|
||
See also reports about trouble loading twitter on reddit in #firefox
| Reporter | ||
Comment 6•5 years ago
|
||
(In reply to Neha Kochar [:neha] from comment #3)
Do you have Fission enabled (check about:support)?
Yes, I can confirm what :flod said. fission.autostart is false in my case as well.
This also happens with and without Fission in my case, but seems to get much worse when Fission is enabled.
| Assignee | ||
Comment 8•5 years ago
|
||
This was likely due to bug 1662925 which has been fixed on Nightly since the October 8th nightly. The bug impacted page loads that happened when the target process type wasn't already running, which would be the case for non-fission-enabled Firefox when restoring pinned tabs at startup (as well as fission tabs all the time, and especially at startup). Can you confirm if this problem has disappeared for you?
| Reporter | ||
Comment 9•5 years ago
|
||
I haven't seen the problem in days, so yes, this seems to be fixed.
| Reporter | ||
Comment 10•5 years ago
|
||
| Reporter | ||
Comment 11•5 years ago
|
||
This just happened again on my machine, latest Nightly (https://hg.mozilla.org/mozilla-central/rev/c8b4cf6696dd7663a99c914d89e91b427f070800).
Comment 10 is what's showing up in my about:serviceworkers.
Comment 12•5 years ago
|
||
My pinned Spotify tab also isn't restored regularly after reopening Nightly.
| Assignee | ||
Comment 13•5 years ago
|
||
Thank you both for the data-points and clarifying that there is indeed an installed ServiceWorker for spotify. I'm currently adding instrumentation/logging to about:serviceworkers to help diagnose this.
I'm assuming that the general phenomenon is the same? Specifically:
- Location bar/awesomebar remains empty (resulting in the display of the placeholder text like "Search with Google or enter address")
- The page remains blank. (And if you bring up devtools and type in
document.locationyou see "Location about:blank"
Related questions if this happens again or you remember:
- Is anything interesting showing up in the devtools console for the page?
- When you restore the session, was the initially selected tab: The ServiceWorker-using page that failed to load, a system page like
about:newtab(which also results in an empty location bar), a website, a different ServiceWorker-using page, etc.?
Thanks!
Comment 14•5 years ago
|
||
(In reply to Andrew Sutherland [:asuth] (he/him) from comment #13)
I'm assuming that the general phenomenon is the same? Specifically:
- Location bar/awesomebar remains empty (resulting in the display of the placeholder text like "Search with Google or enter address")
- The page remains blank. (And if you bring up devtools and type in
document.locationyou see "Location about:blank"
I've seen this as well.
Related questions if this happens again or you remember:
- Is anything interesting showing up in the devtools console for the page?
Not sure, sorry :/
- When you restore the session, was the initially selected tab: The ServiceWorker-using page that failed to load, a system page like
about:newtab(which also results in an empty location bar), a website, a different ServiceWorker-using page, etc.?
It wasn't one of the failing tab, but I can't tell if this was a normal website or a serviceworker-using page.
I'll take more care to these information next time.
Comment 15•5 years ago
|
||
I see this on outlook.com as well. Same issue, tab favicon restored, URL and page is blank.
Comment 16•5 years ago
|
||
I just got the problem again and can answer this:
Is anything interesting showing up in the devtools console for the page?
Nothing, completely blank.
Not sure if this is related: my previous reload didn't work, instead Firefox stopped and I had to start it again manually.
Comment 17•5 years ago
|
||
I hit this several times a day for slack / gmail tabs, for both of which I have service workers installed. Let me know if there is anything I can do to help debug.
| Assignee | ||
Comment 18•5 years ago
|
||
Are people still experiencing this/variations of this after bug 1671962 was fixed in Firefox 84 (now beta)? If so, can people confirm:
- The platform in use (windows / mac / linux)
- The current state of your
fission.autostartpreference - Are the browser restarts only a result of applying updates to Firefox, or does the problem also happen when manually quitting and restarting Firefox?
My preliminary attempts to reproduce the scenarios haven't encountered problems since that fix. I'm planning to do more testing on Windows 10 and to try to better emulate the nightly update scenario in terms of impact on QuotaManager regardless, but want to make sure I understand what the current experienced situations are. Thanks!
Comment 19•5 years ago
|
||
(In reply to Andrew Sutherland [:asuth] (he/him) from comment #18)
Are people still experiencing this/variations of this after bug 1671962 was fixed in Firefox 84 (now beta)?
Yes. Every single day without fail (on Nightly).
If so, can people confirm:
- The platform in use (windows / mac / linux)
Windows.
- The current state of your
fission.autostartpreference
false
- Are the browser restarts only a result of applying updates to Firefox, or does the problem also happen when manually quitting and restarting Firefox?
Only a result of applying updates to Firefox as far as I can tell. I don't recall seeing it when I first start the browser.
| Assignee | ||
Comment 20•5 years ago
•
|
||
Thanks, :birtles. I failed to reproduce across a variety of reproduction cases on Windows (10) using about:restartRequired and the MOZ_PURGE_CACHES environment variable to attempt to reproduce the overall scenario of updating a nightly. I started from a very heavy profile and reduced the number of tabs/windows restored and number of pinned tabs to attempt to get the pinned tab loads to happen earlier. Then I added a number of SW-using pinned tabs back.
I'm going to try and finish out the workflow required to easily share logs from a failure case. Specifically, click-to-download into an ndjson (newline-delimited JSON) file that can then be sanity-checked for privacy concerns and then emailed or otherwise shared with me. My end goal is to provide for per-origin log saving but that may not happen in the first pass. I'll spin up try builds for those interested but will also be moving forward with review after splitting the patch up.
| Assignee | ||
Comment 21•5 years ago
|
||
I managed to reproduce on linux for spotify in a pinned tab right after posting that comment! Woo!
| Assignee | ||
Comment 22•5 years ago
|
||
It appears that BrowsingContext::LoadURI really likes to Cancel the intercepted channels after a short period of time.
It appears this is happening because CanonicalBrowsingContext::AttemptSpeculativeLoadInParent / DocumentLoadListener::SpeculativeLoadInParent creates an nsHttpChannel which is then shortly superseded by NeckoParent::RecvPDocumentChannelConstructor coming from the content process.
This can reveal a lifecycle management problem for the SW where it can be terminated which triggers an abort of the second request. This is the SW bug, but it would be interesting to know if there's a way to help the speculative channel be used instead of being destroyed. It seems possible that the unaddressed shortcut taken in https://bugzilla.mozilla.org/show_bug.cgi?id=1589749#c13 could be at the root of what's going on.
| Assignee | ||
Comment 23•5 years ago
|
||
Status:
- Comment 21 and comment 22 scenario.
- Simple fix/mitigation identified. The most straightforward aspect of the problem is that we don't renew the keepalive token until the remote worker's created is notified in ServiceWorkerPrivateImpl::CreationSucceeded. Obtaining the keep-alive token at spawn time will entirely moot the problem.
- Working on a test to emulate the situation created by the speculative load being canceled, but not depending on that to keep happening.
- Working on analyzing the job queue/termination logic's interaction is sound or how to make it sound. The [fetch, terminate, fetch] sequence that's implied here in this failure mode would want to net out to [spawn, fetch, terminate, spawn, fetch] with a barrier at terminate by dropping the ServiceWorkerPrivateImpl and re-creating it or similar. This is made slightly more complex by there being 2 levels of waiting for the worker to be spawned (need to wait for a process versus don't need to wait for a process, with comment 22 being not waiting for a process). This will inform the test as well as the fission situation is potentially different and wants fidelity.
- Investigating another potential source of weirdness where some of these sites (ex: spotify) are triggering a soft update check that seems to go to the network and result in an update of the ServiceWorker every time Firefox is restarted. While our soft update checks are delayed by 1 second, the update checks happen very quickly and can result in a new install very quickly as well, especially when there's a lot going on in the browser. The good news is this isn't happening for all ServiceWorkers, the pre-existing SW seems healthy, and the site doesn't seem to be invoking skipWaiting, so hopefully it's just bad/broken behavior on the part of the site.
- I'll probably shelve this for the time being. I added enough logging to help get to the point and ran enough runs that I'm feeling good that it is just the site doing something weird (but not in a way that would end up breaking the user experience).
| Assignee | ||
Comment 24•5 years ago
|
||
So it appears that the NS_BINDING_ABORTED we're seeing from a canceled speculative load in fact is unrelated to the problem. The keepalive renewal arguably happening later than it should in the process ends up not actually mattering because the cancellation of the InterceptedHttpChannel doesn't directly impact the FetchEventOpChild that is created; the FetchEventOpChild only perceives the channel as canceled when it goes to provide its synthesized response.
Further augmenting of the logging shows that the ServiceWorker wasn't actually being terminated but in fact the RemoteWorker creation was failing. My lucky pernosco trace luckily had an instance of this situation and current analysis shows that the Cache API lookup is failing to locate the main script of the ServiceWorker which is not really a thing that should be able to happen. However, given that I'm seeing the spotify ServiceWorker decide to update itself at the start of every session (visible in the about:serviceworkers excerpt attached here), it seems very believable that it's related. More investigation is required, but the good news is that at first glance "vary" processing doesn't seem to be involved, as the initial query is returning 0 rows and vary processing only starts happening after that.
| Assignee | ||
Comment 25•5 years ago
|
||
Quick Context:
- The data for a ServiceWorker exists in 2 places:
- The registration defined in PROFILE/serviceworker.txt and which names the chrome-namespaced Cache API storage where the ServiceWorker script and all of its script dependencies live.
- The aforementioned Cache API storage.
- Although we have a runtime concept of there existing multiple ServiceWorker instances for a registration (evaluating, installing, waiting, active), only one registration exists in the serviceworker.txt file at a time, and that is the installed worker.
Analysis
In the end, I believe the problem is a data coherency race as the shutdown process frees up the "waiting" updated ServiceWorker to advance to being "active" as the browser shuts down. Because the spotify ServiceWorker does not skipWaiting(), whenever there's an update, its new ServiceWorker gets installed but does not become "active" because the pinned spotify tab remains controlled by the existing "active" ServiceWorker. This newly "waiting" SW only gets to advance to "active" when there are no more pages being controlled by the worker. This happens at shutdown when the pinned tab gets torn down.
What happens at this point is extra bad, transactionally speaking:
- ServiceWorkerRegistrationInfo::TransitionWaitingToActive:
- Marks the existing active SW as redundant. This immediately initiates an async purge of its cache.
- Dispatches an "activate" lifecycle event at the SW and when that completes (it can waitUntil()!), we call ServiceWorkerRegistrationInfo::FinishActivate which is what causes us to update serviceworker.txt to reference the newly "active" ServiceWorker.
- This creates the following possible failure modes, with us dubbing the old ServiceWorker "old" and the new one "new".
- Pinned tab broken at startup, "new" storage leaked-ish: "old" Cache API storage deleted, "old" registration still reflected in serviceworker.txt, "new" Cache API storage orphaned. Note that this should self-correct as the functional event should have already initiated a soft update job which will be compelled to succeed. If it doesn't succeed, we will have marked the update job as meriting the registration being purged from serviceworker.txt.
- Pinned tab works at startup, "new" storage leaked-ish: "old" Cache API storage still there, "old" registration still reflected in serviceworker.txt, "new" Cache API storage orphaned.
- In my testing profile, I note that I have 27 distinct SW cache script entries in my Cache API for open.spotify.com, of which 25 must inherently be leaks.
- Bug 1665197 tracks GC-ing the leaked script data.
Fix Plans and Spec Issues
https://github.com/w3c/ServiceWorker/issues/1372 explicitly raises the scenario of shutdown during the activation process and how it impacts the SW script itself.
Our design goals are:
- Avoid I/O at shutdown for performance reasons. (And in general it would be nice to support a crash-only mode of operation where we never do anything at shutdown.)
- Ensure that a pinned tab that uses a ServiceWorker will actually get the chance to update. There's an inherent tension here between performance (fast shutdown, minimal latency on restart) and actually updating.
- Not leaking ServiceWorker Cache API installations. We don't want to have to GC ServiceWorker Cache API storages constantly.
My tentative plan is:
- Add additional data for the registrations in serviceworker.txt such that we are able to update the registration when a newly installed ServiceWorker reaches the waiting state. Specifically, we'd add data such that:
- We can express that the registration was in the waiting state or activating state. It might be best to pose this as the number of attempted activations with waiting being a value of 0. This enables supporting a bounded number of retries (with the attempt number preceding the activation event being sent in case it triggers a browser crash or hang that ends up as a browser crash).
- We can express the Cache API id of the "old" ServiceWorker. This would be cleared only after we confirm that it's been deleted and the deletion has hit the disk.
- Other schema changes we want:
- Space for marking the registration as broken.
- Expando space for the WebExtension infrastructure. (If we don't just add JSON for everything here, this might want its own JSON.)
- We would cease to advance ServiceWorkers to the active state once we've reached "quit-application".
- At startup, when reading registrations:
- Any registration that had been in the waiting state would be put into a special state such that the next functional event dispatched at the ServiceWorker will begin its process of activation. This will trigger explicit handling in "Handle Fetch" and "Fire Functional Event" to defer the events until activation completes which can result in some additional latency, but the spec issue already resulted in guidance being added that activation logic should endeavor to be minimal.
- We could also use the above logic for retry attempts. If concerned about this causing an infinite browser crash loop, it would make sense to ensure the retry attempt count was incremented and flushed to disk prior to actually triggering the retry. The higher level logic that adds UI confirmation to session restore might moot the need for this, as would the need to trigger a fetch to trigger the problem. (And arguably anything the activate method might do, normal operation might also do. Although, push notifications I suppose do pose a separate problem for that... and I'm not sure we've seen problems like that?)
- Any "old" Cache API script storage would be purged. We're always opting to move forward to the newly installed SW, so there's no point in keeping the old data around. Once purged, the registration would be updated to remove this old entry.
- Any registration that had been in the waiting state would be put into a special state such that the next functional event dispatched at the ServiceWorker will begin its process of activation. This will trigger explicit handling in "Handle Fetch" and "Fire Functional Event" to defer the events until activation completes which can result in some additional latency, but the spec issue already resulted in guidance being added that activation logic should endeavor to be minimal.
A somewhat orthogonal fix is:
- The error case that's cancelling interception likely does want to actually be performing a resetInterception() with LOAD_BYPASS_SERVICE_WORKER. This is something that I've already been planning to do via other bugs, but that has also entailed marking a registration as broken which would be the durable (until SW update) source of LOAD_BYPASS_SERVICE_WORKER for all future loads too.
Comment 26•5 years ago
|
||
I think I got the problem for a non-pinned website, but that was the foreground tab when Firefox restarted. This was the Firefox Pofiler website, which is also using a service worker, so I believe this makes sense, but I wanted to highlight this case too.
Updated•5 years ago
|
| Assignee | ||
Comment 27•5 years ago
|
||
(In reply to Julien Wajsberg [:julienw] from comment #26)
I think I got the problem for a non-pinned website, but that was the foreground tab when Firefox restarted. This was the Firefox Pofiler website, which is also using a service worker, so I believe this makes sense, but I wanted to highlight this case too.
Yes, this makes sense. This problem could also occur for other non-pinned tab scenarios as well. The specific problem is for pages controlled by a ServiceWorker where there's an update pending at shutdown when the browser automatically closes all the tabs. So if you opened a profiler page and leave it open until shutdown, but only switch to that tab a half hour into your next browsing session, you could still witness the same problem.
I've been juggling a few other bugs and waiting to see if there was any interest in altering the spec around this; there didn't seem to be so I'll be going implementing the plan from comment 25 and characterized a little more concisely in the spec issue.
| Assignee | ||
Updated•5 years ago
|
| Reporter | ||
Comment 28•5 years ago
|
||
Hey Andrew,
is there any short term plan to fix this? It makes some major modern web applications unusable after Firefox restarts :(
| Assignee | ||
Comment 29•5 years ago
|
||
Yes, I'm pursuing the proposed fix in comment 25.
Updated•4 years ago
|
Updated•1 year ago
|
Comment 30•4 months ago
•
|
||
https://phabricator.services.mozilla.com/D268092 - the current WIP patch which is trying to implement what is written in the comment 25
schema changes already landed (https://phabricator.services.mozilla.com/D253144)
Comment 31•4 months ago
|
||
So, I am able to detect a broken SW, but stumbled upon this:
trigger explicit handling in "Handle Fetch" and "Fire Functional Event" to defer the events until activation completes
are there any code examples for this? :)
And thanks in advance
Comment 32•4 months ago
|
||
This looks like what I was looking for:
//void ServiceWorkerManager::DispatchFetchEvent(nsIInterceptedChannel* aChannel,
ErrorResult& aRv)
RefPtr<ServiceWorkerRegistrationInfo> registration;
nsresult rv = GetClientRegistration(loadInfo->GetClientInfo().ref(),
getter_AddRefs(registration));
if (NS_WARN_IF(NS_FAILED(rv))) {
aRv.Throw(rv);
return;
}
if (!registration->GetActive() && registration->GetWaiting() &&
registration->GetNumberOfAttemptedActivations() == 0) {
registration->TryToActivateAsync(
ServiceWorkerLifetimeExtension(FullLifetimeExtension{}),
[this, aChannel, principal = registration->Principal(),
scope = registration->Scope()]() {
RefPtr<ServiceWorkerRegistrationInfo> reg2 =
GetRegistration(principal, scope);
if (reg2 && reg2->GetActive()) {
ErrorResult rv2;
DispatchFetchEvent(aChannel, rv2);
} else {
aChannel->ResetInterception(false);
}
});
return; // defer this fetch until callback runs
}
Comment 33•3 months ago
•
|
||
Here is the wip patch: https://phabricator.services.mozilla.com/D268092
Comment 34•3 months ago
•
|
||
Eden, Andrew
I think it would be a good idea to review the current plan and check if something is missing or if anything can be simplified.
For example, we have a blocker here: https://bugzilla.mozilla.org/show_bug.cgi?id=1426401, which depends on https://bugzilla.mozilla.org/show_bug.cgi?id=1231208 (Service worker e10s redesign)
Comment 35•3 months ago
|
||
(Okay, I'll prepare a doc for a discussion. Thanks to everyone)
Comment 36•3 months ago
|
||
Recently I've been seeing the symptoms in the original post on macOS for non-pinned tabs that I leave open. I have the browser set to resume the last session. Sometimes, the previously-selected tab reloads blank, or one of the other existing tabs might suddenly become a blank new tab when I click on it. This doesn't seem to be happening in Windows or Linux, so I wonder if there's been any macOS-specific regression.
Comment 37•3 months ago
|
||
Comment 38•3 months ago
|
||
Regarding to the comment 25. Not everything is clear for me. And I am asking for the clarification.
Especially these areas:
update the registration when a newly installed ServiceWorker reaches the waiting state
and
At startup, when reading registrations:
Any registration that had been in the waiting state would be put into a special state
How I see/understand that
- when a newly installed ServiceWorker reaches the waiting state, we update the registration (store on disk, with additional fields: activation attempts and oldCacheName)
- at startup, when reading registrations, we put that waiting worker in a newly created state (like "preactivation"). And trigger activation on a fly whenever worker access registration
- remove all cache data stored in oldCachename
- cease to advance ServiceWorkers to the active state on "quit-application".
- if activations attempts counter is above the limit, we remove registration and cache completely and read the next item
Updated•3 months ago
|
| Assignee | ||
Comment 39•3 months ago
|
||
Unfortunately the somewhat more contextualizing comment I was working on got eaten in a Firefox nightly restart (possibly interacting with Bugzilla's comment-saving), so responding somewhat more tersely here:
(In reply to Artur Iunusov from comment #38)
Regarding to the comment 25. Not everything is clear for me. And I am asking for the clarification.
Especially these areas:update the registration when a newly installed ServiceWorker reaches the waiting state
and
At startup, when reading registrations:
Any registration that had been in the waiting state would be put into a special stateHow I see/understand that
- when a newly installed ServiceWorker reaches the waiting state, we update the registration (store on disk, with additional fields: activation attempts and oldCacheName)
Yes, I believe this is what I meant by the first comment above.
- at startup, when reading registrations, we put that waiting worker in a newly created state (like "preactivation"). And trigger activation on a fly whenever worker access registration
I think the approach from the patch https://phabricator.services.mozilla.com/D276105 broadly works here. The key idea here is that we would not want the ServiceWorker to start running until we already had a reason to run it, such as an interception or page load. That patch correctly made sure not to dispatch any event when putting the SW in the waiting state.
- remove all cache data stored in oldCachename
Yes, noting that it's possible it might already have been removed.
- cease to advance ServiceWorkers to the active state on "quit-application".
Yes. Thanks to Jens we now have some nice easy ways to check this. Apparently quit-application maps to AppShutdownConfirmed so this check ends up amounting to a check of AppShutdown::IsInOrBeyond(ShutdownPhase::AppShutdownConfirmed) when we might otherwise try to activate a ServiceWorker. If that returns true, we don't want to try and advance the state.
- if activations attempts counter is above the limit, we remove registration and cache completely and read the next item
Yes. This would also want to go hand-in-hand with making sure we are incrementing the value and writing it out. Broadly, there are 2 "normal" situations related to this we care about:
- Interrupted by shutdown. We dispatched an activation event at a SW but then shutdown happened and the the "activate" ExtendableEvent from the activate algorithm didn't get a chance to complete (in the sense of all of the
waitUntilpromises resolved), then we do want to re-dispatch activate next time we try to run. - Activation timed out. We dispatched an activation event but the "activate" ExtendableEvent never resolved before we shut down the SW because it took too long.
In either case, I think we actually expect to see a rejection of the activate event that we dispatched, and we can count those as a failure.
In terms of bumping the value on the registration and writing it out, we could either do that when we start to dispatch the event or when we get the rejection. Since the shutdown case is one where we want to try and avoid performing additional I/O, that's a good argument for bumping the activation count before we dispatch the event. There's also an "abnormal" situation we might want to worry about; the case where we dispatch an activation event at the SW and the SW does something that causes us to crash or deadlock which eventually ends up as a crash. That could also be a good argument for bumping the value before activation and starting the process of writing the file out.
Since the timeout case is something that could potentially happen without Firefox restarting, that's potentially a case where we actually want to do this as part of the decision to retry performing activation. This would suggest we don't want to perform any logic about the activation count when reading things off disk in ServiceWorkerRegistrar or loading the registration, but instead when we'd go to actually perform the activation so that there's only one place for the logic.
This is an idea that came from the Chrome/Blink team mentioning they did something interesting here, so it could also be good to see what their implementation currently does from the source at https://source.chromium.org/ since it could also make sense to align with what they're doing.
Description
•