Closed Bug 1588152 Opened 1 year ago Closed 4 months ago

Crash in [@ AsyncShutdownTimeout | profile-change-teardown | ServiceWorkerShutdownBlocker: shutting down Service Workers]

Categories

(Core :: DOM: Service Workers, defect, P2)

Unspecified
All
defect

Tracking

()

RESOLVED FIXED
84 Branch
Tracking Status
firefox-esr68 --- unaffected
firefox-esr78 --- wontfix
firefox69 --- unaffected
firefox70 --- unaffected
firefox71 --- disabled
firefox72 --- disabled
firefox73 + disabled
firefox74 + fixed
firefox77 --- wontfix
firefox78 --- wontfix
firefox79 --- wontfix
firefox80 + verified
firefox81 - wontfix
firefox82 --- wontfix
firefox83 --- wontfix
firefox84 --- fixed

People

(Reporter: mccr8, Assigned: ytausky)

References

(Blocks 1 open bug, Regression)

Details

(4 keywords)

Crash Data

Attachments

(6 files)

This bug is for crash report bp-cce6f3bd-77ea-4418-a051-4541e0191011.

Top 10 frames of crashing thread:

0 mozglue.dll mozalloc_abort memory/mozalloc/mozalloc_abort.cpp:33
1 xul.dll static void Abort xpcom/base/nsDebugImpl.cpp:439
2 xul.dll NS_DebugBreak xpcom/base/nsDebugImpl.cpp
3 xul.dll nsresult nsDebugImpl::Abort xpcom/base/nsDebugImpl.cpp:133
4 xul.dll XPTC__InvokebyIndex 
5  @0x1d136ce151f 
6 xul.dll trunc 
7 xul.dll trunc 
8 xul.dll static bool XPCWrappedNative::CallMethod js/xpconnect/src/XPCWrappedNative.cpp:1149
9 xul.dll XPC_WN_CallMethod js/xpconnect/src/XPCWrappedNativeJSOps.cpp:946

This looks like a new service worker related shutdown hang on the 10/9 build.

Bug 1456995 landed on October 9.

Flags: needinfo?(perry)
Regressed by: 1456995
Priority: -- → P1

Update: I am working on a fix.

Assignee: nobody → perry
Flags: needinfo?(perry)

This remains off on beta/release 71.

[Parent 3079, Main Thread] ###!!! ABORT: file /builds/worker/workspace/build/src/dom/serviceworkers/ServiceWorkerShutdownBlocker.cpp, line 82
[Parent 3079, Main Thread] ###!!! ABORT: file /builds/worker/workspace/build/src/dom/serviceworkers/ServiceWorkerShutdownBlocker.cpp, line 82
ExceptionHandler::GenerateDump cloned child 28682
ExceptionHandler::SendContinueSignalToChild sent continue signal to child
ExceptionHandler::WaitForContinueSignal waiting for continue signal...
Exiting due to channel error.

I what is barfing on the terminal for this crash.

OS: Windows 10 → All

I hit this or a similar issue in https://bugzilla.mozilla.org/show_bug.cgi?id=1474608#c17.

The message I got is:
"AsyncShutdown timeout in profile-change-teardown Conditions: [{"name":"ServiceWorkerShutdownBlocker: shutting down Service Workers","state":{"pendingPromises":0,"acceptingPromises":true},"filename":"/builds/worker/workspace/build/src/dom/serviceworkers/ServiceWorkerShutdownBlocker.cpp","lineNumber":82,"stack":"Service Workers shutdown"}] At least one completion condition failed to complete within a reasonable amount of"

I'm not sure if mine case can acutal occur in general Firefox users because SWM is usually initialized while loading documents, but my case is:

Some services (e.g. clear data) would initialize SWM through the lazy getter. It becomes an issue if the SWM is initialized during the "profile-change-teardown" stage because the shutdown block can only be removed after stopping accepting the promise, then it won't have a chance to stop accepting promise and thus remove the blocker. Consequently, it introduces the shutdown hang.

This check was put in yesterday which I believe fixes the issue when SWM is initialized during "profile-change-teardown". All of the crash reports I've looked at say the shutdown blocker's state is "acceptingPromises": false, though, so I believe whatever is going on here is due to something different.

It looks like there could be a race condition where GetTerminationPromise()
is called at the same time on two threads on an empty MozPromiseHolder, and one
of the MozPromises created by MozPromiseHolder::Ensure if leaked (and its
chained callbacks never called).

Additionally, I don't think the ParentStatusProtected() > Running check is
necessary because for ServiceWorkers, the global can't call close(), so the
parent status is at least "Canceling", making the CloseWorkerOnMainThread call
unnecessary (which just calls WorkerPrivate::Cancel). The parent process also
will always send a termination operation as the last operation, so we can just
rely on that to make sure the worker is fully terminated.

(In reply to Perry Jiang [:perry] from comment #6)

This check was put in yesterday which I believe fixes the issue when SWM is initialized during "profile-change-teardown". All of the crash reports I've looked at say the shutdown blocker's state is "acceptingPromises": false, though, so I believe whatever is going on here is due to something different.

I rebased my patches to the latest central and ran the test (security/manager/ssl/tests/unit/test_sss_sanitizeOnShutdown.js) again. It seems that the issue hasn't been fixed yet. I suspect in that case ~ServiceWorkerManager() won't be executed before it hit the Shutdown timed out.

Again, I'm not sure whether it can happen in general users because the ServiceWorkerManager is firstly initialized by ServiceWorkerCleanUp.jsm and ServiceWorkerCleanUp is called by cleardata service (removing storage while shutting down). Also, that test mimics the shutdown process by sending the "profile-change-teardown" manually.

I will try to fix that in bug 1474608.
The proper fix for my case I can think of is that:
It seems that we might want to somehow check the shutdown stage during the initialization. If it's at the "profile-change-teardown", then we probably don't want to add a shutdown blocker. Or, However, since this probably won't happen in the real world, a temporary fix is to maunally initialize SWM in that test. So that, the blocker can be removed during "profile-change-teardown".

Pushed by pjiang@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/1738a6aeb485
restrict RemoteWorkerChild::mTerminationPromise to Worker Launcher thread r=asuth
Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla73

Hello Perry! Are there any STR for QA to reproduce/ verify this? Thank's!

Flags: needinfo?(perry)

Hi Alexandru, unfortunately I do not think there's a way to reliably verify this.

Flags: needinfo?(perry)

Hi Perry! I still hit this with Firefox 73.0b2 (20200107212705) on Windows 10x64 when going to https://static-misc-2.glitch.me/blocking-sw/ and restart the browser. Also sometimes just visiting https://static-misc-2.glitch.me/blocking-sw/ and let it run for like 10 to 15 minutes without doing anything crashes. Here is the crash report.
Any thoughts on this? Thank's!

Flags: needinfo?(perry)

(In reply to Alexandru Trif, QA [:atrif] from comment #13)

Hi Perry! I still hit this with Firefox 73.0b2 (20200107212705) on Windows 10x64 when going to https://static-misc-2.glitch.me/blocking-sw/ and restart the browser. Also sometimes just visiting https://static-misc-2.glitch.me/blocking-sw/ and let it run for like 10 to 15 minutes without doing anything crashes. Here is the crash report.
Any thoughts on this? Thank's!

The good news is that changes to prevent that from happening just got landed in bug 1588838 (I verified with https://static-misc-2.glitch.me/blocking-sw/), but it will need to be uplifted in order to affect 73.

Flags: needinfo?(perry)

ni? myself to check later how much bug 1588838's patch helped.

Flags: needinfo?(perry)

(In reply to Perry Jiang [:perry] from comment #14)

(In reply to Alexandru Trif, QA [:atrif] from comment #13)

Hi Perry! I still hit this with Firefox 73.0b2 (20200107212705) on Windows 10x64 when going to https://static-misc-2.glitch.me/blocking-sw/ and restart the browser. Also sometimes just visiting https://static-misc-2.glitch.me/blocking-sw/ and let it run for like 10 to 15 minutes without doing anything crashes. Here is the crash report.
Any thoughts on this? Thank's!

The good news is that changes to prevent that from happening just got landed in bug 1588838 (I verified with https://static-misc-2.glitch.me/blocking-sw/), but it will need to be uplifted in order to affect 73.

Yep, it seems that I can no longer repro the crash restarting Firefox while "pending" or "rejected in 60 seconds" message is active on Firefox 73.0b3 (20200110003145) on Windows 10x64, macOS 10.15 and Ubuntu 18.04. It seems that bug 1588838 helped here at least on this particular case. If there are other ways to verify this please let me know. From crash stats, it seems that builds with the fix from bug 1588838 still crashes.

See Also: → 1608716
Status: RESOLVED → REOPENED
Flags: needinfo?(perry)
Resolution: FIXED → ---
Status: REOPENED → NEW
Target Milestone: mozilla73 → ---

Hi Perry, do we have any further insights into this bug? With the All Hands next week and 73 RC the week after that, we're running out of time to fix this for Fx73. Given the crash rate, I'm worried about shipping SW-e10s with this issue too.

Flags: needinfo?(perry)

I'm working on adding some diagnostics to see what exactly is stalling and causing the timeout.

Flags: needinfo?(perry)
See Also: → 1611522
Pushed by pjiang@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/4f230654c4cc
propagate content process launch failure r=dom-workers-and-storage-reviewers,sg,asuth

We can optimize for the case where the target content process that a worker
will be put in is still starting and and we know that we will be terminating
the worker. Instead of starting up the worker and immediately terminating it,
we can just never start it up. RemoteWorkerManager will use the value of
RemoteWorkerController::IsTerminated to determine this.

Pushed by bugmail@asutherland.org:
https://hg.mozilla.org/integration/autoland/rev/a307dba8bddd
don't spawn ServiceWorker if it's known it'll be immediately terminated r=dom-workers-and-storage-reviewers,asuth
Flags: needinfo?(perry)
Flags: needinfo?(bugmail)
Flags: needinfo?(perry)
See Also: → 1612690

I suspect https://bugzilla.mozilla.org/show_bug.cgi?id=1612690 maybe be contributing to this. I'll land some of the changes from that bug and see what happens.

Pushed by pjiang@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/82ed8800ab5d
move SW shutdown reporting on IPDL thread closer to when the message leaves the thread r=asuth
Attachment #9127861 - Attachment description: Bug 1588152 - use timer to unblock ServiceWorker shutdown WIP → Bug 1588152 - allow unclean ServiceWorker shutdown on beta/release builds
Attachment #9127861 - Attachment description: Bug 1588152 - allow unclean ServiceWorker shutdown on beta/release builds → Bug 1588152 - allow unclean ServiceWorker shutdown on beta/release builds r=asuth
Pushed by pjiang@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/c7eb14fec4cf
allow unclean ServiceWorker shutdown on beta/release builds r=asuth

Comment on attachment 9127861 [details]
Bug 1588152 - allow unclean ServiceWorker shutdown on beta/release builds r=asuth

Beta/Release Uplift Approval Request

  • User impact if declined: There will be shutdown hangs due to Service Workers taking too long / stalling to terminate. This will only happen when the browser closes, so it is probably better to have an unclean shutdown (not in the sense that any data will be corrupted) rather than crash for beta/release.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Medium
  • Why is the change risky/not risky? (and alternatives if risky): Conceptually it's not risky because it just unblocks browser shutdown after a reasonable period of time. I would say it's medium risk because the bug has not been able to be reproduced locally/on treeherder/by QA, so the specific situation(s) which has/have been causing the crash has not been tested with the patch.
  • String changes made/needed:
Attachment #9127861 - Flags: approval-mozilla-beta?

Comment on attachment 9127861 [details]
Bug 1588152 - allow unclean ServiceWorker shutdown on beta/release builds r=asuth

Fixes an important crasher on beta caused by sw-e10s, uplift approved for 74.0b8 given that we need better stability of sw-e10s to ship it, thanks.

Attachment #9127861 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

ni? to look at why there's still crashes

Flags: needinfo?(bugmail) → needinfo?(perry)
Flags: needinfo?(perry)
Priority: P1 → P3
Assignee: perry → nobody

[Tracking Requested - why for this release]:
the volume of this signature is noticeably increasing in firefox 80.0b

Jens, can someone take another look at this one (see comment 42)?

Flags: needinfo?(jstutte)

Changing the priority to p1 as the bug is tracked by a release manager for the current beta.
See What Do You Triage for more information

Priority: P3 → P1
Flags: needinfo?(bugmail)

This spiked up August 6th in beta 3 & 4. If you take a look at beta 5 and up, the volume is down, not to where it was before 80 hit beta but the spike has definitely subsided. In beta 3 & 4 it was substantially elevated on Windows 7 machines, with a high crash rate and relatively low user count. In subsequent betas the crash rate on windows 7 reverted back to a rate on par with other Windows versions. So it would seem that something might have landed in beta 5 or 6. Unfortunately I don't see any suspects in the push logs, posting here though in case someone else notices something.

Generally though the spike we had in beta 3 & 4 seems to have been addresses in some way.

beta 3: 2e5115ca18805d9d5176423769cbf6fa8b58d77f
https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=2e5115ca18805d9d5176423769cbf6fa8b58d77f&tochange=2f7d9a147be6a405b200b05cabf764103f437c88
buildid: 20200803045446

beta 4: 2f7d9a147be6a405b200b05cabf764103f437c88
https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=2f7d9a147be6a405b200b05cabf764103f437c88&tochange=3649c4dd0f883b3487fa307b1fd036349a09209e
buildid: 20200804180257

beta 5: 3649c4dd0f883b3487fa307b1fd036349a09209e
https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=2f7d9a147be6a405b200b05cabf764103f437c88&tochange=3649c4dd0f883b3487fa307b1fd036349a09209e
buildid: 20200806203447

beta 6: 6fa6624abc9cee1610dc4ececdd23115edcddd22
https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=3649c4dd0f883b3487fa307b1fd036349a09209e&tochange=6fa6624abc9cee1610dc4ececdd23115edcddd22
buidid: 20200807195315

80.0b7 has quite a bunch of crashes again. I don't think this is addressed.

Talked with Jens, I'm going to try and dig into this again to understand the uptick in beta hangs. Leaving my needinfo up so :jcristau gets a notification when I put up a proper response.

Flags: needinfo?(jstutte)

Looks like the backout in bug 1656261 resolved this for 80.0rc2. Volume is way down.

Volume is low but not zero. By now I am going to lower the severity.

Severity: critical → S3
Flags: needinfo?(bugmail)
Priority: P1 → P2

With many, many, MANY thanks to :whimboo providing the pernosco trace https://pernos.co/debug/Uq1o5n_s5wLvmwUfY60jHg/index.html in https://bugzilla.mozilla.org/show_bug.cgi?id=1651542#c91 it appears that the situation here is deadlock between our use of an nsIAsyncShutdownClient on "profile-change-teardown" and our (dynamically arrived at) use of a normal observer on "profile-change-teardown". If the async shutdown service gets notified first, it results in deadlock.

The basic scenario is:

  • For parent intercept, we wanted to change the point at which we started shutdown of ServiceWorkerManager via MaybeStartShutdown. Originally, this was xpcom-shutdown but this was way too late for parent-intercept mode with remote workers. So we moved it up to "profile-change-teardown", although I think there was some fuzziness around which observer was most optimal. The observer was a pre-existing mechanism that we were reusing and would leave using xpcom-shutdown in non-parent-intercept.
  • We also added a shutdown blocker because that's the right mechanism.
  • https://phabricator.services.mozilla.com/D26166#1213324 is my restating statement on this situation.

At a meta level, I think the general situation was that we were concerned about changing the child intercept behavior and although both of these changes landed in the same patch, the evolution of the implementation obscured things somewhat, with the GetStartShutdownTopic() indirection obscuring that we'd landed on using the same observer notification for both. (I vaguely recall in a longer video discussion having them be different phases but then reality aligned them to be the same, but no red flags went up.)

From a fix perspective, we no longer support child intercept so it's straightforward for us to eliminate the use of observer notifications entirely.

Assignee: nobody → bugmail
Blocks: 1651542
Status: NEW → ASSIGNED

Under parent intercept, we had both an nsIAsyncShutdownClient and a manual
observer notification used by ServiceWorkerManager for shutdown. (Under
child intercept, the observer notification used a different phase, but we no
longer support child intercept.) If the observer notification didn't fire
prior to the shutdown blocker, deadlock would result at shutdown.

This patch addresses the problem by changing us to only use the shutdown blocker
for profile-change-teardown, with it taking on the responsibility of calling
the relevant ServiceWorkerManager shutdown logic.

We do leave a finish shutdown observer notification in place. It's added
dynamically during shutdown (as long as we can get the observer service)
which I believe should allow the observer to run prior to the QuotaManager
profile-before-change-qm receiving its notification due to the reverse
observer notification. But that isn't required to avoid deadlocking.

As part of these changes, I also removed the configurable logic around shutdown
that existed to support child intercept. Note that this is not a comprehensive
cleanup, just a normalization of shutdown to require/assume parent intercept.
The most notable aspect of this change is explicitly bailing out of
ServiceWorkerManager initialization if we are not in the parent process.

Assignee: bugmail → ytausky
Pushed by bugmail@asutherland.org:
https://hg.mozilla.org/integration/autoland/rev/d799c0698b88
Eliminate redundant profile-change-teardown observer. r=asuth
Regressions: 1674564

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #50)

With many, many, MANY thanks to :whimboo providing the pernosco trace https://pernos.co/debug/Uq1o5n_s5wLvmwUfY60jHg/index.html in https://bugzilla.mozilla.org/show_bug.cgi?id=1651542#c91 it appears that the situation here is deadlock between our use of an nsIAsyncShutdownClient on "profile-change-teardown" and our (dynamically arrived at) use of a normal observer on "profile-change-teardown". If the async shutdown service gets notified first, it results in deadlock.

After the fix landed I triggered a new Puppeteer job, and it looks all fine now. No more hangs / crashes occur during shutdown:
https://treeherder.mozilla.org/jobs?repo=try&revision=7ea49095586ab0c730817c861894ac6ed26cde1b&searchStr=pup

Andrew, while having this fix landed it still looks like we have crashes with the latest Nightly. Mind having a look?

Flags: needinfo?(bugmail)

That's great news, thank you! That means the deadlock situation would seem to be addressed.

The new crash report states show that we're now actually waiting for ServiceWorkers that haven't actually shut down. Aggregating on async timeout reason, we see the following scenarios:

  • {"shutdownStates":"content process main thread, ","pendingPromises":1,"acceptingPromises":false}
  • {"shutdownStates":"parent process IPDL background thread, ","pendingPromises":1,"acceptingPromises":false}
  • {"shutdownStates":"parent process IPDL background thread, parent process IPDL background thread, ", "pendingPromises":2,"acceptingPromises":false}

This requires more analysis than I can perform in the next few days. I wonder if it makes sense to mark this bug fixed since and spin-off a new bug with the same crash signature to look at these specific situations. Yaron?

Flags: needinfo?(bugmail) → needinfo?(ytausky)

I opened bug 1675068 for this.

Status: ASSIGNED → RESOLVED
Closed: 1 year ago4 months ago
Flags: needinfo?(ytausky)
Resolution: --- → FIXED

Shall we get the fix for the deadlock uplifted to 83 beta?

Flags: needinfo?(ytausky)
Target Milestone: --- → 84 Branch

We can try. I'll request an uplift.

Flags: needinfo?(ytausky)

Comment on attachment 9183919 [details]
Bug 1588152 - Eliminate redundant profile-change-teardown observer. r=#dom-workers-and-storage

Beta/Release Uplift Approval Request

  • User impact if declined: This is a shutdown hang, so the user impact is not that significant.
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: Bug 1674564
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): This change removes a redundant observer subscription that was causing a deadlock, without introducing new machinery.
  • String changes made/needed:
Attachment #9183919 - Flags: approval-mozilla-beta?

Comment on attachment 9183919 [details]
Bug 1588152 - Eliminate redundant profile-change-teardown observer. r=#dom-workers-and-storage

Given that we shipped a dozen releases with this bug that it is a S3 and that we are at the end of betas, I prefer that we let it ride the trains, thanks.

Attachment #9183919 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
Regressions: 1680723
You need to log in before you can comment on or make changes to this bug.