Open Bug 1588152 Opened 3 months ago Updated 4 days ago

Crash in [@ AsyncShutdownTimeout | profile-change-teardown | ServiceWorkerShutdownBlocker: shutting down Service Workers]

Categories

(Core :: DOM: Service Workers, defect, P1, critical)

Unspecified
All
defect

Tracking

()

Tracking Status
firefox-esr68 --- unaffected
firefox69 --- unaffected
firefox70 --- unaffected
firefox71 --- disabled
firefox72 --- disabled
firefox73 + affected
firefox74 + affected

People

(Reporter: mccr8, Assigned: perry)

References

(Regression)

Details

(Keywords: crash, leave-open, regression)

Crash Data

Attachments

(1 file)

This bug is for crash report bp-cce6f3bd-77ea-4418-a051-4541e0191011.

Top 10 frames of crashing thread:

0 mozglue.dll mozalloc_abort memory/mozalloc/mozalloc_abort.cpp:33
1 xul.dll static void Abort xpcom/base/nsDebugImpl.cpp:439
2 xul.dll NS_DebugBreak xpcom/base/nsDebugImpl.cpp
3 xul.dll nsresult nsDebugImpl::Abort xpcom/base/nsDebugImpl.cpp:133
4 xul.dll XPTC__InvokebyIndex 
5  @0x1d136ce151f 
6 xul.dll trunc 
7 xul.dll trunc 
8 xul.dll static bool XPCWrappedNative::CallMethod js/xpconnect/src/XPCWrappedNative.cpp:1149
9 xul.dll XPC_WN_CallMethod js/xpconnect/src/XPCWrappedNativeJSOps.cpp:946

This looks like a new service worker related shutdown hang on the 10/9 build.

Bug 1456995 landed on October 9.

Flags: needinfo?(perry)
Regressed by: 1456995
Priority: -- → P1

Update: I am working on a fix.

Assignee: nobody → perry
Flags: needinfo?(perry)

This remains off on beta/release 71.

[Parent 3079, Main Thread] ###!!! ABORT: file /builds/worker/workspace/build/src/dom/serviceworkers/ServiceWorkerShutdownBlocker.cpp, line 82
[Parent 3079, Main Thread] ###!!! ABORT: file /builds/worker/workspace/build/src/dom/serviceworkers/ServiceWorkerShutdownBlocker.cpp, line 82
ExceptionHandler::GenerateDump cloned child 28682
ExceptionHandler::SendContinueSignalToChild sent continue signal to child
ExceptionHandler::WaitForContinueSignal waiting for continue signal...
Exiting due to channel error.

I what is barfing on the terminal for this crash.

OS: Windows 10 → All

I hit this or a similar issue in https://bugzilla.mozilla.org/show_bug.cgi?id=1474608#c17.

The message I got is:
"AsyncShutdown timeout in profile-change-teardown Conditions: [{"name":"ServiceWorkerShutdownBlocker: shutting down Service Workers","state":{"pendingPromises":0,"acceptingPromises":true},"filename":"/builds/worker/workspace/build/src/dom/serviceworkers/ServiceWorkerShutdownBlocker.cpp","lineNumber":82,"stack":"Service Workers shutdown"}] At least one completion condition failed to complete within a reasonable amount of"

I'm not sure if mine case can acutal occur in general Firefox users because SWM is usually initialized while loading documents, but my case is:

Some services (e.g. clear data) would initialize SWM through the lazy getter. It becomes an issue if the SWM is initialized during the "profile-change-teardown" stage because the shutdown block can only be removed after stopping accepting the promise, then it won't have a chance to stop accepting promise and thus remove the blocker. Consequently, it introduces the shutdown hang.

This check was put in yesterday which I believe fixes the issue when SWM is initialized during "profile-change-teardown". All of the crash reports I've looked at say the shutdown blocker's state is "acceptingPromises": false, though, so I believe whatever is going on here is due to something different.

It looks like there could be a race condition where GetTerminationPromise()
is called at the same time on two threads on an empty MozPromiseHolder, and one
of the MozPromises created by MozPromiseHolder::Ensure if leaked (and its
chained callbacks never called).

Additionally, I don't think the ParentStatusProtected() > Running check is
necessary because for ServiceWorkers, the global can't call close(), so the
parent status is at least "Canceling", making the CloseWorkerOnMainThread call
unnecessary (which just calls WorkerPrivate::Cancel). The parent process also
will always send a termination operation as the last operation, so we can just
rely on that to make sure the worker is fully terminated.

(In reply to Perry Jiang [:perry] from comment #6)

This check was put in yesterday which I believe fixes the issue when SWM is initialized during "profile-change-teardown". All of the crash reports I've looked at say the shutdown blocker's state is "acceptingPromises": false, though, so I believe whatever is going on here is due to something different.

I rebased my patches to the latest central and ran the test (security/manager/ssl/tests/unit/test_sss_sanitizeOnShutdown.js) again. It seems that the issue hasn't been fixed yet. I suspect in that case ~ServiceWorkerManager() won't be executed before it hit the Shutdown timed out.

Again, I'm not sure whether it can happen in general users because the ServiceWorkerManager is firstly initialized by ServiceWorkerCleanUp.jsm and ServiceWorkerCleanUp is called by cleardata service (removing storage while shutting down). Also, that test mimics the shutdown process by sending the "profile-change-teardown" manually.

I will try to fix that in bug 1474608.
The proper fix for my case I can think of is that:
It seems that we might want to somehow check the shutdown stage during the initialization. If it's at the "profile-change-teardown", then we probably don't want to add a shutdown blocker. Or, However, since this probably won't happen in the real world, a temporary fix is to maunally initialize SWM in that test. So that, the blocker can be removed during "profile-change-teardown".

Pushed by pjiang@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/1738a6aeb485
restrict RemoteWorkerChild::mTerminationPromise to Worker Launcher thread r=asuth
Status: NEW → RESOLVED
Closed: Last month
Resolution: --- → FIXED
Target Milestone: --- → mozilla73

Hello Perry! Are there any STR for QA to reproduce/ verify this? Thank's!

Flags: needinfo?(perry)

Hi Alexandru, unfortunately I do not think there's a way to reliably verify this.

Flags: needinfo?(perry)

Hi Perry! I still hit this with Firefox 73.0b2 (20200107212705) on Windows 10x64 when going to https://static-misc-2.glitch.me/blocking-sw/ and restart the browser. Also sometimes just visiting https://static-misc-2.glitch.me/blocking-sw/ and let it run for like 10 to 15 minutes without doing anything crashes. Here is the crash report.
Any thoughts on this? Thank's!

Flags: needinfo?(perry)

(In reply to Alexandru Trif, QA [:atrif] from comment #13)

Hi Perry! I still hit this with Firefox 73.0b2 (20200107212705) on Windows 10x64 when going to https://static-misc-2.glitch.me/blocking-sw/ and restart the browser. Also sometimes just visiting https://static-misc-2.glitch.me/blocking-sw/ and let it run for like 10 to 15 minutes without doing anything crashes. Here is the crash report.
Any thoughts on this? Thank's!

The good news is that changes to prevent that from happening just got landed in bug 1588838 (I verified with https://static-misc-2.glitch.me/blocking-sw/), but it will need to be uplifted in order to affect 73.

Flags: needinfo?(perry)

ni? myself to check later how much bug 1588838's patch helped.

Flags: needinfo?(perry)

(In reply to Perry Jiang [:perry] from comment #14)

(In reply to Alexandru Trif, QA [:atrif] from comment #13)

Hi Perry! I still hit this with Firefox 73.0b2 (20200107212705) on Windows 10x64 when going to https://static-misc-2.glitch.me/blocking-sw/ and restart the browser. Also sometimes just visiting https://static-misc-2.glitch.me/blocking-sw/ and let it run for like 10 to 15 minutes without doing anything crashes. Here is the crash report.
Any thoughts on this? Thank's!

The good news is that changes to prevent that from happening just got landed in bug 1588838 (I verified with https://static-misc-2.glitch.me/blocking-sw/), but it will need to be uplifted in order to affect 73.

Yep, it seems that I can no longer repro the crash restarting Firefox while "pending" or "rejected in 60 seconds" message is active on Firefox 73.0b3 (20200110003145) on Windows 10x64, macOS 10.15 and Ubuntu 18.04. It seems that bug 1588838 helped here at least on this particular case. If there are other ways to verify this please let me know. From crash stats, it seems that builds with the fix from bug 1588838 still crashes.

See Also: → 1608716
Status: RESOLVED → REOPENED
Flags: needinfo?(perry)
Resolution: FIXED → ---
Duplicate of this bug: 1608716
Status: REOPENED → NEW
Target Milestone: mozilla73 → ---
Keywords: leave-open
You need to log in before you can comment on or make changes to this bug.