Open Bug 1602393 Opened 3 years ago Updated 20 days ago

Crash in [@ AsyncShutdownTimeout | profile-before-change | ServiceWorkerRegistrar: Flushing data]

Categories

(Core :: DOM: Service Workers, defect, P3)

Unspecified
Linux
defect

Tracking

()

People

(Reporter: whimboo, Unassigned)

References

Details

(Keywords: crash, leave-open)

Crash Data

Attachments

(4 files)

We got this report from a user running Selenium tests with geckodriver inside a docker image, and by running Firefox in headless mode (https://github.com/mozilla/geckodriver/issues/1663):

This bug is for crash report bp-804666b6-3adb-4e4b-a607-25c610191209.

Top 10 frames of crashing thread:

0 firefox-bin mozalloc_abort memory/mozalloc/mozalloc_abort.cpp:33
1 libxul.so Abort xpcom/base/nsDebugImpl.cpp:439
2 libxul.so NS_DebugBreak xpcom/string/nsSubstring.cpp
3 libxul.so nsDebugImpl::Abort xpcom/base/nsDebugImpl.cpp:133
4 libxul.so NS_InvokeByIndex 
5 libxul.so XPCWrappedNative::CallMethod js/xpconnect/src/XPCWrappedNative.cpp:1149
6 libxul.so XPC_WN_CallMethod js/xpconnect/src/XPCWrappedNativeJSOps.cpp:943
7 libxul.so js::InternalCallOrConstruct js/src/vm/Interpreter.cpp:539
8 libxul.so Interpret js/src/vm/Interpreter.cpp:594
9 libxul.so js::RunScript js/src/vm/Interpreter.cpp:424

It happens for him while navigating to http://google.com, specifically between the DOMContentLoaded and pageshow events.

Interesting are the app notes in the bug report:

Ubuntu 18.04.2 LTSFP(D00-L1000-W00000000-T000) GLXtest process failed (exited with status 1): Unable to open a connection to the X server

WR? WR- OMTP? OMTP+4 xpcom_runtime_abort(###!!! ABORT: file /builds/worker/workspace/build/src/dom/serviceworkers/ServiceWorkerRegistrar.cpp, line 1111)

The crash rate seems rather high, but it doesn't happen on Nightly. Maybe it's fixed by the new ServiceWorker?

Priority: -- → P2

When did that land? I could ask the reporter to check those specific nightly builds (before and after the landing).

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #2)

When did that land? I could ask the reporter to check those specific nightly builds (before and after the landing).

I meant the pref (dom.serviceWorkers.parent_intercept). There are lots of changes protected by that pref. It's turned on by default on Nightly only and it's tracked in https://bugzilla.mozilla.org/show_bug.cgi?id=1588154.

As the reporter just replied the headless problem is fixed for him in Firefox 71 while it crashes in 70. So maybe we should just remove this use-case here and keep this bug more general?

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #4)

As the reporter just replied the headless problem is fixed for him in Firefox 71 while it crashes in 70. So maybe we should just remove this use-case here and keep this bug more general?

Sounds good to me!

There seems to have multiple causes because we still have crashes on 72.0b3 (AsyncShutdownTimeout | profile-before-change | ServiceWorkerRegistrar: Flushing data). And, apparently, it's another issue since the reporter replied it's fixed for him in 71.

From the latest crash reports, it seems we have added a shutdown blocker that has never been removed until it gets into our way.
:asuth, can you clarify next steps here?

Flags: needinfo?(bugmail)

Looking closer, I see we add the blocker with a string literal:

rv = GetShutdownPhase()->AddBlocker(
      this, NS_LITERAL_STRING(__FILE__), __LINE__,
      NS_LITERAL_STRING("ServiceWorkerRegistrar: Flushing data"));

but the only place we have the same literal is the function

NS_IMETHODIMP
ServiceWorkerRegistrar::GetName(nsAString& aName) {
  aName = NS_LITERAL_STRING("ServiceWorkerRegistrar: Flushing data");
  return NS_OK;
}

which seems to have some overloads. I have no evidence, that this causes the crash, but in any case we should use the GetName function to add the blocker instead of literals anyway, I'd say.

Flags: needinfo?(perry)
Flags: needinfo?(bugmail)

As discussed on Zoom, AddBlocker's 4th parameter and nsIAsyncShutdownClient::GetName are only used to add crash metadata, so they are unlikely to be related to the crash itself, but the string duplication should be removed.

Flags: needinfo?(perry)
Assignee: nobody → perry
Status: NEW → ASSIGNED
Pushed by pjiang@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/237b6fc723ea
remove ServiceWorkerRegistrar string duplication r=dom-workers-and-storage-reviewers,asuth

As far as I can tell, ServiceWorkerRegistrar shutdown hang could only occur if
writing data to disk stalls or the shutdown message from the main thread to
background thread doesn't complete. (Although I still cannot piece together the
exact possible sequence of events that would lead to these hangs.)

Pushed by pjiang@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/43921ad138e3
add debugging info to ServiceWorkerRegistrar shutdown blocking r=asuth

The crash reports that include the changes from D78222 show "shuttingDown":false in the metadata (e.g. https://crash-stats.mozilla.org/report/index/074c3341-1168-4921-9fe2-980530200708#tab-metadata). This most definitely means that ServiceWorkerRegistrar::Shutdown was never called by BackgroundParentImpl::RecvShutdownServiceWorkerRegistrar. The best explanation right now for why this hang is happening is that the IPDL background thread is too busy to issue the IPDL call before the shutdown blocker timeout/timer expires.

Assignee: perry → nobody
Status: ASSIGNED → NEW

The leave-open keyword is there and there is no activity for 6 months.
:jstutte, maybe it's time to close this bug?

Flags: needinfo?(jstutte)

This is supposed to be fixed by rolling out LSNG, see bug 1711922.

Flags: needinfo?(jstutte)
See Also: → 1711922

The leave-open keyword is there and there is no activity for 6 months.
:jstutte, maybe it's time to close this bug?

Flags: needinfo?(jstutte)

There are still quite some instances here. Looking at ServiceWorkerRegistrar::MaybeScheduleShutdownCompleted I read:

  if (mRunnableDispatched || !mShuttingDown) {
    return;
  }

  RefPtr<Runnable> runnable =
      NewRunnableMethod("dom::ServiceWorkerRegistrar::ShutdownCompleted", this,
                        &ServiceWorkerRegistrar::ShutdownCompleted);
  MOZ_ALWAYS_SUCCEEDS(NS_DispatchToMainThread(runnable.forget()));

So we dispatch the ServiceWorkerRegistrar::ShutdownCompleted only if there is no mRunnableDispatched. This was introduced by bug 1452373 apparently to protect serviceworker.txt from multiple writes.

IIUC this can lead to a situation where we just never unblock our shutdown blocker though we did all expected work. I am also wondering if the way we protect serviceworker.txt from parallel writes could not be improved a bit, as just dropping events seems bad in general.

Flags: needinfo?(jstutte) → needinfo?(bugmail)

Hmm, looking at the annotations of the crashes I see the AsyncShutdownTimeout:

    {
        "name": "ServiceWorkerRegistrar: Flushing data",
        "state": {
            "saveDataRunnableDispatched": false,
            "shuttingDown": false
        },
        "filename": "/builds/worker/checkouts/gecko/dom/serviceworkers/ServiceWorkerRegistrar.cpp",
        "lineNumber": 1240,
        "stack": "ServiceWorkerRegistrar: Flushing data"
    }

with no sign of shutdown at all (and no saveDataRunnableDispatched, too). So mRunnableDispatched seems not involved here. There might be a minimal chance that child->SendShutdownServiceWorkerRegistrar() failed and we never entered shutdown. We are coming directly from ServiceWorkerRegistrar::BlockShutdown and I assume we can expect the async shutdown blocker to have called BlockShutdown if we see ourselves as a blocker in the annotation.

Assignee: nobody → jstutte
Status: NEW → ASSIGNED
Flags: needinfo?(bugmail)
Pushed by jstutte@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/6d7b934ff5aa
Rename mRunnableDispatched to mSaveDataRunnableDispatched and crash on shutdown message dispatch failure in ServiceWorkerRegistrar r=dom-worker-reviewers,jesup
Pushed by jstutte@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/099bc1f51e2f
Fake ShutdownComplete() for all error conditions of ServiceWorkerRegistrar::ProfileStopped. r=dom-worker-reviewers,jesup

Unassigning as I do not have other plans here for now. The last patch here will land in release 102, such that we won't know before if it helped.

Assignee: jstutte → nobody
Status: ASSIGNED → NEW
Severity: critical → S2

As of today this accounts for only 0,12% of all AsyncShutdownTimeouts. In general it seems that we track even much more frequent shutdown hangs as S3.

Severity: S2 → S3
Priority: P2 → P3
You need to log in before you can comment on or make changes to this bug.