Closed Bug 1594572 Opened 1 year ago Closed 6 months ago

RuntimeService::Cleanup hangs content processes when a worker is registered a few milliseconds before shutdown

Categories

(Core :: DOM: Workers, defect, P2)

defect

Tracking

()

RESOLVED FIXED
mozilla77
Fission Milestone M7
Tracking Status
firefox77 --- fixed

People

(Reporter: kmag, Assigned: kmag)

References

(Blocks 4 open bugs)

Details

Attachments

(1 file)

We started running into this for WebExtension reader mode tests with Fission enabled. It's fairly easy to reproduce locally in debug builds with:

./mach mochitest --enable-fission --verify browser/components/extensions/test/browser/browser_ext_tabs_readerMode.js

It may take a couple of runs, but eventually a content process will hang during shutdown at https://searchfox.org/mozilla-central/rev/d061ba55ac76f41129618d638f4ef674303ec103/dom/workers/RuntimeService.cpp#1844 waiting for mDomainMap to empty, with the sole remaining worker thread sitting idle in its main event loop. It appears to happen whenever RuntimeService::RegisterWorker is called a few milliseconds before RuntimeService::Shutdown.

If you happen to have an rr trace that you could pernosco-submit, or if the pernosco bot spammed you with a pernosco link from a try push or a self-serve link that can trigger that, if you could provide a pernosco link/self-serve link, it'd be appreciated to save some cycles. (Thanks!)

I think this is likely related to bug bug 988872 if it's not the same. Even though this is currently happening in content processes, it seems possible that chrome workers could create the same problem in the parent process, so also blocking the worker shutdown bug, bug 1435343.

Blocks: 1435343
Priority: -- → P2
See Also: → 988872

"PERF" key word?

(In reply to Worcester12345 from comment #2)

"PERF" key word?

I don't think that's applicable in this case as I think it sounds like this a race condition at shutdown and I don't think the perf team would normally investigate or fix this, the workers & storage team would.

The context here is that Fission mochitests are hitting frequent shutdown hangs (bug 1358898). One of the directories that this is happening is browser/components/extensions/test/browser/, so maybe it is the same thing as the issue Kris described in comment 0. With Fission, we no longer preserve content processes during tests, so we have many more content processes, which presumably would lead to many more opportunities for shutdown races.

Sorry, I don't have either an rr trace or a Pernosco trace. Andrew McCreight came across this, and I tracked down the cause by attaching gdb to a hung content process and then adding debug logs to confirm the scenarios it happens in.

Andrew may have a link to a try run that can be used with Pernosco, though.

Flags: needinfo?(continuation)

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #1)

If you happen to have an rr trace that you could pernosco-submit, or if the pernosco bot spammed you with a pernosco link from a try push or a self-serve link that can trigger that, if you could provide a pernosco link/self-serve link, it'd be appreciated to save some cycles. (Thanks!)

I don't know how to do any of that, but here's a try run with the failure (which may not be the exact scenario Kris describes):
https://treeherder.mozilla.org/#/jobs?repo=try&revision=cd52be5a68483db6fca8ad34f53f15e2447165b1

Two of the four failures have the hang I'm talking about, which manifests as application crashed [@ mozilla::(anonymous namespace)::RunWatchdog(void*)].

Flags: needinfo?(continuation)

Hm, so that's weird because the Watchdog has a workers-specific crash that produces output different than what's happening in those failures.

Specifically, in the Watchdog an explicit call to RuntimeService::CrashIfHanging which at the top checks whether mDomainMap is empty. If it's empty, it early returns to allow the crash to have the string "Shutdown too long, probably frozen, causing a crash." we're seeing here. If it's not empty, the crash will start with "Workers Hanging"

The crashes at https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=274151077&repo=try&lineNumber=36610 and https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=274154705&repo=try&lineNumber=36743 indeed don't seem to have this signature. Instead the problems seem to be at ~BrowserProcessSubThread in both: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=274151077&repo=try&lineNumber=36659 and https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=274154705&repo=try&lineNumber=36792 ?

Do you have other try failures that more directly implicate this bug?

Flags: needinfo?(continuation)

For what it's worth, when it hung locally, I never actually got a crash dump for the Hit MOZ_CRASH(Shutdown too long, probably frozen, causing a crash.) at /builds/worker/workspace/build/src/toolkit/components/terminator/nsTerminator.cpp:213 message for some reason, but when it happened, I attached to the remaining frozen content process and found it frozen in worker cleanup, as I described, every time. It's possible that the worker shutdown hang in one process caused the RunWatchdog hang in another.

I re-submitted the try job myself[1] at https://treeherder.mozilla.org/#/jobs?repo=try&revision=93ccac454ff90b3dcb699fda7ca4e1d8f7063720 to try and convince the pernosco bot to reproduce it under pernosco for me, but the bot didn't bite.

1: I transplanted the query with "mach try fuzzy -q 'thing from commit'" but apparently that didn't include the full resolution of the filtering step down to just the 16th job. If anyone knows how to more easily directly reproduce a try push when the fuzzy picker is involved, I'd appreciate it. I was assuming that "mach try" wouldn't like it if I tried to reuse the exact try commit that includes the build settings because it likes to create such a commit on its own.

Is there some documentation of this requesting a Perensco run thing? I have never used it before.

So, originally the way to get pernosco to do things for you was to be in a trial where the pernosco bot would watch your try pushes and automatically notice oranges and try and reproduce them for you. I think that bot may still exist (and maybe you need khuey/roc to add you to it if you want that), but it turns out it has been superseded by the self-service thing described https://github.com/Pernosco/pernosco/wiki/Mozilla-Tips#reproducing-arbitrary-taskcluster-job-failures

I just tried trigger a self-serve reproduction on your try push. I:

There's also info at the link about submitting your own pernosco.

And if you want info on using pernosco, check out https://pernos.co/about/overview/. And if you'd like a sample run, there's a sessionstorage fission failure cited at https://bugzilla.mozilla.org/show_bug.cgi?id=1593246#c4.

Hm, so, it looks like I got 5 reproductions of the same problem? I promise I only pushed the button twice! The reproductions are associated with output logs and there are actually only 2 distinct log URL's, so perhaps the size of this trace caused redundant pernosco traces to be derived from the 2 underlying reproductions requests? It seems like the reproductions may be pretty large, at least in terms of how long it's taking pernosco to load them into working memory or whatever the initial "Loading pernosco data..." interstitial is, so perhaps let's stick to the first pernosco URL for now.

It sounds like you managed to reproduce this in Pernosco, so I'm clearing my needinfo. Thanks for taking a look.

Flags: needinfo?(continuation)
Fission Milestone: --- → M4.1
See Also: → 1597693

toolkit/components/reader/test/browser.ini is another place I am planning to reuse Fission content processes because of a shutdown hang, which could be the same as this one.

Andrew, what's the next step for this bug? Do these hangs need to block our Fission M4.1 mochitest milestone?

Flags: needinfo?(continuation)

(In reply to Chris Peterson [:cpeterson] from comment #16)

Andrew, what's the next step for this bug? Do these hangs need to block our Fission M4.1 mochitest milestone?

A worker person would need to investigate this.

I think it doesn't need to block M4.1. I turned on process reuse in the directories where the hangs were happening in testing, and it doesn't seem to have come up much since then. I'm not sure exactly what milestone would be appropriate for this. Possibly this is a problem in the wild, but we won't know without looking at hang telemetry.

Flags: needinfo?(continuation)

(In reply to Andrew McCreight [:mccr8] from comment #17)

I think it doesn't need to block M4.1. I turned on process reuse in the directories where the hangs were happening in testing, and it doesn't seem to have come up much since then. I'm not sure exactly what milestone would be appropriate for this. Possibly this is a problem in the wild, but we won't know without looking at hang telemetry.

In that case, let's move this bug to Fission M7 (ride to Beta) so we can look at hang telemetry from Fission Nightly users.

Fission Milestone: M4.1 → M7
See Also: → 1494592
Blocks: 1633342
No longer blocks: 1633342
Blocks: 1633342
No longer blocks: 1435343

One way I'm able to reproduce this locally is:

  1. Go into toolkit/components/reader/test/browser.ini and delete the part where we set dom.ipc.keepProcessesAlive.webIsolated.perOrigin to 1.
  2. ./mach mochitest --headless --enable-fission --verify toolkit/components/reader/test/browser_bug1453818_samesite_cookie.js

browser_bug1124271_readerModePinnedTab.js and browser_readerMode_with_anchor.js also reproduced the issue for me, but not the other tests in that directory, for whatever reason. When the tests failed, it was in any of the first three modes that verify runs in.

Hi Andrew, it would be interesting to see, if the cause has to do with what baku said here. Are you able to rr this and create a pernosco session?

Flags: needinfo?(continuation)

As Kris said in comment 0, and as I confirmed yesterday, RuntimeService::Cleanup() is effectively spinning the event loop during shutdown.

Flags: needinfo?(continuation)
Assignee: nobody → kmaglione+bmo

If we try to start a worker too close to shutdown, the main runnable for its
thread runs after shutdown has begun, bails out early trying to create a
BackgroundChild for its thread. Unlike bail-outs in WorkerPrivate::DoRunLoop,
though, nothing winds up scheduling the deletion of the worker after it
returns, and afterwards the worker is stuck forever in a pending state.
Attempts to shut it down from the main thread just dispatch an impotent
notification to be processed by the workers main event loop (which, of course,
will not ever happen).

This patch fixes the behavior of WorkerThreadPrimaryRunnable to always call
ScheduleDeletion when it exits, rather than only doing so when it succeeds
in entering the worker's main loop.

With that patch, my STR in comment 19 doesn't hang any more (I just tried the one test).

Pushed by maglione.k@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/d995d3771ad5
Schedule worker deletion when primary runnable fails. r=asuth

(In reply to Andrew McCreight [:mccr8] from comment #21)

As Kris said in comment 0, and as I confirmed yesterday, RuntimeService::Cleanup() is effectively spinning the event loop during shutdown.

Thanks :mccr8 for reminding me, I was mentally far behind the status of this bug and my comment was just a reflex (reproduces - pernosco!), sorry.

:asuth, is this supposed to help also with bug 1633342 or even bug 1435343 ?

Flags: needinfo?(bugmail)

(Oh, and thank you all here for the work and patch!)

Status: NEW → RESOLVED
Closed: 6 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla77
Regressions: 1634259
Regressions: 1635394
Flags: needinfo?(bugmail)
You need to log in before you can comment on or make changes to this bug.