1594572 - RuntimeService::Cleanup hangs content processes when a worker is registered a few milliseconds before shutdown

Assignee

Description

•

5 years ago

We started running into this for WebExtension reader mode tests with Fission enabled. It's fairly easy to reproduce locally in debug builds with:

./mach mochitest --enable-fission --verify browser/components/extensions/test/browser/browser_ext_tabs_readerMode.js

It may take a couple of runs, but eventually a content process will hang during shutdown at https://searchfox.org/mozilla-central/rev/d061ba55ac76f41129618d638f4ef674303ec103/dom/workers/RuntimeService.cpp#1844 waiting for mDomainMap to empty, with the sole remaining worker thread sitting idle in its main event loop. It appears to happen whenever RuntimeService::RegisterWorker is called a few milliseconds before RuntimeService::Shutdown.

Andrew McCreight [:mccr8]

Updated

•

5 years ago

Blocks: RunWatchdogShutdownhang

Andrew Sutherland [:asuth] (he/him)

Comment 1

•

5 years ago

If you happen to have an rr trace that you could pernosco-submit, or if the pernosco bot spammed you with a pernosco link from a try push or a self-serve link that can trigger that, if you could provide a pernosco link/self-serve link, it'd be appreciated to save some cycles. (Thanks!)

I think this is likely related to bug bug 988872 if it's not the same. Even though this is currently happening in content processes, it seems possible that chrome workers could create the same problem in the parent process, so also blocking the worker shutdown bug, bug 1435343.

Blocks: 1435343

Priority: -- → P2

Comment 2

•

5 years ago

"PERF" key word?

Andrew Sutherland [:asuth] (he/him)

Comment 3

•

5 years ago

(In reply to Worcester12345 from comment #2)

"PERF" key word?

I don't think that's applicable in this case as I think it sounds like this a race condition at shutdown and I don't think the perf team would normally investigate or fix this, the workers & storage team would.

Andrew McCreight [:mccr8]

Comment 4

•

5 years ago

The context here is that Fission mochitests are hitting frequent shutdown hangs (bug 1358898). One of the directories that this is happening is browser/components/extensions/test/browser/, so maybe it is the same thing as the issue Kris described in comment 0. With Fission, we no longer preserve content processes during tests, so we have many more content processes, which presumably would lead to many more opportunities for shutdown races.

Kris Maglione [:kmag]

Assignee

Comment 5

•

5 years ago

Sorry, I don't have either an rr trace or a Pernosco trace. Andrew McCreight came across this, and I tracked down the cause by attaching gdb to a hung content process and then adding debug logs to confirm the scenarios it happens in.

Andrew may have a link to a try run that can be used with Pernosco, though.

Flags: needinfo?(continuation)

Andrew McCreight [:mccr8]

Comment 6

•

5 years ago

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #1)

If you happen to have an rr trace that you could pernosco-submit, or if the pernosco bot spammed you with a pernosco link from a try push or a self-serve link that can trigger that, if you could provide a pernosco link/self-serve link, it'd be appreciated to save some cycles. (Thanks!)

I don't know how to do any of that, but here's a try run with the failure (which may not be the exact scenario Kris describes):
https://treeherder.mozilla.org/#/jobs?repo=try&revision=cd52be5a68483db6fca8ad34f53f15e2447165b1

Two of the four failures have the hang I'm talking about, which manifests as application crashed [@ mozilla::(anonymous namespace)::RunWatchdog(void*)].

Flags: needinfo?(continuation)

Andrew Sutherland [:asuth] (he/him)

Comment 7

•

5 years ago

•

Edited

Hm, so that's weird because the Watchdog has a workers-specific crash that produces output different than what's happening in those failures.

Specifically, in the Watchdog an explicit call to RuntimeService::CrashIfHanging which at the top checks whether mDomainMap is empty. If it's empty, it early returns to allow the crash to have the string "Shutdown too long, probably frozen, causing a crash." we're seeing here. If it's not empty, the crash will start with "Workers Hanging"

The crashes at https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=274151077&repo=try&lineNumber=36610 and https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=274154705&repo=try&lineNumber=36743 indeed don't seem to have this signature. Instead the problems seem to be at ~BrowserProcessSubThread in both: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=274151077&repo=try&lineNumber=36659 and https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=274154705&repo=try&lineNumber=36792 ?

Do you have other try failures that more directly implicate this bug?

Flags: needinfo?(continuation)

Kris Maglione [:kmag]

Assignee

Comment 8

•

5 years ago

For what it's worth, when it hung locally, I never actually got a crash dump for the Hit MOZ_CRASH(Shutdown too long, probably frozen, causing a crash.) at /builds/worker/workspace/build/src/toolkit/components/terminator/nsTerminator.cpp:213 message for some reason, but when it happened, I attached to the remaining frozen content process and found it frozen in worker cleanup, as I described, every time. It's possible that the worker shutdown hang in one process caused the RunWatchdog hang in another.

Andrew Sutherland [:asuth] (he/him)

Comment 9

•

5 years ago

I re-submitted the try job myself[1] at https://treeherder.mozilla.org/#/jobs?repo=try&revision=93ccac454ff90b3dcb699fda7ca4e1d8f7063720 to try and convince the pernosco bot to reproduce it under pernosco for me, but the bot didn't bite.

1: I transplanted the query with "mach try fuzzy -q 'thing from commit'" but apparently that didn't include the full resolution of the filtering step down to just the 16th job. If anyone knows how to more easily directly reproduce a try push when the fuzzy picker is involved, I'd appreciate it. I was assuming that "mach try" wouldn't like it if I tried to reuse the exact try commit that includes the build settings because it likes to create such a commit on its own.

Andrew McCreight [:mccr8]

Updated

•

5 years ago

Blocks: 1595191

Andrew McCreight [:mccr8]

Comment 10

•

5 years ago

Is there some documentation of this requesting a Perensco run thing? I have never used it before.

Andrew Sutherland [:asuth] (he/him)

Comment 11

•

5 years ago

So, originally the way to get pernosco to do things for you was to be in a trial where the pernosco bot would watch your try pushes and automatically notice oranges and try and reproduce them for you. I think that bot may still exist (and maybe you need khuey/roc to add you to it if you want that), but it turns out it has been superseded by the self-service thing described https://github.com/Pernosco/pernosco/wiki/Mozilla-Tips#reproducing-arbitrary-taskcluster-job-failures

I just tried trigger a self-serve reproduction on your try push. I:

Brought up the treeherder UI link you had https://treeherder.mozilla.org/#/jobs?repo=try&revision=93ccac454ff90b3dcb699fda7ca4e1d8f7063720&selectedJob=275228416
Clicked on a bc16 job.
Clicked on the task id part of "Task: BE81ArvGTcCGjDw2D1SyIw" in the lower left pane of the treeherder UI when a machine is up.
That brought me to https://tools.taskcluster.net/groups/HVDDkiFjSHixj4Q9JyEDfA/tasks/BE81ArvGTcCGjDw2D1SyIw/runs/0/logs/public%2Flogs%2Flive.log where I could see the "Task Group ID" and the "Task ID"
I copied and pasted the template from https://github.com/Pernosco/pernosco/wiki/Mozilla-Tips#reproducing-arbitrary-taskcluster-job-failures into the URL bar and then inserted those
I clicked on the radio button and clicked the reproduce button. Apparently this can take like a minute to actually produce a window.alert() that a reproduction has been triggered. (So I reproduced it twice because I brought up devtools and clicked the button a second time so I could see a POST actually happen.) I assume it will email me at one of the emails on my github account that I authenticated with (per https://github.com/Pernosco/pernosco/wiki/Mozilla-Tips#accessing-traces)
I'll post on this bug if the pernosco robot sends me a link to a pernosco trace.

There's also info at the link about submitting your own pernosco.

Andrew Sutherland [:asuth] (he/him)

Comment 12

•

5 years ago

And if you want info on using pernosco, check out https://pernos.co/about/overview/. And if you'd like a sample run, there's a sessionstorage fission failure cited at https://bugzilla.mozilla.org/show_bug.cgi?id=1593246#c4.

Andrew Sutherland [:asuth] (he/him)

Comment 13

•

5 years ago

Hm, so, it looks like I got 5 reproductions of the same problem? I promise I only pushed the button twice! The reproductions are associated with output logs and there are actually only 2 distinct log URL's, so perhaps the size of this trace caused redundant pernosco traces to be derived from the 2 underlying reproductions requests? It seems like the reproductions may be pretty large, at least in terms of how long it's taking pernosco to load them into working memory or whatever the initial "Loading pernosco data..." interstitial is, so perhaps let's stick to the first pernosco URL for now.

Andrew McCreight [:mccr8]

Comment 14

•

5 years ago

It sounds like you managed to reproduce this in Pernosco, so I'm clearing my needinfo. Thanks for taking a look.

Flags: needinfo?(continuation)

Chris Peterson [:cpeterson]

Updated

•

5 years ago

Fission Milestone: --- → M4.1

:Gijs (he/him)

Updated

•

5 years ago

Comment 15

•

5 years ago

toolkit/components/reader/test/browser.ini is another place I am planning to reuse Fission content processes because of a shutdown hang, which could be the same as this one.

Chris Peterson [:cpeterson]

Comment 16

•

4 years ago

Andrew, what's the next step for this bug? Do these hangs need to block our Fission M4.1 mochitest milestone?

Flags: needinfo?(continuation)

Andrew McCreight [:mccr8]

Comment 17

•

4 years ago

(In reply to Chris Peterson [:cpeterson] from comment #16)

Andrew, what's the next step for this bug? Do these hangs need to block our Fission M4.1 mochitest milestone?

A worker person would need to investigate this.

I think it doesn't need to block M4.1. I turned on process reuse in the directories where the hangs were happening in testing, and it doesn't seem to have come up much since then. I'm not sure exactly what milestone would be appropriate for this. Possibly this is a problem in the wild, but we won't know without looking at hang telemetry.

Flags: needinfo?(continuation)

Chris Peterson [:cpeterson]

Comment 18

•

4 years ago

(In reply to Andrew McCreight [:mccr8] from comment #17)

I think it doesn't need to block M4.1. I turned on process reuse in the directories where the hangs were happening in testing, and it doesn't seem to have come up much since then. I'm not sure exactly what milestone would be appropriate for this. Possibly this is a problem in the wild, but we won't know without looking at hang telemetry.

In that case, let's move this bug to Fission M7 (ride to Beta) so we can look at hang telemetry from Fission Nightly users.

Fission Milestone: M4.1 → M7

Jens Stutte [:jstutte]

Updated

•

4 years ago

Updated

•

4 years ago

Blocks: 1633342

Jens Stutte [:jstutte]

Updated

•

4 years ago

No longer blocks: 1633342

Jens Stutte [:jstutte]

Updated

•

4 years ago

Blocks: 1633342

Jens Stutte [:jstutte]

Updated

•

4 years ago

No longer blocks: 1435343

Andrew McCreight [:mccr8]

Comment 19

•

4 years ago

One way I'm able to reproduce this locally is:

Go into toolkit/components/reader/test/browser.ini and delete the part where we set dom.ipc.keepProcessesAlive.webIsolated.perOrigin to 1.
./mach mochitest --headless --enable-fission --verify toolkit/components/reader/test/browser_bug1453818_samesite_cookie.js

browser_bug1124271_readerModePinnedTab.js and browser_readerMode_with_anchor.js also reproduced the issue for me, but not the other tests in that directory, for whatever reason. When the tests failed, it was in any of the first three modes that verify runs in.

Jens Stutte [:jstutte]

Comment 20

•

4 years ago

Hi Andrew, it would be interesting to see, if the cause has to do with what baku said here. Are you able to rr this and create a pernosco session?

Flags: needinfo?(continuation)

Andrew McCreight [:mccr8]

Comment 21

•

4 years ago

As Kris said in comment 0, and as I confirmed yesterday, RuntimeService::Cleanup() is effectively spinning the event loop during shutdown.

Flags: needinfo?(continuation)

Kris Maglione [:kmag]

Assignee

Updated

•

4 years ago

Assignee: nobody → kmaglione+bmo

Kris Maglione [:kmag]

Assignee

Comment 22

•

4 years ago

Attached file Bug 1594572: Schedule worker deletion when primary runnable fails. r=asuth — Details

If we try to start a worker too close to shutdown, the main runnable for its
thread runs after shutdown has begun, bails out early trying to create a
BackgroundChild for its thread. Unlike bail-outs in WorkerPrivate::DoRunLoop,
though, nothing winds up scheduling the deletion of the worker after it
returns, and afterwards the worker is stuck forever in a pending state.
Attempts to shut it down from the main thread just dispatch an impotent
notification to be processed by the workers main event loop (which, of course,
will not ever happen).

This patch fixes the behavior of WorkerThreadPrimaryRunnable to always call
ScheduleDeletion when it exits, rather than only doing so when it succeeds
in entering the worker's main loop.

Andrew McCreight [:mccr8]

Comment 23

•

4 years ago

With that patch, my STR in comment 19 doesn't hang any more (I just tried the one test).

Pulsebot

Comment 24

•

4 years ago

Pushed by maglione.k@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/d995d3771ad5
Schedule worker deletion when primary runnable fails. r=asuth

Jens Stutte [:jstutte]

Comment 25

•

4 years ago

(In reply to Andrew McCreight [:mccr8] from comment #21)

As Kris said in comment 0, and as I confirmed yesterday, RuntimeService::Cleanup() is effectively spinning the event loop during shutdown.

Thanks :mccr8 for reminding me, I was mentally far behind the status of this bug and my comment was just a reflex (reproduces - pernosco!), sorry.

:asuth, is this supposed to help also with bug 1633342 or even bug 1435343 ?

Flags: needinfo?(bugmail)

Jens Stutte [:jstutte]

Comment 26

•

4 years ago

(Oh, and thank you all here for the work and patch!)

Andreea Pavel [:apavel]

Comment 27

•

4 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/d995d3771ad5

Status: NEW → RESOLVED

Closed: 4 years ago

status-firefox77: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla77

Natalia Csoregi [:nataliaCs]

Updated

•

4 years ago

Regressions: 1634259

Andrew McCreight [:mccr8]

Updated

•

4 years ago

Regressions: 1635394

Jens Stutte [:jstutte]

Updated

•

4 years ago

Flags: needinfo?(bugmail)

Simon Giesecke [:sg] [he/him]

Updated

•

4 years ago

Regressions: 1687597