Investigate the Worker/ServiceWorker shutdown crashes: RuntimeService
Categories
(Core :: DOM: Workers, task)
Tracking
()
Tracking | Status | |
---|---|---|
firefox83 | --- | fixed |
People
(Reporter: jstutte, Assigned: jstutte)
References
(Blocks 1 open bug)
Details
Attachments
(2 files)
47 bytes,
text/x-phabricator-request
|
Details | Review | |
3.20 KB,
text/plain
|
chutten
:
data-review+
|
Details |
The remaining shutdown hang MOZ_CRASH messages show, that apparently only WorkerDebuggeeRunnable cases remain.
This bug intends to investigate their nature in order to make them more actionable.
Assignee | ||
Comment 1•4 years ago
•
|
||
The crash messages are generated by RuntimeService::CrashIfHanging
.
RuntimeService
is a singleton that contains a map of domain names to WorkerPrivate*
lists and allows operations on registered workers.
How workers' shutdown is triggered
The shutdown triggers either Shutdown or Cleanup through an observer:
if (!strcmp(aTopic, NS_XPCOM_SHUTDOWN_OBSERVER_ID)) {
Shutdown();
return NS_OK;
}
if (!strcmp(aTopic, NS_XPCOM_SHUTDOWN_THREADS_OBSERVER_ID)) {
Cleanup();
return NS_OK;
}
The difference is that Shutdown
just sends cancel to all top level workers only, Cleanup
spins the event loop until all threads have joined.
Why and how the MOZ_CRASH messages are composed
On the main thread we are in the Cleanup
event loop when the watchdog triggers, waiting apparently for some worker threads to join.
CrashIfHanging
then iterates over the domain map and retrieves statistics for each WorkerPrivate*
list through Update
. To do so it dispatches a CrashIfHangingRunnable
to the worker and if the dispatch succeeds it waits for its result (forever?). This runnable either writes the crash information or the string "Canceled"
to mMsg
.
The worker's crash information is written by DumpCrashInformation
which appends a WorkerRef's name only for each workerref that is preventing the worker's shutdown.
A shutdown hang is attributed to Workers if:
a) the (single) shutdown timeout has been reached by the RunWatchdog
b) the shutdown steps were completed (sShutdownNotified == true)
c) there is a worker associated to any domain which is still able to receive runnables (and to respond!)
So actually the worker is not "hanging", it has just not been closed yet.
From the way the message is constructed, I assume the suspected cause to not having closed the worker yet is a living worker reference with workerRef->IsPreventingShutdown()
set to true. There seems to be at least one case, where the list of printed workerRef
names is empty, though, indicating that there is no workerRef
preventing shutdown for this worker. However, in the vast majority of cases we have the mSender
worker ref reported.
:asuth suspects the root cause in some late execution of chrome javascript, in particular osfile.
Questions
Q1: According to :asuth, only workers in the parent process can/should cause a shutdown hang. This would mean that either the map of domain names in the parent process does not contain domains handled by content processes or that the dispatch function somehow decides to not dispatch the CrashIfHangingRunnable to workers into content processes? Or do we just expect content processes to be already dead at this stage?
Q2: Is this really worth a crash (and/or a blocking event loop on the main thread)? I would assume that any critical task is (can be) associated to a shutdown step and thus after sShutdownNotified == true
no one should complain to just exit(0)?
Suggestions
S1: Transform (worker) shutdown hangs from crashes to normal telemetry and exit(0).
S2: Include a hint if (and which) chrome JS was run by the hanging worker.
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 2•4 years ago
|
||
Q3: Isn't it risky to block the main thread just to compose the worker shutdown hang messages? What if we never receive a response from that worker? Couldn't this cause real process hangs we will never get reports for?
Assignee | ||
Updated•4 years ago
|
Comment 3•4 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #1)
Q1: According to :asuth, only workers in the parent process can/should cause a shutdown hang. This would mean that either the map of domain names in the parent process does not contain domains handled by content processes or that the dispatch function somehow decides to not dispatch the CrashIfHangingRunnable to workers into content processes? Or do we just expect content processes to be already dead at this stage?
The map is local to the current process. We also expect content processes to have already been terminated at profile-before-change (noting that we do expect them to go through XPCOM shutdown locally in DEBUG builds, but for them to just be exit(0)'d or terminated by the content process under opt builds).
Q2: Is this really worth a crash (and/or a blocking event loop on the main thread)? I would assume that any critical task is (can be) associated to a shutdown step and thus after
sShutdownNotified == true
no one should complain to just exit(0)?
It's not really worth the crash. There was a historical platform engineering attempt in general to help verify correctness at shutdown, but obviously that shouldn't be happening at the expense of users. That said, I think the nsThreadManager shutdown would hang without Workers actively making sure to shut down its threads based on how the WorkerThreadPrimaryRunnable works (it spins its own event loop that only ever yields control when its worker logic says it's allowed to shutdown.)
Having the telemetry would be sufficient, and if we're doing exit(0) that should help sidestep the nsThreadManager not being able to shut things down.
Comment 4•4 years ago
•
|
||
(In reply to Jens Stutte [:jstutte] from comment #2)
Q3: Isn't it risky to block the main thread just to compose the worker shutdown hang messages? What if we never receive a response from that worker? Couldn't this cause real process hangs we will never get reports for?
Sortof. The expected failure mode is that the Worker is responsive but it has effectively leaked a StrongWorkerRef somehow. The WorkerControlRunnnables will always interrupt running JS, which means that we're not concerned about JS content code being in a tight while loop. It's also the case that a failure mode where the thread is in an infinite loop will show in the crash stats in the thread backtrace for the worker.
Comment 5•4 years ago
|
||
Summarizing my video call discussion with Jens about this:
- Jens' analysis that there's a potential for Firefox permanently hanging and us not having additional backstops to report the hang is absolutely correct.
- Noting that in terms of severity, we don't expect this to happen, but we have no way to have metrics on this other than support interactions with users where they'd have to restart their computer to get Firefox to run again, so our intuition could be wrong.
- Note that our concern is not actually deadlock over the monitor, just that the code never runs and the condvar never notifies, producing some variant of livelock.
- The fix discussed is to change from waiting forever for the runnable to run or be canceled is to use a bounded wait. We proposed:
- Using MonitorAutoLock/Monitor::Wait with the timeout argument instead of the eternal Wait() in CrashIfHangingRunnable::DispatchAndWait
- Augment the runnable to use an boolean to track whether or not it has effectively "resolved" with an answer so that we only write to
mMsg
exactly once. While the call to Wait(timeout) indicates what happened with its CVStatus return, it's necessary to make sure that WorkerRun/Cancel check this value before potentially mutatingmMsg
in the event they actually do run after the timed wait has given up. - This can probably take the form of just adding a non-atomic bool that is accessed with the MonitorAutoLock held. The logic in WorkerRun/Cancel would accordingly need to move the mMsg-writing logic into the section of the functions protected by the held monitor.
Assignee | ||
Comment 6•4 years ago
|
||
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Comment hidden (obsolete) |
Comment hidden (obsolete) |
Comment hidden (obsolete) |
Comment hidden (obsolete) |
Assignee | ||
Comment 11•4 years ago
|
||
Assignee | ||
Updated•4 years ago
|
Comment 12•4 years ago
|
||
(In reply to Andrew Sutherland [:asuth] (he/him) from comment #5)
...
- Noting that in terms of severity, we don't expect this to happen, but we have no way to have metrics on this other than support interactions with users where they'd have to restart their computer to get Firefox to run again, so our intuition could be wrong.
...
I think I've run into this. It has locked up the entire computer.
Assignee | ||
Comment 13•4 years ago
|
||
(In reply to Worcester12345 from comment #12)
(In reply to Andrew Sutherland [:asuth] (he/him) from comment #5)
...
- Noting that in terms of severity, we don't expect this to happen, but we have no way to have metrics on this other than support interactions with users where they'd have to restart their computer to get Firefox to run again, so our intuition could be wrong.
I think I've run into this. It has locked up the entire computer.
Part of the goal here is to be sure that we catch all of the hangs as reports. We are going to monitor this closely: if the number of crashes raises perceivable after landing this patch in release, then there were probably many people that experienced infinite hangs caused by this exact problem (but I would not exclude 100% that there might be other possible causes, of course).
Comment 14•4 years ago
|
||
Comment on attachment 9180121 [details]
DataReviewRequest1664386.txt
DATA COLLECTION REVIEW RESPONSE:
Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?
Yes.
Is there a control mechanism that allows the user to turn the data collection on and off?
Yes. This collection is Telemetry so can be controlled through Firefox's Preferences.
If the request is for permanent data collection, is there someone who will monitor the data over time?
Yes, :jstutte is responsible.
Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?
Category 2, Interaction.
Is the data collection request for default-on or default-off?
Default on for all channels.
Does the instrumentation include the addition of any new identifiers?
No.
Is the data collection covered by the existing Firefox privacy notice?
Yes.
Does there need to be a check-in in the future to determine whether to renew the data?
No. This collection is permanent.
Result: datareview+
Comment 15•4 years ago
|
||
Pushed by bugmail@asutherland.org: https://hg.mozilla.org/integration/autoland/rev/186ca9e93e1d Avoid any possibility of infinite process hangs due to workers hanging on shutdown and add chrome worker information to reports r=dom-workers-and-storage-reviewers,asuth
Comment 16•4 years ago
|
||
bugherder |
Description
•