Closed Bug 1664386 Opened 4 years ago Closed 4 years ago

Investigate the Worker/ServiceWorker shutdown crashes: RuntimeService

Categories

(Core :: DOM: Workers, task)

task

Tracking

()

RESOLVED FIXED
83 Branch
Tracking Status
firefox83 --- fixed

People

(Reporter: jstutte, Assigned: jstutte)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

The remaining shutdown hang MOZ_CRASH messages show, that apparently only WorkerDebuggeeRunnable cases remain.

This bug intends to investigate their nature in order to make them more actionable.

The crash messages are generated by RuntimeService::CrashIfHanging.

RuntimeService is a singleton that contains a map of domain names to WorkerPrivate* lists and allows operations on registered workers.

How workers' shutdown is triggered

The shutdown triggers either Shutdown or Cleanup through an observer:

  if (!strcmp(aTopic, NS_XPCOM_SHUTDOWN_OBSERVER_ID)) {
    Shutdown();
    return NS_OK;
  }
  if (!strcmp(aTopic, NS_XPCOM_SHUTDOWN_THREADS_OBSERVER_ID)) {
    Cleanup();
    return NS_OK;
  }

The difference is that Shutdown just sends cancel to all top level workers only, Cleanup spins the event loop until all threads have joined.

Why and how the MOZ_CRASH messages are composed

On the main thread we are in the Cleanupevent loop when the watchdog triggers, waiting apparently for some worker threads to join.

CrashIfHanging then iterates over the domain map and retrieves statistics for each WorkerPrivate* list through Update. To do so it dispatches a CrashIfHangingRunnable to the worker and if the dispatch succeeds it waits for its result (forever?). This runnable either writes the crash information or the string "Canceled" to mMsg.

The worker's crash information is written by DumpCrashInformation which appends a WorkerRef's name only for each workerref that is preventing the worker's shutdown.

A shutdown hang is attributed to Workers if:

a) the (single) shutdown timeout has been reached by the RunWatchdog
b) the shutdown steps were completed (sShutdownNotified == true)
c) there is a worker associated to any domain which is still able to receive runnables (and to respond!)

So actually the worker is not "hanging", it has just not been closed yet.

From the way the message is constructed, I assume the suspected cause to not having closed the worker yet is a living worker reference with workerRef->IsPreventingShutdown() set to true. There seems to be at least one case, where the list of printed workerRef names is empty, though, indicating that there is no workerRef preventing shutdown for this worker. However, in the vast majority of cases we have the mSender worker ref reported.

:asuth suspects the root cause in some late execution of chrome javascript, in particular osfile.

Questions

Q1: According to :asuth, only workers in the parent process can/should cause a shutdown hang. This would mean that either the map of domain names in the parent process does not contain domains handled by content processes or that the dispatch function somehow decides to not dispatch the CrashIfHangingRunnable to workers into content processes? Or do we just expect content processes to be already dead at this stage?

Q2: Is this really worth a crash (and/or a blocking event loop on the main thread)? I would assume that any critical task is (can be) associated to a shutdown step and thus after sShutdownNotified == true no one should complain to just exit(0)?

Suggestions

S1: Transform (worker) shutdown hangs from crashes to normal telemetry and exit(0).

S2: Include a hint if (and which) chrome JS was run by the hanging worker.

Flags: needinfo?(bugmail)

Q3: Isn't it risky to block the main thread just to compose the worker shutdown hang messages? What if we never receive a response from that worker? Couldn't this cause real process hangs we will never get reports for?

(In reply to Jens Stutte [:jstutte] from comment #1)

Q1: According to :asuth, only workers in the parent process can/should cause a shutdown hang. This would mean that either the map of domain names in the parent process does not contain domains handled by content processes or that the dispatch function somehow decides to not dispatch the CrashIfHangingRunnable to workers into content processes? Or do we just expect content processes to be already dead at this stage?

The map is local to the current process. We also expect content processes to have already been terminated at profile-before-change (noting that we do expect them to go through XPCOM shutdown locally in DEBUG builds, but for them to just be exit(0)'d or terminated by the content process under opt builds).

Q2: Is this really worth a crash (and/or a blocking event loop on the main thread)? I would assume that any critical task is (can be) associated to a shutdown step and thus after sShutdownNotified == true no one should complain to just exit(0)?

It's not really worth the crash. There was a historical platform engineering attempt in general to help verify correctness at shutdown, but obviously that shouldn't be happening at the expense of users. That said, I think the nsThreadManager shutdown would hang without Workers actively making sure to shut down its threads based on how the WorkerThreadPrimaryRunnable works (it spins its own event loop that only ever yields control when its worker logic says it's allowed to shutdown.)

Having the telemetry would be sufficient, and if we're doing exit(0) that should help sidestep the nsThreadManager not being able to shut things down.

Flags: needinfo?(bugmail)

(In reply to Jens Stutte [:jstutte] from comment #2)

Q3: Isn't it risky to block the main thread just to compose the worker shutdown hang messages? What if we never receive a response from that worker? Couldn't this cause real process hangs we will never get reports for?

Sortof. The expected failure mode is that the Worker is responsive but it has effectively leaked a StrongWorkerRef somehow. The WorkerControlRunnnables will always interrupt running JS, which means that we're not concerned about JS content code being in a tight while loop. It's also the case that a failure mode where the thread is in an infinite loop will show in the crash stats in the thread backtrace for the worker.

Summarizing my video call discussion with Jens about this:

  • Jens' analysis that there's a potential for Firefox permanently hanging and us not having additional backstops to report the hang is absolutely correct.
    • Noting that in terms of severity, we don't expect this to happen, but we have no way to have metrics on this other than support interactions with users where they'd have to restart their computer to get Firefox to run again, so our intuition could be wrong.
    • Note that our concern is not actually deadlock over the monitor, just that the code never runs and the condvar never notifies, producing some variant of livelock.
  • The fix discussed is to change from waiting forever for the runnable to run or be canceled is to use a bounded wait. We proposed:
    • Using MonitorAutoLock/Monitor::Wait with the timeout argument instead of the eternal Wait() in CrashIfHangingRunnable::DispatchAndWait
    • Augment the runnable to use an boolean to track whether or not it has effectively "resolved" with an answer so that we only write to mMsg exactly once. While the call to Wait(timeout) indicates what happened with its CVStatus return, it's necessary to make sure that WorkerRun/Cancel check this value before potentially mutating mMsg in the event they actually do run after the timed wait has given up.
    • This can probably take the form of just adding a non-atomic bool that is accessed with the MonitorAutoLock held. The logic in WorkerRun/Cancel would accordingly need to move the mMsg-writing logic into the section of the functions protected by the held monitor.
Assignee: nobody → jstutte
Status: NEW → ASSIGNED
Attachment #9179216 - Attachment description: Bug 1664386: Avoid any possibility of infinite process hangs due to workers hanging on shutdown → Bug 1664386: Avoid any possibility of infinite process hangs due to workers hanging on shutdown and add chrome worker information to reports
Attachment #9179216 - Attachment description: Bug 1664386: Avoid any possibility of infinite process hangs due to workers hanging on shutdown and add chrome worker information to reports → Bug 1664386: Avoid any possibility of infinite process hangs due to workers hanging on shutdown
Attachment #9179216 - Attachment description: Bug 1664386: Avoid any possibility of infinite process hangs due to workers hanging on shutdown → Bug 1664386: Avoid any possibility of infinite process hangs due to workers hanging on shutdown and add chrome worker information to reports
Attachment #9180121 - Flags: data-review?(jstutte)
Attachment #9180121 - Flags: data-review?(jstutte) → data-review?(chutten)

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #5)
...

  • Noting that in terms of severity, we don't expect this to happen, but we have no way to have metrics on this other than support interactions with users where they'd have to restart their computer to get Firefox to run again, so our intuition could be wrong.
    ...
    I think I've run into this. It has locked up the entire computer.

(In reply to Worcester12345 from comment #12)

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #5)
...

  • Noting that in terms of severity, we don't expect this to happen, but we have no way to have metrics on this other than support interactions with users where they'd have to restart their computer to get Firefox to run again, so our intuition could be wrong.

I think I've run into this. It has locked up the entire computer.

Part of the goal here is to be sure that we catch all of the hangs as reports. We are going to monitor this closely: if the number of crashes raises perceivable after landing this patch in release, then there were probably many people that experienced infinite hangs caused by this exact problem (but I would not exclude 100% that there might be other possible causes, of course).

Comment on attachment 9180121 [details]
DataReviewRequest1664386.txt

DATA COLLECTION REVIEW RESPONSE:

Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?

Yes.

Is there a control mechanism that allows the user to turn the data collection on and off?

Yes. This collection is Telemetry so can be controlled through Firefox's Preferences.

If the request is for permanent data collection, is there someone who will monitor the data over time?

Yes, :jstutte is responsible.

Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 2, Interaction.

Is the data collection request for default-on or default-off?

Default on for all channels.

Does the instrumentation include the addition of any new identifiers?

No.

Is the data collection covered by the existing Firefox privacy notice?

Yes.

Does there need to be a check-in in the future to determine whether to renew the data?

No. This collection is permanent.


Result: datareview+

Attachment #9180121 - Flags: data-review?(chutten) → data-review+
Pushed by bugmail@asutherland.org:
https://hg.mozilla.org/integration/autoland/rev/186ca9e93e1d
Avoid any possibility of infinite process hangs due to workers hanging on shutdown and add chrome worker information to reports r=dom-workers-and-storage-reviewers,asuth
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 83 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: