Bug 1664386 Comment 1 Edit History

Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.

The crash messages are generated by [`RuntimeService::CrashIfHanging`](https://searchfox.org/mozilla-central/rev/8a0745cd346f0cfb89ae71690babbf7bff706113/dom/workers/RuntimeService.cpp#1647).

[`RuntimeService`](https://searchfox.org/mozilla-central/rev/8a0745cd346f0cfb89ae71690babbf7bff706113/dom/workers/RuntimeService.h#34) is a singleton that contains a map of domain names to `WorkerPrivate*` lists and allows operations on registered workers.

**How workers' shutdown is triggered**

The shutdown triggers either [Shutdown](https://searchfox.org/mozilla-central/rev/eb9d5c97927aea75f0c8e38bbc5b5d288099e687/dom/workers/RuntimeService.cpp#1538) or [Cleanup](https://searchfox.org/mozilla-central/rev/eb9d5c97927aea75f0c8e38bbc5b5d288099e687/dom/workers/RuntimeService.cpp#2084) through an observer:

```
  if (!strcmp(aTopic, NS_XPCOM_SHUTDOWN_OBSERVER_ID)) {
    Shutdown();
    return NS_OK;
  }
  if (!strcmp(aTopic, NS_XPCOM_SHUTDOWN_THREADS_OBSERVER_ID)) {
    Cleanup();
    return NS_OK;
  }
```
The difference is that `Shutdown` just [sends cancel to all top level workers only](https://searchfox.org/mozilla-central/source/dom/workers/RuntimeService.cpp#1563-1568), `Cleanup` [spins the event loop until all threads have joined](https://searchfox.org/mozilla-central/rev/eb9d5c97927aea75f0c8e38bbc5b5d288099e687/dom/workers/RuntimeService.cpp#1683-1684).

**Why and how the MOZ_CRASH messages are composed**

On the main thread we are in the `Cleanup`event loop when the watchdog triggers, waiting apparently for some worker threads to join.

`CrashIfHanging` then iterates over the domain map and retrieves statistics for each `WorkerPrivate*` list through [`Update`](https://searchfox.org/mozilla-central/rev/8a0745cd346f0cfb89ae71690babbf7bff706113/dom/workers/RuntimeService.cpp#1626). To do so it dispatches a `CrashIfHangingRunnable` to the worker and if the dispatch succeeds it [waits for its result](https://searchfox.org/mozilla-central/rev/8a0745cd346f0cfb89ae71690babbf7bff706113/dom/workers/RuntimeService.cpp#1599-1610) (forever?). This runnable either [writes the crash information](https://searchfox.org/mozilla-central/rev/8a0745cd346f0cfb89ae71690babbf7bff706113/dom/workers/RuntimeService.cpp#1583) or the string `"Canceled"` to mMsg.

The worker's crash information is written by [`DumpCrashInformation`](https://searchfox.org/mozilla-central/rev/8a0745cd346f0cfb89ae71690babbf7bff706113/dom/workers/RuntimeService.cpp#1583) which appends a WorkerRef's name only for each workerref that is preventing the worker's shutdown.

A shutdown hang is attributed to Workers if:

a) the (single) shutdown timeout has been reached by the RunWatchdog
b) the shutdown steps were completed (sShutdownNotified == true)
c) there is a worker associated to any domain which is still able to receive runnables (and to respond!)

So actually the worker is not "hanging", it has just not been closed yet. 

From the way the message is constructed, I assume the suspected cause to not having closed the worker yet is a living worker reference with `workerRef->IsPreventingShutdown()` set to true. There seems to be at least one case, where the list of printed `workerRef` names is empty, though, indicating that there is no `workerRef` preventing shutdown for this worker. However, in the vast majority of cases we have the [`mSender`](https://searchfox.org/mozilla-central/rev/eb9d5c97927aea75f0c8e38bbc5b5d288099e687/dom/workers/WorkerRunnable.h#491) worker ref reported.

:asuth suspects the root cause in some late execution of chrome javascript, in particular [osfile](https://searchfox.org/mozilla-central/source/toolkit/components/osfile).

**Questions**

Q1: According to :asuth, only workers in the parent process can/should cause a shutdown hang. This would mean that either the map of domain names in the parent process does not contain domains handled by content processes or that the dispatch function somehow decides to not dispatch the CrashIfHangingRunnable to workers into content processes?

Q2: Why do we react on `NS_XPCOM_SHUTDOWN_OBSERVER_ID` with `Shutdown` (without blocking) and on `NS_XPCOM_SHUTDOWN_THREADS_OBSERVER_ID` with the blocking `Cleanup` ? It seems from [`ShutdownXPCOM`](https://hg.mozilla.org/releases/mozilla-release/annotate/2c869ada52702e6a02b2fe73b4d81d6be6d515f0/xpcom/build/XPCOMInit.cpp#l621) that both are called.

Q3: Can the reaction on NS_XPCOM_SHUTDOWN_OBSERVER_ID just race with the one on `NS_XPCOM_SHUTDOWN_THREADS_OBSERVER_ID` ? That is, the worker receives cancel the first time, and while it needs still to close down, it receives a second cancel message which it will never elaborate?

Q4: Is this really worth a crash (and/or a blocking event loop on the main thread)? I would assume that any critical task is (can be) associated to a shutdown step and thus after `sShutdownNotified == true` no one should complain to just exit(0)?

**Suggestions**

S1: Transform (worker) shutdown hangs from crashes to normal telemetry and exit(0). 

S2: Include a hint if (and which) chrome JS was run by the hanging worker.
The crash messages are generated by [`RuntimeService::CrashIfHanging`](https://searchfox.org/mozilla-central/rev/8a0745cd346f0cfb89ae71690babbf7bff706113/dom/workers/RuntimeService.cpp#1647).

[`RuntimeService`](https://searchfox.org/mozilla-central/rev/8a0745cd346f0cfb89ae71690babbf7bff706113/dom/workers/RuntimeService.h#34) is a singleton that contains a map of domain names to `WorkerPrivate*` lists and allows operations on registered workers.

**How workers' shutdown is triggered**

The shutdown triggers either [Shutdown](https://searchfox.org/mozilla-central/rev/eb9d5c97927aea75f0c8e38bbc5b5d288099e687/dom/workers/RuntimeService.cpp#1538) or [Cleanup](https://searchfox.org/mozilla-central/rev/eb9d5c97927aea75f0c8e38bbc5b5d288099e687/dom/workers/RuntimeService.cpp#2084) through an observer:

```
  if (!strcmp(aTopic, NS_XPCOM_SHUTDOWN_OBSERVER_ID)) {
    Shutdown();
    return NS_OK;
  }
  if (!strcmp(aTopic, NS_XPCOM_SHUTDOWN_THREADS_OBSERVER_ID)) {
    Cleanup();
    return NS_OK;
  }
```
The difference is that `Shutdown` just [sends cancel to all top level workers only](https://searchfox.org/mozilla-central/source/dom/workers/RuntimeService.cpp#1563-1568), `Cleanup` [spins the event loop until all threads have joined](https://searchfox.org/mozilla-central/rev/eb9d5c97927aea75f0c8e38bbc5b5d288099e687/dom/workers/RuntimeService.cpp#1683-1684).

**Why and how the MOZ_CRASH messages are composed**

On the main thread we are in the `Cleanup`event loop when the watchdog triggers, waiting apparently for some worker threads to join.

`CrashIfHanging` then iterates over the domain map and retrieves statistics for each `WorkerPrivate*` list through [`Update`](https://searchfox.org/mozilla-central/rev/8a0745cd346f0cfb89ae71690babbf7bff706113/dom/workers/RuntimeService.cpp#1626). To do so it dispatches a `CrashIfHangingRunnable` to the worker and if the dispatch succeeds it [waits for its result](https://searchfox.org/mozilla-central/rev/8a0745cd346f0cfb89ae71690babbf7bff706113/dom/workers/RuntimeService.cpp#1599-1610) (forever?). This runnable either [writes the crash information](https://searchfox.org/mozilla-central/rev/8a0745cd346f0cfb89ae71690babbf7bff706113/dom/workers/RuntimeService.cpp#1583) or the string `"Canceled"` to mMsg.

The worker's crash information is written by [`DumpCrashInformation`](https://searchfox.org/mozilla-central/rev/8a0745cd346f0cfb89ae71690babbf7bff706113/dom/workers/RuntimeService.cpp#1583) which appends a WorkerRef's name only for each workerref that is preventing the worker's shutdown.

A shutdown hang is attributed to Workers if:

a) the (single) shutdown timeout has been reached by the RunWatchdog
b) the shutdown steps were completed (sShutdownNotified == true)
c) there is a worker associated to any domain which is still able to receive runnables (and to respond!)

So actually the worker is not "hanging", it has just not been closed yet. 

From the way the message is constructed, I assume the suspected cause to not having closed the worker yet is a living worker reference with `workerRef->IsPreventingShutdown()` set to true. There seems to be at least one case, where the list of printed `workerRef` names is empty, though, indicating that there is no `workerRef` preventing shutdown for this worker. However, in the vast majority of cases we have the [`mSender`](https://searchfox.org/mozilla-central/rev/eb9d5c97927aea75f0c8e38bbc5b5d288099e687/dom/workers/WorkerRunnable.h#491) worker ref reported.

:asuth suspects the root cause in some late execution of chrome javascript, in particular [osfile](https://searchfox.org/mozilla-central/source/toolkit/components/osfile).

**Questions**

Q1: According to :asuth, only workers in the parent process can/should cause a shutdown hang. This would mean that either the map of domain names in the parent process does not contain domains handled by content processes or that the dispatch function somehow decides to not dispatch the CrashIfHangingRunnable to workers into content processes?

Q2: Is this really worth a crash (and/or a blocking event loop on the main thread)? I would assume that any critical task is (can be) associated to a shutdown step and thus after `sShutdownNotified == true` no one should complain to just exit(0)?

**Suggestions**

S1: Transform (worker) shutdown hangs from crashes to normal telemetry and exit(0). 

S2: Include a hint if (and which) chrome JS was run by the hanging worker.
The crash messages are generated by [`RuntimeService::CrashIfHanging`](https://searchfox.org/mozilla-central/rev/8a0745cd346f0cfb89ae71690babbf7bff706113/dom/workers/RuntimeService.cpp#1647).

[`RuntimeService`](https://searchfox.org/mozilla-central/rev/8a0745cd346f0cfb89ae71690babbf7bff706113/dom/workers/RuntimeService.h#34) is a singleton that contains a map of domain names to `WorkerPrivate*` lists and allows operations on registered workers.

**How workers' shutdown is triggered**

The shutdown triggers either [Shutdown](https://searchfox.org/mozilla-central/rev/eb9d5c97927aea75f0c8e38bbc5b5d288099e687/dom/workers/RuntimeService.cpp#1538) or [Cleanup](https://searchfox.org/mozilla-central/rev/eb9d5c97927aea75f0c8e38bbc5b5d288099e687/dom/workers/RuntimeService.cpp#2084) through an observer:

```
  if (!strcmp(aTopic, NS_XPCOM_SHUTDOWN_OBSERVER_ID)) {
    Shutdown();
    return NS_OK;
  }
  if (!strcmp(aTopic, NS_XPCOM_SHUTDOWN_THREADS_OBSERVER_ID)) {
    Cleanup();
    return NS_OK;
  }
```
The difference is that `Shutdown` just [sends cancel to all top level workers only](https://searchfox.org/mozilla-central/source/dom/workers/RuntimeService.cpp#1563-1568), `Cleanup` [spins the event loop until all threads have joined](https://searchfox.org/mozilla-central/rev/eb9d5c97927aea75f0c8e38bbc5b5d288099e687/dom/workers/RuntimeService.cpp#1683-1684).

**Why and how the MOZ_CRASH messages are composed**

On the main thread we are in the `Cleanup`event loop when the watchdog triggers, waiting apparently for some worker threads to join.

`CrashIfHanging` then iterates over the domain map and retrieves statistics for each `WorkerPrivate*` list through [`Update`](https://searchfox.org/mozilla-central/rev/8a0745cd346f0cfb89ae71690babbf7bff706113/dom/workers/RuntimeService.cpp#1626). To do so it dispatches a `CrashIfHangingRunnable` to the worker and if the dispatch succeeds it [waits for its result](https://searchfox.org/mozilla-central/rev/8a0745cd346f0cfb89ae71690babbf7bff706113/dom/workers/RuntimeService.cpp#1599-1610) (forever?). This runnable either [writes the crash information](https://searchfox.org/mozilla-central/rev/8a0745cd346f0cfb89ae71690babbf7bff706113/dom/workers/RuntimeService.cpp#1583) or the string `"Canceled"` to mMsg.

The worker's crash information is written by [`DumpCrashInformation`](https://searchfox.org/mozilla-central/rev/8a0745cd346f0cfb89ae71690babbf7bff706113/dom/workers/RuntimeService.cpp#1583) which appends a WorkerRef's name only for each workerref that is preventing the worker's shutdown.

A shutdown hang is attributed to Workers if:

a) the (single) shutdown timeout has been reached by the RunWatchdog
b) the shutdown steps were completed (sShutdownNotified == true)
c) there is a worker associated to any domain which is still able to receive runnables (and to respond!)

So actually the worker is not "hanging", it has just not been closed yet. 

From the way the message is constructed, I assume the suspected cause to not having closed the worker yet is a living worker reference with `workerRef->IsPreventingShutdown()` set to true. There seems to be at least one case, where the list of printed `workerRef` names is empty, though, indicating that there is no `workerRef` preventing shutdown for this worker. However, in the vast majority of cases we have the [`mSender`](https://searchfox.org/mozilla-central/rev/eb9d5c97927aea75f0c8e38bbc5b5d288099e687/dom/workers/WorkerRunnable.h#491) worker ref reported.

:asuth suspects the root cause in some late execution of chrome javascript, in particular [osfile](https://searchfox.org/mozilla-central/source/toolkit/components/osfile).

**Questions**

Q1: According to :asuth, only workers in the parent process can/should cause a shutdown hang. This would mean that either the map of domain names in the parent process does not contain domains handled by content processes or that the dispatch function somehow decides to not dispatch the CrashIfHangingRunnable to workers into content processes? Or do we just expect content processes to be already dead at this stage?

Q2: Is this really worth a crash (and/or a blocking event loop on the main thread)? I would assume that any critical task is (can be) associated to a shutdown step and thus after `sShutdownNotified == true` no one should complain to just exit(0)?

**Suggestions**

S1: Transform (worker) shutdown hangs from crashes to normal telemetry and exit(0). 

S2: Include a hint if (and which) chrome JS was run by the hanging worker.
The crash messages are generated by [`RuntimeService::CrashIfHanging`](https://searchfox.org/mozilla-central/rev/8a0745cd346f0cfb89ae71690babbf7bff706113/dom/workers/RuntimeService.cpp#1647).

[`RuntimeService`](https://searchfox.org/mozilla-central/rev/8a0745cd346f0cfb89ae71690babbf7bff706113/dom/workers/RuntimeService.h#34) is a singleton that contains a map of domain names to `WorkerPrivate*` lists and allows operations on registered workers.

**How workers' shutdown is triggered**

The shutdown triggers either [Shutdown](https://searchfox.org/mozilla-central/rev/eb9d5c97927aea75f0c8e38bbc5b5d288099e687/dom/workers/RuntimeService.cpp#1538) or [Cleanup](https://searchfox.org/mozilla-central/rev/eb9d5c97927aea75f0c8e38bbc5b5d288099e687/dom/workers/RuntimeService.cpp#2084) through an observer:

```
  if (!strcmp(aTopic, NS_XPCOM_SHUTDOWN_OBSERVER_ID)) {
    Shutdown();
    return NS_OK;
  }
  if (!strcmp(aTopic, NS_XPCOM_SHUTDOWN_THREADS_OBSERVER_ID)) {
    Cleanup();
    return NS_OK;
  }
```
The difference is that `Shutdown` just [sends cancel to all top level workers only](https://searchfox.org/mozilla-central/source/dom/workers/RuntimeService.cpp#1563-1568), `Cleanup` [spins the event loop until all threads have joined](https://searchfox.org/mozilla-central/rev/eb9d5c97927aea75f0c8e38bbc5b5d288099e687/dom/workers/RuntimeService.cpp#1683-1684).

**Why and how the MOZ_CRASH messages are composed**

On the main thread we are in the `Cleanup`event loop when the watchdog triggers, waiting apparently for some worker threads to join.

`CrashIfHanging` then iterates over the domain map and retrieves statistics for each `WorkerPrivate*` list through [`Update`](https://searchfox.org/mozilla-central/rev/8a0745cd346f0cfb89ae71690babbf7bff706113/dom/workers/RuntimeService.cpp#1626). To do so it dispatches a `CrashIfHangingRunnable` to the worker and if the dispatch succeeds it [waits for its result](https://searchfox.org/mozilla-central/rev/8a0745cd346f0cfb89ae71690babbf7bff706113/dom/workers/RuntimeService.cpp#1599-1610) (forever?). This runnable either [writes the crash information](https://searchfox.org/mozilla-central/rev/8a0745cd346f0cfb89ae71690babbf7bff706113/dom/workers/RuntimeService.cpp#1583) or the string `"Canceled"` to `mMsg`.

The worker's crash information is written by [`DumpCrashInformation`](https://searchfox.org/mozilla-central/rev/8a0745cd346f0cfb89ae71690babbf7bff706113/dom/workers/RuntimeService.cpp#1583) which appends a WorkerRef's name only for each workerref that is preventing the worker's shutdown.

A shutdown hang is attributed to Workers if:

a) the (single) shutdown timeout has been reached by the RunWatchdog
b) the shutdown steps were completed (sShutdownNotified == true)
c) there is a worker associated to any domain which is still able to receive runnables (and to respond!)

So actually the worker is not "hanging", it has just not been closed yet. 

From the way the message is constructed, I assume the suspected cause to not having closed the worker yet is a living worker reference with `workerRef->IsPreventingShutdown()` set to true. There seems to be at least one case, where the list of printed `workerRef` names is empty, though, indicating that there is no `workerRef` preventing shutdown for this worker. However, in the vast majority of cases we have the [`mSender`](https://searchfox.org/mozilla-central/rev/eb9d5c97927aea75f0c8e38bbc5b5d288099e687/dom/workers/WorkerRunnable.h#491) worker ref reported.

:asuth suspects the root cause in some late execution of chrome javascript, in particular [osfile](https://searchfox.org/mozilla-central/source/toolkit/components/osfile).

**Questions**

Q1: According to :asuth, only workers in the parent process can/should cause a shutdown hang. This would mean that either the map of domain names in the parent process does not contain domains handled by content processes or that the dispatch function somehow decides to not dispatch the CrashIfHangingRunnable to workers into content processes? Or do we just expect content processes to be already dead at this stage?

Q2: Is this really worth a crash (and/or a blocking event loop on the main thread)? I would assume that any critical task is (can be) associated to a shutdown step and thus after `sShutdownNotified == true` no one should complain to just exit(0)?

**Suggestions**

S1: Transform (worker) shutdown hangs from crashes to normal telemetry and exit(0). 

S2: Include a hint if (and which) chrome JS was run by the hanging worker.

Back to Bug 1664386 Comment 1