Bug 1775784 Comment 18 Edit History

Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.

tl;dr: Worker shutdown is happening at Canceling not Killing as far as the remote worker mechanism is concerned for the purposes of deciding when to stop keeping the process alive, which leads to a race between worker shutdown and the process being torn down, and this leads to the telemetry not being reported because the observer shutdown notification has already happened by the time the worker gets around to reporting its usage.  By fixing this by moving the remote worker notification to the transition to Killing (via helper lambda passed into the WorkerPrivate constructor) we can also likely address another set of related races.

My longer key notes here from [slack discussion](https://mozilla.slack.com/archives/GTYQJ2U8M/p1661917727296469?thread_ts=1661875326.792659&cid=GTYQJ2U8M):

So I understand that for things to work:

    The worker has to reach the killing stage and report its use counters.  (I guess we must do this for reasons related to lock contention?  Arguably doing it once at worker shutdown does avoid a lot of pathological possibilities for multiple workers doing the exact same thing!)
    The child has to send telemetry up to the parent.  It will do this in "content-child-shutdown".

In the pernosco trace at https://pernos.co/debug/KzErTzqDyGm8hBQz1FdPCg/index.html#f{m[AkVx,ChMJ_,t[AbE,Cmc_,f{e[AkVx,ChMC_,s{af4gCXQAA,bARo,uHk90Sw,oHlBm3Q___/ at that timestamp we can see that the "content-child-shutdown" telemetry IPC sending is happening prior to the worker reaching the killing state and so the telemetry never gets sent.The reason the child is able to start shutdown for SharedWorkers or ServiceWorkers is that RemoteWorkerParent::ActorDestroy notifies the ContentParent which can then start the parent shutdown.Presumably we can address this by keeping the RemoteWorkerParent alive slightly longer, and in doing so we'll probably eliminate a whole bunch of intermittents, which could be very nice!  We have a somewhat nice diagram that suggests that RemoteWorkerChild::ShutdownOnWorker is responsible for sending the close message that causes problems, and indeed that's what happened in the above pernosco trace.  This WeakWorkerRef is triggered on the transition to Canceling which is strictly before Killing.Unfortunately, it's a bit awkward to be notified of the transition to Killing right now.  We won't transition to killing if we have active worker refs.  Also it would be a wacky concept to be able to have WorkerRefs that notify on killing because that's the point where everything should have stopped and you're explicitly not allowed to do anything.  Maybe the most practical thing is to add an optional lambda argument to WorkerPrivate::Constructor like RemoteWorkerChild::ExecWorkerOnMainThread calls and call that lambda in WorkerPrivate::ClearSelfAndParentEventTargetRef.  This is potentially a bit extreme but we know the worker is basically entirely dead at that point.  As the logic that creates and dispatches the worker finished runnable says, the worker literally can't do anything after dispatching that runnable!  I don't see any real harm in making sure the worker is totally, totally done.  (Well, we will be less efficient, but from our team's perspectives, a LOT of edge-cases will go away, so it's a win for us!)
tl;dr: Worker shutdown is happening at Canceling not Killing as far as the remote worker mechanism is concerned for the purposes of deciding when to stop keeping the process alive, which leads to a race between worker shutdown and the process being torn down, and this leads to the telemetry not being reported because the observer shutdown notification has already happened by the time the worker gets around to reporting its usage.  By fixing this by moving the remote worker notification to the transition to Killing (via helper lambda passed into the WorkerPrivate constructor) we can also likely address another set of related races.

My longer key notes here from [slack discussion](https://mozilla.slack.com/archives/GTYQJ2U8M/p1661917727296469?thread_ts=1661875326.792659&cid=GTYQJ2U8M):


So I understand that for things to work:


- The worker has to reach the killing stage and report its use counters.  (I guess we must do this for reasons related to lock contention?  Arguably doing it once at worker shutdown does avoid a lot of pathological possibilities for multiple workers doing the exact same thing!)
- The child has to send telemetry up to the parent.  It will do this in "content-child-shutdown".


In the pernosco trace at https://pernos.co/debug/KzErTzqDyGm8hBQz1FdPCg/index.html#f{m[AkVx,ChMJ_,t[AbE,Cmc_,f{e[AkVx,ChMC_,s{af4gCXQAA,bARo,uHk90Sw,oHlBm3Q___/ at that timestamp we can see that the "content-child-shutdown" telemetry IPC sending is happening prior to the worker reaching the killing state and so the telemetry never gets sent.The reason the child is able to start shutdown for SharedWorkers or ServiceWorkers is that RemoteWorkerParent::ActorDestroy notifies the ContentParent which can then start the parent shutdown.

Presumably we can address this by keeping the RemoteWorkerParent alive slightly longer, and in doing so we'll probably eliminate a whole bunch of intermittents, which could be very nice!  We have a somewhat nice diagram that suggests that RemoteWorkerChild::ShutdownOnWorker is responsible for sending the close message that causes problems, and indeed that's what happened in the above pernosco trace.  This WeakWorkerRef is triggered on the transition to Canceling which is strictly before Killing.

Unfortunately, it's a bit awkward to be notified of the transition to Killing right now.  We won't transition to killing if we have active worker refs.  Also it would be a wacky concept to be able to have WorkerRefs that notify on killing because that's the point where everything should have stopped and you're explicitly not allowed to do anything.  Maybe the most practical thing is to add an optional lambda argument to WorkerPrivate::Constructor like RemoteWorkerChild::ExecWorkerOnMainThread calls and call that lambda in WorkerPrivate::ClearSelfAndParentEventTargetRef.  This is potentially a bit extreme but we know the worker is basically entirely dead at that point.  As the logic that creates and dispatches the worker finished runnable says, the worker literally can't do anything after dispatching that runnable!  I don't see any real harm in making sure the worker is totally, totally done.  (Well, we will be less efficient, but from our team's perspectives, a LOT of edge-cases will go away, so it's a win for us!)

Back to Bug 1775784 Comment 18