Bug 1756172 Comment 10 Edit History

Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.

Trying to translate this into a more local reasoning about the lifetime-relation between `WorkerPrivate` and `WorkerGlobalScopeBase`:

The "thing that's crashing" here is an access to the `mWorkerPrivate` of `WorkerGlobalScopeBase` after it has already been nulled out. I might be simplifying things, but IMHO we have only two options here regarding the lifetime-relation between `WorkerPrivate` and `WorkerGlobalScopeBase` derived objects:

1. We expect `WorkerGlobalScopeBase` derived objects to never outlive their `mWorkerPrivate`. 
This seems to be kind of the current state, only that evidently there are violations to this. If this is what we want, we should just continue to enforce this (for example by adding more asserts before any access to `mWorkerPrivate`) and adjust any violations (actually we already catch them now by having nulled out the `mWorkerPrivate`). Apparently one violation can be caused by the `mWorkerPrivate->ClearMainEventQueue(WorkerPrivate::WorkerRan);` done after we nulled out the global's `mWorkerPrivate` and thus keeping this pointer alive until the latest possible moment in the `WorkerPrivate`'s live seems advisable to me, as proposed by the patch here.

2. We expect the global's `mWorkerPrivate` to possibly go away at any time, introducing a further (most definite) way of telling the global "this worker is dead". 
This would imply to null-check `mWorkerPrivate` before accessing it at runtime [in all places](https://searchfox.org/mozilla-central/search?q=symbol:F_%3CT_mozilla%3A%3Adom%3A%3AWorkerGlobalScopeBase%3E_mWorkerPrivate&redirect=false) and to act accordingly. 

IIUC the state of the worker is already well modeled by its internal state machine, so until the global can access its `mWorkerPrivate` it knows the worker's state and acts accordingly and having `mWorkerPrivate` set to `nullptr` does not add new information. It just guards us from accessing a potentially dead object. And until today we kind of expected that 1. is true, so I would expect to be there only few violations we can find and should handle.

> This is one of those cases where the thing that's crashing is indicating a potentially concerning lifetime problem but maybe we also want the thing that's catching this by virtue of crashing to assert under fuzzing but silently not crash or do bad things in production.

Not sure if I second this. I'd rather prefer to first have a clear decision between 1. and 2. and then adjust the code accordingly. To crash only while fuzzing reads like "we want 1. but we are lazy to find all violations". Now that can be the right call if we expect those violations to be very hard to find.

> In terms of the things of what's catching this: In the pernosco trace, we don't actually care about the not-taken call to `WorkerPrivateSaysForbidScript` because the global was already marked as dying (`nsIGlobalObject::StartDying`) previously during the transitions to Canceling and Killing. So if we ever got to `CallbackObject::CallSetup::CallSetup`'s invocation of `IsScriptForbidden` we'd say it's not okay. But we're not getting there because it's the call to `nsIGlobalObject::HasJSGlobal` during `AutoEntryScript` construction for the Promise that's crashing earlier. 

Should I read this as "on the way to check if we should actually execute script here we died earlier due to the missing `mWorkerPrivate`" ? Such a check should definitely just fail but not crash in the state we are in here, IIUC. I assume (but I did not check) that if the `mWorkerPrivate` was still set in this situation, everything would have just been fine. And the latest possible access to it seems to be triggered by the final `ClearMainEventQueue`, so ensuring that we still can access our-self during that clear still seems advisable to me. It might not solve other violations, but we can be kind of sure that the worker thread will not try to use the global and its `mWorkerPrivate` anymore only after that point, and that any other access must come from a different thread then.

Long story short, IMHO we should:

- decide that we want 1. 
- repair this specific violation, potentially with attachment 9268485 [details]
- enforce the invariant from 1. further (adding some asserts), though this is kind of optional and mostly documents our assumption everywhere (setting `mWorkerPrivate` to `nullptr` technically is sufficient to catch them).
Trying to translate this into a more local reasoning about the lifetime-relation between `WorkerPrivate` and `WorkerGlobalScopeBase`:

The "thing that's crashing" here is an access to the `mWorkerPrivate` of `WorkerGlobalScopeBase` after it has already been nulled out. I might be simplifying things, but IMHO we have only two options here regarding the lifetime-relation between `WorkerPrivate` and `WorkerGlobalScopeBase` derived objects:

1. We expect `WorkerGlobalScopeBase` derived objects to never outlive their `mWorkerPrivate`. 
This seems to be kind of the current state, only that evidently there are violations to this. If this is what we want, we should just continue to enforce this (for example by adding more asserts before any access to `mWorkerPrivate`) and adjust any violations (actually we already catch them now by having nulled out the `mWorkerPrivate`). Apparently one violation can be caused by the `mWorkerPrivate->ClearMainEventQueue(WorkerPrivate::WorkerRan);` done after we nulled out the global's `mWorkerPrivate` and thus keeping this pointer alive until the latest possible moment in the `WorkerPrivate`'s live seems advisable to me, as proposed by the patch here.

2. We expect the global's `mWorkerPrivate` to possibly go away at any time, introducing a further (most definite) way of telling the global "this worker is dead". 
This would imply to null-check `mWorkerPrivate` before accessing it at runtime [in all places](https://searchfox.org/mozilla-central/search?q=symbol:F_%3CT_mozilla%3A%3Adom%3A%3AWorkerGlobalScopeBase%3E_mWorkerPrivate&redirect=false) and to act accordingly. 

IIUC the state of the worker is already well modeled by its internal state machine, so until the global can access its `mWorkerPrivate` it knows the worker's state and acts accordingly and having `mWorkerPrivate` set to `nullptr` does not add new information. It just guards us from accessing a potentially dead object. And until today we kind of expected that 1. is true, so I would expect to be there only few violations we can find and should handle.

> This is one of those cases where the thing that's crashing is indicating a potentially concerning lifetime problem but maybe we also want the thing that's catching this by virtue of crashing to assert under fuzzing but silently not crash or do bad things in production.

Not sure if I second this. I'd rather prefer to first have a clear decision between 1. and 2. and then adjust the code accordingly. To crash only while fuzzing reads like "we want 1. but we are lazy to find all violations". Now that can be the right call if we expect those violations to be very hard to find.

> In terms of the things of what's catching this: In the pernosco trace, we don't actually care about the not-taken call to `WorkerPrivateSaysForbidScript` because the global was already marked as dying (`nsIGlobalObject::StartDying`) previously during the transitions to Canceling and Killing. So if we ever got to `CallbackObject::CallSetup::CallSetup`'s invocation of `IsScriptForbidden` we'd say it's not okay. But we're not getting there because it's the call to `nsIGlobalObject::HasJSGlobal` during `AutoEntryScript` construction for the Promise that's crashing earlier. 

Should I read this as "on the way to check if we should actually execute script here we died earlier due to the missing `mWorkerPrivate`" ? Such a check should definitely just fail but not crash in the state we are in here, IIUC. I assume (but I did not check) that if the `mWorkerPrivate` was still set in this situation, everything would have just been fine. And the latest possible access to it seems to be triggered by the final `ClearMainEventQueue`, so ensuring that we still can access our-self during that clear still seems advisable to me. It might not solve other violations, but we can be kind of sure that the worker thread will not try to use the global and its `mWorkerPrivate` anymore **only after that point**, and that any other access must come from a different thread then.

Long story short, IMHO we should:

- decide that we want 1. 
- repair this specific violation, potentially with attachment 9268485 [details]
- enforce the invariant from 1. further (adding some asserts), though this is kind of optional and mostly documents our assumption everywhere (setting `mWorkerPrivate` to `nullptr` technically is sufficient to catch them).

Back to Bug 1756172 Comment 10