Bug 1854076 Comment 9 Edit History

Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.

The problem here seems to be more subtle than an ownership problem; the confusing state machine around entangling and UpdateMustKeepAlive which maintains a self-induced strong-ref through at least entangling means that the state machine is largely correct.  And changing the reference held by the MessagePortChild to a strong reference inherently would interfere with cycle collection[1].  That said, the raw pointer is inherently dangerous and the destructor and cycle collection unlink logic does not go far enough to ensure that the dangerous raw pointer is cleaned up.

By inspection, I suspect the issue here is that if we fail to create the WorkerRef we directly advanced the state machine to eStateDisentangled and call UpdateMustKeepAlive but without taking any action to ensure the actor is detached/cleaned up.  UpdateMustKeepAlive will drop our strong ref, and setting the state to disentangled defeats any actions that might be taken to clean up the actor properly.

I don't believe this is possible for the normal Worker.postMessage, but this is possible due to an unfortunate timing of events related to our implementation of RTCRtpScriptTransform which this POC uses.  [This is the WorkerRun implementation of EventWithOptionsRunnable that was forked from MessageEventRunnable for RTCRtpScriptTransform](https://searchfox.org/mozilla-central/rev/6602bdf9fff5020fbc8e248c963ddddf09a77b1b/dom/workers/EventWithOptionsRunnable.cpp#132-162).  [This is the WorkerRun implementation of MessageEventRunnable](https://searchfox.org/mozilla-central/rev/6602bdf9fff5020fbc8e248c963ddddf09a77b1b/dom/workers/MessageEventRunnable.cpp#109-148).  They do not look the same.  They should look the same.  [This is what is missing from EventWithOptionsRunnable](https://searchfox.org/mozilla-central/rev/6602bdf9fff5020fbc8e248c963ddddf09a77b1b/dom/workers/MessageEventRunnable.cpp#140-144) that would have prevented the MessagePort from being deserialized when the WorkerRef could not be created.  The problem is the code was forked just prior to the landing of the changes to MessageEventRunnable.

So there's a couple fixes here for tomorrow:
- Bring EventWithOptionsRunnable up-to-date.
- Add failsafes to MessagePort so that we drop the actor no matter what in our destructor, asserting very hard if we have to do that since it means there was a state machine problem.
   - Ensure the actor can handle the MessagePort not existing.  (It may already do this.)
- Correct the logic for when we can't create the WorkerRef to shut down the actor correctly/safely.  We may also want to diagnostic assert a bit in this case because it should no longer be possible at this point to attempt to create a MessagePort after we've reached cancellation thanks to Eden's other cleanups.

In terms of follow-ups:
- We likely need to revisit how we're handling cycle collection for the MessagePort.  Right now we keep it alive entirely if it's entangled, but we really should be switching to doing ` KeepAliveIfHasListenersFor(nsGkAtoms::onmessage);` like we do for BroadcastChannel once we've reached the entangled stage.  (But we absolutely do need to keep the extra reference until the entanglement state machine is completed based on how it's currently structured.)

In terms of specific regressing versions, this specific thing we're seeing will have started in v117 with the landing of bug 1631263 and will not be present in ESR.  It's not clear if a different variation would be possible in versions prior to that.  The cancellation refactor bug 1800659 landed in v116 and would have eliminated prior variants.  In v115 and earlier cancellation would be likely to make it very hard to create the problem, although not impossible, so I expect an ESR uplift of everything but changes to EventWithOptionsRunnable would be appropriate.
The problem here seems to be more subtle than an ownership problem; the confusing state machine around entangling and UpdateMustKeepAlive which maintains a self-induced strong-ref through at least entangling means that the state machine is largely correct.  And changing the reference held by the MessagePortChild to a strong reference inherently would interfere with cycle collection[1].  That said, the raw pointer is inherently dangerous and the destructor and cycle collection unlink logic does not go far enough to ensure that the dangerous raw pointer is cleaned up.

By inspection, I suspect the issue here is that if we fail to create the WorkerRef we directly advanced the state machine to eStateDisentangled and call UpdateMustKeepAlive but without taking any action to ensure the actor is detached/cleaned up.  UpdateMustKeepAlive will drop our strong ref, and setting the state to disentangled defeats any actions that might be taken to clean up the actor properly.

I don't believe this is possible for the normal Worker.postMessage, but this is possible due to an unfortunate timing of events related to our implementation of RTCRtpScriptTransform which this POC uses.  [This is the WorkerRun implementation of EventWithOptionsRunnable that was forked from MessageEventRunnable for RTCRtpScriptTransform](https://searchfox.org/mozilla-central/rev/6602bdf9fff5020fbc8e248c963ddddf09a77b1b/dom/workers/EventWithOptionsRunnable.cpp#132-162).  [This is the WorkerRun implementation of MessageEventRunnable](https://searchfox.org/mozilla-central/rev/6602bdf9fff5020fbc8e248c963ddddf09a77b1b/dom/workers/MessageEventRunnable.cpp#109-148).  They do not look the same.  They should look the same.  [This is what is missing from EventWithOptionsRunnable](https://searchfox.org/mozilla-central/rev/6602bdf9fff5020fbc8e248c963ddddf09a77b1b/dom/workers/MessageEventRunnable.cpp#140-144) that would have prevented the MessagePort from being deserialized when the WorkerRef could not be created.  The problem is the code was forked just prior to the landing of the changes to MessageEventRunnable.

So there's a couple fixes here for tomorrow:
- Bring EventWithOptionsRunnable up-to-date.
- Add failsafes to MessagePort so that we drop the actor no matter what in our destructor, asserting very hard if we have to do that since it means there was a state machine problem.
   - Ensure the actor can handle the MessagePort not existing.  (It may already do this.)
- Correct the logic for when we can't create the WorkerRef to shut down the actor correctly/safely.  We may also want to diagnostic assert a bit in this case because it should no longer be possible at this point to attempt to create a MessagePort after we've reached cancellation thanks to Eden's other cleanups.

In terms of follow-ups:
- We likely need to revisit how we're handling cycle collection for the MessagePort.  Right now we keep it alive entirely if it's entangled, but we really should be switching to doing ` KeepAliveIfHasListenersFor(nsGkAtoms::onmessage);` like we do for BroadcastChannel once we've reached the entangled stage.  (But we absolutely do need to keep the extra reference until the entanglement state machine is completed based on how it's currently structured.)

In terms of specific regressing versions, this specific thing we're seeing will have started in v117 with the landing of bug 1631263 and will not be present in ESR.  It's not clear if a different variation would be possible in versions prior to that.  The cancellation refactor bug 1800659 landed in v116 and would have eliminated prior variants.  In v115 and earlier cancellation would be likely to make it very hard to create the problem, although not impossible, so I expect an ESR uplift of everything but changes to EventWithOptionsRunnable would be appropriate.

1: As noted above, the current cycle collection is not particularly cycle-collecty, so in a follow-up we should probably improve that.

Back to Bug 1854076 Comment 9