Assert that StrongWorkerRefs are not lazily released during final GC/CC
Categories
(Core :: DOM: Workers, task)
Tracking
()
Tracking | Status | |
---|---|---|
firefox116 | --- | fixed |
People
(Reporter: jstutte, Assigned: jstutte)
References
Details
Attachments
(1 file, 1 obsolete file)
During the investigations for bug 1775241 we noted a scenario where a StrongWorkerRef
was blocking a worker shutdown because due to a logic edge case it was released only during a GC/CC run - which never occurred given that the worker did not receive any more events.
Every object holding a WorkerRef should really have a straight, deterministic line from the WorkerRef's callback being invoked to the WorkerRef being released which is supported by strong-references that can't form a cycle. One might argue that once we have transitioned to Canceling that it should be a MOZ_DIAGNOSTIC_ASSERT-level crash if a WorkerRef is dropped during cycle collection because it implies that the WorkerRef contract was violated.
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Comment 1•2 years ago
|
||
Updated•2 years ago
|
Assignee | ||
Comment 2•2 years ago
|
||
Depends on D150942
Updated•1 year ago
|
Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/4629c85e137b Assert that StrongWorkerRefs are not lazily released during final GC/CC. r=dom-worker-reviewers,smaug,asuth
Comment 4•1 year ago
•
|
||
Backed out causing multiple worker related failures
Assignee | ||
Comment 5•1 year ago
•
|
||
(In reply to Norisz Fay [:noriszfay] from comment #4)
Backed out causing multiple worker related failures
Is "caused" by the new diagnostic assert from this patch. It seems the fix for bug 1538754 might cover this.
Is a different assertion and I am kind of convinced it should be unrelated to this patch, as if ever this patch makes us GCCC more often than before, so if its own assertion is not triggered I see less probability for the global to stay alive, not more.
Assignee | ||
Comment 6•1 year ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #5)
Is a different assertion and I am kind of convinced it should be unrelated to this patch, as if ever this patch makes us GCCC more often than before, so if its own assertion is not triggered I see less probability for the global to stay alive, not more.
Hmm, there might be potential for a CC killing things that held a worker reference meant to prevent the worker from shutting down.
Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/b5217784536c Assert that StrongWorkerRefs are not lazily released during final GC/CC. r=dom-worker-reviewers,smaug,asuth
Comment 8•1 year ago
|
||
Backed out for causing assertion failures at WorkerPrivate.h.
Backout link: https://hg.mozilla.org/integration/autoland/rev/b528ceb6d1767d43a9ccce5c78c620cae8a30a24
Failure log: https://treeherder.mozilla.org/logviewer?job_id=410968588&repo=autoland&lineNumber=5698
Assignee | ||
Comment 9•1 year ago
|
||
(In reply to Atila Butkovits from comment #8)
Backed out for causing assertion failures at WorkerPrivate.h.
Backout link: https://hg.mozilla.org/integration/autoland/rev/b528ceb6d1767d43a9ccce5c78c620cae8a30a24
Failure log: https://treeherder.mozilla.org/logviewer?job_id=410968588&repo=autoland&lineNumber=5698
:asuth, that brings me back to bug 1770630 - any idea?
Comment 10•1 year ago
|
||
I think we definitely want to change the || (!aSyncLoopTarget && ParentStatus() > Running)
conditional check in WorkerPrivate::DispatchLockHeld to > Canceling
. Code has been somewhat assuming that wouldn't fire, but it's clearly firing here and messing stuff up and I think we're generally in a better place now that I don't think that's as likely to open us up to terrifying new permutations. r=asuth to make that change.
Assignee | ||
Comment 11•1 year ago
|
||
(In reply to Andrew Sutherland [:asuth] (he/him) from comment #10)
I think we definitely want to change the
|| (!aSyncLoopTarget && ParentStatus() > Running)
conditional check in WorkerPrivate::DispatchLockHeld to> Canceling
. Code has been somewhat assuming that wouldn't fire, but it's clearly firing here and messing stuff up and I think we're generally in a better place now that I don't think that's as likely to open us up to terrifying new permutations. r=asuth to make that change.
OK, I rebased on top of current central and added this change. Let's try.
Assignee | ||
Comment 12•1 year ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #11)
OK, I rebased on top of current central and added this change. Let's try.
Even though it sounds right to allow dispatching while we potentially hold a worker alive through a strong worker ref, it seems like this was not helping. It looks like the fix from bug 1538754 was not sufficient to avoid that FetchStreamReader
releases its worker ref only during GC/CC. Pernosco session triggered, will post back a link here if successful.
Assignee | ||
Comment 13•1 year ago
|
||
Assignee | ||
Comment 14•1 year ago
|
||
The problem here seems to be, that the mAsyncWaitWorkerRef
keeps the worker alive, but nothing seems to keep the FetchStreamReader
alive which holds it and that has been allocated directly by Javascript, such that it gets cycle collected early.
In the meantime we close mPipe
here but we do not null out mAsyncWaitWorkerRef
. If the pipe object is capable of living longer than the FetchStreamReader
instance, shouldn't it carry away the mAsyncWaitWorkerRef
rather than keeping it here alongside the worker ref? And if not: Should we just null out mAsyncWaitWorkerRef
here, too?
Comment 15•1 year ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #14)
The problem here seems to be, that the
mAsyncWaitWorkerRef
keeps the worker alive, but nothing seems to keep theFetchStreamReader
alive which holds it and that has been allocated directly by Javascript, such that it gets cycle collected early.In the meantime we close
mPipe
here but we do not null outmAsyncWaitWorkerRef
. If the pipe object is capable of living longer than theFetchStreamReader
instance, shouldn't it carry away themAsyncWaitWorkerRef
rather than keeping it here alongside the worker ref? And if not: Should we just null outmAsyncWaitWorkerRef
here, too?
My intention with the mAsyncWaitWorkerRef
was that it would always be set so long as there is an outstanding AsyncWait
happening, and that it would be cleared when the AsyncWait
completes. We're intentionally keeping it around after closing mPipe
, as theoretically closing the pipe should cause the outstanding AsyncWait
callback to be invoked one last time to notify us of closure, which in turn will clear the reference here: https://searchfox.org/mozilla-central/rev/c41e3f43302c65b7ecdd85e0999c6138a7f2cbea/dom/fetch/FetchStreamReader.cpp#243,251.
I looked through the pernosco trace, and it appears that what happens is that the runnable to call OnOutputStreamReady()
is created as I expected from the nsPipeOutputStream
, and dispatched to the worker as normal. Unfortunately, as the worker is shutting down, this runnable is then cancelled, as all runnables dispatched to worker threads are required to be cancellable (which is the reason the runnables used for this asyncwait notification are cancelable). This means the OnOutputStreamREady
callback is never invoked, and the reference to the callback is dropped.
The solution here isn't to null out mAsyncWaitWorkerRef
earlier like you suggest, as we do need to keep the worker alive so we can deliver the async wait notification to it properly, unfortunately I had incorrectly assumed that a WorkerRef
would prevent runnables dispatched to that worker from being Cancel()
-ed.
I'm having a bit of trouble coming up with a great solution here, as we effectively to make sure that the WorkerRef has the same lifecycle as the callback, but the callback might be dropped in the edge-case that the worker is shutting down, meaning we'll never get the callback. The most obvious way forward right now to me is to make the nsIOutputStreamCallback
be a separate object which the FetchStreamReader
only holds a WeakPtr
to, storing the WorkerRef
in it. That way while we wouldn't get the callback we were supposed to, we will at least get the WorkerRef
destroyed. This feels like a bit of a hack though, so I'm not the biggest fan.
That being said, according to :asuth, we're actively removing the need for cancellation in bug 1800659, which would also fix the issue by making it so that we don't cancel the event, and thus the runnable is delivered, and we successfully clear everything. Depending on the timeline for that, it might be worth waiting for?
Comment 16•1 year ago
|
||
I think it makes sense to prioritize bug 1800659. :nika and I briefly discussed mitigations like making sure we can actually dispatch to the worker so that references can be dropped on the worker, but I think the work required to do that is comparable to removing cancellation and ClearMainEventQueue but with (the mitigation) being much harder to reason about because it increases the state space and potential edge cases greatly.
I think this patch is nicely demonstrating the real correctness problems that cancellation is causing and which are likely manifesting as shutdown hangs. As work usage increases, in particular with more sophisticated APIs, I think we can expect a general increase in the manifestations of the problem if not addressed.
Assignee | ||
Comment 17•1 year ago
|
||
(In reply to Andrew Sutherland [:asuth] (he/him) from comment #16)
I think it makes sense to prioritize bug 1800659. :nika and I briefly discussed mitigations like making sure we can actually dispatch to the worker so that references can be dropped on the worker, but I think the work required to do that is comparable to removing cancellation and ClearMainEventQueue but with (the mitigation) being much harder to reason about because it increases the state space and potential edge cases greatly.
OK, I'd propose then to extract the piece that moves the global alive check until after the JS context is destroyed to a separate patch in order to unblock bug 1825589 (and other potential false alarm cases).
Comment 18•1 year ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #17)
OK, I'd propose then to extract the piece that moves the global alive check until after the JS context is destroyed to a separate patch in order to unblock bug 1825589 (and other potential false alarm cases).
Sounds good!
Assignee | ||
Updated•1 year ago
|
Comment 19•10 months ago
|
||
Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/1347ad62d6c5 Assert that StrongWorkerRefs are not lazily released during final GC/CC. r=dom-worker-reviewers,smaug,asuth
Comment 20•10 months ago
|
||
bugherder |
Description
•