Closed Bug 1777921 Opened 2 years ago Closed 10 months ago

Assert that StrongWorkerRefs are not lazily released during final GC/CC

Categories

(Core :: DOM: Workers, task)

task

Tracking

()

RESOLVED FIXED
116 Branch
Tracking Status
firefox116 --- fixed

People

(Reporter: jstutte, Assigned: jstutte)

References

Details

Attachments

(1 file, 1 obsolete file)

During the investigations for bug 1775241 we noted a scenario where a StrongWorkerRef was blocking a worker shutdown because due to a logic edge case it was released only during a GC/CC run - which never occurred given that the worker did not receive any more events.

Every object holding a WorkerRef should really have a straight, deterministic line from the WorkerRef's callback being invoked to the WorkerRef being released which is supported by strong-references that can't form a cycle. One might argue that once we have transitioned to Canceling that it should be a MOZ_DIAGNOSTIC_ASSERT-level crash if a WorkerRef is dropped during cycle collection because it implies that the WorkerRef contract was violated.

See Also: → 1775241
Summary: Assert that StrongWorkerRefs are not lazily released during GC/CC → Assert that StrongWorkerRefs are not lazily released during final GC/CC
Assignee: nobody → jstutte
Status: NEW → ASSIGNED
Attachment #9302853 - Attachment is obsolete: true
Pushed by jstutte@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/4629c85e137b
Assert that StrongWorkerRefs are not lazily released during final GC/CC. r=dom-worker-reviewers,smaug,asuth
Regressions: 1823731
Flags: needinfo?(jstutte)

(In reply to Norisz Fay [:noriszfay] from comment #4)

Backed out causing multiple worker related failures

Backout link

Push with failures - c2
Failure log - c2

Is "caused" by the new diagnostic assert from this patch. It seems the fix for bug 1538754 might cover this.

Push with failure - Wc
Failure log - Wc

Is a different assertion and I am kind of convinced it should be unrelated to this patch, as if ever this patch makes us GCCC more often than before, so if its own assertion is not triggered I see less probability for the global to stay alive, not more.

Flags: needinfo?(jstutte)
See Also: → 1538754

(In reply to Jens Stutte [:jstutte] from comment #5)

Failure log - Wc

Is a different assertion and I am kind of convinced it should be unrelated to this patch, as if ever this patch makes us GCCC more often than before, so if its own assertion is not triggered I see less probability for the global to stay alive, not more.

Hmm, there might be potential for a CC killing things that held a worker reference meant to prevent the worker from shutting down.

Depends on: 1823855
No longer depends on: 1823855
Pushed by jstutte@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b5217784536c
Assert that StrongWorkerRefs are not lazily released during final GC/CC. r=dom-worker-reviewers,smaug,asuth

I think we definitely want to change the || (!aSyncLoopTarget && ParentStatus() > Running) conditional check in WorkerPrivate::DispatchLockHeld to > Canceling. Code has been somewhat assuming that wouldn't fire, but it's clearly firing here and messing stuff up and I think we're generally in a better place now that I don't think that's as likely to open us up to terrifying new permutations. r=asuth to make that change.

Flags: needinfo?(bugmail)

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #10)

I think we definitely want to change the || (!aSyncLoopTarget && ParentStatus() > Running) conditional check in WorkerPrivate::DispatchLockHeld to > Canceling. Code has been somewhat assuming that wouldn't fire, but it's clearly firing here and messing stuff up and I think we're generally in a better place now that I don't think that's as likely to open us up to terrifying new permutations. r=asuth to make that change.

OK, I rebased on top of current central and added this change. Let's try.

(In reply to Jens Stutte [:jstutte] from comment #11)

OK, I rebased on top of current central and added this change. Let's try.

Even though it sounds right to allow dispatching while we potentially hold a worker alive through a strong worker ref, it seems like this was not helping. It looks like the fix from bug 1538754 was not sufficient to avoid that FetchStreamReader releases its worker ref only during GC/CC. Pernosco session triggered, will post back a link here if successful.

The problem here seems to be, that the mAsyncWaitWorkerRef keeps the worker alive, but nothing seems to keep the FetchStreamReader alive which holds it and that has been allocated directly by Javascript, such that it gets cycle collected early.

In the meantime we close mPipe here but we do not null out mAsyncWaitWorkerRef. If the pipe object is capable of living longer than the FetchStreamReader instance, shouldn't it carry away the mAsyncWaitWorkerRef rather than keeping it here alongside the worker ref? And if not: Should we just null out mAsyncWaitWorkerRef here, too?

Flags: needinfo?(nika)
See Also: → 1819221
Blocks: 1825589

(In reply to Jens Stutte [:jstutte] from comment #14)

The problem here seems to be, that the mAsyncWaitWorkerRef keeps the worker alive, but nothing seems to keep the FetchStreamReader alive which holds it and that has been allocated directly by Javascript, such that it gets cycle collected early.

In the meantime we close mPipe here but we do not null out mAsyncWaitWorkerRef. If the pipe object is capable of living longer than the FetchStreamReader instance, shouldn't it carry away the mAsyncWaitWorkerRef rather than keeping it here alongside the worker ref? And if not: Should we just null out mAsyncWaitWorkerRef here, too?

My intention with the mAsyncWaitWorkerRef was that it would always be set so long as there is an outstanding AsyncWait happening, and that it would be cleared when the AsyncWait completes. We're intentionally keeping it around after closing mPipe, as theoretically closing the pipe should cause the outstanding AsyncWait callback to be invoked one last time to notify us of closure, which in turn will clear the reference here: https://searchfox.org/mozilla-central/rev/c41e3f43302c65b7ecdd85e0999c6138a7f2cbea/dom/fetch/FetchStreamReader.cpp#243,251.

I looked through the pernosco trace, and it appears that what happens is that the runnable to call OnOutputStreamReady() is created as I expected from the nsPipeOutputStream, and dispatched to the worker as normal. Unfortunately, as the worker is shutting down, this runnable is then cancelled, as all runnables dispatched to worker threads are required to be cancellable (which is the reason the runnables used for this asyncwait notification are cancelable). This means the OnOutputStreamREady callback is never invoked, and the reference to the callback is dropped.

The solution here isn't to null out mAsyncWaitWorkerRef earlier like you suggest, as we do need to keep the worker alive so we can deliver the async wait notification to it properly, unfortunately I had incorrectly assumed that a WorkerRef would prevent runnables dispatched to that worker from being Cancel()-ed.

I'm having a bit of trouble coming up with a great solution here, as we effectively to make sure that the WorkerRef has the same lifecycle as the callback, but the callback might be dropped in the edge-case that the worker is shutting down, meaning we'll never get the callback. The most obvious way forward right now to me is to make the nsIOutputStreamCallback be a separate object which the FetchStreamReader only holds a WeakPtr to, storing the WorkerRef in it. That way while we wouldn't get the callback we were supposed to, we will at least get the WorkerRef destroyed. This feels like a bit of a hack though, so I'm not the biggest fan.

That being said, according to :asuth, we're actively removing the need for cancellation in bug 1800659, which would also fix the issue by making it so that we don't cancel the event, and thus the runnable is delivered, and we successfully clear everything. Depending on the timeline for that, it might be worth waiting for?

Depends on: 1800659
Flags: needinfo?(nika)

I think it makes sense to prioritize bug 1800659. :nika and I briefly discussed mitigations like making sure we can actually dispatch to the worker so that references can be dropped on the worker, but I think the work required to do that is comparable to removing cancellation and ClearMainEventQueue but with (the mitigation) being much harder to reason about because it increases the state space and potential edge cases greatly.

I think this patch is nicely demonstrating the real correctness problems that cancellation is causing and which are likely manifesting as shutdown hangs. As work usage increases, in particular with more sophisticated APIs, I think we can expect a general increase in the manifestations of the problem if not addressed.

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #16)

I think it makes sense to prioritize bug 1800659. :nika and I briefly discussed mitigations like making sure we can actually dispatch to the worker so that references can be dropped on the worker, but I think the work required to do that is comparable to removing cancellation and ClearMainEventQueue but with (the mitigation) being much harder to reason about because it increases the state space and potential edge cases greatly.

OK, I'd propose then to extract the piece that moves the global alive check until after the JS context is destroyed to a separate patch in order to unblock bug 1825589 (and other potential false alarm cases).

(In reply to Jens Stutte [:jstutte] from comment #17)

OK, I'd propose then to extract the piece that moves the global alive check until after the JS context is destroyed to a separate patch in order to unblock bug 1825589 (and other potential false alarm cases).

Sounds good!

No longer blocks: 1825589
Depends on: 1825589
Pushed by jstutte@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/1347ad62d6c5
Assert that StrongWorkerRefs are not lazily released during final GC/CC. r=dom-worker-reviewers,smaug,asuth
Status: ASSIGNED → RESOLVED
Closed: 10 months ago
Resolution: --- → FIXED
Target Milestone: --- → 116 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: