Closed Bug 1777921 Opened 3 years ago Closed 2 years ago

Assert that StrongWorkerRefs are not lazily released during final GC/CC

Tracking

()

Status:

RESOLVED FIXED

Milestone:

116 Branch

Tracking Flags:

Tracking

Status

firefox116

---

fixed

People

(Reporter: jstutte, Assigned: jstutte)

References

Details

Attachments

(1 file, 1 obsolete file)

Bug 1777921 - Assert that StrongWorkerRefs are not lazily released during final GC/CC. r?#dom-worker-reviewers 3 years ago Jens Stutte [:jstutte] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1777921 - AssertIsNotPotentiallyLastGCCCRunning() also for dispatch from worker to worker. r?#dom-worker-reviewers 3 years ago Jens Stutte [:jstutte] 48 bytes, text/x-phabricator-request		Details \| Review

Jens Stutte [:jstutte]

Assignee

Description

•

3 years ago

During the investigations for bug 1775241 we noted a scenario where a StrongWorkerRef was blocking a worker shutdown because due to a logic edge case it was released only during a GC/CC run - which never occurred given that the worker did not receive any more events.

Every object holding a WorkerRef should really have a straight, deterministic line from the WorkerRef's callback being invoked to the WorkerRef being released which is supported by strong-references that can't form a cycle. One might argue that once we have transitioned to Canceling that it should be a MOZ_DIAGNOSTIC_ASSERT-level crash if a WorkerRef is dropped during cycle collection because it implies that the WorkerRef contract was violated.

Jens Stutte [:jstutte]

Assignee

Updated

•

3 years ago

Updated

•

3 years ago

Summary: Assert that StrongWorkerRefs are not lazily released during GC/CC → Assert that StrongWorkerRefs are not lazily released during final GC/CC

Jens Stutte [:jstutte]

Assignee

Comment 1

•

3 years ago

Attached file Bug 1777921 - Assert that StrongWorkerRefs are not lazily released during final GC/CC. r?#dom-worker-reviewers — Details

Phabricator Automation

Updated

•

3 years ago

Assignee: nobody → jstutte

Status: NEW → ASSIGNED

Andrew Sutherland [:asuth] (he/him)

Updated

•

3 years ago

Depends on: 1775784

Jens Stutte [:jstutte]

Assignee

Comment 2

•

3 years ago

Attached file Bug 1777921 - AssertIsNotPotentiallyLastGCCCRunning() also for dispatch from worker to worker. r?#dom-worker-reviewers (obsolete) — Details

Depends on D150942

Phabricator Automation

Updated

•

2 years ago

Attachment #9302853 - Attachment is obsolete: true

Pulsebot

Comment 3

•

2 years ago

Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/4629c85e137b Assert that StrongWorkerRefs are not lazily released during final GC/CC. r=dom-worker-reviewers,smaug,asuth

Cristina Horotan [:chorotan]

Updated

•

2 years ago

Regressions: 1823731

Norisz Fay [:noriszfay]

Comment 4

•

2 years ago

•

Edited

Backed out causing multiple worker related failures

Backout link

Push with failures - c2
Failure log - c2

Push with failure - Wc
Failure log - Wc

Flags: needinfo?(jstutte)

Jens Stutte [:jstutte]

Assignee

Comment 5

•

2 years ago

•

Edited

(In reply to Norisz Fay [:noriszfay] from comment #4)

Backed out causing multiple worker related failures

Backout link

Push with failures - c2
Failure log - c2

Is "caused" by the new diagnostic assert from this patch. It seems the fix for bug 1538754 might cover this.

Push with failure - Wc
Failure log - Wc

Is a different assertion and I am kind of convinced it should be unrelated to this patch, as if ever this patch makes us GCCC more often than before, so if its own assertion is not triggered I see less probability for the global to stay alive, not more.

Flags: needinfo?(jstutte)

Jens Stutte [:jstutte]

Assignee

Updated

•

2 years ago

Comment 6

•

2 years ago

(In reply to Jens Stutte [:jstutte] from comment #5)

Failure log - Wc

Is a different assertion and I am kind of convinced it should be unrelated to this patch, as if ever this patch makes us GCCC more often than before, so if its own assertion is not triggered I see less probability for the global to stay alive, not more.

Hmm, there might be potential for a CC killing things that held a worker reference meant to prevent the worker from shutting down.

Jens Stutte [:jstutte]

Assignee

Updated

•

2 years ago

Depends on: 1823855

Jens Stutte [:jstutte]

Assignee

Updated

•

2 years ago

No longer depends on: 1823855

Pulsebot

Comment 7

•

2 years ago

Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/b5217784536c Assert that StrongWorkerRefs are not lazily released during final GC/CC. r=dom-worker-reviewers,smaug,asuth

Atila Butkovits

Comment 8

•

2 years ago

Backed out for causing assertion failures at WorkerPrivate.h.

Backout link: https://hg.mozilla.org/integration/autoland/rev/b528ceb6d1767d43a9ccce5c78c620cae8a30a24

Push with failures: https://treeherder.mozilla.org/jobs?repo=autoland&resultStatus=testfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel&revision=b5217784536c788f0a87dcf203d15954abaa479c&selectedTaskRun=L0cvP_HZRty8RE_QwVdjaA.0

Failure log: https://treeherder.mozilla.org/logviewer?job_id=410968588&repo=autoland&lineNumber=5698

Flags: needinfo?(jstutte)

Jens Stutte [:jstutte]

Assignee

Comment 9

•

2 years ago

(In reply to Atila Butkovits from comment #8)

Backed out for causing assertion failures at WorkerPrivate.h.

Backout link: https://hg.mozilla.org/integration/autoland/rev/b528ceb6d1767d43a9ccce5c78c620cae8a30a24

Push with failures: https://treeherder.mozilla.org/jobs?repo=autoland&resultStatus=testfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel&revision=b5217784536c788f0a87dcf203d15954abaa479c&selectedTaskRun=L0cvP_HZRty8RE_QwVdjaA.0

Failure log: https://treeherder.mozilla.org/logviewer?job_id=410968588&repo=autoland&lineNumber=5698

:asuth, that brings me back to bug 1770630 - any idea?

Flags: needinfo?(jstutte) → needinfo?(bugmail)

Andrew Sutherland [:asuth] (he/him)

Comment 10

•

2 years ago

I think we definitely want to change the || (!aSyncLoopTarget && ParentStatus() > Running) conditional check in WorkerPrivate::DispatchLockHeld to > Canceling. Code has been somewhat assuming that wouldn't fire, but it's clearly firing here and messing stuff up and I think we're generally in a better place now that I don't think that's as likely to open us up to terrifying new permutations. r=asuth to make that change.

Flags: needinfo?(bugmail)

Jens Stutte [:jstutte]

Assignee

Comment 11

•

2 years ago

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #10)

I think we definitely want to change the || (!aSyncLoopTarget && ParentStatus() > Running) conditional check in WorkerPrivate::DispatchLockHeld to > Canceling. Code has been somewhat assuming that wouldn't fire, but it's clearly firing here and messing stuff up and I think we're generally in a better place now that I don't think that's as likely to open us up to terrifying new permutations. r=asuth to make that change.

OK, I rebased on top of current central and added this change. Let's try.

Jens Stutte [:jstutte]

Assignee

Comment 12

•

2 years ago

(In reply to Jens Stutte [:jstutte] from comment #11)

OK, I rebased on top of current central and added this change. Let's try.

Even though it sounds right to allow dispatching while we potentially hold a worker alive through a strong worker ref, it seems like this was not helping. It looks like the fix from bug 1538754 was not sufficient to avoid that FetchStreamReader releases its worker ref only during GC/CC. Pernosco session triggered, will post back a link here if successful.

Jens Stutte [:jstutte]

Assignee

Comment 13

•

2 years ago

https://pernos.co/debug/hQHMab1FPItb1QMHMlbmcA/index.html

Jens Stutte [:jstutte]

Assignee

Comment 14

•

2 years ago

The problem here seems to be, that the mAsyncWaitWorkerRef keeps the worker alive, but nothing seems to keep the FetchStreamReader alive which holds it and that has been allocated directly by Javascript, such that it gets cycle collected early.

In the meantime we close mPipe here but we do not null out mAsyncWaitWorkerRef. If the pipe object is capable of living longer than the FetchStreamReader instance, shouldn't it carry away the mAsyncWaitWorkerRef rather than keeping it here alongside the worker ref? And if not: Should we just null out mAsyncWaitWorkerRef here, too?

Flags: needinfo?(nika)

Jens Stutte [:jstutte]

Assignee

Updated

•

2 years ago

Updated

•

2 years ago

Blocks: 1825589

Nika Layzell [:nika] (ooo, ni? for response)

Comment 15

•

2 years ago

(In reply to Jens Stutte [:jstutte] from comment #14)

The problem here seems to be, that the mAsyncWaitWorkerRef keeps the worker alive, but nothing seems to keep the FetchStreamReader alive which holds it and that has been allocated directly by Javascript, such that it gets cycle collected early.

In the meantime we close mPipe here but we do not null out mAsyncWaitWorkerRef. If the pipe object is capable of living longer than the FetchStreamReader instance, shouldn't it carry away the mAsyncWaitWorkerRef rather than keeping it here alongside the worker ref? And if not: Should we just null out mAsyncWaitWorkerRef here, too?

My intention with the mAsyncWaitWorkerRef was that it would always be set so long as there is an outstanding AsyncWait happening, and that it would be cleared when the AsyncWait completes. We're intentionally keeping it around after closing mPipe, as theoretically closing the pipe should cause the outstanding AsyncWait callback to be invoked one last time to notify us of closure, which in turn will clear the reference here: https://searchfox.org/mozilla-central/rev/c41e3f43302c65b7ecdd85e0999c6138a7f2cbea/dom/fetch/FetchStreamReader.cpp#243,251.

I looked through the pernosco trace, and it appears that what happens is that the runnable to call OnOutputStreamReady() is created as I expected from the nsPipeOutputStream, and dispatched to the worker as normal. Unfortunately, as the worker is shutting down, this runnable is then cancelled, as all runnables dispatched to worker threads are required to be cancellable (which is the reason the runnables used for this asyncwait notification are cancelable). This means the OnOutputStreamREady callback is never invoked, and the reference to the callback is dropped.

The solution here isn't to null out mAsyncWaitWorkerRef earlier like you suggest, as we do need to keep the worker alive so we can deliver the async wait notification to it properly, unfortunately I had incorrectly assumed that a WorkerRef would prevent runnables dispatched to that worker from being Cancel()-ed.

I'm having a bit of trouble coming up with a great solution here, as we effectively to make sure that the WorkerRef has the same lifecycle as the callback, but the callback might be dropped in the edge-case that the worker is shutting down, meaning we'll never get the callback. The most obvious way forward right now to me is to make the nsIOutputStreamCallback be a separate object which the FetchStreamReader only holds a WeakPtr to, storing the WorkerRef in it. That way while we wouldn't get the callback we were supposed to, we will at least get the WorkerRef destroyed. This feels like a bit of a hack though, so I'm not the biggest fan.

That being said, according to :asuth, we're actively removing the need for cancellation in bug 1800659, which would also fix the issue by making it so that we don't cancel the event, and thus the runnable is delivered, and we successfully clear everything. Depending on the timeline for that, it might be worth waiting for?

Depends on: 1800659

Flags: needinfo?(nika)

Andrew Sutherland [:asuth] (he/him)

Comment 16

•

2 years ago

I think it makes sense to prioritize bug 1800659. :nika and I briefly discussed mitigations like making sure we can actually dispatch to the worker so that references can be dropped on the worker, but I think the work required to do that is comparable to removing cancellation and ClearMainEventQueue but with (the mitigation) being much harder to reason about because it increases the state space and potential edge cases greatly.

I think this patch is nicely demonstrating the real correctness problems that cancellation is causing and which are likely manifesting as shutdown hangs. As work usage increases, in particular with more sophisticated APIs, I think we can expect a general increase in the manifestations of the problem if not addressed.

Jens Stutte [:jstutte]

Assignee

Comment 17

•

2 years ago

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #16)

I think it makes sense to prioritize bug 1800659. :nika and I briefly discussed mitigations like making sure we can actually dispatch to the worker so that references can be dropped on the worker, but I think the work required to do that is comparable to removing cancellation and ClearMainEventQueue but with (the mitigation) being much harder to reason about because it increases the state space and potential edge cases greatly.

OK, I'd propose then to extract the piece that moves the global alive check until after the JS context is destroyed to a separate patch in order to unblock bug 1825589 (and other potential false alarm cases).

Andrew Sutherland [:asuth] (he/him)

Comment 18

•

2 years ago

(In reply to Jens Stutte [:jstutte] from comment #17)

OK, I'd propose then to extract the piece that moves the global alive check until after the JS context is destroyed to a separate patch in order to unblock bug 1825589 (and other potential false alarm cases).

Sounds good!

Jens Stutte [:jstutte]

Assignee

Updated

•

2 years ago

No longer blocks: 1825589

Depends on: 1825589

Pulsebot

Comment 19

•

2 years ago

Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/1347ad62d6c5 Assert that StrongWorkerRefs are not lazily released during final GC/CC. r=dom-worker-reviewers,smaug,asuth

Iulian Moraru

Comment 20

•

2 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/1347ad62d6c5

Status: ASSIGNED → RESOLVED

Closed: 2 years ago

status-firefox116: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 116 Branch

You need to log in before you can comment on or make changes to this bug.