Bugzilla

Comment 1

•

3 months ago

The bug is linked to a topcrash signature, which matches the following criteria:

Top 20 desktop browser crashes on beta
Top 10 content process crashes on beta

For more information, please visit BugBot documentation.

Keywords: topcrash

Comment 2

•

3 months ago

•

Edited

It is kind of expected that we see those now. We should examine the runnable names on nightly instances, I assume.

Flags: needinfo?(jstutte) → needinfo?(echuang)

Pascal Chevrel:pascalc

Updated

•

3 months ago

tracking-firefox123: --- → +

Comment 3

•

3 months ago

Please note that bug 1867982 mitigated things for release, not sure we need to track this.

Pascal Chevrel:pascalc

Comment 4

•

3 months ago

Untracking as the crash volume indeed died after beta 7, thanks.

tracking-firefox123: + → ---

Comment 5

•

3 months ago

Bug 1880231 should give us more details here.

Flags: needinfo?(echuang)

Updated

•

3 months ago

Severity: -- → S3

Priority: -- → P2

Comment 6

•

3 months ago

(In reply to Jens Stutte [:jstutte] from comment #2)

It is kind of expected that we see those now. We should examine the runnable names on nightly instances, I assume.

To be more precise: Part of what happened earlier on bug 1836937 is now happening here (but mitigated for release).

Comment 7

•

3 months ago

In the newest nightly crashes I see the CompileScriptRunnable. Given the order of things, I understand that the CompileScriptRunnable is always dispatched after the WorkerThreadPrimaryRunnable is dispatched through runtimeService->RegisterWorker but we do not necessarily wait with its dispatch until we reached a stable state inside our DoRunLoop(cx). I assume that if something goes wrong before, we might end up in the RunLoopNeverRan case and thus exit the WorkerThreadPrimaryRunnable and cleanup before the CompileScriptRunnable is ever executed.

In theory any WorkerDebuggeeRunnable could know that it should not execute anymore as we notified its worker ref, in practice there seems to be no callback on create that takes any action (at least setting a flag) whatsoever (and as a consequence the worker ref is only released when the WorkerDebuggeeRunnable instance is deconstructed, which feels a bit odd, too).

We could have a callback on that WorkerDebuggeeRunnable::mSender worker ref that signals us to not do anything anymore and wrap/override WorkerRunnable::Run to check that condition and bail out in case, which might be a cleaner course of action in general and could even help with some shutdown hangs, as a side effect? In practice the mitigation from bug 1867982 might just be fine for this specific case if we remove the diagnostic assert.

Eden, wdyt?

Flags: needinfo?(echuang)

Assignee

Updated

•

2 months ago

Duplicate of this bug: 1836937

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 9

•

2 months ago

Attached file Bug 1879272 - Clear Worker thread event queue in WorkerPrivate::RunLoopNevenRan and WorkerThread::SetWorker(nullptr). r=asuth — Details

Unfortunately, WorkerThread could be held by other objects through nsIThread/nsIEventTarget. In this case, event dispatching is not restricted by Worker status. This is needed because some objects need to continue the shutdown work even though the Worker is in "Dead" status. Therefore, runnables can still be dispatched to the Worker thread while breaking the connection between the Worker thread and the WorkerPrivate.

To fix the problem, this patch added the final run to process the worker thread's pending events before the disconnection that before calling WorkerThread::SetWorker(nullptr).

Phabricator Automation

Updated

•

2 months ago

Assignee: nobody → echuang

Status: NEW → ASSIGNED

Comment 10

•

2 months ago

(In reply to Eden Chuang[:edenchuang] from comment #9)

Unfortunately, WorkerThread could be held by other objects through nsIThread/nsIEventTarget. In this case, event dispatching is not restricted by Worker status. This is needed because some objects need to continue the shutdown work even though the Worker is in "Dead" status. Therefore, runnables can still be dispatched to the Worker thread while breaking the connection between the Worker thread and the WorkerPrivate.

Can you elaborate on what code is doing this? Is this just CompileScriptRunnable from comment 7 or did you find other code doing it too?

Flags: needinfo?(echuang)

Andrew Sutherland [:asuth] (he/him)

Updated

•

2 months ago

Flags: needinfo?(echuang)

Comment 11

•

2 months ago

•

Edited

(In reply to Eden Chuang[:edenchuang] from comment #9)

Bug 1879272 - Clear Worker thread event queue before disconnecting WorkerThread and WorkerPrivate r=asuth
To fix the problem, this patch added the final run to process the worker thread's pending events before the disconnection that before calling WorkerThread::SetWorker(nullptr).

Looking at the patch, I have difficulties to match this description to what the patch seems to do. What I read there is that the edge case WorkerPrivate::RunLoopNeverRan has become an additional event processing (though it seems to be limited to one single NS_ProcessNextEvent?) and we slightly anticipate "Status transitions to Closing/Canceling" in the loop but I did not try to understand the consequences of that change.

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #10)

Can you elaborate on what code is doing this? Is this just CompileScriptRunnable from comment 7 or did you find other code doing it too?

In all builds since bug 1876301 landed completely we only ever see the CompileScriptRunnable. This lets me think that comment 7 could be a starting point for investigation. In fact, doing just some extra event processing in WorkerPrivate::RunLoopNeverRan might paper over this problem without really solving it cleanly.

In general I would expect the mitigation from bug 1867982 to be enough to not have serious problems and the assertion there to be helpful to find remaining or new offenders and treat them specifically.

However there might be a blindspot, that is bug 1867982 limits the assertion to happen for WorkerRunnable derived runnables only. But AFAICS a non-WorkerRunnable would probably not expect the worker to live and try to access it, anyways. For example there might be some thread management/closure related runnable dispatched at some very late point. For all other runnables at least we already do our best to ensure there are no pending events when leaving WorkerThreadPrimaryRunnable::Run.

Assignee

Comment 12

•

2 months ago

The root cause of this bug's assertion is the same as we have pending events during WorkerPrivate::ScheduleDeletion().

After analyzing the corresponding crash stacks and runnable dispatching stacks, it can be two situations

The WorkerPrivate::DoRunLoop is never executed.
This is the case where the worker initialization on the worker thread fails. And then WorkerPrivate::RunLoopNeverRan() would be called for handling these fail cases. However, WorkerPrivate::mPreStartRunnables had already dispatched when WorkerPrivate::SetWorkerPrivateInWorkerThread() is called, so when the moment executing WorkerPrivate::RunLoopNeverRan(), mPreStartRunnables must in the worker thread already. CompileScriptRunnable is one of these runnable.
Worker enters into DoRunLoop(). The WorkerThread can be held as nsIThread or nsIEventTarget by

nsCOMPtr<nsIThread> thread = NS_GetCurrentThread()
thread->Dispatch(myRunnable);

We don't block these runnable dispatchings after the worker's status changes to "Dead" because some objects, such as cycle_collector, need to complete their shutdown after the worker's shutdown. So ExternalWrappedRunnables(wrapped here) shows up in the dispatching stack when we hit the assertion. The original runnable is from everywhere and is related to Worker.

Before we remove the ClearMainEventQueue(bug 1800659), all these pending events are processed by calling ClearMainEventQueue() in WorkerPrivate::ScheduleDeletion(). But after we remove the ClearMainEventQueue, we only remove the mPreStartRunnables if needed.
This might be the correct thing that ClearMainEventQueue did. This is because it is hard to distinguish which object should have the permit to dispatch runnable to the Worker thread in the nsIThread way.

To fix case 1, the patch wants to handle the pending events in WorkerPrivate::RunLoopNevenRan() correctly. So, call ProcessPendingEvents if needed(Yes, it should be ProcessPendingEvents, not ProcessNexEvent. Try runs did not complain because we always meet the case only one runnable in the queue.)

To fix case 2, we process the pending events before setting WorkerThread::mWorkerPrivate as nullptr. I think this is a correct time point because I think any other shutdown jobs should be finished before we detach the WorkerPrivate from the Worker thread.

Flags: needinfo?(echuang)

Phabricator Automation

Updated

•

2 months ago

Attachment #9389282 - Attachment description: Bug 1879272 - Clear Worker thread event queue before disconnecting WorkerThread and WorkerPrivate r=asuth → Bug 1879272 - Clear Worker thread event queue in WorkerPrivate::RunLoopNevenRan and WorkerThread::SetWorker(nullptr). r=asuth

https://hg.mozilla.org/mozilla-central/rev/e3df02eaa087

Comment 13

•

2 months ago

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash

Pulsebot

Comment 14

•

1 month ago

Pushed by echuang@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/e3df02eaa087
Clear Worker thread event queue in WorkerPrivate::RunLoopNevenRan and WorkerThread::SetWorker(nullptr). r=asuth

Iulian Moraru

Comment 15

•

1 month ago

bugherder

Status: ASSIGNED → RESOLVED

Closed: 1 month ago

status-firefox126: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 126 Branch

Ryan VanderMeulen [:RyanVM]

Comment 16

•

1 month ago

Since nightly and release are affected, beta will likely be affected too.
For more information, please visit BugBot documentation.

status-firefox125: --- → affected

Updated

•

1 month ago

status-firefox123: affected → wontfix

status-firefox124: affected → wontfix

status-firefox125: affected → wontfix

Assignee

Comment 17

•

1 month ago

Attached file Bug 1879272 - Remove assertion in WorkerRunnable::Run. r=asuth — Details

Because WorkerRunnable could be still dispatched and ran after Worker's cycle collector shutdown, the assertion condition is not fit anymore. So we remove these codes.

Pulsebot

Comment 18

•

1 month ago

Pushed by echuang@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/3187c4f491a8
Remove assertion in WorkerRunnable::Run. r=asuth

https://hg.mozilla.org/mozilla-central/rev/3187c4f491a8

Comment 19

•

1 month ago

A patch has been attached on this bug, which was already closed. Filing a separate bug will ensure better tracking. If this was not by mistake and further action is needed, please alert the appropriate party. (Or: if the patch doesn't change behavior -- e.g. landing a test case, or fixing a typo -- then feel free to disregard this message)

Serban Stanca [:SerbanS]

Comment 20

•

1 month ago

bugherder