Closed Bug 1879272 Opened 3 months ago Closed 1 month ago

Crash in [@ mozilla::dom::WorkerRunnable::Run]

Categories

(Core :: DOM: Workers, defect, P2)

Unspecified
Windows
defect

Tracking

()

RESOLVED FIXED
126 Branch
Tracking Status
firefox-esr115 --- unaffected
firefox122 --- unaffected
firefox123 --- wontfix
firefox124 --- wontfix
firefox125 --- wontfix
firefox126 --- fixed

People

(Reporter: aryx, Assigned: edenchuang)

References

Details

(Keywords: crash)

Crash Data

Attachments

(2 files)

There are 0-4 instances of Nightly 124.0a1 crashing per build. For beta, only 123.0b5 and 123.0b6 got crash reports (afterwards late beta) - could this be a regression from bug 1867982 which got uplifted in b5? All crashes on Windows.

Crash report: https://crash-stats.mozilla.org/report/index/eaa5238d-5012-42a6-a3ab-2feb30240208

MOZ_CRASH Reason: Runnable 'WorkerDebuggeeRunnable' executed after WorkerThreadPrimaryRunnable ended.

Top 10 frames of crashing thread:

0  xul.dll  MOZ_Crash  mfbt/Assertions.h:301
0  xul.dll  mozilla::dom::WorkerRunnable::Run  dom/workers/WorkerRunnable.cpp:237
1  xul.dll  nsThread::ProcessNextEvent  xpcom/threads/nsThread.cpp:1193
1  xul.dll  NS_ProcessNextEvent  xpcom/threads/nsThreadUtils.cpp:480
2  xul.dll  mozilla::ipc::MessagePumpForNonMainThreads::Run  ipc/glue/MessagePump.cpp:300
3  xul.dll  MessageLoop::RunInternal  ipc/chromium/src/base/message_loop.cc:370
3  xul.dll  MessageLoop::RunHandler  ipc/chromium/src/base/message_loop.cc:363
4  xul.dll  MessageLoop::Run  ipc/chromium/src/base/message_loop.cc:345
4  xul.dll  nsThread::ThreadFunc  xpcom/threads/nsThread.cpp:370
5  nss3.dll  _PR_NativeRunThread  nsprpub/pr/src/threads/combined/pruthr.c:399
Flags: needinfo?(jstutte)

The bug is linked to a topcrash signature, which matches the following criteria:

  • Top 20 desktop browser crashes on beta
  • Top 10 content process crashes on beta

For more information, please visit BugBot documentation.

Keywords: topcrash

It is kind of expected that we see those now. We should examine the runnable names on nightly instances, I assume.

Flags: needinfo?(jstutte) → needinfo?(echuang)

Please note that bug 1867982 mitigated things for release, not sure we need to track this.

Untracking as the crash volume indeed died after beta 7, thanks.

Bug 1880231 should give us more details here.

Flags: needinfo?(echuang)
See Also: → 1880231
Severity: -- → S3
Priority: -- → P2

(In reply to Jens Stutte [:jstutte] from comment #2)

It is kind of expected that we see those now. We should examine the runnable names on nightly instances, I assume.

To be more precise: Part of what happened earlier on bug 1836937 is now happening here (but mitigated for release).

See Also: → 1836937

In the newest nightly crashes I see the CompileScriptRunnable. Given the order of things, I understand that the CompileScriptRunnable is always dispatched after the WorkerThreadPrimaryRunnable is dispatched through runtimeService->RegisterWorker but we do not necessarily wait with its dispatch until we reached a stable state inside our DoRunLoop(cx). I assume that if something goes wrong before, we might end up in the RunLoopNeverRan case and thus exit the WorkerThreadPrimaryRunnable and cleanup before the CompileScriptRunnable is ever executed.

In theory any WorkerDebuggeeRunnable could know that it should not execute anymore as we notified its worker ref, in practice there seems to be no callback on create that takes any action (at least setting a flag) whatsoever (and as a consequence the worker ref is only released when the WorkerDebuggeeRunnable instance is deconstructed, which feels a bit odd, too).

We could have a callback on that WorkerDebuggeeRunnable::mSender worker ref that signals us to not do anything anymore and wrap/override WorkerRunnable::Run to check that condition and bail out in case, which might be a cleaner course of action in general and could even help with some shutdown hangs, as a side effect? In practice the mitigation from bug 1867982 might just be fine for this specific case if we remove the diagnostic assert.

Eden, wdyt?

Flags: needinfo?(echuang)
Duplicate of this bug: 1836937

Unfortunately, WorkerThread could be held by other objects through nsIThread/nsIEventTarget. In this case, event dispatching is not restricted by Worker status. This is needed because some objects need to continue the shutdown work even though the Worker is in "Dead" status. Therefore, runnables can still be dispatched to the Worker thread while breaking the connection between the Worker thread and the WorkerPrivate.

To fix the problem, this patch added the final run to process the worker thread's pending events before the disconnection that before calling WorkerThread::SetWorker(nullptr).

Assignee: nobody → echuang
Status: NEW → ASSIGNED

(In reply to Eden Chuang[:edenchuang] from comment #9)

Unfortunately, WorkerThread could be held by other objects through nsIThread/nsIEventTarget. In this case, event dispatching is not restricted by Worker status. This is needed because some objects need to continue the shutdown work even though the Worker is in "Dead" status. Therefore, runnables can still be dispatched to the Worker thread while breaking the connection between the Worker thread and the WorkerPrivate.

Can you elaborate on what code is doing this? Is this just CompileScriptRunnable from comment 7 or did you find other code doing it too?

Flags: needinfo?(echuang)
Flags: needinfo?(echuang)

(In reply to Eden Chuang[:edenchuang] from comment #9)

Bug 1879272 - Clear Worker thread event queue before disconnecting WorkerThread and WorkerPrivate r=asuth
To fix the problem, this patch added the final run to process the worker thread's pending events before the disconnection that before calling WorkerThread::SetWorker(nullptr).

Looking at the patch, I have difficulties to match this description to what the patch seems to do. What I read there is that the edge case WorkerPrivate::RunLoopNeverRan has become an additional event processing (though it seems to be limited to one single NS_ProcessNextEvent?) and we slightly anticipate "Status transitions to Closing/Canceling" in the loop but I did not try to understand the consequences of that change.

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #10)

Can you elaborate on what code is doing this? Is this just CompileScriptRunnable from comment 7 or did you find other code doing it too?

In all builds since bug 1876301 landed completely we only ever see the CompileScriptRunnable. This lets me think that comment 7 could be a starting point for investigation. In fact, doing just some extra event processing in WorkerPrivate::RunLoopNeverRan might paper over this problem without really solving it cleanly.

In general I would expect the mitigation from bug 1867982 to be enough to not have serious problems and the assertion there to be helpful to find remaining or new offenders and treat them specifically.

However there might be a blindspot, that is bug 1867982 limits the assertion to happen for WorkerRunnable derived runnables only. But AFAICS a non-WorkerRunnable would probably not expect the worker to live and try to access it, anyways. For example there might be some thread management/closure related runnable dispatched at some very late point. For all other runnables at least we already do our best to ensure there are no pending events when leaving WorkerThreadPrimaryRunnable::Run.

The root cause of this bug's assertion is the same as we have pending events during WorkerPrivate::ScheduleDeletion().

After analyzing the corresponding crash stacks and runnable dispatching stacks, it can be two situations

  1. The WorkerPrivate::DoRunLoop is never executed.
    This is the case where the worker initialization on the worker thread fails. And then WorkerPrivate::RunLoopNeverRan() would be called for handling these fail cases. However, WorkerPrivate::mPreStartRunnables had already dispatched when WorkerPrivate::SetWorkerPrivateInWorkerThread() is called, so when the moment executing WorkerPrivate::RunLoopNeverRan(), mPreStartRunnables must in the worker thread already. CompileScriptRunnable is one of these runnable.

  2. Worker enters into DoRunLoop(). The WorkerThread can be held as nsIThread or nsIEventTarget by

nsCOMPtr<nsIThread> thread = NS_GetCurrentThread()
thread->Dispatch(myRunnable);

We don't block these runnable dispatchings after the worker's status changes to "Dead" because some objects, such as cycle_collector, need to complete their shutdown after the worker's shutdown. So ExternalWrappedRunnables(wrapped here) shows up in the dispatching stack when we hit the assertion. The original runnable is from everywhere and is related to Worker.

Before we remove the ClearMainEventQueue(bug 1800659), all these pending events are processed by calling ClearMainEventQueue() in WorkerPrivate::ScheduleDeletion(). But after we remove the ClearMainEventQueue, we only remove the mPreStartRunnables if needed.
This might be the correct thing that ClearMainEventQueue did. This is because it is hard to distinguish which object should have the permit to dispatch runnable to the Worker thread in the nsIThread way.

To fix case 1, the patch wants to handle the pending events in WorkerPrivate::RunLoopNevenRan() correctly. So, call ProcessPendingEvents if needed(Yes, it should be ProcessPendingEvents, not ProcessNexEvent. Try runs did not complain because we always meet the case only one runnable in the queue.)

To fix case 2, we process the pending events before setting WorkerThread::mWorkerPrivate as nullptr. I think this is a correct time point because I think any other shutdown jobs should be finished before we detach the WorkerPrivate from the Worker thread.

Flags: needinfo?(echuang)
Attachment #9389282 - Attachment description: Bug 1879272 - Clear Worker thread event queue before disconnecting WorkerThread and WorkerPrivate r=asuth → Bug 1879272 - Clear Worker thread event queue in WorkerPrivate::RunLoopNevenRan and WorkerThread::SetWorker(nullptr). r=asuth

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash
Pushed by echuang@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/e3df02eaa087
Clear Worker thread event queue in WorkerPrivate::RunLoopNevenRan and WorkerThread::SetWorker(nullptr). r=asuth
Status: ASSIGNED → RESOLVED
Closed: 1 month ago
Resolution: --- → FIXED
Target Milestone: --- → 126 Branch

Since nightly and release are affected, beta will likely be affected too.
For more information, please visit BugBot documentation.

Because WorkerRunnable could be still dispatched and ran after Worker's cycle collector shutdown, the assertion condition is not fit anymore. So we remove these codes.

Pushed by echuang@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/3187c4f491a8
Remove assertion in WorkerRunnable::Run. r=asuth

A patch has been attached on this bug, which was already closed. Filing a separate bug will ensure better tracking. If this was not by mistake and further action is needed, please alert the appropriate party. (Or: if the patch doesn't change behavior -- e.g. landing a test case, or fixing a typo -- then feel free to disregard this message)

Comment on attachment 9389282 [details]
Bug 1879272 - Clear Worker thread event queue in WorkerPrivate::RunLoopNevenRan and WorkerThread::SetWorker(nullptr). r=asuth

Beta/Release Uplift Approval Request

  • User impact if declined: Users might hit assertion during shutdown the Firefox.
  • Is this code covered by automated tests?: Unknown
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): The patch removes the assertion that is not needed anymore.
  • String changes made/needed:
  • Is Android affected?: Yes
Attachment #9389282 - Flags: approval-mozilla-beta?

Comment on attachment 9389282 [details]
Bug 1879272 - Clear Worker thread event queue in WorkerPrivate::RunLoopNevenRan and WorkerThread::SetWorker(nullptr). r=asuth

AFAICT, this bug only impacts Nightly builds and given that 125 is moving to RC today, it doesn't seem like we need to uplift this.

Attachment #9389282 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: