Crash in [@ mozilla::dom::WorkerRunnable::Run]
Categories
(Core :: DOM: Workers, defect, P2)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr115 | --- | unaffected |
firefox122 | --- | unaffected |
firefox123 | --- | wontfix |
firefox124 | --- | wontfix |
firefox125 | --- | wontfix |
firefox126 | --- | fixed |
People
(Reporter: aryx, Assigned: edenchuang)
References
Details
(Keywords: crash)
Crash Data
Attachments
(2 files)
48 bytes,
text/x-phabricator-request
|
RyanVM
:
approval-mozilla-beta-
|
Details | Review |
48 bytes,
text/x-phabricator-request
|
Details | Review |
There are 0-4 instances of Nightly 124.0a1 crashing per build. For beta, only 123.0b5 and 123.0b6 got crash reports (afterwards late beta) - could this be a regression from bug 1867982 which got uplifted in b5? All crashes on Windows.
Crash report: https://crash-stats.mozilla.org/report/index/eaa5238d-5012-42a6-a3ab-2feb30240208
MOZ_CRASH Reason: Runnable 'WorkerDebuggeeRunnable' executed after WorkerThreadPrimaryRunnable ended.
Top 10 frames of crashing thread:
0 xul.dll MOZ_Crash mfbt/Assertions.h:301
0 xul.dll mozilla::dom::WorkerRunnable::Run dom/workers/WorkerRunnable.cpp:237
1 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1193
1 xul.dll NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:480
2 xul.dll mozilla::ipc::MessagePumpForNonMainThreads::Run ipc/glue/MessagePump.cpp:300
3 xul.dll MessageLoop::RunInternal ipc/chromium/src/base/message_loop.cc:370
3 xul.dll MessageLoop::RunHandler ipc/chromium/src/base/message_loop.cc:363
4 xul.dll MessageLoop::Run ipc/chromium/src/base/message_loop.cc:345
4 xul.dll nsThread::ThreadFunc xpcom/threads/nsThread.cpp:370
5 nss3.dll _PR_NativeRunThread nsprpub/pr/src/threads/combined/pruthr.c:399
Comment 1•3 months ago
|
||
The bug is linked to a topcrash signature, which matches the following criteria:
- Top 20 desktop browser crashes on beta
- Top 10 content process crashes on beta
For more information, please visit BugBot documentation.
Comment 2•3 months ago
•
|
||
It is kind of expected that we see those now. We should examine the runnable names on nightly instances, I assume.
Updated•3 months ago
|
Comment 3•3 months ago
|
||
Please note that bug 1867982 mitigated things for release, not sure we need to track this.
Comment 4•3 months ago
|
||
Untracking as the crash volume indeed died after beta 7, thanks.
Comment 5•3 months ago
|
||
Bug 1880231 should give us more details here.
Updated•3 months ago
|
Comment 6•3 months ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #2)
It is kind of expected that we see those now. We should examine the runnable names on nightly instances, I assume.
To be more precise: Part of what happened earlier on bug 1836937 is now happening here (but mitigated for release).
Comment 7•3 months ago
|
||
In the newest nightly crashes I see the CompileScriptRunnable
. Given the order of things, I understand that the CompileScriptRunnable
is always dispatched after the WorkerThreadPrimaryRunnable
is dispatched through runtimeService->RegisterWorker
but we do not necessarily wait with its dispatch until we reached a stable state inside our DoRunLoop(cx)
. I assume that if something goes wrong before, we might end up in the RunLoopNeverRan
case and thus exit the WorkerThreadPrimaryRunnable
and cleanup before the CompileScriptRunnable
is ever executed.
In theory any WorkerDebuggeeRunnable
could know that it should not execute anymore as we notified its worker ref, in practice there seems to be no callback on create that takes any action (at least setting a flag) whatsoever (and as a consequence the worker ref is only released when the WorkerDebuggeeRunnable
instance is deconstructed, which feels a bit odd, too).
We could have a callback on that WorkerDebuggeeRunnable::mSender
worker ref that signals us to not do anything anymore and wrap/override WorkerRunnable::Run
to check that condition and bail out in case, which might be a cleaner course of action in general and could even help with some shutdown hangs, as a side effect? In practice the mitigation from bug 1867982 might just be fine for this specific case if we remove the diagnostic assert.
Eden, wdyt?
Assignee | ||
Comment 9•2 months ago
|
||
Unfortunately, WorkerThread could be held by other objects through nsIThread/nsIEventTarget. In this case, event dispatching is not restricted by Worker status. This is needed because some objects need to continue the shutdown work even though the Worker is in "Dead" status. Therefore, runnables can still be dispatched to the Worker thread while breaking the connection between the Worker thread and the WorkerPrivate.
To fix the problem, this patch added the final run to process the worker thread's pending events before the disconnection that before calling WorkerThread::SetWorker(nullptr).
Updated•2 months ago
|
Comment 10•2 months ago
|
||
(In reply to Eden Chuang[:edenchuang] from comment #9)
Unfortunately, WorkerThread could be held by other objects through nsIThread/nsIEventTarget. In this case, event dispatching is not restricted by Worker status. This is needed because some objects need to continue the shutdown work even though the Worker is in "Dead" status. Therefore, runnables can still be dispatched to the Worker thread while breaking the connection between the Worker thread and the WorkerPrivate.
Can you elaborate on what code is doing this? Is this just CompileScriptRunnable
from comment 7 or did you find other code doing it too?
Updated•2 months ago
|
Comment 11•2 months ago
•
|
||
(In reply to Eden Chuang[:edenchuang] from comment #9)
Bug 1879272 - Clear Worker thread event queue before disconnecting WorkerThread and WorkerPrivate r=asuth
To fix the problem, this patch added the final run to process the worker thread's pending events before the disconnection that before calling WorkerThread::SetWorker(nullptr).
Looking at the patch, I have difficulties to match this description to what the patch seems to do. What I read there is that the edge case WorkerPrivate::RunLoopNeverRan
has become an additional event processing (though it seems to be limited to one single NS_ProcessNextEvent
?) and we slightly anticipate "Status transitions to Closing/Canceling" in the loop but I did not try to understand the consequences of that change.
(In reply to Andrew Sutherland [:asuth] (he/him) from comment #10)
Can you elaborate on what code is doing this? Is this just
CompileScriptRunnable
from comment 7 or did you find other code doing it too?
In all builds since bug 1876301 landed completely we only ever see the CompileScriptRunnable
. This lets me think that comment 7 could be a starting point for investigation. In fact, doing just some extra event processing in WorkerPrivate::RunLoopNeverRan
might paper over this problem without really solving it cleanly.
In general I would expect the mitigation from bug 1867982 to be enough to not have serious problems and the assertion there to be helpful to find remaining or new offenders and treat them specifically.
However there might be a blindspot, that is bug 1867982 limits the assertion to happen for WorkerRunnable
derived runnables only. But AFAICS a non-WorkerRunnable
would probably not expect the worker to live and try to access it, anyways. For example there might be some thread management/closure related runnable dispatched at some very late point. For all other runnables at least we already do our best to ensure there are no pending events when leaving WorkerThreadPrimaryRunnable::Run
.
Assignee | ||
Comment 12•2 months ago
|
||
The root cause of this bug's assertion is the same as we have pending events during WorkerPrivate::ScheduleDeletion().
After analyzing the corresponding crash stacks and runnable dispatching stacks, it can be two situations
-
The WorkerPrivate::DoRunLoop is never executed.
This is the case where the worker initialization on the worker thread fails. And then WorkerPrivate::RunLoopNeverRan() would be called for handling these fail cases. However, WorkerPrivate::mPreStartRunnables had already dispatched when WorkerPrivate::SetWorkerPrivateInWorkerThread() is called, so when the moment executing WorkerPrivate::RunLoopNeverRan(), mPreStartRunnables must in the worker thread already. CompileScriptRunnable is one of these runnable. -
Worker enters into DoRunLoop(). The WorkerThread can be held as nsIThread or nsIEventTarget by
nsCOMPtr<nsIThread> thread = NS_GetCurrentThread()
thread->Dispatch(myRunnable);
We don't block these runnable dispatchings after the worker's status changes to "Dead" because some objects, such as cycle_collector, need to complete their shutdown after the worker's shutdown. So ExternalWrappedRunnables(wrapped here) shows up in the dispatching stack when we hit the assertion. The original runnable is from everywhere and is related to Worker.
Before we remove the ClearMainEventQueue(bug 1800659), all these pending events are processed by calling ClearMainEventQueue() in WorkerPrivate::ScheduleDeletion(). But after we remove the ClearMainEventQueue, we only remove the mPreStartRunnables if needed.
This might be the correct thing that ClearMainEventQueue did. This is because it is hard to distinguish which object should have the permit to dispatch runnable to the Worker thread in the nsIThread way.
To fix case 1, the patch wants to handle the pending events in WorkerPrivate::RunLoopNevenRan() correctly. So, call ProcessPendingEvents if needed(Yes, it should be ProcessPendingEvents, not ProcessNexEvent. Try runs did not complain because we always meet the case only one runnable in the queue.)
To fix case 2, we process the pending events before setting WorkerThread::mWorkerPrivate as nullptr. I think this is a correct time point because I think any other shutdown jobs should be finished before we detach the WorkerPrivate from the Worker thread.
Updated•2 months ago
|
Comment 13•2 months ago
|
||
Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.
For more information, please visit BugBot documentation.
Comment 14•1 month ago
|
||
Pushed by echuang@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/e3df02eaa087 Clear Worker thread event queue in WorkerPrivate::RunLoopNevenRan and WorkerThread::SetWorker(nullptr). r=asuth
Comment 15•1 month ago
|
||
bugherder |
Comment 16•1 month ago
|
||
Since nightly and release are affected, beta will likely be affected too.
For more information, please visit BugBot documentation.
Updated•1 month ago
|
Assignee | ||
Comment 17•1 month ago
|
||
Because WorkerRunnable could be still dispatched and ran after Worker's cycle collector shutdown, the assertion condition is not fit anymore. So we remove these codes.
Comment 18•1 month ago
|
||
Pushed by echuang@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/3187c4f491a8 Remove assertion in WorkerRunnable::Run. r=asuth
Comment 19•1 month ago
|
||
A patch has been attached on this bug, which was already closed. Filing a separate bug will ensure better tracking. If this was not by mistake and further action is needed, please alert the appropriate party. (Or: if the patch doesn't change behavior -- e.g. landing a test case, or fixing a typo -- then feel free to disregard this message)
Comment 20•1 month ago
|
||
bugherder |
Assignee | ||
Comment 21•1 month ago
|
||
Comment on attachment 9389282 [details]
Bug 1879272 - Clear Worker thread event queue in WorkerPrivate::RunLoopNevenRan and WorkerThread::SetWorker(nullptr). r=asuth
Beta/Release Uplift Approval Request
- User impact if declined: Users might hit assertion during shutdown the Firefox.
- Is this code covered by automated tests?: Unknown
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): The patch removes the assertion that is not needed anymore.
- String changes made/needed:
- Is Android affected?: Yes
Comment 22•1 month ago
|
||
Comment on attachment 9389282 [details]
Bug 1879272 - Clear Worker thread event queue in WorkerPrivate::RunLoopNevenRan and WorkerThread::SetWorker(nullptr). r=asuth
AFAICT, this bug only impacts Nightly builds and given that 125 is moving to RC today, it doesn't seem like we need to uplift this.
Description
•