GetRunnableForMTTask seems to be stuck on mMainThreadCV.Wait();
Categories
(Core :: DOM: Content Processes, defect, P2)
Tracking
()
Tracking | Status | |
---|---|---|
firefox116 | --- | fixed |
People
(Reporter: jstutte, Assigned: jstutte)
References
(Blocks 1 open bug)
Details
Attachments
(2 files, 1 obsolete file)
Edit: See comment 5 of this bug for updated understanding. We came from:
From inspecting the minidump from several crashes like this with Visual Studio, it seems that we can see situations where we arrived inside the while (mMainThreadTasks.empty())
(thus having a supposedly empty task queue) and find us stuck on mMainThreadCV.Wait();
, but looking at mMainThreadTasks
it seems we have some task in the queue, as if there was a race (if we can trust the values in Visual Studio, but they seem reasonable and consistent over some crashes).
mMainThreadCV {...} mozilla::CondVar mozilla::OffTheBooksCondVar {mLock=0x000001c923d72060 {...} mImpl={platformData_=0x000001c923d72108 {0x0000000000000000, 0x0000000000000000, ...} } }
mozilla::OffTheBooksCondVar mCurrentTasksMT { size=0 } std::stack<RefPtr<mozilla::Task>,std::deque<RefPtr<mozilla::Task>,std::allocator<RefPtr<mozilla::Task>>>>
mThreadableTasks { size=0 } std::set<RefPtr<mozilla::Task>,mozilla::Task::PriorityCompare,std::allocator<RefPtr<mozilla::Task>>>
mMainThreadTasks { size=1 } std::set<RefPtr<mozilla::Task>,mozilla::Task::PriorityCompare,std::allocator<RefPtr<mozilla::Task>>>
Assignee | ||
Comment 1•1 years ago
|
||
Note that these are cases where the IPC shutdown state suggests that we received the side-channel NotifiedImpendingShutdown
(not running through the main thread) but the content process main thread apparently never processed the "real" shutdown event the ContentParent
sent.
Assignee | ||
Comment 2•1 years ago
|
||
Asking :nika for help with triage as she started to dig into this on matrix already. It might well be the wrong component and should live in XPCOM.
Comment 3•1 years ago
|
||
I looked into this more after the discussion on Matrix, but I don't think it was the issue I thought it might be from back then around the mixture of different waiting styles used. IIRC on Windows content processes, we don't really use the system event loop, so we'll end up using this condvar for waiting.
In terms of waiting, it is definitely possible for us to have no tasks to run while mMainThreadTasks
is non-empty AFAICT, due to how things like idle tasks and suspended tasks work under TaskController. That being said, I don't fully understand how we're ending up in this exact situation. The manipulation of mMainThreadTasks
is somewhat decoupled from mMainThreadCV
being signaled, so there could be some edge-case where we aren't signaling and we should be?
Redirecting the ni? for now over to :bas, as I don't have the time to super dig into the code flow here and figure out if there's a possible way to end up in this situation.
Comment 4•1 years ago
|
||
As Nika mentioned on here, there being tasks in the queue while we're not doing anything is expected. Idle tasks are the most likely thing in the queue here. However we should be waking up when we are ready to process these. (Olli would know better exactly how this work, but I think it's related with receiving the idle token)
When we are shutting down all idle tasks should be free to be executed though. And when we're not shutting down pending idle tasks should never be causing a deadlock, otherwise they shouldn't be idle tasks :-).
As Nika said though, some of this code has changed over the years and the decoupling means we could have accidentally created a situation where the CV is not being signalled. We could try signalling it somewhat more aggressively during manipulation of mMainThreadTasks
. The downside would be slightly more wakeups potentially, but I don't think it would do any real harm.
Assignee | ||
Comment 5•1 years ago
•
|
||
(In reply to Bas Schouten (:bas.schouten) from comment #4)
As Nika mentioned on here, there being tasks in the queue while we're not doing anything is expected. Idle tasks are the most likely thing in the queue here. However we should be waking up when we are ready to process these. (Olli would know better exactly how this work, but I think it's related with receiving the idle token)
OK, that makes it most likely to not matter if we have tasks in the queue.
When we are shutting down all idle tasks should be free to be executed though. And when we're not shutting down pending idle tasks should never be causing a deadlock, otherwise they shouldn't be idle tasks :-).
As Nika said though, some of this code has changed over the years and the decoupling means we could have accidentally created a situation where the CV is not being signalled. We could try signalling it somewhat more aggressively during manipulation of
mMainThreadTasks
. The downside would be slightly more wakeups potentially, but I don't think it would do any real harm.
The symptom we see is that apparently the ContentChild::RecvShutdown
function is never executed. IIUC that is supposed to happen executing a MessageChannel::MessageTask
that wraps ContentChild::OnMessage
and is dispatched to the main thread.
However, re-reading the ContentParent::MaybeBeginShutDown
code it seems to me, that if aSendShutDown
is false
we can end up starting our timer without having sent the ContentParent::SendShutdown
at all ( at least here in this function ) which would obviously perfectly explain what is going on.
That function and flag has been introduced by bug 1661364 and it seems to me we should either decide to die or not, but not halfway.
And also ContentParent::MaybeAsyncSendShutDownMessage
can still decide to not send the message.
I need to parse this a bit better, but it seems to me we might end up having started the timer on the parent side without ever even have scheduled the parent side task that is supposed to send the shutdown message.
Note: That would make it be in the correct component, of course.
Assignee | ||
Updated•1 years ago
|
Assignee | ||
Updated•1 years ago
|
Assignee | ||
Comment 6•1 years ago
|
||
Assignee | ||
Comment 7•1 years ago
|
||
The parent process might be busy such that the time that passes between NotifyTabDestroying and ShutdownProcess can become significant in terms of activating the ShutdownKill timer too early. In order to avoid this, we will start the timer only after we actually sent the shutdown message.
In all relevant cases we have already called SignalImpendingShutdownToContentJS such that we can omit it here (we do not expect any JS to run if we see a launch failure or other edge case).
Depends on D180958
Assignee | ||
Comment 8•1 years ago
|
||
Now that we start the ShutdownKill timer only after we actually sent the shutdown message, we want to ensure that a content process that we know will die but did not yet receive the shutdown message will start to idle as early as possible in case the parent process is too busy to send the shutdown message right away.
Depends on D180959
Assignee | ||
Comment 9•1 years ago
•
|
||
However, re-reading the
ContentParent::MaybeBeginShutDown
code it seems to me, that ifaSendShutDown
isfalse
we can end up starting our timer without having sent theContentParent::SendShutdown
at all ( at least here in this function ) which would obviously perfectly explain what is going on.
I think we are always calling NotifyTabDestroyed
(where aSendShutdown
is true) after we called NotifyTabDestroying
, so this seems to be ok, in principle.
However, as ContentParent::MaybeAsyncSendShutDownMessage
dispatches an event to the parent's MT queue and only that event will then ask the child to shutdown, we might just see cases where the parent's main thread is so busy that we never execute the send before the timer kills the child. I think we should move the StartForceKillTimer
very close to when we actually call SendShutdown
.
Note that we will still SignalImpendingShutdownToContentJS
early, such that the child process will hopefully stop to be (too) busy while waiting for the shutdown message. For that reason I'd propose to promote dom.abort_script_on_child_shutdown
to be enabled everywhere.
Updated•1 years ago
|
Updated•1 years ago
|
Updated•1 years ago
|
Updated•1 years ago
|
Updated•1 years ago
|
Updated•1 years ago
|
Updated•1 years ago
|
Updated•1 years ago
|
Comment 10•1 years ago
|
||
Comment on attachment 9339103 [details]
WIP: Bug 1837467 - Overhaul the thread safety and synchronization of our keepalive checks. r?#dom-worker-reviewers,smaug
Revision D180958 was moved to bug 1838779. Setting attachment 9339103 [details] to obsolete.
Updated•1 years ago
|
Comment 11•1 years ago
|
||
Comment 12•1 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/cd1285d0e9e6
https://hg.mozilla.org/mozilla-central/rev/a114eadef755
Description
•