Closed Bug 1837467 Opened 1 years ago Closed 1 years ago

GetRunnableForMTTask seems to be stuck on mMainThreadCV.Wait();

Categories

(Core :: DOM: Content Processes, defect, P2)

defect

Tracking

()

RESOLVED FIXED
116 Branch
Tracking Status
firefox116 --- fixed

People

(Reporter: jstutte, Assigned: jstutte)

References

(Blocks 1 open bug)

Details

Attachments

(2 files, 1 obsolete file)

Edit: See comment 5 of this bug for updated understanding. We came from:

From inspecting the minidump from several crashes like this with Visual Studio, it seems that we can see situations where we arrived inside the while (mMainThreadTasks.empty()) (thus having a supposedly empty task queue) and find us stuck on mMainThreadCV.Wait();, but looking at mMainThreadTasks it seems we have some task in the queue, as if there was a race (if we can trust the values in Visual Studio, but they seem reasonable and consistent over some crashes).

mMainThreadCV	{...}	mozilla::CondVar mozilla::OffTheBooksCondVar	{mLock=0x000001c923d72060 {...} mImpl={platformData_=0x000001c923d72108 {0x0000000000000000, 0x0000000000000000, ...} } }
mozilla::OffTheBooksCondVar mCurrentTasksMT	{ size=0 }	std::stack<RefPtr<mozilla::Task>,std::deque<RefPtr<mozilla::Task>,std::allocator<RefPtr<mozilla::Task>>>> 
mThreadableTasks	{ size=0 }	std::set<RefPtr<mozilla::Task>,mozilla::Task::PriorityCompare,std::allocator<RefPtr<mozilla::Task>>> 
mMainThreadTasks	{ size=1 }	std::set<RefPtr<mozilla::Task>,mozilla::Task::PriorityCompare,std::allocator<RefPtr<mozilla::Task>>>

Note that these are cases where the IPC shutdown state suggests that we received the side-channel NotifiedImpendingShutdown (not running through the main thread) but the content process main thread apparently never processed the "real" shutdown event the ContentParent sent.

Asking :nika for help with triage as she started to dig into this on matrix already. It might well be the wrong component and should live in XPCOM.

Flags: needinfo?(nika)

I looked into this more after the discussion on Matrix, but I don't think it was the issue I thought it might be from back then around the mixture of different waiting styles used. IIRC on Windows content processes, we don't really use the system event loop, so we'll end up using this condvar for waiting.

In terms of waiting, it is definitely possible for us to have no tasks to run while mMainThreadTasks is non-empty AFAICT, due to how things like idle tasks and suspended tasks work under TaskController. That being said, I don't fully understand how we're ending up in this exact situation. The manipulation of mMainThreadTasks is somewhat decoupled from mMainThreadCV being signaled, so there could be some edge-case where we aren't signaling and we should be?

Redirecting the ni? for now over to :bas, as I don't have the time to super dig into the code flow here and figure out if there's a possible way to end up in this situation.

Flags: needinfo?(nika) → needinfo?(bas)

As Nika mentioned on here, there being tasks in the queue while we're not doing anything is expected. Idle tasks are the most likely thing in the queue here. However we should be waking up when we are ready to process these. (Olli would know better exactly how this work, but I think it's related with receiving the idle token)

When we are shutting down all idle tasks should be free to be executed though. And when we're not shutting down pending idle tasks should never be causing a deadlock, otherwise they shouldn't be idle tasks :-).

As Nika said though, some of this code has changed over the years and the decoupling means we could have accidentally created a situation where the CV is not being signalled. We could try signalling it somewhat more aggressively during manipulation of mMainThreadTasks. The downside would be slightly more wakeups potentially, but I don't think it would do any real harm.

Flags: needinfo?(bas)

(In reply to Bas Schouten (:bas.schouten) from comment #4)

As Nika mentioned on here, there being tasks in the queue while we're not doing anything is expected. Idle tasks are the most likely thing in the queue here. However we should be waking up when we are ready to process these. (Olli would know better exactly how this work, but I think it's related with receiving the idle token)

OK, that makes it most likely to not matter if we have tasks in the queue.

When we are shutting down all idle tasks should be free to be executed though. And when we're not shutting down pending idle tasks should never be causing a deadlock, otherwise they shouldn't be idle tasks :-).

As Nika said though, some of this code has changed over the years and the decoupling means we could have accidentally created a situation where the CV is not being signalled. We could try signalling it somewhat more aggressively during manipulation of mMainThreadTasks. The downside would be slightly more wakeups potentially, but I don't think it would do any real harm.

The symptom we see is that apparently the ContentChild::RecvShutdown function is never executed. IIUC that is supposed to happen executing a MessageChannel::MessageTask that wraps ContentChild::OnMessage and is dispatched to the main thread.

However, re-reading the ContentParent::MaybeBeginShutDown code it seems to me, that if aSendShutDown is false we can end up starting our timer without having sent the ContentParent::SendShutdown at all ( at least here in this function ) which would obviously perfectly explain what is going on.

That function and flag has been introduced by bug 1661364 and it seems to me we should either decide to die or not, but not halfway.

And also ContentParent::MaybeAsyncSendShutDownMessage can still decide to not send the message.

I need to parse this a bit better, but it seems to me we might end up having started the timer on the parent side without ever even have scheduled the parent side task that is supposed to send the shutdown message.

Note: That would make it be in the correct component, of course.

Summary: GetRunnableForMTTask seems to be stuck on mMainThreadCV.Wait(); while having tasks in mMainThreadTasks → GetRunnableForMTTask seems to be stuck on mMainThreadCV.Wait();
Severity: -- → S3
Priority: -- → P2
Assignee: nobody → jstutte

The parent process might be busy such that the time that passes between NotifyTabDestroying and ShutdownProcess can become significant in terms of activating the ShutdownKill timer too early. In order to avoid this, we will start the timer only after we actually sent the shutdown message.
In all relevant cases we have already called SignalImpendingShutdownToContentJS such that we can omit it here (we do not expect any JS to run if we see a launch failure or other edge case).

Depends on D180958

Now that we start the ShutdownKill timer only after we actually sent the shutdown message, we want to ensure that a content process that we know will die but did not yet receive the shutdown message will start to idle as early as possible in case the parent process is too busy to send the shutdown message right away.

Depends on D180959

However, re-reading the ContentParent::MaybeBeginShutDown code it seems to me, that if aSendShutDown is false we can end up starting our timer without having sent the ContentParent::SendShutdown at all ( at least here in this function ) which would obviously perfectly explain what is going on.

I think we are always calling NotifyTabDestroyed (where aSendShutdown is true) after we called NotifyTabDestroying, so this seems to be ok, in principle.

However, as ContentParent::MaybeAsyncSendShutDownMessage dispatches an event to the parent's MT queue and only that event will then ask the child to shutdown, we might just see cases where the parent's main thread is so busy that we never execute the send before the timer kills the child. I think we should move the StartForceKillTimer very close to when we actually call SendShutdown.

Note that we will still SignalImpendingShutdownToContentJS early, such that the child process will hopefully stop to be (too) busy while waiting for the shutdown message. For that reason I'd propose to promote dom.abort_script_on_child_shutdown to be enabled everywhere.

Attachment #9339103 - Attachment description: Bug 1837467 - Remove the maybe part from AsyncSendShutdownMessage, relying on earlier keepalive checks. r?#dom-worker-reviewers,smaug → WIP: Bug 1837467 - Overhaul the thread safety and synchronization of our keepalive checks. r?#dom-worker-reviewers,smaug
Attachment #9339104 - Attachment description: Bug 1837467 - StartForceKillTimer only after SendShutdown has been called. r?smaug → WIP: Bug 1837467 - StartForceKillTimer only after SendShutdown has been called. r?smaug
Attachment #9339105 - Attachment description: Bug 1837467 - Enable dom.abort_script_on_child_shutdown everywhere. r?smaug → WIP: Bug 1837467 - Enable dom.abort_script_on_child_shutdown everywhere. r?smaug
Attachment #9339103 - Attachment description: WIP: Bug 1837467 - Overhaul the thread safety and synchronization of our keepalive checks. r?#dom-worker-reviewers,smaug → Bug 1837467 - Overhaul the thread safety and synchronization of our keepalive checks. r?#dom-worker-reviewers,smaug
Attachment #9339104 - Attachment description: WIP: Bug 1837467 - StartForceKillTimer only after SendShutdown has been called. r?smaug → Bug 1837467 - StartForceKillTimer only after SendShutdown has been called. r?smaug
Attachment #9339105 - Attachment description: WIP: Bug 1837467 - Enable dom.abort_script_on_child_shutdown everywhere. r?smaug → Bug 1837467 - Enable dom.abort_script_on_child_shutdown everywhere. r?smaug
Attachment #9339103 - Attachment description: Bug 1837467 - Overhaul the thread safety and synchronization of our keepalive checks. r?#dom-worker-reviewers,smaug → WIP: Bug 1837467 - Overhaul the thread safety and synchronization of our keepalive checks. r?#dom-worker-reviewers,smaug
Attachment #9339104 - Attachment description: Bug 1837467 - StartForceKillTimer only after SendShutdown has been called. r?smaug → WIP: Bug 1837467 - StartForceKillTimer only after SendShutdown has been called. r?smaug
Depends on: 1838779

Comment on attachment 9339103 [details]
WIP: Bug 1837467 - Overhaul the thread safety and synchronization of our keepalive checks. r?#dom-worker-reviewers,smaug

Revision D180958 was moved to bug 1838779. Setting attachment 9339103 [details] to obsolete.

Attachment #9339103 - Attachment is obsolete: true
Attachment #9339104 - Attachment description: WIP: Bug 1837467 - StartForceKillTimer only after SendShutdown has been called. r?smaug → Bug 1837467 - StartForceKillTimer only after SendShutdown has been called. r?smaug
Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/cd1285d0e9e6 StartForceKillTimer only after SendShutdown has been called. r=smaug https://hg.mozilla.org/integration/autoland/rev/a114eadef755 Enable dom.abort_script_on_child_shutdown everywhere. r=smaug
Status: NEW → RESOLVED
Closed: 1 years ago
Resolution: --- → FIXED
Target Milestone: --- → 116 Branch
No longer depends on: 1838779
Duplicate of this bug: 1789803
Blocks: 1813602
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: