Closed Bug 1752305 Opened 2 years ago Closed 2 years ago

Stack overflow through MediaTrackGraphShutDownRunnable

Categories

(Core :: Audio/Video: MediaStreamGraph, defect)

defect

Tracking

()

RESOLVED FIXED
105 Branch
Tracking Status
firefox105 --- fixed

People

(Reporter: jstutte, Assigned: karlt)

References

Details

Crash Data

Attachments

(1 file)

+++ This bug was initially created as a clone of Bug #1691517 +++

At least some of the crashes still visible in bug 1691517 seem to be caused by an infinite recursion that goes through the MediaTrackGraphShutDownRunnable.

Example: https://crash-stats.mozilla.org/report/index/ddaa54a7-555d-4fb0-a00b-204350220126

Component: Audio/Video → Audio/Video: MediaStreamGraph
Severity: -- → S3
See Also: → 1742804
See Also: → 1775341

https://crash-stats.mozilla.org/report/index/92724d7e-f8ce-40ae-bf97-1a0e50220630#allthreads has 565 "MediaTrackGrph" threads.
477 of these have passed SHDRCV and SHDACK.
These seem to all be waiting in _PR_NotifyJoinWaiters() implying that nsThreadShutdownAckEvent has not run on the joining (main) thread.
There is one "GraphRunner" thread in a similar state.

There are 479 nsThread::Shutdown() calls on the stack from MediaTrackGraphShutDownRunnable::Run().

Assignee: nobody → karlt

This avoids putting a nested event loop on the stack, which may not unwind if
further threads are shutdown, leading to stack exhaustion.

GraphDriver::Shutdown() is called only after control of the graph has been
handed to the main thread so the graph is not running while this is called and
so MediaTrackGraphShutDownRunnable on the main thread should not need to wait
for thread shutdown.

See Also: 1775341

Copying crash signatures from duplicate bugs.

Crash Signature: [@ mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal] → [@ mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal] [@ mozilla::PerformanceCounterState::MaybeReportAccumulatedTime]

(In reply to Karl Tomlinson (back July 25 :karlt) from comment #1)

https://crash-stats.mozilla.org/report/index/92724d7e-f8ce-40ae-bf97-1a0e50220630#allthreads has 565 "MediaTrackGrph" threads.
477 of these have passed SHDRCV and SHDACK.
These seem to all be waiting in _PR_NotifyJoinWaiters() implying that nsThreadShutdownAckEvent has not run on the joining (main) thread.

Some nsThreadShutdownAckEvent might have run actually. I assume what happened here is that we stack several SpinEventLoopUntil for all those thread shutdowns, each waiting on a different nsThreadShutdownAckEvent. So if the topmost SpinEventLoopUntil did not receive its own nsThreadShutdownAckEvent, it will continue to block the entire stack.

There is one "GraphRunner" thread in a similar state.

There are 479 nsThread::Shutdown() calls on the stack from MediaTrackGraphShutDownRunnable::Run().

Karl, is it normal to see so many threads? I'd rather understand from MediaTrackGraphInitThreadRunnable that we want to re-use existing threads. So my assumption would be, that while the existing thread is waiting to join we already cleared our flags that signal a running thread and someone asks us to start a new thread - but we should not try to do so during shutdown (it would be torn down immediately anyways). Either MediaTrackGraphInitThreadRunnable or ThreadedDriver::RunThread should do some AppShutdown::IsInOrBeyond check for an appropriate phase, I think.

Crash Signature: [@ mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal] [@ mozilla::PerformanceCounterState::MaybeReportAccumulatedTime] → [@ mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal] [@ mozilla::PerformanceCounterState::MaybeReportAccumulatedTime]
Flags: needinfo?(karlt)

(hu, I did not touch crash signatures?)

See Also: → 1781029

(In reply to Jens Stutte [:jstutte] from comment #6)

(In reply to Karl Tomlinson (back July 25 :karlt) from comment #1)

https://crash-stats.mozilla.org/report/index/92724d7e-f8ce-40ae-bf97-1a0e50220630#allthreads has 565 "MediaTrackGrph" threads.
477 of these have passed SHDRCV and SHDACK.
These seem to all be waiting in _PR_NotifyJoinWaiters() implying that nsThreadShutdownAckEvent has not run on the joining (main) thread.

Some nsThreadShutdownAckEvent might have run actually. I assume what happened here is that we stack several SpinEventLoopUntil for all those thread shutdowns, each waiting on a different nsThreadShutdownAckEvent. So if the topmost SpinEventLoopUntil did not receive its own nsThreadShutdownAckEvent, it will continue to block the entire stack.

Nested event loops are truly evil, the situation you describe being just one reason why, but the situation is slightly different here.

The non-main thread stacks in that report suggest that nsThreadShutdownAckEvent has not Run() for those threads. _PR_NotifyJoinWaiters() waits for PR_JoinThread() from nsThreadShutdownAckEvent::Run().

A different timing between thread shutdowns might give a chance for some nsThreadShutdownAckEvents to run, in which case perhaps their threads may have terminated and so they would no longer show in the crash report. However, the balance of the count of Shutdown() calls on the stack and the count of threads in _PR_NotifyJoinWaiters() indicates that this was not what happened.

This state where nsThreadShutdownAckEvents have not run suggests that the threads have all been shut down in succession, with MediaTrackGraphShutDownRunnables all queued before any nsThreadShutdownAckEvents are queued. Such a situation might occur on GC or page unload.

There are 479 nsThread::Shutdown() calls on the stack from MediaTrackGraphShutDownRunnable::Run().

Karl, is it normal to see so many threads?

It is not normal to see so many threads, but sites can trigger this.

I'd rather understand from MediaTrackGraphInitThreadRunnable that we want to re-use existing threads.

There is one active thread per MediaTrackGraph. The problem here is that we have multiple MediaTrackGraphs.

So my assumption would be, that while the existing thread is waiting to join we already cleared our flags that signal a running thread and someone asks us to start a new thread - but we should not try to do so during shutdown (it would be torn down immediately anyways). Either MediaTrackGraphInitThreadRunnable or ThreadedDriver::RunThread should do some AppShutdown::IsInOrBeyond check for an appropriate phase, I think.

The assumption is that creation of a new graph is triggered only from content JS, which will not run late in shutdown. Graphs are shut down when the window is destroyed. Realtime graphs are also shutdown during xpcom-will-shutdown, but I guess that is not necessary with shutdown on window destroy.

The main thread has not commenced XPCOM shutdown in the report here.

Flags: needinfo?(karlt)

The assumption is that creation of a new graph is triggered only from content JS, which will not run late in shutdown.

Well, we know that running (and blocking) content JS can interfere at least with our parent process shutdown, bug 1777198 tries to improve things here in general. Here we are in the same content process, IIUC, but some additional resilience would probably not hurt?

(In reply to Karl Tomlinson (:karlt) from comment #8)

Realtime graphs are also shutdown during xpcom-will-shutdown, but I guess that is not necessary with shutdown on window destroy.

The first shutdown comes during discarding of a window, after which script can run but cannot create a new AudioContext because the window is no longer fully active.
The shutdown on xpcom-will-shutdown would have been necessary prior to the inclusion of the fully-active check.

(In reply to Jens Stutte [:jstutte] from comment #9)

Here we are in the same content process, IIUC, but some additional resilience would probably not hurt?

I expect the fully-active check provides the is-beyond check that you are expecting.

Pushed by ktomlinson@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/2bbe4f4ee2c2
AsyncShutdown() MediaTrackGrph thread r=padenot
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 105 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: