Closed Bug 1678330 Opened 4 years ago Closed 11 months ago

Crash in [@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThreadManager::SpinEventLoopUntilInternal]

Categories

(Core :: XPCOM, defect)

Unspecified
Windows 10
defect

Tracking

()

RESOLVED INCOMPLETE
Tracking Status
firefox-esr78 --- unaffected
firefox83 --- wontfix
firefox84 --- wontfix
firefox85 --- fix-optional
firefox86 --- ?

People

(Reporter: pascalc, Unassigned)

References

Details

(Keywords: crash, regression)

Crash Data

Attachments

(2 files, 1 obsolete file)

Crash report: https://crash-stats.mozilla.org/report/index/a05e3e76-f5bf-4af9-89e4-0e0be0201119

MOZ_CRASH Reason: MOZ_CRASH(Shutdown hanging before starting.)

Top 10 frames of crashing thread:

0 ntdll.dll NtWaitForAlertByThreadId 
1 ntdll.dll RtlSleepConditionVariableSRW 
2 kernelbase.dll SleepConditionVariableSRW 
3 mozglue.dll mozilla::detail::ConditionVariableImpl::wait mozglue/misc/ConditionVariable_windows.cpp:50
4 xul.dll mozilla::TaskController::GetRunnableForMTTask xpcom/threads/TaskController.cpp:283
5 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1143
6 xul.dll nsThreadManager::SpinEventLoopUntilInternal xpcom/threads/nsThreadManager.cpp:719
7 xul.dll XPTC__InvokebyIndex 
8  @0x21b6b82e83f 
9 xul.dll trunc 

Bugbug thinks this bug should belong to this component, but please revert this change in case of error.

Component: General → XPCOM
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → DUPLICATE

The situation here is not the same as in bug 1505660, so let's keep them separate.

Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---

This signature doesn't show up on nightly since mid-to-late November, do you know if this got fixed or the signature changed?

I see a few instances of [@ shutdownhang | nsThreadManager::SpinEventLoopUntilInternal ] on Nightly, such as bp-3de2cfce-5bfe-4c7d-8d45-c21f80210106 . That signature looks similar to the one in comment 0.

It seems, that the variant with GetRunnableForMTTask is happening currently only under OSX.

Crash Signature: [@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThreadManager::SpinEventLoopUntilInternal] → [@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThreadManager::SpinEventLoopUntilInternal] [@ shutdownhang | nsThreadManager::SpinEventLoopUntilInternal ]
Flags: needinfo?(jstutte)
Crash Signature: [@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThreadManager::SpinEventLoopUntilInternal] [@ shutdownhang | nsThreadManager::SpinEventLoopUntilInternal ] → [@ shutdownhang | __psynch_cvwait | mozilla::TaskController::GetRunnableForMTTask | nsThreadManager::SpinEventLoopUntilInternal] [@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThreadManager::SpinEventLoopUntilInternal] [@ shutdownha…

FWIW, I cracked up a dump and it seems, we are executing some JS on the main thread that somehow calls nsThreadManager::SpinEventLoopUntil directly from JS:

...
 	xul.dll!nsThreadManager::SpinEventLoopUntilInternal(nsINestedEventLoopCondition * aCondition, bool aCheckingShutdown) Zeile 727	C++
 	xul.dll!nsThreadManager::SpinEventLoopUntil(nsINestedEventLoopCondition * aCondition) Zeile 706	C++
 	xul.dll!_NS_InvokeByIndex()	Unbekannt
 	[Inlineframe] xul.dll!CallMethodHelper::Invoke() Zeile 1620	C++
 	[Inlineframe] xul.dll!CallMethodHelper::Call() Zeile 1176	C++
 	xul.dll!XPCWrappedNative::CallMethod(XPCCallContext & ccx, XPCWrappedNative::CallMode mode) Zeile 1142	C++
 	xul.dll!XPC_WN_CallMethod(JSContext * cx, unsigned int argc, JS::Value * vp) Zeile 926	C++
 	[Inlineframe] xul.dll!CallJSNative(JSContext * cx, bool(*)(JSContext *, unsigned int, JS::Value *) native, js::CallReason reason, const JS::CallArgs & args) Zeile 503	C++
 	xul.dll!js::InternalCallOrConstruct(JSContext * cx, const JS::CallArgs & args, js::MaybeConstruct construct, js::CallReason reason) Zeile 594	C++
 	[Inlineframe] xul.dll!InternalCall(JSContext * cx, const js::AnyInvokeArgs & args, js::CallReason reason) Zeile 647	C++
 	[Inlineframe] xul.dll!js::CallFromStack(JSContext * cx, const JS::CallArgs & args) Zeile 651	C++
 	xul.dll!Interpret(JSContext * cx, js::RunState & state) Zeile 3309	C++
 	[Inlineframe] xul.dll!js::RunScript(JSContext * cx, js::RunState & state) Zeile 473	C++
...

From the MOZ_CRASH(Shutdown hanging before starting any known phase.) I deduce, that we just called the observer notification "profile-after-change" which started the terminator watchdog, but the terminator did never receive another shutdown heartbeat notification out of these.

So I would look out for any Javascript executed on the main thread that reacts on "profile-after-change" and potentially calls nsThreadManager::SpinEventLoopUntil directly from JS.

So we have JS that use an explicit nsThreadManager::SpinEventLoopUntilOrShutdown which bails out in case of shutdown.

We have some SpinEventLoopUntil cases out of this list where we explicitly handle shutdown.

And then there are JS using SpinEventLoopUntil that seem to not care about shutdown at all.

I was wondering if there is any good reason to not always use nsThreadManager::SpinEventLoopUntilOrShutdown from Javascript (or to not make nsThreadManager::SpinEventLoopUntil always aware of shutdown in JS)? It seems to be a relatively easy thing to do, but maybe I am overlooking something.

Flags: needinfo?(nika)

(In reply to Jens Stutte [:jstutte] from comment #8)

From the MOZ_CRASH(Shutdown hanging before starting any known phase.) I deduce, that we just called the observer notification "profile-after-change" which started the terminator watchdog, but the terminator did never receive another shutdown heartbeat notification out of these.

So I would look out for any Javascript executed on the main thread that reacts on "profile-after-change" and potentially calls nsThreadManager::SpinEventLoopUntil directly from JS.

Actually the crash report annotation says "ShutdownProgress: quit-application". There is a known issue in nsTerminator that may result in not having fired the heartbeat as the first observer notification for a given phase.

So looking for any Javascript that reacts on "quit-application" and potentially calls nsThreadManager::SpinEventLoopUntil directly from JS yields:

devtools/client/framework/devtools-browser.js

as only result.

(In reply to Jens Stutte [:jstutte] from comment #9)

I was wondering if there is any good reason to not always use nsThreadManager::SpinEventLoopUntilOrShutdown from Javascript (or to not make nsThreadManager::SpinEventLoopUntil always aware of shutdown in JS)? It seems to be a relatively easy thing to do, but maybe I am overlooking something.

Actually it is not clear to me if in this stage of the shutdown mShuttingDown would have been set already. So while being a legitimate question, it is not clear to me, if it would help to solve this problem, as we seem to be in the parent process here and "xpcom-shutdown" is not yet reached.

Crash Signature: shutdownhang | nsThreadManager::SpinEventLoopUntilInternal ] → shutdownhang | nsThreadManager::SpinEventLoopUntilInternal ] [@ shutdownhang | trunc | nsThreadManager::SpinEventLoopUntilInternal ]

(In reply to Jens Stutte [:jstutte] from comment #8)

So I would look out for any Javascript executed on the main thread that reacts on "profile-after-change" and potentially calls nsThreadManager::SpinEventLoopUntil directly from JS.

profile-after-change is actually fired early during startup (it's a confusing name I know), whereas profile-before-change is the call site during shutdown. In this case we're in "quit-application".

(In reply to Jens Stutte [:jstutte] from comment #10)

Actually the crash report annotation says "ShutdownProgress: quit-application". There is a known issue in nsTerminator that may result in not having fired the heartbeat as the first observer notification for a given phase.

Yup, the the normal flow for nsTerminator looks like:

  1. Using a manifest, nsTerminator is registered as a listener for profile-after-change (https://searchfox.org/mozilla-central/rev/281009d3b7e1e1496b9524d5478ff4f0b7369573/toolkit/components/terminator/terminator.manifest#1), this event is fired as the user's profile becomes available, and it uses the notification to register shutdown observers. It doesn't start the watchdog thread at this time. https://searchfox.org/mozilla-central/rev/281009d3b7e1e1496b9524d5478ff4f0b7369573/toolkit/components/terminator/nsTerminator.cpp#483-485

  2. When the first of the phase notifications are fired, the watchdog thread is started (https://searchfox.org/mozilla-central/rev/281009d3b7e1e1496b9524d5478ff4f0b7369573/toolkit/components/terminator/nsTerminator.cpp#489-494), then we perform updates with that particular phase in mind. This is where the "ShutdownProgress" annotation comes from, which in your case is "quit-application", meaning the last shutdown phase notification the watchdog saw was "quit-application".

This means that the checks in the RunWatchdog function to generate the crash message are misleading, as we have definitely entered the "quit-application" phase, but the check is actually checking if we've moved past the "quit-application" phase, as we only set mTicks to a positive value when we observe a new notification in UpdateHeartbeat (https://searchfox.org/mozilla-central/rev/281009d3b7e1e1496b9524d5478ff4f0b7369573/toolkit/components/terminator/nsTerminator.cpp#517).

AFAIK there are 2 main places where quit-application can be fired: From nsAppStartup::Quit (https://searchfox.org/mozilla-central/rev/281009d3b7e1e1496b9524d5478ff4f0b7369573/toolkit/components/startup/nsAppStartup.cpp#446), or due to a MOZ_WM_APP_QUIT message on Windows instructing us to exit immediately (https://searchfox.org/mozilla-central/rev/281009d3b7e1e1496b9524d5478ff4f0b7369573/widget/windows/nsWindow.cpp#5206). The most likely situation here is the first, but it probably doesn't matter.

(In reply to Jens Stutte [:jstutte] from comment #11)

Actually it is not clear to me if in this stage of the shutdown mShuttingDown would have been set already. So while being a legitimate question, it is not clear to me, if it would help to solve this problem, as we seem to be in the parent process here and "xpcom-shutdown" is not yet reached.

Yeah, I think it's unlikely that using spinEventLoopUntilOrShutdown will help much here, as that will only be aborted once the "xpcom-shutdown" observer notification has fired (https://searchfox.org/mozilla-central/rev/281009d3b7e1e1496b9524d5478ff4f0b7369573/xpcom/threads/nsThreadManager.cpp#297-301), which hasn't happened yet, as otherwise either the watchdog or the "ShutdownProgress" annotation would know about it if we had.


Based on my current understanding, it seems like after quit-application is fired, the shutdown process can't continue until the event loop has become idle, and no nested events are being processed. The call to run the root event loop needs to have exited in nsAppStartup::Run(), which will happen next time the root event loop becomes idle, so if we have any nested event loop deadlocks, they'll definitely cause us to time out here.

I'm thinking a reasonable starting place here might be to require each caller of spinEventLoopUntil{OrShutdown} (perhaps including native callers) to pass in a string describing why they're spinning an event loop, and include the currently active set of these strings in a crash annotation explaining what nested event loops are currently being executed on the main thread if the crash occurs with nested loops spinning. It might end up looking something like:

NestedEventLoops: openwindow;debuggerpaused;alertdialog;

We shouldn't be using nested event loops in any places where performance is sensitive enough we can't get away with a bit of extra overhead for that, and it would really help for figuring out what's happening in situations like this one.

Flags: needinfo?(nika)

(In reply to Nika Layzell [:nika] (ni? for response) from comment #12)

(In reply to Jens Stutte [:jstutte] from comment #10)

Actually the crash report annotation says "ShutdownProgress: quit-application". There is a known issue in nsTerminator that may result in not having fired the heartbeat as the first observer notification for a given phase.

Yup, the the normal flow for nsTerminator looks like:

  1. Using a manifest, nsTerminator is registered as a listener for profile-after-change (https://searchfox.org/mozilla-central/rev/281009d3b7e1e1496b9524d5478ff4f0b7369573/toolkit/components/terminator/terminator.manifest#1), this event is fired as the user's profile becomes available, and it uses the notification to register shutdown observers. It doesn't start the watchdog thread at this time. https://searchfox.org/mozilla-central/rev/281009d3b7e1e1496b9524d5478ff4f0b7369573/toolkit/components/terminator/nsTerminator.cpp#483-485

  2. When the first of the phase notifications are fired, the watchdog thread is started (https://searchfox.org/mozilla-central/rev/281009d3b7e1e1496b9524d5478ff4f0b7369573/toolkit/components/terminator/nsTerminator.cpp#489-494), then we perform updates with that particular phase in mind. This is where the "ShutdownProgress" annotation comes from, which in your case is "quit-application", meaning the last shutdown phase notification the watchdog saw was "quit-application".

This means that the checks in the RunWatchdog function to generate the crash message are misleading, as we have definitely entered the "quit-application" phase, but the check is actually checking if we've moved past the "quit-application" phase, as we only set mTicks to a positive value when we observe a new notification in UpdateHeartbeat (https://searchfox.org/mozilla-central/rev/281009d3b7e1e1496b9524d5478ff4f0b7369573/toolkit/components/terminator/nsTerminator.cpp#517).

There is some effort I was doing together with :dthayer over in bug 1689953 to improve this situation and to make the nsTerminator crash reports and telemetry more accurate.

...
Based on my current understanding, it seems like after quit-application is fired, the shutdown process can't continue until the event loop has become idle, and no nested events are being processed. The call to run the root event loop needs to have exited in nsAppStartup::Run(), which will happen next time the root event loop becomes idle, so if we have any nested event loop deadlocks, they'll definitely cause us to time out here.

I'm thinking a reasonable starting place here might be to require each caller of spinEventLoopUntil{OrShutdown} (perhaps including native callers) to pass in a string describing why they're spinning an event loop, and include the currently active set of these strings in a crash annotation explaining what nested event loops are currently being executed on the main thread if the crash occurs with nested loops spinning. It might end up looking something like:

NestedEventLoops: openwindow;debuggerpaused;alertdialog;

We shouldn't be using nested event loops in any places where performance is sensitive enough we can't get away with a bit of extra overhead for that, and it would really help for figuring out what's happening in situations like this one.

Thanks, that sounds like a good plan for this bug here. There are not too many callers of SpinEventLoopUntil{OrShutdown} at least from JS side, I did not check native though.

See Also: → 1689953
Assignee: nobody → jstutte
Attachment #9204685 - Attachment description: Bug 1678330: Ensure nested JS SpinEventLoopUntil calls are traceable to the originating source in case of crash r?nika → Bug 1678330: Ensure nested SpinEventLoopUntil(OrShutdown) calls are traceable to the originating source in case of crash. r?nika
Keywords: leave-open

Data review for diagnostic patch D106029.

Attachment #9205072 - Flags: data-review?(chutten)

Comment on attachment 9205072 [details]
DataReviewRequest1678330.txt

DATA COLLECTION REVIEW RESPONSE:

Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?

Yes.

Is there a control mechanism that allows the user to turn the data collection on and off?

Yes. This collection is Telemetry so can be controlled through Firefox's Preferences.

If the request is for permanent data collection, is there someone who will monitor the data over time?

Yes, Jens Stutte and Nika Layzell are responsible.

Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1, Technical.

Is the data collection request for default-on or default-off?

Default on for all channels.

Does the instrumentation include the addition of any new identifiers?

No.

Is the data collection covered by the existing Firefox privacy notice?

Yes.

Does there need to be a check-in in the future to determine whether to renew the data?

No. This collection is permanent.


Result: datareview+

Attachment #9205072 - Flags: data-review?(chutten) → data-review+

Made AutoNestedEventLoopAnnotation constructor explicit, try builds on linux/mac/win: https://treeherder.mozilla.org/#/jobs?repo=try&revision=e7896907a8349af8c80dcccf49208df5dcf7d94a

Builds are green, this should be ready to commandeer back from me and attempt landing. 🤞

We saw an issue where moz-phab patch complained that the revision was missing commit information; if there's something wrong that also breaks compatibility with Lando we may need to create a new revision for this change and carry forward the reviews.

Flags: needinfo?(jstutte)
Attachment #9204685 - Attachment is obsolete: true

(In reply to Nihanth Subramanya [:nhnt11] from comment #19)

We saw an issue where moz-phab patch complained that the revision was missing commit information; if there's something wrong that also breaks compatibility with Lando we may need to create a new revision for this change and carry forward the reviews.

I gave up on the old revision, it seems there was vital information for moz-phab missing from the commit. Kindly asking to get another formal sign-off, sorry for the mess.

Flags: needinfo?(jstutte) → needinfo?(nika)
Flags: needinfo?(nika)
See Also: → 1695820
Pushed by jstutte@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/03cae7800b41
Ensure nested SpinEventLoopUntil(OrShutdown) calls are traceable to the originating source in case of crash. r=nika,extension-reviewers
Backout by csabou@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/106794cdd13b
Backed out changeset 03cae7800b41 for mochitest plain failures on test_window_open_discarded_bc.html. CLOSED TREE

It seems I overlooked some instance of SpinEventLoopUntil (in an html file, sic).

Flags: needinfo?(jstutte)
Pushed by jstutte@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/194eb6cbf6ab
Ensure nested SpinEventLoopUntil(OrShutdown) calls are traceable to the originating source in case of crash. r=nika,extension-reviewers
Depends on: 1696397

For the records: all six crashes caught so far point to Prompter.jsm.

The leave-open keyword is there and there is no activity for 6 months.
:jstutte, maybe it's time to close this bug?

Flags: needinfo?(jstutte)

Taking a look at current crashes indicate XPIProvider.jsm:awaitPromise as a hanging SpinEventLoopUntil cause.

Unfortunately there seem to be (at least) two possible call paths that get us here:

Shane, you introduced those promises, any thought?

Flags: needinfo?(jstutte) → needinfo?(mixedpuppy)

Could you point me at a specific report? I looked and there are a ton, but not sure what to specifically look for.

Flags: needinfo?(mixedpuppy) → needinfo?(jstutte)

You can now display xpcom spin event lop stack and aggregate and filter on it.

05d129fc-b63f-4e8b-8ee9-d81310211103 is one example.

Flags: needinfo?(jstutte) → needinfo?(mixedpuppy)
Depends on: 1739171

So we should monitor now if the patch on bug 1739171 will help with the XPIProvider.jsm:awaitPromise cases. Thanks!

Flags: needinfo?(mixedpuppy)

Yes. There were two uses of awaitPromise, this one was easy to remove, the other one would require refactoring how startup works in general, so it's a larger issue.

(In reply to Jens Stutte [:jstutte] from comment #35)

So we should monitor now if the patch on bug 1739171 will help with the XPIProvider.jsm:awaitPromise cases. Thanks!

Unfortunately it seems this did not change much.

(In reply to Shane Caraveo (:mixedpuppy) from comment #36)

Yes. There were two uses of awaitPromise, this one was easy to remove, the other one would require refactoring how startup works in general, so it's a larger issue.

I assume we would need to look into the other use then.

(unassigning as I will not actively work on this myself)

Assignee: jstutte → nobody
Flags: needinfo?(mixedpuppy)

The leave-open keyword is there and there is no activity for 6 months.
:nika, maybe it's time to close this bug?
For more information, please visit auto_nag documentation.

Flags: needinfo?(nika)
See Also: → 1789803

I don't think there's anything new we can do here right now.

Flags: needinfo?(nika)

Removed an empty signature

Crash Signature: [@ shutdownhang | __psynch_cvwait | mozilla::TaskController::GetRunnableForMTTask | nsThreadManager::SpinEventLoopUntilInternal] [@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThreadManager::SpinEventLoopUntilInternal] [@ shutdownha… → [@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThreadManager::SpinEventLoopUntilInternal] [@ shutdownhang | nsThreadManager::SpinEventLoopUntilInternal] [@ shutdownhang | trunc | nsThreadManager::SpinEventLoopUntilInternal]
Duplicate of this bug: 1434745

Removed two empty signatures.

The remaining signature shows reports only up to major version 105. That would mean, that only ESR 102.x is still affected. So it seems we are in a better shape here now, in the sense of having more specific signatures resulting in bugs like bug 1505660.

Crash Signature: [@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThreadManager::SpinEventLoopUntilInternal] [@ shutdownhang | nsThreadManager::SpinEventLoopUntilInternal] [@ shutdownhang | trunc | nsThreadManager::SpinEventLoopUntilInternal] → [@ shutdownhang | nsThreadManager::SpinEventLoopUntilInternal]
Flags: needinfo?(mixedpuppy)

Marking this as incomplete based on comment 42 - there are new bugs with new more specific signatures now.

Status: REOPENED → RESOLVED
Closed: 3 years ago11 months ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: