Crash in [@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThreadManager::SpinEventLoopUntilInternal]
Categories
(Core :: XPCOM, defect)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr78 | --- | unaffected |
firefox83 | --- | wontfix |
firefox84 | --- | wontfix |
firefox85 | --- | fix-optional |
firefox86 | --- | ? |
People
(Reporter: pascalc, Unassigned)
References
Details
(Keywords: crash, regression)
Crash Data
Attachments
(2 files, 1 obsolete file)
2.29 KB,
text/plain
|
chutten|PTO
:
data-review+
|
Details |
48 bytes,
text/x-phabricator-request
|
Details | Review |
Crash report: https://crash-stats.mozilla.org/report/index/a05e3e76-f5bf-4af9-89e4-0e0be0201119
MOZ_CRASH Reason: MOZ_CRASH(Shutdown hanging before starting.)
Top 10 frames of crashing thread:
0 ntdll.dll NtWaitForAlertByThreadId
1 ntdll.dll RtlSleepConditionVariableSRW
2 kernelbase.dll SleepConditionVariableSRW
3 mozglue.dll mozilla::detail::ConditionVariableImpl::wait mozglue/misc/ConditionVariable_windows.cpp:50
4 xul.dll mozilla::TaskController::GetRunnableForMTTask xpcom/threads/TaskController.cpp:283
5 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1143
6 xul.dll nsThreadManager::SpinEventLoopUntilInternal xpcom/threads/nsThreadManager.cpp:719
7 xul.dll XPTC__InvokebyIndex
8 @0x21b6b82e83f
9 xul.dll trunc
Reporter | ||
Updated•4 years ago
|
Reporter | ||
Updated•4 years ago
|
Comment 1•4 years ago
|
||
Bugbug thinks this bug should belong to this component, but please revert this change in case of error.
Updated•4 years ago
|
Comment 4•4 years ago
|
||
The situation here is not the same as in bug 1505660, so let's keep them separate.
Updated•4 years ago
|
Comment 5•4 years ago
|
||
This signature doesn't show up on nightly since mid-to-late November, do you know if this got fixed or the signature changed?
Comment 6•4 years ago
|
||
I see a few instances of [@ shutdownhang | nsThreadManager::SpinEventLoopUntilInternal ] on Nightly, such as bp-3de2cfce-5bfe-4c7d-8d45-c21f80210106 . That signature looks similar to the one in comment 0.
Comment 7•4 years ago
|
||
It seems, that the variant with GetRunnableForMTTask is happening currently only under OSX.
Updated•4 years ago
|
![]() |
||
Updated•4 years ago
|
Comment 8•4 years ago
•
|
||
FWIW, I cracked up a dump and it seems, we are executing some JS on the main thread that somehow calls nsThreadManager::SpinEventLoopUntil
directly from JS:
...
xul.dll!nsThreadManager::SpinEventLoopUntilInternal(nsINestedEventLoopCondition * aCondition, bool aCheckingShutdown) Zeile 727 C++
xul.dll!nsThreadManager::SpinEventLoopUntil(nsINestedEventLoopCondition * aCondition) Zeile 706 C++
xul.dll!_NS_InvokeByIndex() Unbekannt
[Inlineframe] xul.dll!CallMethodHelper::Invoke() Zeile 1620 C++
[Inlineframe] xul.dll!CallMethodHelper::Call() Zeile 1176 C++
xul.dll!XPCWrappedNative::CallMethod(XPCCallContext & ccx, XPCWrappedNative::CallMode mode) Zeile 1142 C++
xul.dll!XPC_WN_CallMethod(JSContext * cx, unsigned int argc, JS::Value * vp) Zeile 926 C++
[Inlineframe] xul.dll!CallJSNative(JSContext * cx, bool(*)(JSContext *, unsigned int, JS::Value *) native, js::CallReason reason, const JS::CallArgs & args) Zeile 503 C++
xul.dll!js::InternalCallOrConstruct(JSContext * cx, const JS::CallArgs & args, js::MaybeConstruct construct, js::CallReason reason) Zeile 594 C++
[Inlineframe] xul.dll!InternalCall(JSContext * cx, const js::AnyInvokeArgs & args, js::CallReason reason) Zeile 647 C++
[Inlineframe] xul.dll!js::CallFromStack(JSContext * cx, const JS::CallArgs & args) Zeile 651 C++
xul.dll!Interpret(JSContext * cx, js::RunState & state) Zeile 3309 C++
[Inlineframe] xul.dll!js::RunScript(JSContext * cx, js::RunState & state) Zeile 473 C++
...
From the MOZ_CRASH(Shutdown hanging before starting any known phase.)
I deduce, that we just called the observer notification "profile-after-change" which started the terminator watchdog, but the terminator did never receive another shutdown heartbeat notification out of these.
So I would look out for any Javascript executed on the main thread that reacts on "profile-after-change"
and potentially calls nsThreadManager::SpinEventLoopUntil
directly from JS.
Comment 9•4 years ago
•
|
||
So we have JS that use an explicit nsThreadManager::SpinEventLoopUntilOrShutdown which bails out in case of shutdown.
We have some SpinEventLoopUntil cases out of this list where we explicitly handle shutdown.
And then there are JS using SpinEventLoopUntil that seem to not care about shutdown at all.
I was wondering if there is any good reason to not always use nsThreadManager::SpinEventLoopUntilOrShutdown
from Javascript (or to not make nsThreadManager::SpinEventLoopUntil
always aware of shutdown in JS)? It seems to be a relatively easy thing to do, but maybe I am overlooking something.
Comment 10•4 years ago
•
|
||
(In reply to Jens Stutte [:jstutte] from comment #8)
From the
MOZ_CRASH(Shutdown hanging before starting any known phase.)
I deduce, that we just called the observer notification "profile-after-change" which started the terminator watchdog, but the terminator did never receive another shutdown heartbeat notification out of these.So I would look out for any Javascript executed on the main thread that reacts on
"profile-after-change"
and potentially callsnsThreadManager::SpinEventLoopUntil
directly from JS.
Actually the crash report annotation says "ShutdownProgress: quit-application". There is a known issue in nsTerminator
that may result in not having fired the heartbeat as the first observer notification for a given phase.
So looking for any Javascript that reacts on "quit-application"
and potentially calls nsThreadManager::SpinEventLoopUntil
directly from JS yields:
devtools/client/framework/devtools-browser.js
as only result.
Comment 11•4 years ago
•
|
||
(In reply to Jens Stutte [:jstutte] from comment #9)
I was wondering if there is any good reason to not always use
nsThreadManager::SpinEventLoopUntilOrShutdown
from Javascript (or to not makensThreadManager::SpinEventLoopUntil
always aware of shutdown in JS)? It seems to be a relatively easy thing to do, but maybe I am overlooking something.
Actually it is not clear to me if in this stage of the shutdown mShuttingDown would have been set already. So while being a legitimate question, it is not clear to me, if it would help to solve this problem, as we seem to be in the parent process here and "xpcom-shutdown"
is not yet reached.
Updated•4 years ago
|
Comment 12•4 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #8)
So I would look out for any Javascript executed on the main thread that reacts on
"profile-after-change"
and potentially callsnsThreadManager::SpinEventLoopUntil
directly from JS.
profile-after-change
is actually fired early during startup (it's a confusing name I know), whereas profile-before-change
is the call site during shutdown. In this case we're in "quit-application".
(In reply to Jens Stutte [:jstutte] from comment #10)
Actually the crash report annotation says "ShutdownProgress: quit-application". There is a known issue in
nsTerminator
that may result in not having fired the heartbeat as the first observer notification for a given phase.
Yup, the the normal flow for nsTerminator
looks like:
-
Using a manifest,
nsTerminator
is registered as a listener forprofile-after-change
(https://searchfox.org/mozilla-central/rev/281009d3b7e1e1496b9524d5478ff4f0b7369573/toolkit/components/terminator/terminator.manifest#1), this event is fired as the user's profile becomes available, and it uses the notification to register shutdown observers. It doesn't start the watchdog thread at this time. https://searchfox.org/mozilla-central/rev/281009d3b7e1e1496b9524d5478ff4f0b7369573/toolkit/components/terminator/nsTerminator.cpp#483-485 -
When the first of the phase notifications are fired, the watchdog thread is started (https://searchfox.org/mozilla-central/rev/281009d3b7e1e1496b9524d5478ff4f0b7369573/toolkit/components/terminator/nsTerminator.cpp#489-494), then we perform updates with that particular phase in mind. This is where the "ShutdownProgress" annotation comes from, which in your case is "quit-application", meaning the last shutdown phase notification the watchdog saw was "quit-application".
This means that the checks in the RunWatchdog
function to generate the crash message are misleading, as we have definitely entered the "quit-application" phase, but the check is actually checking if we've moved past the "quit-application" phase, as we only set mTicks
to a positive value when we observe a new notification in UpdateHeartbeat
(https://searchfox.org/mozilla-central/rev/281009d3b7e1e1496b9524d5478ff4f0b7369573/toolkit/components/terminator/nsTerminator.cpp#517).
AFAIK there are 2 main places where quit-application
can be fired: From nsAppStartup::Quit
(https://searchfox.org/mozilla-central/rev/281009d3b7e1e1496b9524d5478ff4f0b7369573/toolkit/components/startup/nsAppStartup.cpp#446), or due to a MOZ_WM_APP_QUIT
message on Windows instructing us to exit immediately (https://searchfox.org/mozilla-central/rev/281009d3b7e1e1496b9524d5478ff4f0b7369573/widget/windows/nsWindow.cpp#5206). The most likely situation here is the first, but it probably doesn't matter.
(In reply to Jens Stutte [:jstutte] from comment #11)
Actually it is not clear to me if in this stage of the shutdown mShuttingDown would have been set already. So while being a legitimate question, it is not clear to me, if it would help to solve this problem, as we seem to be in the parent process here and
"xpcom-shutdown"
is not yet reached.
Yeah, I think it's unlikely that using spinEventLoopUntilOrShutdown
will help much here, as that will only be aborted once the "xpcom-shutdown"
observer notification has fired (https://searchfox.org/mozilla-central/rev/281009d3b7e1e1496b9524d5478ff4f0b7369573/xpcom/threads/nsThreadManager.cpp#297-301), which hasn't happened yet, as otherwise either the watchdog or the "ShutdownProgress" annotation would know about it if we had.
Based on my current understanding, it seems like after quit-application
is fired, the shutdown process can't continue until the event loop has become idle, and no nested events are being processed. The call to run the root event loop needs to have exited in nsAppStartup::Run()
, which will happen next time the root event loop becomes idle, so if we have any nested event loop deadlocks, they'll definitely cause us to time out here.
I'm thinking a reasonable starting place here might be to require each caller of spinEventLoopUntil{OrShutdown}
(perhaps including native callers) to pass in a string describing why they're spinning an event loop, and include the currently active set of these strings in a crash annotation explaining what nested event loops are currently being executed on the main thread if the crash occurs with nested loops spinning. It might end up looking something like:
NestedEventLoops: openwindow;debuggerpaused;alertdialog;
We shouldn't be using nested event loops in any places where performance is sensitive enough we can't get away with a bit of extra overhead for that, and it would really help for figuring out what's happening in situations like this one.
Comment 13•4 years ago
|
||
(In reply to Nika Layzell [:nika] (ni? for response) from comment #12)
(In reply to Jens Stutte [:jstutte] from comment #10)
Actually the crash report annotation says "ShutdownProgress: quit-application". There is a known issue in
nsTerminator
that may result in not having fired the heartbeat as the first observer notification for a given phase.Yup, the the normal flow for
nsTerminator
looks like:
Using a manifest,
nsTerminator
is registered as a listener forprofile-after-change
(https://searchfox.org/mozilla-central/rev/281009d3b7e1e1496b9524d5478ff4f0b7369573/toolkit/components/terminator/terminator.manifest#1), this event is fired as the user's profile becomes available, and it uses the notification to register shutdown observers. It doesn't start the watchdog thread at this time. https://searchfox.org/mozilla-central/rev/281009d3b7e1e1496b9524d5478ff4f0b7369573/toolkit/components/terminator/nsTerminator.cpp#483-485When the first of the phase notifications are fired, the watchdog thread is started (https://searchfox.org/mozilla-central/rev/281009d3b7e1e1496b9524d5478ff4f0b7369573/toolkit/components/terminator/nsTerminator.cpp#489-494), then we perform updates with that particular phase in mind. This is where the "ShutdownProgress" annotation comes from, which in your case is "quit-application", meaning the last shutdown phase notification the watchdog saw was "quit-application".
This means that the checks in the
RunWatchdog
function to generate the crash message are misleading, as we have definitely entered the "quit-application" phase, but the check is actually checking if we've moved past the "quit-application" phase, as we only setmTicks
to a positive value when we observe a new notification inUpdateHeartbeat
(https://searchfox.org/mozilla-central/rev/281009d3b7e1e1496b9524d5478ff4f0b7369573/toolkit/components/terminator/nsTerminator.cpp#517).
There is some effort I was doing together with :dthayer over in bug 1689953 to improve this situation and to make the nsTerminator crash reports and telemetry more accurate.
...
Based on my current understanding, it seems like afterquit-application
is fired, the shutdown process can't continue until the event loop has become idle, and no nested events are being processed. The call to run the root event loop needs to have exited innsAppStartup::Run()
, which will happen next time the root event loop becomes idle, so if we have any nested event loop deadlocks, they'll definitely cause us to time out here.I'm thinking a reasonable starting place here might be to require each caller of
spinEventLoopUntil{OrShutdown}
(perhaps including native callers) to pass in a string describing why they're spinning an event loop, and include the currently active set of these strings in a crash annotation explaining what nested event loops are currently being executed on the main thread if the crash occurs with nested loops spinning. It might end up looking something like:NestedEventLoops: openwindow;debuggerpaused;alertdialog;
We shouldn't be using nested event loops in any places where performance is sensitive enough we can't get away with a bit of extra overhead for that, and it would really help for figuring out what's happening in situations like this one.
Thanks, that sounds like a good plan for this bug here. There are not too many callers of SpinEventLoopUntil{OrShutdown} at least from JS side, I did not check native though.
Comment 14•4 years ago
|
||
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Comment 15•4 years ago
|
||
Data review for diagnostic patch D106029.
Comment 16•4 years ago
|
||
Comment on attachment 9205072 [details]
DataReviewRequest1678330.txt
DATA COLLECTION REVIEW RESPONSE:
Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?
Yes.
Is there a control mechanism that allows the user to turn the data collection on and off?
Yes. This collection is Telemetry so can be controlled through Firefox's Preferences.
If the request is for permanent data collection, is there someone who will monitor the data over time?
Yes, Jens Stutte and Nika Layzell are responsible.
Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?
Category 1, Technical.
Is the data collection request for default-on or default-off?
Default on for all channels.
Does the instrumentation include the addition of any new identifiers?
No.
Is the data collection covered by the existing Firefox privacy notice?
Yes.
Does there need to be a check-in in the future to determine whether to renew the data?
No. This collection is permanent.
Result: datareview+
Comment 17•4 years ago
|
||
Here's a try build with the patch rebased onto central: https://treeherder.mozilla.org/jobs?repo=try&revision=4cc7c711740d18327f7cd795f5debd57ba82f60c
Comment 18•4 years ago
|
||
Made AutoNestedEventLoopAnnotation constructor explicit, try builds on linux/mac/win: https://treeherder.mozilla.org/#/jobs?repo=try&revision=e7896907a8349af8c80dcccf49208df5dcf7d94a
Comment 19•4 years ago
|
||
Builds are green, this should be ready to commandeer back from me and attempt landing. 🤞
We saw an issue where moz-phab patch
complained that the revision was missing commit information; if there's something wrong that also breaks compatibility with Lando we may need to create a new revision for this change and carry forward the reviews.
Comment 20•4 years ago
|
||
Updated•4 years ago
|
Comment 21•4 years ago
|
||
(In reply to Nihanth Subramanya [:nhnt11] from comment #19)
We saw an issue where
moz-phab patch
complained that the revision was missing commit information; if there's something wrong that also breaks compatibility with Lando we may need to create a new revision for this change and carry forward the reviews.
I gave up on the old revision, it seems there was vital information for moz-phab missing from the commit. Kindly asking to get another formal sign-off, sorry for the mess.
Updated•4 years ago
|
Comment 22•4 years ago
|
||
Comment 23•4 years ago
|
||
Comment 24•4 years ago
•
|
||
Failure log: https://treeherder.mozilla.org/logviewer?job_id=331785084&repo=autoland
https://treeherder.mozilla.org/logviewer?job_id=331787872&repo=autoland
Backout link: https://hg.mozilla.org/integration/autoland/rev/106794cdd13b
Comment 25•4 years ago
|
||
It seems I overlooked some instance of SpinEventLoopUntil (in an html file, sic).
Comment 26•4 years ago
|
||
Comment 27•4 years ago
|
||
bugherder |
Comment 28•4 years ago
•
|
||
So far no crashes with the new annotation, let's monitor.
Comment 29•4 years ago
|
||
First two catches:
6ea15ac5-c080-4258-9942-d4e750210304
7ca2cdec-eb9d-4689-a5fb-fbe1d0210303
point both to:
Comment 30•4 years ago
|
||
For the records: all six crashes caught so far point to Prompter.jsm.
Comment 31•4 years ago
|
||
The leave-open keyword is there and there is no activity for 6 months.
:jstutte, maybe it's time to close this bug?
Comment 32•4 years ago
•
|
||
Taking a look at current crashes indicate XPIProvider.jsm:awaitPromise
as a hanging SpinEventLoopUntil
cause.
Unfortunately there seem to be (at least) two possible call paths that get us here:
Shane, you introduced those promises, any thought?
Comment 33•3 years ago
|
||
Could you point me at a specific report? I looked and there are a ton, but not sure what to specifically look for.
Comment 34•3 years ago
|
||
You can now display xpcom spin event lop stack and aggregate and filter on it.
05d129fc-b63f-4e8b-8ee9-d81310211103 is one example.
Comment 35•3 years ago
|
||
So we should monitor now if the patch on bug 1739171 will help with the XPIProvider.jsm:awaitPromise
cases. Thanks!
Comment 36•3 years ago
|
||
Yes. There were two uses of awaitPromise, this one was easy to remove, the other one would require refactoring how startup works in general, so it's a larger issue.
Comment 37•3 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #35)
So we should monitor now if the patch on bug 1739171 will help with the
XPIProvider.jsm:awaitPromise
cases. Thanks!
Unfortunately it seems this did not change much.
(In reply to Shane Caraveo (:mixedpuppy) from comment #36)
Yes. There were two uses of awaitPromise, this one was easy to remove, the other one would require refactoring how startup works in general, so it's a larger issue.
I assume we would need to look into the other use then.
(unassigning as I will not actively work on this myself)
Comment 38•3 years ago
|
||
The leave-open keyword is there and there is no activity for 6 months.
:nika, maybe it's time to close this bug?
For more information, please visit auto_nag documentation.
Comment 39•3 years ago
|
||
I don't think there's anything new we can do here right now.
Comment 40•2 years ago
|
||
Removed an empty signature
Comment 42•2 years ago
|
||
Removed two empty signatures.
The remaining signature shows reports only up to major version 105. That would mean, that only ESR 102.x is still affected. So it seems we are in a better shape here now, in the sense of having more specific signatures resulting in bugs like bug 1505660.
Comment 43•2 years ago
|
||
Marking this as incomplete based on comment 42 - there are new bugs with new more specific signatures now.
Updated•2 years ago
|
Description
•