<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Ryan VanderMeulen [:RyanVM]

Comment 4

•

4 years ago

The situation here is not the same as in bug 1505660, so let's keep them separate.

Status: RESOLVED → REOPENED

Resolution: DUPLICATE → ---

Updated

•

4 years ago

status-firefox83: affected → wontfix

status-firefox84: affected → fix-optional

Julien Cristau [:jcristau]

Comment 5

•

4 years ago

This signature doesn't show up on nightly since mid-to-late November, do you know if this got fixed or the signature changed?

status-firefox84: fix-optional → wontfix

status-firefox85: affected → fix-optional

status-firefox86: --- → ?

Flags: needinfo?(jstutte)

Andrew McCreight [:mccr8]

Comment 6

•

4 years ago

I see a few instances of [@ shutdownhang | nsThreadManager::SpinEventLoopUntilInternal ] on Nightly, such as bp-3de2cfce-5bfe-4c7d-8d45-c21f80210106 . That signature looks similar to the one in comment 0.

Comment 7

•

4 years ago

It seems, that the variant with GetRunnableForMTTask is happening currently only under OSX.

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Updated

•

4 years ago

Crash Signature: [@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThreadManager::SpinEventLoopUntilInternal] → [@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThreadManager::SpinEventLoopUntilInternal] [@ shutdownhang | nsThreadManager::SpinEventLoopUntilInternal ]

Flags: needinfo?(jstutte)

Updated

•

4 years ago

Crash Signature: [@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThreadManager::SpinEventLoopUntilInternal] [@ shutdownhang | nsThreadManager::SpinEventLoopUntilInternal ] → [@ shutdownhang | __psynch_cvwait | mozilla::TaskController::GetRunnableForMTTask | nsThreadManager::SpinEventLoopUntilInternal] [@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThreadManager::SpinEventLoopUntilInternal] [@ shutdownha…

Comment 8

•

4 years ago

•

Edited

FWIW, I cracked up a dump and it seems, we are executing some JS on the main thread that somehow calls nsThreadManager::SpinEventLoopUntil directly from JS:

...
 	xul.dll!nsThreadManager::SpinEventLoopUntilInternal(nsINestedEventLoopCondition * aCondition, bool aCheckingShutdown) Zeile 727	C++
 	xul.dll!nsThreadManager::SpinEventLoopUntil(nsINestedEventLoopCondition * aCondition) Zeile 706	C++
 	xul.dll!_NS_InvokeByIndex()	Unbekannt
 	[Inlineframe] xul.dll!CallMethodHelper::Invoke() Zeile 1620	C++
 	[Inlineframe] xul.dll!CallMethodHelper::Call() Zeile 1176	C++
 	xul.dll!XPCWrappedNative::CallMethod(XPCCallContext & ccx, XPCWrappedNative::CallMode mode) Zeile 1142	C++
 	xul.dll!XPC_WN_CallMethod(JSContext * cx, unsigned int argc, JS::Value * vp) Zeile 926	C++
 	[Inlineframe] xul.dll!CallJSNative(JSContext * cx, bool(*)(JSContext *, unsigned int, JS::Value *) native, js::CallReason reason, const JS::CallArgs & args) Zeile 503	C++
 	xul.dll!js::InternalCallOrConstruct(JSContext * cx, const JS::CallArgs & args, js::MaybeConstruct construct, js::CallReason reason) Zeile 594	C++
 	[Inlineframe] xul.dll!InternalCall(JSContext * cx, const js::AnyInvokeArgs & args, js::CallReason reason) Zeile 647	C++
 	[Inlineframe] xul.dll!js::CallFromStack(JSContext * cx, const JS::CallArgs & args) Zeile 651	C++
 	xul.dll!Interpret(JSContext * cx, js::RunState & state) Zeile 3309	C++
 	[Inlineframe] xul.dll!js::RunScript(JSContext * cx, js::RunState & state) Zeile 473	C++
...

From the MOZ_CRASH(Shutdown hanging before starting any known phase.) I deduce, that we just called the observer notification "profile-after-change" which started the terminator watchdog, but the terminator did never receive another shutdown heartbeat notification out of these.

So I would look out for any Javascript executed on the main thread that reacts on "profile-after-change" and potentially calls nsThreadManager::SpinEventLoopUntil directly from JS.

Comment 9

•

4 years ago

•

Edited

So we have JS that use an explicit nsThreadManager::SpinEventLoopUntilOrShutdown which bails out in case of shutdown.

We have some SpinEventLoopUntil cases out of this list where we explicitly handle shutdown.

And then there are JS using SpinEventLoopUntil that seem to not care about shutdown at all.

I was wondering if there is any good reason to not always use nsThreadManager::SpinEventLoopUntilOrShutdown from Javascript (or to not make nsThreadManager::SpinEventLoopUntil always aware of shutdown in JS)? It seems to be a relatively easy thing to do, but maybe I am overlooking something.

Flags: needinfo?(nika)

Comment 10

•

4 years ago

•

Edited

(In reply to Jens Stutte [:jstutte] from comment #8)

From the MOZ_CRASH(Shutdown hanging before starting any known phase.) I deduce, that we just called the observer notification "profile-after-change" which started the terminator watchdog, but the terminator did never receive another shutdown heartbeat notification out of these.

So I would look out for any Javascript executed on the main thread that reacts on "profile-after-change" and potentially calls nsThreadManager::SpinEventLoopUntil directly from JS.

Actually the crash report annotation says "ShutdownProgress: quit-application". There is a known issue in nsTerminator that may result in not having fired the heartbeat as the first observer notification for a given phase.

So looking for any Javascript that reacts on "quit-application" and potentially calls nsThreadManager::SpinEventLoopUntil directly from JS yields:

devtools/client/framework/devtools-browser.js

as only result.

Comment 11

•

4 years ago

•

Edited

(In reply to Jens Stutte [:jstutte] from comment #9)

I was wondering if there is any good reason to not always use nsThreadManager::SpinEventLoopUntilOrShutdown from Javascript (or to not make nsThreadManager::SpinEventLoopUntil always aware of shutdown in JS)? It seems to be a relatively easy thing to do, but maybe I am overlooking something.

Actually it is not clear to me if in this stage of the shutdown mShuttingDown would have been set already. So while being a legitimate question, it is not clear to me, if it would help to solve this problem, as we seem to be in the parent process here and "xpcom-shutdown" is not yet reached.

Sean Feng [:sefeng]

Updated

•

4 years ago

Crash Signature: shutdownhang | nsThreadManager::SpinEventLoopUntilInternal ] → shutdownhang | nsThreadManager::SpinEventLoopUntilInternal ] [@ shutdownhang | trunc | nsThreadManager::SpinEventLoopUntilInternal ]

Comment 12

•

4 years ago

(In reply to Jens Stutte [:jstutte] from comment #8)

So I would look out for any Javascript executed on the main thread that reacts on "profile-after-change" and potentially calls nsThreadManager::SpinEventLoopUntil directly from JS.

profile-after-change is actually fired early during startup (it's a confusing name I know), whereas profile-before-change is the call site during shutdown. In this case we're in "quit-application".

(In reply to Jens Stutte [:jstutte] from comment #10)

Actually the crash report annotation says "ShutdownProgress: quit-application". There is a known issue in nsTerminator that may result in not having fired the heartbeat as the first observer notification for a given phase.

Yup, the the normal flow for nsTerminator looks like:

Using a manifest, nsTerminator is registered as a listener for profile-after-change (https://searchfox.org/mozilla-central/rev/281009d3b7e1e1496b9524d5478ff4f0b7369573/toolkit/components/terminator/terminator.manifest#1), this event is fired as the user's profile becomes available, and it uses the notification to register shutdown observers. It doesn't start the watchdog thread at this time. https://searchfox.org/mozilla-central/rev/281009d3b7e1e1496b9524d5478ff4f0b7369573/toolkit/components/terminator/nsTerminator.cpp#483-485
When the first of the phase notifications are fired, the watchdog thread is started (https://searchfox.org/mozilla-central/rev/281009d3b7e1e1496b9524d5478ff4f0b7369573/toolkit/components/terminator/nsTerminator.cpp#489-494), then we perform updates with that particular phase in mind. This is where the "ShutdownProgress" annotation comes from, which in your case is "quit-application", meaning the last shutdown phase notification the watchdog saw was "quit-application".

This means that the checks in the RunWatchdog function to generate the crash message are misleading, as we have definitely entered the "quit-application" phase, but the check is actually checking if we've moved past the "quit-application" phase, as we only set mTicks to a positive value when we observe a new notification in UpdateHeartbeat (https://searchfox.org/mozilla-central/rev/281009d3b7e1e1496b9524d5478ff4f0b7369573/toolkit/components/terminator/nsTerminator.cpp#517).

AFAIK there are 2 main places where quit-application can be fired: From nsAppStartup::Quit (https://searchfox.org/mozilla-central/rev/281009d3b7e1e1496b9524d5478ff4f0b7369573/toolkit/components/startup/nsAppStartup.cpp#446), or due to a MOZ_WM_APP_QUIT message on Windows instructing us to exit immediately (https://searchfox.org/mozilla-central/rev/281009d3b7e1e1496b9524d5478ff4f0b7369573/widget/windows/nsWindow.cpp#5206). The most likely situation here is the first, but it probably doesn't matter.

(In reply to Jens Stutte [:jstutte] from comment #11)

Actually it is not clear to me if in this stage of the shutdown mShuttingDown would have been set already. So while being a legitimate question, it is not clear to me, if it would help to solve this problem, as we seem to be in the parent process here and "xpcom-shutdown" is not yet reached.

Yeah, I think it's unlikely that using spinEventLoopUntilOrShutdown will help much here, as that will only be aborted once the "xpcom-shutdown" observer notification has fired (https://searchfox.org/mozilla-central/rev/281009d3b7e1e1496b9524d5478ff4f0b7369573/xpcom/threads/nsThreadManager.cpp#297-301), which hasn't happened yet, as otherwise either the watchdog or the "ShutdownProgress" annotation would know about it if we had.

Based on my current understanding, it seems like after quit-application is fired, the shutdown process can't continue until the event loop has become idle, and no nested events are being processed. The call to run the root event loop needs to have exited in nsAppStartup::Run(), which will happen next time the root event loop becomes idle, so if we have any nested event loop deadlocks, they'll definitely cause us to time out here.

I'm thinking a reasonable starting place here might be to require each caller of spinEventLoopUntil{OrShutdown} (perhaps including native callers) to pass in a string describing why they're spinning an event loop, and include the currently active set of these strings in a crash annotation explaining what nested event loops are currently being executed on the main thread if the crash occurs with nested loops spinning. It might end up looking something like:

NestedEventLoops: openwindow;debuggerpaused;alertdialog;

We shouldn't be using nested event loops in any places where performance is sensitive enough we can't get away with a bit of extra overhead for that, and it would really help for figuring out what's happening in situations like this one.

Flags: needinfo?(nika)

Comment 13

•

4 years ago

(In reply to Nika Layzell [:nika] (ni? for response) from comment #12)

(In reply to Jens Stutte [:jstutte] from comment #10)

Actually the crash report annotation says "ShutdownProgress: quit-application". There is a known issue in nsTerminator that may result in not having fired the heartbeat as the first observer notification for a given phase.

Yup, the the normal flow for nsTerminator looks like:

Using a manifest, nsTerminator is registered as a listener for profile-after-change (https://searchfox.org/mozilla-central/rev/281009d3b7e1e1496b9524d5478ff4f0b7369573/toolkit/components/terminator/terminator.manifest#1), this event is fired as the user's profile becomes available, and it uses the notification to register shutdown observers. It doesn't start the watchdog thread at this time. https://searchfox.org/mozilla-central/rev/281009d3b7e1e1496b9524d5478ff4f0b7369573/toolkit/components/terminator/nsTerminator.cpp#483-485

When the first of the phase notifications are fired, the watchdog thread is started (https://searchfox.org/mozilla-central/rev/281009d3b7e1e1496b9524d5478ff4f0b7369573/toolkit/components/terminator/nsTerminator.cpp#489-494), then we perform updates with that particular phase in mind. This is where the "ShutdownProgress" annotation comes from, which in your case is "quit-application", meaning the last shutdown phase notification the watchdog saw was "quit-application".

This means that the checks in the RunWatchdog function to generate the crash message are misleading, as we have definitely entered the "quit-application" phase, but the check is actually checking if we've moved past the "quit-application" phase, as we only set mTicks to a positive value when we observe a new notification in UpdateHeartbeat (https://searchfox.org/mozilla-central/rev/281009d3b7e1e1496b9524d5478ff4f0b7369573/toolkit/components/terminator/nsTerminator.cpp#517).

There is some effort I was doing together with :dthayer over in bug 1689953 to improve this situation and to make the nsTerminator crash reports and telemetry more accurate.

...
Based on my current understanding, it seems like after quit-application is fired, the shutdown process can't continue until the event loop has become idle, and no nested events are being processed. The call to run the root event loop needs to have exited in nsAppStartup::Run(), which will happen next time the root event loop becomes idle, so if we have any nested event loop deadlocks, they'll definitely cause us to time out here.

I'm thinking a reasonable starting place here might be to require each caller of spinEventLoopUntil{OrShutdown} (perhaps including native callers) to pass in a string describing why they're spinning an event loop, and include the currently active set of these strings in a crash annotation explaining what nested event loops are currently being executed on the main thread if the crash occurs with nested loops spinning. It might end up looking something like:
NestedEventLoops: openwindow;debuggerpaused;alertdialog;
We shouldn't be using nested event loops in any places where performance is sensitive enough we can't get away with a bit of extra overhead for that, and it would really help for figuring out what's happening in situations like this one.

Thanks, that sounds like a good plan for this bug here. There are not too many callers of SpinEventLoopUntil{OrShutdown} at least from JS side, I did not check native though.

Updated

•

4 years ago

Comment 14

•

4 years ago

Attached file Bug 1678330: Ensure nested SpinEventLoopUntil(OrShutdown) calls are traceable to the originating source in case of crash. r?nika (obsolete) — Details

Phabricator Automation

Updated

•

4 years ago

Assignee: nobody → jstutte

Phabricator Automation

Updated

•

4 years ago

Attachment #9204685 - Attachment description: Bug 1678330: Ensure nested JS SpinEventLoopUntil calls are traceable to the originating source in case of crash r?nika → Bug 1678330: Ensure nested SpinEventLoopUntil(OrShutdown) calls are traceable to the originating source in case of crash. r?nika

Updated

•

4 years ago

Keywords: leave-open

Chris H-C :chutten|PTO (back March 17)

Comment 15

•

4 years ago

Attached file DataReviewRequest1678330.txt — Details

Data review for diagnostic patch D106029.

Attachment #9205072 - Flags: data-review?(chutten)

Comment 16

•

4 years ago

Comment on attachment 9205072 [details]
DataReviewRequest1678330.txt

DATA COLLECTION REVIEW RESPONSE:

Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?

Yes.

Is there a control mechanism that allows the user to turn the data collection on and off?

Yes. This collection is Telemetry so can be controlled through Firefox's Preferences.

If the request is for permanent data collection, is there someone who will monitor the data over time?

Yes, Jens Stutte and Nika Layzell are responsible.

Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1, Technical.

Is the data collection request for default-on or default-off?

Default on for all channels.

Does the instrumentation include the addition of any new identifiers?

No.

Is the data collection covered by the existing Firefox privacy notice?

Yes.

Does there need to be a check-in in the future to determine whether to renew the data?

No. This collection is permanent.

Result: datareview+

Attachment #9205072 - Flags: data-review?(chutten) → data-review+

(inactive) Nihanth Subramanya [:nhnt11]

Comment 17

•

4 years ago

Here's a try build with the patch rebased onto central: https://treeherder.mozilla.org/jobs?repo=try&revision=4cc7c711740d18327f7cd795f5debd57ba82f60c

(inactive) Nihanth Subramanya [:nhnt11]

Comment 18

•

4 years ago

Made AutoNestedEventLoopAnnotation constructor explicit, try builds on linux/mac/win: https://treeherder.mozilla.org/#/jobs?repo=try&revision=e7896907a8349af8c80dcccf49208df5dcf7d94a

(inactive) Nihanth Subramanya [:nhnt11]

Comment 19

•

4 years ago

Builds are green, this should be ready to commandeer back from me and attempt landing. 🤞

We saw an issue where moz-phab patch complained that the revision was missing commit information; if there's something wrong that also breaks compatibility with Lando we may need to create a new revision for this change and carry forward the reviews.

Flags: needinfo?(jstutte)

Comment 20

•

4 years ago

Attached file Bug 1678330: Ensure nested SpinEventLoopUntil(OrShutdown) calls are traceable to the originating source in case of crash. r?nika — Details

Phabricator Automation

Updated

•

4 years ago

Attachment #9204685 - Attachment is obsolete: true

Comment 21

•

4 years ago

(In reply to Nihanth Subramanya [:nhnt11] from comment #19)

We saw an issue where moz-phab patch complained that the revision was missing commit information; if there's something wrong that also breaks compatibility with Lando we may need to create a new revision for this change and carry forward the reviews.

I gave up on the old revision, it seems there was vital information for moz-phab missing from the commit. Kindly asking to get another formal sign-off, sorry for the mess.

Flags: needinfo?(jstutte) → needinfo?(nika)

Updated

•

4 years ago

Flags: needinfo?(nika)

Push with failures: https://treeherder.mozilla.org/jobs?repo=autoland&group_state=expanded&selectedTaskRun=Egu8AAOqSxKrys16HO94VA.0&resultStatus=testfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel&revision=03cae7800b417391988a28de7a930af388f93a1a

Updated

•

4 years ago

Comment 22

•

4 years ago

Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/03cae7800b41 Ensure nested SpinEventLoopUntil(OrShutdown) calls are traceable to the originating source in case of crash. r=nika,extension-reviewers

Pulsebot

Comment 23

•

4 years ago

Backout by csabou@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/106794cdd13b Backed out changeset 03cae7800b41 for mochitest plain failures on test_window_open_discarded_bc.html. CLOSED TREE

Cosmin Sabou [:CosminS]

Comment 24

•

4 years ago

•

Edited

Failure log: https://treeherder.mozilla.org/logviewer?job_id=331785084&repo=autoland
https://treeherder.mozilla.org/logviewer?job_id=331787872&repo=autoland

Backout link: https://hg.mozilla.org/integration/autoland/rev/106794cdd13b

Flags: needinfo?(jstutte)

Bogdan Tara[:bogdan_tara | bogdant]

Comment 25

•

4 years ago

It seems I overlooked some instance of SpinEventLoopUntil (in an html file, sic).

Flags: needinfo?(jstutte)

Pulsebot

Comment 26

•

4 years ago

Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/194eb6cbf6ab Ensure nested SpinEventLoopUntil(OrShutdown) calls are traceable to the originating source in case of crash. r=nika,extension-reviewers

Comment 27

•

4 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/194eb6cbf6ab

Comment 28

•

4 years ago

•

Edited

So far no crashes with the new annotation, let's monitor.

6ea15ac5-c080-4258-9942-d4e750210304
7ca2cdec-eb9d-4689-a5fb-fbe1d0210303

Comment 29

•

4 years ago

First two catches:

point both to:

Prompter.jsm

Updated

•

4 years ago

Depends on: 1696397

BugBot [:suhaib / :marco/ :calixte]

Comment 30

•

4 years ago

For the records: all six crashes caught so far point to Prompter.jsm.

Florian Quèze [:florian]

Updated

•

4 years ago

Blocks: 1696406

Comment 31

•

4 years ago

The leave-open keyword is there and there is no activity for 6 months.
:jstutte, maybe it's time to close this bug?

Flags: needinfo?(jstutte)

Comment 32

•

4 years ago

•

Edited

Taking a look at current crashes indicate XPIProvider.jsm:awaitPromise as a hanging SpinEventLoopUntil cause.

Unfortunately there seem to be (at least) two possible call paths that get us here:

Shane, you introduced those promises, any thought?

Flags: needinfo?(jstutte) → needinfo?(mixedpuppy)

Comment 33

•

3 years ago

Could you point me at a specific report? I looked and there are a ton, but not sure what to specifically look for.

Flags: needinfo?(mixedpuppy) → needinfo?(jstutte)

Comment 34

•

3 years ago

You can now display xpcom spin event lop stack and aggregate and filter on it.

05d129fc-b63f-4e8b-8ee9-d81310211103 is one example.

Flags: needinfo?(jstutte) → needinfo?(mixedpuppy)

Updated

•

3 years ago

Depends on: 1739171

Comment 35

•

3 years ago

So we should monitor now if the patch on bug 1739171 will help with the XPIProvider.jsm:awaitPromise cases. Thanks!

Flags: needinfo?(mixedpuppy)

Comment 36

•

3 years ago

Yes. There were two uses of awaitPromise, this one was easy to remove, the other one would require refactoring how startup works in general, so it's a larger issue.

BugBot [:suhaib / :marco/ :calixte]

Comment 37

•

3 years ago

(In reply to Jens Stutte [:jstutte] from comment #35)

So we should monitor now if the patch on bug 1739171 will help with the XPIProvider.jsm:awaitPromise cases. Thanks!

Unfortunately it seems this did not change much.

(In reply to Shane Caraveo (:mixedpuppy) from comment #36)

Yes. There were two uses of awaitPromise, this one was easy to remove, the other one would require refactoring how startup works in general, so it's a larger issue.

I assume we would need to look into the other use then.

(unassigning as I will not actively work on this myself)

Assignee: jstutte → nobody

Flags: needinfo?(mixedpuppy)

Comment 38

•

3 years ago

The leave-open keyword is there and there is no activity for 6 months.
:nika, maybe it's time to close this bug?
For more information, please visit auto_nag documentation.

Flags: needinfo?(nika)

Gabriele Svelto [:gsvelto]

Updated

•

3 years ago

Comment 39

•

3 years ago

I don't think there's anything new we can do here right now.

Flags: needinfo?(nika)

Comment 40

•

2 years ago

Removed an empty signature

Crash Signature: [@ shutdownhang | __psynch_cvwait | mozilla::TaskController::GetRunnableForMTTask | nsThreadManager::SpinEventLoopUntilInternal] [@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThreadManager::SpinEventLoopUntilInternal] [@ shutdownha… → [@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThreadManager::SpinEventLoopUntilInternal] [@ shutdownhang | nsThreadManager::SpinEventLoopUntilInternal] [@ shutdownhang | trunc | nsThreadManager::SpinEventLoopUntilInternal]

Updated

•

2 years ago

Duplicate of this bug: 1434745

Comment 42

•

2 years ago

Removed two empty signatures.

The remaining signature shows reports only up to major version 105. That would mean, that only ESR 102.x is still affected. So it seems we are in a better shape here now, in the sense of having more specific signatures resulting in bugs like bug 1505660.

Crash Signature: [@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThreadManager::SpinEventLoopUntilInternal] [@ shutdownhang | nsThreadManager::SpinEventLoopUntilInternal] [@ shutdownhang | trunc | nsThreadManager::SpinEventLoopUntilInternal] → [@ shutdownhang | nsThreadManager::SpinEventLoopUntilInternal]

Flags: needinfo?(mixedpuppy)