Open Bug 1505660 Opened 6 years ago Updated 7 months ago

Crash in shutdownhang | nsThread::Shutdown | nsThreadManager::ShutdownNonMainThreads

Categories

(Core :: XPCOM, defect, P2)

45 Branch
defect

Tracking

()

REOPENED
Tracking Status
firefox-esr60 --- wontfix
firefox-esr68 --- wontfix
firefox-esr78 --- wontfix
firefox63 --- wontfix
firefox64 --- wontfix
firefox65 --- wontfix
firefox67 --- wontfix
firefox68 --- wontfix
firefox69 --- wontfix
firefox70 --- wontfix
firefox71 --- wontfix
firefox72 --- wontfix
firefox84 --- wontfix
firefox85 --- wontfix
firefox86 --- wontfix

People

(Reporter: skywalker333, Assigned: jstutte)

References

(Depends on 4 open bugs, Blocks 2 open bugs)

Details

(4 keywords, Whiteboard: [Thunderbird see bug 1524247])

Crash Data

Attachments

(1 file)

This bug was filed from the Socorro interface and is report bp-88218746-24bc-4e81-b9ce-d64cb0181106. ============================================================= Top 10 frames of crashing thread: 0 ntdll.dll NtWaitForAlertByThreadId 1 ntdll.dll RtlSleepConditionVariableSRW 2 kernelbase.dll SleepConditionVariableSRW 3 mozglue.dll mozilla::detail::ConditionVariableImpl::wait mozglue/misc/ConditionVariable_windows.cpp:58 4 xul.dll mozilla::ThreadEventQueue<mozilla::PrioritizedEventQueue<mozilla::EventQueue> >::GetEvent xpcom/threads/ThreadEventQueue.cpp:168 5 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1171 6 xul.dll NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:530 7 xul.dll nsThread::Shutdown xpcom/threads/nsThread.cpp:953 8 xul.dll nsThreadManager::Shutdown xpcom/threads/nsThreadManager.cpp:341 9 xul.dll static nsresult mozilla::ShutdownXPCOM xpcom/build/XPCOMInit.cpp:926 =============================================================

Jim, should we put shutdown hangs in a different component?

Flags: needinfo?(jmathies)
Blocks: shutdownkill
No longer blocks: shutdown
Component: General → DOM: Content Processes
Flags: needinfo?(jmathies)

FYI I observed two crashes today in TB version 65.0b2 (32-bit)

bp-dddcf79f-eab6-451c-a9f6-923c10190115 15/01/2019, 15:58
bp-9d9da253-2cbf-4896-8864-4542e0190115 15/01/2019, 12:42

I believe the first one occurred as I was trying to open a .eml attachment from a received email (opened in separate tab). IMAP/SMTP mailbox setup set to synchronise only 3 days worth of emails.

The second time when it happened I was not even actively using Thunderbird, but it was opened and close by itself... without my intervention :-)

Wayne I don't need info from you, it was just a way for me to let you know about crashes with my TB for your information as you seems to follow them up...

Flags: needinfo?(vseerror)

Your first crash better aligns with bug 1526127 - bp-dddcf79f-eab6-451c-a9f6-923c10190115 [@ shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown ] - trying to open a .eml attachment from a received email (opened in separate tab). IMAP/SMTP mailbox setup set to synchronise only 3 days worth of emails.

Your second crash is bug 1534119 - bp-9d9da253-2cbf-4896-8864-4542e0190115 [@ nsImapProtocol::HandleMessageDownLoadLine ] - not actively using Thunderbird

Flags: needinfo?(vseerror)
Priority: -- → P2

Top Crashers for Firefox 69.0a1 (Nightly) - 7 days ago

#13 0.45% 0.39% shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown 37 37 0 0 34 0 2017-09-27

Signature report for shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown

Showing results from 7 days ago

2,710 Results

Windows 10 1401 51.7%
Windows 7 1013 37.4%
Windows 8.1 187 6.9%
Windows Vista 81 3.0%
Windows 8 27 1.0%
Windows XP 1 0.0%

Product
Firefox 69.0a1 37 1.4% 46
Thunderbird 69.0a1 7 0.3% 5
Firefox 68.0b7 131 4.8% 113
Firefox 68.0b8 130 4.8% 133
Firefox 68.0b6 71 2.6% 81
Firefox 68.0b5 32 1.2% 34
Firefox 68.0b4 17 0.6% 20
Firefox 68.0b9 15 0.6% 13
Firefox 68.0b3 10 0.4% 9
Firefox 67.0.1 376 13.9% 234
Firefox 67.0 227 8.4% 179
Firefox 67.0.2 43 1.6% 42

Firefox 60.7.0esr 289 10.7% 285
Thunderbird 60.7.0 425 15.7% 313
Thunderbird 60.7.1 2 0.1% 2
Firefox 60.6.2esr 43 1.6% 43
Firefox 60.6.3esr 42 1.5% 37
Firefox 60.6.1esr 28 1.0% 21

Firefox 52.9.0esr 81 3.0% 41

Architecture
amd64 1465 54.1%
x86 1245 45.9%

(In reply to Wayne Mery (:wsmwk) from comment #5)

Judging by release date of firefox 68, it looks like crash rate more than doubled.
https://crash-stats.mozilla.org/signature/?signature=shutdownhang%20%7C%20nsThread%3A%3AShutdown%20%7C%20nsThreadManager%3A%3AShutdown&date=>%3D2019-01-26T00%3A00%3A00.000Z&date=<2019-07-26T23%3A59%3A00.000Z#graphs

[@ shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown ]

There is still a high volume of crashes with this signature. It is the #2 Top Crasher on versions 68.0.1 and 68.0 (second only to the [@ OOM | small] signature). It is #3 Top Crasher on versions 69.0b13 and 60.7.2esr.

It is affecting the currently Nightly version of Firefox (70.0a1) and is the #19 Top Crasher for that version.

Could the priority for this bug be re-evaluated? I would suggest raising the priority from P2 to P1.

Could this bug be assigned to someone?

Thanks

Top Crashers for Firefox 68.0.1
Top 50 Crashing Signatures. 7 days ago

#1 9.01% -7.36% shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown 2225 2225 0 0 2101 0 2017-09-27

Top Crashers for Firefox 68.0.2
Top 50 Crashing Signatures. 7 days ago

#15 0.6% -0.02% shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown 645 645 0 0 647 0 2017-09-27

Top Crashers for Firefox 60.8.0esr
Top 50 Crashing Signatures. 7 days ago

#7 1.15% 0.02% shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown 318 318 0 0 274 0 2017-09-27

Top Crashers for Firefox 69.0b14
Top 50 Crashing Signatures. 7 days ago

#5 1.86% -1.59% shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown 193 193 0 0 180 0 2017-09-27

Top Crashers for Firefox 69.0b13
Top 50 Crashing Signatures. 7 days ago

#2 6.69% -5.61% shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown 177 177 0 0 149 0 2017-09-27

Top Crashers for Firefox 60.7.2esr
Top 50 Crashing Signatures. 7 days ago

#2 6.23% 1.25% shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown 168 168 0 0 155 0 2017-09-27

Top Crashers for Firefox 68.0
Top 50 Crashing Signatures. 7 days ago

#1 7.21% -0.95% shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown 148 148 0 0 139 0 2017-09-27

Top Crashers for Firefox 70.0a1
Top 50 Crashing Signatures. 7 days ago

#18 0.76% 0.63% shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown 55 55 0 0 54 0 2017-09-27

Signature report for shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown
Showing results from 7 days ago

Windows 10 2637 47.5%
Windows 7 1910 34.4%
Windows 8.1 886 15.9%
Windows Vista 62 1.1%
Windows 8 60 1.1%
Windows XP 2 0.0%

Firefox 68.0.1 2225 40.0% 1026
Firefox 68.0.2 646 11.6% 725
Firefox 60.8.0esr 318 5.7% 215
Firefox 69.0b14 192 3.5% 111
Firefox 69.0b13 182 3.3% 125
Firefox 60.7.2esr 168 3.0% 149
Firefox 68.0 147 2.6% 152
Firefox 69.0b15 82 1.5% 90
Firefox 60.6.2esr 74 1.3% 56
Firefox 52.9.0esr 65 1.2% 71
Firefox 69.0b12 65 1.2% 47
Firefox 70.0a1 55 1.0% 55
Firefox 69.0b11 51 0.9% 36

Uptime Range

1 hour 1643 29.6%
1-5 min 1507 27.1%
5-15 min 1287 23.2%
15-60 min 1117 20.1%
< 1 min 3 0.1%

Architecture
amd64 3924 70.6%
x86 1633 29.4%

Flags: needinfo?(skywalker333)

The rate of shutdown hangs has recently doubled. Jimm, is the reason for this known?

Flags: needinfo?(jmathies)

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #7)

The rate of shutdown hangs has recently doubled. Jimm, is the reason for this known?

It is doubled mainly from increased crashes of Thunderbird 68.x, probably due to users updating from 60.x, being tracked in bug 1526127.
Per the graph https://crash-stats.mozilla.org/signature/?signature=shutdownhang%20%7C%20nsThread%3A%3AShutdown%20%7C%20nsThreadManager%3A%3AShutdown&date=%3E%3D2019-10-21T01%3A45%3A00.000Z&date=%3C2020-01-21T01%3A45%3A00.000Z#graphs

Thunderbird crashes should be referred to bug 1524247

See Also: → 1524247
Whiteboard: [Thunderbird see bug 1524247]

Looking at the aggregation of MOZ_CRASH_REASON I see here now:

1 	MOZ_CRASH(Shutdown too long, probably frozen, causing a crash.) 	24732 	100.00 %
2 	MOZ_CRASH(Shutdown hanging before starting.) 	1 	0.00 %

For the vast majority then, looking at the RunWatchdog this seems to indicate that:

  1. All observers have been successfully notified and unblocked (sShutdownNotified is true)
  2. No workers are hanging (runtimeService->CrashIfHanging(); did not crash)

To me this seems to indicate that there is nothing left worth waiting for (otherwise we would report it here) and we are just bloating our crash statistics. Still this is unexpected behavior, but for the end-user it should be handled without harm and I do not see how we could transform those crash reports into something more actionable if there is nothing concrete left we wait for.

A possible fix might be, to replace this MOZ_CRASH with a MOZ_ASSERT_UNREACHABLE and an exit(0) ? Or to dispatch a last "kill yourself now" message to the main thread and wait another second or two?

Flags: needinfo?(amarchesini)
Crash Signature: [@ shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown] → [@ shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown] [@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThread::Shutdown | nsThreadManager::Shutdown]
Crash Signature: [@ shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown] [@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThread::Shutdown | nsThreadManager::Shutdown] → [@ shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown] [@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThread::Shutdown | nsThreadManager::Shutdown] [@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThread…

The signature just added by the release bot seems to be a different situation. If I look at MOZ_CRASH_REASON there, I see only the MOZ_CRASH(Shutdown hanging before starting.) case, which is quite the opposite extreme (not even the first shutdown phase succeeded).

Sylvestre, can we prevent the bot somehow from adding this signature here?

Crash Signature: [@ shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown] [@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThread::Shutdown | nsThreadManager::Shutdown] [@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask | → [@ shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown] [@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThread::Shutdown | nsThreadManager::Shutdown] [@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask |
Flags: needinfo?(sledru)

Jens, the bot is doing that because bug 1678330 is marked as dup of this bug.

So, either you undup (and reopen) bug 1678330 or you remove the signature from bug 1678330

Flags: needinfo?(sledru)

Sorry, that was too easy...

Crash Signature: [@ shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown] [@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThread::Shutdown | nsThreadManager::Shutdown] [@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThrea… → [@ shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown] [@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThread::Shutdown | nsThreadManager::Shutdown]
Assignee: nobody → jstutte
Status: NEW → ASSIGNED
Attachment #9192576 - Attachment description: Bug 1505660: Promote profile-before-change-qm to have a timer reset. r=#dom-workers-and-storage → Bug 1505660: Promote profile-before-change-net-teardown -qm and -telemetry to have a shutdown timer reset. r=#dom-workers-and-storage
Attachment #9192576 - Attachment description: Bug 1505660: Promote profile-before-change-net-teardown -qm and -telemetry to have a shutdown timer reset. r=#dom-workers-and-storage → Bug 1505660: Promote profile-change-net-teardown -before-change-qm and -telemetry to have a shutdown timer reset. r=#dom-workers-and-storage
Attachment #9192576 - Attachment description: Bug 1505660: Promote profile-change-net-teardown -before-change-qm and -telemetry to have a shutdown timer reset. r=#dom-workers-and-storage → Bug 1505660: Promote profile-change-net-teardown -before-change-qm and -telemetry to have a shutdown timer reset. r=dthayer,#dom-workers-and-storage
Attachment #9192576 - Attachment description: Bug 1505660: Promote profile-change-net-teardown -before-change-qm and -telemetry to have a shutdown timer reset. r=dthayer,#dom-workers-and-storage → Bug 1505660: Promote profile-change-net-teardown -before-change-qm and -telemetry to have a shutdown timer reset. r?dthayer,#dom-workers-and-storage
Attachment #9192576 - Attachment description: Bug 1505660: Promote profile-change-net-teardown -before-change-qm and -telemetry to have a shutdown timer reset. r?dthayer,#dom-workers-and-storage → Bug 1505660: Promote profile-change-net-teardown -before-change-qm and -telemetry to have a shutdown timer reset. r?#dom-workers-and-storage,dthayer
Attachment #9192576 - Attachment description: Bug 1505660: Promote profile-change-net-teardown -before-change-qm and -telemetry to have a shutdown timer reset. r?#dom-workers-and-storage,dthayer → Bug 1505660: Promote profile-change-net-teardown and -before-change-qm to have a shutdown timer reset. r?#dom-workers-and-storage,dthayer
Pushed by rmaries@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/45ba20e40892 Promote profile-change-net-teardown and -before-change-qm to have a shutdown timer reset. r=dthayer,mossop
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 86 Branch

While the fix is expected to potentially reduce the number of crashes, it is not necessary a complete solution. We expect QM shutdown to be faster in general now thanks to bug 1666214, and we hope to have better diagnostics about the remaining cases through this patch here and will continue to observe this.

And for some of these crashes the MOZ_CRASH_REASON might just change slightly to be more explicit about the phase we are stuck in.

Status: RESOLVED → REOPENED
Flags: needinfo?(amarchesini)
Keywords: leave-open
Resolution: FIXED → ---

The patch landed in nightly and beta is affected.
:jstutte, is this bug important enough to require an uplift?
If not please set status_beta to wontfix.

For more information, please visit auto_nag documentation.

Flags: needinfo?(jstutte)

Comment on attachment 9192576 [details]
Bug 1505660: Promote profile-change-net-teardown and -before-change-qm to have a shutdown timer reset. r?#dom-workers-and-storage,dthayer

Beta/Release Uplift Approval Request

  • User impact if declined: Some users might experience less shutdown crashes thanks to this patch. And at least we might gain some more diagnostic insights on the different shutdown hangs in different phases.
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): The patch does only extend the explicit shutdown handling on two more shutdown phases. Normal operation (including successful shutdown) is unaffected.
  • String changes made/needed:
Flags: needinfo?(jstutte)
Attachment #9192576 - Flags: approval-mozilla-beta?
Target Milestone: 86 Branch → ---

Comment on attachment 9192576 [details]
Bug 1505660: Promote profile-change-net-teardown and -before-change-qm to have a shutdown timer reset. r?#dom-workers-and-storage,dthayer

approved for 85.0b6

Attachment #9192576 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Depends on: 1685204

Comment on attachment 9192576 [details]
Bug 1505660: Promote profile-change-net-teardown and -before-change-qm to have a shutdown timer reset. r?#dom-workers-and-storage,dthayer

https://hg.mozilla.org/releases/mozilla-beta/rev/f6ecda1b2d689320f4a3ce86893afb9efbbdef1a

[clearing approval to get this off the needs-uplift queries]

Attachment #9192576 - Flags: approval-mozilla-beta+
See Also: → 1689953
Blocks: 1629669
See Also: → 1704391
See Also: → 1684441
See Also: → 1730686
Depends on: 1741377
Depends on: 1741675
Depends on: 1740899

FWIW, the current statistics show two interesting data points:

  • Thunderbird is not appearing any more since mid-november. Is this just a change in reporting or did they actually solve something here?
  • Since FF 94 each release causes a high spike, presumably right after the updates happened.

Thunderbird crash reports: https://crash-stats.thunderbird.net/ is used for crash reports now, and https://crash-stats.mozilla.org/ rejects submissions for Thunderbird (bug 1608971).

(In reply to Jens Stutte [:jstutte] from comment #23)

FWIW, the current statistics show two interesting data points:

  • Thunderbird is not appearing any more since mid-november. Is this just a change in reporting or did they actually solve something here?

Sadly, yes. crash-stats terminated supporting Thunderbird crashes. Collection stopped Nov 9.

There is public access to individual crash reports. But the detailed data on backtrace.io requires a paid account.

There is public access to individual crash reports. But the detailed data on backtrace.io requires a paid account.

How expensive is such an account? How do you get one? What's the public interface for searching this information, once you have an account?

I'm not necessarily going to go through with this. But the https://backtrace.io site has no information on this, or on Thunderbird at all. So it'd be nice to know.

Flags: needinfo?(vseerror)

responded to smichaud via PM

Flags: needinfo?(vseerror)
See Also: → 1769211

Copying crash signatures from duplicate bugs.

Crash Signature: [@ shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown] [@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThread::Shutdown | nsThreadManager::Shutdown] → [@ shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown] [@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThread::Shutdown | nsThreadManager::Shutdown] [@ shutdownhang | nsThread::Shutdown | nsThreadManager::ShutdownNonMainThr…

crash-stats has the XPCOMSpinEventLoopStack field (under crash annotations) that is useful in this context. These signatures are going to be popular and will change over time but, as of today, I'm seeing:

@ shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown
By far the most popular cause is nsThread::Shutdown: BitsCommander, which is trying to stop a thread that is in the middle of a COM operation to do a background download. There are also BitsMonitor {some_guid} threads that are also related to downloads. I don't know if we expect to wait for or to kill incomplete downloads at shutdown. There are a few other stray thread names under this hang but most of the rest (about half) are just empty names. With those, it seems common to find BitsCommander doing COM on some thread (like with this crash ).

@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThread::Shutdown | nsThreadManager::Shutdown
This only goes to Fx 84 -- all of the others are Thunderbird crashes. Most of the labels are empty -- maybe Tb doesn't label the relevant threads.

@ shutdownhang | nsThread::Shutdown | nsThreadManager::ShutdownNonMainThreads
Again, mostly BitsCommander threads not shutting down, although this signature brings in a healthy number of Update Watcher and TelemetryModule threads. There isn't a strong pattern here but the Telemetry hangs suggest AuthenticodeImpl::GetBinaryOrgName is too slow. See e.g. thread 24 of this crash. The Update Watcher thread is stuck waiting for the updater process to finish in nsUpdateProcessor::WaitForProcess.

Crash Signature: [@ shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown] [@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThread::Shutdown | nsThreadManager::Shutdown] [@ shutdownhang | nsThread::Shutdown | → [@ shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown] [@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThread::Shutdown | nsThreadManager::Shutdown] [@ shutdownhang | nsThread::Shutdown |
See Also: → 1741675
See Also: → 1772903
See Also: → 1772908
Whiteboard: [Thunderbird see bug 1524247] → [Thunderbird see bug 1524247], [win:stability]

I scraped crashstats for nightly instances of this signature where some of the thread names contain SHDRCV and/or SHDACK (added by bug 1770451).

FWIW, here is the result.

There is definitely some correlation with the Updater process being waited on, as in bug 1772908. But we should also look at the other ones.

There is also one interesting instance concerning the BHMgr Monitor and Cookie threads.

  1. We enter phase ShutdownPhase::AppShutdown
  2. (MainThread) notifies the CookieService to shutdown.
  3. (MainThread) The CookieService starts a synchronous shutdown().
  4. (MainThread) nsThread::Shutdown() is spinning the event loop, waiting for the thread to finish
  5. (BHMgr Monitor) BackgroundHangManager::MonitorThread detects a hang and wants to report it (might have happened even earlier)
  6. (BHMgr Monitor) BackgroundHangThread::ReportHang is called with mManager->mLock IS locked
  7. (BHMgr Monitor) For whatever cosmic reason, writing to file takes some time inside nsLocalFile::OpenNSPRFileDesc
  8. (MainThread) ProcessNextEvent wants to BackgroundHangMonitor().NotifyWait(); which wants to lock the same mManager->mLock already held by BackgroundHangThread::ReportHang and starves
  9. (Cookie thread) In the meantime, the Cookie thread reached its end and is just waiting for the nsThreadShutdownAckEvent to be processed on the MainThread.

Also for performance reasons it seems questionable to me that a lock used by frequent event processing is held on the background hang monitor thread for long-running tasks like hang-report disk writings.

Bob, :jya, the lock that blocks the main thread has been introduced relatively recently (2 years ago) by bug 1634253.

Dragana, on a different side note the synchronous nsThread::Shutdown() in CookiePersistentStorage::Close could maybe be avoided in favor of nsThread::AsyncShutdown() ? It would not stop this from happening, of course, but there would be less blocking layers potentially stacking on the main thread.

Flags: needinfo?(jya-moz)
Flags: needinfo?(dd.mozilla)
Flags: needinfo?(bhood)

Doug, you you recall any of this stuff. Bug 1594577 changed some of the relevant code, though https://hg.mozilla.org/mozilla-central/annotate/2426bd765f8b74744fa7042826423ec5ddca53a7/xpcom/threads/nsThread.cpp#l1094 was added later.

Flags: needinfo?(dothayer)

Andrew, I know this is a necro, but would you mind having a cursory look at this when you get a chance just in case Jean-Yves fails to respond? (See comment 32)

Flags: needinfo?(bhood) → needinfo?(aosmond)

I will open a bug to investigate a possible async shutdown of the cookie storage.

Flags: needinfo?(dd.mozilla)
See Also: → 1781032

A bit of additional archaeology and a bit of additional correlation.


(In reply to David Parks [:handyman] from comment #30)

crash-stats has the XPCOMSpinEventLoopStack field (under crash annotations) that is useful in this context. These signatures are going to be popular and will change over time

And indeed:

@ shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown

@ shutdownhang | nsThread::Shutdown | nsThreadManager::ShutdownNonMainThreads

These are effectively the same callstack, only before and after bug 1764119.


@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThread::Shutdown | nsThreadManager::Shutdown
This only goes to Fx 84 -- all of the others are Thunderbird crashes. Most of the labels are empty -- maybe Tb doesn't label the relevant threads.

This is also the same callstack.

The primary difference — the presence of mozilla::TaskController::GetRunnableForMTTask — is spurious. The call stack is essentially identical in Fx and Tb, as can be seen in Visual Studio in the minidumps; but this frame is consistently inlined in Fx and consistently not inlined in Tb. Presumably this is due to some locally-uninteresting choice of compiler flags.

The secondary difference -- the absence of XPCOMSpinEventLoopStack annotations -- is mostly due to the fact that Tb didn't start writing these until about v95. As with Fx, the signature has changed to @ shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThread::Shutdown | nsThreadManager::ShutdownNonMainThreads in recent versions, and XPCOMSpinEventLoopStack is generally present.

Thunderbird crashes with this signature are being investigated in bug 1768344 and bug 1749142.

(I've added the latter signature to the crash data, above, but perhaps I ought to have removed the former?)


More than 80% of these shutdown hangs appear to be in Thunderbird; of those, about 80% seem to be blocked on the "IMAP" thread. The remainder are stuck on "Update Watcher" and "TelemetryModule".

On the Firefox side (post-bug 1764119 only, for simplicity's sake), the vast majority of hangs are in "Update Watcher" and "BitsCommander", which are very nearly identical — it's not surprising that they're correlated, given that the only thing we use BITS for is downloading updates, but I have no explanation for why their ratio is approximately 1. The other hangs with this signature are mostly attributed to "TelemetryModule" and "Wifi Monitor", with "BitsMonitor {various uuids}" and "BHMgr Processor" being distant runners-up.

Crash Signature: nsThreadManager::ShutdownNonMainThreads] → nsThreadManager::ShutdownNonMainThreads] [@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThread::Shutdown | nsThreadManager::ShutdownNonMainThreads]
Summary: Crash in shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown → Crash in shutdownhang | nsThread::Shutdown | nsThreadManager::ShutdownNonMainThreads
Flags: needinfo?(jya-moz)
Whiteboard: [Thunderbird see bug 1524247], [win:stability] → [Thunderbird see bug 1524247]

(In reply to Jens Stutte [:jstutte] from comment #31)

I scraped crashstats for nightly instances of this signature where some of the thread names contain SHDRCV and/or SHDACK (added by bug 1770451).

FWIW, here is the result.

I updated the sheet.

Moving to XPCOM, since many of the hangs are in the parent process.

Component: DOM: Content Processes → XPCOM
Depends on: 1791767

(In reply to Olli Pettay [:smaug][bugs@pettay.fi] from comment #38)

Moving to XPCOM, since many of the hangs are in the parent process.

Just to be precise: All of them.

See Also: → 1790983
Severity: critical → S2
Depends on: 1794785
Depends on: 1794799
Blocks: 1749142
Blocks: 1768344
Depends on: 1796060

Moved ni?s to bug 1796060.

Flags: needinfo?(dothayer)
Flags: needinfo?(aosmond)

(In reply to Jens Stutte [:jstutte] from comment #31)

I scraped crashstats for nightly instances of this signature where some of the thread names contain SHDRCV and/or SHDACK (added by bug 1770451).

FWIW, here is the result.

I updated the sheet. (Unfortunately this is not a simple query in crash-stats)

Depends on: 1796132

On the Firefox side (post-bug 1764119 only, for simplicity's sake), the vast majority of hangs are in "Update Watcher" and "BitsCommander", which are very nearly identical — it's not surprising that they're correlated, given that the only thing we use BITS for is downloading updates, but I have no explanation for why their ratio is approximately 1. The other hangs with this signature are mostly attributed to "TelemetryModule" and "Wifi Monitor", with "BitsMonitor {various uuids}" and "BHMgr Processor" being distant runners-up.

Jens observed that some of these have the BackgroundTaskName: backgroundupdate annotation. Such crashes during shutdown are exactly the kind of crash we would expect from background tasks: lots of Gecko doesn't have clean shutdown and background tasks can exit in a rush; there's often not much work to do. (And there's a timeout mechanism but last I checked, which was a while back, we don't see timeouts frequently.)

The two listed threads are very likely to be involved: backgroundupdate tasks install updates and interact with BITS (to start downloads and check on transfer progress).

Let me know if I can be of assistance debugging or reproducing these crashes. Thanks!

Crash Signature: nsThreadManager::ShutdownNonMainThreads] [@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThread::Shutdown | nsThreadManager::ShutdownNonMainThreads] → nsThreadManager::ShutdownNonMainThreads] [@ shutdownhang | mozilla::SpinEventLoopUntil | nsThread::Shutdown | nsThreadManager::ShutdownNonMainThreads] [@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThread::Shutdown | nsThreadManage…

I hit this on 107.0b3 a bit ago. TB had presented me with the OAuth2-type username and PW dialogue but I was in the middle of shutting down TB when it hung and then crashed. So either an OAuth2 or PW manager issue.

https://crash-stats.mozilla.org/report/index/0a3b7415-dc4d-48d4-ba59-1eccf0221104

And a second shortly afterwards: https://crash-stats.mozilla.org/report/index/5dd1a9ea-54a5-4db2-8d65-1163e0221104

Maybe related to bug 1768344 since that's what I did before it crashed this time.

Blocks: 1798108

Added a signature

Crash Signature: nsThreadManager::ShutdownNonMainThreads] [@ shutdownhang | mozilla::SpinEventLoopUntil | nsThread::Shutdown | nsThreadManager::ShutdownNonMainThreads] [@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThread::Shutdown | nsThreadManage… → nsThreadManager::ShutdownNonMainThreads] [@ shutdownhang | mozilla::SpinEventLoopUntil | nsThread::Shutdown | nsThreadManager::ShutdownNonMainThreads] [@ shutdownhang | mozilla::SpinEventLoopUntil<T> | nsThread::Shutdown | nsThreadManager::ShutdownNonM…

@gsvelto: would it be feasible to bucket these crashes by their XPCOMSpinEventLoopStack value, rather than by their callstack? It should be both more stable and more indicative of the actual problem.

(Although at the moment it'd require some special-casing for BitsMonitor threads, due to the unfortunate embedded GUID.)

Flags: needinfo?(gsvelto)

(In reply to Nick Alexander :nalexander [he/him] from comment #42)

On the Firefox side (post-bug 1764119 only, for simplicity's sake), the vast majority of hangs are in "Update Watcher" and "BitsCommander", which are very nearly identical — it's not surprising that they're correlated, given that the only thing we use BITS for is downloading updates, but I have no explanation for why their ratio is approximately 1. The other hangs with this signature are mostly attributed to "TelemetryModule" and "Wifi Monitor", with "BitsMonitor {various uuids}" and "BHMgr Processor" being distant runners-up.

Jens observed that some of these have the BackgroundTaskName: backgroundupdate annotation. Such crashes during shutdown are exactly the kind of crash we would expect from background tasks: lots of Gecko doesn't have clean shutdown and background tasks can exit in a rush; there's often not much work to do. (And there's a timeout mechanism but last I checked, which was a while back, we don't see timeouts frequently.)

The two listed threads are very likely to be involved: backgroundupdate tasks install updates and interact with BITS (to start downloads and check on transfer progress).

Let me know if I can be of assistance debugging or reproducing these crashes. Thanks!

From recent nightly numbers (1 week) I see:

  • 15 non-background task instances. All I looked at have to do with BitsCommander or UpdateWatcher.
  • 172 background task ones, all of which have "backgroundupdate" as task name.

We should probably re-triage (and maybe rename) bug 1741675.

(In reply to Ray Kraesig [:rkraesig] from comment #46)

@gsvelto: would it be feasible to bucket these crashes by their XPCOMSpinEventLoopStack value, rather than by their callstack? It should be both more stable and more indicative of the actual problem.

Yes, it's possible and I'll file a bug for it.

(Although at the moment it'd require some special-casing for BitsMonitor threads, due to the unfortunate embedded GUID.)

That's a problem. For something to be in the signature it should be stable over time and crashes, otherwise bucketing won't work. We have steps within Socorro to sanitize stuff (mostly for privacy reason) but they add complexity and I don't think we want that in signature generation.

Flags: needinfo?(gsvelto)
Depends on: 1810491

(In reply to Jens Stutte [:jstutte] from comment #47)

From recent nightly numbers (1 week) I see:

  • 15 non-background task instances. All I looked at have to do with BitsCommander or UpdateWatcher.
  • 172 background task ones, all of which have "backgroundupdate" as task name.

We should probably re-triage (and maybe rename) bug 1741675.

This is still the case, it seems. And this is our second top-most crash in nightly, currently.

Updating signatures to the active ones for current versions.

Crash Signature: [@ shutdownhang | nsThread::Shutdown | nsThreadManager::Shutdown] [@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThread::Shutdown | nsThreadManager::Shutdown] [@ shutdownhang | nsThread::Shutdown | nsThreadManager::ShutdownNonMainTh… → [@ shutdownhang | mozilla::SpinEventLoopUntil | nsThread::Shutdown | nsThreadManager::ShutdownNonMainThreads ] [@ shutdownhang | nsThread::Shutdown | nsThreadManager::ShutdownNonMainThreads ] [@ shutdownhang | mozilla::SpinEventLoopUntil<T> | nsThread::…

FWIW, most Thunderbird crashes are German locale (German 21%, English 19%, Japanese 15%, in that order).

Firefox crashes are overwhelmingly English - 38%

See Also: → 1843744

Currently ~95% of the Firefox crashes are from backgroundupdate, see bug 1741675.

Crash Signature: nsThreadManager::ShutdownNonMainThreads ] [@ shutdownhang | mozilla::SpinEventLoopUntil | nsThreadManager::ShutdownNonMainThreads ] [@ shutdownhang | nsThreadManager::ShutdownNonMainThreads ] [@ shutdownhang | mozilla::SpinEventLoopUntil<T> | nsThread… → nsThreadManager::ShutdownNonMainThreads ] [@ shutdownhang | mozilla::SpinEventLoopUntil | nsThreadManager::ShutdownNonMainThreads ] [@ shutdownhang | nsThreadManager::ShutdownNonMainThreads ] [@ shutdownhang | mozilla::SpinEventLoopUntil | nsThread::W…
No longer blocks: 1768344
See Also: → 1768344

I'm going to unmark this main bug as S2, it makes more sense to prioritize the actual signatures.

Severity: S2 → N/A

(In reply to Gian-Carlo Pascutto [:gcp] from comment #55)

I'm going to unmark this main bug as S2, it makes more sense to prioritize the actual signatures.

I agree, but we need some Sx value to make the triage bot happy.

Severity: N/A → S3

The severity field for this bug is set to S3. However, the following bug duplicate has higher severity:

:jstutte, could you consider increasing the severity of this bug to S2?

For more information, please visit BugBot documentation.

Flags: needinfo?(jstutte)

No, see comment 55.

Flags: needinfo?(jstutte)

The leave-open keyword is there and there is no activity for 6 months.
:jstutte, maybe it's time to close this bug?
For more information, please visit BugBot documentation.

Flags: needinfo?(jstutte)
Flags: needinfo?(jstutte)
Depends on: 1866944

Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: