Increase in Windows shutdown crashes starting around 2021-10-27, e.g. Crash in [@ shutdownhang | NtFlushBuffersFile]
Categories
(Core :: XPCOM, defect, P1)
Tracking
()
People
(Reporter: aryx, Assigned: valentin)
References
Details
(Keywords: crash, topcrash)
Crash Data
Attachments
(1 file, 1 obsolete file)
|
48 bytes,
text/x-phabricator-request
|
diannaS
:
approval-mozilla-beta+
pascalc
:
approval-mozilla-release+
|
Details | Review |
Jens, do you have insight why this signature [started to spike]8https://crash-stats.mozilla.org/signature/?product=Firefox&signature=shutdownhang%20%7C%20NtFlushBuffersFile&date=%3E%3D2021-08-02T16%3A18%3A00.000Z&date=%3C2021-11-02T16%3A18%3A00.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_columns=startup_crash&_sort=-date&page=2#graphs9 (3.5-4x previous level) around 2021-10-27 but on release? Bug 1705360 saw a change but that was for Nightly (95.0a1) and other signatures are also seeing an increase. Either people crash more, submit more or we process more.
- @ mozilla::detail::RunnableMethodImpl<T>::~RunnableMethodImpl
- @ mozilla::ClearOnShutdown_Internal::PointerClearer<T>::Shutdown
- @ shutdownhang | mozilla::EventListenerManager::HandleEventInternal
Crash report: https://crash-stats.mozilla.org/report/index/8b3e2d08-629d-4729-90e3-082930211102
MOZ_CRASH Reason: Shutdown hanging at step quit-application. Something is blocking the main-thread.
Top 10 frames of crashing thread:
0 ntdll.dll NtFlushBuffersFile
1 KERNELBASE.dll FlushFileBuffers
2 nss3.dll winSync third_party/sqlite3/src/sqlite3.c:45420
3 nss3.dll syncJournal third_party/sqlite3/src/sqlite3.c:56920
4 nss3.dll sqlite3PagerCommitPhaseOne third_party/sqlite3/src/sqlite3.c:59166
5 nss3.dll sqlite3BtreeCommitPhaseOne third_party/sqlite3/src/sqlite3.c:69499
6 nss3.dll sqlite3VdbeHalt third_party/sqlite3/src/sqlite3.c:82024
7 nss3.dll sqlite3VdbeExec third_party/sqlite3/src/sqlite3.c:89967
8 nss3.dll sqlite3_step third_party/sqlite3/src/sqlite3.c:84918
9 nss3.dll sqlite3_exec third_party/sqlite3/src/sqlite3.c:126190
Comment 1•4 years ago
|
||
Hi Dana, do you have any idea what could have caused this raise here?
Hi Aryx, do I understand right that we see a general increase in crashes or is it limited to just some signatures? Is this crash particularly outstanding here?
Updated•4 years ago
|
| Reporter | ||
Comment 2•4 years ago
|
||
Most signatures but not all with shutdownhang at the start saw the crash volume increase: check the first column rank and the red highlights for big moves in the ranking.
Comment 3•4 years ago
|
||
(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #2)
Most signatures but not all with
shutdownhangat the start saw the crash volume increase: check the first columnrankand the red highlights for big moves in the ranking.
It seems we are back to the previous level more or less. I can only suspect that there was a bigger Windows update occurring often those days while people wanted to close there machines causing heavy IO and thus shutdown hangs? This would at least explain somehow why mostly only shutdown hangs were increasing.
Comment 4•4 years ago
|
||
In the meantime we had also our 94 release, so it is actually hard to tell what makes the difference, though.
| Reporter | ||
Comment 5•4 years ago
|
||
https://docs.microsoft.com/en-us/windows/release-health/release-information mentions latest OS revisions for Windows 10 released on 2021-10-26, crash stats doesn't expose the last part of the version except in the raw data.
Comment 6•4 years ago
|
||
(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #5)
https://docs.microsoft.com/en-us/windows/release-health/release-information mentions latest OS revisions for Windows 10 released on 2021-10-26, crash stats doesn't expose the last part of the version except in the raw data.
The dates match perfectly so I tend towards saying, that this was the reason. If so I do not see much that we can do here if not recommending our users to explicitly quit Firefox before installing Windows updates.
Comment 7•4 years ago
|
||
Seems to be similar to bug 1705360 and bug 1712118, but we still don't know what's going on here.
Comment 8•4 years ago
|
||
The volume of this crash is still quite high. In the comments, some people are talking about Firefox crashing while they are using it, but there are also an odd number of comments talking about how they didn't even notice a crash, but they got this dialog.
An instance of this last thing is this crash bp-48031eb1-0766-4ab6-b08a-a34220211201 with this comment: "I often get this pop-up but I'm not aware of my browser crashing."
bp-1fa0773e-f077-4f8f-9238-f0d700211206 In this crash, we're shutting down (in profile before change), spinning the event loop as part of CookiePersistentStorage::Close(), then the hang in the middle of a captive portal check trying to sync some file to disk.
bp-1eac166f-e426-4cea-9cb8-0d7d30211205 This crash has the comment: "Installed Windows 11 a day ago. Completely shut down my desktop last night and restarted this morning first thing. So, then, this crash happened." The stack is similar to the previous one, except instead of the cookie storage thing doing the observation, it is something written in JS that ends up spinning the event loop.
Comment 9•4 years ago
|
||
The volume jumped from 500-1000/day to multiple thousands per day around Oct 26th or so. We heard there was a Windows update around that time. However there are crashes on win7/8/10 and 11. Just the one most common signature for this issue has had 16000 crashes in the last 7 days, and about 1/4 of the top 50 crash signatures are from this issue, including #2.
The volume had gone up from low 100's (from the various signatures) to the 500-1000 level from June to October.
About 50% of the crashes are in profile-before-change, and about 50% in quit-application (and a few profile-before-change-qm, and a handful with no shutdown-process attribution).
They all show up as non-fission crashes, however this might be because we never opened a window and set the fission attribution. If so, that would point to a process that's started up in order to notify a running firefox to browse to a url. I should note that the uptime of these crashes is typically in the 100's of seconds, and a few measured in days - perhaps it was just hung there for a long time until the user or the OS did something to trigger a quit.
Also: virtually all the stacks have NSS initialization, called from CaptivePortal initialization (except those where the stack backtrace failed before we got back that far). Some show us clearly in shutdown; others don't (though all appear to be in shutdown per above).
Captive Portal seems to be initialized off a timer; in a number of crash reports we see a thread in ::Shutdown spinning the event loop and running a timer which tries to start the captive portal.
There likely are better ways to solve this, but a simple quick failure on NSS init when shutdown has started may be enough to stop the hangs.
In discussion with nika, cpeterson, mccr8 we mentioned that we likely shouldn't start captive portal if this instance of Firefox is going to notify an existing firefox and exit. (Though why it didn't notify it and exit is a separate question, assuming that's what's happening).
Comment 10•4 years ago
|
||
I'd propose to add the following to nsNSSComponent::Init():
// Avoid a late initialization
if (AppShutdown::IsInOrBeyond(ShutdownPhase::AppShutdownNetTeardown)) {
return NS_ERROR_ILLEGAL_DURING_SHUTDOWN;
}
(I cannot try this immediately myself due to bug 1744665)
Please note that NSS already checks AppShutdown::IsShuttingDown() in some (few) other places, which is equivalent to the earlier AppShutdown::IsInOrBeyond(ShutdownPhase::AppShutdownConfirmed). We might want to try if we can avoid the initialization earlier, too, but I'd expect to break some tests if we do this (actually I would not even be surprised if we see some breakage already with ShutdownPhase::AppShutdownNetTeardown, but that would be clearly something to avoid then).
| Assignee | ||
Comment 11•4 years ago
|
||
Updated•4 years ago
|
| Assignee | ||
Comment 12•4 years ago
|
||
Depends on D133064
| Assignee | ||
Comment 13•4 years ago
|
||
I did a couple of try pushes: just the captive portal fix and one with Jen's suggestions to prevent NSS initialization during shutdown
Comment 14•4 years ago
|
||
(In reply to Valentin Gosu [:valentin] (he/him) from comment #13)
one with Jen's suggestions to prevent NSS initialization during shutdown
That seems to break some 300+ tests... We might want to try with later phases like ShutdownPhase::XPCOMWillShutdown which is the first after telemetry shut down.
Comment 15•4 years ago
|
||
What ShutdownPhase does quit-application correspond to? I'm not sure how those relate.
| Assignee | ||
Comment 16•4 years ago
|
||
I have a fixed version coming up here.
Arguably we shouldn't be creating new channels during shutdown either.
Comment 17•4 years ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #15)
What ShutdownPhase does quit-application correspond to? I'm not sure how those relate.
The mapping is defined here. "quit-application" <-> AppShutdownConfirmed
Updated•4 years ago
|
Comment 18•4 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #17)
(In reply to Andrew McCreight [:mccr8] from comment #15)
What ShutdownPhase does quit-application correspond to? I'm not sure how those relate.
The mapping is defined here.
"quit-application"<->AppShutdownConfirmed
Note that this mapping is currently "parent process centric". We might want to look at bug 1697745 to improve things further here.
Comment 19•4 years ago
|
||
This approach sounds good. I was waiting to see if fission just fixed this, but evidently it doesn't.
Updated•4 years ago
|
Comment 20•4 years ago
|
||
[Tracking Requested - why for this release]: Number 5 top crash on release. Maybe higher if you combine the signatures.
Comment 21•4 years ago
|
||
Comment 22•4 years ago
|
||
Tracking as a ride along for a potential dot release in 95 given that this is a high volume crash. Valentin, please request uplift to beta and release after this has landed on mozilla-central. Thanks.
Comment 23•4 years ago
|
||
| bugherder | ||
https://hg.mozilla.org/mozilla-central/rev/d479dee92656
https://hg.mozilla.org/mozilla-central/rev/56ae8059453e
| Assignee | ||
Comment 24•4 years ago
|
||
Comment on attachment 9254073 [details]
Bug 1738984 - Prevent NSS initialization and channel creation during shutdown r=keeler!
Beta/Release Uplift Approval Request
- User impact if declined: Large number of crashes (possibly in background task) caused by NSS initialization during shutdown.
- Is this code covered by automated tests?: Yes
- Has the fix been verified in Nightly?: No
- Needs manual test from QE?: No
- If yes, steps to reproduce: Watching the crash rate should confirm if this patch works as expected
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): These patches ensure that the captive portal is properly stopped when the shutdown notification is received, and that no Necko channels are created after that point or NSS initialization.
- String changes made/needed:
| Assignee | ||
Updated•4 years ago
|
Updated•4 years ago
|
Comment 25•4 years ago
|
||
Comment on attachment 9254073 [details]
Bug 1738984 - Prevent NSS initialization and channel creation during shutdown r=keeler!
Approved for 96.0b3
Comment 26•4 years ago
|
||
Comment on attachment 9254072 [details]
Bug 1738984 - Prevent captive portal service from working post-shutdown r=#necko
Approved for 96.0b3
Comment 27•4 years ago
|
||
| bugherder uplift | ||
Comment 28•4 years ago
|
||
This is going to be wontfix for 95 given that it caused regressions.
Comment 29•4 years ago
|
||
Backing out the NSS patch per bug 1745043 comment 4 to see if it helps with those crashes.
https://hg.mozilla.org/mozilla-central/rev/a2e05f530d49fc02680eafeabaaf8acbde6c4d7b
Comment 30•4 years ago
|
||
Hi Valentin, should we have a new bug then for the NSS / HTTP Channel patch to investigate that further? It still seems like the right thing to do, AFAICS. There might be other caveats, but the crash from bug 1745043 could be easy to fix, see bug 1745043 comment 5.
| Assignee | ||
Comment 31•4 years ago
|
||
Definitely. I filed bug 1745925 and will move attachment 9254073 [details] to that bug.
Comment 32•4 years ago
|
||
Adding to comment 29 and the fact that there are no more crashes in latest nightly builds, backing out the NSS patch from beta as well
https://hg.mozilla.org/releases/mozilla-beta/rev/eb59149ac90f87fa326eb06b1ce2b780ca7747f4
Updated•4 years ago
|
Comment 34•4 years ago
|
||
Comment on attachment 9254072 [details]
Bug 1738984 - Prevent captive portal service from working post-shutdown r=#necko
Approved for a 95 dot release, thanks.
Updated•4 years ago
|
Comment 35•4 years ago
|
||
| bugherder uplift | ||
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Comment 38•2 years ago
|
||
Comment on attachment 9254073 [details]
Bug 1738984 - Prevent NSS initialization and channel creation during shutdown r=keeler!
Revision D133065 was moved to bug 1713749. Setting attachment 9254073 [details] to obsolete.
Description
•