Closed Bug 1738984 Opened 4 years ago Closed 4 years ago

Increase in Windows shutdown crashes starting around 2021-10-27, e.g. Crash in [@ shutdownhang | NtFlushBuffersFile]

Categories

(Core :: XPCOM, defect, P1)

Unspecified
Windows
defect

Tracking

()

RESOLVED FIXED
97 Branch
Tracking Status
firefox-esr91 --- wontfix
firefox95 + fixed
firefox96 + fixed
firefox97 + fixed

People

(Reporter: aryx, Assigned: valentin)

References

Details

(Keywords: crash, topcrash)

Crash Data

Attachments

(1 file, 1 obsolete file)

Jens, do you have insight why this signature [started to spike]8https://crash-stats.mozilla.org/signature/?product=Firefox&signature=shutdownhang%20%7C%20NtFlushBuffersFile&date=%3E%3D2021-08-02T16%3A18%3A00.000Z&date=%3C2021-11-02T16%3A18%3A00.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_columns=startup_crash&_sort=-date&page=2#graphs9 (3.5-4x previous level) around 2021-10-27 but on release? Bug 1705360 saw a change but that was for Nightly (95.0a1) and other signatures are also seeing an increase. Either people crash more, submit more or we process more.

Crash report: https://crash-stats.mozilla.org/report/index/8b3e2d08-629d-4729-90e3-082930211102

MOZ_CRASH Reason: Shutdown hanging at step quit-application. Something is blocking the main-thread.

Top 10 frames of crashing thread:

0 ntdll.dll NtFlushBuffersFile 
1 KERNELBASE.dll FlushFileBuffers 
2 nss3.dll winSync third_party/sqlite3/src/sqlite3.c:45420
3 nss3.dll syncJournal third_party/sqlite3/src/sqlite3.c:56920
4 nss3.dll sqlite3PagerCommitPhaseOne third_party/sqlite3/src/sqlite3.c:59166
5 nss3.dll sqlite3BtreeCommitPhaseOne third_party/sqlite3/src/sqlite3.c:69499
6 nss3.dll sqlite3VdbeHalt third_party/sqlite3/src/sqlite3.c:82024
7 nss3.dll sqlite3VdbeExec third_party/sqlite3/src/sqlite3.c:89967
8 nss3.dll sqlite3_step third_party/sqlite3/src/sqlite3.c:84918
9 nss3.dll sqlite3_exec third_party/sqlite3/src/sqlite3.c:126190
Flags: needinfo?(jstutte)
See Also: → 1705360

Hi Dana, do you have any idea what could have caused this raise here?

Hi Aryx, do I understand right that we see a general increase in crashes or is it limited to just some signatures? Is this crash particularly outstanding here?

Flags: needinfo?(jstutte)
Flags: needinfo?(dkeeler)
Flags: needinfo?(aryx.bugmail)

Most signatures but not all with shutdownhang at the start saw the crash volume increase: check the first column rank and the red highlights for big moves in the ranking.

Flags: needinfo?(aryx.bugmail)

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #2)

Most signatures but not all with shutdownhang at the start saw the crash volume increase: check the first column rank and the red highlights for big moves in the ranking.

It seems we are back to the previous level more or less. I can only suspect that there was a bigger Windows update occurring often those days while people wanted to close there machines causing heavy IO and thus shutdown hangs? This would at least explain somehow why mostly only shutdown hangs were increasing.

In the meantime we had also our 94 release, so it is actually hard to tell what makes the difference, though.

https://docs.microsoft.com/en-us/windows/release-health/release-information mentions latest OS revisions for Windows 10 released on 2021-10-26, crash stats doesn't expose the last part of the version except in the raw data.

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #5)

https://docs.microsoft.com/en-us/windows/release-health/release-information mentions latest OS revisions for Windows 10 released on 2021-10-26, crash stats doesn't expose the last part of the version except in the raw data.

The dates match perfectly so I tend towards saying, that this was the reason. If so I do not see much that we can do here if not recommending our users to explicitly quit Firefox before installing Windows updates.

Seems to be similar to bug 1705360 and bug 1712118, but we still don't know what's going on here.

Flags: needinfo?(dkeeler)

The volume of this crash is still quite high. In the comments, some people are talking about Firefox crashing while they are using it, but there are also an odd number of comments talking about how they didn't even notice a crash, but they got this dialog.

An instance of this last thing is this crash bp-48031eb1-0766-4ab6-b08a-a34220211201 with this comment: "I often get this pop-up but I'm not aware of my browser crashing."

bp-1fa0773e-f077-4f8f-9238-f0d700211206 In this crash, we're shutting down (in profile before change), spinning the event loop as part of CookiePersistentStorage::Close(), then the hang in the middle of a captive portal check trying to sync some file to disk.

bp-1eac166f-e426-4cea-9cb8-0d7d30211205 This crash has the comment: "Installed Windows 11 a day ago. Completely shut down my desktop last night and restarted this morning first thing. So, then, this crash happened." The stack is similar to the previous one, except instead of the cookie storage thing doing the observation, it is something written in JS that ends up spinning the event loop.

The volume jumped from 500-1000/day to multiple thousands per day around Oct 26th or so. We heard there was a Windows update around that time. However there are crashes on win7/8/10 and 11. Just the one most common signature for this issue has had 16000 crashes in the last 7 days, and about 1/4 of the top 50 crash signatures are from this issue, including #2.

The volume had gone up from low 100's (from the various signatures) to the 500-1000 level from June to October.

About 50% of the crashes are in profile-before-change, and about 50% in quit-application (and a few profile-before-change-qm, and a handful with no shutdown-process attribution).

They all show up as non-fission crashes, however this might be because we never opened a window and set the fission attribution. If so, that would point to a process that's started up in order to notify a running firefox to browse to a url. I should note that the uptime of these crashes is typically in the 100's of seconds, and a few measured in days - perhaps it was just hung there for a long time until the user or the OS did something to trigger a quit.

Also: virtually all the stacks have NSS initialization, called from CaptivePortal initialization (except those where the stack backtrace failed before we got back that far). Some show us clearly in shutdown; others don't (though all appear to be in shutdown per above).

Captive Portal seems to be initialized off a timer; in a number of crash reports we see a thread in ::Shutdown spinning the event loop and running a timer which tries to start the captive portal.

There likely are better ways to solve this, but a simple quick failure on NSS init when shutdown has started may be enough to stop the hangs.

In discussion with nika, cpeterson, mccr8 we mentioned that we likely shouldn't start captive portal if this instance of Firefox is going to notify an existing firefox and exit. (Though why it didn't notify it and exit is a separate question, assuming that's what's happening).

Flags: needinfo?(valentin.gosu)
Flags: needinfo?(dkeeler)

I'd propose to add the following to nsNSSComponent::Init():

  // Avoid a late initialization
  if (AppShutdown::IsInOrBeyond(ShutdownPhase::AppShutdownNetTeardown)) {
    return NS_ERROR_ILLEGAL_DURING_SHUTDOWN;
  }

(I cannot try this immediately myself due to bug 1744665)

Please note that NSS already checks AppShutdown::IsShuttingDown() in some (few) other places, which is equivalent to the earlier AppShutdown::IsInOrBeyond(ShutdownPhase::AppShutdownConfirmed). We might want to try if we can avoid the initialization earlier, too, but I'd expect to break some tests if we do this (actually I would not even be surprised if we see some breakage already with ShutdownPhase::AppShutdownNetTeardown, but that would be clearly something to avoid then).

Assignee: nobody → valentin.gosu
Status: NEW → ASSIGNED

I did a couple of try pushes: just the captive portal fix and one with Jen's suggestions to prevent NSS initialization during shutdown

Flags: needinfo?(valentin.gosu)

(In reply to Valentin Gosu [:valentin] (he/him) from comment #13)

one with Jen's suggestions to prevent NSS initialization during shutdown

That seems to break some 300+ tests... We might want to try with later phases like ShutdownPhase::XPCOMWillShutdown which is the first after telemetry shut down.

What ShutdownPhase does quit-application correspond to? I'm not sure how those relate.

I have a fixed version coming up here.
Arguably we shouldn't be creating new channels during shutdown either.

(In reply to Andrew McCreight [:mccr8] from comment #15)

What ShutdownPhase does quit-application correspond to? I'm not sure how those relate.

The mapping is defined here. "quit-application" <-> AppShutdownConfirmed

Attachment #9254073 - Attachment description: Bug 1738984 - Prevent NSS initialization during shutdown r=keeler → Bug 1738984 - Prevent NSS initialization and channel creation during shutdown r=keeler!

(In reply to Jens Stutte [:jstutte] from comment #17)

(In reply to Andrew McCreight [:mccr8] from comment #15)

What ShutdownPhase does quit-application correspond to? I'm not sure how those relate.

The mapping is defined here. "quit-application" <-> AppShutdownConfirmed

Note that this mapping is currently "parent process centric". We might want to look at bug 1697745 to improve things further here.

This approach sounds good. I was waiting to see if fission just fixed this, but evidently it doesn't.

Flags: needinfo?(dkeeler)

[Tracking Requested - why for this release]: Number 5 top crash on release. Maybe higher if you combine the signatures.

Pushed by valentin.gosu@gmail.com: https://hg.mozilla.org/integration/autoland/rev/d479dee92656 Prevent captive portal service from working post-shutdown r=necko-reviewers,dragana https://hg.mozilla.org/integration/autoland/rev/56ae8059453e Prevent NSS initialization and channel creation during shutdown r=keeler

Tracking as a ride along for a potential dot release in 95 given that this is a high volume crash. Valentin, please request uplift to beta and release after this has landed on mozilla-central. Thanks.

Flags: needinfo?(valentin.gosu)
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 97 Branch

Comment on attachment 9254073 [details]
Bug 1738984 - Prevent NSS initialization and channel creation during shutdown r=keeler!

Beta/Release Uplift Approval Request

  • User impact if declined: Large number of crashes (possibly in background task) caused by NSS initialization during shutdown.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce: Watching the crash rate should confirm if this patch works as expected
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): These patches ensure that the captive portal is properly stopped when the shutdown notification is received, and that no Necko channels are created after that point or NSS initialization.
  • String changes made/needed:
Flags: needinfo?(valentin.gosu)
Attachment #9254073 - Flags: approval-mozilla-release?
Attachment #9254073 - Flags: approval-mozilla-beta?
Attachment #9254072 - Flags: approval-mozilla-release?
Attachment #9254072 - Flags: approval-mozilla-beta?

Comment on attachment 9254073 [details]
Bug 1738984 - Prevent NSS initialization and channel creation during shutdown r=keeler!

Approved for 96.0b3

Attachment #9254073 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Comment on attachment 9254072 [details]
Bug 1738984 - Prevent captive portal service from working post-shutdown r=#necko

Approved for 96.0b3

Attachment #9254072 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Regressions: 1745043

This is going to be wontfix for 95 given that it caused regressions.

Hi Valentin, should we have a new bug then for the NSS / HTTP Channel patch to investigate that further? It still seems like the right thing to do, AFAICS. There might be other caveats, but the crash from bug 1745043 could be easy to fix, see bug 1745043 comment 5.

Flags: needinfo?(valentin.gosu)
Depends on: 1745925

Definitely. I filed bug 1745925 and will move attachment 9254073 [details] to that bug.

Flags: needinfo?(valentin.gosu)

Adding to comment 29 and the fact that there are no more crashes in latest nightly builds, backing out the NSS patch from beta as well

https://hg.mozilla.org/releases/mozilla-beta/rev/eb59149ac90f87fa326eb06b1ce2b780ca7747f4

Crash Signature: [@ shutdownhang | NtFlushBuffersFile] → [@ shutdownhang | NtFlushBuffersFile] [@ shutdownhang | (anonymous namespace)::InterposedNtFlushBuffersFile] [@ shutdownhang | ntdll.dll | (anonymous namespace)::InterposedNtFlushBuffersFile]

Comment on attachment 9254072 [details]
Bug 1738984 - Prevent captive portal service from working post-shutdown r=#necko

Approved for a 95 dot release, thanks.

Attachment #9254072 - Flags: approval-mozilla-release? → approval-mozilla-release+
Attachment #9254073 - Flags: approval-mozilla-release? → approval-mozilla-release-
Crash Signature: [@ shutdownhang | NtFlushBuffersFile] [@ shutdownhang | (anonymous namespace)::InterposedNtFlushBuffersFile] [@ shutdownhang | ntdll.dll | (anonymous namespace)::InterposedNtFlushBuffersFile] → [@ shutdownhang | NtFlushBuffersFile] [@ shutdownhang | (anonymous namespace)::InterposedNtFlushBuffersFile] [@ shutdownhang | ntdll.dll | (anonymous namespace)::InterposedNtFlushBuffersFile]
No longer depends on: 1745925
See Also: → 1745925
Crash Signature: [@ shutdownhang | NtFlushBuffersFile] [@ shutdownhang | (anonymous namespace)::InterposedNtFlushBuffersFile] [@ shutdownhang | ntdll.dll | (anonymous namespace)::InterposedNtFlushBuffersFile] → [@ shutdownhang | NtFlushBuffersFile] [@ shutdownhang | (anonymous namespace)::InterposedNtFlushBuffersFile] [@ shutdownhang | ntdll.dll | (anonymous namespace)::InterposedNtFlushBuffersFile] [@ shutdownhang | mozilla::EventListenerManager::HandleEventI…
See Also: 1739703
Crash Signature: [@ shutdownhang | NtFlushBuffersFile] [@ shutdownhang | (anonymous namespace)::InterposedNtFlushBuffersFile] [@ shutdownhang | ntdll.dll | (anonymous namespace)::InterposedNtFlushBuffersFile] [@ shutdownhang | → [@ shutdownhang | NtFlushBuffersFile] [@ shutdownhang | (anonymous namespace)::InterposedNtFlushBuffersFile] [@ shutdownhang | ntdll.dll | (anonymous namespace)::InterposedNtFlushBuffersFile] [@ shutdownhang |
See Also: 1705360
Crash Signature: mozilla::EventListenerManager::HandleEventInternal] → mozilla::EventListenerManager::HandleEventInternal] [@ shutdownhang | NtWriteFile]

Comment on attachment 9254073 [details]
Bug 1738984 - Prevent NSS initialization and channel creation during shutdown r=keeler!

Revision D133065 was moved to bug 1713749. Setting attachment 9254073 [details] to obsolete.

Attachment #9254073 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: