Closed Bug 1705360 Opened 10 months ago Closed 1 month ago

Crash in [@ shutdownhang | NtFlushBuffersFile]

Categories

(Toolkit :: Storage, defect)

Unspecified
Windows 10
defect

Tracking

()

RESOLVED DUPLICATE of bug 1738984
Tracking Status
firefox-esr78 --- wontfix
firefox-esr91 --- wontfix
firefox89 --- wontfix
firefox90 --- wontfix
firefox91 + wontfix
firefox92 --- wontfix
firefox93 --- wontfix
firefox94 --- wontfix
firefox95 --- fixed
firefox96 --- fixed
firefox97 --- fixed

People

(Reporter: aryx, Assigned: keeler)

References

Details

(Keywords: crash)

Crash Data

Attachments

(2 files, 1 obsolete file)

18 crashes on 10+ machines, all with Firefox 89.0a1 on Windows 10, oldest reported build ID is 20210412213434. Increase volume of crash reports from the WER changes (bug 1682516 etc.))? The signature had been around before but with less reports per release cycle.

Crash report: https://crash-stats.mozilla.org/report/index/124f4721-4c03-4d8d-8da3-cb4a80210413

MOZ_CRASH Reason: Shutdown hanging at step profile-before-change. Something is blocking the main-thread.

Top 10 frames of crashing thread:

0 ntdll.dll NtFlushBuffersFile 
1 kernelbase.dll FlushFileBuffers 
2 nss3.dll winSync third_party/sqlite3/src/sqlite3.c:45165
3 nss3.dll sqlite3PagerCommitPhaseOne third_party/sqlite3/src/sqlite3.c:58723
4 nss3.dll sqlite3BtreeCommitPhaseOne third_party/sqlite3/src/sqlite3.c:69003
5 nss3.dll sqlite3VdbeHalt third_party/sqlite3/src/sqlite3.c:81496
6 nss3.dll sqlite3VdbeExec third_party/sqlite3/src/sqlite3.c:89413
7 nss3.dll sqlite3_step third_party/sqlite3/src/sqlite3.c:84388
8 nss3.dll sqlite3_exec third_party/sqlite3/src/sqlite3.c:125282
9 softokn3.dll sdb_init security/nss/lib/softoken/sdb.c:2319

bp-480a24dc-6091-484f-b2a9-296830210415 has one more frame near the start:

ntdll.dll	NtFlushBuffersFile	
xul.dll	`anonymous namespace'::InterposedNtFlushBuffersFile(void*, _IO_STATUS_BLOCK*)	xpcom/build/PoisonIOInterposerWin.cpp:365
kernelbase.dll	FlushFileBuffers	

The stacks look like they're mostly from nss/psm. Any idea what's going on here?

Flags: needinfo?(dkeeler)
Flags: needinfo?(bbeurdouche)

I looked at a few of these reports - they all seemed to be of softoken trying to do some sqlite operations. These operations (e.g. opening the database) should be fast (they absolutely should not be taking 10 minutes). One thing I've seen is if another process also has the cert/key databases open, it can make softoken very slow. Unfortunately this is a "supported" "feature" of softoken. Another thing I've heard of is if some antivirus/auditing software is interposing reads/writes, it again can make softoken quite slow. Unless this is a new bug in sqlite, I'm not sure what we can do here.

Flags: needinfo?(dkeeler)

Dana, Benjamin, these are shutdown hangs but their volume have exploded on beta this cycle, is there really nothing happening on our side that we should fix?

Flags: needinfo?(dkeeler)

Maybe backing out bug 1717559 (and potentially bug 1698592) would help us see if the sqlite upgrade(s) caused this?

Flags: needinfo?(dkeeler)

The timing of the spikes doesn't seem to correlate with the SQLite upgrades from what I can see. The 3.35.4 update landed on April 4, but the first tick up in crashes here didn't appear until about a week and a half later. The 3.36.0 update landed on June 22, which again doesn't seem to fit well with the June 10 and July 9 spikes.

See Also: → 1712118

Some crash reports appear to be indicating that initializing NSS' certificate
and key databases is taking on the order of minutes in some cases, which is
unexpected. One hypothesis is that third-party software is opening these DBs at
the same time that NSS is operating on them, causing contention and thus
slowness. This patch experimentally (in Nightly only) renames these DBs in the
hopes that third-party software might not recognize them as the DBs it's
looking for, and will thus leave them alone.

Assignee: nobody → dkeeler
Status: NEW → ASSIGNED
Pushed by dkeeler@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/4b6b66ff77ea
"hide" NSS DBs from meddling third party software r=jschanck,bbeurdouche
Pushed by dkeeler@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ee866eada1ad
"hide" NSS DBs from meddling third party software r=jschanck,bbeurdouche

Backed out changeset ee866eada1ad (Bug 1705360) for causing talos failures.
Backout link
Push with failures - x
Failure Log

Pushed by dkeeler@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/afd3d1fef036
"hide" NSS DBs from meddling third party software r=jschanck,bbeurdouche,perftest-reviewers,sparky
Flags: needinfo?(dkeeler)
Flags: needinfo?(bbeurdouche)
Pushed by dkeeler@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b5a5e31457dc
"hide" NSS DBs from meddling third party software r=jschanck,bbeurdouche,perftest-reviewers,sparky
Status: ASSIGNED → RESOLVED
Closed: 4 months ago
Resolution: --- → FIXED
Target Milestone: --- → 95 Branch
Flags: needinfo?(dkeeler)

The patch landed in nightly and beta is affected.
:keeler, is this bug important enough to require an uplift?
If not please set status_beta to wontfix.

For more information, please visit auto_nag documentation.

Flags: needinfo?(dkeeler)

What landed is a Nightly-only experiment to see if it helps address the issue, so this shouldn't be uplifted.

Acording to crash reports, obsfucating the NSS DB locations did not help, so
this patch un-does the changes and un-migrates any migrated DB locations.

Pushed by dkeeler@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/cc14f4a79aba
un-do nightly experiment obsfucating NSS DB locations r=jschanck,perftest-reviewers,AlexandruIonescu

(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) (out of office) from comment #17)

Created attachment 9247361 [details]
Bug 1705360 - un-do nightly experiment obsfucating NSS DB locations r?jschanck

Acording to crash reports, obsfucating the NSS DB locations did not help, so
this patch un-does the changes and un-migrates any migrated DB locations.

So I assume we should re-open this bug ? Or shall we continue work in the new bug 1738984 ?

Flags: needinfo?(dkeeler)
See Also: → 1738984

Yes, I meant to re-open this when I landed the second patch.

Status: RESOLVED → REOPENED
Flags: needinfo?(dkeeler)
Keywords: leave-open
Resolution: FIXED → ---

Curiously, of the 80,000+ crash reports for NtFlushBuffersFile shutdown hangs over the last six months, only two had Fission enabled, even though Fission has been enabled for about 60% of Nightly, 30% of Beta, and 0.5% of Release 92 during that period.

The Fission crash reports' stack traces don't include sqlite, so they might be unrelated. In that case, zero of 80,000+ crash reports had Fission enabled.

bp-1855ff19-f2a8-46fc-a7a5-e1b750211102
bp-35bdad88-1c7b-4a7d-b03a-59d4b0210924

Crash Signature: [@ shutdownhang | NtFlushBuffersFile] → [@ shutdownhang | NtFlushBuffersFile] [@ shutdownhang | (anonymous namespace)::InterposedNtFlushBuffersFile] [@ shutdownhang | ntdll.dll | (anonymous namespace)::InterposedNtFlushBuffersFile]

It's looking like the fix for bug 1738984 resolved this. No crash reports in 96.0b3+. Dana, are you good calling this bug fixed by that or are there changes you still want to land in this bug?

Flags: needinfo?(dkeeler)

Awesome! I don't think we need to make any more changes here.

Status: REOPENED → RESOLVED
Closed: 4 months ago1 month ago
Flags: needinfo?(dkeeler)
Resolution: --- → DUPLICATE
Duplicate of bug: 1738984
Target Milestone: 95 Branch → ---
Attachment #9251395 - Attachment is obsolete: true
See Also: 1738984
You need to log in before you can comment on or make changes to this bug.