Crash in [@ shutdownhang | NtFlushBuffersFile]
Categories
(Toolkit :: Storage, defect)
Tracking
()
People
(Reporter: aryx, Assigned: keeler)
References
Details
(Keywords: crash)
Crash Data
Attachments
(2 files, 1 obsolete file)
18 crashes on 10+ machines, all with Firefox 89.0a1 on Windows 10, oldest reported build ID is 20210412213434. Increase volume of crash reports from the WER changes (bug 1682516 etc.))? The signature had been around before but with less reports per release cycle.
Crash report: https://crash-stats.mozilla.org/report/index/124f4721-4c03-4d8d-8da3-cb4a80210413
MOZ_CRASH Reason: Shutdown hanging at step profile-before-change. Something is blocking the main-thread.
Top 10 frames of crashing thread:
0 ntdll.dll NtFlushBuffersFile
1 kernelbase.dll FlushFileBuffers
2 nss3.dll winSync third_party/sqlite3/src/sqlite3.c:45165
3 nss3.dll sqlite3PagerCommitPhaseOne third_party/sqlite3/src/sqlite3.c:58723
4 nss3.dll sqlite3BtreeCommitPhaseOne third_party/sqlite3/src/sqlite3.c:69003
5 nss3.dll sqlite3VdbeHalt third_party/sqlite3/src/sqlite3.c:81496
6 nss3.dll sqlite3VdbeExec third_party/sqlite3/src/sqlite3.c:89413
7 nss3.dll sqlite3_step third_party/sqlite3/src/sqlite3.c:84388
8 nss3.dll sqlite3_exec third_party/sqlite3/src/sqlite3.c:125282
9 softokn3.dll sdb_init security/nss/lib/softoken/sdb.c:2319
bp-480a24dc-6091-484f-b2a9-296830210415 has one more frame near the start:
ntdll.dll NtFlushBuffersFile
xul.dll `anonymous namespace'::InterposedNtFlushBuffersFile(void*, _IO_STATUS_BLOCK*) xpcom/build/PoisonIOInterposerWin.cpp:365
kernelbase.dll FlushFileBuffers
Comment 1•3 years ago
|
||
The stacks look like they're mostly from nss/psm. Any idea what's going on here?
Updated•3 years ago
|
Assignee | ||
Comment 2•3 years ago
|
||
I looked at a few of these reports - they all seemed to be of softoken trying to do some sqlite operations. These operations (e.g. opening the database) should be fast (they absolutely should not be taking 10 minutes). One thing I've seen is if another process also has the cert/key databases open, it can make softoken very slow. Unfortunately this is a "supported" "feature" of softoken. Another thing I've heard of is if some antivirus/auditing software is interposing reads/writes, it again can make softoken quite slow. Unless this is a new bug in sqlite, I'm not sure what we can do here.
Comment 3•3 years ago
|
||
Dana, Benjamin, these are shutdown hangs but their volume have exploded on beta this cycle, is there really nothing happening on our side that we should fix?
Assignee | ||
Comment 4•3 years ago
|
||
Maybe backing out bug 1717559 (and potentially bug 1698592) would help us see if the sqlite upgrade(s) caused this?
Comment 5•3 years ago
|
||
The timing of the spikes doesn't seem to correlate with the SQLite upgrades from what I can see. The 3.35.4 update landed on April 4, but the first tick up in crashes here didn't appear until about a week and a half later. The 3.36.0 update landed on June 22, which again doesn't seem to fit well with the June 10 and July 9 spikes.
Updated•3 years ago
|
Assignee | ||
Comment 6•3 years ago
|
||
Some crash reports appear to be indicating that initializing NSS' certificate
and key databases is taking on the order of minutes in some cases, which is
unexpected. One hypothesis is that third-party software is opening these DBs at
the same time that NSS is operating on them, causing contention and thus
slowness. This patch experimentally (in Nightly only) renames these DBs in the
hopes that third-party software might not recognize them as the DBs it's
looking for, and will thus leave them alone.
Updated•3 years ago
|
Pushed by dkeeler@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/4b6b66ff77ea "hide" NSS DBs from meddling third party software r=jschanck,bbeurdouche
Comment 8•3 years ago
|
||
Backed out for causing failures in test_refresh_firefox.py.
Backout link: https://hg.mozilla.org/integration/autoland/rev/72594a9e5b1f6aa8f494d77661072900d83969c1
Failure log: https://treeherder.mozilla.org/logviewer?job_id=353007593&repo=autoland&lineNumber=85782
Pushed by dkeeler@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/ee866eada1ad "hide" NSS DBs from meddling third party software r=jschanck,bbeurdouche
Comment 10•3 years ago
|
||
Backed out changeset ee866eada1ad (Bug 1705360) for causing talos failures.
Backout link
Push with failures - x
Failure Log
Comment 11•3 years ago
|
||
Pushed by dkeeler@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/afd3d1fef036 "hide" NSS DBs from meddling third party software r=jschanck,bbeurdouche,perftest-reviewers,sparky
Assignee | ||
Updated•3 years ago
|
Comment 12•3 years ago
|
||
Backed out for causing Android Btime failures
Backout link: https://hg.mozilla.org/integration/autoland/rev/1a7d94a7a1e6d83d7b8ad4f077683ded4bf1d893
Log link: https://treeherder.mozilla.org/logviewer?job_id=353412164&repo=autoland&lineNumber=2163
Updated•3 years ago
|
Comment 13•3 years ago
|
||
Pushed by dkeeler@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/b5a5e31457dc "hide" NSS DBs from meddling third party software r=jschanck,bbeurdouche,perftest-reviewers,sparky
Comment 14•3 years ago
|
||
bugherder |
Assignee | ||
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Comment 15•3 years ago
|
||
The patch landed in nightly and beta is affected.
:keeler, is this bug important enough to require an uplift?
If not please set status_beta
to wontfix
.
For more information, please visit auto_nag documentation.
Assignee | ||
Comment 16•3 years ago
|
||
What landed is a Nightly-only experiment to see if it helps address the issue, so this shouldn't be uplifted.
Assignee | ||
Comment 17•3 years ago
|
||
Acording to crash reports, obsfucating the NSS DB locations did not help, so
this patch un-does the changes and un-migrates any migrated DB locations.
Comment 18•3 years ago
|
||
Pushed by dkeeler@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/cc14f4a79aba un-do nightly experiment obsfucating NSS DB locations r=jschanck,perftest-reviewers,AlexandruIonescu
Comment 19•3 years ago
|
||
bugherder |
Comment 20•3 years ago
|
||
(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) (out of office) from comment #17)
Created attachment 9247361 [details]
Bug 1705360 - un-do nightly experiment obsfucating NSS DB locations r?jschanckAcording to crash reports, obsfucating the NSS DB locations did not help, so
this patch un-does the changes and un-migrates any migrated DB locations.
So I assume we should re-open this bug ? Or shall we continue work in the new bug 1738984 ?
Assignee | ||
Comment 21•3 years ago
|
||
Yes, I meant to re-open this when I landed the second patch.
Comment 22•3 years ago
|
||
Curiously, of the 80,000+ crash reports for NtFlushBuffersFile shutdown hangs over the last six months, only two had Fission enabled, even though Fission has been enabled for about 60% of Nightly, 30% of Beta, and 0.5% of Release 92 during that period.
The Fission crash reports' stack traces don't include sqlite, so they might be unrelated. In that case, zero of 80,000+ crash reports had Fission enabled.
bp-1855ff19-f2a8-46fc-a7a5-e1b750211102
bp-35bdad88-1c7b-4a7d-b03a-59d4b0210924
Comment 23•3 years ago
|
||
Also, a bunch of other shutdownhang|NtXxxxxx seem to disappear with fission. ~10 of the top 50 crashes in 94 in the last 3 days were Nt shutdownhangs; for the same period with fission (enabled 2 days ago, up to 16 million users today) there are no Nt shutdownhangs.
Assignee | ||
Comment 24•3 years ago
|
||
Comment 25•2 years ago
|
||
It's looking like the fix for bug 1738984 resolved this. No crash reports in 96.0b3+. Dana, are you good calling this bug fixed by that or are there changes you still want to land in this bug?
Assignee | ||
Comment 26•2 years ago
|
||
Awesome! I don't think we need to make any more changes here.
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Description
•