Closed Bug 1186276 Opened 9 years ago Closed 6 years ago

Firefox keeps crashing: shutdown hang in the storage service

Categories

(Toolkit :: Storage, defect, P3)

39 Branch
x86
Windows 7
defect

Tracking

()

RESOLVED WONTFIX

People

(Reporter: rwnew2012, Unassigned)

References

Details

Crash Data

User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:39.0) Gecko/20100101 Firefox/39.0
Build ID: 20150630154324

Steps to reproduce:

After doing a stand alone Adobe Flash update outside of the Firefox browser Firefox kept on crashing.


Actual results:

After the Adobe Flash update Firefox kept crashing I didn't realize that the stand alone update outside of the browser caused the problem until I did the same kind of update on two other computers I have. I couldn't even use the browser and found out by doing a search that you could refresh your browser by holding the shift key down while clicking on Firefox. I did refresh the browser but I still had problems with Firefox starting. I think something is wrong with the coding of Firefox in an update and that there is some kind of bug since the last update.


Expected results:

Firefox should have been fixed after the Refresh of Firefox but it has been buggy with it working for a while and then crashing with another Refresh being needed. The Refresh only works for a while and really should have solved the problem as it has done in the past.
Removing security sensitive.

Reporter: Can you please type about:crashes in the URL bar and paste the crash report IDs in this bug?
Group: core-security
Flags: needinfo?(rwnew2012)
I don't have any url except one and that is http://www.doityourself.com/forum where the browser would go to the top on its own after scrolling down to the bottom of the page. Other times it would just hang and crash and not fully load. A friend of mine who uses the handle Chandler on the website has had the same problems and I advised people to not use Firefox temporarily until the problem is resolved. The thread to that can be found easily in computers on the forum. It definitely is related somehow to Adobe flash and a stand alone outside of browser problem at least for me. No actual problem during the update and no viruses or malware either. Other people on this forum don't seem to be having as many or any problems. Sorry I can't really tell you much more than that.
Iteration: --- → 42.2 - Jul 27
QA Whiteboard: bp-0824f04e-6c73-44e2-902b-8fdab2150721
Flags: needinfo?(rwnew2012)
Whiteboard: This happened after a stand alone update of flash which shouldn't have happened as I have done this before but flash is definitely involved
I added the crash signature to the appropriate field. I am also changing a few of the other fields since (1) Status whiteboard is meant for brief terms and (2) iteration field is not used.

The stack that you put in the whiteboard is actually the crash report I was looking for: bp-0824f04e-6c73-44e2-902b-8fdab2150721
Status: UNCONFIRMED → NEW
Crash Signature: [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_Wait | nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread*, bool) | mozilla::storage::Service::Observe(nsISupports*, char const*, wchar_t const*) ]
Iteration: 42.2 - Jul 27 → ---
QA Whiteboard: bp-0824f04e-6c73-44e2-902b-8fdab2150721
Ever confirmed: true
Whiteboard: This happened after a stand alone update of flash which shouldn't have happened as I have done this before but flash is definitely involved
I do apologize I have never used Bugzilla before and wanted to make sure this doesn't happen again as I really like Firefox. As I said before I have several computers and I have been using one of the affected laptops for several hours with Firefox as my primary browser. I still have had problems at the website I mentioned with the page going to the top but the browser so far has not crashed nor has it refused to go to a web page. I haven't had a chance to check my one other laptop or my desktop with Firefox. I hope the information I did send was helpful. Thank you for your help!
This is a hang on shutdown; the storage service is stuck with connections that are closing but not yet closed:

http://hg.mozilla.org/releases/mozilla-release/annotate/d3b3e57e8088/storage/src/mozStorageService.cpp#l943
Component: Untriaged → Storage
Product: Firefox → Toolkit
Summary: Firefox keeps crashing → Firefox keeps crashing: shutdown hang in the storage service
It hanged while starting up too and wouldn't go to web pages so I would have to get Windows to shut Firefox down as it would not shut down normally.
Severity: normal → major
OS: Unspecified → Windows 7
Hardware: Unspecified → x86
Crash Signature: [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_Wait | nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread*, bool) | mozilla::storage::Service::Observe(nsISupports*, char const*, wchar_t const*) ] → [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_Wait | nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread*, bool) | mozilla::storage::Service::Observe(nsISupports*, char const*, wchar_t const*) ] [@ shutdownhang | …
Blocks: 1259890
Crash Signature: , wchar_t const*) ] [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_Wait | nsThread::ProcessNextEvent | NS_ProcessNextEvent | mozilla::storage::Service::Observe ] → , wchar_t const*) ] [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_Wait | nsThread::ProcessNextEvent | NS_ProcessNextEvent | mozilla::storage::Service::Observe ] [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_Wait…
The combined crash-stats volume for these signatures are quite high.
It appears we're busy waiting in this loop:
http://hg.mozilla.org/mozilla-central/annotate/1da1937a9e03/storage/mozStorageService.cpp#l937
I suspect there are connections which just never close as we expect them to
(it's unclear to me why we expect them to exactly).

Benjamin, do you know someone who can work on this?
Flags: needinfo?(benjamin)
Redirecting to ckarlof, because I believe his team has responsibility for storage.
Flags: needinfo?(benjamin) → needinfo?(ckarlof)
Storage isn't part of my team (at least yet). :mak is listed as the module owner. :mak, can you help with this?
Flags: needinfo?(ckarlof) → needinfo?(mak77)
we wait for the db connections cause we don't want them to survive past xpcom-shutdown-threads and we don't want to interrupt the closing procedure abruptly, to avoid dataloss or corruption.
The hang is doing properly its job, showing there's something wrong, though I can't tell if we are blocked in the loop, or if there's a thread contention in the isClosing() mutex acquisition.
We could have a timeout, but then we would have never known there is a problem here.
It's a pity we don't know which connection was the problematic one, nor we know how to reach that state.

Andrew, any idea off-hand?
the stack seems to be on NS_ProcessNextEvent, that kind-of excludes a contention, I think.

Benjamin, how long is the hang timeout?
Flags: needinfo?(mak77) → needinfo?(bugmail)
60 seconds
The hang from comment 3 has someone actively using SQLite with SQLite blocked on reads in thread 25.  The stack is a little short, so it's not clear who is using it although there's a chance it's just async mozStorage use from the main thread and that wouldn't tell us anything useful either.  It looks somewhat like bug 1143866 where the scenario seems to just be that I/O can be slow on windows, especially with antivirus active.  But looking at the other stacks, it seems like SQLite may be making things worse with its backoff logic when it contends with anti-virus and mozStorage's infinite-retry-on-busy logic also may not help.  Also, mozStorage could just stop spinning at shutdown.

Aside: I am a little concerned that the stackwalks all seem to be saying the SQLite symbols are coming from nss3.dll.  Hopefully that's just a problem of there being 2 mappings [addr, symbol], [symbol, dll] where it's the second mapping getting tricked by xul.dll and nss3.dll having the same symbols.

== Survey of Reports ==

A quick survey of what we're waiting on and sample reports:

* winRetryIoErr: This is a linear backoff loop.  Initial delay is 25ms, growing by 25ms each time with a max retry of 10, so total retry wait before giving up is 1.375 seconds.  This is reported as a SQLITE_IOERR_READ when reached which AsyncExecuteStatements.  The expired MOZ_STORAGE_ASYNC_REQUESTS_SUCCESS probe last told us in Firefox 39 Release that we observed 15,505,630 errors in general at a rate of 0.0012%.  Maybe we want that probe back as a non-boolean that encodes the error code returned.  The now-expired VFS probes in 39 tell us that the (non-main-thread) MOZ_SQLITE_PLACES_READ_MS 1.114sec-3sec bucket for reads hit 3,604,385 times and 3sec+ 676,129.  (non-main-thread) MOZ_SQLITE_OTHER_READ_MS for non-places/webapps hit 1.114-3sec 280,680 times and 3sec+ 80,724.  Which is to say, it's conceivable we've done super-long waits.
** bp-0d429f88-1df6-406b-97a5-5ace92160421 Fx38 with Sophos
** bp-dfcad6b6-981e-4398-b8d8-c9c842160420 Fx43 with no immediately obvious AV stuff in red in the stacks, but I sorta suspect it

* winLock: Less costly loop that will sleep for 1ms two times.  Manifests as SQLITE_BUSY which can be bad because AsyncExecuteStatements::executeStatement never gives up when it sees SQLITE_BUSY, it just loops around again.  There is no telemetry for this specifically, but the delay would have shown up in the expired MOZ_STORAGE_ASYNC_REQUESTS_SUCCESS telemetry probe (true case).
** bp-12cab2ee-de74-40d9-83ba-5d1142160421 Fx38 with Sophos
** bp-a9345b59-26b2-4de9-add9-5565f2160421 Fx43 with Sophos

* walTryBeginRead: Serious eventually quadratic backoff; retries up to 100 times; 100th time sleeps for 323ms, comment did the math for us and the total delay is "less than 10 seconds".  But which seems long to me.  Error failure is SQLITE_PROTOCOL which is going to be the same as SQLITE_IOERR_READ as far as mozStorage cares.  If it doesn't error out and encounters some other contention issues, it may return WAL_RETRY which can cause the wait process to begin again; I'm expecting we don't need to worry about that case since most of those would seem to involve other concurrent WAL-consumers on other threads, and I don't think we have those.
** bp-cc111d35-2574-4eff-ada5-839382160421 Fx43 with Symantec
** bp-13a51c0d-4a63-4ad7-b085-f200d2160421 Fx41 with Novell remote FS stuff? (nwsrvloc.dll)

== The Problem, restated ==

We're facing read/lock contention and sleeps are trying to give whatever we're contending with a chance to stop contending with us.

== Options ==

Understand / mitigate the delays:
* Work with SQLite team to tune/allow configurable backoff heuristics.  Assuming our contention is primarily I/O bound, once we're sleeping for longer than a fair time slice, we risk wasting time.
* Get more telemetry data by:
** Having TelemetryVFS collect counters and error return codes for xLock
** Have async execution log the error status codes from stepping.
* More aggressively hold locks on files to keep antivirus software out of our business.  In no case do we expect/allow concurrent manipulation of our databases by external processes, and I don't think there is any code in gecko that needs to use filesystem locking to communicate internally.  We could have our win VFS keep and hold locks on the files until closed to avoid letting the antivirus software in.  It's possible this will just cause the AV software to become more aggressive and/or just cause our reads/writes to block.
* Try and use some affordance to keep the antivirus software out of our SQLite databases entirely like get them whitelisted or excluded?  In theory, malware could leverage this, so maybe there's no way to do it, etc.
* Move to a parallel universe where there's no antivirus software.
* Have the mozStorage async BUSY spinner introduce a failsafe counter so it gives up eventually.  (Places might need the failsafe counter cranked way up if it still uses a separate main-thread connection?)

Have mozStorage stop spinning at shutdown, at least in the default/general case.  We use SQLite because it knows how to survive crashes and is adept at rollbacks/etc.  In this family of Fx crashes, we seem to be getting into pathological I/O and/or antivirus-contention where we will never win.  Arguably there comes a point where mozStorage should just "play dead" and neither do work nor report errors because errors will cause more work (and things like the cookies database will freak out and delete itself and try and re-write itself from scratch, which is more work that will fail.)

We could let really important stuff mark itself to be kept alive, and just play dead for everything else.  Other than our session store, it's not clear that there's anything that absolutely needs to be 100% persisted at shutdown.  And I think that uses append-only JSON which is a perfect way to avoid AV I/O contentions.  (Although the atomic move at the end may be subject to some.)

Another thing we could do to better understand the problem is the following pseudocode:
* At observe shutdown, open a file, write all the still open databases.
** Spin for ~10 seconds.
** Write the databases still open.
** Close the file.
** Stop spinning.
** Trigger any "play dead" logic if we've implemented that.
* At startup:
** Look for that JSON file
** Submit it to FHR/telemetry under one of the richer-data-smaller-population-on-trunk rationales.  We'd sanitize by binning the connections based on explicit single-database users (places, webappstore, cookies, etc.) or API for things that use multiple databases (IndexedDB, ServiceWorkers registrations, ServiceWorkers Cache, etc.).  The main thing we'd want to know are who was open, who recovered, who did not recover.
** Delete that file.

I turned into a DOM platform engineer so I can probably help out with things if people think it's important.
Flags: needinfo?(bugmail)
(In reply to Andrew Sutherland [:asuth] from comment #12)
> Aside: I am a little concerned that the stackwalks all seem to be saying the
> SQLite symbols are coming from nss3.dll.  Hopefully that's just a problem of
> there being 2 mappings [addr, symbol], [symbol, dll] where it's the second
> mapping getting tricked by xul.dll and nss3.dll having the same symbols.

I actually think sqlite is currently linked into nss3.dll, not xul.dll
http://mxr.mozilla.org/mozilla-central/source/config/external/sqlite/moz.build#14

> * More aggressively hold locks on files to keep antivirus software out of
> our business.  In no case do we expect/allow concurrent manipulation of our
> databases by external processes, and I don't think there is any code in
> gecko that needs to use filesystem locking to communicate internally.  We
> could have our win VFS keep and hold locks on the files until closed to
> avoid letting the antivirus software in.  It's possible this will just cause
> the AV software to become more aggressive and/or just cause our reads/writes
> to block.

Sqlite has "unix-excl" vfs, that we use when storage.nfs_filesystem is set, but we could just be more aggressive and always use it...
At the time I remember I asked Sqlite team about having a win32-excl vfs, and they said it wasn't easily feasible due to how locking works on Windows.
We could try to ask again if something changed...

> * Have the mozStorage async BUSY spinner introduce a failsafe counter so it
> gives up eventually.  (Places might need the failsafe counter cranked way up
> if it still uses a separate main-thread connection?)

yes, it still uses the main-thread connection (most of the async projects were stopped some time ago), other components too used a mixed connection, but mostly to setup/upgrade the db schema, then they only use the async APIs. we could maybe offer an opt out path, like once done init invoke this API, and you won't be allowed anymore to use the sync API from that point on, and then we could maybe avoid mutexes for purely async connection (they could already, but we never had time to do) and these opt-out connections.

> Have mozStorage stop spinning at shutdown, at least in the default/general
> case.  We use SQLite because it knows how to survive crashes and is adept at
> rollbacks/etc.  In this family of Fx crashes, we seem to be getting into
> pathological I/O and/or antivirus-contention where we will never win. 
> Arguably there comes a point where mozStorage should just "play dead"

So, just timeout after 30s in the final loop?

> (and
> things like the cookies database will freak out and delete itself and try
> and re-write itself from scratch, which is more work that will fail.)

Yeah, that decision was horrible, but there's already a bug filed iirc.

> We could let really important stuff mark itself to be kept alive, and just
> play dead for everything else.  Other than our session store, it's not clear
> that there's anything that absolutely needs to be 100% persisted at
> shutdown.

Due to "clear history on shutdown", many components want to be persisted at shutodown, for privacy reasons.

> Another thing we could do to better understand the problem is the following
> pseudocode:
> * At observe shutdown, open a file, write all the still open databases.
> ** Spin for ~10 seconds.
> ** Write the databases still open.
> ** Close the file.
> ** Stop spinning.
> ** Trigger any "play dead" logic if we've implemented that.
> * At startup:
> ** Look for that JSON file
> ** Submit it to FHR/telemetry under one of the
> richer-data-smaller-population-on-trunk rationales.

Reporting previously unclosed connections sounds like a good idea!

So the plan sounds like may be:
1. having a 30s timeout in the final close loop
2. in case of timeout annotate metadata for leftover connections somewhere
3. at next startup report that metadata through a telemetry keyed histogram (FHR is gone). For privacy reasons we may have to only report known db names and put all the third party dbs under "others".
(In reply to Marco Bonardo [::mak] from comment #13)
> I actually think sqlite is currently linked into nss3.dll, not xul.dll
> http://mxr.mozilla.org/mozilla-central/source/config/external/sqlite/moz.
> build#14

Good news, thanks!  Anytime I see SQLite and nss at the same time it has historically meant bad news.  I think my memory has crossed wires or just missed some changes.
 
> At the time I remember I asked Sqlite team about having a win32-excl vfs,
> and they said it wasn't easily feasible due to how locking works on Windows.
> We could try to ask again if something changed...

Ah, no, that sounds like that option has been considered then.  Most of the crashes seem to be on Windows 7 machines, which is a 2009 OS.  I'm expecting you've asked since then and the SQLite team has likely a far better understanding of what's up with how antivirus and windows locking works than I.
 
> So, just timeout after 30s in the final loop?

Sure.

But also in terms of "play dead" I also meant that we could also flip a flag so that at least our async execution threads would basically do a "sleep(infinity);".  If we assume they're never going to complete their work before the hang-killer kills us, we might as well have them stop generating I/O load.  (We wouldn't actually kill the thread because that could result in a cascade of refcounts hitting zero and so on.)  It's a bit dramatic and assumes we were unlucky enough to have an expensive transaction scheduled around shutdown time that results in a lot of step() calls or BUSY being returned so that we actually get a chance to have control flow returned to us to do that.  Just brainstorming that.  This is likely not an option we'd want unless other investigations showed it could help and it wasn't feasible to instead have known consumers just abort their async requests when they observe shutdown starting.

> > We could let really important stuff mark itself to be kept alive, and just
> > play dead for everything else.  Other than our session store, it's not clear
> > that there's anything that absolutely needs to be 100% persisted at
> > shutdown.
> 
> Due to "clear history on shutdown", many components want to be persisted at
> shutodown, for privacy reasons.

Doing a lot of database work that touches pages at shutdown seems likely to result in this hang problem.  If we're entirely cleaning out the databases in question (seems like we would be) and currently we're using DELETE instead of nuking the files, maybe we could instead:
* Straight up delete the databases.  (Still requires a clean shutdown.)
* Encrypt the database and keep the key in-memory only so if we ever crash/shutdown, the means of decrypting the database go away.  We'd still delete on shutdown if we don't crash; purging at startup if we do crash.  I was looking at this for IndexedDB private browsing mode and encrypting SQLite is quite easy.  If SQLITE_HAS_CODEC is defined, the hooks for SQLite's commercial encryption libraries are made available which are separate from the VFS.  These allow for space on the page to be reserved for IV's and MAC-ish codes like AEAD would contribute.  (Noting reserved space isn't really needed for this threat model; AES-XTS or other dumb CTR modes can work as long as incremental vacuuming gives us a chance to decrypt and re-encrypt when moving pages around.  I assume it does or that can be addressed by doing the crypto in the VFS layer.)  This would potentially be useful should we ever allow explicitly encrypted IndexedDB databases.
   Furthre aside: It turns out the tricky part for IndexedDB is all the file path stuff that exposes origins.  Non-trivial Quota Manager and IndexedDB changes involving the assumed file structure would be required to avoid massive meta-data reveals in the event of a crash.
 
> So the plan sounds like may be:
> 1. having a 30s timeout in the final close loop
> 2. in case of timeout annotate metadata for leftover connections somewhere
> 3. at next startup report that metadata through a telemetry keyed histogram
> (FHR is gone). For privacy reasons we may have to only report known db names
> and put all the third party dbs under "others".

Yes, this sounds like a very sane set of things to do.  Is this something you can do/find resources for or should I try and schedule this on my to-do list?
(In reply to Andrew Sutherland [:asuth] from comment #14)
> If we're entirely cleaning out the databases
> in question (seems like we would be) and currently we're using DELETE
> instead of nuking the files, maybe we could instead:
> * Straight up delete the databases.  (Still requires a clean shutdown.)

In some cases this may be a nightmare for Sync. And the dataloss risk is higher, plus we'd pay an high price on startup for users who clear data on shutdown.

> * Encrypt the database and keep the key in-memory only so if we ever
> crash/shutdown, the means of decrypting the database go away.  We'd still
> delete on shutdown if we don't crash; purging at startup if we do crash.

The problems seem similar, especially the long startups.
It could be done for things like indexedDB or DOM Storage, but I suspect those are less likely t obe the culprits since in general their dbs have a quota limit, and operations, including close, should be quite fast.

> > So the plan sounds like may be:
> > 1. having a 30s timeout in the final close loop
> > 2. in case of timeout annotate metadata for leftover connections somewhere
> > 3. at next startup report that metadata through a telemetry keyed histogram
> > (FHR is gone). For privacy reasons we may have to only report known db names
> > and put all the third party dbs under "others".
> 
> Yes, this sounds like a very sane set of things to do.  Is this something
> you can do/find resources for or should I try and schedule this on my to-do
> list?

I don't think I'll have the time to look into this soon.  I have pending work for the awesomebar, a security bug and hidpi favicons, plus quitesome review requests regularly incoming. I'd be available as a reviewer though.
(In reply to Marco Bonardo [::mak] from comment #15)
> (In reply to Andrew Sutherland [:asuth] from comment #14)
> > If we're entirely cleaning out the databases
> > in question (seems like we would be) and currently we're using DELETE
> > instead of nuking the files, maybe we could instead:

Also, as a side note, when clearing the dbs we block earlier, at profile-before-change, and we don't proceed until done (there's a timeout after which we crash on purpose though). In those cases the only thing that could happen at shutdown is merging back the journal. In some cases (wal) it may be a large journal and maybe that's where the problem is.
(In reply to Marco Bonardo [::mak] from comment #16)
> Also, as a side note, when clearing the dbs we block earlier, at
> profile-before-change, and we don't proceed until done (there's a timeout
> after which we crash on purpose though). In those cases the only thing that
> could happen at shutdown is merging back the journal. In some cases (wal) it
> may be a large journal and maybe that's where the problem is.

AIUI, the WAL journal will be merged during the AsyncClose() call which Places blocks/spins on the entirety of.  We'd expect in that scenario to have Places' Database::Observe on the stack spinning, not Storage's.  It does happen, but apparently way less frequently than this storage spin:
https://crash-stats.mozilla.com/report/list?range_unit=days&range_value=14&signature=shutdownhang+%7C+mozilla%3A%3Aplaces%3A%3ADatabase%3A%3AObserve
Oddly, few of those actually show SQLite active on the other threads, but this one does, albeit with a truncated stack: bp-7c48eade-0ac0-4d02-a934-9cc282160420

Assigning to me based on relative busy-ness.  The last prioritizing quote is that the crash volume is "quite high" which I'm taking as important but not end of the world; I'll try and have a patch up by the end of the week.
Assignee: nobody → bugmail
Status: NEW → ASSIGNED
Flags: needinfo?(bugmail)
After coming up to speed a bit on Telemetry and some surveying of mozStorage consumers, I think we should revisit our use-case and how we're addressing it.

The mozStorage spinning at "xpcom-shutdown-threads" was added in bug 744294 in order to avoid nsPlacesAutoComplete.js needing to add its own event-loop spinning when it called AsyncClose().  The choice of "xpcom-shutdown-threads" was arbitrary-ish.  I think the goal was to spin in a phase strictly after the last one we'd expect a call to AsyncClose to happen in so that there's no race between spinning and calling AsyncClose().

From my partial consumer survey, everyone triggers their shutdowns on either:
* "profile-change-teardown": Places' ConcurrentStatementHolder via "places-shutdown", ditto for Places' nsPlacesAutoComplete, with both of them issuing blind AsyncClose invocations
* "profile-before-change": DOMStorage (synchronous close), QuotaManager and thereby all its QuotaClients including IndexedDB and DOM Cache (spins the main thread while shutdown happens on PBackground), Places' Database (AsyncClose using nsIAsyncShutdownClient ShutdownBlocker idiom), Satchel's FormHistory.jsm (AsyncClose with spun event loop), Sqlite.jsm (AsyncClose with AsyncShutdown.jsm).

Telemetry shuts itself down on "profile-change-teardown2" (which is AsyncShutdown.jsm's "sendTelemetry" phase).  So if we wanted to do the telemetry thing, we just need to run in an earlier phase... possibly by bumping telemetry in profile-change-teardown3 as you presciently raised the possibility of in bug 873156 comment 3.

However, I think maybe we don't want to do Telemetry.  nsIAsyncShutdown* is a nice unified interface with clear semantics and built-in detailed crash reporting that is way better than everyone using hand-rolled observer/loop spinning.  I propose we:
* Have mozStorage register itself as a blocker of the sendTelemetry/"profile-before-change2" phase.  We don't actually care about telemetry in this proposal and this avoids races with observer-based shutdown on "profile-before-change".  (Although based on my survey so far, we'd actually be okay there.)
* Our blocker will use slightly modified rules.  If a caller does AsyncClose(null) then we're on the hook to block shutdown until it completes.  If they provided their own callback, they're on the hook and we do nothing.
** So for AsyncClose(null) we flag the connection as a close that we care about and provide our own callback.
** Our callback is a loop over all connections that checks for (careAbout && !closed), removing the blocker when there are no more of those.  We also run this loop when our nsIAsyncShutdownBlocker::BlockShutdown() handler is called in case we're not waiting on anything.
* Our blocker's state property bag will provide the list of telemetry filenames of connections we are waiting on if we need to crash.  We already maintain an mTelemetryFilename name for all connections for the benefit of the Slow SQL telemetry reporter.  It's the explicitly provided "telemetryFilename" URI parameter or just the SQLite database name.  Only IndexedDB provides these right now.
** To be clear, add-on database names will be surfaced, as is the case for the Slow SQL telemetry report at https://telemetry.mozilla.org/slowsql/.  We only sanitize their SQL, not their name.  If this is concerning (and I should probably needinfo data privacy people if we go ahead with this), we should maybe move the whitelist logic from Telemetry.cpp into mozStorage and have mTelemetryFilename pre-censor.

Does this seem reasonable, :mak?  It's more (complex) code than we have right now, but arguably significantly simpler and cleaner than the telemetry approach.
Flags: needinfo?(bugmail) → needinfo?(mak77)
(In reply to Andrew Sutherland [:asuth] from comment #18)
> The mozStorage spinning at "xpcom-shutdown-threads" was added in bug 744294
> in order to avoid nsPlacesAutoComplete.js needing to add its own event-loop
> spinning when it called AsyncClose().

It's a little bit different, the one you describe was the outcome, but actually the main project that this was part-of was the exit(0) project. We actually needed to ensure all the connections were sanely closed before allowing the process to die, also to avoid leaking threads and resources.

The choice of
> "xpcom-shutdown-threads" was arbitrary-ish.  I think the goal was to spin in
> a phase strictly after the last one we'd expect a call to AsyncClose to
> happen in so that there's no race between spinning and calling AsyncClose().

xpcom-shutdown-threads is where we merge the async threads to the main-thread, right? I think that was part of the reason to do that there.

> From my partial consumer survey, everyone triggers their shutdowns on either:
> * "profile-change-teardown":

I think this is only Places, but it doesn't trigger connection shutdown, it asks consumers of Places APIs/connections to stop using them, cause it's about to close them. The real connections shutdown starts at profile-before-change.
And by XPCOM documentation, anything using profile files is supposed to stop using the profile after this point.

> Does this seem reasonable, :mak?  It's more (complex) code than we have
> right now, but arguably significantly simpler and cleaner than the telemetry
> approach.

It sounds like a plan!
Now I only have some fear due to bug 1261106, Places started using async shutdown, but sounds like there's some problem with holding onto js consumers and proxy releases. That may need some solution first. That bug is not completely clear to me atm, but looks like an async shutdown blocking problem.
Flags: needinfo?(mak77)
(In reply to Marco Bonardo [::mak] from comment #19)
> > From my partial consumer survey, everyone triggers their shutdowns on either:
> > * "profile-change-teardown":
> 
> I think this is only Places, but it doesn't trigger connection shutdown, it
> asks consumers of Places APIs/connections to stop using them, cause it's
> about to close them. The real connections shutdown starts at
> profile-before-change.

To clarify, Database::Observe monitors "profile-change-teardown" and notifies "places-shutdown", which directly triggers ConcurrentStatementHolder and (via History::Shutdown) and nsPlacesAutocomplete.js to shut down their connections.
https://dxr.mozilla.org/mozilla-central/source/toolkit/components/places/Database.cpp#1747

But I think I get what you mean; they are consumers of Places with read-only usage(?) connections and shut themselves down at that earlier point; they're not the Places core with Database's writable connection.  (Although a connection only doing reads could still end up being the one to checkpoint the WAL if it ends up being the last one to close.)
yep, that's right, Places consumers are expected to close a little bit earlier, and if they have a connection it will be closed earlier as well.
Priority: -- → P3
Having just looked at the DOMStorage-blocks-shutdown variant of this (bug 1335769) where a checkpoint was on the stack, I think we might also want to look at:

1) Having shutdown close()s explicitly disable auto-checkpointing so that we don't try and re-integrate the WAL back into the database at shutdown and incur all that I/O and the 2 fsync()s (under PRAGMA synchronous=NORMAL).  The exception would be cases like places' "clear history at shutdown" where we intentionally have best-effort privacy characteristics where we want deletions/erasures to be re-integrated so that we can be sure they no longer exist in the database.  This should avoids fsyncs under PRAGMA synchronous=NORMAL entirely.

2) Unifying fsync/FlushFileBuffers calls at shutdown for cases where we do need to perform checkpointing.  While our random access database I/O does not particularly parallelize on its own, we do know that on existing hardware and file-systems fsync() calls are usually going to be disproportionately expensive.  If we know we're going to try and do close-related fsync()s on N threads, the VFS layers could do something where N-1 of the threads will skip their fsync call and block on the last thread completing its fsync.  (If the last thread decides not to fsync as indicated by its close method being called, and another thread flagged an fsync is needed, however, it will still be compelled to trigger an fsync, or to convince another thread to do so.)  These would be tracked independently for the WAL and database files, where we expect 1 fsync each.
Assignee: bugmail → nobody
Status: ASSIGNED → NEW
Depends on: 1405290
Depends on: 1437575
Closing because no crash reported since 12 weeks.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
Closing because no crash reported since 12 weeks.
You need to log in before you can comment on or make changes to this bug.