Closed Bug 730495 Opened 12 years ago Closed 7 years ago

Need to guarantee that sqlite3_config is called before any other SQLite function

Categories

(Toolkit :: Storage, defect, P3)

defect

Tracking

()

RESOLVED FIXED
mozilla58
Tracking Status
firefox58 --- fixed

People

(Reporter: mbrubeck, Assigned: KaiE)

References

(Blocks 1 open bug)

Details

Attachments

(1 file, 8 obsolete files)

mozilla::storage::Service::initialize calls sqlite3_config, which must be called before sqlite is initialized.  If sqlite3_initialize is called, directly or indirectly, before the storage service is initialized, then the sqlite3_config call will fail.

This is what happened in bug 717904: A simple change caused Fennec to load NSS before the storage service, causing the storage service initialization to fail.

To prevent this from happening again, we need to guarantee that the sqlite3_config call happens before any other sqlite usage.  Some options:

1) Do the configuration within an application-specified sqlite3_os_init() function.  This has the advantage that it would transparently and automatically work with any code that uses sqlite.  However, it looks like this is only supported when compiling with SQLITE_OS_OTHER=1, in which case none of the built-in sqlite_os_* code is used or built.

2) Call the storage service initializer (or just the sqlite3_config part) from somewhere very early in the startup path, before other sqlite callers have a chance to run.

3) Factor out the sqlite3_config part of the storage service initializer into a new function, and ensure that any code that uses SQLite (e.g. NSS) calls that function first.
We may even ask SQLite team if it would be possible to have a sqlite_app_init invoked by (or just after) sqlite_os_init.
Blocks: 721784
After fixing bug 743715, and disabling jemalloc, my crasher in bug 741836 went away.
https://tbpl.mozilla.org/?tree=Try&rev=80754f169a60
Assignee: nobody → gpascutto
Attachment #613357 - Flags: review?(jones.chris.g)
Assignee: gpascutto → nobody
Attachment #613357 - Attachment is obsolete: true
Attachment #613357 - Flags: review?(jones.chris.g)
Sorry for that, wrong bug.
Should this block?
(In reply to Marco Bonardo [:mak] from comment #1)
> We may even ask SQLite team if it would be possible to have a
> sqlite_app_init invoked by (or just after) sqlite_os_init.

I think this sounds like the best proposal so far.  Dr. Hipp, would you accept a change like this?  Would you prefer to implement it yourself, or to accept a patch from us?

I like this solution because on Android we want Java code to use libmozsqlite3 before libxul is loaded, so any configuration code must exist outside of libxul.  (I think jemalloc is already compiled into a separate "libmozalloc" library, so that should be possible, right?)


(In reply to Mark Finkle (:mfinkle) from comment #7)
> Should this block?

Probably not.  Bug 721784 has a workaround (disable jemalloc for sqlite on Android) that should fix our blocking issues without needing any changes to sqlite itself.  However, that workaround might impact performance.  If there's an unacceptable performance hit, then this bug should block Fennec.
(In reply to Matt Brubeck (:mbrubeck) from comment #8)
> (In reply to Marco Bonardo [:mak] from comment #1)
> > We may even ask SQLite team if it would be possible to have a
> > sqlite_app_init invoked by (or just after) sqlite_os_init.
> 
> I think this sounds like the best proposal so far.  Dr. Hipp, would you
> accept a change like this?  

If I correctly understand your request, then this already exists.  Simply compile with -DSQLITE_EXTRA_INIT=app_init_func and then app_init_func() will be called shortly after sqlite3_os_init().  The app_init_func() is passed a single argument which is a NULL pointer (an historical artifact).  There is also -DSQLITE_EXTRA_SHUTDOWN=app_shutdown_func which you can use to cause SQLite to invoke a function just prior to shutdown.
(In reply to D. Richard Hipp from comment #9)
> If I correctly understand your request, then this already exists.  Simply
> compile with -DSQLITE_EXTRA_INIT=app_init_func and then app_init_func() will
> be called shortly after sqlite3_os_init().

That's perfect; thanks!
(In reply to Matt Brubeck (:mbrubeck) from comment #10)
> (In reply to D. Richard Hipp from comment #9)
> > If I correctly understand your request, then this already exists.  Simply
> > compile with -DSQLITE_EXTRA_INIT=app_init_func and then app_init_func() will
> > be called shortly after sqlite3_os_init().
> 
> That's perfect; thanks!

How well is this going to work for system sqlite, though?
(In reply to Nathan Froyd (:froydnj) from comment #11)
> How well is this going to work for system sqlite, though?
System SQLite is unsupported.
(In reply to Shawn Wilsher :sdwilsh from comment #12)
> (In reply to Nathan Froyd (:froydnj) from comment #11)
> > How well is this going to work for system sqlite, though?
> System SQLite is unsupported.

It's used by many linux distros.
See Also: → 938730
(In reply to Nathan Froyd (:froydnj) from comment #11)
> How well is this going to work for system sqlite, though?

For Linux distributions we need a solution that works with system sqlite, or we must patch Mozilla.
I think we cannot use the proposed approach (-DSQLITE_EXTRA_INIT=app_init_func).

Because that's a compile time option, but for Linux distributions we require a generic solution, that will allow sqlite to be compiled as a single binary package, and supporting any application's needs.

If a Mozilla application wants to ensure that any library will use a Mozilla provided allocator (instead of the default allocator), then Mozilla must make sure to register that allocator (by calling sqlite3_config(SQLITE_CONFIG_MALLOC)) prior to calling any library code (such as NSS code, which might indirectly trigger init of the allocation function).
(In reply to Matt Brubeck (:mbrubeck) from comment #0)
> 2) Call the storage service initializer (or just the sqlite3_config part)
> from somewhere very early in the startup path, before other sqlite callers
> have a chance to run.

+1

That's the only approach that's guaranteed to work in all scenarios.
Bug 938730 is going to fix the system sqlite case, by just not using jemalloc as the allocator.  Once that lands, that should let us use the solution from comment 9, which will let us use jemalloc as the allocator on Android.  Is that right?
Removing 938730 from the blocker-list (it's still in the see-also-list).

(In reply to Nicholas Nethercote [:njn] from comment #17)
> Bug 938730 is going to fix the system sqlite case, by just not using
> jemalloc as the allocator.

Yes, landed.

> Once that lands, that should let us use the
> solution from comment 9, which will let us use jemalloc as the allocator on
> Android.  Is that right?

I think that's right.

If you use an internal version of sqlite, you can register an additional function at build time, as suggested in comment 9.


Given this bug isn't fixed yet: Why doesn't it crash when using a current Mozilla release Linux binary (e.g. Firefox 25), and using the same NSS configuration as in bug 938730 (using the sqlite NSS database format)? I would expect that to crash, too.
No longer blocks: 938730
(In reply to Kai Engert (:kaie) from comment #18)
> Given this bug isn't fixed yet: Why doesn't it crash when using a current
> Mozilla release Linux binary (e.g. Firefox 25), and using the same NSS
> configuration as in bug 938730 (using the sqlite NSS database format)?

I've investigated and found the answer. The following happens.

Firefox built using internal sqlite (and jemalloc enabled).

- firefox starts
- NSS (indirectly) triggers sqlite3_initialize
- NSS starts using sqlite DB
- storage service gets created and calls sqlite3_config(SQLITE_CONFIG_MALLOC)
- sqlite rejects the call!
  because of a consistency check:
    if( sqlite3GlobalConfig.isInit ) return SQLITE_MISUSE_BKPT;
- storage service init fails
- storage service gets destructed and sqlite3_shutdown (!) gets called

	#1  0x00007ffff1309952 in mozilla::storage::Service::~Service
	#2  0x00007ffff1309abe in mozilla::storage::Service::~Service
	#3  0x00007ffff130939c in mozilla::storage::Service::Release
	#4  0x00007ffff1309652 in mozilla::storage::Service::getSingleton ()
	#5  0x00007ffff12e8c9f in mozilla::storage::ServiceConstructor
	#6  0x00007ffff1d90297 in mozilla::GenericFactory::CreateInstance
	#7  0x00007ffff1e006e2 in nsComponentManagerImpl::CreateInstanceByContractID
            (this=0x7ffff6c88840, 
             aContractID=0x7ffff326cd18 "@mozilla.org/storage/service;1", ...)
- shortly afterwards, the storage service gets initialized again, and 
  registers the jemalloc allocator.

This seems like a pretty bad scenario.
Shutting down the sqlite session, which NSS has already started using, probably causes unpredictable behaviour in NSS.

(Remember, this is when using the new NSS database format, which uses sqlite, which isn't yet used by Firefox by default, but we'd like to change Firefox to use it by default).
Blocks: 783994
bug 1026828 might have complicated things a little bit. If with bug 938730 we could just assume system sqlite didn't use jemalloc and thus was unaffected, with bug 1026828 we will use jemalloc if there's system jemalloc. That means we call again sqlite3_config even for some system sqlite, thus we can't again use -DSQLITE_EXTRA_INIT, or those system with both system sqlite and system jemalloc will hit this bug again.
Attached patch WIP patch (obsolete) — Splinter Review
As :mak said, this might not support system-jemalloc + system-sqlite but I'm not quite clear about the issue yet. I want your early feedback about whether you think we should do this or think about other solutions.
Attachment #8621569 - Flags: feedback?(mh+mozilla)
Why not initialize sqlite from a static initializer for the system-jemalloc+system-sqlite case?
Attachment #8621569 - Flags: feedback?(mh+mozilla)
The race condition seems to be able to cause breakage with add-on installations. Kan-Ru, can you confirm Tim's suspicion? What's the status of your patch?
Flags: needinfo?(kchen)
I am referring to bug 1236865.
Flags: needinfo?(kchen)
Flags: needinfo?(kchen)
(In reply to Christiane Ruetten [:cr] from comment #23)
> The race condition seems to be able to cause breakage with add-on
> installations. Kan-Ru, can you confirm Tim's suspicion? What's the status of
> your patch?

The patch needs rebasing and we also need to implement comment 22
Flags: needinfo?(kchen)
Assign to myself to investigate.
Assignee: nobody → kchen
This does not enable us to use jemalloc on Android because the malloc
functions are still in libxul. They has to be in libxul because the
customized DMD reporting.

Review commit: https://reviewboard.mozilla.org/r/39189/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/39189/
Attachment #8728906 - Flags: review?(mh+mozilla)
doesn't look like this patch resolves comment 20 and comment 22
(In reply to Marco Bonardo [::mak] from comment #28)
> doesn't look like this patch resolves comment 20 and comment 22

Yes, it only fixes the races when mozsqlite is used with jemalloc. It doesn't attempt to fix the Android issue or system-sqlite issue. The latter cases still use sqlite without jemalloc to avoid the problem.

I think the only way to fix Android/system-sqlite is to call sqlite3_config in program startup. We cannot do it in libxul.so because the reason stated in comment 8.
Comment on attachment 8728906 [details]
MozReview Request: Bug 730495 - guarantee that sqlite3_config is called before any other SQLite function. r=glandium

https://reviewboard.mozilla.org/r/39189/#review36103

::: db/sqlite3/src/sqlite_jemalloc.c:9
(Diff revision 1)
> +__attribute__((weak)) void sqlite3_really_config_jemalloc();

I don't think this works reliably. On Linux, if it works at all, which I'm not even sure, it works by chance. But there are plenty of reasons why this would not do anything, one of which is hardened builds with e.g. -Wl,-z,now (which is equivalent to setting LD_BIND_NOW=1). I'm not sure this would work on OSX either, and this definitely doesn't work on Windows, where __attribute__((weak)) is not supported.
Attachment #8728906 - Flags: review?(mh+mozilla)
Priority: -- → P3
Blocks: 1263209
Blocks: 1319379
Is there any chance this will get fixed/landed in the 57 time frame?  I'm assuming not, but if so we could use it for bug 870460 (which we're otherwise planning to just hack by finding an empirically safe time to load the cookie database).
Flags: needinfo?(mak77)
let's move on the discussion in bug 870460 first, since I'm not sure this bug really affects you in any way.
Flags: needinfo?(mak77)
Could we require that PSM triggers the init of the storage server, prior to NSS init?

This seems to solve the assertions that I saw when testing bug 783994.
Attachment #8884335 - Flags: superreview?(dkeeler)
Attachment #8884335 - Flags: review?(mh+mozilla)
Comment on attachment 8884335 [details] [diff] [review]
730495-psm-triggers-storage.patch

Review of attachment 8884335 [details] [diff] [review]:
-----------------------------------------------------------------

Seems like a reasonable solution to me, although see the comment about where we actually put this call.

::: security/manager/ssl/nsNSSComponent.cpp
@@ +113,5 @@
>    }
>  
>    if (XRE_IsParentProcess()) {
> +    nsresult rv;
> +    nsCOMPtr<nsISupports> storageService =

I think it might be best to put this in nsNSSComponent::Init so that it's impossible to instantiate PSM without the storage service. Also, since we're not checking rv, I think we can use the do_GetService variant that doesn't have the extra parameter.
Attachment #8884335 - Flags: superreview?(dkeeler) → superreview+
(I guess note that attachment 8884335 [details] [diff] [review] only addresses the specific case of NSS' copy of sqlite running first and causing bad things to happen, so that wouldn't solve this bug in general.)
Attachment #8884335 - Flags: review?(mh+mozilla) → review?(mak77)
Comment on attachment 8884335 [details] [diff] [review]
730495-psm-triggers-storage.patch

Review of attachment 8884335 [details] [diff] [review]:
-----------------------------------------------------------------

As pointed out in comment 35, the proposed patch doesn't really solve this bug, it just adds another workaround to nss. I'm fine with a workaround to move on, just that it's in the wrong bug. Thus clearing the request (regardless, there isn't much to review for a Storage peer here, the patch is just instantiating a service).
The workaround should be worked in an NSS component bug, and that bug should probably block this, that will stay unresolved for now.

Regarding a broader solution here, I'm prone to think we should:
1. Use -DSQLITE_EXTRA_INIT. It will just work on Windows, MacOS and Linux (non-system-sqlite).
2. Provide some kind of workaround for system Sqlite. Comment 22 suggested a static initializer for "System Sqlite + system Jemalloc" but I'm not sure the system jemalloc problem exists yet. I think our jemalloc version is so much far from upstream that we stopped using system jemalloc (I can't find MOZ_NATIVE_JEMALLOC define anywhere. Glandium may know). Regardless, we can have a workaround solution only for system sqlite, that consists into invoking sqlite3_config really early in the process life, and be done with that.
Attachment #8884335 - Flags: review?(mak77)
Glandium, is there any case where we may still use system jemalloc?
Do you have any suggestion for the static initializer you suggested, that would only exist for the system sqlite case?
Flags: needinfo?(mh+mozilla)
Depends on: 1380706
(In reply to Marco Bonardo [::mak] from comment #36)
> As pointed out in comment 35, the proposed patch doesn't really solve this
> bug, it just adds another workaround to nss. I'm fine with a workaround to
> move on, just that it's in the wrong bug. Thus clearing the request
> (regardless, there isn't much to review for a Storage peer here, the patch
> is just instantiating a service).
> The workaround should be worked in an NSS component bug, and that bug should
> probably block this, that will stay unresolved for now.

OK, moved to bug 1380706
(In reply to Marco Bonardo [::mak] from comment #37)
> Glandium, is there any case where we may still use system jemalloc?

On FreeBSD, but it's now treated as a generic system malloc (which we have in other cases, like 
ASAN builds)

> Do you have any suggestion for the static initializer you suggested, that
> would only exist for the system sqlite case?

A static instance of a class with a constructor, inside a #ifdef, would do that.
Flags: needinfo?(mh+mozilla)
No longer blocks: 783994
See Also: → 1401514
No longer depends on: 1380706
See Also: → 1380706
I discovered yet another issue related to this problem. This time it isn't about the init order, but about shutdown order.

If NSS/PSM uses sqlite, the following may happen:
- PSM service is requested
- PSM init triggers init of the storage service
- PSM calls NSS init
- at shutdown time, the XPCOM cleanup shuts down the storage service, which calls slite3_shutdown
- next, XPCOM cleanup calls PSM cleanup, which calls nss shutdown, which attempts to close sqlite databases, which fails because sqlite is already shut down.
I'd like to suggest that sqlite init and shutdown gets completely decoupled from XPCom.

Call sqlite_config/init prior to starting up the XPCom system,
and call sqlite_shutdown after XPCom cleanup is completed.
(In reply to Kai Engert (:kaie:) from comment #40)
> - at shutdown time, the XPCOM cleanup shuts down the storage service, which
> calls slite3_shutdown
> - next, XPCOM cleanup calls PSM cleanup, which calls nss shutdown, which
> attempts to close sqlite databases, which fails because sqlite is already
> shut down.

There are 2 things here:
1. https://sqlite.org/c3ref/initialize.html: A call to sqlite3_shutdown() is an "effective" call if it is the first call to sqlite3_shutdown() since the last sqlite3_initialize(). Only an effective call to sqlite3_shutdown() does any deinitialization. All other valid calls to sqlite3_shutdown() are harmless no-ops.
2. Why is NSS trying to close a database connection so late? It should close it when the profile is going away, it doesn't make sense to touch anything in the profile folder after the profile shutdown notifications

Off-hand the shutdown bug looks like a pure implementation bug in nss.
We can debate about NSS, but apparently this bug is waiting for a general solution for multiple components anyway.

The suggestion to do it outside of XPCom would solve it for everyone.
(In reply to Marco Bonardo [::mak] from comment #42)
> 2. Why is NSS trying to close a database connection so late? It should close
> it when the profile is going away, it doesn't make sense to touch anything
> in the profile folder after the profile shutdown notifications

NSS cannot shut down its storage until it's told to shutdown entirely.

NSS cannot be shut down until all memory provided by NSS to the application has been cleaned up.

Because of JS, XPCom, and delayed cleanup, you cannot destroy the global NSS resources while JS holds pointers to NSS.


> Off-hand the shutdown bug looks like a pure implementation bug in nss.

NSS never calls sqlite3_shutdown.
(In reply to Kai Engert (:kaie:) from comment #41)
> Call sqlite_config/init prior to starting up the XPCom system,
> and call sqlite_shutdown after XPCom cleanup is completed.

How about doing that in toolkit/xre/Bootstrap.cpp constructor and destructor?
(In reply to Kai Engert (:kaie:) from comment #43)
> We can debate about NSS, but apparently this bug is waiting for a general
> solution for multiple components anyway.

I agree that it's waiting for a solution, but not by multiple components, all the blocked bugs are related to NSS.
Note, I'm not trying to blame, just stating the only existing issue is between Storage service and NSS. That's what we try to solve from the beginning.
So far there was just an init problem, and that already had a suggested solution (that nobody found time to implement): comment 36 and comment 39.

(In reply to Kai Engert (:kaie:) from comment #44)
> > Off-hand the shutdown bug looks like a pure implementation bug in nss.
> 
> NSS never calls sqlite3_shutdown.

How does NSS know if the file it's acting upon is still valid? I suppose it's using a database from somewhere, and I suppose that "somewhere" is the profile folder. Once the profile has been shutdown by the app, there is no guarantee made on its contents, they could even be "virtual" or removed.
What I meant is not that NSS is wrongly invoking sqlite3_shutdown, but that anything that touches the profile shouldn't wait xpcom-shutdown before closing its handles, it should stop using profile files when the profile goes away, even if the app is alive.

(In reply to Kai Engert (:kaie:) from comment #44)
> NSS cannot shut down its storage until it's told to shutdown entirely.

This would be a problem indeed. I'd like to understand the technical reasons, but I guess it would be a too long description :(

(In reply to Kai Engert (:kaie:) from comment #45)
> How about doing that in toolkit/xre/Bootstrap.cpp constructor and destructor?

This is probably a question for someone more more involved with the XPCOM runtime story, than me. Maybe :nfroyd?
But note than at that point, any other consumers that is wrongly waiting xpcom-shutdown to close (and any other consumer would be wrong doing that) would be delayed until that new very-late point in the runtime, that may cause unexpected problems, since Storage and most consumers expect XPCOM to be available.
(In reply to Marco Bonardo [::mak] from comment #46)
> that may
> cause unexpected problems, since Storage and most consumers expect XPCOM to
> be available.

On a second thought, if the other consumers go through Storage, they should be handled by the normal Storage shutdown, provided we only delay the sqlite3_shutdown call. So it may not be a problem in the end.
(In reply to Marco Bonardo [::mak] from comment #46)
> I agree that it's waiting for a solution, but not by multiple components,
> all the blocked bugs are related to NSS.

Ok, I hadn't noticed that.


> How does NSS know if the file it's acting upon is still valid?

It uses its own files that nobody else is accessing. That fact they are in a directory where firefox stores files, too, shouldn't affect that NSS still owns those files.

> Once the profile has been shutdown by the app, there is no
> guarantee made on its contents, they could even be "virtual" or removed.

We should be able to assume that nobody does evil things and deliberately destroys the contents of the files.
Deleting isn't a problem, the OS handles deleted files that are still open by someone.


> What I meant is not that NSS is wrongly invoking sqlite3_shutdown, but that
> anything that touches the profile shouldn't wait xpcom-shutdown before
> closing its handles, it should stop using profile files when the profile
> goes away, even if the app is alive.

Probably very difficult to get implemented with NSS.


> This would be a problem indeed. I'd like to understand the technical
> reasons, but I guess it would be a too long description :(

Right, and it would require to bring in additional experts to provide you with that explanation.


> > How about doing that in toolkit/xre/Bootstrap.cpp constructor and destructor?
> 
> This is probably a question for someone more more involved with the XPCOM
> runtime story, than me. Maybe :nfroyd?

I've started a try build to test if that pragmatic approach works. (It built locally.):
https://treeherder.mozilla.org/#/jobs?repo=try&revision=70943c6597a9905b69457281507b93fb1a358164
There may be minor complications with the TelemetryVFS we should address.  Assuming https://searchfox.org/mozilla-central/source/security/nss/lib/softoken/sdb.c#623 and its use of sqlite3_open are the SQLite use we're talking about, it's ending up with the default VFS, which is the one we set, telemetry-vfs.  We currently delete the mSqliteVFS sqlite3_vfs that serves as the vtable for VFS dispatch in Service::~Service(), so I would expect crashes on implementations where we're using jemalloc and fill freed memory with poision value 0xe5's.

Options would be:
- Have NSS use sqlite3_open_v2 and explicitly pick a VFS to bypass the telemetry VFS.  Our telemetry doesn't meaningfully recognize that NSS may use a database, so it's not like this is a big loss.  But it might be useful in the future.
- Defer deleting the sqlite3_vfs until after we really shutdown SQLite.  (This assumes Telemetry won't explode on its own.  It looks like Telemetry doesn't every really shutdown from the perspective of the calls we make, so this should be fine?
Thanks Andrew for these hints. What does telemetry-vfs do, is it limited to recording I/O timing, or does it modify how I/O functions?
Andrew, IIUC, when using the legacy sqlite3_open call, one gets the default VFS.

IIUC, if NSS isn't changed, then NSS wouldn't use the telemetry-vfs that you are registering, and wouldn't affected by the vfs being destroyed early.
Attached patch 730495-v5.patch (obsolete) — Splinter Review
(In reply to Kai Engert (:kaie:) from comment #48)
> 
> I've started a try build to test if that pragmatic approach works. (It built
> locally.):
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=70943c6597a9905b69457281507b93fb1a358164

It's still running (now 75%), but it built, and it seems to work.

That try build used a quick-and-dirty move-over of the code.

Here I'm attaching a cleaner patch, for the suggestion to move sqlite init/shutdown out of XPCom land, to avoid the races.
(In reply to Kai Engert (:kaie:) from comment #50)
> Thanks Andrew for these hints. What does telemetry-vfs do, is it limited to
> recording I/O timing, or does it modify how I/O functions?

It just records I/O timing/counts.  I do think it's useful to have the timing info in general[1].  My concern is just that we avoid a use-after-free/crash-after-free scenario.  Also, I want to make sure you're aware that you're probably getting that VFS.

1: Noting that unless we add a specific row for NSS's database file, all of its statistics will just go in the "OTHER" bucket.  For privacy reasons we only collect statistics for very specific databases file (places, cookies, localstorage), plus a catch-all OTHER.

(In reply to Kai Engert (:kaie:) from comment #51)
> Andrew, IIUC, when using the legacy sqlite3_open call, one gets the default
> VFS.
> 
> IIUC, if NSS isn't changed, then NSS wouldn't use the telemetry-vfs that you
> are registering, and wouldn't affected by the vfs being destroyed early.

Service::initialize() does `rc = sqlite3_vfs_register(mSqliteVFS, 1);` where that second argument is the "makeDflt" argument that makes it the default VFS.

Given that mozStorage already wraps all calls to sqlite3_open(_v2) for its callers, we could probably stop making the VFS the default if there's a convincing reason.  Again, I think it's arguably useful for us to have the I/O latency perspective, but that's a bit hand-wavey.  (Although since we no longer let extensions use SQLite directly, we may start paying more attention to these metrics and expanding them.)
Marco, IIUC comment 31 (bug 870460) also suggests a scenario that wants to ensure that sqlite has already been initialized, which is not related to NSS.
Andrew, the NSS databases store secret data. There are sometimes attacks that can deduce secret keys based on timing information. That might be an argument to exclude the NSS database from timing probes, and transmitting such probes to remote locations.
Comment on attachment 8910817 [details] [diff] [review]
730495-v5.patch

Mike, what's your opinion on this approach?
Attachment #8910817 - Flags: review?(mh+mozilla)
Comment on attachment 8910817 [details] [diff] [review]
730495-v5.patch

Also asking nfroyd for feedback, as suggested by Marco.
Attachment #8910817 - Flags: feedback?(nfroyd)
(In reply to Kai Engert (:kaie:) from comment #55)
> Andrew, the NSS databases store secret data. There are sometimes attacks
> that can deduce secret keys based on timing information. That might be an
> argument to exclude the NSS database from timing probes, and transmitting
> such probes to remote locations.

That would be quite an epic side-channel attack, but I agree that we have no concrete need for the NSS database to be included in telemetry, and for telemetry there always needs to be a good reason for the data being gathered.  And I could totally believe that an attacker could influence the histogram such that the telemetry payload size could be subject to traffic analysis due to ping size, and maybe the attacker could improve the signal rate/work around ping rate limits if MITMing to convince Firefox that the connection failed to complete and triggering retry logic that might get a fresh snapshot or something.  Unsure how that all works.  Don't want to have to worry about it, better to just avoid the surface area.

:mak, do you have any problem with us making the VFS non-default?  Looking at the VFS-using code, I notice that the "ignoreLockingMode" magic for read-only mode that picks the "win32-none" or "unix-none" VFS ends up bypassing the telemetry VFS.  When making things non-default, we could address that.  Maybe by making ConstructTelemetryVFS take the VFS name to wrap, and we just invoke it once for the standard VFS name and once for the non-locking VFS and registering them separately as "telemetry-vfs" and "telemetry-vfs-none".
Flags: needinfo?(mak77)
Alternately, we could make NSS explicitly specify a VFS.  We could do that by putting the appropriate ifdef's in NSS, and/or we could create an alias for "no-telemetry" (ifdef'ed for being built under gecko; nullptr would be used on non-Gecko) that we map to "win32"/"unix"/"unix-excl" as appropriate.  I guess the NFS-awareness pref-check we have in ConstructTelemetryVFS is something NSS is getting for free right now, and maybe it would want to keep it.
You don't control the NSS code. It can be a system-wide installed binary library file on a Linux distribution, which wasn't compiled with your modifications.
(In reply to Kai Engert (:kaie:) from comment #60)
> You don't control the NSS code. It can be a system-wide installed binary
> library file on a Linux distribution, which wasn't compiled with your
> modifications.

Accepted as reality.  Are we still expecting this to be the case going forward with Flatpak and Snappy packages and the mainline distros?
(In reply to Andrew Sutherland [:asuth] from comment #61)
> Accepted as reality.  Are we still expecting this to be the case going
> forward with Flatpak and Snappy packages and the mainline distros?

I'm pretty sure this will continue to be the case on several distributions. It will be almost certainly the case for Fedora and RHEL distributions.

We use adjusted NSS configurations on long term maintenance branches e.g. for backwards compatibility, e.g. TLS cipher suite order. We make modifications for OS integration, e.g. we use a different place to store the CA trust, which administrators can configure centrally, and these things are expected to work across all applications, so it would be a lot of work to adjust multiple copies. It's also a lot of work to update critical libraries like NSS when vulnerabilities must get patched. It's much easier to do that in just one central copy, and update just once for e.g. both Firefox, Thunderbird and other applications. I expect the configuration of workstations for the primary applications to continue to follow the traditional approach of shared libraries.
(In reply to Kai Engert (:kaie:) from comment #62)
> I'm pretty sure this will continue to be the case on several distributions.
> It will be almost certainly the case for Fedora and RHEL distributions.

Thank you.  We may be turning on FTS5 and/or other SQLite extensions in the future and I was wondering how this might interact with the analogous system SQLite issue.  But that's a different bug! :)
(In reply to Andrew Sutherland [:asuth] from comment #53)
> (In reply to Kai Engert (:kaie:) from comment #50)
> > Thanks Andrew for these hints. What does telemetry-vfs do, is it limited to
> > recording I/O timing, or does it modify how I/O functions?
> 
> It just records I/O timing/counts.  I do think it's useful to have the
> timing info in general[1].

I honestly think we could get rid of this telemetry. It was extremely useful when we were prioritizing the conversion from synchronous to asynchronous, but now I'm the only one periodically looking at its data, and apart from a few fluctuations, lately we have reduced our new uses of storage, so it has been pretty much stable.
While I think it has its own value, the benefit vs the maintenance cost may not pend anymore towards keeping it.
Slow SQL telemetry would be far more useful, and we lost it :/ Hopefully it will come back.

THOUGH, telemetry-vfs, despite the name, is also at the base of the Quota system afaict, indeed some time ago I suggested (in a comment lost in space-time) to rename it to quota-vfs and remove the telemetry part of it.

Would this solve the security concerns?
This could also mean we could limit quota-vfs to consumers caring about quota (I assume indexedDB and DOM Storage?)

(In reply to Andrew Sutherland [:asuth] from comment #58)
> :mak, do you have any problem with us making the VFS non-default?

I think my answers goes even further. What do you think of completely removing the telemetry part?

> Looking
> at the VFS-using code, I notice that the "ignoreLockingMode" magic for
> read-only mode that picks the "win32-none" or "unix-none" VFS ends up
> bypassing the telemetry VFS.

This means it's bypassing quota? That's not ok :(
Flags: needinfo?(mak77)
(In reply to Marco Bonardo [::mak] from comment #64)
> I honestly think we could get rid of this telemetry.

This works for me.  I agree that the interesting data exists at an `Connection::executeSql` granularity or higher.  For consumers using the async API, that telemetry effectively needs to be gathered by mozStorage (but for sync consumers they may want their own higher level telemetry).  And SQLite has been growing useful helpers like https://sqlite.org/stmt.html for letting us get stats out that are potentially more useful than acting at a VFS layer.

> THOUGH, telemetry-vfs, despite the name, is also at the base of the Quota
> system afaict, indeed some time ago I suggested (in a comment lost in
> space-time) to rename it to quota-vfs and remove the telemetry part of it.

Yeah, a rename sounds good.

:kaie/NSS team, don't worry, the Quota Manager stuff only kicks in if you open the database as a URI and pass all the right QuotaManager URI arguments in the URI.
 
> This could also mean we could limit quota-vfs to consumers caring about
> quota (I assume indexedDB and DOM Storage?)

Yes, this sounds good.  And we can add a debug-only assertion that we're specifying the VFS if the path is under PROFILE/storage/ and we can have the VFS assert that it's able to find the relevant quota object.

> > Looking
> > at the VFS-using code, I notice that the "ignoreLockingMode" magic for
> > read-only mode that picks the "win32-none" or "unix-none" VFS ends up
> > bypassing the telemetry VFS.
> 
> This means it's bypassing quota? That's not ok :(

It looks like only the Chrome profile migration logic at https://searchfox.org/mozilla-central/source/browser/components/migration/ChromeProfileMigrator.js it was added for in bug 1285041 uses it.  We definitely don't enforce quota on Chromium's profile directory, so we should be fine there ;)  (I was assuming it was something Places used on its read-only async clones, but that appears to not be the case.  And in fact, it looks like we fail to propagate the setting if we do an async clone.)
Comment on attachment 8910817 [details] [diff] [review]
730495-v5.patch

Review of attachment 8910817 [details] [diff] [review]:
-----------------------------------------------------------------

This seems like a pretty reasonable approach.  Some comments below.

::: security/manager/ssl/nsNSSComponent.cpp
@@ -2210,5 @@
>  
> -  // To avoid a sqlite3_config race in NSS init, as a workaround for
> -  // bug 730495, we require the storage service to get initialized first.
> -  nsCOMPtr<nsISupports> storageService =
> -    do_GetService(MOZ_STORAGE_SERVICE_CONTRACTID);

I think that mozStorageService was the only thing propagating out errors of sqlite initialization to this piece of code.  Will errors be discovered by pieces of code futher down, now that this has been removed, and will those errors be handled correctly?

::: toolkit/xre/SQLiteLifetime.cpp
@@ +133,5 @@
> +
> +SQLiteLifetime::SQLiteLifetime()
> +{
> +    if (++SQLiteLifetime::mSingletonEnforcer != 1)
> +        abort();

Nit: this condition is indented peculiarly, relative to the rest of the function.

::: toolkit/xre/SQLiteLifetime.h
@@ +7,5 @@
> +#define mozilla_SQLiteLifetime_h
> +
> +namespace mozilla {
> +
> +class SQLiteLifetime

This class should be final.  Maybe it should be called `AutoSQLiteLifetime`, in keeping with Mozilla naming of RAII things as `Auto`?

@@ +9,5 @@
> +namespace mozilla {
> +
> +class SQLiteLifetime
> +{
> +protected:

..and if the class is `final`, then these can just be normal private members.
Attachment #8910817 - Flags: feedback?(nfroyd) → feedback+
Attached patch 730495-v6.patch (obsolete) — Splinter Review
(In reply to Nathan Froyd [:froydnj] from comment #66)
> ::: security/manager/ssl/nsNSSComponent.cpp
> @@ -2210,5 @@
> >  
> > -  // To avoid a sqlite3_config race in NSS init, as a workaround for
> > -  // bug 730495, we require the storage service to get initialized first.
> > -  nsCOMPtr<nsISupports> storageService =
> > -    do_GetService(MOZ_STORAGE_SERVICE_CONTRACTID);
> 
> I think that mozStorageService was the only thing propagating out errors of
> sqlite initialization to this piece of code.  Will errors be discovered by
> pieces of code futher down, now that this has been removed, and will those
> errors be handled correctly?

We had added this very recently, it wasn't necessary until we started to use with NSS sqlite. The intention was to avoid the init race with the storage service, to ensure sqlite3_config was already called, prior to calling NSS init. If we use the suggested patch and call sqlite3_config prior to XPCom init, there will no longer be a sqlite init race, and this explicit trigger will no longer be necessary.

If sqlite fails for other reasons, and NSS init fails, we'll still discover that during NSS init.

I have addressed your other comments.
Attachment #8884335 - Attachment is obsolete: true
Attachment #8910817 - Attachment is obsolete: true
Attachment #8910817 - Flags: review?(mh+mozilla)
Attachment #8912431 - Flags: review?(nfroyd)
does this address comment 49 (from the patch it doesn't look like)?
(In reply to Marco Bonardo [::mak] from comment #68)
> does this address comment 49 (from the patch it doesn't look like)?

I had assumed that the telemetry-disable change would be worked on in a separate patch.
(In reply to Kai Engert (:kaie:) from comment #69)

> I had assumed that the telemetry-disable change would be worked on in a
> separate patch.

I agree, it needs a separate bug, but this can't land until that happens, because as Andrew said, there's risk of crashes and that you end up using a VFS you don't want. Surely you can get a preliminary review here.
Unfortunately at this time I'm busy with rolling up work for 57 and Storage doesn't really have a "team", usually who needs a change works on it and the peers provide feedback and reviews. So, unless Andrew has any spare time, you may have to do that work by yourself :(

There are these alternatives if I read correctly all the comments:
1. NSS moves to sqlite3_open_v2 and picks one of the sqlite vfs
2. Storage drops the sqlite3_vfs_register call and instead passes the telemetry vfs in its sqlite3_open_v2 calls
3. Telemetry is removed from TelemetryVFS, it gets renamed to QuotaVFS and we defer deleting the sqlite3_vfs until sqlite3_shutdown

Probably 2 is the simplest to implement, and I'd suggest to go that path.
The first part of 3 is something nice-to-have regardless, but not necessary here unless you want to dedicate your time helping us cleaning Storage.
:mak had already filed bug 1065923 for option 3 (s/TelemetryVFS/QuotaVFS/ with requisite work).  I've taken that bug and will try and turn it around quickly.
Comment on attachment 8912431 [details] [diff] [review]
730495-v6.patch

Review of attachment 8912431 [details] [diff] [review]:
-----------------------------------------------------------------

I will take your word for it that errors will propagate out in the expected manner.
Attachment #8912431 - Flags: review?(nfroyd) → review+
Nathan, thanks for the r+!

I agree that we need to change the VFS logic, prior to changing the NSS default to sqlite, to avoid crashes with the default VFS. Thanks Andrew and Marco for your detailed explanations, they are very helpful.

I'm keen on getting this fixed early for the Firefox 58 cycle, so I'm willing to help. I prefer to avoid changes to NSS, as they would be required to propagate into all copies of NSS used as shared libraries, so a solution that's limited to Gecko has the lowest complexity for me.

I appreciate the option 2 that Marco has explained, which seems very simple to do, and doesn't require us to wait for bug 1065923.

I see that sqlite_open_v2 takes a string parameter of a VFS name, so I think we must continue to register your VFS, just not set it as the default. If the use of the registered VFS is limited to the storage service, we don't have to worry about lifetime beyond the lifetime of the storage service.

I've started to work on a patch for option 2, and expect that I will attach it very soon.

Regarding the NFS awareness you've mentioned, I don't know the background here. I see you're using a pref named storage.nfs_filesystem, which doesn't seem to be set by default, and I cannot find code outside the storage service using it. Is this an optional preference, that only ever gets set manually?
(In reply to Kai Engert (:kaie:) from comment #73)
> Regarding the NFS awareness you've mentioned, I don't know the background
> here. I see you're using a pref named storage.nfs_filesystem, which doesn't
> seem to be set by default, and I cannot find code outside the storage
> service using it. Is this an optional preference, that only ever gets set
> manually?

It's an (hidden? don't recall) pref that some user can try to set when keeping the profile on a shared network. Sometimes it gives better results than a crashing or completely hung Firefox. The reason is the fact some filesystems lie to locking requests, for various reasons.
Okay, I'll hold off on option 3/bug 1065923 until after this bug is resolved.
Blocks: 1065923
No longer depends on: 1065923
Comment on attachment 8912431 [details] [diff] [review]
730495-v6.patch

This patch contains a mistake in Service::initialize(),
if MOZ_STORAGE_MEMORY is undefined,
we don't call sqlite3_config, and the result remains at the default failure, and we fail to construct the storage service. This obviously fails.

The obvious fix is to use

#ifdef MOZ_STORAGE_MEMORY
  sResult = ::sqlite3_config(SQLITE_CONFIG_MALLOC, &memMethods);
#else
  sResult = SQLITE_OK;
#endif

I'll attach an updated revision of this patch, which also renames the static variables to use an 's' prefix, instead of the previous 'm' prefix.
Attached patch 730495-v7.patch (obsolete) — Splinter Review
fix the mistake in v6
Attached patch no-default-vfs (obsolete) — Splinter Review
(In reply to Kai Engert (:kaie:) from comment #73)
> I see that sqlite_open_v2 takes a string parameter of a VFS name, so I think
> we must continue to register your VFS, just not set it as the default. If
> the use of the registered VFS is limited to the storage service, we don't
> have to worry about lifetime beyond the lifetime of the storage service.
> 
> I've started to work on a patch for option 2, and expect that I will attach
> it very soon.

This implementation uses a single copy of the string "telemetry-vfs". Given that you already have the external function ConstructTelemetryVFS(), I thought it might be ok to have another external function GetVFSName().

The patch, combined with the others, passes the problematic tests with my local build. (I haven't yet started a try build.)

Marco, what do you think?
Attachment #8913412 - Flags: review?(mak77)
Try build that combines patch v7 and the incremental no-default-vfs patch:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b5275886ad440875b3f709cccea7f84bba7369ba
Comment on attachment 8913412 [details] [diff] [review]
no-default-vfs

I'll leave the final say up to :mak, but this works for me and is nicely minimal.
Attachment #8913412 - Flags: feedback+
Comment on attachment 8913407 [details] [diff] [review]
730495-v7.patch

Review of attachment 8913407 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/xre/AutoSQLiteLifetime.cpp
@@ +13,5 @@
> +#    include "DMD.h"
> +#  endif
> +#endif
> +
> +#ifdef MOZ_STORAGE_MEMORY

Looks like you can remove the previous endif and this ifdef.

@@ +160,5 @@
> +  // Shutdown the sqlite3 API.  Warn if shutdown did not turn out okay, but
> +  // there is nothing actionable we can do in that case.
> +  sResult = ::sqlite3_shutdown();
> +  if (sResult != SQLITE_OK)
> +    NS_WARNING("sqlite3 did not shutdown cleanly.");

nit: You can probably use
NS_WARNING_ASSERTION(sResult == SQLITE_OK,
                     "sqlite3 did not shutdown cleanly.");
Comment on attachment 8913412 [details] [diff] [review]
no-default-vfs

Review of attachment 8913412 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM, thanks!
Attachment #8913412 - Flags: review?(mak77) → review+
could you please mark obsolete patches so that the bug status is clearer?
Attachment #8621569 - Attachment is obsolete: true
Attachment #8728906 - Attachment is obsolete: true
Attachment #8913412 - Attachment description: no-default-vfs.cpp → no-default-vfs
Attachment #8913412 - Attachment filename: no-default-vfs.cpp → no-default-vfs
This patch combines the earlier patch v7 and incremental patch no-default-vfs.

I assume the trivial changes between v6 and v7 (rename and default assignment) don't require another review from Nathan, and Marco has already looked at the new patch. I've also addressed the minor changes Marco requested in comment 81.

I'm carrying forward the r+, as a combined r=asuth/froydnj/mak
Assignee: kchen → kaie
Attachment #8912431 - Attachment is obsolete: true
Attachment #8913407 - Attachment is obsolete: true
Attachment #8913412 - Attachment is obsolete: true
Attachment #8913620 - Flags: review+
Pushed by kaie@kuix.de:
https://hg.mozilla.org/integration/mozilla-inbound/rev/be147f49de8e
guarantee that sqlite3_config is called before any other SQLite function, r=asuth, r=froydnj, r=mak
Blocks: 783994
https://hg.mozilla.org/mozilla-central/rev/be147f49de8e
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla58
Depends on: 1523874
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: