Closed Bug 1157950 Opened 7 years ago Closed 7 years ago

nsPermissionManager shouldn't opt in to DB corruption and data loss "for performance"

Categories

(Core :: Networking: Cookies, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla41
Tracking Status
firefox40 --- affected
firefox41 --- fixed

People

(Reporter: jld, Assigned: jld)

References

(Blocks 2 open bugs)

Details

Attachments

(1 file)

nsPermissionManager.cpp does this:

> // make operations on the table asynchronous, for performance
> mDBConn->ExecuteSimpleSQL(NS_LITERAL_CSTRING("PRAGMA synchronous = OFF"));

The SQLite documentation (https://www.sqlite.org/pragma.html#pragma_synchronous) warns that “With synchronous OFF (0), … the database might become corrupted if the operating system crashes or the computer loses power before that data has been written to the disk surface.”

Bug 1143607 has run into this on the B2G emulator — the OS seems to cache unsynced writes in memory for a relatively long time (possibly due to tuning for power use or low-end Flash memory), and exiting the emulator is roughly equivalent to a hard reboot.  However, this could also occur in real-world situations; we really shouldn't risk DB corruption or data loss, especially on B2G where losing the permissions DB essentially means the device is bricked.
Cookies, others do this too.
Attached patch PatchSplinter Review
Attachment #8597457 - Flags: review?(josh)
This feels like something we should fix in all affected consumers at the same time, or at least make it clear that we are deciding not to adjust the other consumers listed in http://mxr.mozilla.org/mozilla-central/search?string=pragma%20synchronous%20%3D%20off .
I don't think it's a good idea to do this without at least moving to WAL, permissions.sqlite is one of the offenders in the Slow SQL telemetry dashboard already.
(In reply to Marco Bonardo [::mak] from comment #4)
> I don't think it's a good idea to do this without at least moving to WAL,
> permissions.sqlite is one of the offenders in the Slow SQL telemetry
> dashboard already.

But corrupted (and therefore lost) permissions are a pretty big disaster (much worse than losing cookies, for example).
I'm going to end up abiding by whatever people who know about the database-level details say is best, since I know nothing about our storage implementation. This review should really go to somebody else.
Comment on attachment 8597457 [details] [diff] [review]
Patch

I'll go redo this with WAL/NORMAL.  It would help if there were some sort of benchmarking I could do locally, to get an idea of whether it's tuned reasonable well, but maybe I can improvise something with the B2G first-boot Webapps.jsm stuff.
Attachment #8597457 - Flags: review?(josh)
One problem with non-durable permissions changes is that we'll need to expose some way to explicitly checkpoint the DB before making interdependent changes to a different database — e.g., Webapps.json setting prefs (which *is* durable, but even making changes to a different non-durable sqlite file would be a consistency hazard).
I tried doing some simple experiments on B2G with my Flame, with Date.now() calls in Webapps.json around the preinstalled app processing.  The two options with integrity (DELETE/FULL and WAL/NORMAL) seemed to be slightly slower (<10%) than without integrity (DELETE/OFF), but there was enough noise that that might not be significant.  There didn't seem to be a difference between the two integrity options.

However, this was on a phone, using ext4 on eMMC Flash, so it could have very different performance characteristics to the systems where (according to comment #4) permissions.sqlite is a performance concern.  This is really not the right way to try to get meaningful information.

The other thing about WAL is that instead of this:

-rw------- root     root        57344 2015-05-04 22:19 permissions.sqlite

we get this:

-rw------- root     root         4096 2015-05-04 22:25 permissions.sqlite
-rw------- root     root        32768 2015-05-04 22:25 permissions.sqlite-shm
-rw------- root     root      1624480 2015-05-04 22:25 permissions.sqlite-wal

Also, I have reproduced a lack of durability on B2G with WAL/NORMAL (bricking the device, because it happened to the first-boot permission writes).  To reiterate comment #8, that's not a complete solution without extending the permissions manager API to add an explicit barrier operation and making sure it's used where it needs to be.
(In reply to Marco Bonardo [::mak] from comment #4)
> I don't think it's a good idea to do this without at least moving to WAL,
> permissions.sqlite is one of the offenders in the Slow SQL telemetry
> dashboard already.

Do you have a link to the data for this?  It would help to know in more detail what we're concerned about in terms of performance.
Flags: needinfo?(mak77)
http://telemetry.mozilla.org/slowsql/ holds all our telemetry data.
Flags: needinfo?(mak77)
(In reply to Ben Turner [:bent] (use the needinfo flag!) from comment #11)
> http://telemetry.mozilla.org/slowsql/ holds all our telemetry data.

So what shows up is creating the DB (once per profile):

  CREATE TABLE moz_hosts ( ... )

Upgrading the DB schema (at most once per profile):

  ALTER TABLE moz_hosts ADD modificationTime INTEGER

I'm not sure what the deal with these two is, because I don't see why these would take significant time:

  PRAGMA user_version = 4
  PRAGMA cache_size = -2048

And then this, which is once per session and could maybe benefit from an index or similar optimization if we're concerned about it:

  DELETE FROM moz_hosts WHERE expireType = ?1 AND expireTime <= ?2

I'm not seeing anything that's helpful in deciding whether or not it's reasonable for us to be concerned about the cost of doing disk I/O when adding individual permissions.

<hr />

One approach would be to just remove the current pragma (use the mozStorage default instead of OFF), let that bake on Nightly for a week, and see what the telemetry looks like after that.
:bent points out on IRC that all of the SQL statements I mentioned in comment #12 are executed on the main thread.  Adding and removing permissions are done asynchronously off the main thread, but they don't show up in the top 100 for any metric in the off-main-thread side of the telemetry dashboard.  This also means that, even if those operations become slower, that won't block the main thread.

This *also* also means that comment #8 is potentially still a problem even with `synchronous = FULL`; Webapps.jsm can't be completely fixed without a permission manager method to synchronize with the storage thread.
The main difference between wal+normal and normal is the number of fsyncs, much lower for the former. fsyncs are usually nor very nice to the system IO, I think they are slightly better on linux where they are fdatasync, so they might be less of a concern on b2g.

(In reply to Jed Davis [:jld] {UTC-7} from comment #9)
> The other thing about WAL is that instead of this:
> 
> -rw------- root     root        57344 2015-05-04 22:19 permissions.sqlite
> 
> we get this:
> 
> -rw------- root     root         4096 2015-05-04 22:25 permissions.sqlite
> -rw------- root     root        32768 2015-05-04 22:25 permissions.sqlite-shm
> -rw------- root     root      1624480 2015-05-04 22:25 permissions.sqlite-wal

you should still have a -journal file even without wal, unless you use MEMORY journal mode, but then you'd again have durability issues. usually TRUNCATE journal mode has decent behavior on linux.
the -shm file can be "removed" by using exclusive locking (though that means you can open the database only in one process and can't clone the connection).

> Also, I have reproduced a lack of durability on B2G with WAL/NORMAL
> (bricking the device, because it happened to the first-boot permission
> writes).  To reiterate comment #8, that's not a complete solution without
> extending the permissions manager API to add an explicit barrier operation
> and making sure it's used where it needs to be.

https://sqlite.org/pragma.html#pragma_wal_checkpoint

Btw, if there are time and resources contraints, surely moving to normal as a first step would already be better than not having any data safety.
Yeah, I say we just file a followup for making these dbs use WAL, and just remove the SYNCHRONOUS=OFF from everything. Jed doesn't need to be responsible for the performance of these databases.
Attachment #8597457 - Flags: review?(ehsan)
See Also: → 1165165
Comment on attachment 8597457 [details] [diff] [review]
Patch

Review of attachment 8597457 [details] [diff] [review]:
-----------------------------------------------------------------

r=me based on the information on the bug so far.
Attachment #8597457 - Flags: review?(ehsan) → review+
https://treeherder.mozilla.org/#/jobs?repo=try&revision=fcbfa4d367ed and I think all the failures are not-my-fault intermittents?
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/3b420a568ecb
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla41
You need to log in before you can comment on or make changes to this bug.