Closed Bug 848136 Opened 11 years ago Closed 11 years ago

Use WAL for Health Report database

Categories

(Firefox Health Report Graveyard :: Client: Desktop, defect, P2)

defect

Tracking

(firefox20 unaffected, firefox21+ fixed, firefox22 fixed)

RESOLVED FIXED
Firefox 22
Tracking Status
firefox20 --- unaffected
firefox21 + fixed
firefox22 --- fixed

People

(Reporter: gps, Assigned: gps)

Details

Attachments

(2 files, 2 obsolete files)

xpcshell\tests\services\datareporting\tests\xpcshell\test_policy.js | running test ...
xpcshell\tests\services\datareporting\tests\xpcshell\test_policy.js | test passed (time: 2482.000ms)
xpcshell\tests\services\datareporting\tests\xpcshell\test_session_recorder.js | running test ...
services\datareporting\tests\xpcshell\test_session_recorder.js | test passed (time: 1760.000ms)
services\healthreport\tests\xpcshell\test_load_modules.js | running test ...
services\healthreport\tests\xpcshell\test_load_modules.js | test passed (time: 149.000ms)
services\healthreport\tests\xpcshell\test_profile.js | running test ...
services\healthreport\tests\xpcshell\test_profile.js | test passed (time: 5374.000ms)
services\healthreport\tests\xpcshell\test_healthreporter.js | running test ...
services\healthreport\tests\xpcshell\test_healthreporter.js | test passed (time: 344003.000ms)
services\healthreport\tests\xpcshell\test_provider_addons.js | running test ...
services\healthreport\tests\xpcshell\test_provider_addons.js | test passed (time: 7065.000ms)
services\healthreport\tests\xpcshell\test_provider_appinfo.js | running test ...
services\healthreport\tests\xpcshell\test_provider_appinfo.js | test passed (time: 19387.000ms)
services\healthreport\tests\xpcshell\test_provider_crashes.js | running test ...
services\healthreport\tests\xpcshell\test_provider_crashes.js | test passed (time: 9617.000ms)
services\healthreport\tests\xpcshell\test_provider_places.js | running test ...
services\healthreport\tests\xpcshell\test_provider_places.js | test passed (time: 3420.000ms)
services\healthreport\tests\xpcshell\test_provider_searches.js | running test ...
services\healthreport\tests\xpcshell\test_provider_searches.js | test passed (time: 9068.000ms)
services\healthreport\tests\xpcshell\test_provider_sysinfo.js | running test ...
services\healthreport\tests\xpcshell\test_provider_sysinfo.js | test passed (time: 4621.000ms)
services\healthreport\tests\xpcshell\test_provider_sessions.js | running test ...
services\healthreport\tests\xpcshell\test_provider_sessions.js | test passed (time: 15855.000ms)
services\metrics\tests\xpcshell\test_load_modules.js | running test ...
services\metrics\tests\xpcshell\test_load_modules.js | test passed (time: 132.000ms)
services\metrics\tests\xpcshell\test_metrics_provider.js | running test ...
services\metrics\tests\xpcshell\test_metrics_provider.js | test passed (time: 21399.000ms)
services\metrics\tests\xpcshell\test_metrics_collector.js | running test ...
services\metrics\tests\xpcshell\test_metrics_collector.js | test passed (time: 61681.000ms)
services\metrics\tests\xpcshell\test_metrics_storage.js | running test ...
services\metrics\tests\xpcshell\test_metrics_storage.js | test passed (time: 144678.000ms)

344s (3:44) - yikes!

This is presumably caused by SQLite interaction. Since our tests instantiate a new database for each test (within each test file) and DB creation and schema setup and population is expensive, this is likely the largest source of the time.

Another source could be that the Windows builders are teh suck. I've noticed this from the instrumented builds I've pushed out to try - I/O on Windows machines is substantially worse than on Linux and OS X builders (OS X especially).

By comparison, the long pole (test_healthreporter.js) runs in only 13.841s on OS X.

It's also possible SQLite on Windows is that much slower or that we are failing to use a SQLite optimization/PRAGMA that will make Windows much faster. Perhaps Marco could shed some light.
Will probably be helped by flushing less stuff, but keeping an explicit bug around for searchability.
Priority: -- → P4
I was experimenting with things locally (on OS X with an SSD however) and noticed that setting the journal mode to WAL (default is DELETE) decreased test_healthreporter.js's execution time from 3.9s to 1.9s. Not too shabby. Try at https://tbpl.mozilla.org/?tree=Try&rev=3bafe34d5c0e.
test_healthreporter.js executed in ~34s on Windows with this try push. That's a speedup of about 10x over journal_mode=TRUNCATE (our current default). Woah!

It looks like cookies is using WAL: https://mxr.mozilla.org/mozilla-central/source/netwerk/cookie/nsCookieService.cpp#1099

As is Places: https://mxr.mozilla.org/mozilla-central/source/toolkit/components/places/Database.cpp#556

Time to investigate WAL! Marco should be able to help with that! Perhaps Sqlite.jsm should use WAL by default? Should we also consider loosening synchronous to NORMAL if we switch to WAL? The SQLite docs seem to indicate NORMAL is probably fine for WAL...
Flags: needinfo?(mak77)
Only downside to WAL is startup work on unclean shutdown, AFAIK. I thought we already had it on for FHR, to be honest!
that much for truncate mode is not common, is likely the test is doing something unexpected...
What we know is that our windows tinderboxes, for whatever reason, are extremely slow with fsyncs, wal mode just reduces number of fsyncs that may easily explain the speedup. I know that IT disabled write flushes on some windows tinderboxes due to this.
There are no big issues with WAL, mostly you sacrifice Durability but it's very likely you don't need it, plus you have to run some additional pragmas on startup (minor problem though your measures may have been done without those, see pragmas ran in Places Database.cpp).
the startup work should no be an issue if you properly close the db connection on shutdown.

Regarding synchronous, it's already NORMAL by default and we use that with WAL, we use synchronous full with truncate in some components.
Flags: needinfo?(mak77)
actually, consumers may open a connection to an existing profile db and if we change the journal in SQLite.jsm it may end up forcing a different journal on the db (if it's using a different journal), so doesn't sound like a good idea for now.
On my Linux VM (which is backed by a magnetic HD), using WAL reduces the execution time from 35s to 6.5s. Also, while the test is executing, my system monitor says the HD is writing out at 30MB/s in TRUNCATE mode and ~50MB/s in WAL mode. That seems excessive. I wonder what SQLite is doing behind the scenes. Do I care enough to break out the syscall tracer?
Err, 20MB/s in TRUNCATE mode, not 30MB/s. Still...

I really wish we had things like automatic I/O monitoring on every test just so we could catch things like this. The more data you are presented with, the more questions you start asking.
truncate mode slowdowns to wait fsyncs, wal doesn't, there are only checkpoints doing fsyncs.
Summary: Health Report xpcshell tests take a long time to run on Windows builders → Use WAL for Health Report database
Attached patch Use WAL for FHR database, v1 (obsolete) — Splinter Review
I emulated the logic from Places' Database.cpp.
Assignee: nobody → gps
Status: NEW → ASSIGNED
Attachment #728315 - Flags: review?(rnewman)
Attachment #728315 - Flags: review?(mak77)
Comment on attachment 728315 [details] [diff] [review]
Use WAL for FHR database, v1

Review of attachment 728315 [details] [diff] [review]:
-----------------------------------------------------------------

I'm pretty happy with this, but do defer to mak on the niceties!

::: services/metrics/storage.jsm
@@ +1136,5 @@
> +        yield self._connection.execute("PRAGMA wal_autocheckpoint=" +
> +                                       Math.ceil(self.MAX_WAL_SIZE_KB * 1024 / pageSize));
> +      } else {
> +        if (journalMode != "truncate") {
> +         // Fall back to truncate (which is faster than delete.

Align, missing paren.
Attachment #728315 - Flags: review?(rnewman) → review+
This didn't cause a perf win according to my try push. I suspect our MAX_WAL_SIZE_KB isn't large enough. It appears to be set at 1. I /think/ it may need to be 2 for any kind of win.
I increased MAX_WAL_SIZE_KB from 32 to 128kb and https://tbpl.mozilla.org/?tree=Try&rev=70690b042442
how did you choose the value for MAX_WAL_SIZE_KB?  What are you aiming at, 1 fsync in the life of the app? just 1 fsync at shutdown?
This value must be taken based on measurements of the common values the journal may grow at (it's likely 32KB is too low, the journal is likely to be always over that value and fsync everytime).
does the healthreport connection stay open for all the app life, or is it closed when done?
Up log to 512k. Also set synchronous=NORMAL because the loss in durability is not worth the performance loss from the extra fsync at every DB insert/update.

https://www.sqlite.org/pragma.html#pragma_synchronous
Attachment #728315 - Attachment is obsolete: true
Attachment #728315 - Flags: review?(mak77)
Attachment #729243 - Flags: review?(rnewman)
Attachment #729243 - Flags: review?(mak77)
Comment on attachment 729243 [details] [diff] [review]
Use WAL for FHR database, v2

Review of attachment 729243 [details] [diff] [review]:
-----------------------------------------------------------------

looks fine (apart the useless synchronous set), should still add the checkpoint after the big write, as you suggested on irc.

::: services/metrics/storage.jsm
@@ +1115,5 @@
> +      // does.
> +      let rows = yield self._connection.execute("PRAGMA page_size");
> +      let pageSize = 1024;
> +      if (rows.length) {
> +        pageSize = rows[0].getResultByIndex(0);

I think rows[0].page_size could just work fine, but please verify

@@ +1121,5 @@
> +
> +      self._log.debug("Page size is " + pageSize);
> +
> +      // Ensure temp tables are stored in memory, not on disk.
> +      yield self._connection.execute("PRAGMA temp_store=MEMORY");

at a first glance doesn't look like you have any TEMP or TEMPORARY items, so this doesn't look useful for now

@@ +1139,5 @@
> +        // We trade better performance for less durability because if we
> +        // lose some data due to a crash, power failure, etc, it isn't
> +        // the end of the world. If data is too important to lose, we
> +        // can always force a WAL checkpoint.
> +        yield self._connection.execute("PRAGMA synchronous=NORMAL");

this is pointless, normal is the default synchronous setting of Storage.
Attachment #729243 - Flags: review?(mak77) → review+
(In reply to Marco Bonardo [:mak] (Away Mar 27 - Apr 2) from comment #17)
> ::: services/metrics/storage.jsm
> @@ +1115,5 @@
> > +      // does.
> > +      let rows = yield self._connection.execute("PRAGMA page_size");
> > +      let pageSize = 1024;
> > +      if (rows.length) {
> > +        pageSize = rows[0].getResultByIndex(0);
> 
> I think rows[0].page_size could just work fine, but please verify

undefined. Do we need to QI the row in Sqlite.jsm?
 
> @@ +1121,5 @@
> > +
> > +      self._log.debug("Page size is " + pageSize);
> > +
> > +      // Ensure temp tables are stored in memory, not on disk.
> > +      yield self._connection.execute("PRAGMA temp_store=MEMORY");
> 
> at a first glance doesn't look like you have any TEMP or TEMPORARY items, so
> this doesn't look useful for now

No, we don't. But, it should be cheap. And, don't databases create temporary tables automatically sometimes as part of statement execution? Maybe I'm thinking of crazy MySQL behavior?

> @@ +1139,5 @@
> > +        // We trade better performance for less durability because if we
> > +        // lose some data due to a crash, power failure, etc, it isn't
> > +        // the end of the world. If data is too important to lose, we
> > +        // can always force a WAL checkpoint.
> > +        yield self._connection.execute("PRAGMA synchronous=NORMAL");
> 
> this is pointless, normal is the default synchronous setting of Storage.

I confirmed the default was normal (1) and removed this addition.
(In reply to Gregory Szorc [:gps] from comment #18)
> > I think rows[0].page_size could just work fine, but please verify
> 
> undefined. Do we need to QI the row in Sqlite.jsm?

Ah right, this is async, bug 512761 hits us yet

> No, we don't. But, it should be cheap. And, don't databases create temporary
> tables automatically sometimes as part of statement execution? Maybe I'm
> thinking of crazy MySQL behavior?

nope, nothing that crazy afaik
Attachment #729277 - Flags: review?(rnewman)
Comment on attachment 729277 [details] [diff] [review]
Part 2: Checkpoint WAL after collect, v1

Review of attachment 729277 [details] [diff] [review]:
-----------------------------------------------------------------

This looks fine, except that we almost certainly want to log and measure the shit out of this.

::: services/healthreport/healthreporter.jsm
@@ +536,5 @@
>  
> +      // Flush gathered data to disk. This will incur an fsync. But, if
> +      // there is ever a time we want to persist data to disk, it's
> +      // after a massive collection.
> +      yield this._storage.checkpoint();

NEEDS MOAR TELEMETRY.
Attachment #729277 - Flags: review?(rnewman) → feedback+
Comment on attachment 729243 [details] [diff] [review]
Use WAL for FHR database, v2

Review of attachment 729243 [details] [diff] [review]:
-----------------------------------------------------------------

Little to add over Marco's comments. Do put ",mak" in the r= remark.

::: services/metrics/storage.jsm
@@ +731,5 @@
>  }
>  
>  MetricsStorageSqliteBackend.prototype = Object.freeze({
> +  // Max size (in kibibytes) the WAL log is allowed to grow to before it is
> +  // checkpointed.

Add a little note about how we came to this value, and how to calculate a better version a year from now. A pointer back to the bug with explanation would suffice.
Attachment #729243 - Flags: review?(rnewman) → review+
Hmm. Did some testing on my Linux VM (backed by magnetic storage). I timed test_healthreporter.js:

512KB - 14.25s +- 0.5
1MB   - 10 +- 0.5
2MB   - 7.5 +- 0.5
4MB+  - 6.25 +- 0.5

Please note that this test creates a lot of new empty databases and is probably not indicative of real-world use. Still, 2x wall time improvement for 2MB vs 512KB. Hmmm...
larger wals increase read times, and as I said, what you are doing by increasing it is just moving all of the fsyncs to shutdown, that may not matter in a test, but it may matter in large apps.
I'm going to assert this is at least a P2 because of performance implications.
Priority: P4 → P2
Telemetry!
Attachment #729277 - Attachment is obsolete: true
Attachment #729757 - Flags: review?(rnewman)
Attachment #729757 - Flags: review?(rnewman) → review+
https://hg.mozilla.org/mozilla-central/rev/62b4fc06bf44
https://hg.mozilla.org/mozilla-central/rev/4c50962fcb8f
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Whiteboard: [fixed in services]
Target Milestone: --- → mozilla22
We should consider uplifting this to 21 because of significant performance wins.
Comment on attachment 729243 [details] [diff] [review]
Use WAL for FHR database, v2

[Approval Request Comment]
Bug caused by (feature/regressing bug #): FHR
User impact if declined: FHR DB performance may cause excessive I/O, resulting in bad Firefox perf.
Testing completed (on m-c, etc.): It has baked for a day. I didn't see a large uptick in database errors from submitted payloads. Try runs reveal perf wins are huge.
Risk to taking this patch (and alternatives if risky): Should be low.
String or IDL/UUID changes made by this patch: None

We arguably should have landed this code with the initial landing of FHR. It's been sitting on the backburner for a while because we didn't realize how significant the perf wins were. We didn't notice before because all I/O is off main thread so there we no jank indicating we were being slow.
Attachment #729243 - Flags: approval-mozilla-aurora?
Comment on attachment 729757 [details] [diff] [review]
Part 2: Checkpoint WAL after collect, v2

[Approval Request Comment]
See previous.
Attachment #729757 - Flags: approval-mozilla-aurora?
Comment on attachment 729243 [details] [diff] [review]
Use WAL for FHR database, v2

Considering the data on try run and given the stats https://bugzilla.mozilla.org/show_bug.cgi?id=854606#c17 approving this patch in favor of performance gains .
Attachment #729243 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #729757 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Component: Metrics and Firefox Health Report → Client: Desktop
Flags: in-testsuite+
Product: Mozilla Services → Firefox Health Report
Target Milestone: mozilla22 → Firefox 22
Product: Firefox Health Report → Firefox Health Report Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: