Closed Bug 848136 Opened 12 years ago Closed 12 years ago

Use WAL for Health Report database

Tracking

(firefox20 unaffected, firefox21+ fixed, firefox22 fixed)

Status:

RESOLVED FIXED

Milestone:

Firefox 22

Tracking Flags:

Tracking

Status

firefox20

---

unaffected

firefox21

fixed

firefox22

---

fixed

People

(Reporter: gps, Assigned: gps)

Details

Attachments

(2 files, 2 obsolete files)

Use WAL for FHR database, v1 12 years ago Gregory Szorc [:gps] 2.85 KB, patch	rnewman : review+	Details \| Diff \| Splinter Review
Use WAL for FHR database, v2 12 years ago Gregory Szorc [:gps] 3.19 KB, patch	rnewman : review+ mak : review+ bajaj : approval-mozilla-aurora+	Details \| Diff \| Splinter Review
Part 2: Checkpoint WAL after collect, v1 12 years ago Gregory Szorc [:gps] 2.16 KB, patch	rnewman : feedback+	Details \| Diff \| Splinter Review
Part 2: Checkpoint WAL after collect, v2 12 years ago Gregory Szorc [:gps] 4.32 KB, patch	rnewman : review+ bajaj : approval-mozilla-aurora+	Details \| Diff \| Splinter Review

Gregory Szorc [:gps]

Assignee

Description

•

12 years ago

xpcshell\tests\services\datareporting\tests\xpcshell\test_policy.js | running test ... xpcshell\tests\services\datareporting\tests\xpcshell\test_policy.js | test passed (time: 2482.000ms) xpcshell\tests\services\datareporting\tests\xpcshell\test_session_recorder.js | running test ... services\datareporting\tests\xpcshell\test_session_recorder.js | test passed (time: 1760.000ms) services\healthreport\tests\xpcshell\test_load_modules.js | running test ... services\healthreport\tests\xpcshell\test_load_modules.js | test passed (time: 149.000ms) services\healthreport\tests\xpcshell\test_profile.js | running test ... services\healthreport\tests\xpcshell\test_profile.js | test passed (time: 5374.000ms) services\healthreport\tests\xpcshell\test_healthreporter.js | running test ... services\healthreport\tests\xpcshell\test_healthreporter.js | test passed (time: 344003.000ms) services\healthreport\tests\xpcshell\test_provider_addons.js | running test ... services\healthreport\tests\xpcshell\test_provider_addons.js | test passed (time: 7065.000ms) services\healthreport\tests\xpcshell\test_provider_appinfo.js | running test ... services\healthreport\tests\xpcshell\test_provider_appinfo.js | test passed (time: 19387.000ms) services\healthreport\tests\xpcshell\test_provider_crashes.js | running test ... services\healthreport\tests\xpcshell\test_provider_crashes.js | test passed (time: 9617.000ms) services\healthreport\tests\xpcshell\test_provider_places.js | running test ... services\healthreport\tests\xpcshell\test_provider_places.js | test passed (time: 3420.000ms) services\healthreport\tests\xpcshell\test_provider_searches.js | running test ... services\healthreport\tests\xpcshell\test_provider_searches.js | test passed (time: 9068.000ms) services\healthreport\tests\xpcshell\test_provider_sysinfo.js | running test ... services\healthreport\tests\xpcshell\test_provider_sysinfo.js | test passed (time: 4621.000ms) services\healthreport\tests\xpcshell\test_provider_sessions.js | running test ... services\healthreport\tests\xpcshell\test_provider_sessions.js | test passed (time: 15855.000ms) services\metrics\tests\xpcshell\test_load_modules.js | running test ... services\metrics\tests\xpcshell\test_load_modules.js | test passed (time: 132.000ms) services\metrics\tests\xpcshell\test_metrics_provider.js | running test ... services\metrics\tests\xpcshell\test_metrics_provider.js | test passed (time: 21399.000ms) services\metrics\tests\xpcshell\test_metrics_collector.js | running test ... services\metrics\tests\xpcshell\test_metrics_collector.js | test passed (time: 61681.000ms) services\metrics\tests\xpcshell\test_metrics_storage.js | running test ... services\metrics\tests\xpcshell\test_metrics_storage.js | test passed (time: 144678.000ms) 344s (3:44) - yikes! This is presumably caused by SQLite interaction. Since our tests instantiate a new database for each test (within each test file) and DB creation and schema setup and population is expensive, this is likely the largest source of the time. Another source could be that the Windows builders are teh suck. I've noticed this from the instrumented builds I've pushed out to try - I/O on Windows machines is substantially worse than on Linux and OS X builders (OS X especially). By comparison, the long pole (test_healthreporter.js) runs in only 13.841s on OS X. It's also possible SQLite on Windows is that much slower or that we are failing to use a SQLite optimization/PRAGMA that will make Windows much faster. Perhaps Marco could shed some light.

Richard Newman [:rnewman]

Comment 1

•

12 years ago

Will probably be helped by flushing less stuff, but keeping an explicit bug around for searchability.

Priority: -- → P4

Gregory Szorc [:gps]

Assignee

Comment 2

•

12 years ago

I was experimenting with things locally (on OS X with an SSD however) and noticed that setting the journal mode to WAL (default is DELETE) decreased test_healthreporter.js's execution time from 3.9s to 1.9s. Not too shabby. Try at https://tbpl.mozilla.org/?tree=Try&rev=3bafe34d5c0e.

Gregory Szorc [:gps]

Assignee

Comment 3

•

12 years ago

test_healthreporter.js executed in ~34s on Windows with this try push. That's a speedup of about 10x over journal_mode=TRUNCATE (our current default). Woah! It looks like cookies is using WAL: https://mxr.mozilla.org/mozilla-central/source/netwerk/cookie/nsCookieService.cpp#1099 As is Places: https://mxr.mozilla.org/mozilla-central/source/toolkit/components/places/Database.cpp#556 Time to investigate WAL! Marco should be able to help with that! Perhaps Sqlite.jsm should use WAL by default? Should we also consider loosening synchronous to NORMAL if we switch to WAL? The SQLite docs seem to indicate NORMAL is probably fine for WAL...

Flags: needinfo?(mak77)

Richard Newman [:rnewman]

Comment 4

•

12 years ago

Only downside to WAL is startup work on unclean shutdown, AFAIK. I thought we already had it on for FHR, to be honest!

Marco Bonardo [:mak]

Comment 5

•

12 years ago

that much for truncate mode is not common, is likely the test is doing something unexpected... What we know is that our windows tinderboxes, for whatever reason, are extremely slow with fsyncs, wal mode just reduces number of fsyncs that may easily explain the speedup. I know that IT disabled write flushes on some windows tinderboxes due to this. There are no big issues with WAL, mostly you sacrifice Durability but it's very likely you don't need it, plus you have to run some additional pragmas on startup (minor problem though your measures may have been done without those, see pragmas ran in Places Database.cpp). the startup work should no be an issue if you properly close the db connection on shutdown. Regarding synchronous, it's already NORMAL by default and we use that with WAL, we use synchronous full with truncate in some components.

Flags: needinfo?(mak77)

Marco Bonardo [:mak]

Comment 6

•

12 years ago

actually, consumers may open a connection to an existing profile db and if we change the journal in SQLite.jsm it may end up forcing a different journal on the db (if it's using a different journal), so doesn't sound like a good idea for now.

Gregory Szorc [:gps]

Assignee

Comment 7

•

12 years ago

On my Linux VM (which is backed by a magnetic HD), using WAL reduces the execution time from 35s to 6.5s. Also, while the test is executing, my system monitor says the HD is writing out at 30MB/s in TRUNCATE mode and ~50MB/s in WAL mode. That seems excessive. I wonder what SQLite is doing behind the scenes. Do I care enough to break out the syscall tracer?

Gregory Szorc [:gps]

Assignee

Comment 8

•

12 years ago

Err, 20MB/s in TRUNCATE mode, not 30MB/s. Still... I really wish we had things like automatic I/O monitoring on every test just so we could catch things like this. The more data you are presented with, the more questions you start asking.

Marco Bonardo [:mak]

Comment 9

•

12 years ago

truncate mode slowdowns to wait fsyncs, wal doesn't, there are only checkpoints doing fsyncs.

Gregory Szorc [:gps]

Assignee

Updated

•

12 years ago

Summary: Health Report xpcshell tests take a long time to run on Windows builders → Use WAL for Health Report database

Gregory Szorc [:gps]

Assignee

Comment 10

•

12 years ago

Attached patch Use WAL for FHR database, v1 (obsolete) — Details — Splinter Review

I emulated the logic from Places' Database.cpp.

Assignee: nobody → gps

Status: NEW → ASSIGNED

Attachment #728315 - Flags: review?(rnewman)

Attachment #728315 - Flags: review?(mak77)

Gregory Szorc [:gps]

Assignee

Comment 11

•

12 years ago

https://tbpl.mozilla.org/?tree=Try&rev=442cd7a6518f

Richard Newman [:rnewman]

Comment 12

•

12 years ago

Comment on attachment 728315 [details] [diff] [review] Use WAL for FHR database, v1 Review of attachment 728315 [details] [diff] [review]: ----------------------------------------------------------------- I'm pretty happy with this, but do defer to mak on the niceties! ::: services/metrics/storage.jsm @@ +1136,5 @@ > + yield self._connection.execute("PRAGMA wal_autocheckpoint=" + > + Math.ceil(self.MAX_WAL_SIZE_KB * 1024 / pageSize)); > + } else { > + if (journalMode != "truncate") { > + // Fall back to truncate (which is faster than delete. Align, missing paren.

Attachment #728315 - Flags: review?(rnewman) → review+

Gregory Szorc [:gps]

Assignee

Comment 13

•

12 years ago

This didn't cause a perf win according to my try push. I suspect our MAX_WAL_SIZE_KB isn't large enough. It appears to be set at 1. I /think/ it may need to be 2 for any kind of win.

Gregory Szorc [:gps]

Assignee

Comment 14

•

12 years ago

I increased MAX_WAL_SIZE_KB from 32 to 128kb and https://tbpl.mozilla.org/?tree=Try&rev=70690b042442

Marco Bonardo [:mak]

Comment 15

•

12 years ago

how did you choose the value for MAX_WAL_SIZE_KB? What are you aiming at, 1 fsync in the life of the app? just 1 fsync at shutdown? This value must be taken based on measurements of the common values the journal may grow at (it's likely 32KB is too low, the journal is likely to be always over that value and fsync everytime). does the healthreport connection stay open for all the app life, or is it closed when done?

Gregory Szorc [:gps]

Assignee

Comment 16

•

12 years ago

Attached patch Use WAL for FHR database, v2 — Details — Splinter Review

Up log to 512k. Also set synchronous=NORMAL because the loss in durability is not worth the performance loss from the extra fsync at every DB insert/update. https://www.sqlite.org/pragma.html#pragma_synchronous

Attachment #728315 - Attachment is obsolete: true

Attachment #728315 - Flags: review?(mak77)

Attachment #729243 - Flags: review?(rnewman)

Attachment #729243 - Flags: review?(mak77)

Marco Bonardo [:mak]

Comment 17

•

12 years ago

Comment on attachment 729243 [details] [diff] [review] Use WAL for FHR database, v2 Review of attachment 729243 [details] [diff] [review]: ----------------------------------------------------------------- looks fine (apart the useless synchronous set), should still add the checkpoint after the big write, as you suggested on irc. ::: services/metrics/storage.jsm @@ +1115,5 @@ > + // does. > + let rows = yield self._connection.execute("PRAGMA page_size"); > + let pageSize = 1024; > + if (rows.length) { > + pageSize = rows[0].getResultByIndex(0); I think rows[0].page_size could just work fine, but please verify @@ +1121,5 @@ > + > + self._log.debug("Page size is " + pageSize); > + > + // Ensure temp tables are stored in memory, not on disk. > + yield self._connection.execute("PRAGMA temp_store=MEMORY"); at a first glance doesn't look like you have any TEMP or TEMPORARY items, so this doesn't look useful for now @@ +1139,5 @@ > + // We trade better performance for less durability because if we > + // lose some data due to a crash, power failure, etc, it isn't > + // the end of the world. If data is too important to lose, we > + // can always force a WAL checkpoint. > + yield self._connection.execute("PRAGMA synchronous=NORMAL"); this is pointless, normal is the default synchronous setting of Storage.

Attachment #729243 - Flags: review?(mak77) → review+

Gregory Szorc [:gps]

Assignee

Comment 18

•

12 years ago

(In reply to Marco Bonardo [:mak] (Away Mar 27 - Apr 2) from comment #17) > ::: services/metrics/storage.jsm > @@ +1115,5 @@ > > + // does. > > + let rows = yield self._connection.execute("PRAGMA page_size"); > > + let pageSize = 1024; > > + if (rows.length) { > > + pageSize = rows[0].getResultByIndex(0); > > I think rows[0].page_size could just work fine, but please verify undefined. Do we need to QI the row in Sqlite.jsm? > @@ +1121,5 @@ > > + > > + self._log.debug("Page size is " + pageSize); > > + > > + // Ensure temp tables are stored in memory, not on disk. > > + yield self._connection.execute("PRAGMA temp_store=MEMORY"); > > at a first glance doesn't look like you have any TEMP or TEMPORARY items, so > this doesn't look useful for now No, we don't. But, it should be cheap. And, don't databases create temporary tables automatically sometimes as part of statement execution? Maybe I'm thinking of crazy MySQL behavior? > @@ +1139,5 @@ > > + // We trade better performance for less durability because if we > > + // lose some data due to a crash, power failure, etc, it isn't > > + // the end of the world. If data is too important to lose, we > > + // can always force a WAL checkpoint. > > + yield self._connection.execute("PRAGMA synchronous=NORMAL"); > > this is pointless, normal is the default synchronous setting of Storage. I confirmed the default was normal (1) and removed this addition.

Marco Bonardo [:mak]

Comment 19

•

12 years ago

(In reply to Gregory Szorc [:gps] from comment #18) > > I think rows[0].page_size could just work fine, but please verify > > undefined. Do we need to QI the row in Sqlite.jsm? Ah right, this is async, bug 512761 hits us yet > No, we don't. But, it should be cheap. And, don't databases create temporary > tables automatically sometimes as part of statement execution? Maybe I'm > thinking of crazy MySQL behavior? nope, nothing that crazy afaik

Gregory Szorc [:gps]

Assignee

Comment 20

•

12 years ago

Attached patch Part 2: Checkpoint WAL after collect, v1 (obsolete) — Details — Splinter Review

Attachment #729277 - Flags: review?(rnewman)

Richard Newman [:rnewman]

Comment 21

•

12 years ago

Comment on attachment 729277 [details] [diff] [review] Part 2: Checkpoint WAL after collect, v1 Review of attachment 729277 [details] [diff] [review]: ----------------------------------------------------------------- This looks fine, except that we almost certainly want to log and measure the shit out of this. ::: services/healthreport/healthreporter.jsm @@ +536,5 @@ > > + // Flush gathered data to disk. This will incur an fsync. But, if > + // there is ever a time we want to persist data to disk, it's > + // after a massive collection. > + yield this._storage.checkpoint(); NEEDS MOAR TELEMETRY.

Attachment #729277 - Flags: review?(rnewman) → feedback+

Richard Newman [:rnewman]

Comment 22

•

12 years ago

Comment on attachment 729243 [details] [diff] [review] Use WAL for FHR database, v2 Review of attachment 729243 [details] [diff] [review]: ----------------------------------------------------------------- Little to add over Marco's comments. Do put ",mak" in the r= remark. ::: services/metrics/storage.jsm @@ +731,5 @@ > } > > MetricsStorageSqliteBackend.prototype = Object.freeze({ > + // Max size (in kibibytes) the WAL log is allowed to grow to before it is > + // checkpointed. Add a little note about how we came to this value, and how to calculate a better version a year from now. A pointer back to the bug with explanation would suffice.

Attachment #729243 - Flags: review?(rnewman) → review+

Gregory Szorc [:gps]

Assignee

Comment 23

•

12 years ago

Hmm. Did some testing on my Linux VM (backed by magnetic storage). I timed test_healthreporter.js: 512KB - 14.25s +- 0.5 1MB - 10 +- 0.5 2MB - 7.5 +- 0.5 4MB+ - 6.25 +- 0.5 Please note that this test creates a lot of new empty databases and is probably not indicative of real-world use. Still, 2x wall time improvement for 2MB vs 512KB. Hmmm...

Marco Bonardo [:mak]

Comment 24

•

12 years ago

larger wals increase read times, and as I said, what you are doing by increasing it is just moving all of the fsyncs to shutdown, that may not matter in a test, but it may matter in large apps.

Gregory Szorc [:gps]

Assignee

Comment 25

•

12 years ago

I'm going to assert this is at least a P2 because of performance implications.

Priority: P4 → P2

Gregory Szorc [:gps]

Assignee

Comment 26

•

12 years ago

Attached patch Part 2: Checkpoint WAL after collect, v2 — Details — Splinter Review

Telemetry!

Attachment #729277 - Attachment is obsolete: true

Attachment #729757 - Flags: review?(rnewman)

Richard Newman [:rnewman]

Updated

•

12 years ago

Attachment #729757 - Flags: review?(rnewman) → review+

Gregory Szorc [:gps]

Assignee

Comment 27

•

12 years ago

https://hg.mozilla.org/services/services-central/rev/62b4fc06bf44 https://hg.mozilla.org/services/services-central/rev/4c50962fcb8f

Whiteboard: [fixed in services]

Gregory Szorc [:gps]

Assignee

Comment 28

•

12 years ago

https://hg.mozilla.org/mozilla-central/rev/62b4fc06bf44 https://hg.mozilla.org/mozilla-central/rev/4c50962fcb8f

Status: ASSIGNED → RESOLVED

Closed: 12 years ago

Flags: in-testsuite+

Resolution: --- → FIXED

Whiteboard: [fixed in services]

Target Milestone: --- → mozilla22

Gregory Szorc [:gps]

Assignee

Comment 29

•

12 years ago

We should consider uplifting this to 21 because of significant performance wins.

status-firefox20: --- → unaffected

status-firefox21: --- → affected

status-firefox22: --- → fixed

tracking-firefox21: --- → ?

Gregory Szorc [:gps]

Assignee

Comment 30

•

12 years ago

Comment on attachment 729243 [details] [diff] [review] Use WAL for FHR database, v2 [Approval Request Comment] Bug caused by (feature/regressing bug #): FHR User impact if declined: FHR DB performance may cause excessive I/O, resulting in bad Firefox perf. Testing completed (on m-c, etc.): It has baked for a day. I didn't see a large uptick in database errors from submitted payloads. Try runs reveal perf wins are huge. Risk to taking this patch (and alternatives if risky): Should be low. String or IDL/UUID changes made by this patch: None We arguably should have landed this code with the initial landing of FHR. It's been sitting on the backburner for a while because we didn't realize how significant the perf wins were. We didn't notice before because all I/O is off main thread so there we no jank indicating we were being slow.

Attachment #729243 - Flags: approval-mozilla-aurora?

Gregory Szorc [:gps]

Assignee

Comment 31

•

12 years ago

Comment on attachment 729757 [details] [diff] [review] Part 2: Checkpoint WAL after collect, v2 [Approval Request Comment] See previous.

Attachment #729757 - Flags: approval-mozilla-aurora?

bhavana bajaj [:bajaj]

Comment 32

•

12 years ago

Comment on attachment 729243 [details] [diff] [review] Use WAL for FHR database, v2 Considering the data on try run and given the stats https://bugzilla.mozilla.org/show_bug.cgi?id=854606#c17 approving this patch in favor of performance gains .

Attachment #729243 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

bhavana bajaj [:bajaj]

Updated

•

12 years ago

Attachment #729757 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

bhavana bajaj [:bajaj]

Updated

•

12 years ago

tracking-firefox21: ? → +

Gregory Szorc [:gps]

Assignee

Comment 33

•

12 years ago

https://hg.mozilla.org/releases/mozilla-aurora/rev/b28fcc0c943f https://hg.mozilla.org/releases/mozilla-aurora/rev/0488055e9f9f

status-firefox21: affected → fixed

Mike Connor [:mconnor]

Updated

•

12 years ago

Component: Metrics and Firefox Health Report → Client: Desktop

Flags: in-testsuite+

Product: Mozilla Services → Firefox Health Report

Target Milestone: mozilla22 → Firefox 22

BMO Automation

Updated

•

7 years ago

Product: Firefox Health Report → Firefox Health Report Graveyard

You need to log in before you can comment on or make changes to this bug.