Closed Bug 830492 Opened 12 years ago Closed 12 years ago

Reduce total number of fsync() required by FHR

Categories

(Firefox Health Report Graveyard :: Client: Desktop, defect, P2)

defect

Tracking

(firefox21+ wontfix, firefox22+ fixed, firefox23 fixed)

RESOLVED FIXED
Firefox 23
Tracking Status
firefox21 + wontfix
firefox22 + fixed
firefox23 --- fixed

People

(Reporter: gps, Assigned: gps)

References

Details

(Whiteboard: [qa-])

Attachments

(1 file, 1 obsolete file)

Per conversations with the Perf team, we should intelligently batch accumulated data and periodically flush to database to prevent excessive writes. The most egregious abuse currently is the session activity counter, which currently writes every 5s. Bug 827157 tracks making that better. This bug /may/ be idle for a while depending on whether our current state after that bug is sufficient for perf in the near to medium term (until a new provider presses the issue).
Bug 841074 should make this a little easier to implement. Bug 841074 also paves the way for delayed opening of the DB until we actually need to write data.
Depends on: 841074
Priority: -- → P2
Component: Metrics and Firefox Health Report → Client: Desktop
Product: Mozilla Services → Firefox Health Report
We've talked about this considerably and the real problem we're trying to address is the number of fsync() calls during the lifetime of the app. fsync()'s by themselves don't sound too bad. But, combined with a journaled filesystem (many modern ones, especially on Linux), they effectively result in a stop-the-world while the filesystem commits its journal up to the point of the fsync'd file descriptor. The goal is to reduce the number of fsync()'s incurred by FHR as much as possible. Ideally, we get down to "1 per day." I'm interpreting that to mean "1 per daily collection" or "1 per major data collection." Those are essentially the same thing. But, the distinction is important because "1 per day" is very limiting when you consider corner cases.

I am actively working on assessing the current state of things. I will measure first then make recommendations later.
Summary: Accumulate and batch flush gathered metrics to database → Reduce total number of fsync() required by FHR
In the current FHR world, we use SQLite with a WAL journal.

The most common source of SQLite-issued fsync() is when the WAL is checkpointed. Checkpointing occurs when explicitly told to do so (PRAGMA wal_checkpoint()) or when the current WAL size has exceeded a configured threshold (PRAGMA wal_autocheckpoint). In the latter case, the checkpoint occurs automatically as part of processing the SQL statement that caused the journal to exceed the threshold. So, if you write enough data, you'll incur a checkpoint and thus an fsync().

But there are other sources of fsync()!

The first time we run FHR, the system calls on OS X (and presumably Linux) look like:

open db
open journal
write journal
fsync journal
fsync db
unlink journal
open wal
open shm
write wal
fsync wal

At this point our DB connection is opened and configured. Then, we proceed to creating the schema. This incurs a lot of write() to the wal and some periodic fsync() and ftruncate().

What's interesting about first run is the I/O involving the journal file. This is separate from the WAL journal. On first run, since there is no WAL, SQLite uses non-WAL mode as the default, creating a journal file in the process. When we issue the PRAGMA journal_mode=wal, SQLite flushes the old journal, deletes it, and creates the WAL.

On subsequent runs, it sees the WAL file exists and infers the DB is in WAL mode, skipping I/O involving the journal file. After some stat() to discover this, that looks like:

open db
open wal
open shm

Then, there's some read I/O (which I assume is playing back the wal). Most importantly, no fsync are involved!

On our initial daily collect with a WAL checkpoint threshold of 512kb, FHR's file descriptors incur 18 fsync(). Now, not every fsync is equal because SQLite seems to fsync() the WAL and DB itself in pairs. So, on journaled filesystems, that first fsync() loops in the rest of the world and that 2nd fsync() should be relatively quick.

But, the goal is to reduce the number of fsync(). Let's play around a bit.

If I increase the size of the WAL autocheckpoint to something ridiculous like 64MB, we only see 1 pair of fsync() during collection! Why still 1? Because we explicitly force a checkpoint at the end of our daily collection.

Unfortunately, we're not down to just 1 fsync during daily collect. The reason? OS.File. When we write the last payload file to disk, I see:

open healthreport/lastpayload.json.tmp
write 
fsync
close
rename

So, every file write performed with OS.File incurs an fsync?! Not quite. Turns out it's the default behavior of OS.File.writeAtomic(). We could optionally remove that fsync by passing {flush: false} to that function when we write the last payload file. But, since we already have an fsync() from the forced WAL checkpoint in the same time proximity, I'm not too worried about the additional fsync() here.

Anyway, here is my proposed solution:

1) Jack up the WAL auto checkpoint size to something like 16MB.
2) Perform a manual WAL checkpoint after a major/daily collection.
3) Retain existing behavior of OS.File.writeAtomic()

For continued FHR use, we'll incur 3 fsync() during daily collection. 2 of these come from SQLite as a pair. AFAIK you can't have one without the other: that's just how SQLite works. The other comes from OS.File.writeAtomic(). And, since writeAtomic comes near the SQLite ones, we shouldn't incur a penalty for the additional fsync().

The downside to the increased auto checkpoint threshold is that DB reads will be slower with a large WAL than a smaller one (the reader needs to "merge" the DB with the WAL to compute results). This could potentially cause more I/O and seeks (thus disk pressure) on FHR startup and during normal operation and may adversely affect performance near app startup. Whether this will be measurable, I honestly don't know. I theorize the wall time in FHR startup Telemetry probe may be a rough approximation of the impact of a larger WAL checkpoint threshold. I suspect this won't be an issue because FHR is write heavy. And, we don't have too many DB writes resulting from "pushed" metrics (most of FHR's data is pulled into FHR at collection time). Since the number of writes outside a checkpoint-terminated collection are few, the WAL should never grow too large and thus read times should not be adversely affected.

A potential workaround to concerns about large WALs is to temporary disable auto checkpoint during daily collection, force a checkpoint at the end, then restore a smaller checkpoint threshold during regular operation.

Thoughts?
Before I start working on patches, I'd like some buy-in from Perf land on the proposal.

David?
Flags: needinfo?(dteller)
We are going to remove the saving of the FHR payload to disk in bug 860094. That removes the OS.File.writeAtomic fsync. Yay!
Has anyone told you that your bugzilla comments often read like blog entries? :)

I will partially deflect the WAL question to mak, as he has considerably more experience than me on this.

However, a few questions to ensure that I understand correctly your idea. Is it correct that increasing the WAL threshold will:
- not increase memory usage;
- increase the duration of the final I/O at the end of the collection;
- increase the likelihood of losing data in case of crash;
- not increase the likelihood of corrupting the complete db;
- not improve the I/O of the first use of FHR
?

Also, I seem to remember that the "official" suggestion of the Perf team was to avoid using sqlite entirely, as the level of safety (and the incurred I/O cost) is disproportionate to the requirements of the data put in the table. Do I understand correctly that this bug is meant as a stopgap before eventually migrating to a solution that would reduce everything to 0 or 1 |fsync|?
Flags: needinfo?(dteller) → needinfo?(mak77)
(In reply to David Rajchenbach Teller [:Yoric] from comment #6)
> However, a few questions to ensure that I understand correctly your idea. Is
> it correct that increasing the WAL threshold will:
> - not increase memory usage;

memory usage of SQLite it controlled by us and capped to 2MB per connection currently. It can't grow over that. That said it may grow inside the 2MB limit.

> - increase the duration of the final I/O at the end of the collection;

Surely a larger journal to merge is more expensive than a small journal, it's possible.

> - increase the likelihood of losing data in case of crash;

This depends on the data, could be we don't care to lose FHR data?
This is an important point, reducing fsyncs surely increases the likelyhood of incomplete data.

> - not increase the likelihood of corrupting the complete db;

It won't increase this possibility, that actually already exists, but in very rare cases (power loss in very specific tiny timeframes)

> - not improve the I/O of the first use of FHR

To sum up:
GAIN: less fsync
NO CHANGE: writes, corruption rate
LOSS: lower reads, larger merge times (even at shutdown), increased risk of incoherent data

I have a couple thoughts though:
1. we should not try to request impossible improvements. Asking to reduce fsyncs is fine, asking N fsyncs per day or similar is not. It's like saying your car doesn't need 4 wheels cause it may work even with 3, or maybe even with 2. Fsyncs can be reduced up to a certain point, but it has a cost, no magic wands.
2. Delaying checkpoint as suggested (16MB) is basically the same as disabling auto checkpoint on a db that is currently < 2MB, so at that point just do that (set wal_autocheckpoint to 0). Be well aware that has a cost in both slower reads and fsync on shutdown, so there must be mitigation, that means being sure we can do a better job than an automatic threshold in understanding when to manually checkpoint. Per personal experience, situations where manual handling works better than automatic handling are quite rare.
3. I originally said a database for FHR was looking excessive, but at this point let's be realistic and fix the problem as is today. Having a plan for future doesn't mean we should just look at it and do nothing now.
Flags: needinfo?(mak77)
Marco, David, and I talked about this during the inaugural async and responsiveness meet-up yesterday.

While Marco did express concerns that overthinking SQLite could lead to pain, since FHR has very well-defined interactions with SQLite, we're going to move ahead with manual control over WAL checkpointing. We will disable automatic WAL checkpointing during heavy DB interaction, notably initial setup and daily collection. We will force a WAL checkpoint (fsync) at the end of each. During normal operations, auto WAL checkpointing will be enabled. However, the checkpoint threshold should not be reached except for a tiny fraction of our users that a) are very heavy users of some browser features b) have very long-lived sessions. If we are concerned with these background checkpoints, we can investigate increasing the WAL checkpoint threshold (possibly slowing down app startup and shutdown) and we can start batching and flushing data to SQLite.
It feels like if you need to go through so much trouble to precisely control sqlite IO, sqlite isn't really saving much work over raw IO
(In reply to Taras Glek (:taras) from comment #9)
> It feels like if you need to go through so much trouble to precisely control
> sqlite IO, sqlite isn't really saving much work over raw IO

I am open to moving away from SQLite. However, it is a significant change and it wouldn't likely make uplift to 21 (first because I'm not sure if I could finished it in time and 2nd I'm not sure if sheriffs would accept a patch that hasn't baked long enough). So, a non-exhaustive and non-mutually exclusive list of options is: a) work on small, minimally-invasive, high-impact patches like in this bug and uplift wins to 21 b) focus entirely on moving away from SQLite now. Leave users on 21, 22, maybe 23 on SQLite (with current sub-optimal I/O) c) disable FHR on 21, ... until everyone is satisfied with I/O characteristics.

The larger decisions (b and c) are not mine to make. But, I do believe that "a" should be pursued now because it's relatively simple and gives all our users an immediate win.
Uplifting to 21 sounds scary and I would be strongly against a format change for that. Nevermind comment 9
Attached patch Control WAL, v1 (obsolete) — Splinter Review
This does what was discussed.

We disable auto checkpoint during initial database creation and during daily collection. At the end of both we force a checkpoint and restore auto checkpointing every (currently) 512KB.

Please note that the enable/disable during collection is currently not inside collectMeasurements because there are SQL operations issued during initialization of some pull-only providers and we want to ensure these don't incur a checkpoint. We need better APIs for dealing with pull-only providers. However, this bug is not the appropriate place. Expect another patch in another bug shortly.
Assignee: nobody → gps
Status: NEW → ASSIGNED
Attachment #737654 - Flags: review?(rnewman)
Comment on attachment 737654 [details] [diff] [review]
Control WAL, v1

Review of attachment 737654 [details] [diff] [review]:
-----------------------------------------------------------------

Can we test this?

::: services/metrics/storage.jsm
@@ +700,5 @@
>  function MetricsStorageSqliteBackend(connection) {
>    this._log = Log4Moz.repository.getLogger("Services.Metrics.MetricsStorage");
>  
>    this._connection = connection;
> +  this._connectionAutoCheckpoint = null;

Perhaps "_enabledWALCheckpointSize" would be a better name for this?

You should also comment that:

• This should be set only once, because it controls enabling and disabling, and changing it mid-flight would break that
• You consequently cannot call _init more than once with different wal values. (Shouldn't happen, but…)

@@ +1139,5 @@
>        self._log.info("Journal mode is " + journalMode);
>  
>        if (journalMode == "wal") {
> +        self._connectionAutoCheckpoint =
> +          Math.ceil(self.MAX_WAL_SIZE_KB * 1024 / pageSize);

Log this: "Auto checkpoint size: " + val

@@ +1302,5 @@
>     */
>    checkpoint: function () {
> +    if (!this._connectionAutoCheckpoint) {
> +      return CommonUtils.laterTickResolvingPromise();
> +    }

Is this wrong? Surely you never want to avoid doing an explicit checkpoint if calling code decided to do so…

@@ +1319,5 @@
> +      return this._connection.execute("PRAGMA wal_autocheckpoint=0");
> +    }.bind(this));
> +  },
> +
> +  enableAutoCheckpoint: function () {

Not sure about code reuse here.

  setAutoCheckpoint: function (on) {
    if (!this._connectionAutoCheckpoint) {
      return CommonUtils.laterTickResolvingPromise();
    }

    let val = on ? this._connectionAutoCheckpoint : 0;
    return this.enqueueOperation(function op() {
      return this._connection.execute("PRAGMA wal_autocheckpoint=" + val);
    }.bind(this));
  },

saves a function.

Or use descriptive strings:

    let val = (on == "on") ? this._connectionAutoCheckpoint : 0;

so you can do

  db.setAutoCheckpoint("off");
Attachment #737654 - Flags: review?(rnewman) → feedback+
(In reply to Richard Newman [:rnewman] from comment #13)
> Can we test this?

What, exactly? I've casually asked a-team if we could start recording and annotating *all* our test suites with per-test resource info, including CPU usage and I/O counters. We could potentially lump fsync counts in there as well (although the infrastructure to measure system call count is much harder to deploy than simple process counters already tracked by the kernel). However, I've only casually asked and don't expect anything any time soon. Until then, I believe the answer is "use Talos." Even then, I don't know if the test harnesses have the ability to say "fail if fsync count greater than N." I wish they did...
Setting tracking flags as I believe we'd like to uplift this since it should result in a nice performance win. The goal is to get it on Nightly ASAP. We can let it bake for a little bit then hopefully make beta 4.
What is the risk of regression for uplifting this patch? Is there anything in particular QA should test before uplift is approved?
(In reply to Anthony Hughes, Mozilla QA (:ashughes) from comment #16)
> What is the risk of regression for uplifting this patch? Is there anything
> in particular QA should test before uplift is approved?

The risk is we do something horribly wrong and make the database unusable. That would show up pretty clearly after baking on Nightly. We also may somehow manage to regress performance. A spot check of Telemetry numbers before and after this patch should be all that's needed to confirm.
Attached patch Control WAL, v2Splinter Review
Review comments addressed.

Regarding no-op of the checkpoint APIs, if we aren't in WAL mode, there is nothing to do! Eventually checkpoint() may flush buffered data to storage. But until we buffer data we don't need to worry about that.
Attachment #737654 - Attachment is obsolete: true
Attachment #738091 - Flags: review?(rnewman)
Attachment #738091 - Flags: review?(rnewman) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/7f74457ed102
Target Milestone: --- → Firefox 23
(In reply to Gregory Szorc [:gps] from comment #14)
> (In reply to Richard Newman [:rnewman] from comment #13)
> > Can we test this?
> 
> What, exactly? I've casually asked a-team if we could start recording and
> annotating *all* our test suites with per-test resource info, including CPU
> usage and I/O counters. We could potentially lump fsync counts in there as
> well (although the infrastructure to measure system call count is much
> harder to deploy than simple process counters already tracked by the
> kernel). However, I've only casually asked and don't expect anything any
> time soon. Until then, I believe the answer is "use Talos." Even then, I
> don't know if the test harnesses have the ability to say "fail if fsync
> count greater than N." I wish they did...

Actually they can. ask jmaher/aklotz. We record some runs with xperf. That exposes all of the info you need for this.
(In reply to Taras Glek (:taras) from comment #20)
> (In reply to Gregory Szorc [:gps] from comment #14)
> > (In reply to Richard Newman [:rnewman] from comment #13)
> > > Can we test this?
> > 
> > What, exactly? I've casually asked a-team if we could start recording and
> > annotating *all* our test suites with per-test resource info, including CPU
> > usage and I/O counters. We could potentially lump fsync counts in there as
> > well (although the infrastructure to measure system call count is much
> > harder to deploy than simple process counters already tracked by the
> > kernel). However, I've only casually asked and don't expect anything any
> > time soon. Until then, I believe the answer is "use Talos." Even then, I
> > don't know if the test harnesses have the ability to say "fail if fsync
> > count greater than N." I wish they did...
> 
> Actually they can. ask jmaher/aklotz. We record some runs with xperf. That
> exposes all of the info you need for this.

I had a quick convo with jmaher on IRC. While xperf is deployed everywhere (or most everywhere), it sounds like we don't have the kind of control/features in the test harnesses to do what we want. I do dream of the day though.

Taras: you may also be interested in bug 859573.
https://hg.mozilla.org/mozilla-central/rev/7f74457ed102
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Tracking for FF22.If this is ready to be uplifted on aurora please request nomination asap.

Uplift on Beta depends on a clear risk evaluation along with the user impact considering we are in second half of our Beta cycle .
Comment on attachment 738091 [details] [diff] [review]
Control WAL, v2

[Approval Request Comment]
Bug caused by (feature/regressing bug #): FHR
User impact if declined: Likely non-optimal I/O performance.
Testing completed (on m-c, etc.): It just landed yesterday. It probably needs a few more days to bake.
Risk to taking this patch (and alternatives if risky): I believe this is a low-risk patch. We *may* see some changes in Telemetry numbers once this has gained enough users on Nightly. However, Telemetry numbers won't paint a full picture because they don't measure raw file I/O operations. I think we may have to go with our instincts that this will be a net win.
String or IDL/UUID changes made by this patch: None
Attachment #738091 - Flags: approval-mozilla-beta?
Attachment #738091 - Flags: approval-mozilla-aurora?
Comment on attachment 738091 [details] [diff] [review]
Control WAL, v2

low risk, perf win. Approving on aurora.

Will check back on beta approval on Monday Apr 22, while this gets the needed bake time on aurora,m-c and see if there are any fallout's .
Attachment #738091 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
(In reply to Gregory Szorc [:gps] from comment #17)
> (In reply to Anthony Hughes, Mozilla QA (:ashughes) from comment #16)
> > What is the risk of regression for uplifting this patch? Is there anything
> > in particular QA should test before uplift is approved?
> 
> The risk is we do something horribly wrong and make the database unusable.
> That would show up pretty clearly after baking on Nightly. We also may
> somehow manage to regress performance. A spot check of Telemetry numbers
> before and after this patch should be all that's needed to confirm.

I checked my telemetry numbers in Nightly but I see no difference between builds before and after this landed. Is this what you were expecting (ie. the numbers should not be regressing)?
(In reply to Anthony Hughes, Mozilla QA (:ashughes) from comment #27)
> I checked my telemetry numbers in Nightly but I see no difference between
> builds before and after this landed. Is this what you were expecting (ie.
> the numbers should not be regressing)?

Ideally they'd be improving. But, the strict requirement for verification is we not regress numbers. If you are on sufficiently modern hardware (especially an SSD), it's likely you won't see any difference with this patch. The best way we have of measuring this patch's impact is aggregate numbers reported through the raw Telemetry submissions. We need to wait a few more days before we have enough data to even think about reaching a conclusion. Even then, I reckon our Nightly and Aurora users are heavily biased towards nicer machines. "It's complicated."
Thanks Gregory. To be on the safe side I'm going to tag this bug as [qa-] to indicate that we were unable to reliably test this, citing the reasons you've given above. We should perhaps revisit the telemetry data when we get closer to the next merge day. I will also monitor our Nightly Testers mailing list for indicators of regressing performance.
Whiteboard: [qa-]
Yoric, Can you please help understand if there is a better way to measure the performance improvements here after/before the patch landed. I am worried about the late uplift on beta here and having us in a position where we may have regressed a new issue or caused a stability impact due to this .

I'd rather let this get resolved in Fx22 in which case it will get the desired baketime, if we are not able to go by numbers as this is a recent patch than inducing risk so late in the cycle . Thoughts ?
I don't think we have data supporting late uplift to Beta at this point.  We may see a significant shift via Telemetry, but at this point we should be mitigating risk for Fx21 unless we are fixing blocker-grade issues.
Given comment 30 & 31, I am wontfixing this for Fx21.Feel free to change the status flags and needsinfo me if there is a disagreement on the above and we need to take the forward fix for Fx21.
Attachment #738091 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
Product: Firefox Health Report → Firefox Health Report Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: