Open Bug 484119 Opened 16 years ago Updated 2 years ago

performance/data-safety tradeoffs need improvement [battery]

Categories

(MailNews Core :: Database, enhancement)

enhancement

Tracking

(Not tracked)

People

(Reporter: hyc, Unassigned)

References

(Depends on 2 open bugs, Blocks 1 open bug)

Details

(Keywords: power, Whiteboard: [battery])

Currently the majority of writes to local mailboxes are done synchronously, to minimize the possibility of data loss. Unfortunately this policy is extremely unfriendly to laptops, in terms of both battery life and storage device lifetime. 

On HDDs, a laptop in "max battery life" mode will probably have set aggressive spin-down parameters, but every state change in a particular email will result in a sequence of synchronous writes to update flag bytes in headers, etc. As such, *more* battery power will be consumed, and the HDD motors will experience accelerated wear and ultimately shorter service life.

The cheap SSDs currently on the market are known to have extremely poor random write characteristics, and all NAND flash technologies have problems with in-place rewrites of data. (I.e., they can't do it at all, and require a lot of internal shuffling to make it appear to work.)

In both of these cases, the best thing to do is to allow all writes to accumulate as long as possible in the filesystem cache, to minimize the number of discrete I/O cycles that must be performed.

I think the rationale behind the No DataLoss policy needs to be revisited. In particular, what causes of data loss are being protected against?

If the issue is power outages, then the decision ought to be dependent on whether a machine is on AC or battery power.

If the issue is application instability, then it seems there's no issue here, since anything the app wrote to the filesystem will still be intact regardless of the application state.

If the issue is OS instability ... oh well.

I think at the very least there should be a preference for this:
   * default behavior, always synchronous
   * no synchronous flushes when running on battery
   * no synchronous flushes at all
Fascinating comment.  I think the subject could be tweaked to be less scary -- it's not that you're advocating data loss per se, but that there are some disk I/O optimizations that should be considered, especially in the case of IMAP accounts for example, where we're not really talking about data loss as much as "some data may need to be refetched".  Ideally we could still flush synchronously when "important" data was in the cache (new emails being composed, for example, which happens rarely).
As David points out, there is some set of cases where we can have smarter defaults.  We probably also want some level of configurability and/or policy hooks here too, since tradeoffs are likely to be different on different OSes, different storage types, netbook vs. laptop vs. desktop, etc.  Adjusting the summary to reflect this.
Whiteboard: performance/data-safety tradeoffs need improvement
Summary: No-data-loss policy needs to be configurable → performance/data-safety tradeoffs need improvement
Whiteboard: performance/data-safety tradeoffs need improvement
I still think the actual issues being guarded against need to be spelled out first. Otherwise, from my perspective, none of this should be happening at app-level. My OS (Linux) already has "laptop mode" which extends the FS cache timeouts while on battery, and also automatically reverts back if the battery gets below a particular threshold. Why should any app duplicate the work that the OS already does? And from yet another perspective - when is any of this stuff more important than any other type of data on the machine? I've got IM conversation logs, source code I'm working on, config files for various other tools - everything else on the machine entrusts their data to the filesystem. Why is Mozilla so *distrustful* of the underlying OS policies?
because we use a database engine that is ACID compliant.  We do give you the option of trusting the file system, however, by setting the preference "toolkit.storage.synchronous" to the number 0.
note that Shawn's comment relates to our use of SQLite databases through the mozStorage subsystem.  The mork database used for message store indices and mbox mail storage logic used in MailNews do not use mozStorage and as far as I know do not have an 'off switch' in regards to use of fsync/fdatasync.
(In reply to comment #5)
> note that Shawn's comment relates to our use of SQLite databases through the
> mozStorage subsystem.  The mork database used for message store indices and
> mbox mail storage logic used in MailNews do not use mozStorage and as far as I
> know do not have an 'off switch' in regards to use of fsync/fdatasync.

Correct, their behavior is currently hardcoded.
(In reply to comment #3)
> I still think the actual issues being guarded against need to be spelled out
> first. 

Agreed.

> Otherwise, from my perspective, none of this should be happening at
> app-level. My OS (Linux) already has "laptop mode" which extends the FS cache
> timeouts while on battery, and also automatically reverts back if the battery
> gets below a particular threshold. Why should any app duplicate the work that
> the OS already does? 

I don't see how these questions can be answered for the general case; there's enough variance that they want to be asked and answered on a per-supported-OS basis, I think.

> And from yet another perspective - when is any of this
> stuff more important than any other type of data on the machine? I've got IM
> conversation logs, source code I'm working on, config files for various other
> tools - everything else on the machine entrusts their data to the filesystem.
> Why is Mozilla so *distrustful* of the underlying OS policies?

Since we haven't gone through the exercise of asking and answering these questions in great detail per OS before, having the default policy be conservative seems sensible to me.  That said, I'll happily agree that there's very likely to be room for improvement.
Blocks: 487375
Blocks: tb-netbooks
Is there a way to disable syncing the mbox file/msf (mork) after each received message? Can we have a pref for it? It annoys me even on my beefy desktop machine.
I could try to implement it if there is agreement we could offer this option and I get some code pointers :)
Keywords: power
Summary: performance/data-safety tradeoffs need improvement → performance/data-safety tradeoffs need improvement [battery]
Whiteboard: [battery]
(In reply to :aceman from comment #8)
> Is there a way to disable syncing the mbox file/msf (mork) after each
> received message? Can we have a pref for it? It annoys me even on my beefy
> desktop machine.
> I could try to implement it if there is agreement we could offer this option
> and I get some code pointers :)

taking a stab at who to ask next.

Perhaps even Andrew might weigh in?
Flags: needinfo?(kent)
Flags: needinfo?(Pidgeot18)
At Wayne's request I spent a few minutes looking into this, but I did not get very far I'm afraid. I don't really know what the strategy is supposed to be for committing the database. Doing a quick look with a debugger, most of the calls to commit the database are coming either from 1) SetStringProperty, 2) Setting the msgdatabase on the folder object, which seems to happen very frequently, and 3) gloda activity.

These seem like odd, almost random choices to me, but maybe there is an underlying logic that I am missing.

Of course when using the UI marking messages read is what hits Commit.

It does seem to me that if someone was motivated to try to understand and optimize all of this, that would be a good thing. But my understanding is not to the point where I can say that "disable syncing the mbox file/msf (mork) after each received message" is a good idea, or if that even is an accurate statement of the current state of operation.
Flags: needinfo?(kent)
I don't know the deep-down details of mork very well, but I would be surprised if it ever actually called fsync.
Flags: needinfo?(Pidgeot18)
Gloda is already pretty lazy about its commits when it has ongoing work to-do, but in an effectively idle system, the changes caused by messages moving from unread to read will likely result in transactions which will result in commits.

There are also possibly Firefoxy subsystems at work.  I forget if Thunderbird ending up disabling Places or not, but that's a database train-wreck.  (Gloda is fairly unique among all mozStorage users in the code-base in that it leaves commits open for a fairly long time rather than only for every small batch of operations.)  Periodic tab state saving is also a low-frequency thing too.

Someone very interested in reducing fsyncs would be advised to use something like perf or systemtap on linux, dtrace on OS X (or dtrace-based tools), or sysinternals or xperf/xperf-based tools on Windows to find out what's actually triggering fsyncs and how long they are locking up the main/whatever thread and/or how long they're monopolizing the I/O device in question.
(In reply to Joshua Cranmer [:jcranmer] from comment #11)
> I don't know the deep-down details of mork very well, but I would be
> surprised if it ever actually called fsync.

It would be good to revisit this bug is after Chiaki's fixes are for disk IO buffering and other IO issues. http://mzl.la/1NkXfCP

Chiaki do you agree (see comment 12), and can you provide a bug# that you are fixing that should block this bug?
Flags: needinfo?(ishikawa)
Sorry I did not respond earlier.
When needinfo came in, it was during my summer vacation week.
Ever since I came back, I have been quite tied up with the prepration for an
overseas business trip starting on 7th September (for a full week).

There are some patches which I wish I could submit within September (it is already September!), there is a following meta bug:
Bug 1121842 - [META] RFC: C-C Thunderbird - Cleaning of incorrect Close, unchecked Flush, Write etc. in nsPop3Sink.cpp and friends.

It may be a good idea to put this as a blocker for this bug, however, I am afraid that
most of the I/O done by low-level code which I am tweaking is not quite related to database code, so doing it (blocking this bug) may be moot.

At the same time, I have been investigation I/O issues using strace() under linux and so
if anything new comes up in my investigation, I would post a comment here.

OTOH, I just realized I filed
Bug 1120444 - Use fdatasync properly instead of fsync where appropriate
earlier this year.
Flags: needinfo?(ishikawa)
(In reply to ISHIKAWA, Chiaki from comment #14)
> Sorry I did not respond earlier.
> When needinfo came in, it was during my summer vacation week.
> Ever since I came back, I have been quite tied up with the prepration for an
> overseas business trip starting on 7th September (for a full week).
> 
> There are some patches which I wish I could submit within September (it is
> already September!), there is a following meta bug:
> Bug 1121842 - [META] RFC: C-C Thunderbird - Cleaning of incorrect Close,
> unchecked Flush, Write etc. in nsPop3Sink.cpp and friends.
> 
> It may be a good idea to put this as a blocker for this bug, however, I am
> afraid that
> most of the I/O done by low-level code which I am tweaking is not quite
> related to database code, so doing it (blocking this bug) may be moot.
> 
> At the same time, I have been investigation I/O issues using strace() under
> linux and so
> if anything new comes up in my investigation, I would post a comment here.

It's been a long two years - any new perspectives?


> OTOH, I just realized I filed
> Bug 1120444 - Use fdatasync properly instead of fsync where appropriate
> earlier this year.

That bug has just been fixed. Hooray!
No longer blocks: 487375
Depends on: 1120444, 487375
Flags: needinfo?(ishikawa)
See Also: → 1121842
I'm not so sure that putting data at risk as suggested in comment 0 is a smart way to go, when there are many ways to improve battery life that don't potentially impact data integrity.  (Unless there is an method that does NOT put data at risk)  To name a few:

* Use Thunderbird settings to ensure checking for new email is done infrequently - 15-20 minutes for example.
* Fix these bugs Thunderbird power related bugs https://mzl.la/2xasn7u  
* Core+Firefox  bugs https://mzl.la/2xaZ849  - they've had a battery effort the last few year with slow success. Perhaps only 10% are relevant to Thunderbird, but every bit helps.   
* Fix performance issues which tend to affect power usage. Most notably calendar https://mzl.la/2xbKMkl  ~90 major+critical Thunderbird bugs https://mzl.la/2xbMMZT  (Probably only 10-20% are relevant to everyday usage, but again every bit helps)
* Fix Thunderbird memory leaks, which cause high memory usage and thus excessive CPU usage from avoidable GC and CC overhead
* Finish and ship maildir ?

Plus, battery life in general has improved in the last  years.
Sorry, I can not work on thunderbird until the next weekend due to big conference week of day time job.
But, the close and other error check improvement is in 
bug 1242030 via bug 1121842 now.
https://bugzilla.mozilla.org/show_bug.cgi?id=1242030
Depends on: 1242030
Flags: needinfo?(ishikawa)
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.