Closed Bug 1361262 Opened 3 years ago Closed 3 years ago

nsAppStartup::TrackStartupCrashEnd() can cause main-thread I/O and fflush()

Categories

(Toolkit :: Startup and Profile System, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: ehsan, Assigned: milan)

References

(Blocks 2 open bugs)

Details

(Whiteboard: [bhr][qf:p1])

Attachments

(3 files, 4 obsolete files)

This can cause long running hangs of more than 8 seconds.  Go to https://people-mozilla.org/~mlayzell/bhr/20170429/all.html and search for "mozilla::Preferences::WritePrefFile".  There are many signatures with varying stages of main-thread I/O and FlushFileBuffers there.

Bug 294260 added this code.  In bug 294260 comment 15 Mossop questioned the performance impact of this code.  I would say now we have evidence from our user base to suggest that the performance impact of this has been disastrous.  :-(  In general we have learned our lesson about main-thread I/O, any amount of it can have an arbitrary run time.

Mossop, what do you think about this?  Should we switch to using a simple file here?

Also Benjamin, is it possible to make SavePrefFile() to its work off the main thread?  Also does it *really* need flushing?
Flags: needinfo?(dtownsend)
Flags: needinfo?(benjamin)
Using a simple file would certainly reduce the amount of data we're writing out. We have to flush synchronously in TrackStartupCrashBegin. TrackStartupCrashEnd is called shortly after startup and again at shutdown (which is a no-op if the first call ever happened). We can probably flush asynchronously in the first case but not in the second case.
Flags: needinfo?(dtownsend)
(In reply to Dave Townsend [:mossop] from comment #1)
> Using a simple file would certainly reduce the amount of data we're writing
> out. We have to flush synchronously in TrackStartupCrashBegin.
> TrackStartupCrashEnd is called shortly after startup and again at shutdown
> (which is a no-op if the first call ever happened). We can probably flush
> asynchronously in the first case but not in the second case.

Or to be clear, the second case must block shutdown but we could do that with a shutdown blocker and still do the work asynchronously.
(In reply to :Ehsan Akhgari (super long backlog, slow to respond) from comment #0)
> This can cause long running hangs of more than 8 seconds.  Go to
> https://people-mozilla.org/~mlayzell/bhr/20170429/all.html and search for
> "mozilla::Preferences::WritePrefFile".  There are many signatures with
> varying stages of main-thread I/O and FlushFileBuffers there.
> 
> Bug 294260 added this code.  In bug 294260 comment 15 Mossop questioned the
> performance impact of this code.  I would say now we have evidence from our
> user base to suggest that the performance impact of this has been
> disastrous.  :-(

If it was so bad I would have expected ts_paint to have a significant regression on Talos causing a backout… See also bug 920083 though.

> Also Benjamin, is it possible to make SavePrefFile() to its work off the
> main thread?

I guess that's related to bug 789945. If we don't save the pref 

> Also does it *really* need flushing?

A reason why we flush is so that a crash after the "startup" phase that the code is interested in doesn't get treated as a "startup" crash. If we're fine with reducing accuracy and therefore causing the safe mode dialog to appear more often then there are some options.
See Also: → 1360214
(In reply to Matthew N. [:MattN] (behind on bugmail; PM if requests are blocking you) from comment #3)
> (In reply to :Ehsan Akhgari (super long backlog, slow to respond) from
> comment #0)
> > This can cause long running hangs of more than 8 seconds.  Go to
> > https://people-mozilla.org/~mlayzell/bhr/20170429/all.html and search for
> > "mozilla::Preferences::WritePrefFile".  There are many signatures with
> > varying stages of main-thread I/O and FlushFileBuffers there.
> > 
> > Bug 294260 added this code.  In bug 294260 comment 15 Mossop questioned the
> > performance impact of this code.  I would say now we have evidence from our
> > user base to suggest that the performance impact of this has been
> > disastrous.  :-(
> 
> If it was so bad I would have expected ts_paint to have a significant
> regression on Talos causing a backout… See also bug 920083 though.

Sadly Talos letting a main-thread IO pass by doesn't necessarily mean that it is going to be fast in all circumstances.  The speed of main thread IO can vary *vastly* depending on many things including but not limited to the type of hard disk, the file system type, the system IO load, the disk fragmentation, the amount of data being written to disk (the size of the prefs file can vary largely), and so on...

> > Also Benjamin, is it possible to make SavePrefFile() to its work off the
> > main thread?
> 
> I guess that's related to bug 789945.

Yes, thank you!  I was CCed on that bug but had forgotten about it.

> If we don't save the pref 

Looks like this was cut off?

> > Also does it *really* need flushing?
> 
> A reason why we flush is so that a crash after the "startup" phase that the
> code is interested in doesn't get treated as a "startup" crash. If we're
> fine with reducing accuracy and therefore causing the safe mode dialog to
> appear more often then there are some options.

OK, that makes sense.  If we pick a small (empty?) file instead, at least flushing here would be much cheaper.  What do you think about that?
(In reply to :Ehsan Akhgari (super long backlog, slow to respond) from comment #4)
> (In reply to Matthew N. [:MattN] (behind on bugmail; PM if requests are
> blocking you) from comment #3)
> > (In reply to :Ehsan Akhgari (super long backlog, slow to respond) from
> > comment #0)
> > > This can cause long running hangs of more than 8 seconds.  Go to
> > > https://people-mozilla.org/~mlayzell/bhr/20170429/all.html and search for
> > > "mozilla::Preferences::WritePrefFile".  There are many signatures with
> > > varying stages of main-thread I/O and FlushFileBuffers there.
> > > 
> > > Bug 294260 added this code.  In bug 294260 comment 15 Mossop questioned the
> > > performance impact of this code.  I would say now we have evidence from our
> > > user base to suggest that the performance impact of this has been
> > > disastrous.  :-(
> > 
> > If it was so bad I would have expected ts_paint to have a significant
> > regression on Talos causing a backout… See also bug 920083 though.
> 
> Sadly Talos letting a main-thread IO pass by doesn't necessarily mean that
> it is going to be fast in all circumstances.  The speed of main thread IO
> can vary *vastly* depending on many things including but not limited to the
> type of hard disk, the file system type, the system IO load, the disk
> fragmentation, the amount of data being written to disk (the size of the
> prefs file can vary largely), and so on...

I just remembered that TrackStartupCrashEnd gets called after ts_paint stops measuring startup time. Note that performance was taken into account (you can see discussion in the bug) and it's the reason why we use the lock file's timestamp and prefs as a somewhat hacky approach rather than introducing new I/O for TrackStartupCrashBegin. Not regressing startup time (e.g. first paint) with I/O was prioritized over doing some I/O after startup.

> > > Also Benjamin, is it possible to make SavePrefFile() to its work off the
> > > main thread?
> > 
> > I guess that's related to bug 789945.
> 
> Yes, thank you!  I was CCed on that bug but had forgotten about it.
> 
> > If we don't save the pref 
> 
> Looks like this was cut off?

Oops, I had moved it to the paragraph below.

> > > Also does it *really* need flushing?
> > 
> > A reason why we flush is so that a crash after the "startup" phase that the
> > code is interested in doesn't get treated as a "startup" crash. If we're
> > fine with reducing accuracy and therefore causing the safe mode dialog to
> > appear more often then there are some options.
> 
> OK, that makes sense.  If we pick a small (empty?) file instead, at least
> flushing here would be much cheaper.  What do you think about that?

IIRC, the reason this was implemented using prefs was to avoid new file I/O in TrackStartupCrashBegin specifically to avoid regressing startup time (ts_paint). From my investigation at the time, any approach that adds new file I/O in TrackStartupCrashBegin will regress ts_paint though perhaps we'll be fine with it if that I/O is less than the TrackStartupCrashEnd I/O to be replaced (but isn't measured by ts_paint).
(In reply to Matthew N. [:MattN] (behind on bugmail; PM if requests are blocking you) from comment #5)
> I just remembered that TrackStartupCrashEnd gets called after ts_paint stops
> measuring startup time. Note that performance was taken into account (you
> can see discussion in the bug) and it's the reason why we use the lock
> file's timestamp and prefs as a somewhat hacky approach rather than
> introducing new I/O for TrackStartupCrashBegin. Not regressing startup time
> (e.g. first paint) with I/O was prioritized over doing some I/O after
> startup.

I think comment 0 may have come off the wrong way, I apologize, I didn't mean to imply that performance wasn't taken into account.  The bug clearly shows the contrary.  Obviously now we have 5 years of hindsight and also the BHR telemetry data, and I'm conscious of that.  I would certainly do the same as you did back then at my best.

The fact that TrackStartupCrashEnd gets called after ts_paint stops measuring startup time means that it will be blind to any regressions or improvements from this function specifically, right?  :-(

The other issue is that the size of the preferences file varies depending on the profile, there are unfortunately add-ons that use the prefs service as a key-value store to store large amounts of data so the file can get really large.  Sadly when a user gets into such a situation even uninstalling the add-on may not clear those prefs.  Fixing this bug can especially make a big difference for those users.

> > > > Also does it *really* need flushing?
> > > 
> > > A reason why we flush is so that a crash after the "startup" phase that the
> > > code is interested in doesn't get treated as a "startup" crash. If we're
> > > fine with reducing accuracy and therefore causing the safe mode dialog to
> > > appear more often then there are some options.
> > 
> > OK, that makes sense.  If we pick a small (empty?) file instead, at least
> > flushing here would be much cheaper.  What do you think about that?
> 
> IIRC, the reason this was implemented using prefs was to avoid new file I/O
> in TrackStartupCrashBegin specifically to avoid regressing startup time
> (ts_paint). From my investigation at the time, any approach that adds new
> file I/O in TrackStartupCrashBegin will regress ts_paint though perhaps
> we'll be fine with it if that I/O is less than the TrackStartupCrashEnd I/O
> to be replaced (but isn't measured by ts_paint).

Given the fact that we are seeing the impact of this through telemetry, I think this is one of the cases where optimizing for a benchmark may hurt real users.  I would advocate for regressing any Talos benchmark that we need in order to fix the impact of this to our users.  :-)  If we land a fix here, we can wait for a few days and see whether we see the impact through telemetry and use that as justification for a possible Talos regression.

(And we should also try to figure out how to create Talos tests that are representative of those users in order to be able to have benchmarks that we can use in lieu of telemetry here.)

I also think we should not totally rule out the last paragraph of comment 3.  I'm not sure who would make that decision.  Do you know Matt?
(In reply to :Ehsan Akhgari (super long backlog, slow to respond) from comment #6)
> (In reply to Matthew N. [:MattN] (behind on bugmail; PM if requests are
> blocking you) from comment #5)
> > I just remembered that TrackStartupCrashEnd gets called after ts_paint stops
> > measuring startup time. Note that performance was taken into account (you
> > can see discussion in the bug) and it's the reason why we use the lock
> > file's timestamp and prefs as a somewhat hacky approach rather than
> > introducing new I/O for TrackStartupCrashBegin. Not regressing startup time
> > (e.g. first paint) with I/O was prioritized over doing some I/O after
> > startup.
> 
> I think comment 0 may have come off the wrong way, I apologize, I didn't
> mean to imply that performance wasn't taken into account.  The bug clearly
> shows the contrary.  Obviously now we have 5 years of hindsight and also the
> BHR telemetry data, and I'm conscious of that.  I would certainly do the
> same as you did back then at my best.
> 
> The fact that TrackStartupCrashEnd gets called after ts_paint stops
> measuring startup time means that it will be blind to any regressions or
> improvements from this function specifically, right?  :-(

Right

> The other issue is that the size of the preferences file varies depending on
> the profile, there are unfortunately add-ons that use the prefs service as a
> key-value store to store large amounts of data so the file can get really
> large.  Sadly when a user gets into such a situation even uninstalling the
> add-on may not clear those prefs.  Fixing this bug can especially make a big
> difference for those users.

Yeah, I understand, I just don't know a great solution.

> > > > > Also does it *really* need flushing?
> > > > 
> > > > A reason why we flush is so that a crash after the "startup" phase that the
> > > > code is interested in doesn't get treated as a "startup" crash. If we're
> > > > fine with reducing accuracy and therefore causing the safe mode dialog to
> > > > appear more often then there are some options.
> > > 
> > > OK, that makes sense.  If we pick a small (empty?) file instead, at least
> > > flushing here would be much cheaper.  What do you think about that?
> > 
> > IIRC, the reason this was implemented using prefs was to avoid new file I/O
> > in TrackStartupCrashBegin specifically to avoid regressing startup time
> > (ts_paint). From my investigation at the time, any approach that adds new
> > file I/O in TrackStartupCrashBegin will regress ts_paint though perhaps
> > we'll be fine with it if that I/O is less than the TrackStartupCrashEnd I/O
> > to be replaced (but isn't measured by ts_paint).
> 
> Given the fact that we are seeing the impact of this through telemetry, I
> think this is one of the cases where optimizing for a benchmark may hurt
> real users. 

Agreed. I'll admit that's what happened in bug 294260.

> I would advocate for regressing any Talos benchmark that we
> need in order to fix the impact of this to our users.  :-)  If we land a fix
> here, we can wait for a few days and see whether we see the impact through
> telemetry and use that as justification for a possible Talos regression.

You could also simply remove the 30s timeout on the call to trackStartupCrashEnd and use that as a baseline for Talos comparisons (though it wouldn't be totally fair since the I/O would be happening at a different time).

> (And we should also try to figure out how to create Talos tests that are
> representative of those users in order to be able to have benchmarks that we
> can use in lieu of telemetry here.)
> 
> I also think we should not totally rule out the last paragraph of comment 3.

Agreed.

> I'm not sure who would make that decision.  Do you know Matt?

Maybe bsmedberg or Mossop will have an opinion.

I don't have strong feelings on it. I think if the only goal of the feature is to offer Safe Mode after multiple consecutive "startup" crashes then I think the worst case with that change (without looking too closely at the code again) is that safe mode is offered to a user who crashed:
1) during "startup" once and then
2) crashed "during" or "after startup" twice 
(though I could be wrong.) i.e. it may be that we only change the accuracy if the user is already in the failure case of having one startup crash though someone should double-check my logic.

Offering Safe Mode isn't great when that user may have been better off with a regular startup and the about:sessionrestore page showing to uncheck the tab causing a crash… though now e10s also changes things too. It may be worth re-thinking the whole feature.

We also have data on the feature's usefulness but someone would need to correlate it with crash rates: https://mzl.la/2qtkAhB
    If we have a good definition of "reduced accuracy" and what "more often" means in the context of safe mode dialogs appearing, I would start with :canuckistani.

    If we don't flush in "...Begin", and crash, we won't notice it as a startup crash.  There would be more, perhaps a lot more, crashes the user would experience before we ever got them into the safe mode.  I don't know if we have telemetry on how often we "auto switch" into the safe mode.

    If we don't flush in "...End", and crash, a non-start up crash may look like a start up crash.  We would auto switch into the safe mode in the situations when it isn't necessary.
Would we consider a file append special case for this particular pref?  Instead of writing the whole pref file out, just append the toolkit.startup.recent_crashes line to the existing pref file in both start and end.  We lose the safety and chance a partial write (if a crash happens during the sync write), but it gives us a result independent of the size of the pref file.
(In reply to Milan Sreckovic [:milan] from comment #8)
> I don't know if we have telemetry on how often we "auto switch" into the safe mode.

That is part of the probe mentioned in but it's opt-in which may complicate calculations if we use the release channel…

(Quoting Matthew N. [:MattN] from comment #7)
> We also have data on the feature's usefulness but someone would need to
> correlate it with crash rates: https://mzl.la/2qtkAhB
s/ in / below /

To clarify, I'm talking about the "forced" bucket (value=3)
Whiteboard: [bhr]
So, bug 789945 has the initial patches for OMT pref file save, but we'd have to make a decision that this particular call should be async, or it won't help.
So in general I don't think this should be using prefs (it's basically abusing prefs as a key/value store). If we do keep using prefs, we're going to have to live with a more asynchronous file writing behavior: I believe we're going to support supporting fully-synchronous pref flushes altogether.

If the synchronicity/near-synchronicity is in fact important here, we could do this by touching/removing a small purpose-built file in the profile.
Flags: needinfo?(benjamin)
Whiteboard: [bhr] → [bhr][qf]
I'll take a quick look at the "touch the file" approach.
Flags: needinfo?(milan)
Since ...Begin() save pref only happens after a crash, and ...End() happens all the time, it makes sense to give more priority to getting rid of the ...End() pref save.  We can still get rid of the one in ...Begin(), but the one in ...End() is the more important one. 

This suggests that the "touch file", or whatever method we have that would eventually end up in the prefs, would need to exist between ...End() and prefs file eventually saved.  The information to be kept is two int preference values, although we could simplify things by treating the existence of such file on the next startup as the worst case scenario (details to be figured out.)

So, perhaps we can start with something as simple as "preference files are out of date" touch file without any information in it as to what the out of date preferences are.
I think my preferred thought right now is to add a callback whenever a successful file save happens.  The callers could then register for it, and find out when their changes were finally saved.  It could be a "one time use" or a "keep it until unregistered" type of callback, although we'd only use the first kind here.  We could then create the touch file in ...End(), remove it in the callback, check for it in ...Begin().
Assignee: nobody → milan
Flags: needinfo?(milan)
Attachment #8867759 - Flags: feedback?
Whiteboard: [bhr][qf] → [bhr][qf:p1]
Attachment #8867759 - Attachment is obsolete: true
Attachment #8867759 - Flags: feedback?
Attachment #8870875 - Flags: feedback?(dtownsend)
Attachment #8870875 - Flags: feedback?(dtownsend) → feedback?(benjamin)
The attached patches depend on those in bug 789945.
Comment on attachment 8870876 [details] [diff] [review]
Part 2. Touch file approach to help avoid sync prefs save (once we have sync prefs saves)

Review of attachment 8870876 [details] [diff] [review]:
-----------------------------------------------------------------

As discussed over IRC let's remove the request ID checking bits as they are unnecessary.

::: toolkit/components/startup/nsAppStartup.cpp
@@ +118,5 @@
> +    MOZ_ASSERT(aId >= sRequestId);
> +    // This removes the touch file
> +    nsCOMPtr<nsIFile> file = TouchFile();
> +    bool exists;
> +    if (file && NS_SUCCEEDED(file->Exists(&exists)) && exists) {

We don't need to check the file's existence before attempting to remove it.
Attachment #8870876 - Flags: feedback?(dtownsend) → feedback+
So, I have more up to date patches, but the more I think about it, the more I'm starting to lean towards just having an async-off-main-thread write, and be done with it.  So, if we crash between requesting the write and the preferences actually getting written (which is realistically going to be within a couple of seconds), so be it - we're using this to detect start up crashes, and crashing 2 seconds later is probably close to a start up crash anyway.

I'll still post the up to date patches in case we decide we want to complicate things :)
Attachment #8870875 - Flags: feedback?(benjamin)
I'm going to close this with patches from bug 789945.  We do a blocking pref save when reporting a previous crash on startup, which isn't often and shouldn't hurt us.  Otherwise, we just do asynchronous save, which means that comment 7 could happen, where we identify a post nsAppStartup::TrackStartupCrashEnd crash as the one that happened before it.  Given that it's a split second, it doesn't sound like that much of a wrong decision to do that.
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.