Closed Bug 1279389 Opened 4 years ago Closed 4 years ago

Enable osfile.reset_worker_delay on all products

Categories

(Toolkit :: OS.File, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla50
Tracking Status
firefox50 --- fixed

People

(Reporter: bkelly, Assigned: bkelly)

References

Details

(Whiteboard: [MemShrink])

Attachments

(3 files, 1 obsolete file)

In bug 927560 and bug 961317 we added the ability to shutdown the osfile worker after a timeout.  This was mainly done to reduce memory usage on b2g.

It would be nice, though, to run this consistently across all our products.  There is no good reason to waste memory on desktop if we have the means to clean it up.  This would save us about 3.5MB on desktop and 1MB on android.  While these are relatively small amounts, every little bit adds up and can also contribute to fragmentation over time.

Since the pref is already implemented, I think we should consider just turning it on.
This enables the osfile.reset_worker_delay pref with a fairly conservative 30 second value.  At this level try is green (with the follow-on P2 patch that tweaks one of the devtools tests):

https://treeherder.mozilla.org/#/jobs?repo=try&revision=9eaa178f554c

I also did a talos run before and after the patch to demonstrate that it did not regress performance:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=233bfce6458f&newProject=try&newRevision=9eaa178f554cfc4a7b186db93192b8f9f7510031&framework=1

I did try lower delay values, but it made more tests unstable.  For example, the devtools webide browser_tabs.js test opens and closes webide very quickly.  This tool has an async step to update the package list from disk, but nothing in the test waits for this.  With the added delay of the worker startup and its associated messaging sometimes this package list update would fire after shutdown leak collection completed.  I don't think this is a real production bug, but is an issue with the test not waiting for all operations to complete.

Rather than chase stuff like that down to set an aggressive timeout I figured we could start at something conservative like this and still reap the benefit of improved memory in a typical browsing session.
Attachment #8761885 - Flags: review?(dteller)
Ryan, I needed to relax the timing constraint in browser_telemetry_toolboxtabs_storage.js slightly in order to accommodate added latency from spinning up the osfile worker in some cases.  Without this the test passes all its cases, but runs into a "this.transport is null" in main.js as its ending.  I believe its hitting a similar "async file operation that doesn't block the test" as I described in the previous comment.
Attachment #8761886 - Flags: review?(jryans)
Comment on attachment 8761885 [details] [diff] [review]
P1 Enable osfile.reset_worker_delay with a conservative 30 second delay. r=yoric

Actually, I see that the worker is getting stuck open after the crash manager maintenance runs.  Its not getting requests, though.  If I can't fix that then this is moot.
Attachment #8761885 - Flags: review?(dteller)
Attachment #8761886 - Flags: review?(jryans)
It seems we are forgetting to close directory iterators somewhere.
In this case its like TelemetryStorage.jsm, but dxr suggests there are more in the tree. :-(
Comment on attachment 8761885 [details] [diff] [review]
P1 Enable osfile.reset_worker_delay with a conservative 30 second delay. r=yoric

See comment 1 for review description.
Attachment #8761885 - Flags: review?(dteller)
Comment on attachment 8761886 [details] [diff] [review]
P2 Relax the timing in browser_telemetry_toolboxtabs_storage.js to account for osfile worker startup. r=jryans

See comment 2 for review description.
Attachment #8761886 - Flags: review?(jryans)
Fix all the places I could find where we forget to call .close() on a DirectoryIterator.  Another advantage of turning on the osfile worker timeout is that we can more easily see if a new leak like this creeps in.
Attachment #8761913 - Flags: review?(dteller)
Comment on attachment 8761886 [details] [diff] [review]
P2 Relax the timing in browser_telemetry_toolboxtabs_storage.js to account for osfile worker startup. r=jryans

Review of attachment 8761886 [details] [diff] [review]:
-----------------------------------------------------------------

Sounds fine to me, assuming Talos looks good for the new run.
Attachment #8761886 - Flags: review?(jryans) → review+
In B2G, this feature has been a neverending cause of race conditions and deadlocks. So I'm actually voting for us to entirely get rid of the ability to shutdown OS.File, rather than extending it to other platforms.

Moreover, shutting down OS.File makes sense on B2G, because it is only ever used during the first ~1 second of booting a process, but not nearly as much on Desktop, where it is used all over the place.

Finally, we have had issues in the past with how long it takes to start the OS.File worker in the first place. According to Telemetry, this can take up to 5 seconds (!) on Desktop, depending on cache contention, etc. so we might end up hurting the performance of Firefox for users.
Flags: needinfo?(bkelly)
Comment on attachment 8761885 [details] [diff] [review]
P1 Enable osfile.reset_worker_delay with a conservative 30 second delay. r=yoric

Review of attachment 8761885 [details] [diff] [review]:
-----------------------------------------------------------------

(canceling review until we have discussed the usefulness of this bug)
Attachment #8761885 - Flags: review?(dteller)
Comment on attachment 8761913 [details] [diff] [review]
P3 Fix OS.File.DirectoryIterator leaks throughout tree. r=yoric

Review of attachment 8761913 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for the fix.

Note that, in practice, most of these iterators are automatically closed (this happens when we have iterated to the end), but being extra cautious is a good idea.
Attachment #8761913 - Flags: review?(dteller) → review+
(In reply to David Rajchenbach-Teller [:Yoric] (please use "needinfo") from comment #13)
> Comment on attachment 8761913 [details] [diff] [review]
> P3 Fix OS.File.DirectoryIterator leaks throughout tree. r=yoric
> 
> Review of attachment 8761913 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Thanks for the fix.
> 
> Note that, in practice, most of these iterators are automatically closed
> (this happens when we have iterated to the end), but being extra cautious is
> a good idea.

From what I have observed this auto-close is not working.  Those archive telemetry iterators seem like they iterate to the end quickly, but they were not closed.
(In reply to David Rajchenbach-Teller [:Yoric] (please use "needinfo") from comment #11)
> In B2G, this feature has been a neverending cause of race conditions and
> deadlocks. So I'm actually voting for us to entirely get rid of the ability
> to shutdown OS.File, rather than extending it to other platforms.

Deadlocks?  You mean within OS.File itself?

> Moreover, shutting down OS.File makes sense on B2G, because it is only ever
> used during the first ~1 second of booting a process, but not nearly as much
> on Desktop, where it is used all over the place.

From what I can tell we don't need this running 95% of the time.  We need it once every 10 minutes or so when sync runs.  Or if the user opens a devtools toolbox that uses it.  It seems a shame to waste resources for no reason.

> Finally, we have had issues in the past with how long it takes to start the
> OS.File worker in the first place. According to Telemetry, this can take up
> to 5 seconds (!) on Desktop, depending on cache contention, etc. so we might
> end up hurting the performance of Firefox for users.

Where is the telemetry for this for desktop?  I did run talos and it did not show any regressions.

In general the time it takes to start a worker thread should be negligible compared to disk IO times.
Flags: needinfo?(bkelly)
See questions in comment 15.
Flags: needinfo?(dteller)
I guess I'd like to propose enabling it in nightly since we're at the beginning of a merge cycle.  If it causes problems we can back it out.
(In reply to Ben Kelly [:bkelly] from comment #15)
> Deadlocks?  You mean within OS.File itself?

Indeed. Not the kind of deadlock that blocks a thread entirely, but the kind of deadlock we get with async programming that prevents progress (of an async codepath) because some resource is not available nor will ever be.

> > Moreover, shutting down OS.File makes sense on B2G, because it is only ever
> > used during the first ~1 second of booting a process, but not nearly as much
> > on Desktop, where it is used all over the place.
> 
> From what I can tell we don't need this running 95% of the time.  We need it
> once every 10 minutes or so when sync runs.  Or if the user opens a devtools
> toolbox that uses it.  It seems a shame to waste resources for no reason.

I understand that. If you are willing to put the time to debugging the aforementioned deadlocks, go ahead.

Personally, I'd prefer if we rewrote OS.File in Rust. This would save basically all of this memory, with a faster startup and without the need for the C++ fastpath designed to workaround the slow startup.


> > Finally, we have had issues in the past with how long it takes to start the
> > OS.File worker in the first place. According to Telemetry, this can take up
> > to 5 seconds (!) on Desktop, depending on cache contention, etc. so we might
> > end up hurting the performance of Firefox for users.
> 
> Where is the telemetry for this for desktop?  I did run talos and it did not
> show any regressions.

It's OSFILE_WORKER_READY. You'll see that the 75th percentile is already > 1 second and the 95th percentile almost at 4 seconds.

> In general the time it takes to start a worker thread should be negligible
> compared to disk IO times.

In theory, yes. We were pretty surprised when we realized that using OS.File with Session Restore, which made actual I/O more efficient, caused a multi-second regression in Firefox startup times.
Flags: needinfo?(dteller)
(In reply to David Rajchenbach-Teller [:Yoric] (please use "needinfo") from comment #18)
> (In reply to Ben Kelly [:bkelly] from comment #15)
> > Deadlocks?  You mean within OS.File itself?
> 
> Indeed. Not the kind of deadlock that blocks a thread entirely, but the kind
> of deadlock we get with async programming that prevents progress (of an
> async codepath) because some resource is not available nor will ever be.

Are there existing cases I can test?  This is pretty vague and hard to address at the moment?

> > From what I can tell we don't need this running 95% of the time.  We need it
> > once every 10 minutes or so when sync runs.  Or if the user opens a devtools
> > toolbox that uses it.  It seems a shame to waste resources for no reason.
> 
> I understand that. If you are willing to put the time to debugging the
> aforementioned deadlocks, go ahead.

I would, but how can I find them if we don't enable this?

> > > Finally, we have had issues in the past with how long it takes to start the
> > > OS.File worker in the first place. According to Telemetry, this can take up
> > > to 5 seconds (!) on Desktop, depending on cache contention, etc. so we might
> > > end up hurting the performance of Firefox for users.
> > 
> > Where is the telemetry for this for desktop?  I did run talos and it did not
> > show any regressions.
> 
> It's OSFILE_WORKER_READY. You'll see that the 75th percentile is already > 1
> second and the 95th percentile almost at 4 seconds.

This measured at browser start up today, right?  When the main thread is completely bogged down loading everything right?  We don't start this at any other time on desktop right now AFAIK.  How can this be comparable to starting the worker during more typical operation?

> > In general the time it takes to start a worker thread should be negligible
> > compared to disk IO times.
> 
> In theory, yes. We were pretty surprised when we realized that using OS.File
> with Session Restore, which made actual I/O more efficient, caused a
> multi-second regression in Firefox startup times.

Again, startup is a special case where the main thread is buried under runnables.  Avoiding async is a huge win here.  I don't think that has much bearing on performance once we are outside of startup conditions.

How about if I just enable it in non-release builds?  Aurora and nightly?  That would let us find bugs, etc.
Flags: needinfo?(dteller)
(In reply to Ben Kelly [:bkelly] from comment #19)
> Are there existing cases I can test?  This is pretty vague and hard to
> address at the moment?

Well, everything was on B2G, so it's going to be hard to track down these days.

Lissyx, maybe you remember about a semi-recent B2G issue that was caused by OS.File shutdown?

> > I understand that. If you are willing to put the time to debugging the
> > aforementioned deadlocks, go ahead.
> 
> I would, but how can I find them if we don't enable this?

Well, if you're game, go ahead and enable it.

> > It's OSFILE_WORKER_READY. You'll see that the 75th percentile is already > 1
> > second and the 95th percentile almost at 4 seconds.
> 
> This measured at browser start up today, right?  When the main thread is
> completely bogged down loading everything right?  We don't start this at any
> other time on desktop right now AFAIK.  How can this be comparable to
> starting the worker during more typical operation?

This is measured at the first call to OS.File, so yes, it's certainly during startup.

I don't have a good idea to measure regular restart.

> Again, startup is a special case where the main thread is buried under
> runnables.  Avoiding async is a huge win here.  I don't think that has much
> bearing on performance once we are outside of startup conditions.
> 
> How about if I just enable it in non-release builds?  Aurora and nightly? 
> That would let us find bugs, etc.

Fine by me.
Flags: needinfo?(dteller)
This only enables the pref on non release builds.
Attachment #8761885 - Attachment is obsolete: true
Attachment #8762131 - Flags: review?(dteller)
Attachment #8762131 - Flags: review?(dteller) → review+
Pushed by bkelly@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/2600b18edec8
P1 Enable osfile.reset_worker_delay on non-release builds with a conservative 30 second delay. r=yoric
https://hg.mozilla.org/integration/mozilla-inbound/rev/494ea9227b69
P2 Relax the timing in browser_telemetry_toolboxtabs_storage.js to account for osfile worker startup. r=jryans
https://hg.mozilla.org/integration/mozilla-inbound/rev/f2110c264778
P3 Fix OS.File.DirectoryIterator leaks throughout tree. r=yoric
Ben, is there some form of targeted manual testing that the QA team could perform to help with testing this?
Flags: needinfo?(bkelly)
(In reply to Florin Mezei, QA (:FlorinMezei) from comment #24)
> Ben, is there some form of targeted manual testing that the QA team could
> perform to help with testing this?

So to see if this worked, follow these steps:

1) Launch a fresh browser
2) Browser normally for a few minutes
3) Open an about:memory tab and minimize/measure
4) In the main process there should not be an osfile worker

When account sync (or other subsystems that use osfile async APIs) run, however, the osfile worker should be spawned again.  And then go away after its been idle for a while.  Again, you can use about:memory to see if the worker goes away.

Potential problems to look for would be sync or other file consumers intermittently breaking or being abnormally slow.

Again, this is only enabled in non-release builds.  So nightly now and eventually aurora once 50 merges there.
Flags: needinfo?(bkelly)
https://hg.mozilla.org/mozilla-central/rev/2600b18edec8
https://hg.mozilla.org/mozilla-central/rev/494ea9227b69
https://hg.mozilla.org/mozilla-central/rev/f2110c264778
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla50
Depends on: 1284621
Depends on: 1290288
You need to log in before you can comment on or make changes to this bug.