Lower the skia glyph cache size when we have multiple content processes

RESOLVED FIXED in Firefox 63

Status

()

defect
P3
normal
RESOLVED FIXED
3 years ago
Last year

People

(Reporter: mchang, Assigned: lsalzman)

Tracking

(Blocks 1 bug)

unspecified
mozilla55
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox55 unaffected, firefox63 fixed)

Details

(Whiteboard: [gfx-noted][MemShrink:P1][overhead:10MB])

Attachments

(2 attachments)

From discussion with Eric, we can probably lower the Skia glyph cache size if we have more than one content process. This is so we can reduce memory. When we eventually get to multiple content processes, lower the glyph cache size.
Summary: Low the skia glyph cache size when we have multiple content processes → Lower the skia glyph cache size when we have multiple content processes
See Also: → 1239151
Whiteboard: [gfx-noted] → [gfx-noted][MemShrink]
Whiteboard: [gfx-noted][MemShrink] → [gfx-noted][MemShrink:P1]
Firefox now has multiple content processes, did the skia glyph cache size get reduced?
Flags: needinfo?(mchang)
(In reply to Danial Horton from comment #1)
> Firefox now has multiple content processes, did the skia glyph cache size
> get reduced?

No, but thanks for the reminder! Will get to it
Talked with Eric a bit. Since we now have 4 content processes, it'd be useful to trim some memory if we can. Chrome currently uses 20mb per process and we decided to go with 10 at the time due to a talos regression. Skia's default is 2mb and minimum is 256K. For now, we'll try 5mb to see if anyone notices anything, but we'll do it after the trains switch because 4 is already approved with 10mb per process.

m-c revision - 362360:cad53f061da6
Master talos try - https://treeherder.mozilla.org/#/jobs?repo=try&revision=d178d5a3fe2bd55e8b6e6b76b9684c2ed93463d3
5 mb talos try - https://treeherder.mozilla.org/#/jobs?repo=try&revision=6fde89c94343c398592d526379bb878f4e7e3462
Flags: needinfo?(mchang)
We're going to put in a preference for setting the size, either explicitly, or automatically, based on the system memory, perhaps number of content processes and some magical factor to adjust.  That way we can run experiments and see if telemetry shows us any kind of a performance/checkerboarding changes based on the values.
I don't know if it's worth having a telemetry for cache misses.
Comment on attachment 8874905 [details] [diff] [review]
Add pref for content skia font cache size

Review of attachment 8874905 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, are we going to leave this open for tweaking the value or just file a follow up?
Attachment #8874905 - Flags: review?(erahm) → review+
(In reply to Eric Rahm [:erahm] from comment #7)
> Comment on attachment 8874905 [details] [diff] [review]
> Add pref for content skia font cache size
> 
> Review of attachment 8874905 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Looks good, are we going to leave this open for tweaking the value or just
> file a follow up?

Probably file a follow up. Looks like setting the glyph cache size results in a 4.3% regression on OS X bloom_basic as well.
(In reply to Mason Chang [:mchang] from comment #8)
> (In reply to Eric Rahm [:erahm] from comment #7)
> > Comment on attachment 8874905 [details] [diff] [review]
> > Add pref for content skia font cache size
> > 
> > Review of attachment 8874905 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > 
> > Looks good, are we going to leave this open for tweaking the value or just
> > file a follow up?
> 
> Probably file a follow up. Looks like setting the glyph cache size results
> in a 4.3% regression on OS X bloom_basic as well.

Given that's the only regression and the test description [1] indicates "If the bloom filter isn't working correctly, the 'test' page will take orders of magnitude longer", a 4% regression for bloom on osx-only might be acceptable. I assume it's not a realistic test, but just making sure the feature doesn't catastrophically fail.

bholley, what are your thoughts on this (you're listed as a point of contact)? We're looking at a 20MB memory savings with the trade-off noted above.

[1] https://wiki.mozilla.org/Buildbot/Talos/Tests#bloom_basic.2C_bloom_basic_ref
Flags: needinfo?(bobbyholley)
Yep, that's fine totally fine. Curious, did bloom_basic_ref and bloom_basic change the same amount?
Flags: needinfo?(bobbyholley)
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #10)
> Yep, that's fine totally fine. Curious, did bloom_basic_ref and bloom_basic
> change the same amount?

Yeah it's the same.
I'm not comfortable enabling this to ride the train until we get some more information on how it impacts the performance in the wild - as in, get some telemetry results on checkerboarding at least.  We'd need to make sure we have the right measurements (we probably do), then run the experiment(s) and look at the results. We can't afford this to lead to a QF:P1 regression, if you will.  If we're going to let this go beyond nightly, we need some numbers.
Pushed by mchang@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/3c9a4dbc3409
Add pref for content skia font cache size. r=erahm
https://hg.mozilla.org/mozilla-central/rev/3c9a4dbc3409
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
(In reply to Ryan VanderMeulen [:RyanVM] from comment #14)
> https://hg.mozilla.org/mozilla-central/rev/3c9a4dbc3409

If we're doing the follow up where we actually change the default, we should probably change the subject of this bug.  If we're not doing a follow up, then this was probably a leave open?
(In reply to Milan Sreckovic [:milan] from comment #15)
> (In reply to Ryan VanderMeulen [:RyanVM] from comment #14)
> > https://hg.mozilla.org/mozilla-central/rev/3c9a4dbc3409
> 
> If we're doing the follow up where we actually change the default, we should
> probably change the subject of this bug.  If we're not doing a follow up,
> then this was probably a leave open?

Yeah this landing should have been from a blocking bug. Lets reopen.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to Milan Sreckovic [:milan] from comment #12)
> I'm not comfortable enabling this to ride the train until we get some more
> information on how it impacts the performance in the wild - as in, get some
> telemetry results on checkerboarding at least.  We'd need to make sure we
> have the right measurements (we probably do), then run the experiment(s) and
> look at the results. We can't afford this to lead to a QF:P1 regression, if
> you will.  If we're going to let this go beyond nightly, we need some
> numbers.

So the deal is we know this will regress memory by ~30MB for e10s-multi which is enabled by default in 55 (current nightly), and when 54 (current beta) hits release next week we'll be doing a slow rollout to e10s users without add-ons. I'm fine holding off landing a change until 55 hits beta, but I'd expect us to uplift a change shortly.

While I get everyone is concerned about QF regressions, that's targeting 57 and we haven't seen any indication that there will be a regression going down to 5MB / CP. Testing out even lower numbers is certainly more questionable.
I'm only going on what's in this bug - I don't know the content of the conversations that may have happened, so I may not have all the information.

Do we have the data as to how much of the glyph cache we're using?  Are we filling the 20MB today, are we mostly sitting < 15MB, or < 5MB?  How do we argue that this "won't hurt us" without that information?

What kind of testing has been done so that we trust the "no indication that there will be a regression"?  Talos alone won't do it, and even that is showing regressions.  We're not about to let a performance regression through so that we can fix it later.  This will need telemetry data *and* pi-request for a lot more manual testing.  These are fonts - there may very well be existing bugs that get uncovered because we're blowing the cache limit and start getting correctness errors.

Also, why glyph cache?  What led us to this?  We have a 96mb canvas cache, and I'm happy for you to lower that to 86mb and get the same 10mb/process "savings".
Stay tuned: there will testing done with this preference set to 5mb and 20mb vs. 10mb, to see what kind of telemetry results we get when it comes to performance.
Assignee: mchang → nobody
memshrink team would love to see this land. milan: can we reassign to one of the skia experts?
Flags: needinfo?(milan)
To summarize some of the things from previous comments, I'd like to see the justification for this change.  "We will get 5MB per process back" is not enough, as it isn't clear why this cache, and not some other place.

We're asking to save 5MB per process, from the cache that is currently using 10MB.  Halving a cache is quite a change, especially since Chrome has one twice the size (20MB).

We don't know how much of this cache is being filled in the first place.  If we're using all of it all the time, this change would have quite an impact.

A telemetry experiment where we change the cache size and look at performance, and a measurement of performance change on our target hardware and benchmarks would tell us what kind of impact this would have.

Without these, I don't want to just randomly change a size of a cache and hope for the best.
Flags: needinfo?(milan)
(In reply to Milan Sreckovic [:milan] from comment #21)
> To summarize some of the things from previous comments, I'd like to see the
> justification for this change.  "We will get 5MB per process back" is not
> enough, as it isn't clear why this cache, and not some other place.

It's not just this cache, I'd like to reduce all of them to reasonable values. Please let me know of other large caches that you have in mind.

> We're asking to save 5MB per process, from the cache that is currently using
> 10MB.  Halving a cache is quite a change, especially since Chrome has one
> twice the size (20MB).
> 
> We don't know how much of this cache is being filled in the first place.  If
> we're using all of it all the time, this change would have quite an impact.

Sure adding telemetry for the cache utilization would be useful, I'm not sure if skia exposes that or not though.

> A telemetry experiment where we change the cache size and look at
> performance, and a measurement of performance change on our target hardware
> and benchmarks would tell us what kind of impact this would have.

We ran against talos with 5MB and there wasn't a regression (aside from bloom filter which was deemed okay).

> Without these, I don't want to just randomly change a size of a cache and
> hope for the best.

This is reasonable, when do you plan on doing the experiments? Saying we can't change this without telemetry, but not gathering telemetry is less reasonable.
Flags: needinfo?(milan)
Offline conversation on the scheduling.
Flags: needinfo?(milan)
Whiteboard: [gfx-noted][MemShrink:P1] → [gfx-noted][MemShrink:P1][overhead:10MB]
I did a lot of triggers of various talos tests at cache sizes between 5-10MB. It seems like no matter what we're going to take some hits on tests if we lower below 10MB, so if we're going to start with a reduction, it may as well be half at 5MB.

However, the hits are mostly below the noise level on Linux and reasonable single digit (maybe around 3-5%ish on tp6) on Mac. That seems worth it to potentially halve the amount of memory usage as we start cranking up process counts. So I'm okay with taking the trade if you are...
Assignee: nobody → lsalzman
Status: REOPENED → ASSIGNED
Attachment #8996185 - Flags: review?(bzbarsky)
Comment on attachment 8996185 [details] [diff] [review]
reduce Skia font cache size to 5MB

This seems OK, but I have a few questions:

1)  You mention Linux and Mac, but not Windows.  Was there no Tp impact there, or do we not use Skia for glyphs there, or something else?

2)  Did we ever get any of the numbers Milan was asking about wrt checkerboarding impact?
Attachment #8996185 - Flags: review?(bzbarsky) → review+
(In reply to Boris Zbarsky [:bz] (no decent commit message means r-) from comment #25)
> Comment on attachment 8996185 [details] [diff] [review]
> reduce Skia font cache size to 5MB
> 
> This seems OK, but I have a few questions:
> 
> 1)  You mention Linux and Mac, but not Windows.  Was there no Tp impact
> there, or do we not use Skia for glyphs there, or something else?

The Windows numbers seemed within the noise variance, at least on try. But for Windows these days, the high performance segment is going to be running on Direct2D, where the Skia cache size is not going to affect it. If they can't run Direct2D, it is likely they are also going to be on an older, more memory constrained system, so a tradeoff would still make sense there within reason.

It should also be noted that CoreGraphics and DirectWrite have their own caches above Skia's, which we lean on more heavily as a result of halving the cache size, but which DirectWrite seems a bit happier about than CoreGraphics on talos.

WebRender is not affected by this either, since it's caching is yet again different, but at least under our control too unlike CG or DWrite, so there's that.

> 2)  Did we ever get any of the numbers Milan was asking about wrt
> checkerboarding impact?

No. Admittedly the resistance was partially based on the initial multiprocess situation where we were only looking at a couple processes hanging around. But given that we're looking at potentially even higher and higher process counts, I'm willing to just retire this issue with a stupid simple solution, even at cost, since it will buy us enough of a win that maybe it can save of from even more extreme measures or heuristics for a while yet.
https://hg.mozilla.org/mozilla-central/rev/43bc700f24d8
Status: ASSIGNED → RESOLVED
Closed: 2 years agoLast year
Resolution: --- → FIXED
AWSY win!

== Change summary for alert #14693 (as of Tue, 31 Jul 2018 16:13:53 GMT) ==

Improvements:

  2%  Heap Unclassified osx-10-10 opt stylo     95,069,363.60 -> 92,932,657.03

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=14693
You need to log in before you can comment on or make changes to this bug.