Closed Bug 1397101 Opened 5 years ago Closed 5 years ago

stylo: More heap-overhead with stylo enabled

Categories

(Core :: CSS Parsing and Computation, defect, P3)

defect

Tracking

()

RESOLVED FIXED
mozilla57
Tracking Status
firefox-esr52 --- wontfix
firefox55 --- wontfix
firefox56 --- wontfix
firefox57 --- fixed

People

(Reporter: bholley, Assigned: glandium)

References

(Blocks 1 open bug)

Details

Attachments

(5 files)

glandium is looking into this.
Flags: needinfo?(mh+mozilla)
There are mainly two parts to heap-overhead:
- bin-unused: those are entirely due to fragmentation, and I have yet to come to a conclusion about them
- page-cache: those are dirty pages that are kept around to be reused, and that haven't been explicitly requested to be purged by the OS. We're actually allowing this to go up to 1MB per arena. This might be too large a number for the stylo arenas. Dropping that down to 128K per arena for the stylo arenas "only" makes us do between 2.5 and 3 times the number of madvise/VirtualFree(MEM_DECOMMIT) system calls, which may or may not be worth doing ; the performance impact would need to be evaluated. This is memory that we can actively reclaim through a call to jemalloc_free_dirty_pages (although that applies to all arenas in the given process), which we currently only do on memory pressure.
> Dropping that down to 128K per arena for the stylo arenas "only" makes us do between 2.5 and 3 times the number of madvise/VirtualFree(MEM_DECOMMIT) system calls.

To put in perspective: there's about 0.1% madvice/VirtualFree(MEM_DECOMMIT) system calls per mozjemalloc call on the stylo arenas with the current settings, that makes it between 0.25 and 0.3%.
Oh, I should also mention that it seems there's between 300k and 400k of data per process allocated in the stylo arenas that are never freed, and that doesn't help with fragmentation, obviously. I haven't searched where they come from yet.
> - bin-unused: those are entirely due to fragmentation, and I have yet to come to a conclusion about them

Hmm.  Is the fragmentation per-thread?  That is, is it possible that we're going from 1MB to 6MB of fragmentation, even with nothing else changing, merely because we have 6 threads, not 1?

> We're actually allowing this to go up to 1MB per arena.

Are arenas per-thread?

If the fragmentation and page-cache numbers are basically fixed and independent of workload, just dependent on the number of threads stylo uses, then maybe we shouldn't worry too much about them.  But I'm not sure the fragmentation number is that workload-independent; I've seen it varying quite a bit in bug 1384945 comment 29 and bug 1384945 comment 36...
(P3 just to avoid having multiple blocking bugs tracking memory usage)
Priority: -- → P3
(In reply to Boris Zbarsky [:bz] (still digging out from vacation mail) from comment #5)
> > - bin-unused: those are entirely due to fragmentation, and I have yet to come to a conclusion about them
> 
> Hmm.  Is the fragmentation per-thread?  That is, is it possible that we're
> going from 1MB to 6MB of fragmentation, even with nothing else changing,
> merely because we have 6 threads, not 1?

Probably not that high, but yes. And it happens, that was my first observation: disabling the thread-local arenas for stylo decreases heap-overhead.

> > We're actually allowing this to go up to 1MB per arena.
> 
> Are arenas per-thread?

Stylo threads all have their own arena.

> If the fragmentation and page-cache numbers are basically fixed and
> independent of workload, just dependent on the number of threads stylo uses,
> then maybe we shouldn't worry too much about them.  But I'm not sure the
> fragmentation number is that workload-independent; I've seen it varying
> quite a bit in bug 1384945 comment 29 and bug 1384945 comment 36...

Page-cache varies between 0 and 1M per thread depending on workload, because it depends how much adjacent memory has been allocated and freed. But it's capped at 1M per stylo thread (which can be argued to be way too much), so it /can/ be considered as independent of workload in some sense.

Fragmentation is very much dependent on workload. I'm still trying to get a picture of how it looks like, but it's hard to find a "stable" point where to look at. Memory use varies while stylo works (obviously), but I also don't see an obvious drop when following a link and loading another page (although maybe I found one, actually ; need to dig deeper). I've started looking at how things are split between threads, but so far, my data is not summing up right.

As for the 300~400K of "leaks" per thread, they seem to be thread-local data from servo and rayon code.
(In reply to Mike Hommey [:glandium] from comment #7)
> As for the 300~400K of "leaks" per thread, they seem to be thread-local data
> from servo and rayon code.

That is much higher than I would expect. At least for the servo code, I would expect something closer to 10k. Where is it coming from?
Err, that was per process, not per thread, but that's still more than you're expecting. I'll try to get a thorough list.
(And to be clear, the large per-thread permaleaks I would expect would be from TLS: 4k for the bloom filters, ~8k for the style sharing cache buffers).
Attached patch DMD hackSplinter Review
This is the DMD hack I used to gather the log I'm about to attach. It makes DMD only track things that were allocated from the stylo thread arenas.
Attachment #8905694 - Attachment is patch: true
Attached file DMD report
This is the DMD report for the content process, and gathered with:
  MOZ_DMD_SHUTDOWN_LOG=/tmp/dmd/ MOZ_CC_RUN_DURING_SHUTDOWN=1 STYLO_THREADS=3 ./mach run --dmd --mode=live --stacks=full -- -profile /tmp/profile -new-instance https://wikipedia.org/wiki/Obama

This is on an opt build with NS_FREE_PERMANENT_DATA manually enabled in nscore.h and all.js.

This is actually less (half) than what I saw with allocation logging, and I'm not excluding I did something wrong there (probably the same root cause as why my numbers didn't sum up when I tried to look at fragmentation per thread).

Note that about half of the reported size comes from stylo code, and the other half... pango and fontconfig, through things like GetSystemFontInfo. A lot of the latter is a lot of small allocations, too, which contribute to the overall fragmentation problem. It would be much better if those would happen on the main arena instead of the stylo arenas...

That's still about 75K for 3 stylo threads, so about 25K per thread.
Same as the previous report, but with font allocation excluded.
So on awsy, at least, our heap-overhead regression from stylo in the "tabs opened" condition is about 20MB.  Even once we close all the tabs it's still about 8-10MB.

That's with STYLO_THREADS=4, and I'm pretty sure there are 4 content processes.  So 16 stylo threads total.  So we could have up to 16MB just from page-cache, in theory, right?

The actual breakdowns inside heap-overhead while all the tabs are open look like this:

  Heap-overhead breakdown:
    page-cache: Gecko 2.857422, Stylo 11.080078, regression: 8.222656
    bin-unused: Gecko 27.299109, Stylo 38.512325, regression: 11.213216
    bookkeeping: Gecko 6.241396, Stylo 7.442486, regression: 1.201090

and after the tabs are closed look like this:

  Heap-overhead breakdown:
    page-cache: Gecko 0.710449, Stylo 3.678711, regression: 2.968262
    bin-unused: Gecko 11.625354, Stylo 18.060195, regression: 6.434841
    bookkeeping: Gecko 1.621473, Stylo 1.794022, regression: 0.172548
See the commit messages on the two attached patches as to what's the cause for up to 1.2MiB of waste in page-cache and bin-unused per stylo thread.

The remainder is more straightforward fragmentation, which can also be decreased somehow by tweaking class sizes for small allocations, but that's harder to do without impacting the main arena, and touching the configuration of the main arena is bound to cause performance and memory regressions. I'm going to gather some more stats and see what specific size classes are problematic, maybe we can do something about them at a higher level (by adding some padding to some structs, for example ; this is this fun unintuitive thing where allocating more memory ends up using less memory...).
Flags: needinfo?(mh+mozilla)
Comment on attachment 8907888 [details]
Bug 1397101 - Reduce the number of dirty pages we allow to be kept in thread local arenas.

https://reviewboard.mozilla.org/r/179572/#review184752

::: memory/build/mozjemalloc.cpp:736
(Diff revision 1)
>  	 * dirty, and for which madvise(... MADV_FREE) has not been called.  By
>  	 * tracking this, we can institute a limit on how much dirty unused
>  	 * memory is mapped for each arena.
>  	 */
>  	size_t			ndirty;
> +	size_t			dirty_max;

Please add a comment that indicates what unit this number has. (AFAICT it's pages.)

::: memory/build/mozjemalloc.cpp:4627
(Diff revision 1)
>  #ifndef XP_WIN
>      malloc_mutex_unlock(&init_lock);
>  #endif
>      return true;
>    }
> +  /* arenas_new() sets this to a lower value for thread local arenas ;

Should be `arena_new()`. Also, remove the space before the ';'.
Attachment #8907888 - Flags: review?(n.nethercote) → review+
Comment on attachment 8907889 [details]
Bug 1397101 - Only use a thread local arena for small sizes.

https://reviewboard.mozilla.org/r/179574/#review184756

The word "hack" has multiple meanings and shades of meaning, and this patch encompasses most of them :)
Attachment #8907889 - Flags: review?(n.nethercote) → review+
Comment on attachment 8907889 [details]
Bug 1397101 - Only use a thread local arena for small sizes.

https://reviewboard.mozilla.org/r/179574/#review184760

::: memory/build/mozjemalloc.cpp:2295
(Diff revision 1)
>     * library version, libc's malloc is used by TLS allocation, which
>     * introduces a bootstrapping issue.
>     */
>  #ifndef NO_TLS
> -
> +  // Only use a thread local arena for small sizes.
> +  if (size < small_max) {

This is meant to be <=
So if I replay my AWSY allocation log for all processes, this is what I get at "TabsOpenForceGC":
- page-cache: 11,235,328
- bin-unused: 60,188,816
- bookkeeping: 10,945,680

With style thread arenas disabled:
- page-cache: 3,616,768
- bin-unused: 52,527,472
- bookkeeping: 10,652,592

With the two patches attached, with style thread arenas enabled:
- page-cache: 3,502,080
- bin-unused: 57,885,936
- bookkeeping: 10,885,168

Note this is replaying the exact same allocation log over and over again, so there's no noise between runs, like when comparing the output from awsy itself.

(and yes, bin-unused is actually very high on the main arena too, but that's a topic for another day)

If I further tweak the sizes passed to malloc functions on the stylo threads only to use size classes spaced by 32 instead of 16(*):
- page-cache: 4,104,192
- bin-unused: 56,989,160
- bookkeeping: 10,994,352

* so, 32, 64, 96, 128, etc. instead of 16, 32, 48, etc. meaning malloc(33) returns a 64-bytes block instead of a 48-bytes one.

So in some sense, tweaking size classes is making things worse (more dirty page), while it does reduce bin-unused a little.
(In reply to Nicholas Nethercote [:njn] from comment #18)
> Comment on attachment 8907888 [details]
> Bug 1397101 - Reduce the number of dirty pages we allow to be kept in thread
> local arenas.
> 
> https://reviewboard.mozilla.org/r/179572/#review184752
> 
> ::: memory/build/mozjemalloc.cpp:736
> (Diff revision 1)
> >  	 * dirty, and for which madvise(... MADV_FREE) has not been called.  By
> >  	 * tracking this, we can institute a limit on how much dirty unused
> >  	 * memory is mapped for each arena.
> >  	 */
> >  	size_t			ndirty;
> > +	size_t			dirty_max;
> 
> Please add a comment that indicates what unit this number has. (AFAICT it's
> pages.)

I'm adding the comment "maximum value allowed for ndirty", and ndirty is already commented as being a count of pages.

> Also, remove the space before the ';'.

Dammit, I always forget that the typography rule for the semi-colon is different between english and french.
(In reply to Mike Hommey [:glandium] from comment #21)
> So if I replay my AWSY allocation log for all processes, this is what I get
> at "TabsOpenForceGC":
> - page-cache: 11,235,328
> - bin-unused: 60,188,816
> - bookkeeping: 10,945,680
> 
> With style thread arenas disabled:
> - page-cache: 3,616,768
> - bin-unused: 52,527,472
> - bookkeeping: 10,652,592
> 
> With the two patches attached, with style thread arenas enabled:
> - page-cache: 3,502,080
> - bin-unused: 57,885,936
> - bookkeeping: 10,885,168

Those numbers are for 4 content processes and 4 stylo threads per process. That makes the difference in bin-unused between thread arenas enabled/disables about 327KiB per thread.
Pushed by mh@glandium.org:
https://hg.mozilla.org/integration/autoland/rev/60f6c4563645
Reduce the number of dirty pages we allow to be kept in thread local arenas. r=njn
https://hg.mozilla.org/integration/autoland/rev/d9efe6d06931
Only use a thread local arena for small sizes. r=njn
Blocks: 1399695
Some more numbers, for "TabsClosedForceGC", this time, for the same AWSY allocations log:

Unpatched, style arenas enabled:
- page-cache: 13,336,576
- bin-unused: 55,715,936
- bookkeeping: 6,425,360

Unpatched, style arenas disabled:
- page-cache: 2,859,008
- bin-unused: 53,727,648
- bookkeeping: 6,304,208

Patched, style arenas enabled:
- page-cache: 3,747,840
- bin-unused: 53,469,088
- bookkeeping: 6,366,160
Keywords: leave-open
This is awesome, thanks glandium!

We already limit style struct allocations to be < 512 bytes:

http://searchfox.org/mozilla-central/rev/6326724982c66aaeaf70bb7c7ee170f7a38ca226/layout/style/nsStyleStruct.cpp#60

So I think that limit will generally work well for us.
(In reply to Mike Hommey [:glandium] from comment #21)
> With the two patches attached, with style thread arenas enabled:
> - page-cache: 3,502,080
> - bin-unused: 57,885,936
> - bookkeeping: 10,885,168

Still same AWSY allocations log, for "TabsOpenForceGC", patched, style arenas enabled, and all allocations >= 240 and < 512 clamped to 512:
- page-cache: 4,005,888
- bin-unused: 55,950,352
- bookkeeping: 11,179,504
Seemingly smarter:
  368 < size <= 432 -> 432
  432 < size <= 480 -> 480
  480 < size <= 512 -> 512

- page-cache: 3,600,384
- bin-unused: 57,777,616
- bookkeeping: 10,873,168

Adding some more clamping:
  272 < size <= 368 -> 368
  208 < size <= 272 -> 272

- page-cache: 3,571,712
- bin-unused: 57,265,568
- bookkeeping: 10,856,528

Adding even more clamping:
  176 < size <= 208 -> 208

- page-cache: 3,588,096
- bin-unused: 57,277,664
- bookkeeping: 10,856,720

And the final stroke, grouping all runs of the same size for runs larger than a page:
  80 < size <= 176 -> 176

- page-cache: 4,239,360
- bin-unused: 56,325,808
- bookkeeping: 10,852,000

This is as far as tweaking struct sizes would get us, and that's not much: the best was the dumb attempt from comment 29, and that only reduced bin-unused by 118KiB per thread on average.

There *are* things we can do to improve the situation, but as mentioned earlier, they're not on the low risk side of the things we can do for the short term.

So I'm inclined to say, at this point, that the landed patches are as far as we're going to go with this bug.
Keywords: leave-open
> So I'm inclined to say, at this point, that the landed patches are as far as we're going to go with this bug.

Sounds reasonable to me.
FWIW, with a more risky approach, kind of related with bug 1399695 but not entirely (making all runs on stylo threads a page large), and assuming I didn't actually break anything from the horrible hack:

- page-cache: 3,330,048
- bin-unused: 54,374,432
- bookkeeping: 10,677,728

which is the closest I've gotten so far to the overhead with thread arenas disabled, but I'm actually surprised bookkeeping is so low, so I'm not excluding that I've done something wrong. Anyways, it means we do have some room for improvement with more aggressive approaches.
(In reply to Mike Hommey [:glandium] from comment #32)
> - bookkeeping: 10,677,728

corrected value: 11,775,232

which is less surprising.
For what it's worth, my worry with the various clamping approaches is that they may overtune for current allocation patterns.  But those patterns can change, especially as we add/remove things to style structs...
Depends on: 1436541
You need to log in before you can comment on or make changes to this bug.