1397101 - stylo: More heap-overhead with stylo enabled

Reporter

Description

•

7 years ago

See bug 1384945 comment 29.

Bobby Holley (:bholley)

Reporter

Comment 1

•

7 years ago

glandium is looking into this.

Flags: needinfo?(mh+mozilla)

Mike Hommey [:glandium]

Assignee

Comment 2

•

7 years ago

There are mainly two parts to heap-overhead:
- bin-unused: those are entirely due to fragmentation, and I have yet to come to a conclusion about them
- page-cache: those are dirty pages that are kept around to be reused, and that haven't been explicitly requested to be purged by the OS. We're actually allowing this to go up to 1MB per arena. This might be too large a number for the stylo arenas. Dropping that down to 128K per arena for the stylo arenas "only" makes us do between 2.5 and 3 times the number of madvise/VirtualFree(MEM_DECOMMIT) system calls, which may or may not be worth doing ; the performance impact would need to be evaluated. This is memory that we can actively reclaim through a call to jemalloc_free_dirty_pages (although that applies to all arenas in the given process), which we currently only do on memory pressure.

Mike Hommey [:glandium]

Assignee

Comment 3

•

7 years ago

> Dropping that down to 128K per arena for the stylo arenas "only" makes us do between 2.5 and 3 times the number of madvise/VirtualFree(MEM_DECOMMIT) system calls.

To put in perspective: there's about 0.1% madvice/VirtualFree(MEM_DECOMMIT) system calls per mozjemalloc call on the stylo arenas with the current settings, that makes it between 0.25 and 0.3%.

Mike Hommey [:glandium]

Assignee

Comment 4

•

7 years ago

Oh, I should also mention that it seems there's between 300k and 400k of data per process allocated in the stylo arenas that are never freed, and that doesn't help with fragmentation, obviously. I haven't searched where they come from yet.

Boris Zbarsky [:bzbarsky]

Comment 5

•

7 years ago

> - bin-unused: those are entirely due to fragmentation, and I have yet to come to a conclusion about them

Hmm.  Is the fragmentation per-thread?  That is, is it possible that we're going from 1MB to 6MB of fragmentation, even with nothing else changing, merely because we have 6 threads, not 1?

> We're actually allowing this to go up to 1MB per arena.

Are arenas per-thread?

If the fragmentation and page-cache numbers are basically fixed and independent of workload, just dependent on the number of threads stylo uses, then maybe we shouldn't worry too much about them.  But I'm not sure the fragmentation number is that workload-independent; I've seen it varying quite a bit in bug 1384945 comment 29 and bug 1384945 comment 36...

Bobby Holley (:bholley)

Reporter

Comment 6

•

7 years ago

(P3 just to avoid having multiple blocking bugs tracking memory usage)

Priority: -- → P3

Mike Hommey [:glandium]

Assignee

Comment 7

•

7 years ago

(In reply to Boris Zbarsky [:bz] (still digging out from vacation mail) from comment #5)
> > - bin-unused: those are entirely due to fragmentation, and I have yet to come to a conclusion about them
> 
> Hmm.  Is the fragmentation per-thread?  That is, is it possible that we're
> going from 1MB to 6MB of fragmentation, even with nothing else changing,
> merely because we have 6 threads, not 1?

Probably not that high, but yes. And it happens, that was my first observation: disabling the thread-local arenas for stylo decreases heap-overhead.

> > We're actually allowing this to go up to 1MB per arena.
> 
> Are arenas per-thread?

Stylo threads all have their own arena.

> If the fragmentation and page-cache numbers are basically fixed and
> independent of workload, just dependent on the number of threads stylo uses,
> then maybe we shouldn't worry too much about them.  But I'm not sure the
> fragmentation number is that workload-independent; I've seen it varying
> quite a bit in bug 1384945 comment 29 and bug 1384945 comment 36...

Page-cache varies between 0 and 1M per thread depending on workload, because it depends how much adjacent memory has been allocated and freed. But it's capped at 1M per stylo thread (which can be argued to be way too much), so it /can/ be considered as independent of workload in some sense.

Fragmentation is very much dependent on workload. I'm still trying to get a picture of how it looks like, but it's hard to find a "stable" point where to look at. Memory use varies while stylo works (obviously), but I also don't see an obvious drop when following a link and loading another page (although maybe I found one, actually ; need to dig deeper). I've started looking at how things are split between threads, but so far, my data is not summing up right.

As for the 300~400K of "leaks" per thread, they seem to be thread-local data from servo and rayon code.

Bobby Holley (:bholley)

Reporter

Comment 8

•

7 years ago

(In reply to Mike Hommey [:glandium] from comment #7)
> As for the 300~400K of "leaks" per thread, they seem to be thread-local data
> from servo and rayon code.

That is much higher than I would expect. At least for the servo code, I would expect something closer to 10k. Where is it coming from?

Mike Hommey [:glandium]

Assignee

Comment 9

•

7 years ago

Err, that was per process, not per thread, but that's still more than you're expecting. I'll try to get a thorough list.

Bobby Holley (:bholley)

Reporter

Comment 10

•

7 years ago

(And to be clear, the large per-thread permaleaks I would expect would be from TLS: 4k for the bloom filters, ~8k for the style sharing cache buffers).

Mike Hommey [:glandium]

Assignee

Comment 11

•

7 years ago

Attached patch DMD hack — Details — Splinter Review

This is the DMD hack I used to gather the log I'm about to attach. It makes DMD only track things that were allocated from the stylo thread arenas.

Mike Hommey [:glandium]

Assignee

Updated

•

7 years ago

Attachment #8905694 - Attachment is patch: true

Mike Hommey [:glandium]

Assignee

Comment 12

•

7 years ago

Attached file DMD report — Details

This is the DMD report for the content process, and gathered with:
  MOZ_DMD_SHUTDOWN_LOG=/tmp/dmd/ MOZ_CC_RUN_DURING_SHUTDOWN=1 STYLO_THREADS=3 ./mach run --dmd --mode=live --stacks=full -- -profile /tmp/profile -new-instance https://wikipedia.org/wiki/Obama

This is on an opt build with NS_FREE_PERMANENT_DATA manually enabled in nscore.h and all.js.

This is actually less (half) than what I saw with allocation logging, and I'm not excluding I did something wrong there (probably the same root cause as why my numbers didn't sum up when I tried to look at fragmentation per thread).

Note that about half of the reported size comes from stylo code, and the other half... pango and fontconfig, through things like GetSystemFontInfo. A lot of the latter is a lot of small allocations, too, which contribute to the overall fragmentation problem. It would be much better if those would happen on the main arena instead of the stylo arenas...

That's still about 75K for 3 stylo threads, so about 25K per thread.

Mike Hommey [:glandium]

Assignee

Comment 13

•

7 years ago

Attached file DMD report, excluding font stuff — Details

Same as the previous report, but with font allocation excluded.

Boris Zbarsky [:bzbarsky]

Comment 14

•

7 years ago

So on awsy, at least, our heap-overhead regression from stylo in the "tabs opened" condition is about 20MB.  Even once we close all the tabs it's still about 8-10MB.

That's with STYLO_THREADS=4, and I'm pretty sure there are 4 content processes.  So 16 stylo threads total.  So we could have up to 16MB just from page-cache, in theory, right?

The actual breakdowns inside heap-overhead while all the tabs are open look like this:

  Heap-overhead breakdown:
    page-cache: Gecko 2.857422, Stylo 11.080078, regression: 8.222656
    bin-unused: Gecko 27.299109, Stylo 38.512325, regression: 11.213216
    bookkeeping: Gecko 6.241396, Stylo 7.442486, regression: 1.201090

and after the tabs are closed look like this:

  Heap-overhead breakdown:
    page-cache: Gecko 0.710449, Stylo 3.678711, regression: 2.968262
    bin-unused: Gecko 11.625354, Stylo 18.060195, regression: 6.434841
    bookkeeping: Gecko 1.621473, Stylo 1.794022, regression: 0.172548

Chris Peterson [:cpeterson]

Updated

•

7 years ago

status-firefox55: --- → wontfix

status-firefox56: --- → wontfix

status-firefox57: --- → affected

status-firefox-esr52: --- → wontfix

Comment hidden (mozreview-request)

Until bug 1361258, there was only ever one mozjemalloc arena, and the
number of dirty pages we allow to be kept dirty, fixed to 1MB per arena,
was, in fact, 1MB for an entire process.

With stylo using thread local arenas, we now can have multiple arenas
per process, multiplying that number of dirty pages.

While those dirty pages may be reused later on, when other allocations
end up filling them later on, the fact that a relatively large number of
them is kept around for each stylo thread (in proportion to the amount of
memory ever allocated by stylo), combined with the fact that the memory
use from stylo depends on the workload generated by the pages being
visited, those dirty pages may very well not be used for a rather long
time. This is less of a problem with the main arena, used for most
everything else.

So, for each arena except the main one, we decrease the number of dirty
pages we allow to be kept around to 1/8 of the current value. We do this
by introducing a per-arena configuration of that maximum number.

Review commit: https://reviewboard.mozilla.org/r/179572/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/179572/

Comment hidden (mozreview-request)

The bin-unused count in memory reports indicates how much memory is
used by runs of small and sub-page allocations that is not actually
allocated. This is generally thought as an indicator of fragmentation.

While this is generally true, with the use of thread local arenas by
stylo, combined with how stylo allocates memory, it ends up also being
an indicator of wasted memory.

For instance, over the lifetime of an AWSY iteration, there are only a
few allocations that ends up in the bucket for 2048 allocated bytes. In
the "worst" case, there's only one. But the run size for such
allocations is 132KiB. Which means just because we're allocating one
buffer of size between 1024 and 2048 bytes, we end up wasting 130+KiB.
Per thread.

Something similar happens with size classes of 512 and 1024, where the
run size is respectively 32KiB and 64KiB, and where there's at most a
handful of allocations of each class ever happening per thread.

Overall, an allocation log from a full AWSY iteration reveals that there
are only 448 of 860700 allocations happening on the stylo arenas that
involve sizes above (and excluding) 512 bytes, so 0.05%.

While there are improvements that can be done to mozjemalloc so that it
doesn't waste more than one page per sub-page size class, they are
changes that are too low-level to land at this time of the release
cycle. However, considering the numbers above and the fact that the
stylo arenas are only really meant to avoid lock contention during the
heavy parallel work involved, a short term, low risk, strategy is to
just delegate all sub-page (> 512, < 4096) and large (>= 4096) to the
main arena. Technically speaking, only sub-page allocations are causing
this waste, but it's more consistent to just delegate everything above
512 bytes.

This should save 132KiB + 64KiB = 196KiB per stylo thread.

Review commit: https://reviewboard.mozilla.org/r/179574/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/179574/

Mike Hommey [:glandium]

Assignee

Comment 17

•

7 years ago

See the commit messages on the two attached patches as to what's the cause for up to 1.2MiB of waste in page-cache and bin-unused per stylo thread.

The remainder is more straightforward fragmentation, which can also be decreased somehow by tweaking class sizes for small allocations, but that's harder to do without impacting the main arena, and touching the configuration of the main arena is bound to cause performance and memory regressions. I'm going to gather some more stats and see what specific size classes are problematic, maybe we can do something about them at a higher level (by adding some padding to some structs, for example ; this is this fun unintuitive thing where allocating more memory ends up using less memory...).

Flags: needinfo?(mh+mozilla)

Nicholas Nethercote [inactive]

Comment 18

•

7 years ago

mozreview-review

Comment on attachment 8907888 [details]
Bug 1397101 - Reduce the number of dirty pages we allow to be kept in thread local arenas.

https://reviewboard.mozilla.org/r/179572/#review184752

::: memory/build/mozjemalloc.cpp:736
(Diff revision 1)
>  	 * dirty, and for which madvise(... MADV_FREE) has not been called.  By
>  	 * tracking this, we can institute a limit on how much dirty unused
>  	 * memory is mapped for each arena.
>  	 */
>  	size_t			ndirty;
> +	size_t			dirty_max;

Please add a comment that indicates what unit this number has. (AFAICT it's pages.)

::: memory/build/mozjemalloc.cpp:4627
(Diff revision 1)
>  #ifndef XP_WIN
>      malloc_mutex_unlock(&init_lock);
>  #endif
>      return true;
>    }
> +  /* arenas_new() sets this to a lower value for thread local arenas ;

Should be `arena_new()`. Also, remove the space before the ';'.

Attachment #8907888 - Flags: review?(n.nethercote) → review+

Nicholas Nethercote [inactive]

Comment 19

•

7 years ago

mozreview-review

Comment on attachment 8907889 [details]
Bug 1397101 - Only use a thread local arena for small sizes.

https://reviewboard.mozilla.org/r/179574/#review184756

The word "hack" has multiple meanings and shades of meaning, and this patch encompasses most of them :)

Attachment #8907889 - Flags: review?(n.nethercote) → review+

Mike Hommey [:glandium]

Assignee

Comment 20

•

7 years ago

mozreview-review

Comment on attachment 8907889 [details]
Bug 1397101 - Only use a thread local arena for small sizes.

https://reviewboard.mozilla.org/r/179574/#review184760

::: memory/build/mozjemalloc.cpp:2295
(Diff revision 1)
>     * library version, libc's malloc is used by TLS allocation, which
>     * introduces a bootstrapping issue.
>     */
>  #ifndef NO_TLS
> -
> +  // Only use a thread local arena for small sizes.
> +  if (size < small_max) {

This is meant to be <=

Mike Hommey [:glandium]

Assignee

Comment 21

•

7 years ago

So if I replay my AWSY allocation log for all processes, this is what I get at "TabsOpenForceGC":
- page-cache: 11,235,328
- bin-unused: 60,188,816
- bookkeeping: 10,945,680

With style thread arenas disabled:
- page-cache: 3,616,768
- bin-unused: 52,527,472
- bookkeeping: 10,652,592

With the two patches attached, with style thread arenas enabled:
- page-cache: 3,502,080
- bin-unused: 57,885,936
- bookkeeping: 10,885,168

Note this is replaying the exact same allocation log over and over again, so there's no noise between runs, like when comparing the output from awsy itself.

(and yes, bin-unused is actually very high on the main arena too, but that's a topic for another day)

If I further tweak the sizes passed to malloc functions on the stylo threads only to use size classes spaced by 32 instead of 16(*):
- page-cache: 4,104,192
- bin-unused: 56,989,160
- bookkeeping: 10,994,352

* so, 32, 64, 96, 128, etc. instead of 16, 32, 48, etc. meaning malloc(33) returns a 64-bytes block instead of a 48-bytes one.

So in some sense, tweaking size classes is making things worse (more dirty page), while it does reduce bin-unused a little.

Mike Hommey [:glandium]

Assignee

Comment 22

•

7 years ago

(In reply to Nicholas Nethercote [:njn] from comment #18)
> Comment on attachment 8907888 [details]
> Bug 1397101 - Reduce the number of dirty pages we allow to be kept in thread
> local arenas.
> 
> https://reviewboard.mozilla.org/r/179572/#review184752
> 
> ::: memory/build/mozjemalloc.cpp:736
> (Diff revision 1)
> >  	 * dirty, and for which madvise(... MADV_FREE) has not been called.  By
> >  	 * tracking this, we can institute a limit on how much dirty unused
> >  	 * memory is mapped for each arena.
> >  	 */
> >  	size_t			ndirty;
> > +	size_t			dirty_max;
> 
> Please add a comment that indicates what unit this number has. (AFAICT it's
> pages.)

I'm adding the comment "maximum value allowed for ndirty", and ndirty is already commented as being a count of pages.

> Also, remove the space before the ';'.

Dammit, I always forget that the typography rule for the semi-colon is different between english and french.

Comment hidden (mozreview-request)

Comment on attachment 8907888 [details]
Bug 1397101 - Reduce the number of dirty pages we allow to be kept in thread local arenas.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/179572/diff/1-2/

Comment hidden (mozreview-request)

Comment on attachment 8907889 [details]
Bug 1397101 - Only use a thread local arena for small sizes.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/179574/diff/1-2/

Mike Hommey [:glandium]

Assignee

Comment 25

•

7 years ago

(In reply to Mike Hommey [:glandium] from comment #21)
> So if I replay my AWSY allocation log for all processes, this is what I get
> at "TabsOpenForceGC":
> - page-cache: 11,235,328
> - bin-unused: 60,188,816
> - bookkeeping: 10,945,680
> 
> With style thread arenas disabled:
> - page-cache: 3,616,768
> - bin-unused: 52,527,472
> - bookkeeping: 10,652,592
> 
> With the two patches attached, with style thread arenas enabled:
> - page-cache: 3,502,080
> - bin-unused: 57,885,936
> - bookkeeping: 10,885,168

Those numbers are for 4 content processes and 4 stylo threads per process. That makes the difference in bin-unused between thread arenas enabled/disables about 327KiB per thread.

Pulsebot

Comment 26

•

7 years ago

Pushed by mh@glandium.org:
https://hg.mozilla.org/integration/autoland/rev/60f6c4563645
Reduce the number of dirty pages we allow to be kept in thread local arenas. r=njn
https://hg.mozilla.org/integration/autoland/rev/d9efe6d06931
Only use a thread local arena for small sizes. r=njn

Mike Hommey [:glandium]

Assignee

Updated

•

7 years ago

Blocks: 1399695

Mike Hommey [:glandium]

Assignee

Comment 27

•

7 years ago

Some more numbers, for "TabsClosedForceGC", this time, for the same AWSY allocations log:

Unpatched, style arenas enabled:
- page-cache: 13,336,576
- bin-unused: 55,715,936
- bookkeeping: 6,425,360

Unpatched, style arenas disabled:
- page-cache: 2,859,008
- bin-unused: 53,727,648
- bookkeeping: 6,304,208

Patched, style arenas enabled:
- page-cache: 3,747,840
- bin-unused: 53,469,088
- bookkeeping: 6,366,160

Mike Hommey [:glandium]

Assignee

Updated

•

7 years ago

Keywords: leave-open

Bobby Holley (:bholley)

Reporter

Comment 28

•

7 years ago

This is awesome, thanks glandium!

We already limit style struct allocations to be < 512 bytes:

http://searchfox.org/mozilla-central/rev/6326724982c66aaeaf70bb7c7ee170f7a38ca226/layout/style/nsStyleStruct.cpp#60

So I think that limit will generally work well for us.

Mike Hommey [:glandium]

Assignee

Comment 29

•

7 years ago

(In reply to Mike Hommey [:glandium] from comment #21)
> With the two patches attached, with style thread arenas enabled:
> - page-cache: 3,502,080
> - bin-unused: 57,885,936
> - bookkeeping: 10,885,168

Still same AWSY allocations log, for "TabsOpenForceGC", patched, style arenas enabled, and all allocations >= 240 and < 512 clamped to 512:
- page-cache: 4,005,888
- bin-unused: 55,950,352
- bookkeeping: 11,179,504

Mike Hommey [:glandium]

Assignee

Comment 30

•

7 years ago

Seemingly smarter:
  368 < size <= 432 -> 432
  432 < size <= 480 -> 480
  480 < size <= 512 -> 512

- page-cache: 3,600,384
- bin-unused: 57,777,616
- bookkeeping: 10,873,168

Adding some more clamping:
  272 < size <= 368 -> 368
  208 < size <= 272 -> 272

- page-cache: 3,571,712
- bin-unused: 57,265,568
- bookkeeping: 10,856,528

Adding even more clamping:
  176 < size <= 208 -> 208

- page-cache: 3,588,096
- bin-unused: 57,277,664
- bookkeeping: 10,856,720

And the final stroke, grouping all runs of the same size for runs larger than a page:
  80 < size <= 176 -> 176

- page-cache: 4,239,360
- bin-unused: 56,325,808
- bookkeeping: 10,852,000

This is as far as tweaking struct sizes would get us, and that's not much: the best was the dumb attempt from comment 29, and that only reduced bin-unused by 118KiB per thread on average.

There *are* things we can do to improve the situation, but as mentioned earlier, they're not on the low risk side of the things we can do for the short term.

So I'm inclined to say, at this point, that the landed patches are as far as we're going to go with this bug.

Keywords: leave-open

Nicholas Nethercote [inactive]

Comment 31

•

7 years ago

> So I'm inclined to say, at this point, that the landed patches are as far as we're going to go with this bug.

Sounds reasonable to me.

Mike Hommey [:glandium]

Assignee

Comment 32

•

7 years ago

FWIW, with a more risky approach, kind of related with bug 1399695 but not entirely (making all runs on stylo threads a page large), and assuming I didn't actually break anything from the horrible hack:

- page-cache: 3,330,048
- bin-unused: 54,374,432
- bookkeeping: 10,677,728

which is the closest I've gotten so far to the overhead with thread arenas disabled, but I'm actually surprised bookkeeping is so low, so I'm not excluding that I've done something wrong. Anyways, it means we do have some room for improvement with more aggressive approaches.

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 33

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/60f6c4563645
https://hg.mozilla.org/mozilla-central/rev/d9efe6d06931

Status: NEW → RESOLVED

Closed: 7 years ago

status-firefox57: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla57

Mike Hommey [:glandium]

Assignee

Comment 34

•

7 years ago

(In reply to Mike Hommey [:glandium] from comment #32)
> - bookkeeping: 10,677,728

corrected value: 11,775,232

which is less surprising.

Boris Zbarsky [:bzbarsky]

Comment 35

•

7 years ago

For what it's worth, my worry with the various clamping approaches is that they may overtune for current allocation patterns.  But those patterns can change, especially as we add/remove things to style structs...

Bobby Holley (:bholley)

Reporter

Updated

•

6 years ago

Depends on: 1436541

DMD hack 7 years ago Mike Hommey [:glandium] 397.17 KB, patch		Details \| Diff \| Splinter Review
DMD report 7 years ago Mike Hommey [:glandium] 397.17 KB, text/plain		Details
DMD report, excluding font stuff 7 years ago Mike Hommey [:glandium] 17.10 KB, text/plain		Details
Bug 1397101 - Reduce the number of dirty pages we allow to be kept in thread local arenas. 7 years ago Mike Hommey [:glandium] 59 bytes, text/x-review-board-request	n.nethercote : review+	Details
Bug 1397101 - Only use a thread local arena for small sizes. 7 years ago Mike Hommey [:glandium] 59 bytes, text/x-review-board-request	n.nethercote : review+	Details