Closed
Bug 1397101
Opened 7 years ago
Closed 7 years ago
stylo: More heap-overhead with stylo enabled
Categories
(Core :: CSS Parsing and Computation, defect, P3)
Core
CSS Parsing and Computation
Tracking
()
RESOLVED
FIXED
mozilla57
People
(Reporter: bholley, Assigned: glandium)
References
(Blocks 1 open bug)
Details
Attachments
(5 files)
Assignee | ||
Comment 2•7 years ago
|
||
There are mainly two parts to heap-overhead: - bin-unused: those are entirely due to fragmentation, and I have yet to come to a conclusion about them - page-cache: those are dirty pages that are kept around to be reused, and that haven't been explicitly requested to be purged by the OS. We're actually allowing this to go up to 1MB per arena. This might be too large a number for the stylo arenas. Dropping that down to 128K per arena for the stylo arenas "only" makes us do between 2.5 and 3 times the number of madvise/VirtualFree(MEM_DECOMMIT) system calls, which may or may not be worth doing ; the performance impact would need to be evaluated. This is memory that we can actively reclaim through a call to jemalloc_free_dirty_pages (although that applies to all arenas in the given process), which we currently only do on memory pressure.
Assignee | ||
Comment 3•7 years ago
|
||
> Dropping that down to 128K per arena for the stylo arenas "only" makes us do between 2.5 and 3 times the number of madvise/VirtualFree(MEM_DECOMMIT) system calls.
To put in perspective: there's about 0.1% madvice/VirtualFree(MEM_DECOMMIT) system calls per mozjemalloc call on the stylo arenas with the current settings, that makes it between 0.25 and 0.3%.
Assignee | ||
Comment 4•7 years ago
|
||
Oh, I should also mention that it seems there's between 300k and 400k of data per process allocated in the stylo arenas that are never freed, and that doesn't help with fragmentation, obviously. I haven't searched where they come from yet.
Comment 5•7 years ago
|
||
> - bin-unused: those are entirely due to fragmentation, and I have yet to come to a conclusion about them Hmm. Is the fragmentation per-thread? That is, is it possible that we're going from 1MB to 6MB of fragmentation, even with nothing else changing, merely because we have 6 threads, not 1? > We're actually allowing this to go up to 1MB per arena. Are arenas per-thread? If the fragmentation and page-cache numbers are basically fixed and independent of workload, just dependent on the number of threads stylo uses, then maybe we shouldn't worry too much about them. But I'm not sure the fragmentation number is that workload-independent; I've seen it varying quite a bit in bug 1384945 comment 29 and bug 1384945 comment 36...
Reporter | ||
Comment 6•7 years ago
|
||
(P3 just to avoid having multiple blocking bugs tracking memory usage)
Priority: -- → P3
Assignee | ||
Comment 7•7 years ago
|
||
(In reply to Boris Zbarsky [:bz] (still digging out from vacation mail) from comment #5) > > - bin-unused: those are entirely due to fragmentation, and I have yet to come to a conclusion about them > > Hmm. Is the fragmentation per-thread? That is, is it possible that we're > going from 1MB to 6MB of fragmentation, even with nothing else changing, > merely because we have 6 threads, not 1? Probably not that high, but yes. And it happens, that was my first observation: disabling the thread-local arenas for stylo decreases heap-overhead. > > We're actually allowing this to go up to 1MB per arena. > > Are arenas per-thread? Stylo threads all have their own arena. > If the fragmentation and page-cache numbers are basically fixed and > independent of workload, just dependent on the number of threads stylo uses, > then maybe we shouldn't worry too much about them. But I'm not sure the > fragmentation number is that workload-independent; I've seen it varying > quite a bit in bug 1384945 comment 29 and bug 1384945 comment 36... Page-cache varies between 0 and 1M per thread depending on workload, because it depends how much adjacent memory has been allocated and freed. But it's capped at 1M per stylo thread (which can be argued to be way too much), so it /can/ be considered as independent of workload in some sense. Fragmentation is very much dependent on workload. I'm still trying to get a picture of how it looks like, but it's hard to find a "stable" point where to look at. Memory use varies while stylo works (obviously), but I also don't see an obvious drop when following a link and loading another page (although maybe I found one, actually ; need to dig deeper). I've started looking at how things are split between threads, but so far, my data is not summing up right. As for the 300~400K of "leaks" per thread, they seem to be thread-local data from servo and rayon code.
Reporter | ||
Comment 8•7 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #7) > As for the 300~400K of "leaks" per thread, they seem to be thread-local data > from servo and rayon code. That is much higher than I would expect. At least for the servo code, I would expect something closer to 10k. Where is it coming from?
Assignee | ||
Comment 9•7 years ago
|
||
Err, that was per process, not per thread, but that's still more than you're expecting. I'll try to get a thorough list.
Reporter | ||
Comment 10•7 years ago
|
||
(And to be clear, the large per-thread permaleaks I would expect would be from TLS: 4k for the bloom filters, ~8k for the style sharing cache buffers).
Assignee | ||
Comment 11•7 years ago
|
||
This is the DMD hack I used to gather the log I'm about to attach. It makes DMD only track things that were allocated from the stylo thread arenas.
Assignee | ||
Updated•7 years ago
|
Attachment #8905694 -
Attachment is patch: true
Assignee | ||
Comment 12•7 years ago
|
||
This is the DMD report for the content process, and gathered with: MOZ_DMD_SHUTDOWN_LOG=/tmp/dmd/ MOZ_CC_RUN_DURING_SHUTDOWN=1 STYLO_THREADS=3 ./mach run --dmd --mode=live --stacks=full -- -profile /tmp/profile -new-instance https://wikipedia.org/wiki/Obama This is on an opt build with NS_FREE_PERMANENT_DATA manually enabled in nscore.h and all.js. This is actually less (half) than what I saw with allocation logging, and I'm not excluding I did something wrong there (probably the same root cause as why my numbers didn't sum up when I tried to look at fragmentation per thread). Note that about half of the reported size comes from stylo code, and the other half... pango and fontconfig, through things like GetSystemFontInfo. A lot of the latter is a lot of small allocations, too, which contribute to the overall fragmentation problem. It would be much better if those would happen on the main arena instead of the stylo arenas... That's still about 75K for 3 stylo threads, so about 25K per thread.
Assignee | ||
Comment 13•7 years ago
|
||
Same as the previous report, but with font allocation excluded.
Comment 14•7 years ago
|
||
So on awsy, at least, our heap-overhead regression from stylo in the "tabs opened" condition is about 20MB. Even once we close all the tabs it's still about 8-10MB. That's with STYLO_THREADS=4, and I'm pretty sure there are 4 content processes. So 16 stylo threads total. So we could have up to 16MB just from page-cache, in theory, right? The actual breakdowns inside heap-overhead while all the tabs are open look like this: Heap-overhead breakdown: page-cache: Gecko 2.857422, Stylo 11.080078, regression: 8.222656 bin-unused: Gecko 27.299109, Stylo 38.512325, regression: 11.213216 bookkeeping: Gecko 6.241396, Stylo 7.442486, regression: 1.201090 and after the tabs are closed look like this: Heap-overhead breakdown: page-cache: Gecko 0.710449, Stylo 3.678711, regression: 2.968262 bin-unused: Gecko 11.625354, Stylo 18.060195, regression: 6.434841 bookkeeping: Gecko 1.621473, Stylo 1.794022, regression: 0.172548
Updated•7 years ago
|
status-firefox55:
--- → wontfix
status-firefox56:
--- → wontfix
status-firefox57:
--- → affected
status-firefox-esr52:
--- → wontfix
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Assignee | ||
Comment 17•7 years ago
|
||
See the commit messages on the two attached patches as to what's the cause for up to 1.2MiB of waste in page-cache and bin-unused per stylo thread. The remainder is more straightforward fragmentation, which can also be decreased somehow by tweaking class sizes for small allocations, but that's harder to do without impacting the main arena, and touching the configuration of the main arena is bound to cause performance and memory regressions. I'm going to gather some more stats and see what specific size classes are problematic, maybe we can do something about them at a higher level (by adding some padding to some structs, for example ; this is this fun unintuitive thing where allocating more memory ends up using less memory...).
Flags: needinfo?(mh+mozilla)
Comment 18•7 years ago
|
||
mozreview-review |
Comment on attachment 8907888 [details] Bug 1397101 - Reduce the number of dirty pages we allow to be kept in thread local arenas. https://reviewboard.mozilla.org/r/179572/#review184752 ::: memory/build/mozjemalloc.cpp:736 (Diff revision 1) > * dirty, and for which madvise(... MADV_FREE) has not been called. By > * tracking this, we can institute a limit on how much dirty unused > * memory is mapped for each arena. > */ > size_t ndirty; > + size_t dirty_max; Please add a comment that indicates what unit this number has. (AFAICT it's pages.) ::: memory/build/mozjemalloc.cpp:4627 (Diff revision 1) > #ifndef XP_WIN > malloc_mutex_unlock(&init_lock); > #endif > return true; > } > + /* arenas_new() sets this to a lower value for thread local arenas ; Should be `arena_new()`. Also, remove the space before the ';'.
Attachment #8907888 -
Flags: review?(n.nethercote) → review+
Comment 19•7 years ago
|
||
mozreview-review |
Comment on attachment 8907889 [details] Bug 1397101 - Only use a thread local arena for small sizes. https://reviewboard.mozilla.org/r/179574/#review184756 The word "hack" has multiple meanings and shades of meaning, and this patch encompasses most of them :)
Attachment #8907889 -
Flags: review?(n.nethercote) → review+
Assignee | ||
Comment 20•7 years ago
|
||
mozreview-review |
Comment on attachment 8907889 [details] Bug 1397101 - Only use a thread local arena for small sizes. https://reviewboard.mozilla.org/r/179574/#review184760 ::: memory/build/mozjemalloc.cpp:2295 (Diff revision 1) > * library version, libc's malloc is used by TLS allocation, which > * introduces a bootstrapping issue. > */ > #ifndef NO_TLS > - > + // Only use a thread local arena for small sizes. > + if (size < small_max) { This is meant to be <=
Assignee | ||
Comment 21•7 years ago
|
||
So if I replay my AWSY allocation log for all processes, this is what I get at "TabsOpenForceGC": - page-cache: 11,235,328 - bin-unused: 60,188,816 - bookkeeping: 10,945,680 With style thread arenas disabled: - page-cache: 3,616,768 - bin-unused: 52,527,472 - bookkeeping: 10,652,592 With the two patches attached, with style thread arenas enabled: - page-cache: 3,502,080 - bin-unused: 57,885,936 - bookkeeping: 10,885,168 Note this is replaying the exact same allocation log over and over again, so there's no noise between runs, like when comparing the output from awsy itself. (and yes, bin-unused is actually very high on the main arena too, but that's a topic for another day) If I further tweak the sizes passed to malloc functions on the stylo threads only to use size classes spaced by 32 instead of 16(*): - page-cache: 4,104,192 - bin-unused: 56,989,160 - bookkeeping: 10,994,352 * so, 32, 64, 96, 128, etc. instead of 16, 32, 48, etc. meaning malloc(33) returns a 64-bytes block instead of a 48-bytes one. So in some sense, tweaking size classes is making things worse (more dirty page), while it does reduce bin-unused a little.
Assignee | ||
Comment 22•7 years ago
|
||
(In reply to Nicholas Nethercote [:njn] from comment #18) > Comment on attachment 8907888 [details] > Bug 1397101 - Reduce the number of dirty pages we allow to be kept in thread > local arenas. > > https://reviewboard.mozilla.org/r/179572/#review184752 > > ::: memory/build/mozjemalloc.cpp:736 > (Diff revision 1) > > * dirty, and for which madvise(... MADV_FREE) has not been called. By > > * tracking this, we can institute a limit on how much dirty unused > > * memory is mapped for each arena. > > */ > > size_t ndirty; > > + size_t dirty_max; > > Please add a comment that indicates what unit this number has. (AFAICT it's > pages.) I'm adding the comment "maximum value allowed for ndirty", and ndirty is already commented as being a count of pages. > Also, remove the space before the ';'. Dammit, I always forget that the typography rule for the semi-colon is different between english and french.
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Assignee | ||
Comment 25•7 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #21) > So if I replay my AWSY allocation log for all processes, this is what I get > at "TabsOpenForceGC": > - page-cache: 11,235,328 > - bin-unused: 60,188,816 > - bookkeeping: 10,945,680 > > With style thread arenas disabled: > - page-cache: 3,616,768 > - bin-unused: 52,527,472 > - bookkeeping: 10,652,592 > > With the two patches attached, with style thread arenas enabled: > - page-cache: 3,502,080 > - bin-unused: 57,885,936 > - bookkeeping: 10,885,168 Those numbers are for 4 content processes and 4 stylo threads per process. That makes the difference in bin-unused between thread arenas enabled/disables about 327KiB per thread.
Comment 26•7 years ago
|
||
Pushed by mh@glandium.org: https://hg.mozilla.org/integration/autoland/rev/60f6c4563645 Reduce the number of dirty pages we allow to be kept in thread local arenas. r=njn https://hg.mozilla.org/integration/autoland/rev/d9efe6d06931 Only use a thread local arena for small sizes. r=njn
Assignee | ||
Comment 27•7 years ago
|
||
Some more numbers, for "TabsClosedForceGC", this time, for the same AWSY allocations log: Unpatched, style arenas enabled: - page-cache: 13,336,576 - bin-unused: 55,715,936 - bookkeeping: 6,425,360 Unpatched, style arenas disabled: - page-cache: 2,859,008 - bin-unused: 53,727,648 - bookkeeping: 6,304,208 Patched, style arenas enabled: - page-cache: 3,747,840 - bin-unused: 53,469,088 - bookkeeping: 6,366,160
Assignee | ||
Updated•7 years ago
|
Keywords: leave-open
Reporter | ||
Comment 28•7 years ago
|
||
This is awesome, thanks glandium! We already limit style struct allocations to be < 512 bytes: http://searchfox.org/mozilla-central/rev/6326724982c66aaeaf70bb7c7ee170f7a38ca226/layout/style/nsStyleStruct.cpp#60 So I think that limit will generally work well for us.
Assignee | ||
Comment 29•7 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #21) > With the two patches attached, with style thread arenas enabled: > - page-cache: 3,502,080 > - bin-unused: 57,885,936 > - bookkeeping: 10,885,168 Still same AWSY allocations log, for "TabsOpenForceGC", patched, style arenas enabled, and all allocations >= 240 and < 512 clamped to 512: - page-cache: 4,005,888 - bin-unused: 55,950,352 - bookkeeping: 11,179,504
Assignee | ||
Comment 30•7 years ago
|
||
Seemingly smarter: 368 < size <= 432 -> 432 432 < size <= 480 -> 480 480 < size <= 512 -> 512 - page-cache: 3,600,384 - bin-unused: 57,777,616 - bookkeeping: 10,873,168 Adding some more clamping: 272 < size <= 368 -> 368 208 < size <= 272 -> 272 - page-cache: 3,571,712 - bin-unused: 57,265,568 - bookkeeping: 10,856,528 Adding even more clamping: 176 < size <= 208 -> 208 - page-cache: 3,588,096 - bin-unused: 57,277,664 - bookkeeping: 10,856,720 And the final stroke, grouping all runs of the same size for runs larger than a page: 80 < size <= 176 -> 176 - page-cache: 4,239,360 - bin-unused: 56,325,808 - bookkeeping: 10,852,000 This is as far as tweaking struct sizes would get us, and that's not much: the best was the dumb attempt from comment 29, and that only reduced bin-unused by 118KiB per thread on average. There *are* things we can do to improve the situation, but as mentioned earlier, they're not on the low risk side of the things we can do for the short term. So I'm inclined to say, at this point, that the landed patches are as far as we're going to go with this bug.
Keywords: leave-open
Comment 31•7 years ago
|
||
> So I'm inclined to say, at this point, that the landed patches are as far as we're going to go with this bug.
Sounds reasonable to me.
Assignee | ||
Comment 32•7 years ago
|
||
FWIW, with a more risky approach, kind of related with bug 1399695 but not entirely (making all runs on stylo threads a page large), and assuming I didn't actually break anything from the horrible hack: - page-cache: 3,330,048 - bin-unused: 54,374,432 - bookkeeping: 10,677,728 which is the closest I've gotten so far to the overhead with thread arenas disabled, but I'm actually surprised bookkeeping is so low, so I'm not excluding that I've done something wrong. Anyways, it means we do have some room for improvement with more aggressive approaches.
Comment 33•7 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/60f6c4563645 https://hg.mozilla.org/mozilla-central/rev/d9efe6d06931
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
Assignee | ||
Comment 34•7 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #32) > - bookkeeping: 10,677,728 corrected value: 11,775,232 which is less surprising.
Comment 35•7 years ago
|
||
For what it's worth, my worry with the various clamping approaches is that they may overtune for current allocation patterns. But those patterns can change, especially as we add/remove things to style structs...
You need to log in
before you can comment on or make changes to this bug.
Description
•