Experiment with deferring and/or avoiding arena_t::Purge where possible
Categories
(Core :: Memory Allocator, enhancement)
Tracking
()
| Tracking | Status | |
|---|---|---|
| firefox136 | --- | fixed |
People
(Reporter: jstutte, Assigned: jstutte)
References
(Blocks 2 open bugs, Regressed 1 open bug)
Details
(Keywords: perf-alert, Whiteboard: [sp3])
Attachments
(5 files, 6 obsolete files)
arena_t::Purge checks if if it reached our maximum of non-used but allocated memory and in case is responsible to update our internal book-keeping and give back memory chunks to the OS. It does so synchronously whenever a free or realloc happens. Both our internal book-keeping and giving back memory to the OS can be quite time consuming. We already have ways to influence how much dirty memory we allow before checking, but this comes with the price of memory overhead if taken too far.
There are situations where we know at higher levels of abstraction, that during certain operations a purge is not needed at all or can be done later in order to reduce or avoid this overhead, like:
- process shutdown. If a process knows that it is on the straight path to shutdown, any purge operation can be entirely avoided.
- memory drop and raise. When we destroy a bigger tree of objects, just to re-create another, new tree of other objects, any purge that happens in between could be somehow deferred to happen after the end of that cycle.
This bug wants to explore the potential of making moz_jemalloc aware of such conditions or find other ways to avoid their impact, if possible. A big part of this is probably to set up good ways of measuring the impact of what we are trying to do here.
| Assignee | ||
Updated•1 year ago
|
| Assignee | ||
Comment 1•1 year ago
|
||
:smaug said, we already looked at this from various angles in the past, so we know there might be potential but we do not really know, how much and if it is worth it. Having some PoC patches might help to find out.
| Assignee | ||
Comment 2•1 year ago
|
||
When a process is about to shutdown, there is no point in spending effort to purge our memory, once it exits it will be returned to the system, anyways. Rather than having explicit different code paths everywhere, we can signal jemalloc via a global flag, that it can stop to bother. Once flipped, jemalloc will not execute any purge, and also realloc will only have effect if the new size is bigger than the old.
For content processes, we flip the flag when shutdown is first notified. This notification also interrupts content JS execution, such that there should be no more significant allocations until shutdown.
For the parent process, this patch chooses the highest risk by associating the flip with entering shutdown phase AppShutdownConfirmed, just to verify the potential impact. FWIW, given we only block purge, we do not expect to see unnecessary allocations of system memory.
There might be room for doing more on huge object free, which currently is unaltered.
Note that in leak checking builds we do not skip anything, of course.
This is a PoC patch and might want some adjustment to integrate better with jemalloc. There might be relevant telemetry already in place to monitor the effect.
Note that the profiler currently gets fooled about memory consumption as we bail out at lower levels and would probably need some adjustment, if we want to go this route.
| Assignee | ||
Comment 3•1 year ago
•
|
||
memory drop and raise. When we destroy a bigger tree of objects, just to re-create another, new tree of other objects, any purge that happens in between could be somehow deferred to happen after the end of that cycle.
As far as this case is concerned, there could be some mechanism that allows to give moz_jemalloc a callback that should be called on purge, something like:
- The consuming process sets a callback
std::function<...> cbRequestPurgefor moz_jemalloc purge
... - Some free happens
- moz_jemalloc notices it should purge, does
setPurgePending()and callscbRequestPurge([arena = this]() {arena->doPendingPurge(); } - the
cbRequestPurgedoes an idle dispatch of the passed callback to the current serial event target - when that event is executed, the real purge happens.
| Assignee | ||
Comment 4•1 year ago
|
||
The basic idea is to hook into jemalloc on purge and defer it to when we are less busy.
This patch is meant to just show that this may move some needles. It also shows that it is a a bit hairy to inject something like this into jemalloc and might not be the best way how to do this at all.
Updated•1 year ago
|
| Assignee | ||
Comment 5•1 year ago
•
|
||
The lazy purge seems to have some positive effect on (not only) sp3 numbers, thus smaug suggested to link this bug also there to keep it on the radar for further investigations.
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
| Assignee | ||
Comment 6•1 year ago
|
||
Bug 1488780 will be an interesting, complementary improvement: Once we can purge without holding an arena's lock, the impact of purging on the main thread while other threads do some work that uses such an arena should be minimal.
| Assignee | ||
Comment 7•1 year ago
|
||
there could be some mechanism that allows to give moz_jemalloc a callback that should be called on purge
We do not need a callback, the updated patch avoids that.
| Assignee | ||
Comment 8•1 year ago
|
||
| Assignee | ||
Comment 9•1 year ago
|
||
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
| Assignee | ||
Comment 10•1 year ago
|
||
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Comment 11•1 year ago
|
||
Does this bug depend on Bug 1913753?
| Assignee | ||
Comment 12•1 year ago
•
|
||
(In reply to Paul Bone [:pbone] from comment #11)
Does this bug depend on Bug 1913753?
I assume only for the order of landing the patches to avoid conflicts, but it's probably good to make it explicit. It definitely depends on bug 1488780 and that is already blocking Bug 1913753.
Updated•1 year ago
|
Comment 13•1 year ago
|
||
Comment on attachment 9408621 [details]
WIP: Bug 1903758 - PoC: Avoid purge in our memory allocator after shutdown started.
Revision D214381 was moved to bug 1920450. Setting attachment 9408621 [details] to obsolete.
| Assignee | ||
Updated•1 year ago
|
Comment 14•1 year ago
|
||
I mentioned this in DMs, but I think you want to eagerly purge in situations like heap minimization, as there's probably tons of random stuff that will break, not just bug 1924148, if we delay things more. It should really be treated as a signal like "please free as much as possible, immediately". It might even come into play with situations where a process is on the verge of OOMing.
| Assignee | ||
Comment 15•1 year ago
•
|
||
(In reply to Andrew McCreight [:mccr8] from comment #14)
I mentioned this in DMs, but I think you want to eagerly purge in situations like heap minimization, as there's probably tons of random stuff that will break, not just bug 1924148, if we delay things more. It should really be treated as a signal like "please free as much as possible, immediately". It might even come into play with situations where a process is on the verge of OOMing.
Yes, the code path that comes through nsMemoryPressureWatcher::Observe should remain intact with lazy purge (and I think the patch keeps it intact but might want some tweak).
| Comment hidden (obsolete) |
| Assignee | ||
Comment 17•1 year ago
•
|
||
The awsy tests indicate that D220616 is mainly responsible for regressing the Base Content Resident Unique Memory measurement.
I tried to exclude that we just need to give the OS a bit more time to update its book-keeping before taking the measurements, but that seems to not change anything, making it likely that we see a real regression.
I think we might see a result of a maybe poor strategy of recycling the memory for (probably small) allocations which causes us fragmentation and keeps resident more pages than would be strictly needed if we were able to pack the memory in an optimal way (which might be hard to impossible if we want to continue to be fast).
I believe during allocation this is mostly driven by arena_t::GetNonFullBinRun, but I am not sure I parse this comment well regarding "lowest in memory" when I look at ArenaRunTreeTrait - are we just trying to say "with the lowest address" here ?
I'd be inclined to spin-off a new bug for improving this and for now to reason about for which platforms we could possibly accept a slight regression in this area that buys us a good performance improvement and for which not.
| Comment hidden (obsolete) |
| Comment hidden (obsolete) |
| Assignee | ||
Comment 20•1 year ago
|
||
(sorry for the noise, wrong bug number though it is supposed to be the same stack)
Comment 21•1 year ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #17)
I believe during allocation this is mostly driven by arena_t::GetNonFullBinRun, but I am not sure I parse this comment well regarding "lowest in memory" when I look at
ArenaRunTreeTrait- are we just trying to say "with the lowest address" here ?
Correct, it's sorted by memory order.
I'd be inclined to spin-off a new bug for improving this and for now to reason about for which platforms we could possibly accept a slight regression in this area that buys us a good performance improvement and for which not.
I have a set of patches that turn the tree of runs into a priority queue. It seems to improve things but it's a complex patch that currently involves adding another kind of base allocator into jemalloc. I should tidy it up and test it again.
| Assignee | ||
Comment 22•1 year ago
|
||
(In reply to Paul Bone [:pbone] from comment #21)
I have a set of patches that turn the tree of runs into a priority queue. It seems to improve things but it's a complex patch that currently involves adding another kind of base allocator into jemalloc. I should tidy it up and test it again.
I went ahead on bug 1924444 and experimented with substituting the tree with a simple O(1) MRU list which seems to pay off quite well (and is a quite simple patch).
| Assignee | ||
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Comment 23•1 year ago
|
||
Comment on attachment 9424062 [details]
Bug 1903758 - Let chunk_dealloc do a lazy pages_unmap. r?glandium,pbone,smaug
Revision D221761 was moved to bug 1935590. Setting attachment 9424062 [details] to obsolete.
| Assignee | ||
Updated•1 year ago
|
| Assignee | ||
Comment 24•1 year ago
|
||
Introduce a list of outstanding purge requests and API calls to enable its use and hooks to actually purge.
Without enabling, the behavior remains unchanged.
Updated•1 year ago
|
| Assignee | ||
Comment 25•1 year ago
|
||
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
| Assignee | ||
Comment 26•1 year ago
|
||
This is meant to increase readability where we used to use EffectiveMaxDirty() directly.
Note that ExtraCommitPages() uses different thresholds than Purge(), for that use we do not change anything.
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
| Assignee | ||
Comment 27•1 year ago
|
||
EffectiveMaxDirty() is called on every free and doing some calculations and reads memory from gArenas. After an arena is created, the value can only change when moz_set_max_dirty_page_modifier is called.
This showed up in profiles while working on Bug 1903758. During a full speedometer 3 run on that machine we used to spend 44ms in EffectiveMaxDirty(), with this patch only 4ms in ShouldStartPurge().
Updated•1 year ago
|
Updated•1 year ago
|
| Assignee | ||
Comment 28•1 year ago
|
||
NOT FOR SUBMIT
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Comment 29•1 year ago
|
||
Comment 30•1 year ago
|
||
| bugherder | ||
https://hg.mozilla.org/mozilla-central/rev/0e771414af93
https://hg.mozilla.org/mozilla-central/rev/c20c7fb738dc
https://hg.mozilla.org/mozilla-central/rev/43a16bf722e2
https://hg.mozilla.org/mozilla-central/rev/38ff70a42938
https://hg.mozilla.org/mozilla-central/rev/0927646a79dd
Comment 31•1 year ago
|
||
Probably improved Jetstream2-Bomb-workers by 14%
Comment 32•1 year ago
|
||
Perfherder has detected a devtools performance change from push 0927646a79dd3149e2e7463d01adb43e6295cba1.
Improvements:
| Ratio | Test | Platform | Options | Absolute values (old vs new) |
|---|---|---|---|---|
| 22% | damp server.protocoljs.DAMP | windows11-64-shippable-qr | e10s fission stylo webrender | 670.12 -> 520.97 |
| 13% | damp server.protocoljs.DAMP | linux1804-64-shippable-qr | e10s fission stylo webrender | 848.89 -> 736.22 |
| 10% | damp browser-toolbox.debugger-ready.DAMP | windows11-64-shippable-qr | e10s fission stylo webrender | 533.99 -> 480.37 |
| 7% | damp custom.inspector.close.DAMP | windows11-64-shippable-qr | e10s fission stylo webrender | 26.64 -> 24.73 |
| 6% | damp simple.jsdebugger.close.DAMP | windows11-64-shippable-qr | e10s fission stylo webrender | 9.41 -> 8.86 |
| ... | ... | ... | ... | ... |
| 2% | damp custom.jsdebugger.stepOut.DAMP | windows11-64-shippable-qr | e10s fission stylo webrender | 451.00 -> 441.63 |
Details of the alert can be found in the alert summary, including links to graphs and comparisons for each of the affected tests.
If you need the profiling jobs you can trigger them yourself from treeherder job view or ask a sheriff to do that for you.
You can run these tests on try with ./mach try perf --alert 43391
For more information on performance sheriffing please see our FAQ.
| Assignee | ||
Comment 33•1 year ago
|
||
While bug 1944005 tracks a (probably not too important) regression, it also contains:
Improvements:
| Ratio | Test | Platform | Options | Absolute values (old vs new) |
|---|---|---|---|---|
| 5% | speedometer3 TodoMVC-jQuery/DeletingAllItems/total | android-hw-a55-14-0-aarch64-shippable | fission webrender | 120.94 -> 115.24 |
| 4% | speedometer3 TodoMVC-jQuery/total | android-hw-a55-14-0-aarch64-shippable | fission webrender | 469.88 -> 448.78 |
| 4% | speedometer3 TodoMVC-JavaScript-ES6-Webpack-Complex-DOM/Adding100Items/Sync | android-hw-a55-14-0-aarch64-shippable | fission webrender | 56.77 -> 54.45 |
| 4% | speedometer3 TodoMVC-jQuery/CompletingAllItems/Sync | android-hw-a55-14-0-aarch64-shippable | fission webrender | 206.82 -> 199.18 |
| 4% | speedometer3 TodoMVC-jQuery/DeletingAllItems/Sync | android-hw-a55-14-0-aarch64-shippable | fission webrender | 116.13 -> 111.94 |
| ... | ... | ... | ... | ... |
| 2% | speedometer3 Editor-TipTap/total | android-hw-a55-14-0-aarch64-shippable | fission webrender | 185.18 -> 181.17 |
Details of the alert can be found in the alert summary, including links to graphs and comparisons for each of the affected tests.
Description
•