Closed Bug 1924444 Opened 1 year ago Closed 1 year ago

Improve the order with which we (re-)use non-full runs during allocations

Categories

(Core :: Memory Allocator, enhancement)

enhancement

Tracking

()

RESOLVED FIXED
134 Branch
Tracking Status
firefox134 --- fixed

People

(Reporter: jstutte, Assigned: jstutte)

References

(Blocks 1 open bug, Regressed 1 open bug)

Details

(Keywords: perf-alert, Whiteboard: [sp3])

Attachments

(1 file)

Edit: See comment 4.

Bug 1903758 shows a significant regression for the usage of resident memory. It seems that re-using memory more often can result in a higher number of only partly-used (non-full) runs (for small allocations).

In particular our current policy for re-use seems to be position-in-memory based, which seems improvable.

We assume that allocations for objects of the same size often happen in bursts, as well as deallocations, as a consequence of object trees or other structures being built up and later being torn down all together.
In such situations it would help to keep allocations from the same burst happen as much as possible in the same runs, such that it gets more likely to free entire runs when the corresponding free burst occurs.

But content doesn't randomize small allocations right? So randomization can't be the culprit?

(In reply to Emilio Cobos Álvarez (:emilio) from comment #1)

But content doesn't randomize small allocations right? So randomization can't be the culprit?

Thanks, I was not aware of that. It might even just be that more re-use inevitably causes more sparse free regions, though I think/hope we can improve something here. Edited the description.

The current order of re-using regions from non-full runs relies on the run's position in memory.

In general we assume that allocations for objects of the same size often happen in bursts, as well as deallocations, as a consequence of object trees or other structures being built up and later being torn down all together or in bigger portions.
In such situations it helps to keep allocations from the same burst happen as much as possible in the same runs, such that it gets more likely to free entire runs when the corresponding free burst occurs. Keeping objects close together fosters also to have less "cache miss" situations during use.

We want to use an MRF ("most-recently-freed") policy to refill non-full runs.

This has two reasons:

  1. It is cheap, as all our non-full-runs' book-keeping is O(1), no tree-balancing or walking is needed.
  2. It hopefully fosters that more allocations from the same unit of code end up together in the same run, as most likely they will be freed and/or used together at the same time.

The policy is as follows:
a. A newly allocated run becomes the head. This happens only when the list was empty.
b. If a run is entirely filled again, it is removed from the list and the next MRF run in the list becomes the head.
c. If a run is entirely freed, it is also removed from the list (and from the bin, entering the tree of recycleable runs).
d. Each time we free something from a run, we move it to the head position in our list, if possible and needed. This keeps the entire list in "most recently freed" order.

Attachment #9430830 - Attachment description: WIP: Bug 1924444 - Use an MRU policy for NotFull runs to keep allocations as local as possible. → WIP: Bug 1924444 - Use an MRF policy for NotFull runs to keep allocations as local as possible.
Assignee: nobody → jstutte
Attachment #9430830 - Attachment description: WIP: Bug 1924444 - Use an MRF policy for NotFull runs to keep allocations as local as possible. → Bug 1924444 - Use an MRF policy for NotFull runs to keep allocations as local as possible. r?smaug,glandium,pbone
Status: NEW → ASSIGNED
No longer depends on: 1903758
See Also: → 1903758

This patch is very impactful for performance in general but does not help bug 1903758 with the resident memory regression we see there (it is neutral with that regard).

For landing purposes it depends on bug 1913753.

Depends on: 1913753
See Also: → 1925744
Whiteboard: [sp3]
Attachment #9430830 - Attachment description: Bug 1924444 - Use an MRF policy for NotFull runs to keep allocations as local as possible. r?smaug,glandium,pbone → Bug 1924444 - Use a linked list with LIFO policy for a bin's mNotFullRuns to reduce overhead and foster locality. r?smaug,glandium,pbone

Here are some benchmark results for this patch and a few variations. We were communicating in slack but I want to post them here for posterity.

I'm using 5 different tests. sp3 on windows and android. and three logalloc-replay tests which measure single threaded performance of the allocator by replaying a stream of allocations. They're determinstic but without the actual use of the memory so access patterns either don't exist or are different. the "no touch" version does not use the allocated memory at all. The "touch" version performs a write to the memory as its allocated but does not access it later. Finally I recorded RSS when replaying these allocations.

Before
replay no touch: 3.998 s ± 0.058 s
replay touch: 4.421 s ± 0.104 s
RSS: 169967616
sp3 windows: 20.87 +- 0.99%
sp3 android: 6.6 +- 0.94%

Jen's MRF version:
replay no touch: 4.001 s ± 0.077 s
replay touch: 4.368 s ± 0.033 s
RSS: 169664512
sp3 windows: 21.26 +- 0.72%
sp3 android: 6.78 +- 0.71%

Jens's LIFO version:
replay no touch: 3.959 s ± 0.051 s
replay touch: 4.334 s ± 0.093 s
RSS: 170483712
sp3 windows: 21.43 +- 0.48%
sp3 android: 6.8 +- 2.96%

MRF + Re-introducing mCurrentRun and droping the tail pointer of the linked list:
replay no touch: 4.001 s ± 0.073 s
replay touch: 4.386 s ± 0.066 s
RSS: 169345024
sp3 windows: 21.24 +- 0.63%
sp3 android: 6.73 +- 0.97%

The above plus moving the list linkage back to the chunk map header.:
replay no touch: 3.986 s ± 0.030 s
replay touch: 4.377 s ± 0.066 s
RSS: 169250816
sp3 windows: 21.33 +- 0.79%
sp3 android: 6.76 +- 0.46%

Jens' LIFO version is the winner for performance it has better performance for sp3, "replay touch" and "replay no touch". The latter tells us that the simpler list management during free() is a clear advantage (this is a hot path). It may also have better cache locality but we can't be sure from these results alone, however it also leads to more fragmentation (RSS is higher).

it looks like adding mCurrentRun back is probably not an advantage but I'm not a good enough statistician to say so. There may be other reasons that come up later to bring back mCurrentRun and we could test it on the LIFO version too, just in case. That said it improves fragmentation at least for this workload.

Moving the list links back into the chunk map does seem to be worth it, I'll port it onto the LIFO version and test it again there.

Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/9de0c1db3489 Use a linked list with LIFO policy for a bin's mNotFullRuns to reduce overhead and foster locality. r=smaug,pbone
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 134 Branch

Highlight

1.6% overall improvement on windows-SP3 But the benchmark is noisy, so there will be some deviation from this.

(In reply to Sandor Molnar[:smolnar] from comment #7)

https://hg.mozilla.org/mozilla-central/rev/9de0c1db3489

Perfherder has detected a browsertime performance change from push 9de0c1db34892663017e0fa3f271e3c97841582e.

Improvements:

Ratio Test Platform Options Absolute values (old vs new) Performance Profiles
5% speedometer3 TodoMVC-WebComponents/DeletingAllItems/total windows11-64-shippable-qr fission webrender 5.22 -> 4.97 Before/After
4% speedometer3 TodoMVC-jQuery/Adding100Items/Sync windows11-64-shippable-qr fission webrender 39.03 -> 37.56 Before/After
4% speedometer3 TodoMVC-jQuery/Adding100Items/total windows11-64-shippable-qr fission webrender 41.11 -> 39.59 Before/After
3% speedometer3 NewsSite-Nuxt/NavigateToUS/total windows11-64-shippable-qr fission webrender 40.24 -> 38.93 Before/After
3% speedometer3 TodoMVC-Vue/total windows11-64-shippable-qr fission webrender 28.35 -> 27.56 Before/After
... ... ... ... ... ...
2% speedometer3 total windows11-64-shippable-qr fission webrender 1,159.29 -> 1,135.07 Before/After

Details of the alert can be found in the alert summary, including links to graphs and comparisons for each of the affected tests.

If you need the profiling jobs you can trigger them yourself from treeherder job view or ask a sheriff to do that for you.

You can run these tests on try with ./mach try perf --alert 42608

For more information on performance sheriffing please see our FAQ.

Keywords: perf-alert
Regressions: 1931952

(In reply to Pulsebot from comment #6)

Pushed by jstutte@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/9de0c1db3489
Use a linked list with LIFO policy for a bin's mNotFullRuns to reduce
overhead and foster locality. r=smaug,pbone

Perfherder has detected a talos performance change from push 9de0c1db34892663017e0fa3f271e3c97841582e.

Improvements:

Ratio Test Platform Options Absolute values (old vs new)
12% tp5o_scroll macosx1015-64-shippable-qr e10s fission stylo webrender-sw 1.43 -> 1.26
6% perf_reftest_singletons bloom-basic-2.html linux1804-64-shippable-qr e10s fission stylo webrender 54.84 -> 51.62
5% cross_origin_pageload linux1804-64-shippable-qr e10s fission stylo webrender-sw 224.92 -> 213.39
4% tscrollx macosx1015-64-shippable-qr e10s fission stylo webrender-sw 0.68 -> 0.65
3% pdfpaint bug1865341.pdf macosx1015-64-shippable-qr e10s fission stylo webrender 201.46 -> 194.62
... ... ... ... ...
2% pdfpaint issue16843.pdf windows11-64-shippable-qr e10s fission stylo webrender 200.72 -> 196.58

Details of the alert can be found in the alert summary, including links to graphs and comparisons for each of the affected tests.

If you need the profiling jobs you can trigger them yourself from treeherder job view or ask a sheriff to do that for you.

You can run these tests on try with ./mach try perf --alert 42699

For more information on performance sheriffing please see our FAQ.

Regressions: 1941085
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: