Closed Bug 1924444 Opened 1 year ago Closed 1 year ago

Improve the order with which we (re-)use non-full runs during allocations

Tracking

()

Status:

RESOLVED FIXED

Milestone:

134 Branch

Tracking Flags:

Tracking

Status

firefox134

---

fixed

People

(Reporter: jstutte, Assigned: jstutte)

References

(Blocks 1 open bug, Regressed 1 open bug)

Details

(Keywords: perf-alert, Whiteboard: [sp3])

Attachments

(1 file)

Bug 1924444 - Use a linked list with LIFO policy for a bin's mNotFullRuns to reduce overhead and foster locality. r?smaug,glandium,pbone 1 year ago Jens Stutte [:jstutte] 48 bytes, text/x-phabricator-request		Details \| Review

Jens Stutte [:jstutte]

Assignee

Description

•

1 year ago

•

Edited

Edit: See comment 4.

Bug 1903758 shows a significant regression for the usage of resident memory. It seems that re-using memory more often can result in a higher number of only partly-used (non-full) runs (for small allocations).

In particular our current policy for re-use seems to be position-in-memory based, which seems improvable.

We assume that allocations for objects of the same size often happen in bursts, as well as deallocations, as a consequence of object trees or other structures being built up and later being torn down all together.
In such situations it would help to keep allocations from the same burst happen as much as possible in the same runs, such that it gets more likely to free entire runs when the corresponding free burst occurs.

Emilio Cobos Álvarez [:emilio]

Comment 1

•

1 year ago

But content doesn't randomize small allocations right? So randomization can't be the culprit?

Jens Stutte [:jstutte]

Assignee

Comment 2

•

1 year ago

(In reply to Emilio Cobos Álvarez (:emilio) from comment #1)

But content doesn't randomize small allocations right? So randomization can't be the culprit?

Thanks, I was not aware of that. It might even just be that more re-use inevitably causes more sparse free regions, though I think/hope we can improve something here. Edited the description.

Jens Stutte [:jstutte]

Assignee

Comment 3

•

1 year ago

•

Edited

Attached file Bug 1924444 - Use a linked list with LIFO policy for a bin's mNotFullRuns to reduce overhead and foster locality. r?smaug,glandium,pbone — Details

The current order of re-using regions from non-full runs relies on the run's position in memory.

In general we assume that allocations for objects of the same size often happen in bursts, as well as deallocations, as a consequence of object trees or other structures being built up and later being torn down all together or in bigger portions.
In such situations it helps to keep allocations from the same burst happen as much as possible in the same runs, such that it gets more likely to free entire runs when the corresponding free burst occurs. Keeping objects close together fosters also to have less "cache miss" situations during use.

We want to use an MRF ("most-recently-freed") policy to refill non-full runs.

This has two reasons:

It is cheap, as all our non-full-runs' book-keeping is O(1), no tree-balancing or walking is needed.
It hopefully fosters that more allocations from the same unit of code end up together in the same run, as most likely they will be freed and/or used together at the same time.

The policy is as follows:
a. A newly allocated run becomes the head. This happens only when the list was empty.
b. If a run is entirely filled again, it is removed from the list and the next MRF run in the list becomes the head.
c. If a run is entirely freed, it is also removed from the list (and from the bin, entering the tree of recycleable runs).
d. Each time we free something from a run, we move it to the head position in our list, if possible and needed. This keeps the entire list in "most recently freed" order.

Phabricator Automation

Updated

•

1 year ago

Attachment #9430830 - Attachment description: WIP: Bug 1924444 - Use an MRU policy for NotFull runs to keep allocations as local as possible. → WIP: Bug 1924444 - Use an MRF policy for NotFull runs to keep allocations as local as possible.

Phabricator Automation

Updated

•

1 year ago

Assignee: nobody → jstutte

Attachment #9430830 - Attachment description: WIP: Bug 1924444 - Use an MRF policy for NotFull runs to keep allocations as local as possible. → Bug 1924444 - Use an MRF policy for NotFull runs to keep allocations as local as possible. r?smaug,glandium,pbone

Status: NEW → ASSIGNED

Jens Stutte [:jstutte]

Assignee

Updated

•

1 year ago

No longer depends on: 1903758

Comment 4

•

1 year ago

This patch is very impactful for performance in general but does not help bug 1903758 with the resident memory regression we see there (it is neutral with that regard).

For landing purposes it depends on bug 1913753.

Depends on: 1913753

Jens Stutte [:jstutte]

•

1 year ago

Here are some benchmark results for this patch and a few variations. We were communicating in slack but I want to post them here for posterity.

I'm using 5 different tests. sp3 on windows and android. and three logalloc-replay tests which measure single threaded performance of the allocator by replaying a stream of allocations. They're determinstic but without the actual use of the memory so access patterns either don't exist or are different. the "no touch" version does not use the allocated memory at all. The "touch" version performs a write to the memory as its allocated but does not access it later. Finally I recorded RSS when replaying these allocations.

Before
replay no touch: 3.998 s ± 0.058 s
replay touch: 4.421 s ± 0.104 s
RSS: 169967616
sp3 windows: 20.87 +- 0.99%
sp3 android: 6.6 +- 0.94%

Jen's MRF version:
replay no touch: 4.001 s ± 0.077 s
replay touch: 4.368 s ± 0.033 s
RSS: 169664512
sp3 windows: 21.26 +- 0.72%
sp3 android: 6.78 +- 0.71%

Jens's LIFO version:
replay no touch: 3.959 s ± 0.051 s
replay touch: 4.334 s ± 0.093 s
RSS: 170483712
sp3 windows: 21.43 +- 0.48%
sp3 android: 6.8 +- 2.96%

MRF + Re-introducing mCurrentRun and droping the tail pointer of the linked list:
replay no touch: 4.001 s ± 0.073 s
replay touch: 4.386 s ± 0.066 s
RSS: 169345024
sp3 windows: 21.24 +- 0.63%
sp3 android: 6.73 +- 0.97%

The above plus moving the list linkage back to the chunk map header.:
replay no touch: 3.986 s ± 0.030 s
replay touch: 4.377 s ± 0.066 s
RSS: 169250816
sp3 windows: 21.33 +- 0.79%
sp3 android: 6.76 +- 0.46%

Jens' LIFO version is the winner for performance it has better performance for sp3, "replay touch" and "replay no touch". The latter tells us that the simpler list management during free() is a clear advantage (this is a hot path). It may also have better cache locality but we can't be sure from these results alone, however it also leads to more fragmentation (RSS is higher).

it looks like adding mCurrentRun back is probably not an advantage but I'm not a good enough statistician to say so. There may be other reasons that come up later to bring back mCurrentRun and we could test it on the LIFO version too, just in case. That said it improves fragmentation at least for this workload.

Moving the list links back into the chunk map does seem to be worth it, I'll port it onto the LIFO version and test it again there.

Pulsebot

Comment 6

•

1 year ago

Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/9de0c1db3489 Use a linked list with LIFO policy for a bin's mNotFullRuns to reduce overhead and foster locality. r=smaug,pbone

Sandor Molnar[:smolnar]

Comment 7

•

1 year ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/9de0c1db3489

Status: ASSIGNED → RESOLVED

Closed: 1 year ago

status-firefox134: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 134 Branch

Mayank Bansal

Comment 8

•

1 year ago

•

Edited

Highlight

1.6% overall improvement on windows-SP3 But the benchmark is noisy, so there will be some deviation from this.

Florin Bilt

Comment 9

•

1 year ago

(In reply to Sandor Molnar[:smolnar] from comment #7)

https://hg.mozilla.org/mozilla-central/rev/9de0c1db3489

Perfherder has detected a browsertime performance change from push 9de0c1db34892663017e0fa3f271e3c97841582e.

Improvements:

Ratio	Test	Platform	Options	Absolute values (old vs new)	Performance Profiles
5%	speedometer3 TodoMVC-WebComponents/DeletingAllItems/total	windows11-64-shippable-qr	fission webrender	5.22 -> 4.97	Before/After
4%	speedometer3 TodoMVC-jQuery/Adding100Items/Sync	windows11-64-shippable-qr	fission webrender	39.03 -> 37.56	Before/After
4%	speedometer3 TodoMVC-jQuery/Adding100Items/total	windows11-64-shippable-qr	fission webrender	41.11 -> 39.59	Before/After
3%	speedometer3 NewsSite-Nuxt/NavigateToUS/total	windows11-64-shippable-qr	fission webrender	40.24 -> 38.93	Before/After
3%	speedometer3 TodoMVC-Vue/total	windows11-64-shippable-qr	fission webrender	28.35 -> 27.56	Before/After
...	...	...	...	...	...
2%	speedometer3 total	windows11-64-shippable-qr	fission webrender	1,159.29 -> 1,135.07	Before/After

Details of the alert can be found in the alert summary, including links to graphs and comparisons for each of the affected tests.

If you need the profiling jobs you can trigger them yourself from treeherder job view or ask a sheriff to do that for you.

You can run these tests on try with ./mach try perf --alert 42608

For more information on performance sheriffing please see our FAQ.

Keywords: perf-alert

Treeherder Bug Filer

Updated

•

1 year ago

Regressions: 1931952

Andrej (:aglavic)

Comment 10

•

1 year ago

(In reply to Pulsebot from comment #6)

Pushed by jstutte@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/9de0c1db3489
Use a linked list with LIFO policy for a bin's mNotFullRuns to reduce
overhead and foster locality. r=smaug,pbone

Perfherder has detected a talos performance change from push 9de0c1db34892663017e0fa3f271e3c97841582e.

Improvements:

Ratio	Test	Platform	Options	Absolute values (old vs new)
12%	tp5o_scroll	macosx1015-64-shippable-qr	e10s fission stylo webrender-sw	1.43 -> 1.26
6%	perf_reftest_singletons bloom-basic-2.html	linux1804-64-shippable-qr	e10s fission stylo webrender	54.84 -> 51.62
5%	cross_origin_pageload	linux1804-64-shippable-qr	e10s fission stylo webrender-sw	224.92 -> 213.39
4%	tscrollx	macosx1015-64-shippable-qr	e10s fission stylo webrender-sw	0.68 -> 0.65
3%	pdfpaint bug1865341.pdf	macosx1015-64-shippable-qr	e10s fission stylo webrender	201.46 -> 194.62
...	...	...	...	...
2%	pdfpaint issue16843.pdf	windows11-64-shippable-qr	e10s fission stylo webrender	200.72 -> 196.58

Details of the alert can be found in the alert summary, including links to graphs and comparisons for each of the affected tests.

If you need the profiling jobs you can trigger them yourself from treeherder job view or ask a sheriff to do that for you.

You can run these tests on try with ./mach try perf --alert 42699

For more information on performance sheriffing please see our FAQ.

Paul Bone [:pbone]

Updated

•

10 months ago

Regressions: 1941085

You need to log in before you can comment on or make changes to this bug.