Use a priority queue for `mNonFullRuns` rather than red-black tree
Categories
(Core :: Memory Allocator, enhancement, P1)
Tracking
()
People
(Reporter: pbone, Assigned: pbone)
References
(Blocks 1 open bug)
Details
(Whiteboard: [sp3])
the mNonFullRuns
lookup doesn't need a tree. Technically it doesn't need any sorting since any "non full run" will satisfy a memory request, but some structures may lead to better locality/fragmentation. I have started to investigate this.
Further background: The current implementation is an RBTree sorted by memory address, so it chooses the run with the lowest address, See this comment https://searchfox.org/mozilla-central/source/memory/build/mozjemalloc.cpp#1012-1016. This is a claim that needs some verification. As a red-black tree choosing the first run from the tree here: https://searchfox.org/mozilla-central/source/memory/build/mozjemalloc.cpp#3083-3087 means first searching the tree (following left nodes O(log N)) then a few lines later Remove()
rebalances the tree O(log N) and more complex. These both show up in profiles.
Note that I don't propose changing any of the other red black trees in mozjemalloc, the others are used as more key-value like stores where we're searching for a specific entry, rather than the "lowest" entry or "any" entry.
I tested the following:
Choose root
WIthout changing the RBTree structure, change this code to pick the entry at the "root" of the tree rather than the First()
. This elimiates the first iteration but still requires rebalancing the tree:
Windows sp3: 0.81% better, 5.33 confidence.
Windows resident memory base: 0.62% worse, 6.22 confidence.
Windows resident memory tp6: 0.99% worse, 0.87 confidence
While this is faster it does use more memory. Which validates the claim in the code comment above stating that choosing the entry with the lowest address reduces fragmentation. This confirms that this is something worth investigating further.
Linked list
Changing the tree to a linked list: This change doesn't sort the tree other than with a FIFO or LIFO policy.
WIndows sp3: 1.54% better, 7.32 confidence
Windows resident memory base: 0.02% worse, 0.24 confidence
Windows resident memory tp6: 1.23% worse, 1.3 confidence
Nice performance improvement. I don't remember if I chose a LIFO or FIFO policy.
Priority queue
Changing the tree to a priority queue implemented in a array (a heap).
Windows sp3: 1.42% better, 10.41 confidence
Windows resident memory base: 1.62% worse, 12.61 confidence
Windows resident memory tp6: 0.18% better, 0.19 confidence
Most of the performance of a linked list. Worse base memory usage (probably because of the array?) but better tp6 memory usage.
This is also pretty akward because it means adding another memory allocator within jemalloc that can handle realloc()
for the array but that isn't reenterant (to jemalloc). There's more to make sure we "get right" with this patch.
What I didn't yet test:
Implementing the priority queue implemented as a tree rather than array.
My intuition is that it will be a bit slower than the array based implementation but may have a similar memory footprint. Part of the challenge is to get this to fit into the same fields used by a mozjemalloc run - that is that adding a union and extra fields to this structure doesn't make it bigger since a run is already (trying to be) a multiple of the machine's cache line size.
Sorting the structure differently, eg by the number of free cells in a run.
Choosing the run with the smallest number of free cells will lead to better memory usage but it also means switching runs more frequently making this code slower, since we'd exhaust the current run more often. (The benchmarking above that changes this policy doesn't take this into account).
Choosing the run based on number of free cells is also difficult because the number of free cells can change while a run is in the queue, we'd either need to tollerate the queue being not always sorted or potentially resort parts of the queue when the number of free cells in a run changes.
See also
See also Bug 1843998 where putting a lower-bound on the number of cells per run can also improve performance of this lookup.
Updated•1 year ago
|
Updated•1 year ago
|
Description
•