Use a vector instead of linked list for VirtualRegister ranges
Categories
(Core :: JavaScript Engine: JIT, task, P2)
Tracking
()
Tracking | Status | |
---|---|---|
firefox132 | --- | fixed |
People
(Reporter: jandem, Assigned: jandem)
References
Details
Attachments
(10 files)
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review |
A large part of the performance issue in bug 1916442 is quadratic behavior from updating the linked list of ranges in VirtualRegister
. Adding a range does a slow insertion sort.
A better data structure for this is a vector that we sort lazily for operations that need it to be sorted. This lets the list be unsorted during the core register allocation loop. A Vector also has better cache locality and allows binary search when sorted. These changes improve performance a lot: 276 seconds to 14 seconds locally for the Wasm module in that bug after some other changes.
Assignee | ||
Comment 1•2 months ago
|
||
FixedList
doesn't call destructors for the items. When VirtualRegister
has a Vector
in
later patches we need to call its destructor to not leak memory.
Assignee | ||
Comment 2•2 months ago
|
||
First remove all dead ranges before doing other operations on the ranges. It's easier
to reason about the other loops when they don't change the list of ranges at the same time.
Assignee | ||
Comment 3•2 months ago
|
||
Later patches will change the implementation of this iterator without having to
change the consumers.
Assignee | ||
Comment 4•2 months ago
|
||
This lets us share some code for the two places calling this.
Assignee | ||
Comment 5•2 months ago
|
||
When we use a vector, this is a bit more efficient than removing a single range
at a time.
Assignee | ||
Comment 6•2 months ago
|
||
The vector is lazily sorted in order of descending start position for operations that
need it to be sorted.
This lets the list be unsorted during the core register allocation loop. A Vector also
has better cache locality and allows binary search when sorted. These changes improve
performance a lot: 276 seconds to 16 seconds locally for the Wasm module in bug 1916442
after some other changes.
A later patch will update the SMDOC comment.
Assignee | ||
Comment 7•2 months ago
|
||
All ranges that start after the position that's passed in definitely can't overlap.
Use binary search to skip these ranges.
This reduces the Ion compilation time for the Wasm module in bug 1916442 from 16 seconds to
14 seconds locally.
The time complexity for rangeFor
and how it affects some of the callers still isn't great,
but optimizing that will require more work.
Assignee | ||
Comment 8•2 months ago
|
||
Now that LiveRange is part of a single linked list instead of two lists, we can
remove its bundleLink and make it a standard linked-list node.
Updated•2 months ago
|
Assignee | ||
Comment 9•2 months ago
|
||
The virtual register ranges are guaranteed to be sorted after the initial live range
building and grouping. After this phase, they can become unsorted when allocating
bundles to registers.
This patch adds an explicit sortVirtualRegisterRanges
step to the allocator
to ensure the ranges are sorted after allocating all bundles. This is now the
only place where we can call sortRanges
.
This makes it easier to reason about the sorting behavior and helps avoid pathological
behavior from future changes.
A VirtualRegister::replaceLastRangeLinear
helper function is added to maintain
sort order so that we don't have to sort explicitly there.
Because sorting is now explicit, we can remove the activeRangeIterators_
bookkeeping
because we no longer have this possible footgun from sorting while iterating.
Assignee | ||
Comment 10•2 months ago
|
||
std::sort
is not a stable sorting function. To avoid relying on implementation-defined
ordering, extend the comparator function to compare the to
position and the bundle id
when ranges start at the same position.
Bundles now store an 'id' in non-debug builds too. Each compilation has its own
counter that starts at 0 so that we don't need atomic operations to determine the next
id.
Comment 11•2 months ago
|
||
Comment 12•2 months ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/79c79897921e
https://hg.mozilla.org/mozilla-central/rev/311340102af3
https://hg.mozilla.org/mozilla-central/rev/eb5f3abe6d1f
https://hg.mozilla.org/mozilla-central/rev/af0ed3a86838
https://hg.mozilla.org/mozilla-central/rev/ed873a1e9156
https://hg.mozilla.org/mozilla-central/rev/3ffe6980ef17
https://hg.mozilla.org/mozilla-central/rev/5658a571443c
https://hg.mozilla.org/mozilla-central/rev/0f2df5df3636
https://hg.mozilla.org/mozilla-central/rev/56e5df0ee5bf
https://hg.mozilla.org/mozilla-central/rev/4c0ff51c2ce6
Comment 13•2 months ago
•
|
||
Improvement:
30% on AWFY-'Jetstream2-HashSet-wasm-Runtime
24% on AWFY-Jetstream2-float-mm.c-Average on MacOS only
AWFY-webasswmbly-godot-* numbers by 3%-6%.
Regression:
10% on wasm-misc-optimizing-fannkuch
1.6% on wasm-misc-optimizing-compile . But this is effectively a "return to baseline" after the `1.6% improvement from bug 1917817
Assignee | ||
Comment 14•2 months ago
|
||
(In reply to Mayank Bansal from comment #13)
Improvement:
30% on AWFY-'Jetstream2-HashSet-wasm-Runtime
For posterity: I looked into this and it's because HashSet.wasm has a function with a ton of MIR/LIR call instructions. The splitting for this results in many tiny live ranges for a single virtual register that's live for the whole function. The slow part is connecting these ranges by inserting moves.
That loop is currently quadratic, was optimized some in this bug by using binary search to find the start position, but bug 1920951 will make it linear so will be another speedup.
Local Ion compile times for HashSet.wasm:
before this bug: 2800 ms
after this bug: 690 ms
after other changes like bug 1920951: 179 ms
Assignee | ||
Comment 15•2 months ago
|
||
NI myself to look at wasm-misc-optimizing-fannkuch to see if something changed there.
Assignee | ||
Comment 16•2 months ago
|
||
(In reply to Mayank Bansal from comment #13)
Regression:
10% on wasm-misc-optimizing-fannkuch
I looked into this. The code we generate for this test is very similar before/after. After later changes (probably bug 1920951) it's clearly better now because we have fewer moves.
AFAICT this only affects MacOS and the test is bimodal if you zoom out, but we get the higher number more often now. The last two data points have been better (after bug 1920951 landed) but that could be a coincidence
Description
•