Closed Bug 1918970 Opened 2 months ago Closed 2 months ago

Use a vector instead of linked list for VirtualRegister ranges

Categories

(Core :: JavaScript Engine: JIT, task, P2)

task

Tracking

()

RESOLVED FIXED
132 Branch
Tracking Status
firefox132 --- fixed

People

(Reporter: jandem, Assigned: jandem)

References

Details

Attachments

(10 files)

48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review

A large part of the performance issue in bug 1916442 is quadratic behavior from updating the linked list of ranges in VirtualRegister. Adding a range does a slow insertion sort.

A better data structure for this is a vector that we sort lazily for operations that need it to be sorted. This lets the list be unsorted during the core register allocation loop. A Vector also has better cache locality and allows binary search when sorted. These changes improve performance a lot: 276 seconds to 14 seconds locally for the Wasm module in that bug after some other changes.

FixedList doesn't call destructors for the items. When VirtualRegister has a Vector in
later patches we need to call its destructor to not leak memory.

First remove all dead ranges before doing other operations on the ranges. It's easier
to reason about the other loops when they don't change the list of ranges at the same time.

Later patches will change the implementation of this iterator without having to
change the consumers.

This lets us share some code for the two places calling this.

When we use a vector, this is a bit more efficient than removing a single range
at a time.

The vector is lazily sorted in order of descending start position for operations that
need it to be sorted.

This lets the list be unsorted during the core register allocation loop. A Vector also
has better cache locality and allows binary search when sorted. These changes improve
performance a lot: 276 seconds to 16 seconds locally for the Wasm module in bug 1916442
after some other changes.

A later patch will update the SMDOC comment.

All ranges that start after the position that's passed in definitely can't overlap.
Use binary search to skip these ranges.

This reduces the Ion compilation time for the Wasm module in bug 1916442 from 16 seconds to
14 seconds locally.

The time complexity for rangeFor and how it affects some of the callers still isn't great,
but optimizing that will require more work.

Now that LiveRange is part of a single linked list instead of two lists, we can
remove its bundleLink and make it a standard linked-list node.

Severity: -- → N/A
Priority: -- → P2

The virtual register ranges are guaranteed to be sorted after the initial live range
building and grouping. After this phase, they can become unsorted when allocating
bundles to registers.

This patch adds an explicit sortVirtualRegisterRanges step to the allocator
to ensure the ranges are sorted after allocating all bundles. This is now the
only place where we can call sortRanges.

This makes it easier to reason about the sorting behavior and helps avoid pathological
behavior from future changes.

A VirtualRegister::replaceLastRangeLinear helper function is added to maintain
sort order so that we don't have to sort explicitly there.

Because sorting is now explicit, we can remove the activeRangeIterators_ bookkeeping
because we no longer have this possible footgun from sorting while iterating.

std::sort is not a stable sorting function. To avoid relying on implementation-defined
ordering, extend the comparator function to compare the to position and the bundle id
when ranges start at the same position.

Bundles now store an 'id' in non-debug builds too. Each compilation has its own
counter that starts at 0 so that we don't need atomic operations to determine the next
id.

Pushed by jdemooij@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/79c79897921e part 1 - Store virtual registers in a Vector instead of FixedList. r=jseward https://hg.mozilla.org/integration/autoland/rev/311340102af3 part 2 - Remove dead ranges first in createMoveGroupsFromLiveRangeTransitions. r=jseward https://hg.mozilla.org/integration/autoland/rev/eb5f3abe6d1f part 3 - Add VirtualRegister::RangeIterator. r=jseward https://hg.mozilla.org/integration/autoland/rev/af0ed3a86838 part 4 - Add LiveBundle::removeAllRangesFromVirtualRegisters. r=jseward https://hg.mozilla.org/integration/autoland/rev/ed873a1e9156 part 5 - Add VirtualRegister::removeRangesForBundle. r=jseward https://hg.mozilla.org/integration/autoland/rev/3ffe6980ef17 part 6 - Use a vector instead of linked list for VirtualRegister ranges. r=jseward https://hg.mozilla.org/integration/autoland/rev/5658a571443c part 7 - Use binary search to optimize VirtualRegister::rangeFor. r=jseward https://hg.mozilla.org/integration/autoland/rev/0f2df5df3636 part 8 - Simplify LiveRange linked list code, update comments. r=jseward https://hg.mozilla.org/integration/autoland/rev/56e5df0ee5bf part 9 - Sort ranges explicitly instead of implicitly. r=jseward https://hg.mozilla.org/integration/autoland/rev/4c0ff51c2ce6 part 10 - Define sort order for ranges with same start position. r=jseward

Improvement:
30% on AWFY-'Jetstream2-HashSet-wasm-Runtime
24% on AWFY-Jetstream2-float-mm.c-Average on MacOS only
AWFY-webasswmbly-godot-* numbers by 3%-6%.

Regression:
10% on wasm-misc-optimizing-fannkuch
1.6% on wasm-misc-optimizing-compile . But this is effectively a "return to baseline" after the `1.6% improvement from bug 1917817

(In reply to Mayank Bansal from comment #13)

Improvement:
30% on AWFY-'Jetstream2-HashSet-wasm-Runtime

For posterity: I looked into this and it's because HashSet.wasm has a function with a ton of MIR/LIR call instructions. The splitting for this results in many tiny live ranges for a single virtual register that's live for the whole function. The slow part is connecting these ranges by inserting moves.

That loop is currently quadratic, was optimized some in this bug by using binary search to find the start position, but bug 1920951 will make it linear so will be another speedup.

Local Ion compile times for HashSet.wasm:

before this bug: 2800 ms
after this bug: 690 ms
after other changes like bug 1920951: 179 ms

NI myself to look at wasm-misc-optimizing-fannkuch to see if something changed there.

Flags: needinfo?(jdemooij)

(In reply to Mayank Bansal from comment #13)

Regression:
10% on wasm-misc-optimizing-fannkuch

I looked into this. The code we generate for this test is very similar before/after. After later changes (probably bug 1920951) it's clearly better now because we have fewer moves.

AFAICT this only affects MacOS and the test is bimodal if you zoom out, but we get the higher number more often now. The last two data points have been better (after bug 1920951 landed) but that could be a coincidence

Flags: needinfo?(jdemooij)
Blocks: 1887312
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: