Tweak Ion's Backtracking Allocator spill costs
Categories
(Core :: JavaScript Engine: JIT, enhancement, P1)
Tracking
()
People
(Reporter: yulia, Assigned: yulia)
References
(Blocks 1 open bug)
Details
Attachments
(4 files)
Some of the work I've been doing has resulted in this small change on the register allocator. This overlapped with Julian's work in bug 1758274, to introduce spill weighting by loop depth, also referred to as the "frequency". The change seems to have no negative consequences on perfherder, but improves the rust-fannkuch benchmark by about 20%, and aligns us more with the literature and also the llvm implementation of the same register allocator (see https://llvm.org/doxygen/LiveIntervals_8cpp_source.html#l00857)
Assignee | ||
Comment 1•1 month ago
|
||
This moves the calculation of the def spill weight from computeSpillWeight,
and caches it as part of the range definition. If a range contains the definition, it
will have set (while setting hasDefinition) a spill cost that takes into account the depth.
This is roughly inspired by the same code for LLVM's backtracking register allocator:
https://llvm.org/doxygen/CalcSpillWeights_8cpp_source.html#l00201
The end result is roughly the same as the original algorithm, however we end up with
a performance improvement on rust-fannkuch of about 8% just from this change in the
embenchen test suite. rust-fannkuch is one large long running loop so that may be why.
The other test that is affected is the copy test, which slows down by about 3%, but
this looks like it may be noise. Otherwise, performance might be lost as a
result of the def weights being out of sync with the use weights. If a
def occurs in a loop of any size, it will be weighed disproportionally to the uses.
Another contributing factor is that defs, if they occur without uses, are stores -- those
are cheaper than loads, so right now the weight is incorrect as we have no depth
calculation for uses. That is corrected in the next patch.
An interesting observation: simply removing any weight for defs has the best performance
but its likely that it is only in the tests we have.
There is no noticable impact on CI tests (found here:
https://treeherder.mozilla.org/perfherder/compare?originalProject=try&originalRevision=d6bf54b3e00e9fc5e96a7e4c3f226031bf3e245a&newProject=try&newRevision=a95de146f7655404143fa77069f99b285a75f3ae )
Assignee | ||
Comment 2•1 month ago
|
||
This is part 2 of using loop depth to calculate spill weight. The work here balances the def loop depth
calculation introduced in the previous patch. Introducing estimated frequency for use related spill weights account for an
additional 12 % improvement on the rust-fannkuch test.
There is no change related to our CI tests:
https://treeherder.mozilla.org/perfherder/compare?originalProject=try&originalRevision=d6bf54b3e00e9fc5e96a7e4c3f226031bf3e245a&newProject=try&newRevision=09a22f83bb140c23cdb0ec7a6a83b8552aa81d54&page=1
Depends on D210269
Assignee | ||
Comment 3•1 month ago
|
||
This patch removes some complexity related to spill weights: namely the calculation of
spill weight according to policy. It looks like this was introduced with the first
version of the register allocator.
As far as I can tell, modifying the spill weight by policy does not seem to result in
better register allocation, and this strategy is not used by LLVM :
https://llvm.org/doxygen/LiveIntervals_8cpp_source.html#l00857
CI tests can be found here:
https://treeherder.mozilla.org/perfherder/compare?originalProject=try&originalRevision=d6bf54b3e00e9fc5e96a7e4c3f226031bf3e245a&newProject=try&newRevision=a6ad16a841c83f342aa06f4560577278af04aa0a&page=1&framework=13
Depends on D210270
Assignee | ||
Comment 4•1 month ago
|
||
This patch takes, whole-sale, Julian's work from
https://bugzilla.mozilla.org/show_bug.cgi?id=1758274
The cache allows us to not recalculate spill weights during register allocation unless we need to
(such as there has been a spill).
In my testing on embenchen, we see a very modest improvement across the tests of ~1%. This patch is
optimal.
Depends on D210271
Updated•1 month ago
|
Updated•1 month ago
|
Assignee | ||
Comment 5•1 month ago
•
|
||
Concretely, embenchen reports the following results (comparison is of the average of 9 runs):
test new old diff
---------------------------------------
box2d 589 587 1.003
bullet 703 703 1.0
conditionals 156 156 1.0
copy 486 486 1.0
corrections 388 388 1.0
fannkuch 2016 2020 0.998
fasta 804 806 0.998
fib 942 941 1.001
ifs 171 171 1.0
binarytrees 2467 2469 0.999
memops 786 786 1.0
primes 219 217 1.009
raybench 803 809 0.993
rust-fannkuch 1557 1897 0.821 < -- main improvement of this patch stack
skinning 414 414 1.0
zlib 822 821 1.001
Assignee | ||
Comment 6•1 month ago
•
|
||
There is also a slight improvement on one of the files in https://bugzilla.mozilla.org/show_bug.cgi?id=1884572. Doesn't solve the problem of that bug, but has an unexpected benefit.
before
Benchmarking, the higher ops/sec the better.
Firefox 127.0 on OS X 10.15.
Compress:
- zhipeng-jia/snappyjs x 6.89 ops/sec ±0.33% (22 runs sampled): 558 MB/s
- pierrec/node-lz4 x 6.23 ops/sec ±2.37% (19 runs sampled): 504 MB/s
- gorhill/lz4-block.js x 14.21 ops/sec ±1.15% (28 runs sampled): 1150 MB/s
- gorhill/lz4-block.wasm x 12.62 ops/sec ±0.60% (36 runs sampled): 1022 MB/s
Done.
Uncompress:
- zhipeng-jia/snappyjs x 9.34 ops/sec ±0.46% (27 runs sampled): 756 MB/s
- pierrec/node-lz4 x 8.58 ops/sec ±1.65% (26 runs sampled): 694 MB/s
- gorhill/lz4-block.js x 15.82 ops/sec ±0.82% (31 runs sampled): 1281 MB/s
- gorhill/lz4-block.wasm x 49.05 ops/sec ±1.79% (53 runs sampled): 3972 MB/s
Done.
after:
Benchmarking, the higher ops/sec the better.
Firefox 127.0 on OS X 10.15.
Compress:
- zhipeng-jia/snappyjs x 6.92 ops/sec ±0.39% (22 runs sampled): 560 MB/s
- pierrec/node-lz4 x 6.19 ops/sec ±0.48% (20 runs sampled): 501 MB/s
- gorhill/lz4-block.js x 14.15 ops/sec ±0.84% (28 runs sampled): 1146 MB/s
- gorhill/lz4-block.wasm x 12.93 ops/sec ±0.54% (36 runs sampled): 1047 MB/s
Done.
Uncompress:
- zhipeng-jia/snappyjs x 9.39 ops/sec ±0.48% (27 runs sampled): 760 MB/s
- pierrec/node-lz4 x 8.81 ops/sec ±1.02% (26 runs sampled): 713 MB/s
- gorhill/lz4-block.js x 16.19 ops/sec ±0.46% (31 runs sampled): 1311 MB/s
- gorhill/lz4-block.wasm x 49.46 ops/sec ±0.32% (53 runs sampled): 4006 MB/s
Done.
Updated•1 month ago
|
Description
•