Open Bug 1896581 Opened 1 month ago Updated 1 month ago

Tweak Ion's Backtracking Allocator spill costs

Categories

(Core :: JavaScript Engine: JIT, enhancement, P1)

enhancement

Tracking

()

People

(Reporter: yulia, Assigned: yulia)

References

(Blocks 1 open bug)

Details

Attachments

(4 files)

Some of the work I've been doing has resulted in this small change on the register allocator. This overlapped with Julian's work in bug 1758274, to introduce spill weighting by loop depth, also referred to as the "frequency". The change seems to have no negative consequences on perfherder, but improves the rust-fannkuch benchmark by about 20%, and aligns us more with the literature and also the llvm implementation of the same register allocator (see https://llvm.org/doxygen/LiveIntervals_8cpp_source.html#l00857)

This moves the calculation of the def spill weight from computeSpillWeight,
and caches it as part of the range definition. If a range contains the definition, it
will have set (while setting hasDefinition) a spill cost that takes into account the depth.

This is roughly inspired by the same code for LLVM's backtracking register allocator:
https://llvm.org/doxygen/CalcSpillWeights_8cpp_source.html#l00201

The end result is roughly the same as the original algorithm, however we end up with
a performance improvement on rust-fannkuch of about 8% just from this change in the
embenchen test suite. rust-fannkuch is one large long running loop so that may be why.

The other test that is affected is the copy test, which slows down by about 3%, but
this looks like it may be noise. Otherwise, performance might be lost as a
result of the def weights being out of sync with the use weights. If a
def occurs in a loop of any size, it will be weighed disproportionally to the uses.
Another contributing factor is that defs, if they occur without uses, are stores -- those
are cheaper than loads, so right now the weight is incorrect as we have no depth
calculation for uses. That is corrected in the next patch.

An interesting observation: simply removing any weight for defs has the best performance
but its likely that it is only in the tests we have.

There is no noticable impact on CI tests (found here:
https://treeherder.mozilla.org/perfherder/compare?originalProject=try&originalRevision=d6bf54b3e00e9fc5e96a7e4c3f226031bf3e245a&newProject=try&newRevision=a95de146f7655404143fa77069f99b285a75f3ae )

This is part 2 of using loop depth to calculate spill weight. The work here balances the def loop depth
calculation introduced in the previous patch. Introducing estimated frequency for use related spill weights account for an
additional 12 % improvement on the rust-fannkuch test.

There is no change related to our CI tests:
https://treeherder.mozilla.org/perfherder/compare?originalProject=try&originalRevision=d6bf54b3e00e9fc5e96a7e4c3f226031bf3e245a&newProject=try&newRevision=09a22f83bb140c23cdb0ec7a6a83b8552aa81d54&page=1

Depends on D210269

This patch removes some complexity related to spill weights: namely the calculation of
spill weight according to policy. It looks like this was introduced with the first
version of the register allocator.

As far as I can tell, modifying the spill weight by policy does not seem to result in
better register allocation, and this strategy is not used by LLVM :
https://llvm.org/doxygen/LiveIntervals_8cpp_source.html#l00857

CI tests can be found here:
https://treeherder.mozilla.org/perfherder/compare?originalProject=try&originalRevision=d6bf54b3e00e9fc5e96a7e4c3f226031bf3e245a&newProject=try&newRevision=a6ad16a841c83f342aa06f4560577278af04aa0a&page=1&framework=13

Depends on D210270

This patch takes, whole-sale, Julian's work from
https://bugzilla.mozilla.org/show_bug.cgi?id=1758274

The cache allows us to not recalculate spill weights during register allocation unless we need to
(such as there has been a spill).

In my testing on embenchen, we see a very modest improvement across the tests of ~1%. This patch is
optimal.

CI results:
https://treeherder.mozilla.org/perfherder/compare?originalProject=try&originalRevision=d6bf54b3e00e9fc5e96a7e4c3f226031bf3e245a&newProject=try&newRevision=c806b09d6d6e5d7d14cd4ed0b252056f4881f7a5&framework=13&page=3

Depends on D210271

Attachment #9401595 - Attachment description: Bug 1896581 - Cache def spill weight by depth; r=jseward → Bug 1896581 - Cache def spill weight with depth; r=jseward
Attachment #9401595 - Attachment description: Bug 1896581 - Cache def spill weight with depth; r=jseward → Bug 1896581 - Cache def spill weight by depth; r=jseward

Concretely, embenchen reports the following results (comparison is of the average of 9 runs):

test		new	old	diff
---------------------------------------
box2d		589	587	1.003
bullet		703	703	1.0
conditionals	156	156	1.0
copy		486	486	1.0
corrections	388	388	1.0
fannkuch	2016	2020	0.998
fasta		804	806	0.998
fib		942	941	1.001
ifs		171	171	1.0
binarytrees	2467	2469	0.999
memops		786	786	1.0
primes		219	217	1.009
raybench	803	809	0.993
rust-fannkuch	1557	1897	0.821	< -- main improvement of this patch stack
skinning	414	414	1.0
zlib		822	821	1.001

There is also a slight improvement on one of the files in https://bugzilla.mozilla.org/show_bug.cgi?id=1884572. Doesn't solve the problem of that bug, but has an unexpected benefit.

before

Benchmarking, the higher ops/sec the better.
Firefox 127.0 on OS X 10.15.

Compress:
  -   zhipeng-jia/snappyjs x 6.89 ops/sec ±0.33% (22 runs sampled): 558 MB/s
  -       pierrec/node-lz4 x 6.23 ops/sec ±2.37% (19 runs sampled): 504 MB/s
  -   gorhill/lz4-block.js x 14.21 ops/sec ±1.15% (28 runs sampled): 1150 MB/s
  - gorhill/lz4-block.wasm x 12.62 ops/sec ±0.60% (36 runs sampled): 1022 MB/s
  Done.

Uncompress:
  -   zhipeng-jia/snappyjs x 9.34 ops/sec ±0.46% (27 runs sampled): 756 MB/s
  -       pierrec/node-lz4 x 8.58 ops/sec ±1.65% (26 runs sampled): 694 MB/s
  -   gorhill/lz4-block.js x 15.82 ops/sec ±0.82% (31 runs sampled): 1281 MB/s
  - gorhill/lz4-block.wasm x 49.05 ops/sec ±1.79% (53 runs sampled): 3972 MB/s
  Done.

after:

Benchmarking, the higher ops/sec the better.
Firefox 127.0 on OS X 10.15.

Compress:
  -   zhipeng-jia/snappyjs x 6.92 ops/sec ±0.39% (22 runs sampled): 560 MB/s
  -       pierrec/node-lz4 x 6.19 ops/sec ±0.48% (20 runs sampled): 501 MB/s
  -   gorhill/lz4-block.js x 14.15 ops/sec ±0.84% (28 runs sampled): 1146 MB/s
  - gorhill/lz4-block.wasm x 12.93 ops/sec ±0.54% (36 runs sampled): 1047 MB/s
  Done.

Uncompress:
  -   zhipeng-jia/snappyjs x 9.39 ops/sec ±0.48% (27 runs sampled): 760 MB/s
  -       pierrec/node-lz4 x 8.81 ops/sec ±1.02% (26 runs sampled): 713 MB/s   
  -   gorhill/lz4-block.js x 16.19 ops/sec ±0.46% (31 runs sampled): 1311 MB/s
  - gorhill/lz4-block.wasm x 49.46 ops/sec ±0.32% (53 runs sampled): 4006 MB/s
  Done.
See Also: → 1884572
Blocks: sm-opt-jits
Severity: -- → N/A
Priority: -- → P1
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: