Improve capabilities of the EffectiveAddressAnalysis pass
Categories
(Core :: JavaScript: WebAssembly, enhancement, P1)
Tracking
()
Tracking | Status | |
---|---|---|
firefox143 | --- | fixed |
People
(Reporter: jseward, Assigned: jseward)
References
(Blocks 1 open bug)
Details
Attachments
(4 files, 3 obsolete files)
36.77 KB,
patch
|
Details | Diff | Splinter Review | |
44.82 KB,
patch
|
Details | Diff | Splinter Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review |
The EAA pass has the following limitations:
(1) it can create expressions of the form
signedImm32 + base + (index << shift)
, but not
signedImm32 + (index << shift)
even though the latter is important for
wasm and x86 can do it in one insn (and I think arm64 too).
(2) If any of the subterms -- but in particular (index << shift)
-- is used
more than once, then it won't merge it into an EffectiveAddress node,
whereas we can see that V8 will do that.
(3) there's a kind-of unrelated transformation hacked in there, which removes
x & MASK
in the case where it's obvious that x
has a value that makes
the &
operation redundant.
More generally, the pattern matching is done by a forward pass over each block.
This leads to hard to understand code and the limitation (2). A more natural
way to do it would be via a backwards pass, which would also avoid (2), and
could possibly reimplement (3) in a more decoupled way.
(1) is important for wasm in cases where C/C++/Rust/whatever synthesises
arrays in linear memory, and those arrays are C/C++/Rust/etc global variables.
To index into such an array, the offset in linear memory is (eg)
offset_of_start + (index << 2)
. The compile/link that generates the .wasm
knows what the base offset in linear memory of the array is, and so
offset_of_start
becomes a constant baked into the .wasm. This means the
pattern isn't one that the EAA pass currently matches. This shortcoming is
particularly marked for JS3/gcc-loops-wasm, where we generate many
mov-shl-add
sequences but V8 generates a single lea (,%index,$sh)
insn.
(2) is important for blocks in which there are, for example, two accesses to
arrays of the same element size using the same index (eg arr1[i] = arr2[i]
).
Then there will be two terms of the form index << shift
, which I assume will
be GVN'd before the EAA pass runs. Hence we do not generate an EAA for either
access. V8 by contrast will (implicitly) duplicate the shift by creating EAAs
for both accesses. Note this limitation applies to us even in the case where
we have base
present.
Assignee | ||
Comment 1•5 months ago
|
||
WIP patch. Gives significant speedup on JS3/gcc-loops-wasm on x64.
Assignee | ||
Comment 2•4 months ago
|
||
WIP patch. Perf results on Intel Tiger Lake:
2nd best of 20 runs, 3.0 GHz, " Score: " values (higher is better)
BASIS AFTER
-------------------------------------------
HashSet-wasm 125.170 124.908 -0.2%
tsf-wasm 161.221 159.373 -1.1%
quicksort-wasm 133.032 132.693 -0.3%
gcc-loops-wasm 134.788 179.931 +33.5%
richards-wasm 193.273 193.292 0.0%
sqlite3-wasm 63.909 63.864 -0.1%
tfjs-wasm 1.167 1.166 -0.1%
tfjs-wasm-simd 2.810 2.822 +0.4%
argon2-wasm 97.408 98.202 +0.8%
8bitbench-wasm 16.904 17.996 +6.5%
Dart-flute-wasm 2.660 2.659 0.0%
zlib-wasm 76.732 76.730 0.0%
Updated•4 months ago
|
Assignee | ||
Comment 3•4 months ago
|
||
I looked into the 1.1% regression for tsf-wasm. I cannot find any significant
differences in generated code.
Profiling with perf stat
shows the new version runs in slightly fewer
instructions (0.058% reduction) but requires 1.1% more cycles. Hence the new
version has an IPC of 3.23 vs the original's IPC of 3.26, and the perf lossage
is entirely due to IPC lossage.
Numbers (however fictional) from Callgrind show, for the new version, a 1.3%
reduction in icache miss rate (presumably because there's less generated code),
and negligable changes in anything else it can measure.
So I guess the IPC lossage is due to some scheduling or resource-limit
wierdness in the processor (Intel Tiger Lake).
Assignee | ||
Comment 4•4 months ago
|
||
WIP patch, as much as I can be bothered to tune it. Needs cleanup for review
and needs arm32 assembler bits filled in.
Assignee | ||
Comment 5•4 months ago
|
||
With arm32 bits filled in, and more tuning for arm64.
Best of 20 runs, RPi5, 2.4 GHz
before after
HashSet-wasm Score: 87.072 88.507 +1.6 %
tsf-wasm Score: 91.426 91.332 -0.1 %
quicksort-wasm Score: 111.139 116.623 +4.9 %
gcc-loops-wasm Score: 77.256 79.849 +3.3 %
richards-wasm Score: 91.998 90.688 -1.4 %
sqlite3-wasm Score: 39.431 39.832 +1.0 %
tfjs-wasm Score: 0.801 0.810 +1.1 %
tfjs-wasm-simd Score: 1.559 1.590 +1.2 %
argon2-wasm Score: 49.675 49.757 +0.2 %
8bitbench-wasm Score: 10.562 10.519 -0.4 %
zlib-wasm Score: 60.462 62.925 +4.1 %
-------
+ 1.41% avg
Assignee | ||
Comment 6•3 months ago
|
||
Comprehensive (kitchen-sink) test/evaluation patch, for reference. Not proposed
as something we'd want to land.
Assignee | ||
Comment 7•3 months ago
|
||
This patch provides MIR and LIR modifications needed for reworking of the
EffectiveAddressAnalysis pass:
-
MEffectiveAddress is renamed to MEffectiveAddress3, to signify that this is a
3-addend computation. LEffectiveAddress is similarly renamed. -
new node type MEffectiveAddress2, a 2-addend node, signifying
constant + (index << scale)
. In other words, same as MEffectiveAddress3, but without
thebase
operand. This is important for linear-address computation for
some wasm inputs. -
CodeGenerator-arm.cpp:
- visitEA3: limit the scope of the scratch register
- visitEA2: new method
-
CodeGenerator-arm64.cpp:
- visitEA3: better handling for scale == 1 and/or displacement == 0
- visitEA2: new method
-
CodeGenerator-riscv64.cpp: new method visitEA2
-
Assembler-x86-shared.h, BaseAssembler-x86-shared.h:
- new memory operand kind MEM_SCALE_NOBASE, which is like MEM_SCALE,
but without a base register. - add an emitter and printer for MEM_SCALE_NOBASE
- new memory operand kind MEM_SCALE_NOBASE, which is like MEM_SCALE,
-
CodeGenerator-x86-shared.cpp: new method visitEA2
Assignee | ||
Comment 8•3 months ago
|
||
This patch provides a new and simpler implementation of the EAA pass:
-
for JS inputs, the pass now does nothing
-
for wasm inputs, all previous transformations are gone, and replaced with a
single transformation that recognisesbase + (index << shift)
only.
The rationale for these changes is described in a comment at the top of the
file.
The traversal direction has been changed: matching now proceeds backwards
through basic blocks. This makes it possible to fold the same shift node into
multiple MEffectiveAddress{2,3} nodes, which is something the previous
implementation could not do. The new implementation also recognises and folds
the case where base
is a constant value, a situation which is sometimes
important for wasm, but which the previous implementation could not.
Comment 10•3 months ago
|
||
20%-30% improvement on gcc-loops on JS2/JS3
(Caveat: backfills still in progress)
Comment 11•3 months ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/2b3ba957f99e
https://hg.mozilla.org/mozilla-central/rev/d1e1c2e15bd7
https://hg.mozilla.org/mozilla-central/rev/6ed0c528f690
Comment 12•3 months ago
•
|
||
Jetstream3-Andoid:
3% improvement on Hashset
4.3% on quicksort-wasm
2.3% on tfjs-wasm-simd
Updated•3 months ago
|
Updated•2 months ago
|
Description
•