Closed Bug 1970035 Opened 5 months ago Closed 3 months ago

Improve capabilities of the EffectiveAddressAnalysis pass

Categories

(Core :: JavaScript: WebAssembly, enhancement, P1)

enhancement

Tracking

()

RESOLVED FIXED
143 Branch
Tracking Status
firefox143 --- fixed

People

(Reporter: jseward, Assigned: jseward)

References

(Blocks 1 open bug)

Details

Attachments

(4 files, 3 obsolete files)

The EAA pass has the following limitations:

(1) it can create expressions of the form
signedImm32 + base + (index << shift), but not
signedImm32 + (index << shift) even though the latter is important for
wasm and x86 can do it in one insn (and I think arm64 too).

(2) If any of the subterms -- but in particular (index << shift) -- is used
more than once, then it won't merge it into an EffectiveAddress node,
whereas we can see that V8 will do that.

(3) there's a kind-of unrelated transformation hacked in there, which removes
x & MASK in the case where it's obvious that x has a value that makes
the & operation redundant.

More generally, the pattern matching is done by a forward pass over each block.
This leads to hard to understand code and the limitation (2). A more natural
way to do it would be via a backwards pass, which would also avoid (2), and
could possibly reimplement (3) in a more decoupled way.

(1) is important for wasm in cases where C/C++/Rust/whatever synthesises
arrays in linear memory, and those arrays are C/C++/Rust/etc global variables.

To index into such an array, the offset in linear memory is (eg)
offset_of_start + (index << 2). The compile/link that generates the .wasm
knows what the base offset in linear memory of the array is, and so
offset_of_start becomes a constant baked into the .wasm. This means the
pattern isn't one that the EAA pass currently matches. This shortcoming is
particularly marked for JS3/gcc-loops-wasm, where we generate many
mov-shl-add sequences but V8 generates a single lea (,%index,$sh) insn.

(2) is important for blocks in which there are, for example, two accesses to
arrays of the same element size using the same index (eg arr1[i] = arr2[i]).
Then there will be two terms of the form index << shift, which I assume will
be GVN'd before the EAA pass runs. Hence we do not generate an EAA for either
access. V8 by contrast will (implicitly) duplicate the shift by creating EAAs
for both accesses. Note this limitation applies to us even in the case where
we have base present.

Attached patch bug1970035-EAA.diff-5 (obsolete) — Splinter Review

WIP patch. Gives significant speedup on JS3/gcc-loops-wasm on x64.

Attached patch bug1970035-EAA.diff-9 (obsolete) — Splinter Review

WIP patch. Perf results on Intel Tiger Lake:

2nd best of 20 runs, 3.0 GHz, "  Score: " values (higher is better)

                   BASIS      AFTER
-------------------------------------------
HashSet-wasm     125.170    124.908   -0.2%
tsf-wasm         161.221    159.373   -1.1%
quicksort-wasm   133.032    132.693   -0.3%
gcc-loops-wasm   134.788    179.931  +33.5%
richards-wasm    193.273    193.292    0.0%
sqlite3-wasm      63.909     63.864   -0.1%
tfjs-wasm          1.167      1.166   -0.1%
tfjs-wasm-simd     2.810      2.822   +0.4%
argon2-wasm       97.408     98.202   +0.8%
8bitbench-wasm    16.904     17.996   +6.5%
Dart-flute-wasm    2.660      2.659    0.0%
zlib-wasm         76.732     76.730    0.0%
Attachment #9493193 - Attachment is obsolete: true
Blocks: 1942596
Severity: -- → N/A
Priority: -- → P1
Summary: wasm: improve capabilities of the EffectiveAddressAnalysis pass → Improve capabilities of the EffectiveAddressAnalysis pass

I looked into the 1.1% regression for tsf-wasm. I cannot find any significant
differences in generated code.

Profiling with perf stat shows the new version runs in slightly fewer
instructions (0.058% reduction) but requires 1.1% more cycles. Hence the new
version has an IPC of 3.23 vs the original's IPC of 3.26, and the perf lossage
is entirely due to IPC lossage.

Numbers (however fictional) from Callgrind show, for the new version, a 1.3%
reduction in icache miss rate (presumably because there's less generated code),
and negligable changes in anything else it can measure.

So I guess the IPC lossage is due to some scheduling or resource-limit
wierdness in the processor (Intel Tiger Lake).

Attached file bug1970035-EAA.diff-11 (obsolete) —

WIP patch, as much as I can be bothered to tune it. Needs cleanup for review
and needs arm32 assembler bits filled in.

Attachment #9494360 - Attachment is obsolete: true

With arm32 bits filled in, and more tuning for arm64.

Best of 20 runs, RPi5, 2.4 GHz

                           before       after

HashSet-wasm       Score:  87.072     88.507   +1.6 %
tsf-wasm           Score:  91.426     91.332   -0.1 %
quicksort-wasm     Score: 111.139    116.623   +4.9 %
gcc-loops-wasm     Score:  77.256     79.849   +3.3 %
richards-wasm      Score:  91.998     90.688   -1.4 %
sqlite3-wasm       Score:  39.431     39.832   +1.0 %
tfjs-wasm          Score:   0.801      0.810   +1.1 %
tfjs-wasm-simd     Score:   1.559      1.590   +1.2 %
argon2-wasm        Score:  49.675     49.757   +0.2 %
8bitbench-wasm     Score:  10.562     10.519   -0.4 %
zlib-wasm          Score:  60.462     62.925   +4.1 %
                                              -------
                                               + 1.41% avg

Attachment #9494575 - Attachment is obsolete: true

Comprehensive (kitchen-sink) test/evaluation patch, for reference. Not proposed
as something we'd want to land.

This patch provides MIR and LIR modifications needed for reworking of the
EffectiveAddressAnalysis pass:

  • MEffectiveAddress is renamed to MEffectiveAddress3, to signify that this is a
    3-addend computation. LEffectiveAddress is similarly renamed.

  • new node type MEffectiveAddress2, a 2-addend node, signifying constant + (index << scale). In other words, same as MEffectiveAddress3, but without
    the base operand. This is important for linear-address computation for
    some wasm inputs.

  • CodeGenerator-arm.cpp:

    • visitEA3: limit the scope of the scratch register
    • visitEA2: new method
  • CodeGenerator-arm64.cpp:

    • visitEA3: better handling for scale == 1 and/or displacement == 0
    • visitEA2: new method
  • CodeGenerator-riscv64.cpp: new method visitEA2

  • Assembler-x86-shared.h, BaseAssembler-x86-shared.h:

    • new memory operand kind MEM_SCALE_NOBASE, which is like MEM_SCALE,
      but without a base register.
    • add an emitter and printer for MEM_SCALE_NOBASE
  • CodeGenerator-x86-shared.cpp: new method visitEA2

This patch provides a new and simpler implementation of the EAA pass:

  • for JS inputs, the pass now does nothing

  • for wasm inputs, all previous transformations are gone, and replaced with a
    single transformation that recognises base + (index << shift) only.

The rationale for these changes is described in a comment at the top of the
file.

The traversal direction has been changed: matching now proceeds backwards
through basic blocks. This makes it possible to fold the same shift node into
multiple MEffectiveAddress{2,3} nodes, which is something the previous
implementation could not do. The new implementation also recognises and folds
the case where base is a constant value, a situation which is sometimes
important for wasm, but which the previous implementation could not.

Status: NEW → RESOLVED
Closed: 3 months ago
Resolution: --- → FIXED
Target Milestone: --- → 143 Branch
Regressions: 1978731

Jetstream3-Andoid:
3% improvement on Hashset
4.3% on quicksort-wasm
2.3% on tfjs-wasm-simd

Regressed by: 1978996
No longer regressed by: 1978996
Regressions: 1978996
Regressions: 1979265
QA Whiteboard: [qa-triage-done-c144/b143]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: