Closed Bug 1970035 Opened 5 months ago Closed 3 months ago

Improve capabilities of the EffectiveAddressAnalysis pass

Tracking

()

Status:

RESOLVED FIXED

Milestone:

143 Branch

Tracking Flags:

Tracking

Status

firefox143

---

fixed

People

(Reporter: jseward, Assigned: jseward)

References

(Blocks 1 open bug)

Details

Attachments

(4 files, 3 obsolete files)

bug1970035-EAA.diff-5 5 months ago Julian Seward [:jseward] 27.13 KB, patch		Details \| Diff \| Splinter Review
bug1970035-EAA.diff-9 4 months ago Julian Seward [:jseward] 30.17 KB, patch		Details \| Diff \| Splinter Review
bug1970035-EAA.diff-11 4 months ago Julian Seward [:jseward] 31.33 KB, application/octet-stream		Details
bug1970035-EAA.diff-16 4 months ago Julian Seward [:jseward] 36.77 KB, patch		Details \| Diff \| Splinter Review
bug1970035-EAA.diff-18-comprehensive 3 months ago Julian Seward [:jseward] 44.82 KB, patch		Details \| Diff \| Splinter Review
Bug 1970035 - part 1: EAA rework: supporting MIR and LIR modifications. r=jandem. 3 months ago Julian Seward [:jseward] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1970035 - part 2: rewrite EffectiveAddressAnalysis pass to handle shift-add only. r=jandem. 3 months ago Julian Seward [:jseward] 48 bytes, text/x-phabricator-request		Details \| Review

Julian Seward [:jseward]

Assignee

Description

•

5 months ago

The EAA pass has the following limitations:

(1) it can create expressions of the form
signedImm32 + base + (index << shift), but not
signedImm32 + (index << shift) even though the latter is important for
wasm and x86 can do it in one insn (and I think arm64 too).

(2) If any of the subterms -- but in particular (index << shift) -- is used
more than once, then it won't merge it into an EffectiveAddress node,
whereas we can see that V8 will do that.

(3) there's a kind-of unrelated transformation hacked in there, which removes
x & MASK in the case where it's obvious that x has a value that makes
the & operation redundant.

More generally, the pattern matching is done by a forward pass over each block.
This leads to hard to understand code and the limitation (2). A more natural
way to do it would be via a backwards pass, which would also avoid (2), and
could possibly reimplement (3) in a more decoupled way.

(1) is important for wasm in cases where C/C++/Rust/whatever synthesises
arrays in linear memory, and those arrays are C/C++/Rust/etc global variables.

To index into such an array, the offset in linear memory is (eg)
offset_of_start + (index << 2). The compile/link that generates the .wasm
knows what the base offset in linear memory of the array is, and so
offset_of_start becomes a constant baked into the .wasm. This means the
pattern isn't one that the EAA pass currently matches. This shortcoming is
particularly marked for JS3/gcc-loops-wasm, where we generate many
mov-shl-add sequences but V8 generates a single lea (,%index,$sh) insn.

(2) is important for blocks in which there are, for example, two accesses to
arrays of the same element size using the same index (eg arr1[i] = arr2[i]).
Then there will be two terms of the form index << shift, which I assume will
be GVN'd before the EAA pass runs. Hence we do not generate an EAA for either
access. V8 by contrast will (implicitly) duplicate the shift by creating EAAs
for both accesses. Note this limitation applies to us even in the case where
we have base present.

Julian Seward [:jseward]

Assignee

Comment 1

•

5 months ago

Attached patch bug1970035-EAA.diff-5 (obsolete) — Details — Splinter Review

WIP patch. Gives significant speedup on JS3/gcc-loops-wasm on x64.

Julian Seward [:jseward]

Assignee

Comment 2

•

4 months ago

Attached patch bug1970035-EAA.diff-9 (obsolete) — Details — Splinter Review

WIP patch. Perf results on Intel Tiger Lake:

2nd best of 20 runs, 3.0 GHz, "  Score: " values (higher is better)

                   BASIS      AFTER
-------------------------------------------
HashSet-wasm     125.170    124.908   -0.2%
tsf-wasm         161.221    159.373   -1.1%
quicksort-wasm   133.032    132.693   -0.3%
gcc-loops-wasm   134.788    179.931  +33.5%
richards-wasm    193.273    193.292    0.0%
sqlite3-wasm      63.909     63.864   -0.1%
tfjs-wasm          1.167      1.166   -0.1%
tfjs-wasm-simd     2.810      2.822   +0.4%
argon2-wasm       97.408     98.202   +0.8%
8bitbench-wasm    16.904     17.996   +6.5%
Dart-flute-wasm    2.660      2.659    0.0%
zlib-wasm         76.732     76.730    0.0%

Attachment #9493193 - Attachment is obsolete: true

Ryan Hunt [:rhunt]

Updated

•

4 months ago

Blocks: 1942596

Severity: -- → N/A

Priority: -- → P1

Summary: wasm: improve capabilities of the EffectiveAddressAnalysis pass → Improve capabilities of the EffectiveAddressAnalysis pass

Julian Seward [:jseward]

Assignee

Comment 3

•

4 months ago

I looked into the 1.1% regression for tsf-wasm. I cannot find any significant
differences in generated code.

Profiling with perf stat shows the new version runs in slightly fewer
instructions (0.058% reduction) but requires 1.1% more cycles. Hence the new
version has an IPC of 3.23 vs the original's IPC of 3.26, and the perf lossage
is entirely due to IPC lossage.

Numbers (however fictional) from Callgrind show, for the new version, a 1.3%
reduction in icache miss rate (presumably because there's less generated code),
and negligable changes in anything else it can measure.

So I guess the IPC lossage is due to some scheduling or resource-limit
wierdness in the processor (Intel Tiger Lake).

Julian Seward [:jseward]

Assignee

Comment 4

•

4 months ago

Attached file bug1970035-EAA.diff-11 (obsolete) — Details

WIP patch, as much as I can be bothered to tune it. Needs cleanup for review
and needs arm32 assembler bits filled in.

Attachment #9494360 - Attachment is obsolete: true

Julian Seward [:jseward]

Assignee

Comment 5

•

4 months ago

Attached patch bug1970035-EAA.diff-16 — Details — Splinter Review

With arm32 bits filled in, and more tuning for arm64.

Best of 20 runs, RPi5, 2.4 GHz

                           before       after

HashSet-wasm       Score:  87.072     88.507   +1.6 %
tsf-wasm           Score:  91.426     91.332   -0.1 %
quicksort-wasm     Score: 111.139    116.623   +4.9 %
gcc-loops-wasm     Score:  77.256     79.849   +3.3 %
richards-wasm      Score:  91.998     90.688   -1.4 %
sqlite3-wasm       Score:  39.431     39.832   +1.0 %
tfjs-wasm          Score:   0.801      0.810   +1.1 %
tfjs-wasm-simd     Score:   1.559      1.590   +1.2 %
argon2-wasm        Score:  49.675     49.757   +0.2 %
8bitbench-wasm     Score:  10.562     10.519   -0.4 %
zlib-wasm          Score:  60.462     62.925   +4.1 %
                                              -------
                                               + 1.41% avg

Attachment #9494575 - Attachment is obsolete: true

Julian Seward [:jseward]

Assignee

Comment 6

•

3 months ago

Attached patch bug1970035-EAA.diff-18-comprehensive — Details — Splinter Review

Comprehensive (kitchen-sink) test/evaluation patch, for reference. Not proposed
as something we'd want to land.

Julian Seward [:jseward]

Assignee

Comment 7

•

3 months ago

Attached file Bug 1970035 - part 1: EAA rework: supporting MIR and LIR modifications. r=jandem. — Details

This patch provides MIR and LIR modifications needed for reworking of the
EffectiveAddressAnalysis pass:

MEffectiveAddress is renamed to MEffectiveAddress3, to signify that this is a
3-addend computation. LEffectiveAddress is similarly renamed.
new node type MEffectiveAddress2, a 2-addend node, signifying constant + (index << scale). In other words, same as MEffectiveAddress3, but without
the base operand. This is important for linear-address computation for
some wasm inputs.
CodeGenerator-arm.cpp:
- visitEA3: limit the scope of the scratch register
- visitEA2: new method
CodeGenerator-arm64.cpp:
- visitEA3: better handling for scale == 1 and/or displacement == 0
- visitEA2: new method
CodeGenerator-riscv64.cpp: new method visitEA2
Assembler-x86-shared.h, BaseAssembler-x86-shared.h:
- new memory operand kind MEM_SCALE_NOBASE, which is like MEM_SCALE,
  but without a base register.
- add an emitter and printer for MEM_SCALE_NOBASE
CodeGenerator-x86-shared.cpp: new method visitEA2

Julian Seward [:jseward]

Assignee

Comment 8

•

3 months ago

Attached file Bug 1970035 - part 2: rewrite EffectiveAddressAnalysis pass to handle shift-add only. r=jandem. — Details

This patch provides a new and simpler implementation of the EAA pass:

for JS inputs, the pass now does nothing
for wasm inputs, all previous transformations are gone, and replaced with a
single transformation that recognises base + (index << shift) only.

The rationale for these changes is described in a comment at the top of the
file.

The traversal direction has been changed: matching now proceeds backwards
through basic blocks. This makes it possible to fold the same shift node into
multiple MEffectiveAddress{2,3} nodes, which is something the previous
implementation could not do. The new implementation also recognises and folds
the case where base is a constant value, a situation which is sometimes
important for wasm, but which the previous implementation could not.

Pulsebot

Comment 9

•

3 months ago

Pushed by jseward@mozilla.com: https://github.com/mozilla-firefox/firefox/commit/fa8e2a3709c6 https://hg.mozilla.org/integration/autoland/rev/2b3ba957f99e part 1: EAA rework: supporting MIR and LIR modifications. r=jandem. https://github.com/mozilla-firefox/firefox/commit/777a80fdf154 https://hg.mozilla.org/integration/autoland/rev/d1e1c2e15bd7 part 2: rewrite EffectiveAddressAnalysis pass to handle shift-add only. r=jandem. https://github.com/mozilla-firefox/firefox/commit/7abc593ec4f8 https://hg.mozilla.org/integration/autoland/rev/6ed0c528f690 apply code formatting via Lando

Mayank Bansal

Comment 10

•

3 months ago

20%-30% improvement on gcc-loops on JS2/JS3

JS2: https://treeherder.mozilla.org/perfherder/graphs?highlightAlerts=1&highlightChangelogData=1&highlightCommonAlerts=0&replicates=0&selected=5260478,2077815969&series=autoland,5260478,1,13&timerange=5184000&zoom=1752814104215,1753210334449,2.510493996255538,3.924064564520237

JS3: https://treeherder.mozilla.org/perfherder/graphs?highlightAlerts=1&highlightChangelogData=1&highlightCommonAlerts=0&replicates=0&selected=5260478,2077815969&series=autoland,5260478,1,13&timerange=5184000&zoom=1752814104215,1753210334449,2.510493996255538,3.924064564520237

(Caveat: backfills still in progress)

pstanciu

Comment 11

•

3 months ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/2b3ba957f99e
https://hg.mozilla.org/mozilla-central/rev/d1e1c2e15bd7
https://hg.mozilla.org/mozilla-central/rev/6ed0c528f690

Status: NEW → RESOLVED

Closed: 3 months ago

status-firefox143: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 143 Branch

Mayank Bansal

Updated

•

3 months ago

Regressions: 1978731

Mayank Bansal

Comment 12

•

3 months ago

•

Edited

Jetstream3-Andoid:
3% improvement on Hashset
4.3% on quicksort-wasm
2.3% on tfjs-wasm-simd

Rong Bao [:csmantle]

Updated

•

3 months ago

Regressed by: 1978996

Rong Bao [:csmantle]

Updated

•

3 months ago

No longer regressed by: 1978996

Regressions: 1978996

Zhao Jiazhong

Updated

•

3 months ago

Regressions: 1979265

Mihai Boldan, Desktop QA [:mboldan]

Updated

•

2 months ago

QA Whiteboard: [qa-triage-done-c144/b143]

You need to log in before you can comment on or make changes to this bug.