Closed Bug 1709863 Opened 4 years ago Closed 4 years ago

ARM64: Better code generation for loads from / stores to constant heap offsets

Categories

(Core :: JavaScript: WebAssembly, enhancement, P3)

ARM64
All
enhancement

Tracking

()

RESOLVED FIXED
90 Branch
Tracking Status
firefox90 --- fixed

People

(Reporter: lth, Assigned: lth)

References

(Blocks 1 open bug)

Details

Attachments

(1 file, 1 obsolete file)

Simple test case:

(module
  (memory 1)
  (func $f1 (result i32)
    (i32.load (i32.const 128))))

This is the core of the output:

0x34c80e78034  52800000  mov     w0, #0x0
0x34c80e78038  91020010  add     x16, x0, #0x80 (128)
0x34c80e7803c  b8706aa0  ldr     w0, [x21, x16]

This is pretty sad. The architecture has ldr [x21,immediate] for fairly large immediates, which we should use here. For bytes and halfwords, there are more limited instructions with smaller offset ranges that could be used. Ditto stores. We should test all sizes. Ditto floating point. Ditto SIMD.

In the case where the access has an offset, the offset may be foldable into the constant in the compiler.

Some experiments, need test cases + some documentation:

  • generate much better code for constant-address accesses
  • generate better code when storing zero

Piling on:

(module
  (memory 1)
  (func $f1 (result i32)
    (i32.load offset=16 (i32.const 128))))

turns into

0x1755cbc6034  52800000  mov     w0, #0x0
0x1755cbc6038  91024010  add     x16, x0, #0x90 (144)
0x1755cbc603c  b8706aa0  ldr     w0, [x21, x16]

which shows that the folding happens, as it should, but the code is bad.

In cases where the constant is too large, just loading it into a temp is still preferable to what's generated above. Not sure yet what motivates that addition.

Assignee: nobody → lhansen
Status: NEW → ASSIGNED

(In reply to Lars T Hansen [:lth] from comment #2)

Not sure yet what motivates that addition.

Just a consequence of [base, offset] being converted to [0, base+offset] in WasmIonCompile when the offset won't overflow the guard limit, and then naive code generation in the arm64 back-end loading zero into w0 and then adding in the offset.

I think I have a working POC now for the i32.load case, it's somewhat subtle because of obscure special cases in the front end, but is manageable.

This patch folds constant addresses and their offsets into absolute
addresses and then either embeds small absolute addresses directly in
the load instruction or loads the absolute address into a temp using
an optimal sequence, and then uses the temp as the load index.

Attachment #9221081 - Attachment description: WIP: Bug 1709863 - Generate good ARM64 code for loads from constant addresses. → Bug 1709863 - Improve ARM64 code for loads from/stores to constant addresses. r?yury
Pushed by lhansen@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/2eea3e40c7ed Improve ARM64 code for loads from/stores to constant addresses. r=yury
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 90 Branch
Attachment #9220614 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: