ARM64: Better code generation for loads from / stores to constant heap offsets
Categories
(Core :: JavaScript: WebAssembly, enhancement, P3)
Tracking
()
Tracking | Status | |
---|---|---|
firefox90 | --- | fixed |
People
(Reporter: lth, Assigned: lth)
References
(Blocks 1 open bug)
Details
Attachments
(1 file, 1 obsolete file)
Simple test case:
(module
(memory 1)
(func $f1 (result i32)
(i32.load (i32.const 128))))
This is the core of the output:
0x34c80e78034 52800000 mov w0, #0x0
0x34c80e78038 91020010 add x16, x0, #0x80 (128)
0x34c80e7803c b8706aa0 ldr w0, [x21, x16]
This is pretty sad. The architecture has ldr [x21,immediate]
for fairly large immediates, which we should use here. For bytes and halfwords, there are more limited instructions with smaller offset ranges that could be used. Ditto stores. We should test all sizes. Ditto floating point. Ditto SIMD.
In the case where the access has an offset, the offset may be foldable into the constant in the compiler.
Assignee | ||
Comment 1•4 years ago
|
||
Some experiments, need test cases + some documentation:
- generate much better code for constant-address accesses
- generate better code when storing zero
Assignee | ||
Comment 2•4 years ago
•
|
||
Piling on:
(module
(memory 1)
(func $f1 (result i32)
(i32.load offset=16 (i32.const 128))))
turns into
0x1755cbc6034 52800000 mov w0, #0x0
0x1755cbc6038 91024010 add x16, x0, #0x90 (144)
0x1755cbc603c b8706aa0 ldr w0, [x21, x16]
which shows that the folding happens, as it should, but the code is bad.
In cases where the constant is too large, just loading it into a temp is still preferable to what's generated above. Not sure yet what motivates that addition.
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 3•4 years ago
|
||
(In reply to Lars T Hansen [:lth] from comment #2)
Not sure yet what motivates that addition.
Just a consequence of [base, offset] being converted to [0, base+offset] in WasmIonCompile when the offset won't overflow the guard limit, and then naive code generation in the arm64 back-end loading zero into w0 and then adding in the offset.
I think I have a working POC now for the i32.load case, it's somewhat subtle because of obscure special cases in the front end, but is manageable.
Assignee | ||
Comment 4•4 years ago
|
||
This patch folds constant addresses and their offsets into absolute
addresses and then either embeds small absolute addresses directly in
the load instruction or loads the absolute address into a temp using
an optimal sequence, and then uses the temp as the load index.
Updated•4 years ago
|
Comment 6•4 years ago
|
||
bugherder |
Updated•4 years ago
|
Description
•