ARM64: generate better code for cyclic move groups
Categories
(Core :: JavaScript Engine: JIT, enhancement, P3)
Tracking
()
| Tracking | Status | |
|---|---|---|
| firefox101 | --- | affected |
People
(Reporter: lth, Assigned: rhunt)
References
(Blocks 5 open bugs)
Details
Attachments
(1 file)
(Spun off from bug 1701164)
Consider:
(func (param i32) (param $p i64) (result i64)
(i64.sub (i64.const 0) (local.get $p)))
(notice the extra dead parameter which will occupy r0 and make r1 unavailable for $p) where the body turns into this horror:
0x47fc302c e1a01003 e1a01003 mov r1, r3
0x47fc3030 e1a00002 e1a00002 mov r0, r2
0x47fc3034 e24dd008 e24dd008 sub sp, sp, #8
0x47fc3038 e58d1000 e58d1000 str r1, [sp, #+0]
0x47fc303c e1a01000 e1a01000 mov r1, r0
0x47fc3040 e59d0000 e59d0000 ldr r0, [sp, #+0]
0x47fc3044 e28dd008 e28dd008 add sp, sp, #8
0x47fc3048 e2711000 e2711000 rsbs r1, r1, #0 ;; negate!
0x47fc304c e2e00000 e2e00000 rsc r0, r0, #0 ;; in case you missed it
0x47fc3050 e24dd008 e24dd008 sub sp, sp, #8
0x47fc3054 e58d0000 e58d0000 str r0, [sp, #+0]
0x47fc3058 e1a00001 e1a00001 mov r0, r1
0x47fc305c e59d1000 e59d1000 ldr r1, [sp, #+0]
0x47fc3060 e28dd008 e28dd008 add sp, sp, #8
because the function body is elaborated to this by the register allocator:
[RegAlloc] Register Allocation Integrity State:
[RegAlloc] Block 0
[RegAlloc] [2,3 WasmParameter] [def v1<i>:r0]
[RegAlloc] [4,5 WasmParameterI64] [def v2<g>:r2] [def v3<g>:r3]
[RegAlloc] [6,7 WasmParameter] [def v4<g>:r9]
[RegAlloc] [MoveGroup] [r3 -> r1] [r2 -> r0]
[RegAlloc] [MoveGroup] [r1 -> r0] [r0 -> r1]
[RegAlloc] [8,9 NegI64] [def v5<g>:r1] [def v6<g>:r0] [use v2:r r1] [use v3:r r0]
[RegAlloc] [MoveGroup] [r0 -> r1] [r1 -> r0]
[RegAlloc] [10,11 WasmReturnI64] [use v5:r0 r0] [use v6:r1 r1] [use v4:r9 r9]
and the move resolver deals with the r0 <-> r1 register swaps by going via memory with ad-hoc stack allocation. There's no obvious reason why a register would not be used for that, there are plenty available. Even going via the scratch register would be better, and this platform has one. Even going via the FP scratch would probably be better. Even the XOR trick might be better, certainly when it's only two registers.
The above is ARM code, but ARM64 appears to have the same issue. It may be harder to provoke a meltdown like that on ARM64, though.
Also see bug 1701164, which is about some redundant move groups being emitted as above (r2+r3 -> r0+r1 followed by r0 <-> r1; there's no reason for the initial moves not to move directly into the correct registers).
| Reporter | ||
Comment 1•4 years ago
|
||
We have good move cycle handling on x86/x64. This code is tied in with calculations about when to use xchg vs when to do other things, but this seems incidental - every platform has the ability to xchg, if only via the XOR trick. Thus the x86/x64 cycle handling code should probably be lifted to common code and it should be parameterized by the expected cost of various strategies.
At a minimum, the cycle handling code (and the move resolver in general, frankly) ought to be able to make use of a temp register when the platform has a dedicated temp, because the temp should be available when moves are resolved. We have a dedicated temp on all platforms except the increasingly irrelevant x86.
Comment 2•4 years ago
|
||
(In reply to Lars T Hansen [:lth] from comment #1)
Thus the x86/x64 cycle handling code should probably be lifted to common code and it should be parameterized by the expected cost of various strategies.
I will note that ARM (and MIPS32?) have to deal with float/double/vector register aliasing which does not exists on other platform.
| Reporter | ||
Comment 3•4 years ago
|
||
(In reply to Nicolas B. Pierron [:nbp] from comment #2)
(In reply to Lars T Hansen [:lth] from comment #1)
Thus the x86/x64 cycle handling code should probably be lifted to common code and it should be parameterized by the expected cost of various strategies.
I will note that ARM (and MIPS32?) have to deal with float/double/vector register aliasing which does not exists on other platform.
Yeah, good point re the "high" f32 registers aliasing double regs on ARM. (The MIPS32 port is officially dead, whatever the situation is there, and I more or less expect we will never port SIMD to ARM so the vector registers will not be an issue.)
Mostly I'm worried about ARM64, really, and it would be good to share the code if we can. At least ARM64 has the same FP aliasing rules as x86.
Updated•4 years ago
|
| Reporter | ||
Updated•4 years ago
|
| Reporter | ||
Updated•4 years ago
|
| Assignee | ||
Comment 4•4 years ago
|
||
I took a look at this. I'd like to focus it on ARM64 to avoid dealing with the single/double aliasing issues of ARM32.
| Assignee | ||
Comment 5•4 years ago
|
||
This commit modifies ARM64's move emitter to use a scratch GPR as
the storage for most cycles that need to be broken. The only
exception is for SIMD128 cycles, which fall back to allocating
space on the stack.
Comment 7•4 years ago
|
||
| bugherder | ||
| Reporter | ||
Comment 8•4 years ago
•
|
||
This needs:
- a whitebox codegen test case (ideally several, to cover the several cases)
- a solution for the SIMD case, because that actually matters
- a solution for the FP case that does not include int<->fp moves (probably what solves the SIMD case will solve this case too)
I would expect memory to memory FP moves to be rare, and if so it would be better for those to use the GPR scratch (SIMD would require two loads and two stores) if that allows the FP scratch to be used for optimizing FP and SIMD moves in most cases - assuming it's possible of course to make the logic work out.
| Reporter | ||
Updated•4 years ago
|
Updated•4 years ago
|
Comment 9•5 months ago
|
||
Is this still valid?
| Assignee | ||
Comment 10•5 months ago
|
||
Yeah this is still valid.
Description
•