Open Bug 1713180 Opened 4 years ago Updated 5 months ago

ARM64: generate better code for cyclic move groups

Categories

(Core :: JavaScript Engine: JIT, enhancement, P3)

ARM64
All
enhancement

Tracking

()

REOPENED
101 Branch
Tracking Status
firefox101 --- affected

People

(Reporter: lth, Assigned: rhunt)

References

(Blocks 5 open bugs)

Details

Attachments

(1 file)

(Spun off from bug 1701164)

Consider:

  (func (param i32) (param $p i64) (result i64)
    (i64.sub (i64.const 0) (local.get $p)))

(notice the extra dead parameter which will occupy r0 and make r1 unavailable for $p) where the body turns into this horror:

0x47fc302c  e1a01003  e1a01003       mov r1, r3
0x47fc3030  e1a00002  e1a00002       mov r0, r2
0x47fc3034  e24dd008  e24dd008       sub sp, sp, #8
0x47fc3038  e58d1000  e58d1000       str r1, [sp, #+0]
0x47fc303c  e1a01000  e1a01000       mov r1, r0
0x47fc3040  e59d0000  e59d0000       ldr r0, [sp, #+0]
0x47fc3044  e28dd008  e28dd008       add sp, sp, #8
0x47fc3048  e2711000  e2711000       rsbs r1, r1, #0    ;; negate!
0x47fc304c  e2e00000  e2e00000       rsc r0, r0, #0     ;;  in case you missed it
0x47fc3050  e24dd008  e24dd008       sub sp, sp, #8
0x47fc3054  e58d0000  e58d0000       str r0, [sp, #+0]
0x47fc3058  e1a00001  e1a00001       mov r0, r1
0x47fc305c  e59d1000  e59d1000       ldr r1, [sp, #+0]
0x47fc3060  e28dd008  e28dd008       add sp, sp, #8

because the function body is elaborated to this by the register allocator:

[RegAlloc] Register Allocation Integrity State:
[RegAlloc]   Block 0
[RegAlloc]     [2,3 WasmParameter] [def v1<i>:r0]
[RegAlloc]     [4,5 WasmParameterI64] [def v2<g>:r2] [def v3<g>:r3]
[RegAlloc]     [6,7 WasmParameter] [def v4<g>:r9]
[RegAlloc]     [MoveGroup] [r3 -> r1] [r2 -> r0]
[RegAlloc]     [MoveGroup] [r1 -> r0] [r0 -> r1]
[RegAlloc]     [8,9 NegI64] [def v5<g>:r1] [def v6<g>:r0] [use v2:r r1] [use v3:r r0]
[RegAlloc]     [MoveGroup] [r0 -> r1] [r1 -> r0]
[RegAlloc]     [10,11 WasmReturnI64] [use v5:r0 r0] [use v6:r1 r1] [use v4:r9 r9]

and the move resolver deals with the r0 <-> r1 register swaps by going via memory with ad-hoc stack allocation. There's no obvious reason why a register would not be used for that, there are plenty available. Even going via the scratch register would be better, and this platform has one. Even going via the FP scratch would probably be better. Even the XOR trick might be better, certainly when it's only two registers.

The above is ARM code, but ARM64 appears to have the same issue. It may be harder to provoke a meltdown like that on ARM64, though.

Also see bug 1701164, which is about some redundant move groups being emitted as above (r2+r3 -> r0+r1 followed by r0 <-> r1; there's no reason for the initial moves not to move directly into the correct registers).

We have good move cycle handling on x86/x64. This code is tied in with calculations about when to use xchg vs when to do other things, but this seems incidental - every platform has the ability to xchg, if only via the XOR trick. Thus the x86/x64 cycle handling code should probably be lifted to common code and it should be parameterized by the expected cost of various strategies.

At a minimum, the cycle handling code (and the move resolver in general, frankly) ought to be able to make use of a temp register when the platform has a dedicated temp, because the temp should be available when moves are resolved. We have a dedicated temp on all platforms except the increasingly irrelevant x86.

(In reply to Lars T Hansen [:lth] from comment #1)

Thus the x86/x64 cycle handling code should probably be lifted to common code and it should be parameterized by the expected cost of various strategies.

I will note that ARM (and MIPS32?) have to deal with float/double/vector register aliasing which does not exists on other platform.

(In reply to Nicolas B. Pierron [:nbp] from comment #2)

(In reply to Lars T Hansen [:lth] from comment #1)

Thus the x86/x64 cycle handling code should probably be lifted to common code and it should be parameterized by the expected cost of various strategies.

I will note that ARM (and MIPS32?) have to deal with float/double/vector register aliasing which does not exists on other platform.

Yeah, good point re the "high" f32 registers aliasing double regs on ARM. (The MIPS32 port is officially dead, whatever the situation is there, and I more or less expect we will never port SIMD to ARM so the vector registers will not be an issue.)

Mostly I'm worried about ARM64, really, and it would be good to share the code if we can. At least ARM64 has the same FP aliasing rules as x86.

Severity: -- → N/A
Type: defect → enhancement
Priority: -- → P2
Assignee: nobody → jseward
Status: NEW → ASSIGNED
Priority: P2 → P3

I took a look at this. I'd like to focus it on ARM64 to avoid dealing with the single/double aliasing issues of ARM32.

Assignee: jseward → rhunt
Summary: ARM/ARM64: generate better code for cyclic move groups → ARM64: generate better code for cyclic move groups

This commit modifies ARM64's move emitter to use a scratch GPR as
the storage for most cycles that need to be broken. The only
exception is for SIMD128 cycles, which fall back to allocating
space on the stack.

Pushed by rhunt@eqrion.net: https://hg.mozilla.org/integration/autoland/rev/ede0b744903b [arm64] Use scratch register to break most move cycles. r=nbp
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 101 Branch

This needs:

  • a whitebox codegen test case (ideally several, to cover the several cases)
  • a solution for the SIMD case, because that actually matters
  • a solution for the FP case that does not include int<->fp moves (probably what solves the SIMD case will solve this case too)

I would expect memory to memory FP moves to be rare, and if so it would be better for those to use the GPR scratch (SIMD would require two loads and two stores) if that allows the FP scratch to be used for optimizing FP and SIMD moves in most cases - assuming it's possible of course to make the logic work out.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Blocks: sm-regalloc

Is this still valid?

Yeah this is still valid.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: