Open Bug 1701164 Opened 3 years ago Updated 2 years ago

[meta] Redundant MoveGroups and other Move badness

Categories

(Core :: JavaScript Engine: JIT, enhancement, P3)

enhancement

Tracking

()

People

(Reporter: lth, Unassigned)

References

(Depends on 3 open bugs, Blocks 3 open bugs)

Details

(Keywords: meta)

A little too often I observe that the regalloc inserts apparently redundant MoveGroups when inputs and outputs to an operation are constrained (param registers, output registers). Frequently this gives rise to code that moves redundantly via a temp that is then immediately dead. I'll dig up some examples and make dependent bugs.

These kinds of redundancies matter more for wasm than for JS probably, but it will increase register pressure and instruction counts for all.

Here's an example:

wasmDis(new WebAssembly.Module(wasmTextToBinary(`
  (module
    (func (param v128) (param v128) (result v128)
      (i32x4.add (local.get 1) (local.get 0))))
`)))

The core of the code generated for this is:

00000024  66 0f 6f d1               movdqa %xmm1, %xmm2
00000028  66 0f 6f ca               movdqa %xmm2, %xmm1
0000002C  66 0f fe c8               paddd %xmm0, %xmm1
00000030  66 0f 6f c1               movdqa %xmm1, %xmm0

which is not great. Integer SIMD addition is commutative so the optimal code is just the addition; almost-optimal code has at most one move (the one at the end). But even ignoring that, the first two moves are clearly redundant, as xmm2 is dead and the move accomplishes nothing.

(Note the swapped operand order. If the body were param0 + param1 then the machine code is simply the paddd.)

Regalloc reports that the input is this:

[RegAlloc]     [2,3 WasmParameter] [def v1<simd128>:%xmm0.i4]
[RegAlloc]     [4,5 WasmParameter] [def v2<simd128>:%xmm1.i4]
[RegAlloc]     [6,7 WasmParameter] [def v3<g>:r14]
[RegAlloc]     [8,9 WasmBinarySimd128] [def v4<simd128>:tied(0)] [use v2:r?] [use v1:r]
[RegAlloc]     [10,11 WasmReturn] [use v4:%xmm0.d] [use v3:r14]

but at the end the IR looks like this:

[RegAlloc]     [2,3 WasmParameter] [def v1<simd128>:%xmm0.i4]
[RegAlloc]     [4,5 WasmParameter] [def v2<simd128>:%xmm1.i4]
[RegAlloc]     [MoveGroup] [%xmm1.i4 -> %xmm2.i4]
[RegAlloc]     [6,7 WasmParameter] [def v3<g>:r14]
[RegAlloc]     [MoveGroup] [%xmm2.i4 -> %xmm1.i4]
[RegAlloc]     [8,9 WasmBinarySimd128] [def v4<simd128>:%xmm1.i4] [use v2:r %xmm1.i4] [use v1:r %xmm0.i4]
[RegAlloc]     [MoveGroup] [%xmm1.i4 -> %xmm0.i4]
[RegAlloc]     [10,11 WasmReturn] [use v4:%xmm0.d %xmm0.i4] [use v3:r14 r14]

Here's another one:

wasmDis(new WebAssembly.Module(wasmTextToBinary(`
  (module
    (memory (export "mem") 1)
    (func (param i32) (result i32)
      (i32.load (local.get 0))))
`)))

Run this with --disable-wasm-huge-memory --large-arraybuffers and the regalloc is handed this:

[RegAlloc]     [2,3 WasmParameter] [def v1<i>:rdi]
[RegAlloc]     [4,5 WasmParameter] [def v2<g>:r14]
[RegAlloc]     [6,7 WasmLoadTls] [def v3<g>] [use v2:r]
[RegAlloc]     [8,9 WasmExtendU32Index] [def v4<g>:tied(0)] [use v1:r?]
[RegAlloc]     [10,11 WasmBoundsCheck64] [def v5<g>:tied(0)] [use v4:r?] [use v3:r]
[RegAlloc]     [12,13 WasmWrapU32Index] [def v6<i>:tied(0)] [use v5:r?]
[RegAlloc]     [14,15 WasmLoad] [def v7<i>] [use v6:r]
[RegAlloc]     [16,17 WasmReturn] [use v7:rax] [use v2:r14]

where the nodes at 8 and and 12 emit no code on x64 and use defineReuseInput. And yet redundant moves are inserted by the regalloc:

[RegAlloc]     [2,3 WasmParameter] [def v1<i>:rdi]
[RegAlloc]     [4,5 WasmParameter] [def v2<g>:r14]
[RegAlloc]     [6,7 WasmLoadTls] [def v3<g>:rax] [use v2:r r14]
[RegAlloc]     [MoveGroup] [rdi -> rcx]
[RegAlloc]     [8,9 WasmExtendU32Index] [def v4<g>:rcx] [use v1:r rcx]
[RegAlloc]     [10,11 WasmBoundsCheck64] [def v5<g>:rcx] [use v4:r rcx] [use v3:r rax]
[RegAlloc]     [MoveGroup] [rcx -> rax]
[RegAlloc]     [12,13 WasmWrapU32Index] [def v6<i>:rax] [use v5:r rax]
[RegAlloc]     [14,15 WasmLoad] [def v7<i>:rax] [use v6:r rax] [use bogus]
[RegAlloc]     [16,17 WasmReturn] [use v7:rax rax] [use v2:r14 r14]

In this case there could be an interesting complication since the extend and wrap nodes are technically changing the MIR types. But on every 64-bit platform the 32-bit and 64-bit GPRs overlap, so this is not a great explanation.

This bug currently bites really hard on ARM when manipulating 64-bit data. Consider:

  (func (param i32) (param $p i64) (result i64)
    (i64.sub (i64.const 0) (local.get $p)))

(notice the extra dead parameter) where the body turns into this:

0x47fc302c  e1a01003  e1a01003       mov r1, r3
0x47fc3030  e1a00002  e1a00002       mov r0, r2
0x47fc3034  e24dd008  e24dd008       sub sp, sp, #8
0x47fc3038  e58d1000  e58d1000       str r1, [sp, #+0]
0x47fc303c  e1a01000  e1a01000       mov r1, r0
0x47fc3040  e59d0000  e59d0000       ldr r0, [sp, #+0]
0x47fc3044  e28dd008  e28dd008       add sp, sp, #8
0x47fc3048  e2711000  e2711000       rsbs r1, r1, #0
0x47fc304c  e2e00000  e2e00000       rsc r0, r0, #0
0x47fc3050  e24dd008  e24dd008       sub sp, sp, #8
0x47fc3054  e58d0000  e58d0000       str r0, [sp, #+0]
0x47fc3058  e1a00001  e1a00001       mov r0, r1
0x47fc305c  e59d1000  e59d1000       ldr r1, [sp, #+0]
0x47fc3060  e28dd008  e28dd008       add sp, sp, #8

because the function body is elaborated to this by the register allocator:

[RegAlloc] Register Allocation Integrity State:
[RegAlloc]   Block 0
[RegAlloc]     [2,3 WasmParameter] [def v1<i>:r0]
[RegAlloc]     [4,5 WasmParameterI64] [def v2<g>:r2] [def v3<g>:r3]
[RegAlloc]     [6,7 WasmParameter] [def v4<g>:r9]
[RegAlloc]     [MoveGroup] [r3 -> r1] [r2 -> r0]
[RegAlloc]     [MoveGroup] [r1 -> r0] [r0 -> r1]
[RegAlloc]     [8,9 NegI64] [def v5<g>:r1] [def v6<g>:r0] [use v2:r r1] [use v3:r r0]
[RegAlloc]     [MoveGroup] [r0 -> r1] [r1 -> r0]
[RegAlloc]     [10,11 WasmReturnI64] [use v5:r0 r0] [use v6:r1 r1] [use v4:r9 r9]

It looks like 10 of those instructions are basically r0 <-> r1 register swaps through memory with ad-hoc stack allocation, really ugly and something we ought to be able to specialize. But there's no obvious reason why a register would not be used for that, there are plenty available. So maybe there's an underlying bug.

Depends on: 1713180

Looks like there's been no attempt to optimize simple reg->reg move cycles on ARM or ARM64, unlike on x86. Filed as bug 1713180.

Summary: [meta] Redundant MoveGroups → [meta] Redundant MoveGroups and other Move badness
Blocks: wasm-jit-bugs
No longer blocks: wasm-lang
Depends on: 1746596
Blocks: sm-regalloc
You need to log in before you can comment on or make changes to this bug.