Open Bug 1714280 Opened 3 years ago Updated 2 years ago

Ion x86_64: parameter-carrying registers forced into spill slots in the presence of calls

Categories

(Core :: JavaScript Engine: JIT, enhancement, P3)

enhancement

Tracking

()

People

(Reporter: jseward, Assigned: jseward)

References

(Blocks 2 open bugs)

Details

Attachments

(2 files, 3 obsolete files)

Under certain circumstances, Ion appears to force incoming parameters in
registers into spill slots and then accesses them only from the spill slots.
This is something I've seen repeatedly over the last month as part of
unrelated investigations into Ion's generated code quality for wasm. I
suspect this is a significant source of inefficiency, and in particular I
suspect this contributes to the slowness reported in bug 1708419.

The problem is easily reproduced, with this:

(module
  ;; no-op fn to call
  (func)

  (func (param $$p1 f64) (result f64)
    (local $$f1 f64)
    ;; access both p1 and f1 four times
    (local.set $$f1 (f64.add (local.get $$f1) (local.get $$p1)))
    (local.set $$f1 (f64.sub (local.get $$f1) (local.get $$p1)))
    (local.set $$f1 (f64.mul (local.get $$f1) (local.get $$p1)))
    (local.set $$f1 (f64.div (local.get $$f1) (local.get $$p1)))
    (call 0)
    ;; and again
    (local.set $$f1 (f64.add (local.get $$f1) (local.get $$p1)))
    (local.set $$f1 (f64.sub (local.get $$f1) (local.get $$p1)))
    (local.set $$f1 (f64.mul (local.get $$f1) (local.get $$p1)))
    (local.set $$f1 (f64.div (local.get $$f1) (local.get $$p1)))
    (return (local.get $$f1))
  )
)

What we get, minus prologue/epilogue and other irrelevances, is

  movsd      %xmm0, 0x18(%rsp)  ;; p1 -> 0x18(rsp) [dead assignment]
  xorpd      %xmm1, %xmm1       ;; f1 = 0
  movsd      %xmm0, 0x10(%rsp)  ;; p1 -> 0x10(rsp)
  addsd      %xmm1, %xmm0       ;; f1 = p1 and f1 now lives in xmm0
  subsd      0x10(%rsp), %xmm0  ;; f1 -= p1
  mulsd      0x10(%rsp), %xmm0  ;; f1 *= p1
  divsd      0x10(%rsp), %xmm0  ;; f1 /= p1
  movsd      %xmm0, 0x18(%rsp)  ;; dump f1 on the stack
  call       .Lfrom144          ;; trashes xmm0, xmm1 I assume
  movsd      0x18(%rsp), %xmm0  ;; reload f1
  addsd      0x10(%rsp), %xmm0  ;; f1 += p1
  subsd      0x10(%rsp), %xmm0  ;; f1 -= p1
  mulsd      0x10(%rsp), %xmm0  ;; f1 *= p1
  divsd      0x10(%rsp), %xmm0  ;; f1 /= p1

Removing the (call 0) causes all the stack accesses to disappear:

  xorpd      %xmm2, %xmm2
  movapd     %xmm0, %xmm1
  addsd      %xmm2, %xmm0
  subsd      %xmm1, %xmm0
  mulsd      %xmm1, %xmm0
  divsd      %xmm1, %xmm0
  addsd      %xmm1, %xmm0
  subsd      %xmm1, %xmm0
  mulsd      %xmm1, %xmm0
  divsd      %xmm1, %xmm0

It is as if Ion decided (correctly) to split the live range for f1 (the
running total) across the call, but didn't split the range of p1 (the
parameter). I wonder if there is a heuristic in the splitting logic that says
"a live range attached to a fixed register may not be split (across a call?)".
That would explain this, since p1 is a live range with a fixed register
(xmm0) whereas f1 can be any register [notwithstanding the confusing
xmm0/xmm1 swap at the start].

Below are the results for x86_32, arm64 and arm32. The problem doesn't
manifest on arm64, so the comment 0 theory that a fixed-reg live range can't
be split across a call can't be correct. arm32 isn't different in any
interesting way from arm64. x86_32 is different again: the param is
referenced as 0x10(ebp) and there's no attempt to move it elsewhere,
whilst -- as with x86_64 -- the running total is indeed split across the call.

I wonder now if the x86_64 force-the-param-into-memory behaviour is something
that it mistakenly inherited from Ion's x86_32 port, and instead it should be
adjusted to be more like arm32/64. Given that x86_32 is the only one of these
targets that passes args exclusively in memory.

X86_32:
  xorpd      %xmm1, %xmm1
  movsd      0x10(%ebp), %xmm0
  addsd      %xmm1, %xmm0
  subsd      0x10(%ebp), %xmm0
  mulsd      0x10(%ebp), %xmm0
  divsd      0x10(%ebp), %xmm0
  movsd      %xmm0, 0x10(%esp) ;; NB 0x10(esp); the others are 0x10(ebp)
  call       .Lfrom143
  movsd      0x10(%esp), %xmm0 ;; ditto
  addsd      0x10(%ebp), %xmm0
  subsd      0x10(%ebp), %xmm0
  mulsd      0x10(%ebp), %xmm0
  divsd      0x10(%ebp), %xmm0

ARM64:
  str     d0, [x28, #24]
  fmov    d1, xzr
  fadd    d1, d0, d1
  fsub    d1, d1, d0
  fmul    d1, d1, d0
  fdiv    d0, d1, d0
  str     d0, [x28, #16]
  bl      #+0x0 (addr 0x5618d479ab10)
  ldr     d1, [x28, #16]
  ldr     d0, [x28, #24]
  fadd    d1, d1, d0
  fsub    d1, d1, d0
  fmul    d1, d1, d0
  fdiv    d0, d1, d0

ARM32:
  vstr d0, [sp + 4*4]
  vmov.f64 d1, #1
  vsub.f64 d1, d1, d1 ;; what? 1.0 - 1.0 is the new 0.0 ?
  vadd.f64 d1, d0, d1
  vsub.f64 d1, d1, d0
  vmul.f64 d1, d1, d0
  vdiv.f64 d0, d1, d0
  vstr d0, [sp + 4*2]
  bl  -> (link-time target)
  vldr d1, [sp + 4*2]
  vldr d0, [sp + 4*4]
  vadd.f64 d1, d1, d0
  vsub.f64 d1, d1, d0
  vmul.f64 d1, d1, d0
  vdiv.f64 d0, d1, d0

Yeah, something looks really weird here. I've just skimmed the regalloc spew: the vreg v1 gets two bundles after grouping/queueing, one quite large bundle consisting of many small ranges that more or less span the function, and one single-range bundle that spans the call. The first bundle is split properly but the second one is not, and since it collides with the call it gets spilled completely. Other things look pretty confusing too. Since this is not a problem on other platforms, it could well be some problem with register assignments on x64. On this platform, reused inputs are moved from class REGISTER to class ANY early on and this may confuse the allocator. It will be interesting to contrast the regalloc spew on different platforms.

On arm64, the functions's parameter and its various uses wind up in a single
bundle, with the constraint must-be-d0. This collides with the call and is
split across it, as expected:

  Allocating  v1 3-24 { 3_def 8_v1:R 10_v1:R 12_v1:R 14_v1:R 18_v1:R 20_v1:R 22_v1:R 24_v1:R } [priority 22] [weight 818]
    Requirement d0, fixed by definition
    d0 collides with fixed use v0 17-17 { }
    bundle does not contain hot code
    split across calls at 17
      splitting bundle  v1 3-24 { 3_def } into:
         v1 3-14 { 3_def 8_v1:R 10_v1:R 12_v1:R 14_v1:R }
         v1 18-24 { 18_v1:R 20_v1:R 22_v1:R 24_v1:R }
         v1 4-24 { }

On x86_64 it's more involved. The parameter's live range has the initial part
split off it, and the remaining main section -- composed of uses only -- is
the part that spans the call and that we hope will be split. Unfortunately
that's not what happens. Instead, it has neither a "requirement" nor a
register hint -- the use points are all marked "any" -- and so ends up being
caught by the second if-clause in
BacktrackingAllocator::tryAllocateNonFixed, and hence spilled:

  // Spill bundles which have no hint or register requirement.
  if (requirement.kind() == Requirement::NONE &&
      hint.kind() != Requirement::REGISTER) {

viz:

  Allocating  v1 8-25 { 11_v1:A 13_v1:A 15_v1:A 19_v1:A 21_v1:A 23_v1:A 25_v1:A } [priority 18] [weight 388]
    %xmm0.d collides with fixed use v0 17-17 { }
    ...
    %xmm14.d collides with fixed use v0 17-17 { }
    postponed spill (no register requirement)

If I disable this test (and a similar one below), then the bundle flows
through to splitting, but that leaves the bundle unchanged, which triggers a
splitting-had-no-effect assertion.

Open questions:

  • Should the initial split, that created v1 8-25 .. above, have copied the
    original must-be-xmm0 constraint to the fragments? Presumably not, since
    the reason for splitting a range is exactly to solve the problem that no one
    register can hold the entire original.

  • Why does the splitter have no effect on v1 8-25 above? Is it because
    there's no definition point? Or is it because the use points are all "any"?
    In the arm64 version, the range does have a definition point.

Attached patch Patch referred to by comment 4 (obsolete) — Splinter Review

Summary of current findings

On x86_64, the bundle of interest is composed only of uses (has no def) and
the uses are all marked "any". Hence it ends ends up being caught by the
second if-clause in BacktrackingAllocator::tryAllocateNonFixed, and hence
spilled, per comment 3.

Making that check more restrictive, so that a bundle with no hint or register
requirement will be spilled only when it has no defs or uses, does allow
this bundle to be split. That also requires two followup changes, which are
shown in the attached patch. That does get us more or less what we want, viz:

  movsd      %xmm0, 0x18(%rsp)
  xorpd      %xmm1, %xmm1
  movsd      %xmm0, 0x10(%rsp)
  addsd      %xmm1, %xmm0
  movsd      0x10(%rsp), %xmm1
  subsd      %xmm1, %xmm0
  mulsd      %xmm1, %xmm0
  divsd      %xmm1, %xmm0
  movsd      %xmm0, 0x18(%rsp)
  call       .Lfrom144
  movsd      0x18(%rsp), %xmm0  ;; reload f1
  movsd      0x10(%rsp), %xmm1  ;; reload p1
  addsd      %xmm1, %xmm0
  subsd      %xmm1, %xmm0
  mulsd      %xmm1, %xmm0
  divsd      %xmm1, %xmm0

However, the patch quickly leads to failures when compiling JS. Specifically,
it causes the allocator to sometimes insert a MoveGroup immediately before an
OsiPoint. The CodeGenerator clause for OsiPoint specifically asserts that
it is not preceded by a MoveGroup, since that would, so it seems, invalidate
the LSafepoint info that the OsiPoint refers to. Here's an example, as
usual omitting fluff:

  --------------------------------
  # block3 js/src/jit-test/tests/wasm/binary-slow.js:15:20 (loop header):
                                  # LIR=InterruptCheck
  movabsq    $0x55bc31ad6328, %r11
  cmpl       $0x0, 0x0(%r11)
  jne        .Lfrom1156
  .set .Llabel1156, .
                                  # LIR=MoveGroup
  movq       0x28(%rsp), %rcx
                                  # LIR=OsiPoint
  .set .Llabel1161, .
Assertion failure: !iter->isMoveGroup(), at js/src/jit/CodeGenerator.cpp:3548

Here, the OsiPoint has been separated from a preceding InterruptCheck. The
same failure has been observed to happen after a Call.

It is unclear:

(1) what, if anything, stops the RA inserting a MoveGroup before an OsiPoint
in the unmodified RA sources?

(2) what to do about this. Can the OsiPoint's LSafepoint be updated to take
account of changes created by the MoveGroup?

(3) if we are relying on some fragile invariant for (1), would be be
feasible/safer for any LIR that requires an OsiPoint, to carry the
relevant field(s) directly, and remove the OsiPoint LIR entirely? That
way, there's no possibility that an (implied) OsiPoint can get separated
from the LIR that it describes.

See Also: → 1382643
Severity: -- → N/A
Priority: -- → P1

(In reply to Julian Seward [:jseward] from comment #4)

(1) what, if anything, stops the RA inserting a MoveGroup before an OsiPoint
in the unmodified RA sources?

This is still unclear. I do notice that all OsiPoint LIRs I've seen have
arguments (uses) that are either "c" (constants) or (for virtual registers)
"KEEPALIVE", and I wonder if that's related. The RA does mention special
treatment of KEEPALIVE uses a couple of times. The definition is

    // Keep the used virtual register alive, and use whatever allocation is
    // available. This is similar to ANY but hints to the register allocator
    // that it is never useful to optimize this site.
    KEEPALIVE,

What does "optimize this site" mean?

(2) what to do about this. Can the OsiPoint's LSafepoint be updated to take
account of changes created by the MoveGroup?

Evidently no. The OsiPoint is intended to describe the instruction that
immediately precedes it. If it is updated to take account of an intervening
MoveGroup, then what it would describe is the situation after the MoveGroup,
but would be used to examine the world at the point of the described
instruction, before the MoveGroup.

(3) if we are relying on some fragile invariant for (1), would be be
feasible/safer for any LIR that requires an OsiPoint, to carry the
relevant field(s) directly, and remove the OsiPoint LIR entirely?

No (or at least not easily). OsiPoints are emitted after circa 40 different
LIR kinds, per Lowering.cpp.

OsiPoint are On-Stack Invalidation point, these are fake LIR instructions where the Invalidation mechanism expect to be able to patch in case an invalidation occur.

The OSI point is allocated at the same time we lower the MIR instruction which can exit the code and return into an invalidated code. (see assignSafepoint)

When the OsiPoint is allocated, it is assigned the post-snapshot of the instruction. The post-snapshot define the layout of the stack expected once the instruction completed its execution. It is allocated with the virtual register which exists at the time this code was inserted.

The problem can occur for example, if we attempt to recover the stack such as with fun.arguments using the OsiIndex's snapshot which is the post-snapshot, which is held by the OsiPoint. If we have a MoveGroup in-between, then the snapshot will capture the state after the MoveGroup. Recovery of data from the Snapshot can then lead to corrupted frame recovery, and potentially leaking information.

This problem arise from the fact that LIR only have a single snapshot entry, and that OSIPoint snapshot is a convenient holder for it, without adding a second snapshot pointer to all LIR instructions.

(In reply to Julian Seward [:jseward] from comment #4)

(1) what, if anything, stops the RA inserting a MoveGroup before an OsiPoint
in the unmodified RA sources?

I do not know enough of the Register Allocator to know if such thing exists.

(2) what to do about this. Can the OsiPoint's LSafepoint be updated to take
account of changes created by the MoveGroup?

The problem is that the snapshot of the OsiPoint is used for 2 things. It is used to recover the stack slots (while debugging or fun.arguments, etc…) and it is also used to bailout from the OsiPoint once it is patched. Thus we have 2 locations which are expecting the same interpretation.

Therefore, we cannot patch the snapshot as long as it is used for these 2 cases.

(3) if we are relying on some fragile invariant for (1), would be be
feasible/safer for any LIR that requires an OsiPoint, to carry the
relevant field(s) directly, and remove the OsiPoint LIR entirely? That
way, there's no possibility that an (implied) OsiPoint can get separated
from the LIR that it describes.

Yes it would be possible but it was avoided to avoid adding an extra LSnapshot pointer to every LIR instructions.

We cannot remove the OsiPoint as it is used to bailout from when the code is invalidated, and we need the register recovery of the post-snapshot to properly interpret the returned value of the call. So any bailout has to happen after the LIR under which the code got invalidated.

Attached patch Patch referred to by comment 7 (obsolete) — Splinter Review

Summary of status, 14 June 2021

(1) The patch at comment 4 does cause splitting across calls to happen on
x86_64.

(2) Unfortunately -- also per comment 4 -- this causes MoveGroups to be
inserted before OsiPoints, causing an assertion failure.

(3) Fortunately, (2) happens only when compiling JS, for reasons unknown.
This means that it is possible to evaluate the patch (1) at least for wasm
by running the shell with --no-ion.

(4) Using (1) on wasm/Embenchen gives promising results, shown in the
attachment at comment 8. Overall, a reduction in loads of around 8%,
sometimes insignificant, sometimes dramatically so (64% reduction for
wasm_conditionals). An increase in stores averaging around 0.7%, more
than offset by the reduction in loads. Unfortunately also an increase in
the overall instruction count averaging 1.65%. The Mandelbrot benchmark
at bug 1382643 is affected similarly: although the memory transaction
count falls, the run time increases.

(5) The increased number of instructions happens -- again for reasons unknown
-- because now every use of a spilled value is turned into an explicit
reload followed by a use, even in the case where the value is only used
once, eg:
movsd 0x10(%rsp), %xmmTmp addsd %xmmTmp, %xmm2
vs
addsd 0x10(%rsp), %xmm2
This is no problem on load-store architectures (eg AArch64) but is clearly
a big loss on x86_{32,64}.

(6) I tried to "fix" (5) by causing the allocator to spill any bundle having
no def, only one use, and no register requirement, rather than trying to
allocate it [patch attached]. This works for the trivial example in
comment 0, but for all real inputs fails in
BacktrackingAllocator::pickStackSlots on
MOZ_ASSERT(!bundle->allocation().isBogus());.

So it seems promising, but delivering this means fixing both (6), which seems
like something to do with spill ranges and allocations, and (2), which has the
feel of an interface problem (excessive coupling) between the allocator and
the lowering pass.

See Also: → 1708419

Further observations and simplifications, 2022-Feb-01. Re-analysing
with a simpler test case and with unmodified code base. Test case:

(func (param $$p1 f64) (result f64)
  (local $$f1 f64)
  (local.set $$f1 (f64.sub (local.get $$f1) (local.get $$p1)))
  (local.set $$f1 (f64.sub (local.get $$f1) (local.get $$p1)))
  (call 0)
  (local.set $$f1 (f64.div (local.get $$f1) (local.get $$p1)))
  (local.set $$f1 (f64.div (local.get $$f1) (local.get $$p1)))
  (return (local.get $$f1))
)

Initial LIR:
  2-3 WasmParameter [def v1<d>:%xmm0.d]
  4-5 WasmParameter [def v2<g>:r14]
  6-7 Double [def v3<d>]
  8-9 MathD [def v4<d>:tied(0)] [use v3:R] [use v1:A]
  10-11 MathD [def v5<d>:tied(0)] [use v4:R] [use v1:A]
  12-13 WasmCall [use v2:F:r14]
  14-15 MathD [def v6<d>:tied(0)] [use v5:R] [use v1:A]
  16-17 MathD [def v7<d>:tied(0)] [use v6:R] [use v1:A]
  18-19 WasmReturn [use v2:F:r14] [use v7:F:%xmm0.s]

Live ranges by bundle (after grouping/queueing regs):
  LB1(parent=none v1 3-17 { 3_def:F:%xmm0.d 9_v1:A 11_v1:A 15_v1:A 17_v1:A })
  LB2(parent=none v2 5-19 { 5_def:F:r14 12_v2:F:r14 19_v2:F:r14 })
  LB7(parent=none v3 7-8 { 7_def 8_v3:A }
               ## v4 9-10 { 9_def 10_v4:A }
               ## v5 11-14 { 11_def 14_v5:A }
               ## v6 15-16 { 15_def 16_v6:A }
               ## v7 17-19 { 17_def 19_v7:F:%xmm0.s })

LB2 is the TLS pointer (uninteresting; ignored below)
LB1 is p1 (the param)
LB7 is f1 (the running value)

LB1 (p1) collides with call at 13. It is split, nonsensically, into

  • LB9 (initial def only)
  • LB8 (all the uses; no def)
Splitting LB1(parent=none v1 3-17 { 3_def:F:%xmm0.d 9_v1:A 11_v1:A 15_v1:A
                                    17_v1:A }) ..
  .. bundle does not contain hot code
  .. split across calls at 13
  .. into:
  LB9(parent=LB8 v1 3-3 { 3_def:F:%xmm0.d })
  LB8(parent=none v1 4-17 { 9_v1:A 11_v1:A 15_v1:A 17_v1:A })

LB8 (p1) still collides with the call so it is spilled. Duh.

LB7 (f1) collides with call at 13. It is split, this time sensibly, into

  • LB11 (def and all uses before the call)
  • LB12 (all uses after the call)
  • LB10 (spill parent; entire range except the initial def; no uses and no def)
Splitting LB7(parent=none v3 7-8 { 7_def 8_v3:A } ## v4 9-10 { 9_def 10_v4:A }
                 ## v5 11-14 { 11_def 14_v5:A } ## v6 15-16 { 15_def 16_v6:A }
                 ## v7 17-19 { 17_def 19_v7:F:%xmm0.s }) ..
  .. bundle does not contain hot code
  .. split across calls at 13
  .. into:
  LB11(parent=LB10 v3 7-8 { 7_def 8_v3:A } ## v4 9-10 { 9_def 10_v4:A } 
                            ## v5 11-11 { 11_def })
  LB12(parent=LB10 v5 14-14 { 14_v5:A } ## v6 15-16 { 15_def 16_v6:A }
                              ## v7 17-19 { 17_def 19_v7:F:%xmm0.s })
  LB10(parent=none v3 8-8 { } ## v4 10-10 { } ## v5 12-14 { } ## v6 16-16 { }
                   ## v7 18-19 { })

LB12 allocated to xmm0 - as expected -- holds f1 after the call
LB11 allocated to xmm0 - as expected -- holds f1 before the call
LB10 spilled - as expected -- holds f1 across the call
(so, handling of f1 is fine)

But for p1 ..
LB9 (initial def only) is allocated to xmm0, fine
LB8 (all uses) is live across the call, so is spilled at all 4 uses

Hence: xmm0 (f1) saved/restored across the call, in 0x10(%rsp), as expected.
But p1 is always spilled, hence in 0x18(rsp).

  subsdq 0x18(%rsp), %xmm0
  subsdq 0x18(%rsp), %xmm0
  movsdq %xmm0, 0x10(%rsp)
  call 0xFFFFFFFFFFFFFFF0
  movsdq 0x10(%rsp), %xmm0
  divsdq 0x18(%rsp), %xmm0
  divsdq 0x18(%rsp), %xmm0

So the question is narrowed to: why did splitting across the call do the right
thing for LB7 but the wrong thing for LB1 ?

Interestingly, BacktrackingAllocator::splitAcrossCalls is pretty short. It
merely identifies the call position(s) within the bundle, correctly, per the
output above, then tailcalls splitAt, asking it to split at those positions
(in both cases, only 1). So the trail leads back to splitAt.

This happens because the central splitting loop in splitAt does not consider
an ANY-class use to count as a register use. The attached patch "fixes" it by
causing ANY-class uses to classify as register uses providing the range has a
fixed-register definition. That idiom -- fixed-register-def followed by
multiple ANY uses -- is what you'd expect for an x86_64 parameter.

The patch at least doesn't break anything, AFAICT.

This reduces spill/reload traffic in some tests, but causes a big regression
for rust-fannkuch. There exist deeper problems, resulting from spill bundles
overlapping with their children, for example bug 1752520.

Attachment #9225837 - Attachment is obsolete: true
Attachment #9226764 - Attachment is obsolete: true
Attachment #9226765 - Attachment is obsolete: true
Blocks: sm-regalloc
Flags: needinfo?(sdetar)
Priority: P1 → P3

Sorry, there was a problem with the detection of inactive users. I'm reverting the change.

Assignee: nobody → jseward
You need to log in before you can comment on or make changes to this bug.