Open Bug 1758274 Opened 2 years ago Updated 2 months ago

Ion's RA: implement and evaluate an alternative splitting strategy

Categories

(Core :: JavaScript Engine: JIT, enhancement, P3)

enhancement

Tracking

()

People

(Reporter: jseward, Assigned: jseward)

References

(Blocks 1 open bug)

Details

Attachments

(5 files, 15 obsolete files)

324.02 KB, patch
Details | Diff | Splinter Review
13.87 KB, application/octet-stream
Details
48 bytes, text/x-phabricator-request
Details | Review
3.26 KB, patch
Details | Diff | Splinter Review
48 bytes, text/x-phabricator-request
Details | Review

In the best case this would result in a usable replacement for the existing
splitting logic. If not, it should at least generate documentation and
understanding for what currently exists, and for the problem in general.

Attached patch splitting-X0.diff-10-Xi-Xo (obsolete) — Splinter Review

Status 2022Mar06 (c.f. splitting-X0.diff-10-Xi-Xo)

  • Implemented BacktrackingAllocator::qsplitAt, a replacement for ::splitAt
    that splits precisely at the points specified. With comments! Seems to do
    at least that task OK. It does not (yet) handle the no-points-specified
    (split-for-spill) case though.

  • Status: it now asserts/segfaults in the (badly misnamed) routine
    BacktrackingAllocator::resolveControlFlow.

  • I still don't know the exact meaning of the points in the splitPositions
    vector. Tried three different semantics, MaybeSplitRangeAt_Xi_Xo,
    MaybeSplitRangeAt_NewAtPos, MaybeSplitRangeAt_NewAfterPos, but none of these
    appear to match what the original splitAt does. Settled on using the _Xi_Xo
    variant as that seems to me the most "natural" semantics.

  • Implemented BacktrackingAllocator::qTrySplitAcrossCalls, a replacement for
    ::splitAcrossCalls that is aimed at splitting across calls incrementally
    rather than all-at-once. Also aimed at using the MaybeSplitRangeAt_Xi_Xo
    semantics. Seems like a good start but needs more work.

  • The existing BacktrackingAllocator::splitAt doesn't split at the points
    requested, instead possibly earlier or later. I wonder now if this is a
    case of confusing policy (where to split, computed by ::trySplit*) with
    implementation (actually doing the split, ::splitAt). This strikes me as
    suboptimal because the heuristics for splitting across calls and loops are
    in conflict, so a single set of heuristics in ::splitAt can't possibly be
    optimal for both.

    For both calls and loops, ranges that are live across them really need to be
    split into 3 pieces: before the call/loop, the call/loop itself, and after
    the call/loop. But the requirements are different:

    • For a loop, we want to maximise the chance that the middle section gets a
      register. So it should be as short as possible, covering exactly only the
      loop and nothing before or after it.

    • For a call, want to minimise total interference. Rather than spill the
      value immediately before the call and reload it immediately afterwards, it
      would be better to spill it right after the last ref (def or use) before
      the call, and right before the first use after the call. In other words,
      the middle section should be made as long as possible.

    It would seem to me the way to achieve that is for ::trySplitAcrossHotcode
    and ::splitAcrossCalls to compute exactly the places to split at (policy)
    and ::splitAt to do exactly that (implementation).

  • Next steps:

    • create a split-for-spill routine (replacement for ::splitAt with an empty
      splitPositions vector)

    • investigate/fix asserting/segfaulting in ::resolveControlFlow, so we
      once again have runnable and hence evaluateable code.

Status 2022Mar16 (c.f. splitting-X0.diff-23-kinda-working)

The allocation pipeline conceptually has 3 stages:

(1) liveness analysis (::buildLivenessInfo, ::mergeAndQueueRegister)

(2) main allocation loops, including splitting (::processBundle,
::pickStackSlots)

(3) rewriting (::resolveControlFlow, ::reifyAllocations, ::populateSafepoints,
::annotateMoveGroups)

This patch reimplements the splitting logic in (2). Actual bundle processing
is in (2), and all of (1) and (3) are unchanged.

  • rewritten ::splitAt to do "precise" (children exactly cover the original)
    non-spill splitting

  • new function ::splitForSpill to do splitting for when a range is to be
    spilled -- different requirements than for ::splitAt

  • new function ::exactifySpillBundles to re-impose non-overlap invariants
    after splitting

  • rewritten ::[try]splitAcrossCalls to impose a new across-calls strategy

Collectively these impose known invariants on the splitting process, separate
policy from implementation (choosing where to split vs actually doing the
splits), and overall make splitting behaviour easier to understand.

The contracts holding at the (1)-(2) and (2)-(3) boundaries are still unclear
and require further work. ::resolveControlFlow at a minimum will require
rewriting. There remains much uncertainty.

Allocations of most functions still fails. However, for a couple of
previously analysed test cases, the splitting now behaves as desired. In
particular for this, as compiled with emcc -O1:

float* a1 = calloc(NUM, sizeof(float));
float* a2 = calloc(NUM, sizeof(float));

float sum = 0.0;
for (int i = 0; i < NUM; i++) {
   sum += a1[i] * a2[i];
}

free(a1); free(a2);
return (int)sum;

it now produces an optimal loop body, with 2 memory references (plus the
interrupt check) per iteration. Doing so requires splitting a1 across the
second calloc call, a2 across the first free call, sum across both
free calls, and dealing the the fixed-reg conflict at opposite ends of the
a1 and a2 ranges. The existing allocator spills all three variables and
produces 6 refs (plus i.c.) in the loop body:

After                              Before

.set .Llabel160, .                 .set .Llabel144, .
cmpl  $0x0, 0x48(%r14)             cmpl  $0x0, 0x40(%r14)
je    .Lfrom171je                  je    .Lfrom155
ud2                                ud2
.set  .Llabel171, .                .set  .Llabel155, .
movl  %esi, %ebx                   movl  %eax, %ecx
shll  $2, %ebx                     shll  $2, %ecx
movl  %edx, %eax                   movl  0x18(%rsp), %edx  -- reload
addl  %ebx, %eax                   addl  %ecx, %edx
movss 0x0(%r15,%rax,1), %xmm1      movss 0x0(%r15,%rdx,1), %xmm0
movl  %ecx, %eax                   movl  0x14(%rsp), %edx  -- reload
addl  %ebx, %eax                   addl  %ecx, %edx
movss 0x0(%r15,%rax,1), %xmm2      movss 0x0(%r15,%rdx,1), %xmm1
mulss %xmm2, %xmm1                 mulss %xmm1, %xmm0
                                   movss 0x1c(%rsp), %xmm1 -- reload
addss %xmm1, %xmm0                 addss %xmm0, %xmm1
                                   movss %xmm1, 0x1c(%rsp) -- spill
addl  $1, %esi                     addl  $1, %eax
cmpl  $0x186a0, %esi               cmpl  $0x186a0, %eax
jne   .Llabel160                   jne   .Llabel144
Attachment #9266677 - Attachment is obsolete: true
Attached patch splitting-X0.diff-24-final (obsolete) — Splinter Review

TL;DR

This is the final version of the splitting-X0 (Splitting Experiment #0)
patches. The patch is able to correctly allocate a couple of small test cases
and stays alive long enough for a more in-depth assessment on the the wasm
jit-tests.

It is clear now that design of the patch cannot work as-is. But developing it
has cast a lot of light on how the existing allocator works, and points the
way to a possible rework that is more likely to be successful.

                           ----------------

In marginally more detail:

This patch introduces two main changes relative to the existing allocator:

  1. ::splitAt has been reworked so as to split exactly at the points
    requested. This allows a clean separation of mechanism (implementing
    splits) from policy (deciding where to have them). Together with a rework
    of ::trySplitAcrossCalls, this appears to give much better (viz, actually
    understandable) splitting behaviour across calls and loops, at least for
    the tiny test case in comment 2, and one other. This change is worth
    keeping and is not discussed further.

  2. The existing allocator generates overlapping bundles when splitting. This
    was a cause for concern -- what would that even mean? -- and so this patch
    implements a strict no-overlap policy: when a bundle is split, all of the
    resulting pieces, including the spill bundle, are mutually
    non-overlapping. (It's a bit more complex than that, but that is the end
    result.)

    Although that worked for tiny test cases and even for some of the wasm
    test suite, the wasm tests also make clear this is unworkable. Creating
    overlapping bundles is actually necessary in the most general case.
    However:

    • how the existing allocator does that leads to situations where we have
      unnecessary spills, particularly in loops. Hence a rework is desirable.

    • we now have an answer to "what is the meaning of overlapping bundles?"

                           ----------------
      

In gory detail:

"SSA and the Art of Overlapping Bundles" (with hat-tip to Robert M. Pirsig)

Addressing (2) in detail. What does it mean for bundles to overlap? First
let's see why a strict no-overlap policy is unworkable. Consider this LIR:

148-149 Float32 [def v32<f>]
150-151 WasmCall [use v32:F:xmm0] [use v32:F:xmm1] [use v3:F:r14]

This defines v32 as a float value, assigned a constant by the first LIR,
which is then used twice in the second LIR. The floatness and constantness
are irrelevant. What matters is that the uses -- because they are for a call
-- produce a fixed register register conflict -- that value needs to be in
both xmm0 and xmm1.

The (patched) allocator produces the following bundle to encapsulate the def
and two uses:

LB95(p=none) v32 149-150 { 149_def:R 150_v32:F:xmm0 150_v32:F:xmm1 }

The fixed-reg conflict means this can't be allocated as-is, so it's handed off
to the splitter. That is constrained by the no-overlap rule, so its only
option is to split between the two LIRs, producing:

LB411(p=LB410) v32 149-149 { 149_def:R }
LB412(p=LB410) v32 150-150 { 150_v32:F:xmm0 150_v32:F:xmm1 }

But this is useless: LB412 is still un-allocatable, but now it's unsplittable
(and unspillable) too. The allocator asserts, and it's clear something is
broken at a conceptual level.

The original allocator (obviously) handles this without difficulty by
splitting this into three bundles, two of which overlap:

LB287(p=LB285 v32 149-149 { 149_def })
LB288(p=LB285 v32 150-150 { 150_v32:F:xmm0 })
LB289(p=LB285 v32 150-150 { 150_v32:F:xmm1 })

But what does such an overlap mean? And in particular, what is the
MoveGroup-generation phase (::resolveControlFlow) supposed to do with it?

The answer lies in the fact that the input LIR's virtual regs (the vXX
values) are in SSA form. The examples below show how to create MoveGroups for
such a case, using straight-line code (within a basic block). The same
principles hold in the presence of arbitrary control flow -- the only
difficulty there is the added complexity of deciding where to put the
MoveGroups.

Recall that an LIR virtual register (vreg) will have a bunch of live ranges
attached to it. Because the vreg is SSA, only one of those ranges may contain
a definition. All the rest must contain only uses. Hence it's OK to have
overlapping ranges in certain circumstances, because they must all by
definition hold the same value. So when we need to "use" the value, we can
take it from any range that is live at the use point -- they are all the same.

A live range -- even with no defs and no uses -- represents a request for
storage of a value (the vreg) in some (statically) consecutive range of
instructions. Each range will need to "acquire" that value from somewhere. A
range that contains a definition is considered to start at the point of the
definition, so the definition provides the value. Non-defining ranges will
need to get a copy of the value from some other range, and it is this "get a
copy" mechanism that the MoveGroup-generator (::resolveControlFlow)
implements.

Consider this, with instructions left-to-right across the page:

(v42 10-20 { 10_def })  (v42 21-29 { 25_use 27_use })

where each (..) pair is one range, and all (by definition) are for the same
vreg, v42. Note that the actual allocations for each bundle are not shown
-- they are important for constructing the MoveGroups but are irrelevant here.

The first range defines its own value, so we ignore it. The second range
needs to get its value from somewhere, and fortunately the first range flows
directly into it. So ::resolveControlFlow must insert a MoveGroup in between
points 20 and 21.

Now consider this, with overlapping bundles:

(v42 10-20 { 10_def })  (v42 21-29 { 25_use 27_use })
            (v42 15-25 { 19_use })

As before we need a MoveGroup at the 20/21 transition. The third range will
also need an initial value. We can we get it from? Well, from any range that
happens to be live at the third range's start point. In this case we can choose
only the defining group. Hence we can place a MoveGroup at the 14/15
transition. In fact, it can be anywhere before the first use in the third
group: that is, any transition in 14/15 to 18/19.

With this in mind, ::resolveControlFlow now makes a whole lot more sense.
What it basically does is to visit all of the ranges of a vreg, in increasing
order of their starting points, ignoring any overlaps. For each new range, it
(presumably) tries to find either a predecessor or an overlapping range, and
copies the value from it:

LiveRange* predecessorRange =
    reg.rangeFor(start.previous(), /* preferRegister = */ true);
..
if (!moveInput(ins->toInstruction(), predecessorRange, range, reg.type())) {

This explains a couple of other things. Firstly, this:

// The range which defines the register does not have a predecessor
// to add moves from.
if (range->hasDefinition()) {
  iter++;
  continue;
}

Of course .. a defining range doesn't need to acquire its value from anywhere
else. So it is ignored.

Secondly, ::splitAt has the curious behaviour that any spill bundle created
starts from the point immediately following the definition point:

CodePosition from = range->from();
if (isRegisterDefinition(range)) {
  from = minimalDefEnd(insData[from]).next();
}

Why is that? It is so that the "insert MoveGroup every time a new non-def
range appears" strategy will cause a spill to be inserted immediately after
the def. Hence the value is safely spilled. This is suboptimal though; see
below.

I believe that this overall design also has the advantage that:

  • it avoids the "re-spilling" that is a natural consequence of a strict
    no-overlap discipline. Here, if a range-for-reload comes to an end, so
    what? We generate no MoveGroup. Other ranges-for-reload will have to pull
    a copy from some other simultaneously live range (one in the spill bundle),
    but that's fine.

  • more generally, it gives flexibility in where a non-defining group can
    acquire its value from. It can pull it from any simultaneously-live or
    directly-flows-to-here range. If one of those has been assigned to a stack
    slot and another to a register, then we can take a copy from the register as
    that's cheaper.

                             ----------------
    

The design of overlapping bundles and how they interact MoveGroup generation
is certainly ingenious. But the above, and experimentation, does show some
weaknesses:

  1. ::resolveControlFlow again mixes up policy with implementation. It would
    be better if it limited itself to implementation. Recall this example:
(v42 10-20 { 10_def })  (v42 21-29 { 25_use 27_use })
            (v42 15-25 { 19_use })

The transfer from the first to the third ranges can happen anywhere in 14/15
to 18/19. ::resolveControlFlow is hardwired to use the first opportunity, viz
14/15, even if a later transition would be better.

  1. The above, combined with the fact that spill bundles are defined to start
    at the point immediately after the original's definition leads to bad results
    (observed in the wild). Consider the following loop, written SSAishly for
    clarity, with each x1/2/3 representing a bundle:
int x1 = 0;
loop {
   x2 = phi(x1, x3);
   x3 = x2 + 1;
}
call foo()
.. use x3 ..

x3 will be split into (at least) two bundles: its defining point till the
end of the loop, and the point just after the loop onwards. The latter will
be further split across the call, but that's irrelevant here. What matters is
that the first split will have created a spill bundle that (per ::splitAt's
behaviour) starts immediately after x3s definition -- inside the loop.
Unfortunately that means that ::resolveControlFlow will spill it at that point
-- inside the loop, quite unnecessarily. What we really need is for that
spill to happen after the loop.

                           ----------------

So what now? Possibly:

  • Think up a new, more restricted semantics for splitting, that ..

  • .. still allows overlapping bundles (because it must), but ..

  • .. that retains the advantages of "split where I tell you to, and not at
    some other random place", as detailed far above ..

  • .. and that allows placement of MoveGroups (per ::resolveControlFlow) to be
    precisely controlled. In other words, move policy decisions out of
    ::resolveControlFlow.

Attachment #9267987 - Attachment is obsolete: true
Attached patch splitting-X1-11.diff (obsolete) — Splinter Review

This is a majorly reworked version of the patch, based on the observations in
comment 3. It reinstates the existing scheme of allowing creation of
overlapping bundles when splitting, but only when splitting to implement
spilling (split-for-spill). For ad-hoc splitting (around calls, loops, etc)
exact splitting is still enforced.

More test cases work now. On x86_64-Linux, there are 20 failures out of 2468
jit_tests for wasm, mostly in the JS support code. And overall 88 failures
out of 9224 jit_tests. So large amounts of JS also work.

The top level changes in the patch are (with many names q-prefixed to avoid
collisions with old code):

  • new exact-split routine ::qsplitAt, for non-spill splitting, plus helpers

  • new split-for-spill routine ::qsplitForSpill

  • new split-across-call policy routine, ::qtrySplitAcrossCalls

  • dedicated splitting to deal with fixed stack refs (wasm multi-value
    returns), ::qtrySplitOffFixedStackRefs

  • major rewrite of ::trySplitAfterLastRegisterUse

New and modified code is extensively commented. A top level block comment
describing how the parts hang together is in preparation but is not present in
this patch.

The following are arguably failures in separation-of-concerns across the
allocator interface. They cause additional complication in the allocator that
might be better handled outside the allocator:

  • OsiPoint insns. MoveGroups may not be inserted between an OsiPoint and the
    insn that precedes it. It follows that splitting is not allowed there.
    This causes a bunch of special-casing throughout the splitting logic -- in
    effect "insn ; OsiPoint" must be handled as a single insn.

  • fixed stack refs for wasm multivalue returns. These appear to require a
    special splitting rule because they sidestep the standard bundle-allocation
    mechanism. I wonder if there's a cleaner way to implement this.

As it stands it is still unknown whether the patch can serve as the basis for
improvements to the allocator. That requires at a minimum:

  • passing all tests, also on x86_32 and arm64, so we have some assurance that
    the rework has no terminal design-level problems

  • getting it to work when spill-bundle creation is optionally disabled for
    ad-hoc splitting. Spill bundles cause excessive interference and hence
    copying/spilling, so are to be avoided where possible. This patch creates a
    spill bundle for all splits, as an interim measure.

Debugging continues.

Attachment #9268404 - Attachment is obsolete: true
Attached patch splitting-X1-14.diff (obsolete) — Splinter Review

With further bug fixes. No fails on jit_tests (wasm) with
--wasm-compiler=ion. 6 fails of 9306 tests for a full jit_tests run:

2 preexisting (unrelated to RA)
3 look like wrong code
1 harmless splitting assertion
Attachment #9269354 - Attachment is obsolete: true
Attached patch splitting-X1-16.diff (obsolete) — Splinter Review

With further bug fixes. No fails on jit_tests (wasm) on x86_64-linux with
--wasm-compiler=ion. 3 fails in 9306 tests for a full jit_tests run:

2 preexisting (unrelated to RA)
1 harmless splitting assertion

Fails more on x86(_32)-linux due to non-handling of implicit dependencies
between LWasmCall and LWasmRegister{,Pair}Result. Is WIP.

Attachment #9269468 - Attachment is obsolete: true
Blocks: sm-regalloc
Attached patch splitting-X1-20-BASIS.diff (obsolete) — Splinter Review

A further bug-fixed patch usable as a basis for investigation of tuning
possibilities. Now passes tests on all 4 primary platforms:

x86_64-linux [9310| 2| 0| 0]
x86_32-linux [9307| 5| 0| 0]
arm64 (sim) [9303| 9| 1| 0]
arm32 (sim) [9307| 5| 0| 0]

Of the failures, one/two are unrelated to RA, and the rest are due to the
allocated code being different in tests that check that explicitly, that is,
wasm/ion-adhoc-multiplatform.js et al.

Attachment #9269880 - Attachment is obsolete: true
Attached patch splitting-X1-24.diff (obsolete) — Splinter Review

A cleaned-up version of the X1-20 patch, that is probably buildable and usable
directly against m-c.

Attachment #9271557 - Attachment is obsolete: true

Here are some numbers for wasm tests on x64 and x32, using the X1-24 patch in
comment 8. The patch is often effective in removing reads (reloads) but also
causes large numbers of extra writes (spills). This is under investigation.

Attached patch splitting-X2-5.diff (obsolete) — Splinter Review

A new patch, with many improvements:

  • A complete rewrite of top level fn ::processBundle, to make it easier to
    understand and reason about.

  • A spill-cost model that more fully takes into account loop-nesting depths.
    Subsumes bug 1689328. Not as effective as one might hope.

  • A new splitting rule ::tryCostBasedSplitting, which tries to find split
    candidates using a more systematic estimated-gain heuristic rather than with
    the existing ad-hoc "structural" rules. Inspired by Chris Fallin's RA2
    work. Has some small benefit.

  • Improved heuristic ::splitAcrossHotcode, to stop it splitting bundles across
    hot sections if the bundle isn't actually used/def'd in the hot section.
    This seems to be quite important.

  • Reordered the sequencing of rules in top-level ::splitOrSpillBundle.

  • Further improved diagnostic/debug printing.

Numbers are significantly better than with the comment 8 patch. There is
still excessive spilling (but not excessive reloading) in many cases. This
may be related to interactions between the strategies for spilling across
loops and spilling across calls, and is under investigation.

Attachment #9271640 - Attachment is obsolete: true
Attachment #9271641 - Attachment is obsolete: true
Attached patch splitting-X2-14.diff (obsolete) — Splinter Review

With significantly reduced spilling on x86_64. Has initial implementation of tunnelling (spilling/restoring) values under loops that contain a call, when the value is only read (or not accessed at all) in the loop. This is the last of the X2 line of patches.

Attachment #9273392 - Attachment is obsolete: true
Attached patch splitting-X3-1.diff (obsolete) — Splinter Review

This is functionally identical to the X2-14 patch of comment 12, but with about 1000 lines of code deleted. The deleted lines are for experiments that didn't work well but were if (0)'d rather than simply deleted in the the X2-14 patch. Some minor renaming/reordering of functions was also done.

Attached patch splitting-X3-15.diff (obsolete) — Splinter Review

This is the end of the line for the X1/2/3 series of investigation. It doesn't give a consistent enough perf win to be considered for incorporation as-is. However, we can maybe pick some of the various good bits out of it and land those separately. The patch includes use of AVL trees for interference checks, a spill-weight cache for more speedup, reordering of the methods in BacktrackingAllocator.cpp for improved readability, and many smaller cleanups.

Attachment #9273393 - Attachment is obsolete: true
Attachment #9274800 - Attachment is obsolete: true
Attachment #9275380 - Attachment is obsolete: true
Attached patch splitting-X4-02.diff (obsolete) — Splinter Review

This is the best so far, in terms of allocation quality, with splitting across calls being aware of multi-level loops. A worthwhile improvement over the previous patch, X3-15. It sometimes asserts due to being too kludgey in how it finds non-innermost loop bounds. Making that work reliably requires a proper loop tree, which wouldn't be too difficult (or expensive) to hack up.

This applies to m-c 615509:79a65e971e59

Attachment #9278475 - Attachment is obsolete: true

Same as the comment 15 patch, but rebased to m-c 621842:6eea641faf12.

Attachment #9282510 - Attachment is obsolete: true

Here's a summary of what has been learnt via the work in this bug (1758274),
about how our splitting strategies could be improved. It somewhat builds upon
the commentary added by bug 1778945.

Severity: S2 → N/A

Kludgey fixes to work around some problems caused by other changes in Ion over
the past year, in particular to do with wasm exceptions. Still does not work
(asserts) for most wasm/exception tests. Patch is to be applied on top of the
patch in comment 18, which in turn applies to (at least)
m-c 695291:995a3050d70c.

With that stack, I get 42 failures out of 11156 jit_tests on x86_64-linux.

Depends on D201556

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: