Closed Bug 1712078 Opened 3 years ago Closed 3 years ago

Stop Ion's LICM pass from hoisting computations out of low-probability paths

Categories

(Core :: JavaScript Engine: JIT, enhancement, P2)

enhancement

Tracking

()

RESOLVED FIXED
91 Branch
Tracking Status
firefox91 --- fixed

People

(Reporter: jseward, Assigned: jseward)

References

Details

Attachments

(4 files)

Ion has an LICM pass on MIR. It contains adequate checks to ensure that
hoisting MIR nodes maintains correctness. But it is almost completely
indiscriminate when it comes to the question of which nodes are profitable to
hoist. It will hoist any (valid-to-hoist) node, with the sole exception of
constants that are used by non-invariant nodes. In particular, it will hoist
nodes from arbitrarily low-probability paths within the loop body. This is
believed to be the cause of the factor-of-3 perf regression shown in bug 1708381.

Bug 1708381's bad case has at its core a loop like this

   loop {
      switch (meh) {
         case 0: { generally a single block; break }
         ...
         case N: { generally a single block; break } for N = several hundred
      }
   }

Many (most?) of the switch arms contain loop-invariant nodes, all of which are
lifted out of the loop. This has two bad performance effects:

  • all of them have to be executed before the loop can start, even if the loop
    iterates only a few times

  • the register allocator is overwhelmed and has to spill almost all of them,
    leading to worse code than having just left them in place.

Bug 1708381 comment 18 contains a tiny standalone test case that illustrates
the problem.

Antique bug 669517 is probably a duplicate of this.

See Also: → 669517

We're going to call this a JIT bug even if it probably affects wasm disproportionately.

Component: Javascript: WebAssembly → JavaScript Engine: JIT

Finding any leads on this seems difficult. Some googling didn't show up
anything useful. From source code reading, it appears that GCC's LICM pass
tries to estimate the benefit of hoisting each candidate, and that includes
some consideration of the effects on register pressure. I couldn't tell if
loop-structure/block-probabilities were involved. From LLVM's LICM pass I drew
no conclusion. I didn't see any reference to these kinds of problems in the
source of WebKit's B3's LICM pass.


At first it seems attractive to fix this by add a simple structural heuristic,
for example, don't lift out of a block which is logically downstream from a
multiway branch. That would certainly fix bug 1708381, but it's too
aggressive, since it would inhibit legitimate cases, eg:

loop {
   switch (meh) {
      case 0: { ...; break; }
      case 1: { ...; break; }
      default: { ...; break; }
   }
}

If we crudely assume that all loops iterate 10 times, then it roughly follows
that it's a win to lift an invariant computation out of a block if that block
is executed on at least 10% of the iterations. In other words, if the block
has hotness of 0.1 or more relative to the loop header, which we take to have a
hotness of 1.0.

In this 3-way switch example, it must be the case that one of the arms has a
relative hotness of more than 0.1, so the simple heuristic is too aggressive.
That's even more the case if we consider

loop {
   if (check for error) {
     trap
   } else {
     normal_case
   }
}

in which case we'd certainly want to lift invariant nodes out of the
else-clause, but not out of the then-clause.


An overall better approach would seem to be, for each block in the body, to
estimate its hotness relative to the header node. This would give each block a
value between 0.0 and 1.0, with the header always getting 1.0, and drive the
hoist/no-hoist decision based on that. This would also facilitate cases where
we know something about conditional branch probabilities, in particular the
case where a block is effectively never executed because it traps.

I looked in some detail at the idea of performing a function-wide forward
iterative analysis, which would propagate forward constraints on relative
frequencies, identifying and specially handling loop closing edges. This would
give a hotness for each block in the function and be done once, before LICM.
One reason this is attractive is that it would also be useful in computing
spill costs in the RA.

Although it is probably doable, making it work reliably and cheaply looks like
a big job, so I gave up on it.


A simpler alternative would be to do a quick analysis pass before processing of
each loop during LICM. The idea is to do a single forward pass through all
blocks in the loop, starting at the header, following successor edges until all
blocks in the loop have been visited. Crucially, we would ignore back edges,
hence dealing with an effectively acyclified version of the loop body. Since
the loop body is not itself a loop, what this acyclification means is that we
don't interpret any loops nested within the body as real loops during the
analysis. (That's OK; those inner loops will have their own analysis done when
LICM.cpp later visits them.)

The analysis would propagate a hotness value of between 0.0 and 1.0 along each
edge, dividing hotness equally when a node has multiple successors and summing
the hotness from each predecessor. The sum of the flows out of the header
would need to be 1.0. Because the graph is acyclic, no iteration would be
needed, so it would be a simple breadth-first traversal (maybe) of the body
blocks.

The details remain to be worked out, as does the interaction with LICM.cpp's
strategy of visiting outer loops before inner loops.

Attached file flow_solver2.c

A simpler alternative would be to do a quick analysis pass [..]

Here's a test program including 9 test graphs, which implements the above. It
correctly computes block-hotness for the first 8 graphs (loop bodies). These
include loop bodies with nested sub-loops. The hotness for the nested loop
blocks isn't correct, but that doesn't matter; we only care about hotness for
the top level blocks in the loop body [providing perhaps that LICM processes
innermost loops first, which isn't currently the case.]

Unfortunately the code does not compute correct hotness in cases where the body
contains an irreducible (multiple-entry) nested loop. The 9th and final graph
example in the program shows this. This wouldn't be a problem if it was only
nested loop hotnesses that were wrong, but alas it affects the body-top-level
blocks after the nested loop too:

          H
        / | \
       v  v  v
      /   |   \
 /->-A--->B--->C-->-\
 |        |         |
 \---<----+----<----/
          |
          v
          |
          X

viz: H->A, H->B, H->C (the nested loop entries), A->B, B->C, C->A (the nested
loop itself), B->X (the only exit from the nested loop).

The algorithm computes hotnesses

H: 1.0, A: 0.33, B: 0.67, C: 0.67, X: 0.33

For A, B, C, we don't care what it computes. H with 1.0 is correct (it must be
1.0, since it's the loop body's entry node). But we also require X with 1.0
and we didn't get that.

One fix is to first collapse the loop body graph into an acyclic graph by
computing the strongly connected components and merging each into a single
node. Then do the above cost-propagation on the guaranteed-acyclic SCC graph.
This gives no useful information for each cycle, but we don't care about that.

It does seem depressingly complex, though.

Yeah, tricky. Since c->a is not actually a backedge your graph is not acyclical contrary to assumption. So identify the scc or compute dominators in order to identify true backedges, and then implement some sort of fixed point computation which ends up with a=1, b=1, c=1 anyway. If there's an easier way to identify true backedges than computing dominators then that might be interesting. On the other hand, it's possible Ion has sufficient infrastructure for this already. It does have an implementation of the Cooper/Torczon dominator finding algorithm, and there are concepts of loop headers and backedges. I haven't looked enough to know whether these are featureful enough for you, but it might be worth looking into. Ion is not supposed to be dependent on reducible control flow.

The test program of comment 3 already computes and takes note of back edges,
but this (and associated machinery with dominators) basically only works right
for single-entry loops. My plan is to instead use Tarjan's 1972 algorithm to
compute strongly connected components, collapse the graph accordingly and only
then run the hotness propagation step. This seems to me like it has a good
chance of working robustly. It's just tedious to implement because of the
graph-collapsing phase. At least all the algorithms involved are linear-time
in the number of blocks, though.

(In reply to Julian Seward [:jseward] from comment #5)

The test program of comment 3 already computes and takes note of back edges,
but this (and associated machinery with dominators) basically only works right
for single-entry loops.

Apart from OSR and fixup blocks, where the extra entry point matches the loop entry, I do not think the MIR has anyway of representing a loop with multiple entry points.
Would you have an example where such graph exists in the MIR?

Severity: -- → N/A
Priority: -- → P2
See Also: → 1710012
Attached file flow_solver3.c

Test program that implements the scheme described at the end of comment 3
above. It contains 14 test cases covering every type of degenerate input I
could think of. The test cases contain expected hotness values and the program
verifies that the computed values match them.

This might be worth trying out as a hotness-calculator in Ion/LICM, even if we
don't go with it in the end. For one thing it is available now, and for
another, it is probably pretty close to a "gold standard" for hotness
estimation. Aside from profile-driven approaches, which we don't have, any
simpler scheme will generate estimates which are worse or at least no better.

It could do with further cleanup:

  • (Even) More assertions:

    • check the final State, to check that the forward-backward Node-SCC links
      are consistent

    • check the final hotness computation: the sum of hotnesses of all blocks in
      the region of analysis (the GraphIn) that have no successors, must equal
      1.0

  • Change representation of State::nodeGroups and State::succGroups to use a
    conventional (start-index, length) pair. The existing start-index and
    sentinel arrangement turns out to be hard to work with.

  • Change type of all the integer indexes from uint32_t to uint16_t so as to
    save space and hence improve cache behaviour. This would limit Ion's LICM to
    loops with at max 2^16-3 == 65533 basic blocks, which doesn't seem like a big
    limitation.

The scheme could trivially be enhanced to generate more accurate estimates
based on the observation that we can regard any block which ends in an
unconditional trap as having zero hotness.

Assignee: nobody → jseward

First attempt at integration of the SCC-based hotness estimator into LICM.cpp.

Ion has an LICM pass on MIR. It contains adequate checks to ensure that
hoisting MIR nodes maintains correctness. But it is almost completely
indiscriminate when it comes to the question of which nodes are profitable to
hoist. It will hoist any (valid-to-hoist) node, with the sole exception of
constants that are used by non-invariant nodes. In particular, it will hoist
nodes from arbitrarily low-probability paths within the loop body. This is
believed to be the cause of the factor-of-3 perf regression shown in bug
1708381.

This patch:

  • Disables LICM for any loop with more than 100 blocks in the body or that
    contains an MTableSwitch instruction with more than 25 successors. Either
    of these guards individually fixes the original 1708381.

  • Adds comments explaining the rationale for the above two thresholds, plus a
    tiny amount of profiling data that suggests they are reasonable values.

  • Fixes some indentation and wording in the debug printing, so as to make it
    easier to read.

The perf effect on most wasm inputs is somewhere between nonexistent and very
small. The above limits are pretty generous and, at least for wasm, most
loops are fairly small and so will still qualify for LICMing. For the
Embenchen wasm perf suite, the geomeans for instruction count, data reads and
data writes, in JIT-generated code, fall by 0.10%, 0.19% and 0.13%
respectively.

Pushed by jseward@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/4c3dbe88ee1f
Stop Ion's LICM pass from hoisting computations out of low-probability paths.  r=jandem.
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 91 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: