Closed Bug 646923 Opened 13 years ago Closed 7 years ago

Build a model compiler backend

Categories

(Core :: JavaScript Engine, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: dmandelin, Assigned: dmandelin)

References

(Blocks 1 open bug)

Details

I've been working on a model compiler backend to experiment with code generation ideas for future versions of SpiderMonkey. The primary purpose is to understand the effect of different optimizations on benchmark code extracts, so we know what optimizations to work on. A secondary goal is to gain experience to help with the designs of IRs, program analyses, and optimization frameworks.

Development is currently in:

  http://hg.mozilla.org/users/dmandelin_mozilla.com/pjit/
In the first round of experiments, I compiled an empty loop of 100M iterations with different optimizations. This was on 64-bit Linux, but using the nunboxing representation, which we currently use on x86, and should be doable (and good for performance) on x64 Linux and Windows. Results:

 configuration         isns (M)     cycles (M)

 RA                         800            400
 RA + LTT                   900            400
 RA - OG                    700            300
 RA - TG                    400            200
 RA - TG - OG               300            200

RA is the result of basic codegen with trivial global register allocation, i.e., every variable in the function (in this case, just the loop induction variable) gets its own register. In this version, I also load type tags directly from memory instead of loading them into a register first.

RA + LTT means the type tags get loaded into a register at the start of the loop, and then are not reloaded. This means one fewer load that simply RA. But that didn't help perf: the initial load into a register was an extra instruction, and the cycle count was the same.

RA - OG means overflow guards are removed. I didn't actually implement an analysis for it, I just did it because it's safe in this case.

RA - TG means type guards are removed, as would be done with type inference. (I think it could actually be done with local type inference in this case, although it's probably hard to extend to more complex code because of the possibility of function calls.) This was the big perf win here.

RA - TG - OG means both guards are removed. In this case, it didn't help further relative to -TG.

These results are somewhat interesting, although they don't tell us a lot yet, because the test is too simple. It does tell us that removing various guards, even though they are predictable, can help a lot. The main thing I wanted to test here was whether it's better to do type guards directly from memory instead of loading (as Opera seems to), and here at least, it seems to be. But that may or may not extend to larger programs.
I forgot to say earlier that the code I have is Python and compiles an SSA high-level IR to native code, for one function. In the high-level IR, each instruction has similar meaning to a SM bytecode (e.g., ADD), but the source and destination operands are virtual registers instead of stack slots. A virtual register at that level represents a jsval, either a program variable, program literal, or temporary value.

This means I'm not treating any issues involving the conversion of bytecode to high-level IR, which is an interesting problem in itself.
Experiment 2 is on this JS code and the equivalent HIR:

    for (var i = 0; i < 100000000; ++i) {
        var x = 1000;
        var y = 301000;
        var lsw = (x & 0xFFFF);
    }

(I'm building my way up to safe_add.)

 system   config         isns (M)    cycs (M)

 SM       -m -a              3800        1400
          -j                 1700        1000

 V8       nocs               1600         550
          crankshaft          740         325

 pjit     RA                 1800         600
          RA-OG              1700         500
          RA-OG-TG           1100         400
          RA-OG-TS           1300         400
          RA-OG-TS+TiR       1300         400
          RA-OG-TG-TS         700         300

          RA+CTGE            1600         600
          RA+CTGE+DTSE       1300         500
          RA+CTGE+DTSE-OG    1200         400

   pjit configurations:

      RA    trivial register allocator (one per payload variable)
      TiR   type tags held in registers

      // The following group requires type inference or loop variable interval
      // analysis to implement in practice
      -OG   omit overflow guards
      -TG   omit all type guards
      -TS   omit all type tag stores to memory

      // The following group are like optimizations done in JM2, but more
      // powerful, as they operate on the whole function. They require SSA
      // or data-flow analysis.
      CTGE  constant type guard elimination: eliminate type guards if they can
            be proven not taken.
      DTSE  dead tag store elimination: eliminate stores of type tags if they
            are never read.

Remarks:

1. v8/nocs is much better than JM and TM, and about equal to pjit/RA. This is mostly because of register allocation for the loop variable: that's the big win here.

2. At least in this test, removing overflow guards, type guards, and stores of type tags are all helpful.

3. Note that the entire loop body is subject to LICM. If I hoist it out at the source level, v8/nocs gets much faster, but v8/cs does not. This is evidence that CS does LICM. The numbers here for CS are also similar to those for an empty loop. This suggests that the CS code for an empty loop is something like |RA-OG|.
Experiment 3 is on safe_add (from crypto-(md5|sha1)) in JS and HIR. The JS is:

    for (var i = 0; i < 100000000; ++i) {
        var lsw = (x & 0xFFFF) + (y & 0xFFFF);
        var msw = (x >> 16) + (y >> 16) + (lsw >> 16);
        var z = (msw << 16) | (lsw & 0xFFFF);
    }

 system     config     isns/100M    cycles/100M

 SM         -m -a                 110             36.5
            -j                     33             11.7

 V8         nocs                   67.6           33.6
            cs (*)                 26.4            9.6

 pjit/LS    ExtraMoves             79             28
            base                   72             27
            -OG                    68             26
            CTGE+DTSE              40             14
            CTGE+DTSE-OG           36             12
            CTGE+DTSE+DG           24              9
            CTGE+DTSE+DG-loopOG    23              8
            CTGE+DTSE+DG-OG        20              7

  (*) Crankshaft will hoist everything, turning this into an empty loop, so I
      modified the JS a bit to make that not happen, trying to change the
      program as little as possible.

  pjit configurations:

    pjit/LS base  base version with linear scan register allocator, including
                  the optimization of reusing an input register for the output
                  when the live ranges are disjoint.
    ExtraMoves    without the register-reuse optimization, which generates
                  a few extra reg-reg move instructions

   Some from before:

    -loopOG       remove all overflow guards for the loop variable only. This
                  requires interval analysis on the loop variable.
    -OG           remove all overflow guards. This requires a general interval
                  analysis that understands bitand and shift operators. It's
                  probably a bit academic here, because although it helps a bit
                  here, it probably doesn't have too much effect on overall
                  benchmarks.
    CTGE,DTSE     as in comment 3

   New optimization:

    DG            move type guards to definitions. For an add, in the base
                  version we store an int tag on the fast path and test it
                  later. If we want to specialize the function to run faster
                  when things are always ints (optimistically or 
                  profile-guided), then we can instead guard on the original
                  definition and jump to a deoptimization path if not an int.

Remarks:

1. JM2 and V8/nocs are about the same. V8's 10% advantage is probably because of loop variable register allocation.

2. On x64, the linear scan allocator uses 7 registers and gets every variable in a register without any unnecessary reg-reg moves. The basic version is 25% faster than JM2 or V8/nocs, so it seems like linear scan allocation generates better code than either of those jits.

3. Eliminating unneeded type tag stores and guards really helps, giving us a 2x speedup, but leaving us still slower than TM or v8/cs.

4. Moving type guards to definitions is what makes us about equal to v8/cs and faster than TM. This is important, because it tells us that:

    The _exact_ reason why type specialization helps on this benchmark is by 
    turning this pattern:

        add r1, r2
        jo  slow
        mov INT_TAG, tag(r2)
        ...
        cmp INT_TAG, tag(r2)
        jo  slow

    to this:

        add r1, r2
        jo  deoptimized_version

    It eliminates 3 instructions that don't do very    much. That saves about 
    1 cycle per instance of this pattern, which amounts to 4 cycles in this 
    code (one instance per + operation). That is the only effect of type
    specialization on this benchmark, and it is the same whether done by
    tracing, profiling, or inference.

5. Interval analysis can potentially get rid of all the overflow guards, resulting in about the best possible code for this program. That cuts out another 2 cycles, which is a 1.3x improvement here. I don't know if doing that would really help noticeably on full benchmarks, though.
Quick note about the 32-bit version of the previous experiment: It gets the same results. 

The initial version of linear scan, which didn't handle spills, couldn't allocate for this function. But once I implemented spills, it was able to keep everything in a register. That is surprising, but it works out: It turns out that the live range of 'z' extends to the entire function, because the start value of undefined at the top does reach the end (I defined it to be a return variable). However, a register is needed inside the loop only after the return value of z. Thus, we can 'spill' z early in the loop (and it never ends up using a register at all near there) and can then fit everything in 5 regs.
Blocks: 650496
I'm still working toward am3, but I found out a few interesting things in the meantime:

1. Optimizing away locally unnecessary tags (e.g., for the result of a bitop) is really easy on SSA form: about 50 lines of code. It's basically SCCP (sparse conditional constant propagation) on type tags.

2. I wanted to try interval analysis algorithms. They need to understand a bit of control flow (more than SSA directly accomodates), so I implemented a non-SSA general abstract interpretation for intervals, with some heuristic-based widening. It turned out to be relatively much code, so there's a good chance that's too slow and complicated to want to use in practice.

That motivated me to think harder about really simple algorithms for intervals, and I think I came up with something decent. The main application I have is to infer that loop iteration variables are integers, but I think it can be extended in various ways. Here is the algorithm on the simplest example:

 // Code
 for (var i = 0; i < 1000; ++i) {}

 // SSA form
 0:  i1 = 0
 1:  i2 = phi(i1, i3)
 2:  i3 = i2 + 1
 3:  if i3 < 1000: goto 1

Step 1. Identify loop back edges that carry an integer constraint on a variable. 

In the example, the jump from 3 to 1 is a back edge with constraint |i3 < 1000|. (A back edge is detected as an edge from a node to a dominator of that node; conversion to SSA form "requires" dominator information, so we will have it.)

Step 2. The back edge will always lead to a phi statement that consumes the constrained variable. Find the assignment for the other input to the phi. Look at the def of that input; we'll continue the analysis only if that can be resolved to a constant int. 

In the example, the phi expression is phi(i1, i3), the other input is i1, and its def is "i1 = 0". 

Remark: at this point, we know that on entry to the loop, (a) i must be less than 1000, and (b) i started at 0. The idea of the algorithm is to guess that, on entry to the loop body, i >= 0, and prove that using assume-guarantee reasoning.

Step 3. Assume that on loop entry, the variable is bounded below by [initial value, edge constraint]. Then propagate forward through the SSA assignments, trying to prove that when we reach the back edge, the lower bound still holds. Also, as we process each def, record if it is proven to be an integer, or in some interval of interest. This can be done with a worklist algorithm.

In the example, assume that i2 is in [0, 1000), and initialize the worklist to 'i2'. An element is in the worklist if we have computed a new interval for it and not yet updated its dependents. Note that everything we derive is only an assumption until we prove our assumption is loop-invariant. Hence:

  worklist: [ i2 ]
  action:   pop i2, record that i3 is in [1, 1001), and add i3 to worklist
  worklist: [ i3 ]
  action:   pop i3, and see that it hits the back edge. So we have to prove
            that i3 >= 0. Well, we know i3 is in [1, 1001), so we are done.

This algorithm is simple, works on SSA, requires only one pass through the loop, should handle anything similar to the basic loop, and can be extended to collect other facts.
This is basically what we're already doing in JM+TI to hoist array bound checks, and will soon be using to eliminate integer overflow checks as well.  Main differences are that JM+TI one handles symbolic bounds (though only if loop invariant), can use inferred types, and has a grubbier way of doing the inductive proof of the lower bound at entry (use lifetimes to grep for writes to the index var in the loop body, and reject if any are not 'x++').  That analysis currently works on the normal bytecode, but will be working on SSA form soon and should be reusable for IonMonkey.

(A little) more in bug 650496, which also has a fair amount of stuff on am3.
Experiment 3 is a mini-am3:

 function f(a) {
    var x = 0;
    for (var i = 0; i < 100000000; ++i) {
        x ^= a[i & 0xf];
    }
    return x;
 }

 var x = f([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 ]);
 print(x);

  system    config              isns/100M      cyc/100M

  js64      -m                         73            24
            -j                         38            14

  js32      -m                         45            18
            -j                         34            14
            -n                         17             5

  v8        nocs                       31.6          11.6
            cs                         23             7.2

  pjit/64   base                       34            13
            OptimizeTags               28            11
            OptimizeIntervals          20             8
            test tag from reg          19             6
            hoist obj check            17             5
            hoist dense check          15             4
            hoist slots load           14             4
            fold hole check            12             3
            hoist range check          10             3
            single-load tag+value      11             3

            "easy" optimizations       15             5

    pjit configs:

        OptimizeTags: optimize away const tag stores/checks (see comment 6)
        OptimizeIntervals: use interval analysis to infer int types for loop
            vars and avoid overflow checks (see comment 6)
        test tag from reg: the type tag for the array element read out is
            stored to mem, then tested there, in the base version. This tests
            from the temp register instead. This should be fairly simple to
            work into a compiler.
        hoist obj check: hoist a-is-obj check above loop. This should be
            doable with vanilla LICM.
        hoist dense check: hoist a-is-dense check above loop. This is LICM but
            also requires us to know what can make an array not-dense: type
            inference, or seeing that the array is not written to.
        hoist slots load: hoist load of a's slot array. Similar to previous:
            we need to know no resize/GC.
        fold hole check: hole check is redundant with int guard, so we can
            eliminate it
        hoist range check: hoist range check for start of loop, requires
            knowing size can't change.
        single-load tag+value: the base version uses 2 32-bit loads. Instead,
            we can do one 64-bit load, and then 2 extra instructions to
            separate out the tag.
        "easy": includes OptimizeTags, OptimizeIntervals, test tag from reg,
            hoist obj check, and fold hole check.

Remarks:

1. Naive punboxing hurts JM64 quite a bit here.

2. Yay type inference! Currently beating all the other JS engines and equal to the "reasonable, non-super-hard" optimization set.

3. Pretty much every optimization helps, although once we get pretty tight, not everything speeds things up all by itself ("half a cycle" per optimization). 

4. Loading tag+value together didn't help, in the configuration I tested it above. I tried again, applying it to "OptimizeTags+OptimizeIntervals", and there it saved 1 cycle. So it probably does help, just not necessarily once the code is very tight, as with the other optimizations.

5. The best code I was able to come up with was 2.4x faster than Crankshaft on this microbenchmark, which indicates there is plenty of room to keep innovating and improving here (although it is probably harder to get improvements of that magnitude on large programs).

Next up is to try this example on x86, where there is register pressure, to see how that plays out.
For x86, I'm using 5 free registers. We should have 6 or 7 in practice. In that case, the array microbenchmark would fit everything in registers and get the same results as x64.

But with 5 registers, we need to spill something when we read the array element. Standard linear-scan picks 'a', because it is not read after that (until the next loop iteration). The resulting code takes about 1 cycle longer than the x64 version.

I want to see what happens if I spill 'x' instead (the other reasonable spill selection). To that end, I implemented most of Wimmer's linear scan algorithm, which was fairly educational. I need to add more support for general spilling before I can test spilling 'x', though.
Is it possible to do something like this:
  1.  track which registers are allocated to which variables;
  2.  trace execution of 2-3 iterations of the loop, marking which (variable, register) pairs are hot, which don't change and which are most likely to spill;
  3.  apply the information found in (2) to calculate spill information *across* loop iterations.

The idea here is to collect information about how the register allocation behaves on the loop as a whole, between iterations, and not just in a single execution of the body.

Going further, it may be applicable over nested loops. For example, matrix multiplication has 3 nested loops (i, j, k), so spilling those would significantly reduce performance.

Maybe the variable could be annotated that it is a loop variable and take precedence over other variables (less likely to spill)?

Not sure how these ideas will work in practice.
(In reply to comment #10)
> Maybe the variable could be annotated that it is a loop variable and take
> precedence over other variables (less likely to spill)?

That is a good thing to do. I haven't given much thought about how to do it yet, partly because the benchmarks I've been studying don't feature nested loops. The paper I am working from mentions that but doesn't give specifics. One interesting thing they do is order the basic blocks so that inner loops come first, to give them first chance to get registers. 

You mention using dynamic information as well, which could also be interesting--if we compile after profiling, we might be able to use dynamic estimates to prioritize better.

Side note: I tried the alternate spill (spilling 'x') on the previous example. Mostly I ended up learning more about how Wimmer's allocator works. In a moderately optimizing configuration, it is about 1 cycle slower than spilling 'a'. That's 10% in this example, so fairly significant. What I take from this is that fairly small changes in the register allocator behavior can matter.
OK, am3 results are in. I'll break this up into a few comments because there's a lot of info. 

A. I created a benchmark that runs am3 with a small modification (2 arrays are passed in rather than objects creating arrays) on the same inputs 100M times. The loop in am3 runs |n| times (n is a function parameter). I found n=19 is the most common value, so I selected a parameter list arbitrarily from the ones that run with n=19.

B. Results for the existing JS engines:

 system   options     isns/100M      cyc/100M

 js/32    -m               4501          1519
          -j               2780          1161
          -m -n            1470           536

 js/64    -m               7268          2383               
          -j               2703           970

 v8/32    nocs             4154          1459
          cs               1667           626

Remarks:

1. JM2 on 64-bit takes a bit hit because of punboxing. We'll need to fix this for OSX.

2. JM2 and v8/nocs are about the same on 32-bit.

3. TI is the fastest of all. I looked at the code and it is very good. Experiments with pjit suggest that the reason is probably that TI can eliminate the type tests for elements read out of the arrays.

4. Interestingly, TI is 1.17x faster than CS on this microbenchmark, but 1.23x slower on the full crypto benchmark. I still don't know why. 

One guess was cache behavior, but I think I've ruled that out: (a) TI runs for 640M cycles more (3.14B vs. 2.77B), CS actually makes more L3 accesses, enough to make CS 5-10M cycles slower, and TI does have more L3 misses, but only 100K more, which shouldn't cost more than 10M cycles, and (b) TI runs 640M more instructions (8.33B vs. 7.55B), which seems like a sufficient explanation.

Why does TI run more instructions? My next ideas are (a) something in the full benchmark partially defeats the type inference or some optimizations, or (b) stuff outside am3 is a bunch slower. I would guess that (b) is at least partially in play; comparing the code TI generates for the benchmark against the code it generates for the microbenchmark would be one way of checking (a).
am3 results continued:

Correction from last time: I run the loop 1M times, and the values quoted in the previous comment should be 'isns/1M' and 'cyc/1M'.

C. pjit results on the same microbenchmark as part B. This is for 64-bit code with nunboxing.

                           isns/1M    cyc/1M
 
 (1) base                     3038        1375

This with the usual local tag elimination stuff I've been doing. As usual, it's between v8nocs/JM2 and TM.

 (2) 1 + reorder arith        2904        1671

This is reordering the arithmetic (using an accumulator for a*b+c*d+e*f rather than doing all the muls first, and pushing the i++ and j++ down below their statements). It reduced instruction count a bit, but noticeably increased instruction count. I noticed this can happen a lot.

 (3) 1 + rpo LSRA order       3092        1493

For this one, I ordered instructions in rpo order for forming live intervals for linear scan register allocation. I thought that might give better allocation, but apparently not, at least in this case.

 (4) 1 + fold hole checks     2905        1218

We do a hole check after reading from arrays, but we check the same tags to make sure they are ints right after, so the hole check is not needed. Taking it out gives a surprisingly big boost.

 (5) 4 + interval analysis    2316        1265

For this, I simulated interval analysis on the values computed inside the loop to remove type tags and overflow checks. That removed many instructions, but slightly increased the run time.

 (6) 5 + hoist array guards   2016         911

This one hoists the object-type and dense-array guards for the arrays, for a big boost. At this point, we're about the same as TM, although by rather different means, because TM does more type specialization but pjit does better register allocation.

 (7) 6 + guard on array tag   1959         873

This one guards for ints on the tags stored within the array rather than reading them out, storing them locally, and then guarding. It helps a little bit.

 (8) 7 + ivl analysis for |n| 1920        1090

Simulates interval analysis on the loop var. It cuts out a few instructions, but for unknown reasons increases run time.

 (9) 8 + type specialize      1434         776

This means removing most type tags and guarding on the type of values when they are def'd, not used. This was the next big boost in perf, now getting close to CS.

(10) 9 + single slot load     1529         681

This means that when we read from an array, instead of loading the value and testing the type separately, we load the 64-bit value at once and mask out the type. It is more instructions but a bit faster.

(11) 10 + elim range check    1323         630

This means eliminating the range checks for array accesses. This is about the same as hoisting them, which TI now does using an interval-type analysis. Here we are the same as CS.

(12) 11 + elim tag tests      1095         510

This means eliminating guards on the types of array elements. TI can do this; dynamic systems could possibly do it with an array field that tracks type properties, but that might be too expensive to maintain. This is the same score as TI.

General remarks:

1. Many optimizations can increase the time taken. I don't know why. It does seem that once we strip out everything unnecessary, we are as fast as possible, but some of the intermediate points are not fast.

2. Eliminating predictable branches seems fairly important overall.

3. This test is not entirely the same in all systems because the code generated to call am3, and to loop over that call, differs. I measured the time to run the benchmark with an empty loop body in am3 to try to isolate the cost of the loop contents. It's not clear that's a valid thing to do, though, because the presence of the loop body could affect how stuff around it runs.

Modulo that warning, TI has the least empty-loop overhead (13M cycles), pjit the most (138M cycles, probably from its C++ code), and the others in the middle (26-71M cycles). So, TI and CS might actually be about the same on the code within am3, but we don't really know. The final, heavily manually modified, pjit code might be the fastest, but that's not clear either. From inspection, the final pjit code is very similar to the TI code, but the pjit code gets to use more registers, cheats on skipping -0 checks, and has fewer moves, so it could actually be a bit faster.

4. Minor changes in register allocator behavior seem to be able to generate noticeable perf differences. Moving a restore down to the first use seemed to help a fair amount (50M cycles). But moving 2 other restores from the loop top to the bottom had no effect.

5. It's clear that getting the best score here requires eliminating pretty much all unnecessary computation. Here, that means for dynamic systems: full type specialization, full local type inference based on that specialization, interval analysis, working with values directly from array slots instead of copying them out when possible, folding hole checks away, hoisting guards and slots pointer loads, and good register allocation, including good victim selection and placement of spills and restores. This test suggests that loading value+tag together on 64-bit when we need both is also a win. Type inference further enables eliminating guards on values read out of arrays.
We have a compiler: closing old bugs.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.