Closed Bug 545406 Opened 14 years ago Closed 13 years ago

TM: loop invariant code motion (LICM)

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: jseward, Assigned: n.nethercote)

References

Details

(Whiteboard: PACMAN)

Attachments

(1 file, 4 obsolete files)

With bug 545274 (proper alias info for loads, stores and calls)
in place, it's possible to contemplate how to do loop invariant
code hoisting on traces.  Implementation sketch follows
(warning: at most 60% baked, possibly less)

We want to compute the transitive closure of all value-producing
nodes whose value does not depend on anything computed during the
loop.  So the root nodes are (a) those producing constants
and (b) loads whose alias set does not intersect the alias set of
any store in the trace nor the store-set of any call in the
trace.  Then visit all nodes that would be reachable from these
in a forward traversal of the LIR.

I believe the complete algorithm can be achieved in 3 passes over
the LIR buffer.

Pass 1: compute aggregate store alias set, and auxiliary forward
   index table:

   let S:AliasSet = empty
       T:Array of LIR* = empty

   for each node in LIR buffer, working backwards {
      if node is the trace-branch-back label (head of the loop)
         break
      if node is a store
         S = S union alias-set(node)
      if node is a call
         S = S union store-alias-set(node)

      add node to the end of T
   }

   so now   S tells us which loads can't be considered loop
            invariant due to being stored to inside the loop proper

            T makes it possible to scan forwards during Pass 2


Pass 2: compute transitive closure of loop invariant nodes

   let I:Set of LIR* = empty    // the invariant nodes

   for each node in T, working backwards (hence fwds in LIR buf) {
      if node denotes a literal
         I = I union {node}
      else
      if node is a load 
         and intersect( alias-set(node), S ) == empty
         and node's address argument is in I
        then
         I = I union {node}
      else
      if node is an arithmetic node of some kind
         and all of node's arguments are in I
        then
         I = I union {node}
   }

   so now I is the set (or, a conservative subset) of loop invariant
   nodes.  Note that we scan over the existing loop pre-header, and
   we ought to add all value producing nodes in the pre-header to I.
   So this loop is a bit too conservative.


Pass 3: move I into the pre-header.

   Not sure it's possible to rearrange LIR buffer in-place in O(#
   nodes) time.  Hence rearrange while copying into a new LIR
   buffer of the same size.  The copy has to be done in 3 stages:

   (a) copy the pre-header (all stuff before the loopback label)
   (b) copy nodes in I that aren't already in the pre header
   (c) (now copy the loopback label)
   (d) copy all nodes after the loopback label, that aren't in I

   again using T so facilitate a forwards traversal



Uh, ok.  So this requires some further study on the details of
the pre-header.  

Also, there's conflicting requirements in the representation of I.
We require I to be some kind of set so that we can quickly determine
whether a node is already part of it, in Pass 2.  But Pass 3 (b)
requires I to be array(ish) so that the order in which I's nodes
appear in the pre-header (after the copy) is the same as it was
to start with.
This bug is motivated by the fact that TM (not sure about TR, but probably) does lots of redundant stuff inside loops.  Eg. in this:

  for (var i; i < N; i++) {
    x += a[i];
  }

various checks (is 'a' an array?  does it have 'dslots' allocated?) are done every time around the loop.  Loop invariant code hoisting will improve this by hoisting the checks outside the loop so they're only done once.

Another approach is to do some static analysis of the loop body to decide if the checks can be omitted altogether.  The above example doesn't have enough context to tell if the checks can be omitted, but if 'a' was accessed as an array just before the loop they probably could be, for example.  This static analysis wouldn't necessarily be at the LIR level -- in TM it would probably be easier at the JS bytecode level.

I'm not sure which of these ideas is better for handling this kind of thing.  There may be room for both of them.
Depends on: 517910
JS is full of annoying memory and control flow ambiguities, due to things such as getters and setters. Linear SSA helps a lot, though, so I'd focus on LIR level, not JS bytecode level -- unless you meant linearized JS bytecode as seen by the TM recorder.

/be
> so I'd focus on LIR level, not JS bytecode level

Yes, pushing optimizations into the level that generates the LIR tends
to make it very complex.  Where I was trying to get to with the
original alias set proposal in bug 517910 was to annotate LIR with
whatever information is needed to do the transformations we really
want.  Then, NJ can optimise the LIR without having to know anything
about JS semantics.  I'm all in favour of modularity in compiler
internals.
I totally agree NJ shouldn't have to know anything about the semantics of the front-end language.  I'm just saying that we should do optimisations wherever they have the best cost/benefit ratio -- that might be in NJ, it might be in the front-end.
Getting back to the general approach...

I think values that get hoisted may also need to marked as live with a 'live' instruction at the end of the loop.

Hoisting increases register pressure, potentially quite a lot.

Also, this is interesting.  The first example I looked at:

  state = iparam 0 ecx           state

  label1:                        state label1
  sp = ld state[8]               state sp
  cx = ld state[0]               state sp cx
  eos = ld state[12]             eos state sp cx
  ld1 = ld cx[0]                 eos ld1 state sp
  eq1 = eq ld1, 0                eos eq1 state sp
  xf1: xf eq1 -> pc=0x955fc37 imacpc=(nil) sp+0 rp+0 (GuardID=001)
                                 eos state sp

ld1 is JSContext's 'operationCallBack'.  AFAICT this is set when a script has been running for too long and should be aborted.  But if you consider the LIR code in isolation it looks like you can hoist the 'ld1 = ld cx[0]' because there are no stores that can alias with cx[0] elsewhere in the loop.  But really cx[0] can be set by some other code that the LIR/Nanojit doesn't know about, so hoisting it is unsafe.

To handle that properly would require an annotation to say that a load is volatile.  I wonder how many other such loads there are.
Another complication is the fact that LIR pretends to be SSA but really isn't because it has jumps but no phi nodes.  If there are any jumps other than the back-to-the-top-of-the-loop jump things get more complicated.
Also, hoisting guards is tricky.  We have to make sure the VM state is consistent before taking a guard;  this mostly consists of doing stack stores.  I think we can't hoist any guards that are preceded by such stores.  But I think there's also no clear identification on which stores are make-the-VM-state-consistent stores.
Whiteboard: PACMAN
Target Milestone: --- → Future
Attached patch extremely rough starting patch (obsolete) — Splinter Review
This is an extremely rough patch.  It works on this code:

  function f() {
      var sum = 0.5;
      var a = [1, 1];
      for (var i = 0; i < 1000; i++) {
          sum += a[1]; 
      }
      return sum;
  }
  print(f());

But I haven't really tried it on anything beyond that.  It sucessfully hoists all the constants (which doesn't make much difference as they are mostly folded into other operations, on x86 at least) and a couple of loads from InterpState.
That's not much but it shows this can work.

There's potential for hoisting more but it'll require alias analysis to do a much better job with the stack, ie. be able to track every stack slot separately.  That way any local variables that are read but not written can be hoisted (eg. 'a' above), which is crucial for this to be effective.

One annoying thing about the patch is When copying/rearranging the instructions into the new buffer, because of the pointer operands I have to maintain a table mapping the old LIns pointers to the new ones, and change operands as they're copied.  It's doable, though.

Another annoying thing is that you end up with lots of LIR_live instructions at the end, which fills up the AR.

I have no idea about the cost of this, the code is extremely rough and I haven't measured it at all.
Attached patch WIP patch, v1 (obsolete) — Splinter Review
Still very rough, but much better than the previous one.  I realised that I can do alias analysis on stack accesses accurately quite easily -- stack loads/stores are already marked with ACC_STACK, and then you can just look at the 'disp' value and treat all disp values as different regions.  So eg. sp[0] is a different region to sp[8], etc.

This allows many more values to be found as loop-invariant.  I've been using this code as my test case:

function f() {
    var sum = 0.5;
    var a = [0, 1, 2, 3, 4, 5, 6, 7];
    for (var i = 0; i < 100000000; i++) {
        sum += a[i & 0x7]; 
    }
    return sum;
}
print(f());

It's successfully hoisting everything except the guards.

This increases register pressure a lot.  In particular, handleLoopCarriedExprs() was trying to register-allocate every loop-live value.  This worked fine when there were only one or two (eg. 'state'), but was a disaster when there were a dozen or more.  So I turned that off and things got a lot better.

Despite all the hoisting, I'm only getting a 5% speedup on that loop so far.  This is because many of the hoisted values have to be spilled before the loop starts and thus reloaded within the loop at their use point.  And so far the hoisted values aren't any more expensive to compute than a store.  Hoisting guards will be crucial, I'll do that next.
Assignee: nobody → nnethercote
Attachment #456394 - Attachment is obsolete: true
Status: NEW → ASSIGNED
I've almost got guard-hoisting working, but there's a big snag.  If I hoist a guard I need to generate a new GuardRecord, which involves calling snapshot().  Unfortunately snapshot() depends on and updates various bits of TraceRecorder state (regs->pc, cx->regs, cx->fp) which I think are only correct mid-trace.  

Whereas I need to generate these snapshots once the recording of the fragment is finished, ie. I'm rewriting the final LIR.

Any suggestions?  I'm not sure how to proceed further.  I'm going to need help from someone who understands GuardRecords better than I do (which isn't difficult!)  This is currently a showstopper for a very promising optimisation...
(In reply to comment #10)
> I've almost got guard-hoisting working, but there's a big snag.  If I hoist a
> guard I need to generate a new GuardRecord, which involves calling snapshot(). 
> Unfortunately snapshot() depends on and updates various bits of TraceRecorder
> state (regs->pc, cx->regs, cx->fp) which I think are only correct mid-trace.

Only correct mid-recording, because the interpreter is calling the recorder for each (non-fused) opcode.

> Whereas I need to generate these snapshots once the recording of the fragment
> is finished, ie. I'm rewriting the final LIR.

If you always hoist to a known point before the loop, we could try to snapshot there and save the guard record in the trace recorder, for you to steal away.

IOW, is the hoisting point well-defined and easy to identify from the bytecode at recording time?

/be
(I'm reclassifying this from NJ to TM, because the current code is awfully TM-specific.)


(In reply to comment #11)

> If you always hoist to a known point before the loop, we could try to snapshot
> there and save the guard record in the trace recorder, for you to steal away.

That is the best hack, er, idea I've come up with as well :)

> IOW, is the hoisting point well-defined and easy to identify from the bytecode
> at recording time?

I think so.

Here's some TMFLAGS=recorder output:

    start
    ebx = parami 0 ebx
    esi = parami 1 esi
    edi = parami 2 edi
    state = parami 0 ecx
  **
    About to try emitting guard code for SideExit=0x86d804 exitType=TIMEOUT
    label1:
    sp = ldi.o state[8]
    rp = ldi.o state[24]
    cx = ldi.o state[0]
    eos = ldi.o state[12]
    eor = ldi.o state[28]
    ldi1 = ldi.csro cx[0]
    immi1 = immi 0
    eqi1 = eqi ldi1, immi1/*0*/
    xf1: xf eqi1 -> pc=0x40d566 imacpc=0x0 sp+0 rp+0 (GuardID=001)

  00022:  27  trace
  00023:  28  bindname "bitwiseAndValue"
  ...

I'm hoisting code to the "**" point, just before the loop header label.  I find the interleaving of bytecodes and LIR given by TMFLAGS=recorder very confusing, but I think that point corresponds to the TRACE bytecode.  If I do a speculative snapshot just before emitting the label I think/hope it'll work.

This is assuming that I don't hoist any stores.  AFAICT stores are the important part when it comes to connecting bytecode to LIR.  For example, for an ADD bytecode you'll have something like this LIR:

  x = ldd.s sp[0]
  y = ldd.s sp[8]
  z = addd x, y
  sp[0] = std.s z

It's the final store that commits the result, so you can hoist the two loads and the add (assuming they're loop invariant... sp[0] clearly isn't, but that's beside the point) without changing the meaning of anything.  What this means is that you can end up with lots of hoisted loads and arithmetic ops in the loop pre-header, which correspond to lots of bytecodes within the loop, but because the stores haven't moved the entire pre-header still corresponds to the TRACE bytecode.

If I start hoisting stores (I'm not sure if that's a good idea/worthwhile yet) then I'll need to make sure that any hoisted guards precede any hoisted stores in the pre-header, else the speculatively snapshotted GuardRecord will be wrong.

That's a lot of hand-waving, hopefully it's correct.  I'll try it on Monday.
Summary: NJ: loop invariant code hoisting → TM: loop invariant code motion (LICM)
Component: Nanojit → JavaScript Engine
QA Contact: nanojit → general
Target Milestone: Future → ---
(In reply to comment #12)
> (I'm reclassifying this from NJ to TM, because the current code is awfully
> TM-specific.)

Bummer -- I saw the bug description and thought that TR would get this for free :-) [Still, if this pans out I'm guessing we'll try adapting it for TR...]
(In reply to comment #13)
> 
> Bummer -- I saw the bug description and thought that TR would get this for free
> :-) [Still, if this pans out I'm guessing we'll try adapting it for TR...]

I'd be happy if it worked for TR as well, but I have no idea what TR code tends to look like, and thus how applicable this is.
TR generated code is generally a plain old linearized control flow graph in the same order that our ABC bytecodes were generated.  I am guessing without looking, that the TM trace recorder is identifying loops and therefore where expressions can be hoisted to.  TR would have to do its own analysis, but in ideal world once that was done we could reuse some NJ infrastructure to do the hoisting.  (freely admit the world is not always ideal).

> One annoying thing about the patch is When copying/rearranging the instructions
> into the new buffer, because of the pointer operands I have to maintain a table
> mapping the old LIns pointers to the new ones, and change operands as they're
> copied.  It's doable, though.

I've run into this during experiments too.  Are you using a map indexed by LIns* or something else?  I was thinking that If we think of a LIR->LIR transformer as a LIR backend, then it could be sensible to add a new entry to the union containing SharedFields, for an ID field. this could be used to make an array that maps old LIns to new LIns during a full rewrite.

> Another annoying thing is that you end up with lots of LIR_live instructions at
> the end, which fills up the AR.

yeah, hoisting increases register/stack pressure, fact of life.
(In reply to comment #15)
> TR generated code is generally a plain old linearized control flow graph in the
> same order that our ABC bytecodes were generated.  I am guessing without
> looking, that the TM trace recorder is identifying loops and therefore where
> expressions can be hoisted to.  TR would have to do its own analysis, but in
> ideal world once that was done we could reuse some NJ infrastructure to do the
> hoisting.  (freely admit the world is not always ideal).

Ye, it might be possible.

> I've run into this during experiments too.  Are you using a map indexed by
> LIns* or something else?  I was thinking that If we think of a LIR->LIR
> transformer as a LIR backend, then it could be sensible to add a new entry to
> the union containing SharedFields, for an ID field. this could be used to make
> an array that maps old LIns to new LIns during a full rewrite.

I'm just using HashMap<LIns*,LIns*>, see "OldNewTable" in the patch.  As for the array, we can generate large loops in TM, crypto-md5 in SunSpider has one with more than 10,000 instructions in it, which might make an array map problematic.  Besides, I need to demonstrate good speed-ups on the generated code before worrying about reducing compile-time...
> 
> I'm just using HashMap<LIns*,LIns*>, see "OldNewTable" in the patch.

This works fine in most cases because def-points always precede use-points for data values.  But that isn't true for control flow -- for forward jumps a label is effectively used before it's def'd.  So I'll need an extra table that lets me go back and patch jumps when their target labels are copied.  Sigh, what a hassle.
Depends on: 552812
I found a way to avoid the hashtable lookup for converting operand pointers from old-to-new -- when an instruction is copied I overwrite the old copy with a pointer to the new copy.  This makes subsequent old-to-new lookups O(1).  It's a bit of a hack, but it reduces the overhead for crypto-md5 from 1.30x to 1.10x so I'll take it.

Still lots more to do, but alias analysis is going to have to improve further so more instructions can be marked as loop-invariant.  Bug 552812 is the first step for that and I just posted patches for it.
Attachment #456535 - Attachment is obsolete: true
Ah, the old leave a forwarding address trick.

I keep telling you hacks are good things :-P. Great work here and in bug 552812.

/be
Blocks: 579285
Depends on: 584279
New version.  I found that register pressure was being increased a lot -- eg. it could hoist a dozen or more values that are used in the loop.  So I made the following changes:

- Only hoist invariant guards (and everything they depend on).

- If a hoisted value is used in both the loop pre-header and the loop body,
  I copy it twice, once in the pre-header and once in the body.  The idea is
  to not add any additional register pressure across the loop boundary.

Perf is something of a wash for SS -- some get better, some get worse (mostly due to compile-time, esp. crypto-md5 which is still 1.10x slower).
---------------------------------------------------------------
| millions of instructions executed                           |
| total                        | on-trace (may overestimate)  |
---------------------------------------------------------------
|   111.182   111.939 (0.993x) |    20.798    20.817 (0.999x) | 3d-cube
|    56.986    57.038 (0.999x) |    22.313    22.314 (------) | 3d-morph
|   147.869   148.886 (0.993x) |    21.982    22.887 (0.960x) | 3d-raytrace
|    77.551    77.482 (1.001x) |    17.401    17.400 (------) | access-binary-trees
|   183.724   183.858 (0.999x) |   103.707   103.709 (------) | access-fannkuch
|    41.567    42.080 (0.988x) |    17.553    17.759 (0.988x) | access-nbody
|    60.019    60.037 (------) |    27.013    27.013 (------) | access-nsieve
|    15.687    14.833 (1.058x) |     2.862     1.969 (1.453x) | bitops-3bit-bits-in-byte
|    43.306    43.316 (------) |    30.280    30.281 (------) | bitops-bits-in-byte
|    22.773    22.775 (------) |    10.219    10.219 (------) | bitops-bitwise-and
|    71.623    71.644 (------) |    39.453    39.453 (------) | bitops-nsieve-bits
|    42.397    42.363 (1.001x) |    26.655    26.654 (------) | controlflow-recursive
|    47.105    52.594 (0.896x) |     5.541     5.657 (0.979x) | crypto-md5
|    31.332    31.460 (0.996x) |     7.030     7.032 (------) | crypto-sha1
|   127.749   127.150 (1.005x) |    10.457     9.846 (1.062x) | date-format-tofte
|    99.146    99.183 (------) |    11.118    11.116 (------) | date-format-xparb
|    52.625    52.749 (0.998x) |    27.614    27.715 (0.996x) | math-cordic
|    38.424    38.357 (1.002x) |     6.252     6.065 (1.031x) | math-partial-sums
|    28.581    28.517 (1.002x) |    10.286     9.944 (1.034x) | math-spectral-norm
|   100.938   100.952 (------) |    75.045    75.045 (------) | regexp-dna
|    44.074    43.726 (1.008x) |     8.098     7.626 (1.062x) | string-base64
|   123.879   123.184 (1.006x) |    24.544    24.545 (------) | string-fasta
|   272.585   274.830 (0.992x) |     6.839     6.775 (1.009x) | string-tagcloud
|   251.380   251.376 (------) |     5.637     5.636 (------) | string-unpack-code
|    64.500    64.405 (1.001x) |     7.401     7.175 (1.031x) | string-validate-input
-------
  2157.013  2164.746 (0.996x) |   546.111   544.666 (1.003x) | all
(In reply to comment #20)
> Perf is something of a wash for SS -- some get better, some get worse (mostly
> due to compile-time, esp. crypto-md5 which is still 1.10x slower).

That's a bit disappointing for something which probably took a lot of
futzing around to make work, I'd guess.

I was wondering if d{vander,mandelin}'s work to tune the JM/tracer
integration (bug 580468, excellent stuff btw) might tilt the field in
your favour.  As I read it, that might result in fewer absolutely huge
loops being traced, which would help offset the compile time losses
you're getting here, and might help reduce the extra register
pressure.
(In reply to comment #22)
> 
> I was wondering if d{vander,mandelin}'s work to tune the JM/tracer
> integration (bug 580468, excellent stuff btw) might tilt the field in
> your favour.  As I read it, that might result in fewer absolutely huge
> loops being traced, which would help offset the compile time losses
> you're getting here, and might help reduce the extra register
> pressure.

I'm currently avoiding the register pressure issue fairly well, but yes, this'll be worth looking at again once all the current big changes have landed.  One issue is that we don't spend that much time in trace code -- it's about 25% of total SunSpider time, for example, and in V8 the proportion is even less.  

I could also make it stop if the number of instructions hoisted isn't high enough -- eg. for the main loop of crypto-md5 it hoists 4 instructions out of 19,000+ which isn't exactly worth it.  But then, even working out those numbers requires a couple of passes over the code.
This version is much closer to something landable.  Notable things:

- When hoisting is done in a fragment, instead of rewriting the entire
  fragment, which is difficult and slow, we now insert the hoisted
  instructions by using LIR_skip instructions to trampoline to an
  out-of-line sequence and then back.  This also means we don't need to do
  any forward passes, so there's no need to build the vector of LIns
  pointers.
  
  crypto-md5 now only has a 9% instruction count increase when compiled with
  -j, and it's a pathologically bad case for LICM, having one loop with
  36,000 LIR instructions of which only two are hoisted, combined with a
  very short running time that means compile-time increases really hurt.
  Other than crypto-md5, the worst instruction count increase is 1% for all
  of Sunspider when running with -j (running with -j -m -p the increases are
  much less because less code is traced).

  (Full numbers are below.)

- Most of the new code is in two new files, tracejit/LICM.{cpp,h}.

- Kraken results aren't bad, though not as good as I'd like given how much
  effort went into the patch.  With -j the timing improvement is about
  1.03x, mostly due to big wins in ai-astar and audio-dft.  With -j -m -p
  the results are similar except that ai-astar currently doesn't trace (bug
  606890 is open to fix this) so the improvement is less.

- The one big question mark over this patch is the fact that it reorders
  guards.  It's not clear to me that this is safe in general (although
  jit-tests are all passing).  For example, if a guard that when taken
  causes an exception to be thrown is moved earlier than another guard, the
  exception behaviour of the program may change.

  This question needs to be resolved before this patch can be considered for
  landing.


Sunspider with -j (this is mostly a measure of how much extra compile-time
LICM incurs):
---------------------------------------------------------------
| millions of instructions executed                           |
| total                        | on-trace (may overestimate)  |
---------------------------------------------------------------
|    76.997    77.310 (0.996x) |    22.622    22.628 (------) | 3d-cube
|    38.279    38.367 (0.998x) |    23.129    23.131 (------) | 3d-morph
|   167.503   168.298 (0.995x) |    15.688    16.068 (0.976x) | 3d-raytrace
|   130.658   130.662 (------) |    537758    537875 (------) | access-binary-
|    83.818    83.768 (1.001x) |    74.571    74.317 (1.003x) | access-fannkuc
|    27.568    27.585 (0.999x) |    15.193    14.933 (1.017x) | access-nbody
|    30.483    30.171 (1.010x) |    23.624    23.277 (1.015x) | access-nsieve
|     7.005     6.136 (1.142x) |     2.858     1.965 (1.454x) | bitops-3bit-bi
|    34.364    34.380 (------) |    30.097    30.097 (------) | bitops-bits-in
|    14.069    14.074 (------) |    10.215    10.215 (------) | bitops-bitwise
|    36.272    34.723 (1.045x) |    31.217    29.621 (1.054x) | bitops-nsieve-
|   149.555   149.552 (------) |         0         0 (------) | controlflow-re
|    32.863    35.910 (0.915x) |     4.284     4.368 (0.981x) | crypto-md5
|    19.040    19.158 (0.994x) |     6.269     6.272 (------) | crypto-sha1
|    99.440    97.823 (1.017x) |    11.624    10.159 (1.144x) | date-format-to
|    69.238    69.295 (0.999x) |     9.548     9.547 (------) | date-format-xp
|    42.243    42.843 (0.986x) |    28.678    29.254 (0.980x) | math-cordic
|    22.397    22.315 (1.004x) |     6.158     6.002 (1.026x) | math-partial-s
|    21.533    21.468 (1.003x) |    13.190    12.938 (1.019x) | math-spectral-
|    48.510    48.515 (------) |    34.578    34.578 (------) | regexp-dna
|    27.459    27.486 (0.999x) |     9.026     9.026 (------) | string-base64
|    84.275    84.329 (0.999x) |    23.421    23.422 (------) | string-fasta
|   141.033   141.056 (------) |    12.243    12.243 (------) | string-tagclou
|   123.671   123.519 (1.001x) |     7.663     7.664 (------) | string-unpack-
|    42.286    42.316 (0.999x) |     8.212     8.213 (------) | string-validat
-------
|  1570.572  1571.069 (------) |   424.678   420.503 (1.010x) | all


Sunspider with -j -m -p:
---------------------------------------------------------------
| millions of instructions executed                           |
| total                        | on-trace (may overestimate)  |
---------------------------------------------------------------
|    68.494    68.649 (0.998x) |    43.768    43.771 (------) | 3d-cube
|    39.728    39.788 (0.998x) |    24.716    24.718 (------) | 3d-morph
|    65.557    65.555 (------) |    37.044    37.044 (------) | 3d-raytrace
|    24.407    24.403 (------) |    11.010    11.010 (------) | access-binary-
|    88.397    88.394 (------) |    83.306    83.306 (------) | access-fannkuc
|    28.390    28.258 (1.005x) |    16.264    16.008 (1.016x) | access-nbody
|    35.169    35.166 (------) |    28.735    28.735 (------) | access-nsieve
|     7.524     6.656 (1.130x) |     3.255     2.363 (1.378x) | bitops-3bit-bi
|    35.393    35.391 (------) |    30.400    30.400 (------) | bitops-bits-in
|    15.917    15.923 (------) |    12.019    12.019 (------) | bitops-bitwise
|    38.143    36.582 (1.043x) |    32.966    31.370 (1.051x) | bitops-nsieve-
|    17.137    17.134 (------) |    12.875    12.875 (------) | controlflow-re
|    24.011    24.015 (------) |    11.836    11.836 (------) | crypto-md5
|    19.922    20.035 (0.994x) |     8.528     8.531 (------) | crypto-sha1
|    67.306    67.310 (------) |    21.026    21.026 (------) | date-format-to
|    63.581    63.578 (------) |     9.945     9.945 (------) | date-format-xp
|    43.648    44.242 (0.987x) |    29.513    30.086 (0.981x) | math-cordic
|    22.939    22.852 (1.004x) |     6.310     6.154 (1.025x) | math-partial-s
|    31.372    31.390 (0.999x) |    26.189    26.189 (------) | math-spectral-
|    48.370    48.362 (------) |    34.580    34.580 (------) | regexp-dna
|    28.804    28.830 (0.999x) |     9.277     9.277 (------) | string-base64
|    63.958    63.966 (------) |    32.064    32.064 (------) | string-fasta
|   103.304   103.182 (1.001x) |    17.211    17.211 (------) | string-tagclou
|    97.861    97.671 (1.002x) |    12.829    12.829 (------) | string-unpack-
|    43.169    43.203 (0.999x) |     8.573     8.573 (------) | string-validat
-------
|  1122.513  1120.548 (1.002x) |   564.251   561.933 (1.004x) | all


Kraken with -j (this is a best-case measurement for LICM):
---------------------------------------------------------------
| millions of instructions executed                           |
| total                        | on-trace (may overestimate)  |
---------------------------------------------------------------
|  3217.152  2646.218 (1.216x) |  3001.706  2430.546 (1.235x) | ai-astar
|  1860.749  1870.246 (0.995x) |  1536.491  1545.498 (0.994x) | audio-beat-det
|  1257.911   985.993 (1.276x) |  1098.428   826.386 (1.329x) | audio-dft
|  1685.241  1664.531 (1.012x) |  1225.797  1204.295 (1.018x) | audio-fft
|  2392.974  2367.852 (1.011x) |  1588.545  1563.964 (1.016x) | audio-oscillat
|  5871.686  5897.334 (0.996x) |  4693.845  4719.372 (0.995x) | imaging-gaussi
|  2327.179  2326.063 (------) |   739.492   737.894 (1.002x) | imaging-darkro
|  4927.677  4969.903 (0.992x) |  3752.846  3795.033 (0.989x) | imaging-desatu
|   659.733   660.350 (0.999x) |    10.214    10.214 (------) | json-parse-fin
|   455.590   455.667 (------) |     5.148     5.148 (------) | json-stringify
|  1547.472  1541.430 (1.004x) |   599.730   594.972 (1.008x) | stanford-crypt
|   867.638   857.053 (1.012x) |   319.781   313.732 (1.019x) | stanford-crypt
|  1133.817  1131.367 (1.002x) |   609.569   606.929 (1.004x) | stanford-crypt
|   555.423   552.587 (1.005x) |   199.499   196.206 (1.017x) | stanford-crypt
-------
| 28760.248 27926.599 (1.030x) | 19381.099 18550.196 (1.045x) | all


Kraken with -j -m -p:
---------------------------------------------------------------
| millions of instructions executed                           |
| total                        | on-trace (may overestimate)  |
---------------------------------------------------------------
|  4198.443  4198.522 (------) |  3907.436  3907.436 (------) | ai-astar
|  1979.742  1960.657 (1.010x) |  1353.759  1334.330 (1.015x) | audio-beat-det
|  1302.076  1025.964 (1.269x) |  1135.932   859.698 (1.321x) | audio-dft
|  1709.187  1688.923 (1.012x) |  1248.676  1227.172 (1.018x) | audio-fft
|  2585.718  2585.346 (------) |  1778.296  1778.290 (------) | audio-oscillat
|  6973.203  6999.127 (0.996x) |  4784.010  4809.827 (0.995x) | imaging-gaussi
|  3337.549  3334.614 (1.001x) |   748.425   745.135 (1.004x) | imaging-darkro
|  6017.814  6060.041 (0.993x) |  3836.515  3878.702 (0.989x) | imaging-desatu
|   659.792   660.409 (0.999x) |    10.221    10.221 (------) | json-parse-fin
|   496.092   496.164 (------) |     5.932     5.932 (------) | json-stringify
|  1288.445  1272.218 (1.013x) |   746.812   746.486 (------) | stanford-crypt
|   704.767   699.804 (1.007x) |   361.605   356.196 (1.015x) | stanford-crypt
|  1153.845  1144.825 (1.008x) |   585.183   576.007 (1.016x) | stanford-crypt
|   528.984   525.988 (1.006x) |   199.318   196.024 (1.017x) | stanford-crypt
-------
| 32935.664 32652.608 (1.009x) | 20702.127 20431.463 (1.013x) | all
Attachment #459330 - Attachment is obsolete: true
Attachment #463072 - Attachment is obsolete: true
Comment on attachment 491065 [details] [diff] [review]
patch, v4 (against TM 57130:bc000c1509ac)

Sayre has expressed interest in getting this into the tree, even if it's turned off for Firefox 4.0.  So I'm asking for reviews.

Ed, I'm asking you for feedback about just the LIR.h/LIR.cpp changes.  Most notably, the whole idea of using LIR_skip to trampoline to out-of-line chunks of code.  If we start allowing this it will preclude us ever changing our LIR operands from pointers to offsets.  That will preclude bug 578264, for one.  I'm ok with this -- I think the out-of-line trampolines could end up being used quite extensively, as they're a low-cost way to rewrite LIR that's already been written into a buffer -- but I'd like your feedback.
Attachment #491065 - Flags: review?(gal)
Attachment #491065 - Flags: feedback?(edwsmith)
Attachment #491065 - Flags: review?(jseward)
Comment on attachment 491065 [details] [diff] [review]
patch, v4 (against TM 57130:bc000c1509ac)

Commentary, musing, not for/against the patch:

I've been tempted several times to stuff extra info into SharedFields,
so I like the getExtra/setExtra() tweak.  I always lament the fact that
on x64, a whole machine word is there, sitting idle (maybe we could use it,
but pragmatically, we can't).

Tamarin's SeqReader class (CodegenLIR.cpp) is used to stitch together
two LIR buffers that we generate simultaneously.  One is the prologue.
If we could use LIR_skip to connect them, we can drop the reader and
save one virtual call per emitted LIR instruction (real costs that add up).

We also (still) use LIR_skip to erase instructions;  much as I'd like to 
clean this up, the cleanup is to incorporate dead-store elimination in-line with
the assembly pass, which ends up looking remarkably similar to StackFilter.
The fact that it hasn't been worth the work to do it yet, is a data point
when considering the fate of LIR_skip.  It seems genuinely useful, is all.

(I'm really off topic now, please forgive me).  

Heck, if getExtra/setExtra wasn't used for LiCM, then could it be used to
hold a sequence number, which then could be used to optimize CSE lua-style?

A sequential id in each LIns could be useful several ways... lua-style cse, and maybe also for making fast-to-index maps using LIns->id as a key, instead of a hashtable using LIns* as a key.  It would burn one word on 32bit machines, and fit inside wasted padding space on 64bit machines.

Replacing the whole SharedFields struct with a 32bit id, (or 8-bit opcode+ 24-bit id), a) give us a sequence number, and b) let stages build arrays indexed by LIns->id, for their own data.  (24bits is enough: "16M instructions is enough for anybody").  things like inReg/inAr could pack into bitsets if desired.  arIndex could be bigger, if desired, and so on.  The extra load per access (arindex[lins->id] vs lins->arindex) could be a deal killer, but maybe not. 

I've also experimented somewhat with a traditional CFG that holds
lists of LIns*, then a Reader that feeds them to Assembler in the required
order.  Nothing worth landing, but it makes me think that the super-tight
LUAJIT style IR (integer offsets, contiguous instructions) is not in our future.
Emitting code with some extra LIR_skips here and there, would allow reordering
blocks or inserting code somewhat arbitrarily, without extra buffer copying,
or chained LirReaders.

Nitty things about the patch:

Is LIR_comment supposed to be in there or is that an accident?   (okay with me either way, it looks useful).

I take it we should update tamarin to use overwriteWithSkip().

I skimmed LICM.cpp/h but didn't go looking for bugs.

A comment on arIndex's definition that its invalid during LIR generation, and only holds a real arIndex when inAr==1, would be a good idea.  Should get/setExtra assert when inAr==1?
Attachment #491065 - Flags: feedback?(edwsmith) → feedback+
(In reply to comment #26)
> 
> Heck, if getExtra/setExtra wasn't used for LiCM, then could it be used to
> hold a sequence number, which then could be used to optimize CSE lua-style?
> 
> A sequential id in each LIns could be useful several ways... lua-style cse, and
> maybe also for making fast-to-index maps using LIns->id as a key, instead of a
> hashtable using LIns* as a key.  It would burn one word on 32bit machines, and
> fit inside wasted padding space on 64bit machines.

The 24 non-opcode bits in SharedFields are only used during assembly.  So we could use them for a sequential id in any of the other passes.  The fast-to-index maps are tricky, though, because you don't know how big to make them in the writer pipeline.(In reply to comment #26)
 

> Is LIR_comment supposed to be in there or is that an accident?   (okay with me
> either way, it looks useful).

Not sure what you mean, can you clarify?
 
> A comment on arIndex's definition that its invalid during LIR generation, and
> only holds a real arIndex when inAr==1, would be a good idea.  Should
> get/setExtra assert when inAr==1?

Good idea.
> > Is LIR_comment supposed to be in there or is that an accident?   (okay with me
> > either way, it looks useful).
> 
> Not sure what you mean, can you clarify?

When I first glanced over the patch, it looked as though it included both the LICM code and the new LIR_comment opcode, which I remembered from a different patch, and wondered if they got lumped together by accident.  Looking again, that's not the case -- you just added LIns.getComment(), which is no big deal.
(In reply to comment #24)
> 
> - The one big question mark over this patch is the fact that it reorders
>   guards.  It's not clear to me that this is safe in general (although
>   jit-tests are all passing).  For example, if a guard that when taken
>   causes an exception to be thrown is moved earlier than another guard, the
>   exception behaviour of the program may change.
> 
>   This question needs to be resolved before this patch can be considered for
>   landing.

http://article.gmane.org/gmane.comp.lang.lua.general/58908 says this:

  Code hoisting via unrolling and copy-substitution (LOOP):
  Traditional loop-invariant code motion (LICM) is mostly useless
  for the IR resulting from dynamic languages. The IR has many
  guards and most subsequent instructions are control-dependent on
  them. The first non-hoistable guard would effectively prevent
  hoisting of all subsequent instructions.

It then describes what LuaJIT2 does instead:

  The LOOP pass does synthetic unrolling of the recorded IR,
  combining copy-substitution with redundancy elimination to
  achieve code hoisting. The unrolled and copy-substituted
  instructions are simply fed back into the compiler pipeline,
  which allows reuse of all optimizations for redundancy
  elimination. Loop recurrences are detected on-the-fly and a
  minimized set of PHIs is generated.

which I don't understand at all.
When I read that I thought he was talking about loop peeling; make one copy of the loop body, and put it first, so it dominates the loop body, and redundant expressions in the body get deleted.

Back on TT, one way I thought about doing it was to leave keep the trace recorder recording for one extra loop iteration.  the code fragment is shaped like a little 'b' with a longer stem.  I never got to it, and i'd imagine it would really hurt compile time for long loop bodies (crypto?)
(In reply to comment #30)
> When I read that I thought he was talking about loop peeling; make one copy of
> the loop body, and put it first, so it dominates the loop body, and redundant
> expressions in the body get deleted.

Ah, that makes sense.  Why couldn't he have just said that? :)

> Back on TT, one way I thought about doing it was to leave keep the trace
> recorder recording for one extra loop iteration.  the code fragment is shaped
> like a little 'b' with a longer stem.  I never got to it, and i'd imagine it
> would really hurt compile time for long loop bodies (crypto?)

Just about anything extra done by the compiler hurts compile-time of crypto-md5, it's such a pathological case.  E.g. LICM causes a 9% instruction count increase.
Blocks: 622494
Attachment #491065 - Flags: review?(jseward)
Attachment #491065 - Flags: review?(gal)
With JM can we re-try this?
(In reply to comment #32)
> With JM can we re-try this?

Comment 24 already did so.  It's maybe a 1% speedup for Kraken.  Combine that with (a) the fact that I'm not confident the transformation is valid, and (b) the spectre of IonMonkey coming over the horizon and trampling TraceMonkey into the dust, my motivation to work on this further is limited.
TM's days are numbered:  WONTFIX.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: