Closed Bug 536630 Opened 10 years ago Closed 9 years ago

TM: awful LIR generated for bitwise-and.js (and everything else)

Categories

(Core :: JavaScript Engine, defect)

defect
Not set

Tracking

()

RESOLVED FIXED

People

(Reporter: njn, Assigned: njn)

References

Details

Yesterday I was trying to gauge the quality of Nanojit's LIR-to-native
translations.  Nothing leapt out at me, I think there may be some tweaks to
do but the quality is fairly good.  This isn't so surprising because LIR is
pretty close to machine code.

What did leap out at me was the LIR itself, which to my eyes looks, well,
awful.  For example, bitops-bitwise-and.js looks like this:

    bitwiseAndValue = 4294967296;
    for (var i = 0; i < 600000; i++)
        bitwiseAndValue = bitwiseAndValue & i;

The bytecode looks like this:

00022:  27  trace
00023:  28  bindname "bitwiseAndValue"
00026:  28  name "bitwiseAndValue"
00029:  28  getgvar "i"
00032:  28  bitand
00033:  28  setname "bitwiseAndValue"
00037:  27  gvarinc "i"
00040:  27  pop
00041:  27  getgvar "i"
00044:  27  uint24 600000
00048:  27  lt

The LIR looks like this (instructions indented to the right are optimised
away by StackFilter or dead-code elimination):

  00: start
  01: ebx = iparam 0 ebx
  02: esi = iparam 1 esi
  03: edi = iparam 2 edi
  04: state = iparam 0 ecx
  05: label1:
  06: sp = ld state[8]
                                  07: rp = ld state[24]
  08: cx = ld state[0]
  09: eos = ld state[12]
                                  10: eor = ld state[28]
  11: ld1 = ld cx[0]
                                  12: 0 = int 0
  13: eq1 = eq ld1, 0
  14: xf1: xf eq1 -> pc=0x30cba6 imacpc=0x0 sp+0 rp+0 (GuardID=001)
                                  15: obj = int 2961408
                                  16: sti sp[0] = obj 
                                  17: ld2 = ld cx[148]
                                  18: ld3 = ld ld2[56]
                                  19: map = ld ld3[0]
  20: ld4 = ld eos[664]
                                  21: $global0 = i2f ld4 
                                  22: sti sp[8] = ld4
  23: ld5 = ld eos[656]
                                  24: $global1 = i2f ld5
                                  25: sti sp[16] = ld5
  26: and1 = and ld4, ld5
                                  27: i2f1 = i2f and1
                                  28: sti sp[8] = and1
                                  29: map = ld obj[0]
  30: sti eos[664] = and1
                                  31: 1 = float 1
                                  32: 1 = int 1
  33: add1 = add ld5, 1
  34: ov1 = ov add1
  35: xt1: xt ov1 -> pc=0x30cbb5 imacpc=0x0 sp+0 rp+0 (GuardID=002)
  36: i2f2 = i2f add1
                                  37: sti sp[0] = ld5
  38: sti eos[656] = add1
  39: sti sp[0] = add1
  40: 600000 = float 600000
                                  41: 600000 = int 600000
  42: sti sp[8] = 600000
  43: flt1 = flt i2f2, 600000
  44: xf2: xf flt1 -> pc=0x30cbc0 imacpc=0x0 sp+16 rp+0 (GuardID=003)
  45: sti eos[664] = and1
  46: sti eos[656] = add1
  47: j -> label1
  48: live state
  49: x1: x  -> pc=0x30cba6 imacpc=0x0 sp+0 rp+0 (GuardID=004)

I sure was surprised to see so much LIR generated for such a small loop.
Some thoughts I had about this:   

- I wonder if a quick optimisation of the bytecode would be a win.  The 
  gvarinc "i" / pop / getgvar "i" sequence seems sub-optimal.  That could
  also help with interpretation speed.

- There sure is a lot of memory traffic via 'eos' and 'sp'.  Some of the
  latter cases are removed by StackFilter but it still seems like this could
  be improved, perhaps in the bytecode-to-LIR translation itself.
  - I imagine some of the write-backs are necessary because of the guards,
    maybe these could be moved to exit stubs?
  - Instruction 45 seems redundant w.r.t. instruction 30, maybe applying
    StackFilter to eos would be beneficial?  Likewise instructions 38 and
    46.
  - Writing 600000 to sp[8] is pointless.
  - I also saw this code sequence for bitops-bits-in-byte.js:
                     
      44: ld1 = ld sp[-8]
      46: sti sp[-8] = ld1

    And ld1 had no other uses.  There was an instruction between them that
    was dead and so removed, but still, yuk.

- It's depressing to see overflow checks on the i++.  That for-loop pattern
  must be really common.

- i is an integer for the bitwise-and and the increment, but is a float for
  the loop limit test, necessitating the i2f (and FP comparisons generate
  much worse machine code on i386/x64 than integer comparisons).

- The operation callback stuff (the first guard) is also depressing, but I
  don't know much about it.

- It's a shame that the bytecode-to-LIR translation is printed one basic
  block at a time by TMFLAGS=recorder, it would be easier to understand
  the translation if it were one bytecode instruction at a time.

I realise I'm wading into an area of the implementation that I know very
little about, but I wonder if there is some low-hanging fruit here that I
should pursue before eking out another couple of percent from Nanojit
codegen.  Suggestions are welcome!
  - Writing 600000 to sp[8] is pointless.

If the guard fails, this value has to be on the interpreter stack. But yeah, in general all these stores to the stack on the fast path are just awful. We should be sinking stores to side exits or figuring out how to spill to only one stack and not two.

- It's depressing to see overflow checks on the i++

Language semantics :(

- i is an integer for the bitwise-and and the increment, but is a float for
  the loop limit test

That's worth looking into, it might have something to do with the global var being float and there being an & operator.

- The operation callback stuff (the first guard) is also depressing, but I
  don't know much about it.

This is for the slow script dialog in the browser. WebKit (according to Luke) pins the timeout to a fixed register. We check a variable in the JSContext.
(In reply to comment #1)
> 
> - The operation callback stuff (the first guard) is also depressing, but I
>   don't know much about it.
> 
> This is for the slow script dialog in the browser. WebKit (according to Luke)
> pins the timeout to a fixed register. We check a variable in the JSContext.

I just tried taking it out, the changes to SunSpider were in the noise.  This matches what dmandelin told me.  (But if we got rid of all the other crap I wonder if its significance would increase.)
Operation callback is used for gc too, not just the slow script dialog, iirc...

We generally have plans for unifying the interp and jit stacks, right?

Applying stackfilter to eos might be a pretty simple quick-and-dirty thing to try and measure...
Oh, and fixing the gvarinc/pop/getgvar thing would basically involve optimizing our bytecode a bit right after generating it, right?  I wonder how often such patterns come up...
(In reply to comment #4)
> Oh, and fixing the gvarinc/pop/getgvar thing would basically involve optimizing
> our bytecode a bit right after generating it, right?  I wonder how often such
> patterns come up...

All the time for C-style for (i=0;i<N;i++) do_something(); loops, because we compile them like so:

i=0;
goto L2;
L1: do_something();
i++;
L2: if (i<N) goto L1;

We need a peephole optimizer. I dimly recall a bug being on file already. Can anyone spot it?

On the overflow check: language semantics requires it in general, but not for this case where we can see i start at 0 and be manipulated only by i++. True it is a global variable, so aliased as a property -- but if we can rule out a getter or setter for bitwiseAndValue that could mutate i to be on the edge of overflowing to a double, then we could avoid the overflow check.

This calls for a separate bug: cheap static analysis of loop variables to avoid overflow guards. Please file or cite if already filed.

/be
I already commented on these things on IRC but I figured I should say something again here for posterity.

> - I wonder if a quick optimisation of the bytecode would be a win.  The 
>   gvarinc "i" / pop / getgvar "i" sequence seems sub-optimal.  That could
>   also help with interpretation speed.

Peephole optimization of bytecode? Seems reasonable, although I don't have any real evidence yet that it would be a big win. It also seems to me that some of those kinds of optimizations could be expressed in the tracer or the baseline-compiler-to-be.

> - There sure is a lot of memory traffic via 'eos' and 'sp'.  Some of the
>   latter cases are removed by StackFilter but it still seems like this could
>   be improved, perhaps in the bytecode-to-LIR translation itself.
>   - I imagine some of the write-backs are necessary because of the guards,
>     maybe these could be moved to exit stubs?
>   - Instruction 45 seems redundant w.r.t. instruction 30, maybe applying
>     StackFilter to eos would be beneficial?  Likewise instructions 38 and
>     46.
>   - Writing 600000 to sp[8] is pointless.
>   - I also saw this code sequence for bitops-bits-in-byte.js:
> 
>       44: ld1 = ld sp[-8]
>       46: sti sp[-8] = ld1
> 
>     And ld1 had no other uses.  There was an instruction between them that
>     was dead and so removed, but still, yuk.

I agree there's way too many stack loads/stores. But I'm not sure which you are concerned about: (a) more LIR -> longer compilation time, or (b) loads/stores getting through to the native code and making the generated code bigger and slower.

Either way, some basic approaches for fixing this:

    (1) My personal favorite is to dispense with the stack loads/stores from
        the LIR, and instead maintain on the side (say in TraceRecorder) a
        representation of what logically is on the stack, i.e., a mapping from
        stack location to LIR instruction. Presumably we already have this in
        the tracker. The state of this map needs to be snapshotted at every
        side exit. Then, when we are generating the code, every time we do a
        side exit, we generate stores for the current stack state either on
        the main trace or in a side exit compensation block. When we do generate
        stores on trace, we can track that so that we don't generate them again
        next exit if unchanged. What I like about this is that it minimizes
        the intermediate info that needs to be tracked, and doesn't need a
        high-quality general-purpose LIR optimizer to get rid of the stores.

    (2) Alternatively, use a high-quality dead store elimination. :-) This
        doesn't help shrink the LIR, though.

> - It's depressing to see overflow checks on the i++.  That for-loop pattern
>   must be really common.

It seems to me they should be pretty predictable, so they may not hurt us that much. We should measure. It might also be possible to mitigate by scheduling
1 or 2 non-side-effecting ops that would come after the oflow branch to run before it instead, or something like that.

Otherwise, I think we should demote less aggressively, for this reason and so that we generate fewer traces. But in this particular test we would probably want to demote anyway, because we are doing bitops. I've been recommending that we try out some of these codes with various demotion decisions for a while now, so that we have a clearer idea of what the effect really is.

The static analysis of loop bounds idea is worth pursuing, *if* measurements suggest it will give a significant boost.

> - i is an integer for the bitwise-and and the increment, but is a float for
>   the loop limit test, necessitating the i2f (and FP comparisons generate
>   much worse machine code on i386/x64 than integer comparisons).

Why is the loop limit test done with floats? That seems clearly wrong. IMO we should strive for "connected" variables on a trace to use the same representation ("connected" meaning they are operands to the same operation).

> - It's a shame that the bytecode-to-LIR translation is printed one basic
>   block at a time by TMFLAGS=recorder, it would be easier to understand
>   the translation if it were one bytecode instruction at a time.

Agreed.
Depends on: 536641
Depends on: 536642
(In reply to comment #5)
> On the overflow check: language semantics requires it in general, but not for
> this case where we can see i start at 0 and be manipulated only by i++.

"and stopping at 60000", I meant to add.

> True it
> is a global variable, so aliased as a property -- but if we can rule out a
> getter or setter for bitwiseAndValue that could mutate i to be on the edge of
> overflowing to a double, then we could avoid the overflow check.

We don't have to do compile-time property vs. global alias analysis, thanks to js_LeaveTraceIfGlobalObject.

This all seems doable, even the canonical for-loop analysis, given our def-use chains through the AST. No CFG full of BBs, and no peephole optimization pass overhead required.

/be
(In reply to comment #5)
>
> We need a peephole optimizer. I dimly recall a bug being on file already. Can
> anyone spot it?

I couldn't find any open JS bugs involving the word "peephole" so I filed bug 536642.

> This calls for a separate bug: cheap static analysis of loop variables to avoid
> overflow guards. Please file or cite if already filed.

Bug 536641.


> Peephole optimization of bytecode? Seems reasonable, although I don't have any
> real evidence yet that it would be a big win. It also seems to me that some of
> those kinds of optimizations could be expressed in the tracer or the
> baseline-compiler-to-be.

For this one, I don't see how we'll get the evidence without implementing a peephole optimizer.  Fortunately they are pretty easy things to implement.

> I agree there's way too many stack loads/stores. But I'm not sure which you are
> concerned about: (a) more LIR -> longer compilation time, or (b) loads/stores
> getting through to the native code and making the generated code bigger and
> slower.

I imagine (b) is more important, but it doesn't really matter because we'll get both (a) and (b) if we can improve things.

>     (2) Alternatively, use a high-quality dead store elimination. :-) This
>         doesn't help shrink the LIR, though.

bz was talking about something similar, but I don't see how NJ can do much better than StackFilter.  I'd be happy to be enlightened.

> > - i is an integer for the bitwise-and and the increment, but is a float for
> >   the loop limit test, necessitating the i2f (and FP comparisons generate
> >   much worse machine code on i386/x64 than integer comparisons).
> 
> Why is the loop limit test done with floats? That seems clearly wrong. IMO we
> should strive for "connected" variables on a trace to use the same
> representation ("connected" meaning they are operands to the same operation).

I filed bug 536643 for this.
(In reply to comment #8)
> (In reply to comment #5)
> >     (2) Alternatively, use a high-quality dead store elimination. :-) This
> >         doesn't help shrink the LIR, though.
> 
> bz was talking about something similar, but I don't see how NJ can do much
> better than StackFilter.  I'd be happy to be enlightened.

Hmmm, I think I forgot half of what I meant to say. Does NJ do full store-after-store dead store elimination already? That was one piece of what I was talking about.

The other thing that would be needed that I totally forgot about is a store-sinking optimization. It would have to know that stack stores have no effect until you take a side exit and use that to push them into the side exit.
(In reply to comment #9)
>
> Hmmm, I think I forgot half of what I meant to say. Does NJ do full
> store-after-store dead store elimination already? That was one piece of what I
> was talking about.

StackFilter does this for sp-relative and rp-relative stores within a basic block.  (Although bz recently found an edge case of interest that it misses, bug 536458.)  Doing it also for eos might be worthwhile, although I think in the example above that alone won't help because all the repeats are in different basic blocks.
 
> The other thing that would be needed that I totally forgot about is a
> store-sinking optimization. It would have to know that stack stores have no
> effect until you take a side exit and use that to push them into the side exit.

Sounds like what dvander said in comment 1.  Should I file a bug for that as well?
(In reply to comment #10)
> > The other thing that would be needed that I totally forgot about is a
> > store-sinking optimization. It would have to know that stack stores have no
> > effect until you take a side exit and use that to push them into the side exit.
> 
> Sounds like what dvander said in comment 1.  Should I file a bug for that as
> well?

Please do. The idea has come up several times but we haven't done anything with it yet. I think the argument against is that it will increase code size. Whether the perf gains are worth the increase is something we probably can't figure out until we actually do it.
(In reply to comment #11)
> > > The other thing that would be needed that I totally forgot about is a
> > > store-sinking optimization.
> > 
> > Should I file a bug for that as well?
> 
> Please do.

Bug 537842.
Depends on: 536643, 537842
OS: Mac OS X → All
Hardware: x86 → All
(In reply to comment #0)
>
>   - Instruction 45 seems redundant w.r.t. instruction 30, maybe applying
>     StackFilter to eos would be beneficial?  Likewise instructions 38 and
>     46.
>   - I also saw this code sequence for bitops-bits-in-byte.js:
> 
>       44: ld1 = ld sp[-8]
>       46: sti sp[-8] = ld1
> 
>     And ld1 had no other uses.  There was an instruction between them that
>     was dead and so removed, but still, yuk.

Bug 551885 fixes these cases.
Depends on: 551885
FWIW, the LIR for bitwise-and.js is substantially better now:

      ebx = parami 0 ebx
      esi = parami 1 esi
      edi = parami 2 edi
      state = parami 0 ecx
      label1:
      sp = ldi.o state[8]
      cx = ldi.o state[0]
      eos = ldi.o state[12]
      ldi3 = ldi.csro cx[0]
      immi3 = immi 0
      eqi1 = eqi ldi3, immi3/*0*/
      xf2: xf eqi1 -> pc=0x90add56 imacpc=(nil) sp+0 rp+0 (GuardID=001)

      ldi2 = ldi.o eos[1384]
      ldi1 = ldi.o eos[1376]
      andi1 = andi ldi2, ldi1
      sti.o eos[1384] = andi1
      immi2 = immi 1
      addxovi1 = addxovi ldi1, immi2/*1*/ -> pc=0x90add65 imacpc=(nil) sp+0 rp+0 (GuardID=002)

      sti.o eos[1376] = addxovi1
      sti.s sp[0] = addxovi1
      immi1 = immi 600000
      sti.s sp[8] = immi1/*600000*/
      lti1 = lti addxovi1, immi1/*600000*/
      xf1: xf lti1 -> pc=0x90add70 imacpc=(nil) sp+16 rp+0 (GuardID=003)

      j -> label1
      livei state
      x1: x  -> pc=0x90add56 imacpc=(nil) sp+0 rp+0 (GuardID=004)

There's nothing egregiously stupid (except for the dead guard at the end, but that doesn't matter much).
This tracking bug has served its purpose, even though some blocking bugs are still open.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.