Closed Bug 495569 Opened 11 years ago Closed 10 years ago

[meta] Semantics level cleanups to LIR (in nanojit)

Categories

(Core Graveyard :: Nanojit, defect)

defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED
Future

People

(Reporter: jseward, Unassigned)

References

Details

Attachments

(1 file)

User-Agent:       Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.10) Gecko/2009042700 SUSE/3.0.10-1.1 Firefox/3.0.10
Build Identifier: 

This is a meta-bug, intended as a recording-place for information
regarding cleaning up the semantics of LIR a bit.

I think the following adjustments to LIR would be beneficial in
the longer term, and would not require major changes to the
existing Nanojit implementation, nor slow it down.

(1) Clearly distinguish between values and effects (or, if you
    like, expressions and statements) in LIR

(2) Formalise the concept of "LIR temporary" (virtual register,
    variable, etc)

(3) Get rid of LIR_ov.  It has an implicit dependency through a
    notional, unstated %eflags, to the immediately preceding
    instruction.  Not currently sure what to replace it with, tho.

(4) Possibly make the type system a bit stronger, and have types
    for LIR temporaries as well as for the operations.  Write a LIR
    typechecker, to check for correctly typed, SSA-form LIR.

(5) (Further ahead) Get rid of hardwired amodes in ld/st, and 
    require addresses simply to be arbitrary LIR temporaries.

Expected benefits:

(1) Make the implementation (LIR generation, transformation, conversion
    to native code) simpler, easier to reason about, easier to verify
    as correct.  Similarly, make it easier to reason about correctness
    of proposed LIR transformations.

(2) Put us in a good position to do instruction selection off trees,
    which should increase the quality of the code we can generate.

LIR at present is almost equivalent to straight-line SSA, although
LIR_ov breaks that.  I believe its removal would make LIR be pretty
much bog-standard SSA (the straight-line aspect, anyway, no need
for phi nodes here).


Reproducible: Always
Status: UNCONFIRMED → NEW
Ever confirmed: true
One way to get rid of LIR_ov is to have a specialised form of
exit instruction, which combines an add, an assignment of a
result to a LIR temporary, and an exit.  So instead of (eg) this

  ...
  ld1 = ld sp[-16]
  ...
  ld2 = ld sp[-8]
  ...
  add1 = add ld1, ld2
  ov1 = ov add1
  xt1: xt ov1 -> pc=0x81d9e24 imacpc=(nil) sp+16 rp+0
  # add1 is used later, ov1 is not

we'd have this:

  ...
  ld1 = ld sp[-16]
  ...
  ld2 = ld sp[-8]
  ...
  add1 = x_ov_add ld1, ld2 -> pc=0x81d9e24 imacpc=(nil) sp+16 rp+0
  # add1 is used later

It removes the implicit dependency between LIR_ov and the
immediately preceding add, yet retains the ability to
generate "addl %rS,%rD ; jo ...", and retains also the property
that no LIR statement assigns to more than one temporary.
A question: in https://developer.mozilla.org/en/Nanojit/LIR
there are a bunch of load variants (ldcb, ldcs, ldc, ldqc) which
"may be removed by common subexpression elimination".

What propert(ies) is this intended to signify?

* that the loaded-from location will not be stored to by any
  instruction in this fragment?

* that the load is guaranteed not to take an exception?

* something else?
Don't we need ov for other operations as well (sub, if not mul)?

This sounds like it would let us kill the guard more easily, which would certainly be nice!
> Don't we need ov for other operations as well (sub, if not mul)?

Yes, sure.  Presumably this would have to be done for some small
subset of binary and unary ops; afaict add, sub, mul.
(In reply to comment #2)
> A question: in https://developer.mozilla.org/en/Nanojit/LIR
> there are a bunch of load variants (ldcb, ldcs, ldc, ldqc) which
> "may be removed by common subexpression elimination".
> 
> What propert(ies) is this intended to signify?
> 
> * that the loaded-from location will not be stored to by any
>   instruction in this fragment?
> 
> * that the load is guaranteed not to take an exception?
> 
> * something else?

the 'c' variants are meant to mean:
 - a second load with the same effective address will return the same value
 - if the result is not used, the load can be eliminated
(In reply to comment #1)
> One way to get rid of LIR_ov is to have a specialised form of
> exit instruction, which combines an add, an assignment of a
> result to a LIR temporary, and an exit.  So instead of (eg) this
> 
>   ...
>   ld1 = ld sp[-16]
>   ...
>   ld2 = ld sp[-8]
>   ...
>   add1 = add ld1, ld2
>   ov1 = ov add1
>   xt1: xt ov1 -> pc=0x81d9e24 imacpc=(nil) sp+16 rp+0
>   # add1 is used later, ov1 is not
> 
> we'd have this:
> 
>   ...
>   ld1 = ld sp[-16]
>   ...
>   ld2 = ld sp[-8]
>   ...
>   add1 = x_ov_add ld1, ld2 -> pc=0x81d9e24 imacpc=(nil) sp+16 rp+0
>   # add1 is used later
> 
> It removes the implicit dependency between LIR_ov and the
> immediately preceding add, yet retains the ability to
> generate "addl %rS,%rD ; jo ...", and retains also the property
> that no LIR statement assigns to more than one temporary.

seems fine.

thinking ahead:
 - with more arithmetic operators (mul, sub?)
 - and also handling the jump instructions (jt, jf)

will this combining of operator+branch lead to too many IR instructions?

LIns field width should not be a problem.  The Risk of many combined
op+branch instructions is footprint of LIR itself, complexity in
optimization phases and the backends, etc. 

this could be low risk if we dont anticipate a big explosion of ops.
(in reply to Comment #5)

What I was getting at was that it might be preferable to record the
underlying properties of the load, as an annotation, rather than one specific
transformation that those properties permit.

For example, here we would say "(1) load is not aliased by any stores in the 
fragment, and (2) load does not except".  Taken together, these would allow
CSEing with a load of the same size and location, as you say.

But taken separately: (1) by itself also appears to allow us to float the
load forwards/backwards past arbitrary stores, and has nothing to do with
CSE.  Similarly (2) by itself might have some bearing on the possibility of
floating the load forwards/backwards past exits without changing the exception
behaviour.
Those would be good improvements.  

What if we followed the CallInfo pattern and invented a struct that would be referenced by load & store instructions, another building block in a LIR type system?

LLVM might be an example for us to follow (loads/stores take an address computed by an address-computation instruction which takes a reference, a type descriptor, and a field designator and/or index).  we dont necessarily need all the bells and whistles of LLVM, but a simple type model would be great.
also, all the loads and stores in LIR currently assume no exception behavior.
We need all ops, add/sub/mul/div, and potentially all forms of exits and jumps (xt/xf/jt/jf), so thats a lot of different LIR opcodes. We also might need LIR_cs in the future, which is another 16 combination. The new LIR_div/LIR_mod pair is a similar problem. LIR_mod has an invisible dependency on a LIR_div being close by. I would think we want to model the invisible data flow explicitly with a pseudo register instead of forming the cross product of the opcodes. No?

In my experience with writing the Hotpath optimization passes you almost always handle all loads together, independently of the type. We had one LOAD op, and it had the type and constant-ness as flags (constant or loop-invariant). We could do the same with the LIR encoding. LIR_load | LIR_word | LIR_constant. Generating code for this would be pretty trivial.

That would work for the side exits as well if we don't want to encode the data flow. LIR_xt | LIR_add | LIR_ov. However, I imagine this case will look a bit messy in the code generator, and filters also would have to be written carefully to decide what this really is. An arithmetic op, or a side exit. Might work though.

The more general problem are inline exits, which Ed and I talked about a while ago. We would like to have side exits that behave like arithmetic ops and can be dead code eliminated if they have no uses. Right now a LIR_ov/LIR_xt chain always keeps a LIR_add alive, even if LIR_add has no uses. That shouldn't be the case.

Maybe we can model LIR_ov and LIR_xt together? And model flags<>register moves?

r = LIR_add b, c
c = LIR_flags(r) // this must immediately follow the add, keep 1 return reg format

... // if nothing is in between, c is not materialized into an actual register

i = LIR_ov|LIR_xt r, c, exit

x = LIR_mul 1, i // if this goes dead, everything above is killed

Just thinking out loud. This probably needs quite a bit more work.
What if ld/st/xt/xf took a LIR_ref as operand;  A LIR_ref being an address computation?  

It would be easy to cse like-refs and also during code gen we'd be able to easily fold simple refs (i.e. base+small immediate).  I'm thinking this would also make the exits easier to process as a LIR_xt would look like a 'store' and if its associated LIR_ref was not 'used' the expression would not be generated.
(In reply to comment #6)
> will this combining of operator+branch lead to too many IR instructions?
> 
> LIns field width should not be a problem.  The Risk of many combined
> op+branch instructions is footprint of LIR itself, complexity in
> optimization phases and the backends, etc. 
> 
> this could be low risk if we dont anticipate a big explosion of ops.

I think it's easy to overestimate the added complexity of a few more (clean) ops, and easy to underestimate the complexity of treacherous instructions like LIR_ov that pollute an otherwise clean semantics.

For example, Valgrind's current IR (which Julian designed) has a lot of opcodes (it handles SIMD instructions, among other things) but it's semantically very clean: it is SSA, clearly distinguishes statements from expressions, is type-safe, and avoids hacks.  This makes it very pleasant to deal with, and it's surprisingly difficult to generate incorrect IR.  Valgrind's original IR, although much smaller and thus seemingly simpler, was not as semantically clean and generally much more error-prone and less pleasant to deal with.

If Julian's suggestions leads to a lot of new ops then perhaps another way should be found, but having instructions that must appear near other instructions is asking for trouble.
(In reply to comment #10)
> We need all ops, add/sub/mul/div, and potentially all forms of exits and jumps
> (xt/xf/jt/jf), so thats a lot of different LIR opcodes. We also might need
> LIR_cs in the future, which is another 16 combination. The new LIR_div/LIR_mod
> pair is a similar problem. LIR_mod has an invisible dependency on a LIR_div
> being close by. I would think we want to model the invisible data flow
> explicitly with a pseudo register instead of forming the cross product of the
> opcodes. No?

I'm not sure what a pseudo-register is, but it sounds like a plan to replace one hack with another hack :)
 
It's also worth mentioning bug 493125, which aims to remove LIR_cs, because it's currently dead in both TM and Tamarin.
I dislike LIR_ov just as much as the next guy. But I would prefer replacing it with something better, not just with something different :)

As for LIR_cs, if we want to support unsigned types, we are going to need it.
As long as we're here, would it be possible to document the remaining opcodes in LIRopcode.tbl?  I did some of them at one point, but I just couldn't figure out what things like LIR_ov were supposed to be at the time...  I think that would make it a lot easier to read and understand LIR-producing code (like jstracer.cpp).
Lots of discussion here, nice. I've a (probably naive) question:

Is it possible (or pleasant) to model a LIR instruction as defining 0..N values rather than always "exactly one"? It seems to me that most insns produce a bunch of flag bits, and several x86 insns write to >1 register. Also, *all* the SIMD stuff works on operand bundles, and some "pure effect" operations really don't produce values at all.. it seems like "1 LIns == 1 defined value" is possibly a source of muddle.

If we let each LIns define multiple indexed values, we could then refer to an operand by a pair<LIns*,size_t> or something, rather than just a LIns*. If you wanted to access a flag produced by a flag-setting operator, you'd treat the flag-set like any other store produced by the setting insn, and have to abide by the constraint of not floating any other flag-set insn in between producer and consumer, but otherwise it'd be like any other bit of data flow (living in a special register class of flag bits and "read" through conditional jumps, of course).

I am presumably missing an obvious complication of this scheme?
(in reply to Comment #16)

> Is it possible (or pleasant) to model a LIR instruction as defining 0..N
> values rather than always "exactly one"?

I'm sure it's possible, although I don't actually know.  A bit of dredging
around the literature on SSA would probably throw something up.

> It seems to me that most insns produce a bunch of flag bits, and several
> x86 insns write to >1 register.

Uh .. do you really want to represent (and have to optimise around, etc)
that stuff in an architecture-neutral LIR?  imo that's a job for the 
instruction selectors, downstream of the LIR machinery.

The tricky bits I forsee, w.r.t multiple outputs, are (1) Andreas'
desire to have a single x86 'div' insn produce both the div and mod
results, and (2) how to use ARM addressing modes effectively, 
specifically the ones that write the effective address back to a
register after the load/store.  In both cases these insns write
two registers.

Ideally you don't want to represent in the LIR operations that produce
more than one value, as that complicates the LIR (== slows the JIT)
for the majority of ops (vanilla adds etc) which produce only one
value.

> it seems like "1 LIns == 1 defined value" is possibly a source of muddle.

Right.  That's why I favour partitioning LIR into LIR statements, which
are executed for their side effects (eg, write register, write memory,
take an exit, do a branch) and LIR expressions, which denote a pure value
(eg, constant, read register, load, unary op, binary op, call to guaranteed
side-effect free helper function).  Statements contain expressions, and
the top level definition of an LIR fragment becomes "sequence of statements".
(In reply to comment #16)
> Is it possible (or pleasant) to model a LIR instruction as defining 0..N values
> rather than always "exactly one"? It seems to me that most insns produce a
> bunch of flag bits, and several x86 insns write to >1 register.

The x86 is a difficult case, because 1) all instructions set condition bits in the same register, 2) all conditional branches check that register, and 3) that register is hard to move things into and out of (you can push and pop %eflags, or move a portion of it to and from %ah, and that's it).  So while one *could* treat the flags as another value that the instruction yields, the cost of using that particular kind of value changes drastically depending on the intervening instructions.  You'd really want to treat those values specially.  Thus the appeal of fused test/branch instructions.
(In reply to comment #18)

> The x86 is a difficult case, because 1) all instructions set condition bits in
> the same register, 2) all conditional branches check that register, and 3) that
> register is hard to move things into and out of (you can push and pop %eflags,
> or move a portion of it to and from %ah, and that's it). 

I was thinking of modeling each *bit* as its own value, not the complete status register as a single value. Each conditional branch variant is only sensitive to a few of the bits, and not every instruction sets all of the bits.

> treat the flags as another value that the instruction yields, the cost of using
> that particular kind of value changes drastically depending on the intervening
> instructions.  You'd really want to treat those values specially.  

I'm not sure how it differs from eg. the split between ALU and FPU instructions, that can only act on a particular register class. Or SSE ops. Here you have a register class called "bit in the status word" that only a few kinds of conditional jump can read from. 

An expr/statement split to separate between values and effects is also reasonable though, and will probably help with modeling a variety of other issues. Thought I'd mention another possibly-simpler alternative to handle flags alone, is all.
For comparison-in-the-literature sake:

GCC RTL used to use an implicit adjacent-instruction constraint on the condition code register[1], but seems to have switched to using pseudo-register representation, somewhat like I was describing.

LLVM has a nutty (or clever?) conditional branch instruction[2] that takes a special "inline assembly reference" operand that denotes the test-that-sets-the-condition, and then transplants/fuses it with the branch. So there's no opportunity for the branch and the flag-setting code to separate.

Clearly, the field's wide open! :)

[1] http://gcc.gnu.org/onlinedocs/gccint/Condition-Code.html
[2] http://llvm.org/docs/LangRef.html#i_br
(In reply to comment #17)
> 
> The tricky bits I forsee, w.r.t multiple outputs, are (1) Andreas'
> desire to have a single x86 'div' insn produce both the div and mod
> results, and (2) how to use ARM addressing modes effectively, 
> specifically the ones that write the effective address back to a
> register after the load/store.  In both cases these insns write
> two registers.

ARM predicated instructions also aren't being used at all, right?
A machine-independent IR is always a compromise. It will never capture all the fancy instruction formats (idiv on x86), wonky wide effects (cc updates) or scheduling properties (conditional branches and delay slots anyone?). I think we should try to keep LIR simple and fast to generate and consume (assemble into native code). Micro-optimizations like conditional arithmetic ops and such are not really were we lose performance right now. Introducing guards that are expressions and not always live statements would buy us a lot more IMO than modeling cc properly and thus being able to CSE a bit better. Just my 2c.
You're right we shouldn't worry about micro-optimisations, so comment #21 is a distraction, and should be ignored, sorry.

But the original point of the bug is that LIR has some hacks in it: LIR_ov, the new div instruction, etc.  Getting rid of them would be good.  And that's independent from micro-optimisations.
So do we have a simple way of dealing with LIR_ov thats better than what we have right now? I don't like it either, but I went with this approach because I couldn't think of anything else. I don't think the issue is serious enough to warrant a complete redesign of the IR.
Since JavaScript and Lua semantics are similar enough when it comes to
requirements for IR design, I thought I'd share what the IR in LuaJIT 2.x
looks like:

Every instruction has either 0, 1 or 2 operands -- no exceptions. Every
instruction has either 0 or 1 result -- no exceptions. All of this is
independent of the instruction class or guard property. High orthogonality
of the IR pays off big time for traversals, analysis, debug output etc.

There are four classes of instructions:
N = normal instructions (same as R = ref ops)
L = loads
A = allocations
S = stores

Additionally every instruction can be a guard (G), i.e. it may branch to a
side exit or not. This is independent of the instruction class.

Optimizations can be applied depending on this categorization:
FOLD: only for N
CSE: only for N (but independent of G or not)
FWD: only for L
SINK: only for S and A
DCE: ok for N, L, A, but only if not G
LOOP: simply pipe the IR again through all opts to get the variant part

There are no separate conditional branch instructions. Instead there are
guarded assertions (EQ, NE, LT, LE, GT, GE, ABC and so on). This is
beneficial in several ways: assertions are subject to CSE, serve as inputs
to VRP and simplify the assembler a lot.

Overflow guards are separate instructions. And no, you don't need that many.
I have ADDOV and SUBOV and that's it. Everything else would be lost in the
noise. This also nicely solves the problem where you want to apply algebraic
simplifications to ADD, but not to ADDOV. And the assembler can use lea
instead of an add for integer ADD, but not for ADDOV (lea does not affect
the flags, but it helps to avoid register moves).

All loads and stores point to a reference op (except for some simple cases,
like loads from Lua stack slots). A ref op provides an address, but also
serves as the common anchor point for alias analysis (AA). Using regular
arithmetic ops for address computations would impede or prevent any
higher-level analysis. E.g. an array store cannot alias a hash load; now try
to deduce this from a mess of adds/shifts/base-reg loads -- won't happen.

Ref ops can be a guard, too (like a hash table lookup). Most loads do type
checks, i.e. they are also guards (the whole IR is typed). The assembler
fuses references into loads and stores if the CPU has an addressing mode
that matches the ref op. E.g. x86 has [base+k*idx+ofs] which can be used for
array indexing. Some ref ops like unspecialized hash refs cannot be fused
and only provide an address. The assembler also fuses loads into other
instructions (see Bug 444682 for an example).

Since there are no instructions with two outputs, combining DIV/MOD or
SIN/COS is done differently: there is already an auxiliary hash table used
to pass on sparse info about the IR, such as analysis results or predicted
hash slots for constant keys. It's indexed by info type and IR instruction
number. This links a SIN to the COS and vice versa if they have the same
argument. The assembler uses this to suppress whichever instruction it
encounters first (the last one, since it goes backwards) and to fill two
registers when it encounters the other one.

Modeling target CPU features in a medium-level IR is almost always a losing
proposition. I have argued in the past one doesn't need a low-level IR,
since this lowering can be done on-the-fly by the assembler. Things like
flag bits, instruction combining or fusion are orthogonal aspects and have
no place in a target-independent IR.

There are only a couple of places where the target architecture bleeds
through to the IR. E.g. x86 CPUs automatically mask shift amounts by the
operand width, but ARM CPUs don't. You don't want to emit one bitwise AND
for every shift on x86, only to bother the assembler with its elimination.
But you want to have them in the IR for ARM to be able to reassociate,
simplify and CSE them. So the IR emitter and the FOLD rules differ in this
case, but I think that's an acceptable compromise.

Appended below is an excerpt of the LJ2 IR instruction definitions. And
here's a small example of traced bytecode, optimized IR and generated x86
code for this simple Lua snippet:

  local x = 0; for i=1,100 do x = bit.bxor(x, i) end

[Don't get distracted by the fact that JS has the ^ operator and wouldn't
need to resolve the bit.bxor function. It would need to do something like
this for builtins and functions in other examples.]

---- TRACE 0 start test.lua:1
[... Self-explanatory bytecode :-) ]
0006  GGET     5   0      ; "bit"
0007  TGETS    5   5   1  ; "bxor"
0008  MOV      6   0
0009  MOV      7   4
0010  CALL     5   2   3  ; bit.bxor
0011  MOV      0   5
0012  FORL     1 => 0006
---- TRACE 0 IR
[... Legend: '>': guard, '+': PHI ref, '*': dead ins]
0001 >  int SLOAD  #2    notc            // i, narrowed to int
0002 >  fun SLOAD  #0    notc            // Function at frame base
0003 >  fun FRAME  0002  test.lua:1
0004    tab FLOAD  test.lua:1  func.env  // Load function env ('globals')
0005    int FLOAD  0004  tab.hmask
0006 >  int EQ     0005  +63             // Specialize to hash table size
0007    int FLOAD  0004  tab.node
0008 >  int HREFK  0007  "bit" [slot 43] // Hash ref, specialized to slot
0009 >  tab HLOAD  0008
0010    int FLOAD  0009  tab.hmask
0011 >  int EQ     0010  +15             // Specialize to hash table size
0012    int FLOAD  0009  tab.node
0013 >  int HREFK  0012  "bxor" [slot 11]// Hash ref, specialized to slot
0014 >  fun HLOAD  0013
0015 >  num SLOAD  #1                    // x
0016 >  fun FRAME  0014  bit.bxor        // Frame guard, specialize dispatch
0017    int TOBIT  0015  bias            // Double -> int32 with bit rules
0018  + int BXOR   0017  0001            // x xor i
0019  + int ADD    0001  +1              // Overflow check eliminated
0020 >  int LE     0019  +100
0021 ------ LOOP -----------             // Note: loads and dispatch hoisted!
0022  * num TONUM  0018                  // Compensate type mismatch (dead)
0023  + int BXOR   0019  0018            // Note: folded TOBIT(TONUM(x)) ==> x
0024  + int ADD    0019  +1
0025 >  int LE     0024  +100
0026    int PHI    0019  0024            // PHI for i (narrowed to int)
0027    int PHI    0018  0023            // PHI for x (narrowed to int)
---- TRACE 0 mcode 257
[... Only the inner loop is shown]
->LOOP:
f7fa2fed  33F5              xor esi, ebp
f7fa2fef  83C501            add ebp, +0x01
f7fa2ff2  83FD64            cmp ebp, +0x64
f7fa2ff5  0F8EF2FFFFFF      jle 0xf7fa2fed     ->LOOP
f7fa2ffb  E990AD0A10        jmp 0x0804dd90     ->EXIT_0
---- TRACE 0 end

** Glossary:

FOLD: Constant Folding, Algebraic Simplifications and Reassociation
CSE:  Common-Subexpression Elimination
FWD:  Load and Store Forwarding (L2L and S2L) using Alias Analysis (AA)
SINK: Store Sinking and Allocation Sinking.
DCE:  Dead-Code Elimination
VRP:  Value-Range Propagation
ABC:  Array-Bounds Check

** Excerpt from IR instruction definition:

// Constants grow downwards from HEAD, hidden/inlined by dump
KPRI    N   ___  ___
KINT    N   cst  ___
KGC     N   cst  ___
KNUM    N   cst  ___

// Guarded assertions
LT      G   ref  ref
GE      G   ref  ref
LE      G   ref  ref
GT      G   ref  ref
EQ      G   ref  ref
NE      G   ref  ref
ABC     G   ref  ref
...

// Bit ops
BNOT    N   ref  ___
BAND    N   ref  ref
BOR     N   ref  ref
BXOR    N   ref  ref
BSHL    N   ref  ref
...

// Arithmetic ops
ADD     N   ref  ref
SUB     N   ref  ref
MUL     N   ref  ref
DIV     N   ref  ref
FPMATH  N   ref  lit
NEG     N   ref  ref
ABS     N   ref  ref
...

// Overflow-checking arithmetic ops
ADDOV   G   ref  ref
SUBOV   G   ref  ref

// Memory ops. A = array, H = hash, U = upvalue, F = field, S = stack.
// Ref ops
AREF    R   ref  ref
HREFK   RG  ref  ref
HREF    R   ref  ref
UREFO   RG  ref  lit
UREFC   RG  ref  lit
FREF    R   ref  lit

// Loads and stores
ALOAD   LG  ref  ___
HLOAD   LG  ref  ___
ULOAD   LG  ref  ___
FLOAD   L   ref  lit
SLOAD   LG  lit  lit

ASTORE  S   ref  ref
HSTORE  S   ref  ref
USTORE  S   ref  ref
FSTORE  S   ref  ref

// Allocations
TNEW    A   lit  lit
SNEW    A   ref  ref
...

// Type conversions
TOINT   G   ref  ___
TONUM   N   ref  ___
TOBIT   N   ref  ref
...

// Misc
FRAME   G   ref  ref
PHI     N   ref  ref
HEAD    G   ___  ___
LOOP    G   ___  ___
...
Depends on: 496693
(In reply to comment #25)
> Overflow guards are separate instructions. And no, you don't need that many.
> I have ADDOV and SUBOV and that's it. Everything else would be lost in the
> noise. This also nicely solves the problem where you want to apply algebraic
> simplifications to ADD, but not to ADDOV. And the assembler can use lea
> instead of an add for integer ADD, but not for ADDOV (lea does not affect
> the flags, but it helps to avoid register moves).

That's pretty convincing.
Without a separate overflow check instruction, how does LuaJIT trap overflows from multiplies?
LuaJIT doesn't narrow multiplication, because this is rarely beneficial.
Instead it performs strength-reduction and hoisting which covers most
uses of multiplication in indexing expressions. You definitely don't
want to narrow a multiply which is not used for indexing.

That's why I said, you really don't need more than ADDOV and SUBOV.

[
A slight digression:

IMHO TraceMonkey's narrowing is too aggressive and counterproductive.
I've mentioned this in a previous message to the Tamarin mailing list:

| Generous use of integer operations with overflow guards is more
| expensive than the equivalent FP operations.

The problem is related to how current x86/x64 super-scalar CPUs work:

- All branches (such as the 'jo' for overflow checks) issue to a single
  execution port. A high frequency of branches (add+jo, sub+jo, add+jo,
  etc.) effectively serializes your code, because this execution port
  gets clogged up.

- Overflow branches are highly predictable, so in theory they could all
  move out of the latency chain. But the CPU has to keep track of all of
  the renames to be able to restore the state in case the prediction
  turned out to be wrong. There is a finite amount of reorder buffers,
  so this quickly leads to a stall. Basically the reorder buffer fills
  up faster (up to 3 integer ops per cycle) than it can be emptied (only
  up to 1 branch per cycle). Modern CPU hardware resources are balanced
  to work best for much lower branch frequencies than what TraceMonkey
  generates.

- Replacing an FP add with an integer add plus an overflow check is
  turning a pure data-flow dependency into a different data-flow
  dependency (with slightly lower latency) *plus* a control dependency.
  In general, you don't want to do this. Latencies due to data-flow
  dependencies can be well hidden by out-of-order execution. The
  slightly longer latency of FP operations turns out to be a non-issue,
  except in very tight loops.

- The total CPU execution bandwidth is the sum of the bandwidth of the
  FP and the integer units, because they execute in parallel. The FP
  units have an equal or higher bandwidth than the integer units. Not
  using them means losing execution bandwidth. Moving work away from
  them to the already quite busy integer units is a losing proposition.

So that's the reason why both plain Lua and the LuaJIT 2.x interpreter
always keep all numbers as FP values. This greatly simplifies the
codebase and has been benchmarked as being faster than a mixed FP/int
variant. The latencies for FP operations and conversions are irrelevant
due to other overhead of the interpreter.

OTOH the LuaJIT 2.x JIT compiler generates much tighter code, so it
needs to do something about the FP->int conversion latencies caused by
array indexing. It uses predictive narrowing for induction variables and
demand-driven narrowing for indexing expressions. Additionally it can
eliminate or hoists most overflow checks resulting from narrowing.
Regular numeric computations are never narrowed to integers.

BTW: this also means there can be no FP/int type mismatches (which seems
to be a major source of complexity in TM).

Details upon request, but it's already getting OT for this bug.
]
This is a comparison of the VM running with integer demotion and without. Demotion is a 6% speedup overall. The speedup is quite dramatic on tight loops with bit banging but its also a clear speedup across the board. The data was obtained by poisoning the oracle that inhibits future demotion once we detected a misspeculation for an expression once.

TEST                   COMPARISON            FROM                 TO             DETAILS

=============================================================================

** TOTAL **:           1.060x as fast    1010.2ms +/- 0.5%   953.4ms +/- 0.3%     significant

=============================================================================

  3d:                  -                  143.8ms +/- 1.7%   143.2ms +/- 0.6% 
    cube:              1.115x as fast      44.9ms +/- 5.1%    40.3ms +/- 1.6%     significant
    morph:             *1.123x as slow*    27.1ms +/- 0.9%    30.4ms +/- 0.9%     significant
    raytrace:          *1.008x as slow*    71.8ms +/- 0.4%    72.4ms +/- 0.2%     significant

  access:              1.046x as fast     139.6ms +/- 0.2%   133.4ms +/- 0.3%     significant
    binary-trees:      1.016x as fast      39.6ms +/- 0.5%    39.0ms +/- 0.4%     significant
    fannkuch:          1.006x as fast      56.7ms +/- 0.3%    56.3ms +/- 0.4%     significant
    nbody:             ??                  25.2ms +/- 0.5%    25.4ms +/- 0.8%     not conclusive: might be *1.005x as slow*
    nsieve:            1.42x as fast       18.1ms +/- 0.6%    12.7ms +/- 1.3%     significant

  bitops:              1.98x as fast       70.4ms +/- 0.4%    35.6ms +/- 0.9%     significant
    3bit-bits-in-byte: 1.91x as fast        2.9ms +/- 3.0%     1.5ms +/- 9.3%     significant
    bits-in-byte:      3.30x as fast       26.3ms +/- 0.5%     8.0ms +/- 1.4%     significant
    bitwise-and:       6.22x as fast       15.4ms +/- 0.9%     2.5ms +/- 5.8%     significant
    nsieve-bits:       1.092x as fast      25.8ms +/- 0.6%    23.6ms +/- 0.9%     significant

  controlflow:         ??                  32.7ms +/- 2.3%    32.8ms +/- 0.7%     not conclusive: might be *1.005x as slow*
    recursive:         ??                  32.7ms +/- 2.3%    32.8ms +/- 0.7%     not conclusive: might be *1.005x as slow*

  crypto:              1.078x as fast      59.5ms +/- 1.7%    55.2ms +/- 0.7%     significant
    aes:               1.051x as fast      34.1ms +/- 2.0%    32.4ms +/- 1.0%     significant
    md5:               -                   14.9ms +/- 2.4%    14.7ms +/- 0.9% 
    sha1:              1.30x as fast       10.5ms +/- 1.4%     8.1ms +/- 1.3%     significant

  date:                1.026x as fast     136.3ms +/- 0.6%   132.8ms +/- 0.2%     significant
    format-tofte:      1.055x as fast      67.5ms +/- 0.8%    63.9ms +/- 0.3%     significant
    format-xparb:      -                   68.8ms +/- 0.4%    68.8ms +/- 0.3% 

  math:                1.28x as fast       48.8ms +/- 1.7%    38.2ms +/- 0.6%     significant
    cordic:            1.53x as fast       29.2ms +/- 2.1%    19.1ms +/- 0.5%     significant
    partial-sums:      ??                  13.2ms +/- 1.6%    13.4ms +/- 1.0%     not conclusive: might be *1.015x as slow*
    spectral-norm:     1.101x as fast       6.3ms +/- 2.1%     5.8ms +/- 2.1%     significant

  regexp:              1.010x as fast      45.7ms +/- 0.8%    45.3ms +/- 0.4%     significant
    dna:               1.010x as fast      45.7ms +/- 0.8%    45.3ms +/- 0.4%     significant

  string:              *1.011x as slow*   333.4ms +/- 0.5%   337.0ms +/- 0.6%     significant
    base64:            1.054x as fast      17.8ms +/- 0.9%    16.9ms +/- 0.6%     significant
    fasta:             ??                  73.4ms +/- 1.1%    73.5ms +/- 1.1%     not conclusive: might be *1.001x as slow*
    tagcloud:          *1.012x as slow*    98.4ms +/- 0.7%    99.5ms +/- 0.8%     significant
    unpack-code:       *1.026x as slow*   113.5ms +/- 0.4%   116.5ms +/- 0.6%     significant
    validate-input:    *1.011x as slow*    30.3ms +/- 0.5%    30.6ms +/- 0.9%     significant
(In reply to comment #29)
> This is a comparison of the VM running with integer demotion and without.
> Demotion is a 6% speedup overall.

Is this an implicit argument for keeping LIR_ov?  If so, I don't think anyone is arguing that the feature embodied by LIR_ov should be removed, but rather that the current implementation could be made to have cleaner semantics.
#29 was a followup for #28, quantifying the speedup LIR_ov gets us.
(In reply to comment #30)
> (In reply to comment #29)
> > This is a comparison of the VM running with integer demotion and without.
> > Demotion is a 6% speedup overall.
> 
> Is this an implicit argument for keeping LIR_ov?  If so, I don't think anyone
> is arguing that the feature embodied by LIR_ov should be removed, but rather
> that the current implementation could be made to have cleaner semantics.

Mike was arguing that LuaJIT got away with much less narrowing-to-ints than we have (and I was chiming in support for this notion on IRC). But nobody implied we have *none*. Last I checked the memory controller doesn't like floating point addresses :)
(In reply to comment #32)

Plus I don't think we should get too enamoured of running everything
possible on the FPU.  That might be viable on big heavy CPUs, but on
ARMs with minimal FPUs or softfloat it doesn't sound too good.
The comparison in #29 is missing the point. I said you *do* need
narrowing for indexing expressions and induction variables. Turning it
completely off is a no go.

Obviously bit operations need narrowing, too. But there are no overflow
conditions, so this is OT. You can get away with simple backpropagation,
if the ES semantics for bit operations allow this transform:
  tobit(add_fp(x, y)) ==> add_int(tobit(x), tobit(y))
(With add_int() wrapping around and ignoring overflow.)

[BTW: IMHO that measurement script is statistically unsound. For
fannkuch it shows +-0.3% and +-0.4% variation, i.e. significantly
overlapping ranges of 54.99-58.40 vs. 54.05-58.55. And then the script
goes on to *conclusively* state a significant difference of 0.06%?!?

Apart from that, measurements in the milli/microsecond range have too
much systematic error. This can lead to fictitious differences for only
slightly different code. You can't eliminate this error with repeated
measurements, only with longer measurements.]


But back to the topic: the only remaining argument for keeping LIR_ov 
(and LIR_x*) was that the number of combinations with other instructions
is too high. I dispute that on the basis that a) there are many fewer
combinations that actually make sense and b) most of the "flag"
producing operations already have complements. Fusing the branch-only
instructions into their producers wouldn't necessarily lead to an
explosion of the instruction space.

In general, instruction streams which have a concept of flags and other
implicit dependencies are horrible to work with. Julian can probably
tell you more about that. :-)


But I can buy in on #33. Sure, for FP-challenged CPUs you need to 
rethink narrowing. But there's a limit to that. Either the ES spec has
to change or ARM needs to design CPUs with decent performance for
double-precision FP. I'm not sure which of the two is more unlikely. :-)

[Many of the ARM devices out there nowadays have NEON, but this is
single-precision only. With VFP you get double-precision. Alas, the
vendors like to hide the fact that most of their gadgets actually
contain VFPlite. The high latencies and the low throughput makes 
softfp suddenly look like an attractive option.]
[Sorry, correcting my own math: 'ranges of 56.53-56.87 vs. 56.07-56.53' and
a conclusive '0.6% difference' (= 340 microseconds). Still not convincing.]
Summary of my position re LIR_ov:

* first, just to be clear, I am not trying to get rid of overflow
  checking per se, or change the generated code.  I'm merely trying
  to deal with the fact that LIR_ov differs from almost all other
  value-consuming LIRs in that it doesn't have a pointer back to the
  value-producing LIR.  Instead it implicitly references the
  sourcing LIR by immediately following it in the LIR buffer, thereby
  constituting a special case requiring special handling and
  special consideration when we come to think about "how can we 
  transform/mangle LIR", etc.

* doing all math on the FPU isn't desirable, esp for low end rigs

* a fused compute-and-exit construct, as per Comment #1, would
  work fine.

* I don't see that arguments about "opcode explosion", etc, are
  relevant.  We'd have just a single "conditional exit" construct
  as at present, producing a value too, with fields for the two
  operands, the operation (add, sub, mul, etc), the exit condition
  (ov, cs, etc).

* re comment 18 (push/pop %eflags), that's very expensive.  Leastways
  was last time I looked.  We used it extensively in the Valgrind 2.x
  line and discovered they have latencies (in cycles) roughly equivalent
  to the pipeline depth.  Really the only sane way to do conditional
  branches is to set the flags in one (x86) insn and branch immediately
  in the next one.  That's what gcc generates 99% of the time anyway.

I had hoped to do some LIR hackery this week, but have got bogged down
in LIR debug printing swampery (bug 494864).
This comment updates the initial list of stuff in LIR to look at,
as stated in (1) ... (5) in the Description for this bug.

New things are:

(6) Look at forward jumps in LIR fragments, and see if there is a
    way to clean this up.  It breaks the nice simple story that
    an LIR fragment is a superblock (single-entry-multiple-exit)
    and causes the compiler to have to deal with control flow
    merges, which it wouldn't otherwise have to.  There's
    evidently some magic to do with LIR_live and the lack of
    phi-nodes that I have yet to take on board.

    I understand that the motivation is to be able to do simple
    control-flow diamonds without the overhead (and implied tail
    duplication) of a full trace exit and re-entry.

(7) Similar to the LIR_ov discussion, find a way to remove
    implicit dependencies between LIR_div and LIR_mod without
    affecting code quality.

(8) Document assumptions and requirements about LIR node ordering
    during code generation.

    At the moment, LIR nodes are ordered in two ways:

    - there is a total order imposed on them by the order they
      are parked in the LIR buffer.  ("buffer ordering")

    - there is also a partial order resulting from the fact that
      (most) value-consuming nodes have pointers back to the
      nodes that generated the values. ("operand ordering").

    How these are related (or are supposed to relate) isn't
    stated.  Is the operand ordering required to be consistent
    with the buffer ordering?

    For example, it is the buffer ordering that dictates the
    order that memory references must appear in, since nothing
    else constraints them.  But for value-producing/consuming
    node-pairs (eg, a shift-left feeding into an add), only the
    operand ordering actually matters.  So is their relative
    buffer ordering significant or not?  It would be good to
    clarify this.
(In reply to comment #37)

I'll attempt to clarify the unstated intentions behind LIR,
read on.

> This comment updates the initial list of stuff in LIR to look at,
> as stated in (1) ... (5) in the Description for this bug.
> 
> New things are:
> 
> (6) Look at forward jumps in LIR fragments, and see if there is a
>     way to clean this up.  It breaks the nice simple story that
>     an LIR fragment is a superblock (single-entry-multiple-exit)
>     and causes the compiler to have to deal with control flow
>     merges, which it wouldn't otherwise have to.  There's
>     evidently some magic to do with LIR_live and the lack of
>     phi-nodes that I have yet to take on board.

there not supposed to be a story that LIR is a superblock.  LIR is intended
to house a complete control-flow graph complete with forward & backward
jumps, etc.

>     I understand that the motivation is to be able to do simple
>     control-flow diamonds without the overhead (and implied tail
>     duplication) of a full trace exit and re-entry.

and also, to support full CFG's, as tamarin's JIT front-end emits.

> (7) Similar to the LIR_ov discussion, find a way to remove
>     implicit dependencies between LIR_div and LIR_mod without
>     affecting code quality.

cool.  i've been mute on this but I lead towards keeping LIR_div/mod
separate and combining them with LIR_qjoin or something semantically
stronger, that the instruction selector can identify and then fuse.

> (8) Document assumptions and requirements about LIR node ordering
>     during code generation.
> 
>     At the moment, LIR nodes are ordered in two ways:
> 
>     - there is a total order imposed on them by the order they
>       are parked in the LIR buffer.  ("buffer ordering")
> 
>     - there is also a partial order resulting from the fact that
>       (most) value-consuming nodes have pointers back to the
>       nodes that generated the values. ("operand ordering").
> 
>     How these are related (or are supposed to relate) isn't
>     stated.  Is the operand ordering required to be consistent
>     with the buffer ordering?

I'm all for firming this up!  so i'll try to state our intentions
but without overly biasing the outcome of what we do next.

what's the (most)?  I think it's all.  loads and stores to the local
stack frame (i.e. to areas inside LIR_alloc instructions) do not follow
any particular rules.  they should be considered to be side-effect
instructions and their buffer order is their intended order.

(that said, an analysis pass could connect stores & loads to local
stack area, find true dependencies, and use that to do code motion. 
those dependencies are just not expressed in LIR between stores & loads,
only between direct LIR->LIR operand references).

Within a basic block, LIR IR intends to require that value-producing instructions sit earlier in LirBuffer than value-consuming instructions.

inbetween basic blocks, LIR IR doesn't intend to make any guarantees,
however:
  * the Assember wants to see all uses before all definitions, therefore
    it requires that the order of traversal of the LirReader pipeline
    feeding it also visit basic blocks in an order such that value-consuming
    instructions are seen before value-producing instructions.

importantly: a different LIR walker, and a different LIR consumer, say in an intermediate pass, could have different assumptions.  the LIR IR itself doesn't intend to nail down the order in which blocks are traversed.

I'ts always been the intention that the buffer ordering imply statement ordering.  if a walker just skipped non-side-effect instructions, you'd have the statement list.  this is why there's currently no separate statement list.

>     For example, it is the buffer ordering that dictates the
>     order that memory references must appear in, since nothing
>     else constraints them.  But for value-producing/consuming
>     node-pairs (eg, a shift-left feeding into an add), only the
>     operand ordering actually matters.  So is their relative
>     buffer ordering significant or not?  It would be good to
>     clarify this.

fully agree.  did I cover it above or miss a subtle point?
(In reply to comment #38):

Ed, thanks for the background.

> LIR is intended to house a complete control-flow graph complete with 
> forward & backward jumps, etc.

In which case, afaict, it would be essentially the same as bog-standard
SSA, right?  Apart from one thing: there are no phi-nodes in LIR, and
they are integral to SSA.  So, uh, how are you able to get away without
having phi-nodes?
correct, if it had phi nodes, it would be standard SSA.  

the way it avoids phi nodes is by just making it the front-end's problem to emit stores & loads, where phi-nodes would otherwise exist.  

we've talked about adding phi-nodes.
 - pro: make the IR be standard SSA, enable SSA-based IR transformations
 - con: additional complexity in the backend, converting out of SSA, and in
   the frontend, properly placing phi nodes as the IR is being generated

our frontends (TM's tracer and Tamarin's lir generator) are bone-simple, which is good and bad.  but that's why things are as they are, currently.

we've also talked about introducing chained def-use instructions, that could provide semantic equivalence with phi-nodes (I think) but be easier to deal with
in the frontend and backend.

(tangent to chained def-use idea..)

  LIR_def instructions replace local LIR_st's, at BB ends. 
  LIR_use replace LIR_ld's, at BB starts (or mid-BB, wherever the reference occurs)

LIR_def takes 2 operands: a value-producing expr, and a pointer to another LIR_def.  As LIR is being constructed, when merge nodes are encountered and earlier definitions need to be merged, connect the LIR_def's together.

in backends, always follow the chain of LIR_def's to its first occurance, and do register allocation & spilling with this one single occurance.

the expressions visited along the chain of defs are the ones that would appear in a phi node at the point of a LIR_ref.

....  we implemnted the above in MIR, tamarin's previous jit compiler.  it's light, and it works.  but it's also kind of homebrew, and its not clear that its a win over real SSA with real phi nodes.

(end tangent)

Also, a register-allocator and instruction selector that could work directly on an SSA representation would shove this strongly to pure SSA.  I understand that the "register allocation by puzzle solving" appraoch being worked on this summer for nanojit, is meant to work right on the SSA input.
Hi Mike, greatly appreciate your detailed comments.

(In reply to comment #34)
> Obviously bit operations need narrowing, too. But there are no overflow
> conditions, so this is OT. You can get away with simple backpropagation,
> if the ES semantics for bit operations allow this transform:
>   tobit(add_fp(x, y)) ==> add_int(tobit(x), tobit(y))
> (With add_int() wrapping around and ignoring overflow.)

Indeed JS semantics allow this transform (non-finite values become 0, negative zero goes to 0 too).

> [BTW: IMHO that measurement script is statistically unsound. For
> fannkuch it shows +-0.3% and +-0.4% variation, i.e. significantly
> overlapping ranges of 54.99-58.40 vs. 54.05-58.55. And then the script
> goes on to *conclusively* state a significant difference of 0.06%?!?

This is something we've all noticed. Cc'ing dmandelin, I would like to have a conclusive proof to file as a SunSpider bug in http://bugs.webkit.org.

> Apart from that, measurements in the milli/microsecond range have too
> much systematic error. This can lead to fictitious differences for only
> slightly different code. You can't eliminate this error with repeated
> measurements, only with longer measurements.]

Indeed SunSpider's tests are absurdly short for this day and age. The MD5 data set was cut down so that the test wouldn't take too long in IE three years ago, from what I heard.

> But I can buy in on #33. Sure, for FP-challenged CPUs you need to 
> rethink narrowing. But there's a limit to that. Either the ES spec has
> to change or ARM needs to design CPUs with decent performance for
> double-precision FP. I'm not sure which of the two is more unlikely. :-)

If you made me bet, I would bet on ARM changing sooner. The JS standard changes mostly by adding selectively and enabling such things as "use strict" in ES5. Down the road if we have led programmers away from bad patterns by this combo of carrots and sticks, we may be able to remove things, but only after uses of the feature to remove have all but disappeared on the web.

/be
(In reply to comment #34)
> The comparison in #29 is missing the point. I said you *do* need
> narrowing for indexing expressions and induction variables. Turning it
> completely off is a no go.

No, I only turned of _speculative_ narrowing. All non-speculative narrowing was still done. Also, current TM tip does not narrow mul/div/mod, only add/sub. Narrowing mul/div/mod is a 20ms win over tip (patch is about to land), with some more to come in the array path in a separate patch (we currently always read values as doubles from an array when reading numbers, since specializing to 31-bit integer loads causes trace explosion in crypto code).

> Obviously bit operations need narrowing, too. But there are no overflow
> conditions, so this is OT. You can get away with simple backpropagation,
> if the ES semantics for bit operations allow this transform:
>   tobit(add_fp(x, y)) ==> add_int(tobit(x), tobit(y))
> (With add_int() wrapping around and ignoring overflow.)

Yeah, we always do this and it was still done in the modified benchmark run.

> [BTW: IMHO that measurement script is statistically unsound. For
> fannkuch it shows +-0.3% and +-0.4% variation, i.e. significantly
> overlapping ranges of 54.99-58.40 vs. 54.05-58.55. And then the script
> goes on to *conclusively* state a significant difference of 0.06%?!?
> 
> Apart from that, measurements in the milli/microsecond range have too
> much systematic error. This can lead to fictitious differences for only
> slightly different code. You can't eliminate this error with repeated
> measurements, only with longer measurements.]

Yes, absolutely, the measurements and the harness are crap. Still, the magnitude of the difference proves an impact beyond reasonable doubt.

> But back to the topic: the only remaining argument for keeping LIR_ov 
> (and LIR_x*) was that the number of combinations with other instructions
> is too high. I dispute that on the basis that a) there are many fewer
> combinations that actually make sense and b) most of the "flag"
> producing operations already have complements. Fusing the branch-only
> instructions into their producers wouldn't necessarily lead to an
> explosion of the instruction space.
> 
> In general, instruction streams which have a concept of flags and other
> implicit dependencies are horrible to work with. Julian can probably
> tell you more about that. :-)
> 
> 
> But I can buy in on #33. Sure, for FP-challenged CPUs you need to 
> rethink narrowing. But there's a limit to that. Either the ES spec has
> to change or ARM needs to design CPUs with decent performance for
> double-precision FP. I'm not sure which of the two is more unlikely. :-)
> 
> [Many of the ARM devices out there nowadays have NEON, but this is
> single-precision only. With VFP you get double-precision. Alas, the
> vendors like to hide the fact that most of their gadgets actually
> contain VFPlite. The high latencies and the low throughput makes 
> softfp suddenly look like an attractive option.]

I was hoping for a win with narrowing mul/div/mod on ARM, but it turns out its integer path sucks even more than its FPU. The VFP devices we ship on don't have a fast hardware integer divider, so VFP is actually better there. The mul/div/mod narrowing is disabled on anything but x86 at the moment.

We should probably take this discussion to a separate bug. I suggest https://bugzilla.mozilla.org/show_bug.cgi?id=474443
Here's an interim patch which gets rid of LIR_ov, as per Comment #1.
This is WIP and the patch is far from being correct or usable.  It
does however demonstrate that LIR_ov is removable without major ado.

The main features are:

* three new LIR opcodes, x_{add,sub_mul}_ov, which combine the
  activities of doing a binary operation, overflow detection, and
  possible trace exit.

  They are used like this:  op1 = LIR_2(argL, argR), op2 = guardRec
  and themselves generate the result of the operation (add, sub, mul).

  The use of LIR_2 is a kludge necessary because 3 operands are
  required (argL, argR, guardRec).

* modified TraceRecorder::alu and TraceRecorder::record_JSOP_NEG
  which creates these

* Assembler::asm_branch_x_binop_ov to emit instructions for them.

It runs for quite some number of tests in trace-tests.js before
asserting.

There is much to dislike about the patch:

* first and foremost, the changes to TraceRecorder::alu and
  TraceRecorder::record_JSOP_NEG are almost certainly wrong, in not
  handling overflow correctly

* Assembler::asm_branch_x_binop_ov in Nativei386.cpp essentially
  duplicates logic from Assembler::asm_arith for the LIR_add/sub/mul
  cases.  This could be commoned up.

* all the constant folding and CSE etc machinery does not operate on
  LIR_x_{add,sub_mul}_ov, as that would require duplicating the logic
  from the normal binary ops LIR_{add,sub,mul}.

* it uses up 3 more entries in the top level LIR opcode space and is
  therefore un-expandable since that space is nearly full.  A solution
  would be to split the space into two levels, and put all the
  arithmetic opcodes at the second level.

Overall, it turned out that messing with the LIR representation to add
the new opcodes, and dealing with the instruction selection aspects
was easy.  The real difficulty is in jstracer.cpp, in forming these
LIRs in the first place.  There is a subtle interaction between
overflow checking, constant folding and speculative integer demotion
which I did not so far manage to understand.  Because of that (and the
fact that the JIT eventually asserts) I am sure the patch is wrong.

Anyway I am posting it just in case it later turns out to be useful,
and so as to record my explorations here.  As the patch does not solve
any immediate problem, I'll move on to other stuff.
Definitively the right direction. I like most about the patch that this will allow us to dead-code eliminate the addition. Previously the guard code always kept the math op alive, even if it was otherwise dead.
I don't like the x_add,x_sub, etc. idea, for the reasons listed in comment 43, and because it seems like it will spawn more and more duplication of opcodes and logic, and finally because it just feels weird to have such specialized instructions. In particular, I would like to see nanojit evolve to a general yet small compiler backend, and eschew one-off features added for particular front-end applications. (For the same reason, I want to keep support for full-strength CFGs, so I was glad to hear that's already being used in Tamarin.)

Discussion above has said that nanojit LIR is pretty much standard SSA, which I think is great and exactly where we should go in the future. Going with that idea, it seems the way to handle overflow (and other flags) is to make it an explicit part of the semantics, so the LIR looks like this:

  15     LIR_add  6, 12
  16     LIR_ov  15          // 1 iff isn 15 overflows
  17     LIR_xt  16

I think this is pretty much like the current version except that LIR_ov has an explicit operand. In the sequence above, LIR_ov requires no code on x86, and this should be fairly easy to detect. I assume op/ov/branch is the only common sequence that we care that much to optimize. If there are intervening ops, we can just generate flag-loading instructions, and worry about optimizing that more iff it ever matters.
On the subject of int vs. floats and how much narrowing to do, it would seem to depend on the benchmark. I think our results show pretty conclusively that aggressive narrowing is good for SunSpider on x86, possibly because of the high usage of bit ops and array indexing. If we wanted to know more about this issue, I would suggest generating some x86 for an interesting trace, disassembling it, recoding by hand with a variety of narrowing choices, and then measuring direct runs.

On the subject of SunSpider statistics, if we want a good SunSpider score, then we have no choice but to use SunSpider metrics, whether the benchmarks are good or not. At least on MacOS, the results are repeatable, so in that sense they're pretty usable. I think you can get down to about +/-5 ms maximum precision by doing a large number of runs. 

In terms of new benchmarks, I think the things that would help us most would be microbenchmarks and application-based benchmarks. We really don't know a lot about the performance of specific operations because we haven't used microbenchmarks much yet. In particular, it would help us a lot in learning which features we definitely have a lot of room to improve on. And on the other hand, I think we're finding that improved SS scores doesn't necessarily mean improved SS scores, primarily because SS exercises only a small part of the language, and also because the programs are so short.
#45: If we can extend that somehow to make the guard part of the data-flow, we would get the same dead code elimination benefits.
(In reply to comment #43)
>   The use of LIR_2 is a kludge necessary because 3 operands are
>   required (argL, argR, guardRec).

FWIW, bug 501232 will remove LIR_2 by allowing trinary instructions.
(In reply to comment #45)
> Discussion above has said that nanojit LIR is pretty much standard SSA, which I
> think is great and exactly where we should go in the future. Going with that
> idea, it seems the way to handle overflow (and other flags) is to make it an
> explicit part of the semantics, so the LIR looks like this:
> 
>   15     LIR_add  6, 12
>   16     LIR_ov  15          // 1 iff isn 15 overflows
>   17     LIR_xt  16
> 
> I think this is pretty much like the current version except that LIR_ov has an
> explicit operand.

I find this less clean that the current approach -- "LIR_ov 15" looks like it's using the result of insn 15 (ie. the sum) as an input, when it's really using something else.  That "something else" is either a side-effect of the add (if you view it in terms of setting condition codes) or a second output of the add (if you view the overflow result as a second output).

It's a similar problem to LIR_callh, which also looks like it's operating on something (the low 32 bits of a return value) when it's really operating on something else (the corresponding high 32 bits).
(In reply to comment #45)

There's two orthogonal issues addressed in your post.  (1) whether the
approach to compute-and-if-overflow-exit shown by the patch in #43 is
a good thing.  (2) whether we should move LIR/NJ to full SSA.

Re (1), I don't like the duplication either.  With some careful
refactoring, perhaps the code duplication could be avoided.  Something
like: we have a LIR binary op node, and we have a guards which incorporate
LIR binary op nodes; in both cases the folding machinery etc should
simply operate on the underlying op node.

Re (2)

> Discussion above has said that nanojit LIR is pretty much standard SSA,
> which I think is great and exactly where we should go in the future.

I think it's a bit misleading to say that NJ LIR is pretty much 
standard SSA.  LIR has no phi-nodes, which are mandatory in standard
SSA.  Also, LIR doesn't have a notion of a real control flow graph, in
which blocks can be moved around and visited in an order which gives
the best code quality: instead, we are forced to visit the blocks in 
the particular order they have been parked in the LIR buffer, which
may not be the best from a register allocation point of view.

I believe that implementing full SSA, with phi-nodes, the dominance frontier
stuff needed to make them work, plus representing an arbitrary CFG, plus
identifying loops, plus having to do decent register allocation for variables
in loops, represent a big step up in JIT code complexity and JIT-time cost,
the latter of which we're unlikely to recover in increased code quality.

My instinct is to head in the opposite direction.

Most of the fragments I've looked at are single-entry, multiple exit 
affairs.  I suspect the vast majority of executed fragments are like that.
In which case the added overhead of the full SSA machinery buys you
not very much.

I'd suggest:

(a) restrict LIR to actually be this superblock form (single entry,
    multiple exit) -- that is, get rid of all forward and backwards
    jumps

(b) find a way to improve the inter-fragment jumping conventions, in
    particular so we don't have to flush all live variables to memory
    at such transitions

Assuming that most executed fragments really are superblocks, then (a)
doesn't hurt performance, and (b) improves it for general-case code.
This means we could reuse the existing compiler more or less, with no
change to the code quality within each superblock relative to at present,
and with no significant extra JIT-time expense.

Some years ago there was a Java JIT project at Mozilla, called
ElectricalFire:
http://web.archive.org/web/20021020161548/www.mozilla.org/projects/ef/techdocs/index.html

IIUC, they were implementing full SSA in a JIT context.  What happened to
those people?  Is there anything we can learn from them?
(In reply to comment #49)
> (In reply to comment #45)
> > 
> >   15     LIR_add  6, 12
> >   16     LIR_ov  15          // 1 iff isn 15 overflows
> >   17     LIR_xt  16
>
> I find this less clean that the current approach -- "LIR_ov 15" looks like it's
> using the result of insn 15 (ie. the sum) as an input, when it's really using
> something else.  That "something else" is either a side-effect of the add (if
> you view it in terms of setting condition codes) or a second output of the add
> (if you view the overflow result as a second output).

I had viewed "LIR_ov 15" as using the result of 15, with the result understood to be the result of logical integer addition (i.e., a possibly bigger than 32-bit value) as opposed to 32-bit 2's complement addition yielding a 32-bit value. I guess there's nothing wrong with viewing that as referring to a second output, but then again, I have nothing against second outputs.

I was also thinking of LLVM, which seems to use a similar design.

If we had a type system for LIR, with defined conversions, it wouldn't be too hard to deal with things like this:. The typing could be something like:

  type int32                           # primitive 32-bit integer type
  type flags                           # ALU result flags type
  type int32_result = (int32 * flags)  # ALU operation result type

  typeof(LIR_add) :    int32 int32 -> int32_result
  # LIR_add and most other ALU ops accept inputs of type int32_result, with
  # explicit conversion to int32 by extracting that element
  typeof(LIR_ov)  :    int32_result -> bit

At least to me :-), the semantics are clear and it seems a type checker on LIR could prevent most mistakes over this sort of thing.
(In reply to comment #50)
> (In reply to comment #45)
> 
> There's two orthogonal issues addressed in your post.  (1) whether the
> approach to compute-and-if-overflow-exit shown by the patch in #43 is
> a good thing.  (2) whether we should move LIR/NJ to full SSA.
> 
> Re (1), I don't like the duplication either.  With some careful
> refactoring, perhaps the code duplication could be avoided.  Something
> like: we have a LIR binary op node, and we have a guards which incorporate
> LIR binary op nodes; in both cases the folding machinery etc should
> simply operate on the underlying op node.

I'm not entirely sure what you are proposing here, but it sounds like you are suggesting an instruction like "LIR_xov" that guards on the overflow result of a LIR ALU op. Is that right? It seems pretty closely related to what I suggested, but it seems are you are also suggesting some explicit requirement to tightly link these exits with ALU ops. Perhaps the type system I sketched could also do this.

> Re (2)
> 
> > Discussion above has said that nanojit LIR is pretty much standard SSA,
> > which I think is great and exactly where we should go in the future.
> 
> I think it's a bit misleading to say that NJ LIR is pretty much 
> standard SSA.  LIR has no phi-nodes, which are mandatory in standard
> SSA.  Also, LIR doesn't have a notion of a real control flow graph, in
> which blocks can be moved around and visited in an order which gives
> the best code quality: instead, we are forced to visit the blocks in 
> the particular order they have been parked in the LIR buffer, which
> may not be the best from a register allocation point of view.

Those seem to be implementation aspects as opposed to conceptual aspects. But fair enough, LIR certainly doesn't allow phi nodes currently.

> I believe that implementing full SSA, with phi-nodes, the dominance frontier
> stuff needed to make them work, plus representing an arbitrary CFG, plus
> identifying loops, plus having to do decent register allocation for variables
> in loops, represent a big step up in JIT code complexity and JIT-time cost,
> the latter of which we're unlikely to recover in increased code quality.

I wasn't proposing all of the above. In particular, we won't need dominance analysis and phi-node placement because our code generators will know exactly where to put the phi nodes. One of my applications is generating 'fast-path' versions of ops. Consider this pseudocode:

    if (condition Q):
        # fast path
        result_1 = (fast path LIR)
    else:
        # slow path
        args = (set up call)
        LIR_call
        result_2 = (process call results)
    result = LIR_phi(result_1, result_2)

Being able to do this is valuable: our competition does it, and it's not hard for them because they are doing their register allocation and value tracking directly over simple assembly blocks. I hope we will be able to generate code like this as well, and supporting some possibly restricted kind of phi function seems like a good way to do it. I think all would need to support applications like that is iterative liveness analysis and a register allocator that understands LIR_phi.

We also have a grad student currently working on a JIT-friendly "puzzle solving" register allocator as part of a GSoC, so we should find out soon enough whether that works well or not.

> My instinct is to head in the opposite direction.
> 
> Most of the fragments I've looked at are single-entry, multiple exit 
> affairs.  I suspect the vast majority of executed fragments are like that.

That's mostly true. I think dense array accesses and some imacro things are the main exceptions. But we also have an intern working on inline threading for TraceMonkey using nanojit, which will certainly want structured control flow in LIR. I believe Tamarin is also using structured control flow in LIR.

> In which case the added overhead of the full SSA machinery buys you
> not very much.
> 
> I'd suggest:
> 
> (a) restrict LIR to actually be this superblock form (single entry,
>     multiple exit) -- that is, get rid of all forward and backwards
>     jumps
> 
> (b) find a way to improve the inter-fragment jumping conventions, in
>     particular so we don't have to flush all live variables to memory
>     at such transitions
> 
> Assuming that most executed fragments really are superblocks, then (a)
> doesn't hurt performance, and (b) improves it for general-case code.
> This means we could reuse the existing compiler more or less, with no
> change to the code quality within each superblock relative to at present,
> and with no significant extra JIT-time expense.

This seems at least as difficult to implement as a LIR_phi plus a linear-scan-type register allocator. In particular, loops don't seem very easy to handle that way. It also seems rather painful from the user's (i.e. person writing code to generate LIR) point of view.

> Some years ago there was a Java JIT project at Mozilla, called
> ElectricalFire:
> http://web.archive.org/web/20021020161548/www.mozilla.org/projects/ef/techdocs/index.html
> 
> IIUC, they were implementing full SSA in a JIT context.  What happened to
> those people?  Is there anything we can learn from them?
(In reply to comment #50)
> Some years ago there was a Java JIT project at Mozilla, called
> ElectricalFire:
> http://web.archive.org/web/20021020161548/www.mozilla.org/projects/ef/techdocs/index.html
> 
> IIUC, they were implementing full SSA in a JIT context.  What happened to
> those people?

Waldemar Horwat, who led that team, is at Google. He is a member of Ecma TC39.

> Is there anything we can learn from them?

Possibly, but that was a lont time ago, and EF never reached a usable state of completion. Others who hacked on it have scattered in the Netscape diaspora, and I've lost track of them.

I think you're right to shy away from full SSA. For the short term, can we get a small (2x would not be small for benchmark results) linear speedup out of Nanojit, with cleanups as we go, by generating better code, avoiding going thru memory, etc. -- without jumping into the deep end of the SSA pool?

/be
(In reply to comment #51)
> 
>   type int32                           # primitive 32-bit integer type
>   type flags                           # ALU result flags type
>   type int32_result = (int32 * flags)  # ALU operation result type
> 
>   typeof(LIR_add) :    int32 int32 -> int32_result
>   # LIR_add and most other ALU ops accept inputs of type int32_result, with
>   # explicit conversion to int32 by extracting that element
>   typeof(LIR_ov)  :    int32_result -> bit

Does LIR_add take int32 or int32_result operands as inputs?  Did you mean this:

>   typeof(LIR_add) :    int32_result int32_result -> int32_result

That would be bad because it's not true -- the add doesn't rely on input flags.

Or do you really intend to introduce an explicit conversion like this:

    typeof(LIR_extract_int32) : int32_result -> int32

That would be bad because LIR_extract_int32 would be extremely common.

Or did you mean something else?

I see type-checking LIR as a very worthwhile exercise in correctness, and a compelling argument for fixing instructions like LIR_ov -- because I can't see how it can be sensibly typed in its current form.
(In reply to comment #54)

Preface: I'm with Andreas on this stuff that we should focus on stuff that has concrete benefits as opposed to conceptual cleanup. The LIR_ov debate is questionable by that criterion, although I do think typing or otherwise validating LIR is very worthwhile.

> (In reply to comment #51)
> > 
> >   type int32                           # primitive 32-bit integer type
> >   type flags                           # ALU result flags type
> >   type int32_result = (int32 * flags)  # ALU operation result type
> > 
> >   typeof(LIR_add) :    int32 int32 -> int32_result
> >   # LIR_add and most other ALU ops accept inputs of type int32_result, with
> >   # explicit conversion to int32 by extracting that element
> >   typeof(LIR_ov)  :    int32_result -> bit
> 
> Or did you mean something else?

My comment explains it, although there was a typo that totally changed the meaning. :-) I meant to say that LIR_add and such accept inputs of type int32_result, with *implicit* conversion to int32 by extracting that element. 

> I see type-checking LIR as a very worthwhile exercise in correctness, and a
> compelling argument for fixing instructions like LIR_ov -- because I can't see
> how it can be sensibly typed in its current form.

It could be typed by a type-and-effect system or whatever you want to call it. Although, when I think about it, it seems like the current LIR_ov is always valid, and just refers to the last instruction that set condition flags. So it seems that current LIR_ov is not hard to type, just doesn't have clear expression semantics.
(In reply to comment #55)

> I meant to say that LIR_add and such accept inputs of type
> int32_result, with *implicit* conversion to int32 by extracting that element. 

What's the type of LIR_imm?  Presumably int32, but then you can't feed one into an LIR_add.

I'm glad there's broad agreement that type-checking LIR is a worthwhile "concrete" goal.  I'm just having trouble seeing a sane way to type LIR_ov in its current form.
Why? LIR_ov has a very concrete strict usage form: {LIR_add|sub|mul} is immediately followed by LIR_ov, with that LIR_ov directly pointing to that prior instruction. Just like LIR_callh. Its not pretty, but you can definitively type-check it easily.

For me the more interesting problem is how do we make overflow guards dataflow dependent and subject to dead-code elimination if the result of the op isn't used. Example:

x = add a, b
y = ov x
guard y, exit

The guard will always keep y alive, which always keeps x alive. This is the problem I would prefer to focus on. If we end up finding a cool solution for this, that also happens to clarify our semantics along the way, that would be awesome.
(In reply to comment #56)
> (In reply to comment #55)
> 
> > I meant to say that LIR_add and such accept inputs of type
> > int32_result, with *implicit* conversion to int32 by extracting that element. 
> 
> What's the type of LIR_imm?  Presumably int32, but then you can't feed one into
> an LIR_add.

OK, clearly there is excessive confusion here. Let me start over.

  # Type definitions
  type int32                           # primitive 32-bit integer type
  type flags                           # ALU result flags type
  type int32_result = (int32 * flags)  # ALU operation result type

  # Function types of LIR operations
  typeof(LIR_imm) :    -> int32
  typeof(LIR_add) :    int32 int32 -> int32_result
  typeof(LIR_ov)  :    int32_result -> bit

Note: any operation that takes an 'int32'-typed input can also be given an 'int32_result'-typed input. An 'int32_result'->'int32' conversion with the obvious semantics is inserted implicitly. 

> I'm glad there's broad agreement that type-checking LIR is a worthwhile
> "concrete" goal.  I'm just having trouble seeing a sane way to type LIR_ov in
> its current form.
(In reply to comment #57)
> For me the more interesting problem is how do we make overflow guards dataflow
> dependent and subject to dead-code elimination if the result of the op isn't
> used. Example:
> 
> x = add a, b
> y = ov x
> guard y, exit
> 
> The guard will always keep y alive, which always keeps x alive. This is the
> problem I would prefer to focus on. If we end up finding a cool solution for
> this, that also happens to clarify our semantics along the way, that would be
> awesome.

This problem seems independent of overflow handling, because it could apply to any guard. A clean way to get this DCE to happen in one pass would be to create a LIR_xadd instruction or something like that. LIR_xadd would return a value, but also include a guard and exit. Thus, if the value is not used, the LIR_xadd can be eliminated in the usual way.

Another possibility would be to make a new kind of guard instruction that doesn't keep its input live. This would be used for guards that only protect a value being computed, while the current kind of guard would be used for tests that protect the trace itself.
> For me the more interesting problem is how do we make overflow guards dataflow
> dependent and subject to dead-code elimination if the result of the op isn't
> used.

Umm .. but I provided a suggestion for fixing that in #1, and a partial
proof of concept patch in #43.  Said patch, if integrated properly in the
tree, would make dead code removal of guards trivial.

dmandelin is suggesting the same thing in #59 w.r.t. LIR_xadd.
(In reply to comment #59)
> (In reply to comment #57)
> > For me the more interesting problem is how do we make overflow guards dataflow
> > dependent and subject to dead-code elimination if the result of the op isn't
> > used. Example:
> > 
> > x = add a, b
> > y = ov x
> > guard y, exit
> > 
> > The guard will always keep y alive, which always keeps x alive. This is the
> > problem I would prefer to focus on. If we end up finding a cool solution for
> > this, that also happens to clarify our semantics along the way, that would be
> > awesome.

That takes us back to Julian's LIR_x_add_ov instruction.

> This problem seems independent of overflow handling, because it could apply to
> any guard. A clean way to get this DCE to happen in one pass would be to create
> a LIR_xadd instruction or something like that. LIR_xadd would return a value,
> but also include a guard and exit. Thus, if the value is not used, the LIR_xadd
> can be eliminated in the usual way.

I don't think that makes sense.  This is closely tied to the presence of the ov.  LIR_x_add_ov embodies a condition: "exit if the add overflows".  LIR_xadd doesn't embody any such condition.
We discussed LIR_x_add_ov above. Its a bit heavyweight. I was hoping for something that is more inline with the fine-grained nature of LIR. Ed pointed out we might want to branch on these things as well, and we might want to branch and exit in case of overflow of not-overflow, and for at least 3 instructions, which is a whole bunch of combinations (its 6am, I let someone else count them all).

My attempt at this is documented in https://bugzilla.mozilla.org/show_bug.cgi?id=498035. Its a value-producing instruction that marks a block and the entire block goes dead if the value isn't used. Such guard constructs can be encapsulated in it. Wiring that up to DCE is a bit ugly, but certainly doable.
(In reply to comment #50)
> My instinct is to head in the opposite direction.
> 
> Most of the fragments I've looked at are single-entry, multiple exit 
> affairs.  I suspect the vast majority of executed fragments are like that.
> In which case the added overhead of the full SSA machinery buys you
> not very much.
> 
I am currently working on implementing inline threading using nanojit, and the code I generate is very different than that generated for traces. Absolute and conditional jumps in the javascript correspond directly to absolute and conditional jumps in the generated LIR. Despite having a real CFG, the generated code doesn't really have any need for Phi-nodes (nor will it once I begin inlining simple opcodes). It's possible that possible future optimizations such as further inlining fast paths could require Phi-nodes, but I haven't thought that through fully.

> I'd suggest:
> 
> (a) restrict LIR to actually be this superblock form (single entry,
>     multiple exit) -- that is, get rid of all forward and backwards
>     jumps
> 
This would *seriously* complicate code generation for inline threaded code. The generated code would need to be a large number of linked together fragments (one fragment per jump target, I believe), instead of one fragment with straightforward internal control flow.

> (b) find a way to improve the inter-fragment jumping conventions, in
>     particular so we don't have to flush all live variables to memory
>     at such transitions
> 
Inter-fragment jumps would need to be very cheap to not be a deal-breaker for nanojit based inline threading. How expensive are they now?
(In reply to comment #62)
> We discussed LIR_x_add_ov above. Its a bit heavyweight. I was hoping for
> something that is more inline with the fine-grained nature of LIR. Ed pointed
> out we might want to branch on these things as well, and we might want to
> branch and exit in case of overflow of not-overflow, and for at least 3
> instructions, which is a whole bunch of combinations (its 6am, I let someone
> else count them all).
> 
> My attempt at this is documented in
> https://bugzilla.mozilla.org/show_bug.cgi?id=498035. Its a value-producing
> instruction that marks a block and the entire block goes dead if the value
> isn't used. Such guard constructs can be encapsulated in it. Wiring that up to
> DCE is a bit ugly, but certainly doable.

1. Encoding the blocks using pseudoinstructions is indeed ugly, although I think the idea itself is not too bad. Expanding your explanation a bit, I would say that the (logical) CFG can contain a "smurf-block" of multiple instructions, where *only* the final instruction produces a value that can be used outside the "smurf-block", and if the final instruction is dead, then the entire "smurf-block" is dead. I would want this to come along with a test-mode checker that verifies that only the final instruction is used outside the "smurf-block". The other hazard is that the user can generate a "smurf-block" that contains a guard that verifies a property needed by the rest of the trace, not just the "smurf-block"'s value, in which case eliminating it would be erroneous. But so far it appears that risk applies in every dead-value guard-elimination algorithm I have thought of.

2. I thought a little more about more traditional solutions today, and it appears that nothing involving LIR_ov is likely to work cleanly. Consider this trace:

  10   add  5, 6
  11   ov  10
  12   xt  11

For all the traces we generate, the following code will never refer to the value of instruction 11. Thus, when we arde scanning instruction 12, there is no information directly available *at all*: the form of instruction 11 and the state of its operands is the same, no matter what code follows, so clearly it can't be used to make any decisions. 

3. What makes sense to me today is a LIR_xov instruction, that exits if the single operand value caused an overflow, and otherwise evaluates to the value of its operand. Now the code is this:

  10 add  5, 6
  11 xov 10

When scanning 11, if 11 is not used, we can say that it is dead and eliminate it, and then we would eliminate 10 as well because it is now dead. If 11 is used, then we correctly keep it around.

I like this because the conceptual effect and purpose of a LIR_xov is to produce a non-overflowed integer value. If said value is not used, it is dead.

A side issue is that theoretically we could generate a side exit on overflow that we intend to guard a trace property, and not to produce a value. Thus, we would never want that instruction eliminated. I think in that case, we would not want to use the value-producing LIR_xov. Instead, we would generate a LIR_ov like now and use it with a non-value producing guard, or create a new LIR_xovn that exits if its operand overflowed, but does not produce a value.
(In reply to comment #57)
> For me the more interesting problem is how do we make overflow guards dataflow
> dependent and subject to dead-code elimination if the result of the op isn't
> used. Example:
> 
> x = add a, b
> y = ov x
> guard y, exit
> 
> The guard will always keep y alive, which always keeps x alive.

If that exit will report an exception, then it must never be eliminated, because doing so would change the behavior of the code.  If the exit would simply bail back to the interpreter to compute the value slowly and then ignore it, then it can be eliminated.  Do we distinguish these two cases at the moment?
It seems there is really no good solution for this problem. xov is my personal favorite too, given the choices we discussed so far. We have already xt/xf, xo (or xov) seems somewhat logical. We will likely not need xno (xnov), but we might end up adding jo jno. We can probably drop LIR_cs, since we have no immediate plans of using it, which would further reduce the cross product of names.
(In reply to comment #66)
> We can probably drop LIR_cs, since we have no
> immediate plans of using it, which would further reduce the cross product of
> names.

Already done in bug 493125, which was pushed to TM yesterday.
I'll add my $.02

I like LIR_xov too, given the alternatives.

LIR currently doesn't force any particular block traversal order; block traversal is determined by the LirReader which feeds the backend code gen pipeline.

tamarin-redux contains a VerifyWriter filter used for basic type checking and LIR validation during the LIR generation pass.  (full of asserts).  Lets promote it into nanojit and improve it?

We might still want LIR_ov for use with branching:  jt(ov(add)).  Or, jov(add).  since plain branches don't need to be value producing, either one seems simple enough.  but i haven't thought through it completely; all the needs for DCE of guards are still there for plain branches too, so maybe we really do want value-producing branches and thus jov(add).

Think about the non-tracing version of integer-add speculation that TM's competitors emit, and that we want to emit in Tamarin, and that TM might even emit if the speculation doesn't predict well someplace in the code.
(In reply to comment #68)
> tamarin-redux contains a VerifyWriter filter used for basic type checking and
> LIR validation during the LIR generation pass.  (full of asserts).  Lets
> promote it into nanojit and improve it?

I hg clone'd http://hg.mozilla.org/tamarin-redux, but couldn't find any
VerifyWriter in there.  Should I look somewhere different?  I'm working
on a LIR typechecker and would like to have a look at VerifyWriter so as
not to duplicate work.
I believe Ed is referring to ValidateWriter in CodegenLIR, if you haven't already found it.
Yep, I misspoke. class ValidateWriter in core/CodegenLIR.cpp
Target Milestone: --- → Future
Component: JIT Compiler (NanoJIT) → Nanojit
Product: Tamarin → Core
QA Contact: nanojit → nanojit
Depends on: 463137, 534309
FWIW, LIR_cs no longer exists.
Depends on: 539874
(In reply to comment #0)
> 
> (1) Clearly distinguish between values and effects (or, if you
>     like, expressions and statements) in LIR

Expression instructions and statement instructions are still the same types, but isStmt() is better than it used to be -- it's now type-based rather than being a hodge-podge of pattern matching.

> (2) Formalise the concept of "LIR temporary" (virtual register,
>     variable, etc)

No change, I was never quite sure what this meant.  The address of a value-producing LIns serves as the handle for the value.

> (3) Get rid of LIR_ov.  It has an implicit dependency through a
>     notional, unstated %eflags, to the immediately preceding
>     instruction.  Not currently sure what to replace it with, tho.

Done by bug 539874.
 
> (4) Possibly make the type system a bit stronger, and have types
>     for LIR temporaries as well as for the operations.  Write a LIR
>     typechecker, to check for correctly typed, SSA-form LIR.

Done by bug 463137.

> (5) (Further ahead) Get rid of hardwired amodes in ld/st, and 
>     require addresses simply to be arbitrary LIR temporaries.

I'm not sure what this means.  Loads and stores have a base pointer, which is a LIns* (ie. effectively an arbitrary LIR expression), and an integer offset.

Julian, I think this bug should be closed and any issues you still care about pushed into follow-up bugs.  Do you agree?
No longer depends on: 534309
(In reply to comment #73)
> Julian, I think this bug should be closed and any issues you still care about
> pushed into follow-up bugs.  Do you agree?

I'm closing this bug due to lack of interest.  Please file any remaining issues in follow-ups!  Thanks.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.