Closed
Bug 495569
Opened 15 years ago
Closed 14 years ago
[meta] Semantics level cleanups to LIR (in nanojit)
Categories
(Core Graveyard :: Nanojit, defect)
Core Graveyard
Nanojit
Tracking
(Not tracked)
RESOLVED
FIXED
Future
People
(Reporter: jseward, Unassigned)
References
Details
Attachments
(1 file)
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.10) Gecko/2009042700 SUSE/3.0.10-1.1 Firefox/3.0.10 Build Identifier: This is a meta-bug, intended as a recording-place for information regarding cleaning up the semantics of LIR a bit. I think the following adjustments to LIR would be beneficial in the longer term, and would not require major changes to the existing Nanojit implementation, nor slow it down. (1) Clearly distinguish between values and effects (or, if you like, expressions and statements) in LIR (2) Formalise the concept of "LIR temporary" (virtual register, variable, etc) (3) Get rid of LIR_ov. It has an implicit dependency through a notional, unstated %eflags, to the immediately preceding instruction. Not currently sure what to replace it with, tho. (4) Possibly make the type system a bit stronger, and have types for LIR temporaries as well as for the operations. Write a LIR typechecker, to check for correctly typed, SSA-form LIR. (5) (Further ahead) Get rid of hardwired amodes in ld/st, and require addresses simply to be arbitrary LIR temporaries. Expected benefits: (1) Make the implementation (LIR generation, transformation, conversion to native code) simpler, easier to reason about, easier to verify as correct. Similarly, make it easier to reason about correctness of proposed LIR transformations. (2) Put us in a good position to do instruction selection off trees, which should increase the quality of the code we can generate. LIR at present is almost equivalent to straight-line SSA, although LIR_ov breaks that. I believe its removal would make LIR be pretty much bog-standard SSA (the straight-line aspect, anyway, no need for phi nodes here). Reproducible: Always
Updated•15 years ago
|
Status: UNCONFIRMED → NEW
Ever confirmed: true
Reporter | ||
Comment 1•15 years ago
|
||
One way to get rid of LIR_ov is to have a specialised form of exit instruction, which combines an add, an assignment of a result to a LIR temporary, and an exit. So instead of (eg) this ... ld1 = ld sp[-16] ... ld2 = ld sp[-8] ... add1 = add ld1, ld2 ov1 = ov add1 xt1: xt ov1 -> pc=0x81d9e24 imacpc=(nil) sp+16 rp+0 # add1 is used later, ov1 is not we'd have this: ... ld1 = ld sp[-16] ... ld2 = ld sp[-8] ... add1 = x_ov_add ld1, ld2 -> pc=0x81d9e24 imacpc=(nil) sp+16 rp+0 # add1 is used later It removes the implicit dependency between LIR_ov and the immediately preceding add, yet retains the ability to generate "addl %rS,%rD ; jo ...", and retains also the property that no LIR statement assigns to more than one temporary.
Reporter | ||
Comment 2•15 years ago
|
||
A question: in https://developer.mozilla.org/en/Nanojit/LIR there are a bunch of load variants (ldcb, ldcs, ldc, ldqc) which "may be removed by common subexpression elimination". What propert(ies) is this intended to signify? * that the loaded-from location will not be stored to by any instruction in this fragment? * that the load is guaranteed not to take an exception? * something else?
Don't we need ov for other operations as well (sub, if not mul)? This sounds like it would let us kill the guard more easily, which would certainly be nice!
Reporter | ||
Comment 4•15 years ago
|
||
> Don't we need ov for other operations as well (sub, if not mul)?
Yes, sure. Presumably this would have to be done for some small
subset of binary and unary ops; afaict add, sub, mul.
Comment 5•15 years ago
|
||
(In reply to comment #2) > A question: in https://developer.mozilla.org/en/Nanojit/LIR > there are a bunch of load variants (ldcb, ldcs, ldc, ldqc) which > "may be removed by common subexpression elimination". > > What propert(ies) is this intended to signify? > > * that the loaded-from location will not be stored to by any > instruction in this fragment? > > * that the load is guaranteed not to take an exception? > > * something else? the 'c' variants are meant to mean: - a second load with the same effective address will return the same value - if the result is not used, the load can be eliminated
Comment 6•15 years ago
|
||
(In reply to comment #1) > One way to get rid of LIR_ov is to have a specialised form of > exit instruction, which combines an add, an assignment of a > result to a LIR temporary, and an exit. So instead of (eg) this > > ... > ld1 = ld sp[-16] > ... > ld2 = ld sp[-8] > ... > add1 = add ld1, ld2 > ov1 = ov add1 > xt1: xt ov1 -> pc=0x81d9e24 imacpc=(nil) sp+16 rp+0 > # add1 is used later, ov1 is not > > we'd have this: > > ... > ld1 = ld sp[-16] > ... > ld2 = ld sp[-8] > ... > add1 = x_ov_add ld1, ld2 -> pc=0x81d9e24 imacpc=(nil) sp+16 rp+0 > # add1 is used later > > It removes the implicit dependency between LIR_ov and the > immediately preceding add, yet retains the ability to > generate "addl %rS,%rD ; jo ...", and retains also the property > that no LIR statement assigns to more than one temporary. seems fine. thinking ahead: - with more arithmetic operators (mul, sub?) - and also handling the jump instructions (jt, jf) will this combining of operator+branch lead to too many IR instructions? LIns field width should not be a problem. The Risk of many combined op+branch instructions is footprint of LIR itself, complexity in optimization phases and the backends, etc. this could be low risk if we dont anticipate a big explosion of ops.
Reporter | ||
Comment 7•15 years ago
|
||
(in reply to Comment #5) What I was getting at was that it might be preferable to record the underlying properties of the load, as an annotation, rather than one specific transformation that those properties permit. For example, here we would say "(1) load is not aliased by any stores in the fragment, and (2) load does not except". Taken together, these would allow CSEing with a load of the same size and location, as you say. But taken separately: (1) by itself also appears to allow us to float the load forwards/backwards past arbitrary stores, and has nothing to do with CSE. Similarly (2) by itself might have some bearing on the possibility of floating the load forwards/backwards past exits without changing the exception behaviour.
Comment 8•15 years ago
|
||
Those would be good improvements. What if we followed the CallInfo pattern and invented a struct that would be referenced by load & store instructions, another building block in a LIR type system? LLVM might be an example for us to follow (loads/stores take an address computed by an address-computation instruction which takes a reference, a type descriptor, and a field designator and/or index). we dont necessarily need all the bells and whistles of LLVM, but a simple type model would be great.
Comment 9•15 years ago
|
||
also, all the loads and stores in LIR currently assume no exception behavior.
Comment 10•15 years ago
|
||
We need all ops, add/sub/mul/div, and potentially all forms of exits and jumps (xt/xf/jt/jf), so thats a lot of different LIR opcodes. We also might need LIR_cs in the future, which is another 16 combination. The new LIR_div/LIR_mod pair is a similar problem. LIR_mod has an invisible dependency on a LIR_div being close by. I would think we want to model the invisible data flow explicitly with a pseudo register instead of forming the cross product of the opcodes. No? In my experience with writing the Hotpath optimization passes you almost always handle all loads together, independently of the type. We had one LOAD op, and it had the type and constant-ness as flags (constant or loop-invariant). We could do the same with the LIR encoding. LIR_load | LIR_word | LIR_constant. Generating code for this would be pretty trivial. That would work for the side exits as well if we don't want to encode the data flow. LIR_xt | LIR_add | LIR_ov. However, I imagine this case will look a bit messy in the code generator, and filters also would have to be written carefully to decide what this really is. An arithmetic op, or a side exit. Might work though. The more general problem are inline exits, which Ed and I talked about a while ago. We would like to have side exits that behave like arithmetic ops and can be dead code eliminated if they have no uses. Right now a LIR_ov/LIR_xt chain always keeps a LIR_add alive, even if LIR_add has no uses. That shouldn't be the case. Maybe we can model LIR_ov and LIR_xt together? And model flags<>register moves? r = LIR_add b, c c = LIR_flags(r) // this must immediately follow the add, keep 1 return reg format ... // if nothing is in between, c is not materialized into an actual register i = LIR_ov|LIR_xt r, c, exit x = LIR_mul 1, i // if this goes dead, everything above is killed Just thinking out loud. This probably needs quite a bit more work.
Comment 11•15 years ago
|
||
What if ld/st/xt/xf took a LIR_ref as operand; A LIR_ref being an address computation? It would be easy to cse like-refs and also during code gen we'd be able to easily fold simple refs (i.e. base+small immediate). I'm thinking this would also make the exits easier to process as a LIR_xt would look like a 'store' and if its associated LIR_ref was not 'used' the expression would not be generated.
Comment 12•15 years ago
|
||
(In reply to comment #6) > will this combining of operator+branch lead to too many IR instructions? > > LIns field width should not be a problem. The Risk of many combined > op+branch instructions is footprint of LIR itself, complexity in > optimization phases and the backends, etc. > > this could be low risk if we dont anticipate a big explosion of ops. I think it's easy to overestimate the added complexity of a few more (clean) ops, and easy to underestimate the complexity of treacherous instructions like LIR_ov that pollute an otherwise clean semantics. For example, Valgrind's current IR (which Julian designed) has a lot of opcodes (it handles SIMD instructions, among other things) but it's semantically very clean: it is SSA, clearly distinguishes statements from expressions, is type-safe, and avoids hacks. This makes it very pleasant to deal with, and it's surprisingly difficult to generate incorrect IR. Valgrind's original IR, although much smaller and thus seemingly simpler, was not as semantically clean and generally much more error-prone and less pleasant to deal with. If Julian's suggestions leads to a lot of new ops then perhaps another way should be found, but having instructions that must appear near other instructions is asking for trouble.
Comment 13•15 years ago
|
||
(In reply to comment #10) > We need all ops, add/sub/mul/div, and potentially all forms of exits and jumps > (xt/xf/jt/jf), so thats a lot of different LIR opcodes. We also might need > LIR_cs in the future, which is another 16 combination. The new LIR_div/LIR_mod > pair is a similar problem. LIR_mod has an invisible dependency on a LIR_div > being close by. I would think we want to model the invisible data flow > explicitly with a pseudo register instead of forming the cross product of the > opcodes. No? I'm not sure what a pseudo-register is, but it sounds like a plan to replace one hack with another hack :) It's also worth mentioning bug 493125, which aims to remove LIR_cs, because it's currently dead in both TM and Tamarin.
Comment 14•15 years ago
|
||
I dislike LIR_ov just as much as the next guy. But I would prefer replacing it with something better, not just with something different :) As for LIR_cs, if we want to support unsigned types, we are going to need it.
Comment 15•15 years ago
|
||
As long as we're here, would it be possible to document the remaining opcodes in LIRopcode.tbl? I did some of them at one point, but I just couldn't figure out what things like LIR_ov were supposed to be at the time... I think that would make it a lot easier to read and understand LIR-producing code (like jstracer.cpp).
Comment 16•15 years ago
|
||
Lots of discussion here, nice. I've a (probably naive) question: Is it possible (or pleasant) to model a LIR instruction as defining 0..N values rather than always "exactly one"? It seems to me that most insns produce a bunch of flag bits, and several x86 insns write to >1 register. Also, *all* the SIMD stuff works on operand bundles, and some "pure effect" operations really don't produce values at all.. it seems like "1 LIns == 1 defined value" is possibly a source of muddle. If we let each LIns define multiple indexed values, we could then refer to an operand by a pair<LIns*,size_t> or something, rather than just a LIns*. If you wanted to access a flag produced by a flag-setting operator, you'd treat the flag-set like any other store produced by the setting insn, and have to abide by the constraint of not floating any other flag-set insn in between producer and consumer, but otherwise it'd be like any other bit of data flow (living in a special register class of flag bits and "read" through conditional jumps, of course). I am presumably missing an obvious complication of this scheme?
Reporter | ||
Comment 17•15 years ago
|
||
(in reply to Comment #16) > Is it possible (or pleasant) to model a LIR instruction as defining 0..N > values rather than always "exactly one"? I'm sure it's possible, although I don't actually know. A bit of dredging around the literature on SSA would probably throw something up. > It seems to me that most insns produce a bunch of flag bits, and several > x86 insns write to >1 register. Uh .. do you really want to represent (and have to optimise around, etc) that stuff in an architecture-neutral LIR? imo that's a job for the instruction selectors, downstream of the LIR machinery. The tricky bits I forsee, w.r.t multiple outputs, are (1) Andreas' desire to have a single x86 'div' insn produce both the div and mod results, and (2) how to use ARM addressing modes effectively, specifically the ones that write the effective address back to a register after the load/store. In both cases these insns write two registers. Ideally you don't want to represent in the LIR operations that produce more than one value, as that complicates the LIR (== slows the JIT) for the majority of ops (vanilla adds etc) which produce only one value. > it seems like "1 LIns == 1 defined value" is possibly a source of muddle. Right. That's why I favour partitioning LIR into LIR statements, which are executed for their side effects (eg, write register, write memory, take an exit, do a branch) and LIR expressions, which denote a pure value (eg, constant, read register, load, unary op, binary op, call to guaranteed side-effect free helper function). Statements contain expressions, and the top level definition of an LIR fragment becomes "sequence of statements".
Comment 18•15 years ago
|
||
(In reply to comment #16) > Is it possible (or pleasant) to model a LIR instruction as defining 0..N values > rather than always "exactly one"? It seems to me that most insns produce a > bunch of flag bits, and several x86 insns write to >1 register. The x86 is a difficult case, because 1) all instructions set condition bits in the same register, 2) all conditional branches check that register, and 3) that register is hard to move things into and out of (you can push and pop %eflags, or move a portion of it to and from %ah, and that's it). So while one *could* treat the flags as another value that the instruction yields, the cost of using that particular kind of value changes drastically depending on the intervening instructions. You'd really want to treat those values specially. Thus the appeal of fused test/branch instructions.
Comment 19•15 years ago
|
||
(In reply to comment #18) > The x86 is a difficult case, because 1) all instructions set condition bits in > the same register, 2) all conditional branches check that register, and 3) that > register is hard to move things into and out of (you can push and pop %eflags, > or move a portion of it to and from %ah, and that's it). I was thinking of modeling each *bit* as its own value, not the complete status register as a single value. Each conditional branch variant is only sensitive to a few of the bits, and not every instruction sets all of the bits. > treat the flags as another value that the instruction yields, the cost of using > that particular kind of value changes drastically depending on the intervening > instructions. You'd really want to treat those values specially. I'm not sure how it differs from eg. the split between ALU and FPU instructions, that can only act on a particular register class. Or SSE ops. Here you have a register class called "bit in the status word" that only a few kinds of conditional jump can read from. An expr/statement split to separate between values and effects is also reasonable though, and will probably help with modeling a variety of other issues. Thought I'd mention another possibly-simpler alternative to handle flags alone, is all.
Comment 20•15 years ago
|
||
For comparison-in-the-literature sake: GCC RTL used to use an implicit adjacent-instruction constraint on the condition code register[1], but seems to have switched to using pseudo-register representation, somewhat like I was describing. LLVM has a nutty (or clever?) conditional branch instruction[2] that takes a special "inline assembly reference" operand that denotes the test-that-sets-the-condition, and then transplants/fuses it with the branch. So there's no opportunity for the branch and the flag-setting code to separate. Clearly, the field's wide open! :) [1] http://gcc.gnu.org/onlinedocs/gccint/Condition-Code.html [2] http://llvm.org/docs/LangRef.html#i_br
Comment 21•15 years ago
|
||
(In reply to comment #17) > > The tricky bits I forsee, w.r.t multiple outputs, are (1) Andreas' > desire to have a single x86 'div' insn produce both the div and mod > results, and (2) how to use ARM addressing modes effectively, > specifically the ones that write the effective address back to a > register after the load/store. In both cases these insns write > two registers. ARM predicated instructions also aren't being used at all, right?
Comment 22•15 years ago
|
||
A machine-independent IR is always a compromise. It will never capture all the fancy instruction formats (idiv on x86), wonky wide effects (cc updates) or scheduling properties (conditional branches and delay slots anyone?). I think we should try to keep LIR simple and fast to generate and consume (assemble into native code). Micro-optimizations like conditional arithmetic ops and such are not really were we lose performance right now. Introducing guards that are expressions and not always live statements would buy us a lot more IMO than modeling cc properly and thus being able to CSE a bit better. Just my 2c.
Comment 23•15 years ago
|
||
You're right we shouldn't worry about micro-optimisations, so comment #21 is a distraction, and should be ignored, sorry. But the original point of the bug is that LIR has some hacks in it: LIR_ov, the new div instruction, etc. Getting rid of them would be good. And that's independent from micro-optimisations.
Comment 24•15 years ago
|
||
So do we have a simple way of dealing with LIR_ov thats better than what we have right now? I don't like it either, but I went with this approach because I couldn't think of anything else. I don't think the issue is serious enough to warrant a complete redesign of the IR.
Comment 25•15 years ago
|
||
Since JavaScript and Lua semantics are similar enough when it comes to requirements for IR design, I thought I'd share what the IR in LuaJIT 2.x looks like: Every instruction has either 0, 1 or 2 operands -- no exceptions. Every instruction has either 0 or 1 result -- no exceptions. All of this is independent of the instruction class or guard property. High orthogonality of the IR pays off big time for traversals, analysis, debug output etc. There are four classes of instructions: N = normal instructions (same as R = ref ops) L = loads A = allocations S = stores Additionally every instruction can be a guard (G), i.e. it may branch to a side exit or not. This is independent of the instruction class. Optimizations can be applied depending on this categorization: FOLD: only for N CSE: only for N (but independent of G or not) FWD: only for L SINK: only for S and A DCE: ok for N, L, A, but only if not G LOOP: simply pipe the IR again through all opts to get the variant part There are no separate conditional branch instructions. Instead there are guarded assertions (EQ, NE, LT, LE, GT, GE, ABC and so on). This is beneficial in several ways: assertions are subject to CSE, serve as inputs to VRP and simplify the assembler a lot. Overflow guards are separate instructions. And no, you don't need that many. I have ADDOV and SUBOV and that's it. Everything else would be lost in the noise. This also nicely solves the problem where you want to apply algebraic simplifications to ADD, but not to ADDOV. And the assembler can use lea instead of an add for integer ADD, but not for ADDOV (lea does not affect the flags, but it helps to avoid register moves). All loads and stores point to a reference op (except for some simple cases, like loads from Lua stack slots). A ref op provides an address, but also serves as the common anchor point for alias analysis (AA). Using regular arithmetic ops for address computations would impede or prevent any higher-level analysis. E.g. an array store cannot alias a hash load; now try to deduce this from a mess of adds/shifts/base-reg loads -- won't happen. Ref ops can be a guard, too (like a hash table lookup). Most loads do type checks, i.e. they are also guards (the whole IR is typed). The assembler fuses references into loads and stores if the CPU has an addressing mode that matches the ref op. E.g. x86 has [base+k*idx+ofs] which can be used for array indexing. Some ref ops like unspecialized hash refs cannot be fused and only provide an address. The assembler also fuses loads into other instructions (see Bug 444682 for an example). Since there are no instructions with two outputs, combining DIV/MOD or SIN/COS is done differently: there is already an auxiliary hash table used to pass on sparse info about the IR, such as analysis results or predicted hash slots for constant keys. It's indexed by info type and IR instruction number. This links a SIN to the COS and vice versa if they have the same argument. The assembler uses this to suppress whichever instruction it encounters first (the last one, since it goes backwards) and to fill two registers when it encounters the other one. Modeling target CPU features in a medium-level IR is almost always a losing proposition. I have argued in the past one doesn't need a low-level IR, since this lowering can be done on-the-fly by the assembler. Things like flag bits, instruction combining or fusion are orthogonal aspects and have no place in a target-independent IR. There are only a couple of places where the target architecture bleeds through to the IR. E.g. x86 CPUs automatically mask shift amounts by the operand width, but ARM CPUs don't. You don't want to emit one bitwise AND for every shift on x86, only to bother the assembler with its elimination. But you want to have them in the IR for ARM to be able to reassociate, simplify and CSE them. So the IR emitter and the FOLD rules differ in this case, but I think that's an acceptable compromise. Appended below is an excerpt of the LJ2 IR instruction definitions. And here's a small example of traced bytecode, optimized IR and generated x86 code for this simple Lua snippet: local x = 0; for i=1,100 do x = bit.bxor(x, i) end [Don't get distracted by the fact that JS has the ^ operator and wouldn't need to resolve the bit.bxor function. It would need to do something like this for builtins and functions in other examples.] ---- TRACE 0 start test.lua:1 [... Self-explanatory bytecode :-) ] 0006 GGET 5 0 ; "bit" 0007 TGETS 5 5 1 ; "bxor" 0008 MOV 6 0 0009 MOV 7 4 0010 CALL 5 2 3 ; bit.bxor 0011 MOV 0 5 0012 FORL 1 => 0006 ---- TRACE 0 IR [... Legend: '>': guard, '+': PHI ref, '*': dead ins] 0001 > int SLOAD #2 notc // i, narrowed to int 0002 > fun SLOAD #0 notc // Function at frame base 0003 > fun FRAME 0002 test.lua:1 0004 tab FLOAD test.lua:1 func.env // Load function env ('globals') 0005 int FLOAD 0004 tab.hmask 0006 > int EQ 0005 +63 // Specialize to hash table size 0007 int FLOAD 0004 tab.node 0008 > int HREFK 0007 "bit" [slot 43] // Hash ref, specialized to slot 0009 > tab HLOAD 0008 0010 int FLOAD 0009 tab.hmask 0011 > int EQ 0010 +15 // Specialize to hash table size 0012 int FLOAD 0009 tab.node 0013 > int HREFK 0012 "bxor" [slot 11]// Hash ref, specialized to slot 0014 > fun HLOAD 0013 0015 > num SLOAD #1 // x 0016 > fun FRAME 0014 bit.bxor // Frame guard, specialize dispatch 0017 int TOBIT 0015 bias // Double -> int32 with bit rules 0018 + int BXOR 0017 0001 // x xor i 0019 + int ADD 0001 +1 // Overflow check eliminated 0020 > int LE 0019 +100 0021 ------ LOOP ----------- // Note: loads and dispatch hoisted! 0022 * num TONUM 0018 // Compensate type mismatch (dead) 0023 + int BXOR 0019 0018 // Note: folded TOBIT(TONUM(x)) ==> x 0024 + int ADD 0019 +1 0025 > int LE 0024 +100 0026 int PHI 0019 0024 // PHI for i (narrowed to int) 0027 int PHI 0018 0023 // PHI for x (narrowed to int) ---- TRACE 0 mcode 257 [... Only the inner loop is shown] ->LOOP: f7fa2fed 33F5 xor esi, ebp f7fa2fef 83C501 add ebp, +0x01 f7fa2ff2 83FD64 cmp ebp, +0x64 f7fa2ff5 0F8EF2FFFFFF jle 0xf7fa2fed ->LOOP f7fa2ffb E990AD0A10 jmp 0x0804dd90 ->EXIT_0 ---- TRACE 0 end ** Glossary: FOLD: Constant Folding, Algebraic Simplifications and Reassociation CSE: Common-Subexpression Elimination FWD: Load and Store Forwarding (L2L and S2L) using Alias Analysis (AA) SINK: Store Sinking and Allocation Sinking. DCE: Dead-Code Elimination VRP: Value-Range Propagation ABC: Array-Bounds Check ** Excerpt from IR instruction definition: // Constants grow downwards from HEAD, hidden/inlined by dump KPRI N ___ ___ KINT N cst ___ KGC N cst ___ KNUM N cst ___ // Guarded assertions LT G ref ref GE G ref ref LE G ref ref GT G ref ref EQ G ref ref NE G ref ref ABC G ref ref ... // Bit ops BNOT N ref ___ BAND N ref ref BOR N ref ref BXOR N ref ref BSHL N ref ref ... // Arithmetic ops ADD N ref ref SUB N ref ref MUL N ref ref DIV N ref ref FPMATH N ref lit NEG N ref ref ABS N ref ref ... // Overflow-checking arithmetic ops ADDOV G ref ref SUBOV G ref ref // Memory ops. A = array, H = hash, U = upvalue, F = field, S = stack. // Ref ops AREF R ref ref HREFK RG ref ref HREF R ref ref UREFO RG ref lit UREFC RG ref lit FREF R ref lit // Loads and stores ALOAD LG ref ___ HLOAD LG ref ___ ULOAD LG ref ___ FLOAD L ref lit SLOAD LG lit lit ASTORE S ref ref HSTORE S ref ref USTORE S ref ref FSTORE S ref ref // Allocations TNEW A lit lit SNEW A ref ref ... // Type conversions TOINT G ref ___ TONUM N ref ___ TOBIT N ref ref ... // Misc FRAME G ref ref PHI N ref ref HEAD G ___ ___ LOOP G ___ ___ ...
Comment 26•15 years ago
|
||
(In reply to comment #25) > Overflow guards are separate instructions. And no, you don't need that many. > I have ADDOV and SUBOV and that's it. Everything else would be lost in the > noise. This also nicely solves the problem where you want to apply algebraic > simplifications to ADD, but not to ADDOV. And the assembler can use lea > instead of an add for integer ADD, but not for ADDOV (lea does not affect > the flags, but it helps to avoid register moves). That's pretty convincing.
Comment 27•15 years ago
|
||
Without a separate overflow check instruction, how does LuaJIT trap overflows from multiplies?
Comment 28•15 years ago
|
||
LuaJIT doesn't narrow multiplication, because this is rarely beneficial. Instead it performs strength-reduction and hoisting which covers most uses of multiplication in indexing expressions. You definitely don't want to narrow a multiply which is not used for indexing. That's why I said, you really don't need more than ADDOV and SUBOV. [ A slight digression: IMHO TraceMonkey's narrowing is too aggressive and counterproductive. I've mentioned this in a previous message to the Tamarin mailing list: | Generous use of integer operations with overflow guards is more | expensive than the equivalent FP operations. The problem is related to how current x86/x64 super-scalar CPUs work: - All branches (such as the 'jo' for overflow checks) issue to a single execution port. A high frequency of branches (add+jo, sub+jo, add+jo, etc.) effectively serializes your code, because this execution port gets clogged up. - Overflow branches are highly predictable, so in theory they could all move out of the latency chain. But the CPU has to keep track of all of the renames to be able to restore the state in case the prediction turned out to be wrong. There is a finite amount of reorder buffers, so this quickly leads to a stall. Basically the reorder buffer fills up faster (up to 3 integer ops per cycle) than it can be emptied (only up to 1 branch per cycle). Modern CPU hardware resources are balanced to work best for much lower branch frequencies than what TraceMonkey generates. - Replacing an FP add with an integer add plus an overflow check is turning a pure data-flow dependency into a different data-flow dependency (with slightly lower latency) *plus* a control dependency. In general, you don't want to do this. Latencies due to data-flow dependencies can be well hidden by out-of-order execution. The slightly longer latency of FP operations turns out to be a non-issue, except in very tight loops. - The total CPU execution bandwidth is the sum of the bandwidth of the FP and the integer units, because they execute in parallel. The FP units have an equal or higher bandwidth than the integer units. Not using them means losing execution bandwidth. Moving work away from them to the already quite busy integer units is a losing proposition. So that's the reason why both plain Lua and the LuaJIT 2.x interpreter always keep all numbers as FP values. This greatly simplifies the codebase and has been benchmarked as being faster than a mixed FP/int variant. The latencies for FP operations and conversions are irrelevant due to other overhead of the interpreter. OTOH the LuaJIT 2.x JIT compiler generates much tighter code, so it needs to do something about the FP->int conversion latencies caused by array indexing. It uses predictive narrowing for induction variables and demand-driven narrowing for indexing expressions. Additionally it can eliminate or hoists most overflow checks resulting from narrowing. Regular numeric computations are never narrowed to integers. BTW: this also means there can be no FP/int type mismatches (which seems to be a major source of complexity in TM). Details upon request, but it's already getting OT for this bug. ]
Comment 29•15 years ago
|
||
This is a comparison of the VM running with integer demotion and without. Demotion is a 6% speedup overall. The speedup is quite dramatic on tight loops with bit banging but its also a clear speedup across the board. The data was obtained by poisoning the oracle that inhibits future demotion once we detected a misspeculation for an expression once. TEST COMPARISON FROM TO DETAILS ============================================================================= ** TOTAL **: 1.060x as fast 1010.2ms +/- 0.5% 953.4ms +/- 0.3% significant ============================================================================= 3d: - 143.8ms +/- 1.7% 143.2ms +/- 0.6% cube: 1.115x as fast 44.9ms +/- 5.1% 40.3ms +/- 1.6% significant morph: *1.123x as slow* 27.1ms +/- 0.9% 30.4ms +/- 0.9% significant raytrace: *1.008x as slow* 71.8ms +/- 0.4% 72.4ms +/- 0.2% significant access: 1.046x as fast 139.6ms +/- 0.2% 133.4ms +/- 0.3% significant binary-trees: 1.016x as fast 39.6ms +/- 0.5% 39.0ms +/- 0.4% significant fannkuch: 1.006x as fast 56.7ms +/- 0.3% 56.3ms +/- 0.4% significant nbody: ?? 25.2ms +/- 0.5% 25.4ms +/- 0.8% not conclusive: might be *1.005x as slow* nsieve: 1.42x as fast 18.1ms +/- 0.6% 12.7ms +/- 1.3% significant bitops: 1.98x as fast 70.4ms +/- 0.4% 35.6ms +/- 0.9% significant 3bit-bits-in-byte: 1.91x as fast 2.9ms +/- 3.0% 1.5ms +/- 9.3% significant bits-in-byte: 3.30x as fast 26.3ms +/- 0.5% 8.0ms +/- 1.4% significant bitwise-and: 6.22x as fast 15.4ms +/- 0.9% 2.5ms +/- 5.8% significant nsieve-bits: 1.092x as fast 25.8ms +/- 0.6% 23.6ms +/- 0.9% significant controlflow: ?? 32.7ms +/- 2.3% 32.8ms +/- 0.7% not conclusive: might be *1.005x as slow* recursive: ?? 32.7ms +/- 2.3% 32.8ms +/- 0.7% not conclusive: might be *1.005x as slow* crypto: 1.078x as fast 59.5ms +/- 1.7% 55.2ms +/- 0.7% significant aes: 1.051x as fast 34.1ms +/- 2.0% 32.4ms +/- 1.0% significant md5: - 14.9ms +/- 2.4% 14.7ms +/- 0.9% sha1: 1.30x as fast 10.5ms +/- 1.4% 8.1ms +/- 1.3% significant date: 1.026x as fast 136.3ms +/- 0.6% 132.8ms +/- 0.2% significant format-tofte: 1.055x as fast 67.5ms +/- 0.8% 63.9ms +/- 0.3% significant format-xparb: - 68.8ms +/- 0.4% 68.8ms +/- 0.3% math: 1.28x as fast 48.8ms +/- 1.7% 38.2ms +/- 0.6% significant cordic: 1.53x as fast 29.2ms +/- 2.1% 19.1ms +/- 0.5% significant partial-sums: ?? 13.2ms +/- 1.6% 13.4ms +/- 1.0% not conclusive: might be *1.015x as slow* spectral-norm: 1.101x as fast 6.3ms +/- 2.1% 5.8ms +/- 2.1% significant regexp: 1.010x as fast 45.7ms +/- 0.8% 45.3ms +/- 0.4% significant dna: 1.010x as fast 45.7ms +/- 0.8% 45.3ms +/- 0.4% significant string: *1.011x as slow* 333.4ms +/- 0.5% 337.0ms +/- 0.6% significant base64: 1.054x as fast 17.8ms +/- 0.9% 16.9ms +/- 0.6% significant fasta: ?? 73.4ms +/- 1.1% 73.5ms +/- 1.1% not conclusive: might be *1.001x as slow* tagcloud: *1.012x as slow* 98.4ms +/- 0.7% 99.5ms +/- 0.8% significant unpack-code: *1.026x as slow* 113.5ms +/- 0.4% 116.5ms +/- 0.6% significant validate-input: *1.011x as slow* 30.3ms +/- 0.5% 30.6ms +/- 0.9% significant
Comment 30•15 years ago
|
||
(In reply to comment #29) > This is a comparison of the VM running with integer demotion and without. > Demotion is a 6% speedup overall. Is this an implicit argument for keeping LIR_ov? If so, I don't think anyone is arguing that the feature embodied by LIR_ov should be removed, but rather that the current implementation could be made to have cleaner semantics.
Comment 31•15 years ago
|
||
#29 was a followup for #28, quantifying the speedup LIR_ov gets us.
Comment 32•15 years ago
|
||
(In reply to comment #30) > (In reply to comment #29) > > This is a comparison of the VM running with integer demotion and without. > > Demotion is a 6% speedup overall. > > Is this an implicit argument for keeping LIR_ov? If so, I don't think anyone > is arguing that the feature embodied by LIR_ov should be removed, but rather > that the current implementation could be made to have cleaner semantics. Mike was arguing that LuaJIT got away with much less narrowing-to-ints than we have (and I was chiming in support for this notion on IRC). But nobody implied we have *none*. Last I checked the memory controller doesn't like floating point addresses :)
Reporter | ||
Comment 33•15 years ago
|
||
(In reply to comment #32) Plus I don't think we should get too enamoured of running everything possible on the FPU. That might be viable on big heavy CPUs, but on ARMs with minimal FPUs or softfloat it doesn't sound too good.
Comment 34•15 years ago
|
||
The comparison in #29 is missing the point. I said you *do* need narrowing for indexing expressions and induction variables. Turning it completely off is a no go. Obviously bit operations need narrowing, too. But there are no overflow conditions, so this is OT. You can get away with simple backpropagation, if the ES semantics for bit operations allow this transform: tobit(add_fp(x, y)) ==> add_int(tobit(x), tobit(y)) (With add_int() wrapping around and ignoring overflow.) [BTW: IMHO that measurement script is statistically unsound. For fannkuch it shows +-0.3% and +-0.4% variation, i.e. significantly overlapping ranges of 54.99-58.40 vs. 54.05-58.55. And then the script goes on to *conclusively* state a significant difference of 0.06%?!? Apart from that, measurements in the milli/microsecond range have too much systematic error. This can lead to fictitious differences for only slightly different code. You can't eliminate this error with repeated measurements, only with longer measurements.] But back to the topic: the only remaining argument for keeping LIR_ov (and LIR_x*) was that the number of combinations with other instructions is too high. I dispute that on the basis that a) there are many fewer combinations that actually make sense and b) most of the "flag" producing operations already have complements. Fusing the branch-only instructions into their producers wouldn't necessarily lead to an explosion of the instruction space. In general, instruction streams which have a concept of flags and other implicit dependencies are horrible to work with. Julian can probably tell you more about that. :-) But I can buy in on #33. Sure, for FP-challenged CPUs you need to rethink narrowing. But there's a limit to that. Either the ES spec has to change or ARM needs to design CPUs with decent performance for double-precision FP. I'm not sure which of the two is more unlikely. :-) [Many of the ARM devices out there nowadays have NEON, but this is single-precision only. With VFP you get double-precision. Alas, the vendors like to hide the fact that most of their gadgets actually contain VFPlite. The high latencies and the low throughput makes softfp suddenly look like an attractive option.]
Comment 35•15 years ago
|
||
[Sorry, correcting my own math: 'ranges of 56.53-56.87 vs. 56.07-56.53' and a conclusive '0.6% difference' (= 340 microseconds). Still not convincing.]
Reporter | ||
Comment 36•15 years ago
|
||
Summary of my position re LIR_ov: * first, just to be clear, I am not trying to get rid of overflow checking per se, or change the generated code. I'm merely trying to deal with the fact that LIR_ov differs from almost all other value-consuming LIRs in that it doesn't have a pointer back to the value-producing LIR. Instead it implicitly references the sourcing LIR by immediately following it in the LIR buffer, thereby constituting a special case requiring special handling and special consideration when we come to think about "how can we transform/mangle LIR", etc. * doing all math on the FPU isn't desirable, esp for low end rigs * a fused compute-and-exit construct, as per Comment #1, would work fine. * I don't see that arguments about "opcode explosion", etc, are relevant. We'd have just a single "conditional exit" construct as at present, producing a value too, with fields for the two operands, the operation (add, sub, mul, etc), the exit condition (ov, cs, etc). * re comment 18 (push/pop %eflags), that's very expensive. Leastways was last time I looked. We used it extensively in the Valgrind 2.x line and discovered they have latencies (in cycles) roughly equivalent to the pipeline depth. Really the only sane way to do conditional branches is to set the flags in one (x86) insn and branch immediately in the next one. That's what gcc generates 99% of the time anyway. I had hoped to do some LIR hackery this week, but have got bogged down in LIR debug printing swampery (bug 494864).
Reporter | ||
Comment 37•15 years ago
|
||
This comment updates the initial list of stuff in LIR to look at, as stated in (1) ... (5) in the Description for this bug. New things are: (6) Look at forward jumps in LIR fragments, and see if there is a way to clean this up. It breaks the nice simple story that an LIR fragment is a superblock (single-entry-multiple-exit) and causes the compiler to have to deal with control flow merges, which it wouldn't otherwise have to. There's evidently some magic to do with LIR_live and the lack of phi-nodes that I have yet to take on board. I understand that the motivation is to be able to do simple control-flow diamonds without the overhead (and implied tail duplication) of a full trace exit and re-entry. (7) Similar to the LIR_ov discussion, find a way to remove implicit dependencies between LIR_div and LIR_mod without affecting code quality. (8) Document assumptions and requirements about LIR node ordering during code generation. At the moment, LIR nodes are ordered in two ways: - there is a total order imposed on them by the order they are parked in the LIR buffer. ("buffer ordering") - there is also a partial order resulting from the fact that (most) value-consuming nodes have pointers back to the nodes that generated the values. ("operand ordering"). How these are related (or are supposed to relate) isn't stated. Is the operand ordering required to be consistent with the buffer ordering? For example, it is the buffer ordering that dictates the order that memory references must appear in, since nothing else constraints them. But for value-producing/consuming node-pairs (eg, a shift-left feeding into an add), only the operand ordering actually matters. So is their relative buffer ordering significant or not? It would be good to clarify this.
Comment 38•15 years ago
|
||
(In reply to comment #37) I'll attempt to clarify the unstated intentions behind LIR, read on. > This comment updates the initial list of stuff in LIR to look at, > as stated in (1) ... (5) in the Description for this bug. > > New things are: > > (6) Look at forward jumps in LIR fragments, and see if there is a > way to clean this up. It breaks the nice simple story that > an LIR fragment is a superblock (single-entry-multiple-exit) > and causes the compiler to have to deal with control flow > merges, which it wouldn't otherwise have to. There's > evidently some magic to do with LIR_live and the lack of > phi-nodes that I have yet to take on board. there not supposed to be a story that LIR is a superblock. LIR is intended to house a complete control-flow graph complete with forward & backward jumps, etc. > I understand that the motivation is to be able to do simple > control-flow diamonds without the overhead (and implied tail > duplication) of a full trace exit and re-entry. and also, to support full CFG's, as tamarin's JIT front-end emits. > (7) Similar to the LIR_ov discussion, find a way to remove > implicit dependencies between LIR_div and LIR_mod without > affecting code quality. cool. i've been mute on this but I lead towards keeping LIR_div/mod separate and combining them with LIR_qjoin or something semantically stronger, that the instruction selector can identify and then fuse. > (8) Document assumptions and requirements about LIR node ordering > during code generation. > > At the moment, LIR nodes are ordered in two ways: > > - there is a total order imposed on them by the order they > are parked in the LIR buffer. ("buffer ordering") > > - there is also a partial order resulting from the fact that > (most) value-consuming nodes have pointers back to the > nodes that generated the values. ("operand ordering"). > > How these are related (or are supposed to relate) isn't > stated. Is the operand ordering required to be consistent > with the buffer ordering? I'm all for firming this up! so i'll try to state our intentions but without overly biasing the outcome of what we do next. what's the (most)? I think it's all. loads and stores to the local stack frame (i.e. to areas inside LIR_alloc instructions) do not follow any particular rules. they should be considered to be side-effect instructions and their buffer order is their intended order. (that said, an analysis pass could connect stores & loads to local stack area, find true dependencies, and use that to do code motion. those dependencies are just not expressed in LIR between stores & loads, only between direct LIR->LIR operand references). Within a basic block, LIR IR intends to require that value-producing instructions sit earlier in LirBuffer than value-consuming instructions. inbetween basic blocks, LIR IR doesn't intend to make any guarantees, however: * the Assember wants to see all uses before all definitions, therefore it requires that the order of traversal of the LirReader pipeline feeding it also visit basic blocks in an order such that value-consuming instructions are seen before value-producing instructions. importantly: a different LIR walker, and a different LIR consumer, say in an intermediate pass, could have different assumptions. the LIR IR itself doesn't intend to nail down the order in which blocks are traversed. I'ts always been the intention that the buffer ordering imply statement ordering. if a walker just skipped non-side-effect instructions, you'd have the statement list. this is why there's currently no separate statement list. > For example, it is the buffer ordering that dictates the > order that memory references must appear in, since nothing > else constraints them. But for value-producing/consuming > node-pairs (eg, a shift-left feeding into an add), only the > operand ordering actually matters. So is their relative > buffer ordering significant or not? It would be good to > clarify this. fully agree. did I cover it above or miss a subtle point?
Reporter | ||
Comment 39•15 years ago
|
||
(In reply to comment #38): Ed, thanks for the background. > LIR is intended to house a complete control-flow graph complete with > forward & backward jumps, etc. In which case, afaict, it would be essentially the same as bog-standard SSA, right? Apart from one thing: there are no phi-nodes in LIR, and they are integral to SSA. So, uh, how are you able to get away without having phi-nodes?
Comment 40•15 years ago
|
||
correct, if it had phi nodes, it would be standard SSA. the way it avoids phi nodes is by just making it the front-end's problem to emit stores & loads, where phi-nodes would otherwise exist. we've talked about adding phi-nodes. - pro: make the IR be standard SSA, enable SSA-based IR transformations - con: additional complexity in the backend, converting out of SSA, and in the frontend, properly placing phi nodes as the IR is being generated our frontends (TM's tracer and Tamarin's lir generator) are bone-simple, which is good and bad. but that's why things are as they are, currently. we've also talked about introducing chained def-use instructions, that could provide semantic equivalence with phi-nodes (I think) but be easier to deal with in the frontend and backend. (tangent to chained def-use idea..) LIR_def instructions replace local LIR_st's, at BB ends. LIR_use replace LIR_ld's, at BB starts (or mid-BB, wherever the reference occurs) LIR_def takes 2 operands: a value-producing expr, and a pointer to another LIR_def. As LIR is being constructed, when merge nodes are encountered and earlier definitions need to be merged, connect the LIR_def's together. in backends, always follow the chain of LIR_def's to its first occurance, and do register allocation & spilling with this one single occurance. the expressions visited along the chain of defs are the ones that would appear in a phi node at the point of a LIR_ref. .... we implemnted the above in MIR, tamarin's previous jit compiler. it's light, and it works. but it's also kind of homebrew, and its not clear that its a win over real SSA with real phi nodes. (end tangent) Also, a register-allocator and instruction selector that could work directly on an SSA representation would shove this strongly to pure SSA. I understand that the "register allocation by puzzle solving" appraoch being worked on this summer for nanojit, is meant to work right on the SSA input.
Comment 41•15 years ago
|
||
Hi Mike, greatly appreciate your detailed comments. (In reply to comment #34) > Obviously bit operations need narrowing, too. But there are no overflow > conditions, so this is OT. You can get away with simple backpropagation, > if the ES semantics for bit operations allow this transform: > tobit(add_fp(x, y)) ==> add_int(tobit(x), tobit(y)) > (With add_int() wrapping around and ignoring overflow.) Indeed JS semantics allow this transform (non-finite values become 0, negative zero goes to 0 too). > [BTW: IMHO that measurement script is statistically unsound. For > fannkuch it shows +-0.3% and +-0.4% variation, i.e. significantly > overlapping ranges of 54.99-58.40 vs. 54.05-58.55. And then the script > goes on to *conclusively* state a significant difference of 0.06%?!? This is something we've all noticed. Cc'ing dmandelin, I would like to have a conclusive proof to file as a SunSpider bug in http://bugs.webkit.org. > Apart from that, measurements in the milli/microsecond range have too > much systematic error. This can lead to fictitious differences for only > slightly different code. You can't eliminate this error with repeated > measurements, only with longer measurements.] Indeed SunSpider's tests are absurdly short for this day and age. The MD5 data set was cut down so that the test wouldn't take too long in IE three years ago, from what I heard. > But I can buy in on #33. Sure, for FP-challenged CPUs you need to > rethink narrowing. But there's a limit to that. Either the ES spec has > to change or ARM needs to design CPUs with decent performance for > double-precision FP. I'm not sure which of the two is more unlikely. :-) If you made me bet, I would bet on ARM changing sooner. The JS standard changes mostly by adding selectively and enabling such things as "use strict" in ES5. Down the road if we have led programmers away from bad patterns by this combo of carrots and sticks, we may be able to remove things, but only after uses of the feature to remove have all but disappeared on the web. /be
Comment 42•15 years ago
|
||
(In reply to comment #34) > The comparison in #29 is missing the point. I said you *do* need > narrowing for indexing expressions and induction variables. Turning it > completely off is a no go. No, I only turned of _speculative_ narrowing. All non-speculative narrowing was still done. Also, current TM tip does not narrow mul/div/mod, only add/sub. Narrowing mul/div/mod is a 20ms win over tip (patch is about to land), with some more to come in the array path in a separate patch (we currently always read values as doubles from an array when reading numbers, since specializing to 31-bit integer loads causes trace explosion in crypto code). > Obviously bit operations need narrowing, too. But there are no overflow > conditions, so this is OT. You can get away with simple backpropagation, > if the ES semantics for bit operations allow this transform: > tobit(add_fp(x, y)) ==> add_int(tobit(x), tobit(y)) > (With add_int() wrapping around and ignoring overflow.) Yeah, we always do this and it was still done in the modified benchmark run. > [BTW: IMHO that measurement script is statistically unsound. For > fannkuch it shows +-0.3% and +-0.4% variation, i.e. significantly > overlapping ranges of 54.99-58.40 vs. 54.05-58.55. And then the script > goes on to *conclusively* state a significant difference of 0.06%?!? > > Apart from that, measurements in the milli/microsecond range have too > much systematic error. This can lead to fictitious differences for only > slightly different code. You can't eliminate this error with repeated > measurements, only with longer measurements.] Yes, absolutely, the measurements and the harness are ****. Still, the magnitude of the difference proves an impact beyond reasonable doubt. > But back to the topic: the only remaining argument for keeping LIR_ov > (and LIR_x*) was that the number of combinations with other instructions > is too high. I dispute that on the basis that a) there are many fewer > combinations that actually make sense and b) most of the "flag" > producing operations already have complements. Fusing the branch-only > instructions into their producers wouldn't necessarily lead to an > explosion of the instruction space. > > In general, instruction streams which have a concept of flags and other > implicit dependencies are horrible to work with. Julian can probably > tell you more about that. :-) > > > But I can buy in on #33. Sure, for FP-challenged CPUs you need to > rethink narrowing. But there's a limit to that. Either the ES spec has > to change or ARM needs to design CPUs with decent performance for > double-precision FP. I'm not sure which of the two is more unlikely. :-) > > [Many of the ARM devices out there nowadays have NEON, but this is > single-precision only. With VFP you get double-precision. Alas, the > vendors like to hide the fact that most of their gadgets actually > contain VFPlite. The high latencies and the low throughput makes > softfp suddenly look like an attractive option.] I was hoping for a win with narrowing mul/div/mod on ARM, but it turns out its integer path sucks even more than its FPU. The VFP devices we ship on don't have a fast hardware integer divider, so VFP is actually better there. The mul/div/mod narrowing is disabled on anything but x86 at the moment. We should probably take this discussion to a separate bug. I suggest https://bugzilla.mozilla.org/show_bug.cgi?id=474443
Reporter | ||
Comment 43•15 years ago
|
||
Here's an interim patch which gets rid of LIR_ov, as per Comment #1. This is WIP and the patch is far from being correct or usable. It does however demonstrate that LIR_ov is removable without major ado. The main features are: * three new LIR opcodes, x_{add,sub_mul}_ov, which combine the activities of doing a binary operation, overflow detection, and possible trace exit. They are used like this: op1 = LIR_2(argL, argR), op2 = guardRec and themselves generate the result of the operation (add, sub, mul). The use of LIR_2 is a kludge necessary because 3 operands are required (argL, argR, guardRec). * modified TraceRecorder::alu and TraceRecorder::record_JSOP_NEG which creates these * Assembler::asm_branch_x_binop_ov to emit instructions for them. It runs for quite some number of tests in trace-tests.js before asserting. There is much to dislike about the patch: * first and foremost, the changes to TraceRecorder::alu and TraceRecorder::record_JSOP_NEG are almost certainly wrong, in not handling overflow correctly * Assembler::asm_branch_x_binop_ov in Nativei386.cpp essentially duplicates logic from Assembler::asm_arith for the LIR_add/sub/mul cases. This could be commoned up. * all the constant folding and CSE etc machinery does not operate on LIR_x_{add,sub_mul}_ov, as that would require duplicating the logic from the normal binary ops LIR_{add,sub,mul}. * it uses up 3 more entries in the top level LIR opcode space and is therefore un-expandable since that space is nearly full. A solution would be to split the space into two levels, and put all the arithmetic opcodes at the second level. Overall, it turned out that messing with the LIR representation to add the new opcodes, and dealing with the instruction selection aspects was easy. The real difficulty is in jstracer.cpp, in forming these LIRs in the first place. There is a subtle interaction between overflow checking, constant folding and speculative integer demotion which I did not so far manage to understand. Because of that (and the fact that the JIT eventually asserts) I am sure the patch is wrong. Anyway I am posting it just in case it later turns out to be useful, and so as to record my explorations here. As the patch does not solve any immediate problem, I'll move on to other stuff.
Comment 44•15 years ago
|
||
Definitively the right direction. I like most about the patch that this will allow us to dead-code eliminate the addition. Previously the guard code always kept the math op alive, even if it was otherwise dead.
Comment 45•15 years ago
|
||
I don't like the x_add,x_sub, etc. idea, for the reasons listed in comment 43, and because it seems like it will spawn more and more duplication of opcodes and logic, and finally because it just feels weird to have such specialized instructions. In particular, I would like to see nanojit evolve to a general yet small compiler backend, and eschew one-off features added for particular front-end applications. (For the same reason, I want to keep support for full-strength CFGs, so I was glad to hear that's already being used in Tamarin.) Discussion above has said that nanojit LIR is pretty much standard SSA, which I think is great and exactly where we should go in the future. Going with that idea, it seems the way to handle overflow (and other flags) is to make it an explicit part of the semantics, so the LIR looks like this: 15 LIR_add 6, 12 16 LIR_ov 15 // 1 iff isn 15 overflows 17 LIR_xt 16 I think this is pretty much like the current version except that LIR_ov has an explicit operand. In the sequence above, LIR_ov requires no code on x86, and this should be fairly easy to detect. I assume op/ov/branch is the only common sequence that we care that much to optimize. If there are intervening ops, we can just generate flag-loading instructions, and worry about optimizing that more iff it ever matters.
Comment 46•15 years ago
|
||
On the subject of int vs. floats and how much narrowing to do, it would seem to depend on the benchmark. I think our results show pretty conclusively that aggressive narrowing is good for SunSpider on x86, possibly because of the high usage of bit ops and array indexing. If we wanted to know more about this issue, I would suggest generating some x86 for an interesting trace, disassembling it, recoding by hand with a variety of narrowing choices, and then measuring direct runs. On the subject of SunSpider statistics, if we want a good SunSpider score, then we have no choice but to use SunSpider metrics, whether the benchmarks are good or not. At least on MacOS, the results are repeatable, so in that sense they're pretty usable. I think you can get down to about +/-5 ms maximum precision by doing a large number of runs. In terms of new benchmarks, I think the things that would help us most would be microbenchmarks and application-based benchmarks. We really don't know a lot about the performance of specific operations because we haven't used microbenchmarks much yet. In particular, it would help us a lot in learning which features we definitely have a lot of room to improve on. And on the other hand, I think we're finding that improved SS scores doesn't necessarily mean improved SS scores, primarily because SS exercises only a small part of the language, and also because the programs are so short.
Comment 47•15 years ago
|
||
#45: If we can extend that somehow to make the guard part of the data-flow, we would get the same dead code elimination benefits.
Comment 48•15 years ago
|
||
(In reply to comment #43) > The use of LIR_2 is a kludge necessary because 3 operands are > required (argL, argR, guardRec). FWIW, bug 501232 will remove LIR_2 by allowing trinary instructions.
Comment 49•15 years ago
|
||
(In reply to comment #45) > Discussion above has said that nanojit LIR is pretty much standard SSA, which I > think is great and exactly where we should go in the future. Going with that > idea, it seems the way to handle overflow (and other flags) is to make it an > explicit part of the semantics, so the LIR looks like this: > > 15 LIR_add 6, 12 > 16 LIR_ov 15 // 1 iff isn 15 overflows > 17 LIR_xt 16 > > I think this is pretty much like the current version except that LIR_ov has an > explicit operand. I find this less clean that the current approach -- "LIR_ov 15" looks like it's using the result of insn 15 (ie. the sum) as an input, when it's really using something else. That "something else" is either a side-effect of the add (if you view it in terms of setting condition codes) or a second output of the add (if you view the overflow result as a second output). It's a similar problem to LIR_callh, which also looks like it's operating on something (the low 32 bits of a return value) when it's really operating on something else (the corresponding high 32 bits).
Reporter | ||
Comment 50•15 years ago
|
||
(In reply to comment #45) There's two orthogonal issues addressed in your post. (1) whether the approach to compute-and-if-overflow-exit shown by the patch in #43 is a good thing. (2) whether we should move LIR/NJ to full SSA. Re (1), I don't like the duplication either. With some careful refactoring, perhaps the code duplication could be avoided. Something like: we have a LIR binary op node, and we have a guards which incorporate LIR binary op nodes; in both cases the folding machinery etc should simply operate on the underlying op node. Re (2) > Discussion above has said that nanojit LIR is pretty much standard SSA, > which I think is great and exactly where we should go in the future. I think it's a bit misleading to say that NJ LIR is pretty much standard SSA. LIR has no phi-nodes, which are mandatory in standard SSA. Also, LIR doesn't have a notion of a real control flow graph, in which blocks can be moved around and visited in an order which gives the best code quality: instead, we are forced to visit the blocks in the particular order they have been parked in the LIR buffer, which may not be the best from a register allocation point of view. I believe that implementing full SSA, with phi-nodes, the dominance frontier stuff needed to make them work, plus representing an arbitrary CFG, plus identifying loops, plus having to do decent register allocation for variables in loops, represent a big step up in JIT code complexity and JIT-time cost, the latter of which we're unlikely to recover in increased code quality. My instinct is to head in the opposite direction. Most of the fragments I've looked at are single-entry, multiple exit affairs. I suspect the vast majority of executed fragments are like that. In which case the added overhead of the full SSA machinery buys you not very much. I'd suggest: (a) restrict LIR to actually be this superblock form (single entry, multiple exit) -- that is, get rid of all forward and backwards jumps (b) find a way to improve the inter-fragment jumping conventions, in particular so we don't have to flush all live variables to memory at such transitions Assuming that most executed fragments really are superblocks, then (a) doesn't hurt performance, and (b) improves it for general-case code. This means we could reuse the existing compiler more or less, with no change to the code quality within each superblock relative to at present, and with no significant extra JIT-time expense. Some years ago there was a Java JIT project at Mozilla, called ElectricalFire: http://web.archive.org/web/20021020161548/www.mozilla.org/projects/ef/techdocs/index.html IIUC, they were implementing full SSA in a JIT context. What happened to those people? Is there anything we can learn from them?
Comment 51•15 years ago
|
||
(In reply to comment #49) > (In reply to comment #45) > > > > 15 LIR_add 6, 12 > > 16 LIR_ov 15 // 1 iff isn 15 overflows > > 17 LIR_xt 16 > > I find this less clean that the current approach -- "LIR_ov 15" looks like it's > using the result of insn 15 (ie. the sum) as an input, when it's really using > something else. That "something else" is either a side-effect of the add (if > you view it in terms of setting condition codes) or a second output of the add > (if you view the overflow result as a second output). I had viewed "LIR_ov 15" as using the result of 15, with the result understood to be the result of logical integer addition (i.e., a possibly bigger than 32-bit value) as opposed to 32-bit 2's complement addition yielding a 32-bit value. I guess there's nothing wrong with viewing that as referring to a second output, but then again, I have nothing against second outputs. I was also thinking of LLVM, which seems to use a similar design. If we had a type system for LIR, with defined conversions, it wouldn't be too hard to deal with things like this:. The typing could be something like: type int32 # primitive 32-bit integer type type flags # ALU result flags type type int32_result = (int32 * flags) # ALU operation result type typeof(LIR_add) : int32 int32 -> int32_result # LIR_add and most other ALU ops accept inputs of type int32_result, with # explicit conversion to int32 by extracting that element typeof(LIR_ov) : int32_result -> bit At least to me :-), the semantics are clear and it seems a type checker on LIR could prevent most mistakes over this sort of thing.
Comment 52•15 years ago
|
||
(In reply to comment #50) > (In reply to comment #45) > > There's two orthogonal issues addressed in your post. (1) whether the > approach to compute-and-if-overflow-exit shown by the patch in #43 is > a good thing. (2) whether we should move LIR/NJ to full SSA. > > Re (1), I don't like the duplication either. With some careful > refactoring, perhaps the code duplication could be avoided. Something > like: we have a LIR binary op node, and we have a guards which incorporate > LIR binary op nodes; in both cases the folding machinery etc should > simply operate on the underlying op node. I'm not entirely sure what you are proposing here, but it sounds like you are suggesting an instruction like "LIR_xov" that guards on the overflow result of a LIR ALU op. Is that right? It seems pretty closely related to what I suggested, but it seems are you are also suggesting some explicit requirement to tightly link these exits with ALU ops. Perhaps the type system I sketched could also do this. > Re (2) > > > Discussion above has said that nanojit LIR is pretty much standard SSA, > > which I think is great and exactly where we should go in the future. > > I think it's a bit misleading to say that NJ LIR is pretty much > standard SSA. LIR has no phi-nodes, which are mandatory in standard > SSA. Also, LIR doesn't have a notion of a real control flow graph, in > which blocks can be moved around and visited in an order which gives > the best code quality: instead, we are forced to visit the blocks in > the particular order they have been parked in the LIR buffer, which > may not be the best from a register allocation point of view. Those seem to be implementation aspects as opposed to conceptual aspects. But fair enough, LIR certainly doesn't allow phi nodes currently. > I believe that implementing full SSA, with phi-nodes, the dominance frontier > stuff needed to make them work, plus representing an arbitrary CFG, plus > identifying loops, plus having to do decent register allocation for variables > in loops, represent a big step up in JIT code complexity and JIT-time cost, > the latter of which we're unlikely to recover in increased code quality. I wasn't proposing all of the above. In particular, we won't need dominance analysis and phi-node placement because our code generators will know exactly where to put the phi nodes. One of my applications is generating 'fast-path' versions of ops. Consider this pseudocode: if (condition Q): # fast path result_1 = (fast path LIR) else: # slow path args = (set up call) LIR_call result_2 = (process call results) result = LIR_phi(result_1, result_2) Being able to do this is valuable: our competition does it, and it's not hard for them because they are doing their register allocation and value tracking directly over simple assembly blocks. I hope we will be able to generate code like this as well, and supporting some possibly restricted kind of phi function seems like a good way to do it. I think all would need to support applications like that is iterative liveness analysis and a register allocator that understands LIR_phi. We also have a grad student currently working on a JIT-friendly "puzzle solving" register allocator as part of a GSoC, so we should find out soon enough whether that works well or not. > My instinct is to head in the opposite direction. > > Most of the fragments I've looked at are single-entry, multiple exit > affairs. I suspect the vast majority of executed fragments are like that. That's mostly true. I think dense array accesses and some imacro things are the main exceptions. But we also have an intern working on inline threading for TraceMonkey using nanojit, which will certainly want structured control flow in LIR. I believe Tamarin is also using structured control flow in LIR. > In which case the added overhead of the full SSA machinery buys you > not very much. > > I'd suggest: > > (a) restrict LIR to actually be this superblock form (single entry, > multiple exit) -- that is, get rid of all forward and backwards > jumps > > (b) find a way to improve the inter-fragment jumping conventions, in > particular so we don't have to flush all live variables to memory > at such transitions > > Assuming that most executed fragments really are superblocks, then (a) > doesn't hurt performance, and (b) improves it for general-case code. > This means we could reuse the existing compiler more or less, with no > change to the code quality within each superblock relative to at present, > and with no significant extra JIT-time expense. This seems at least as difficult to implement as a LIR_phi plus a linear-scan-type register allocator. In particular, loops don't seem very easy to handle that way. It also seems rather painful from the user's (i.e. person writing code to generate LIR) point of view. > Some years ago there was a Java JIT project at Mozilla, called > ElectricalFire: > http://web.archive.org/web/20021020161548/www.mozilla.org/projects/ef/techdocs/index.html > > IIUC, they were implementing full SSA in a JIT context. What happened to > those people? Is there anything we can learn from them?
Comment 53•15 years ago
|
||
(In reply to comment #50) > Some years ago there was a Java JIT project at Mozilla, called > ElectricalFire: > http://web.archive.org/web/20021020161548/www.mozilla.org/projects/ef/techdocs/index.html > > IIUC, they were implementing full SSA in a JIT context. What happened to > those people? Waldemar Horwat, who led that team, is at Google. He is a member of Ecma TC39. > Is there anything we can learn from them? Possibly, but that was a lont time ago, and EF never reached a usable state of completion. Others who hacked on it have scattered in the Netscape diaspora, and I've lost track of them. I think you're right to shy away from full SSA. For the short term, can we get a small (2x would not be small for benchmark results) linear speedup out of Nanojit, with cleanups as we go, by generating better code, avoiding going thru memory, etc. -- without jumping into the deep end of the SSA pool? /be
Comment 54•15 years ago
|
||
(In reply to comment #51) > > type int32 # primitive 32-bit integer type > type flags # ALU result flags type > type int32_result = (int32 * flags) # ALU operation result type > > typeof(LIR_add) : int32 int32 -> int32_result > # LIR_add and most other ALU ops accept inputs of type int32_result, with > # explicit conversion to int32 by extracting that element > typeof(LIR_ov) : int32_result -> bit Does LIR_add take int32 or int32_result operands as inputs? Did you mean this: > typeof(LIR_add) : int32_result int32_result -> int32_result That would be bad because it's not true -- the add doesn't rely on input flags. Or do you really intend to introduce an explicit conversion like this: typeof(LIR_extract_int32) : int32_result -> int32 That would be bad because LIR_extract_int32 would be extremely common. Or did you mean something else? I see type-checking LIR as a very worthwhile exercise in correctness, and a compelling argument for fixing instructions like LIR_ov -- because I can't see how it can be sensibly typed in its current form.
Comment 55•15 years ago
|
||
(In reply to comment #54) Preface: I'm with Andreas on this stuff that we should focus on stuff that has concrete benefits as opposed to conceptual cleanup. The LIR_ov debate is questionable by that criterion, although I do think typing or otherwise validating LIR is very worthwhile. > (In reply to comment #51) > > > > type int32 # primitive 32-bit integer type > > type flags # ALU result flags type > > type int32_result = (int32 * flags) # ALU operation result type > > > > typeof(LIR_add) : int32 int32 -> int32_result > > # LIR_add and most other ALU ops accept inputs of type int32_result, with > > # explicit conversion to int32 by extracting that element > > typeof(LIR_ov) : int32_result -> bit > > Or did you mean something else? My comment explains it, although there was a typo that totally changed the meaning. :-) I meant to say that LIR_add and such accept inputs of type int32_result, with *implicit* conversion to int32 by extracting that element. > I see type-checking LIR as a very worthwhile exercise in correctness, and a > compelling argument for fixing instructions like LIR_ov -- because I can't see > how it can be sensibly typed in its current form. It could be typed by a type-and-effect system or whatever you want to call it. Although, when I think about it, it seems like the current LIR_ov is always valid, and just refers to the last instruction that set condition flags. So it seems that current LIR_ov is not hard to type, just doesn't have clear expression semantics.
Comment 56•15 years ago
|
||
(In reply to comment #55) > I meant to say that LIR_add and such accept inputs of type > int32_result, with *implicit* conversion to int32 by extracting that element. What's the type of LIR_imm? Presumably int32, but then you can't feed one into an LIR_add. I'm glad there's broad agreement that type-checking LIR is a worthwhile "concrete" goal. I'm just having trouble seeing a sane way to type LIR_ov in its current form.
Comment 57•15 years ago
|
||
Why? LIR_ov has a very concrete strict usage form: {LIR_add|sub|mul} is immediately followed by LIR_ov, with that LIR_ov directly pointing to that prior instruction. Just like LIR_callh. Its not pretty, but you can definitively type-check it easily. For me the more interesting problem is how do we make overflow guards dataflow dependent and subject to dead-code elimination if the result of the op isn't used. Example: x = add a, b y = ov x guard y, exit The guard will always keep y alive, which always keeps x alive. This is the problem I would prefer to focus on. If we end up finding a cool solution for this, that also happens to clarify our semantics along the way, that would be awesome.
Comment 58•15 years ago
|
||
(In reply to comment #56) > (In reply to comment #55) > > > I meant to say that LIR_add and such accept inputs of type > > int32_result, with *implicit* conversion to int32 by extracting that element. > > What's the type of LIR_imm? Presumably int32, but then you can't feed one into > an LIR_add. OK, clearly there is excessive confusion here. Let me start over. # Type definitions type int32 # primitive 32-bit integer type type flags # ALU result flags type type int32_result = (int32 * flags) # ALU operation result type # Function types of LIR operations typeof(LIR_imm) : -> int32 typeof(LIR_add) : int32 int32 -> int32_result typeof(LIR_ov) : int32_result -> bit Note: any operation that takes an 'int32'-typed input can also be given an 'int32_result'-typed input. An 'int32_result'->'int32' conversion with the obvious semantics is inserted implicitly. > I'm glad there's broad agreement that type-checking LIR is a worthwhile > "concrete" goal. I'm just having trouble seeing a sane way to type LIR_ov in > its current form.
Comment 59•15 years ago
|
||
(In reply to comment #57) > For me the more interesting problem is how do we make overflow guards dataflow > dependent and subject to dead-code elimination if the result of the op isn't > used. Example: > > x = add a, b > y = ov x > guard y, exit > > The guard will always keep y alive, which always keeps x alive. This is the > problem I would prefer to focus on. If we end up finding a cool solution for > this, that also happens to clarify our semantics along the way, that would be > awesome. This problem seems independent of overflow handling, because it could apply to any guard. A clean way to get this DCE to happen in one pass would be to create a LIR_xadd instruction or something like that. LIR_xadd would return a value, but also include a guard and exit. Thus, if the value is not used, the LIR_xadd can be eliminated in the usual way. Another possibility would be to make a new kind of guard instruction that doesn't keep its input live. This would be used for guards that only protect a value being computed, while the current kind of guard would be used for tests that protect the trace itself.
Reporter | ||
Comment 60•15 years ago
|
||
> For me the more interesting problem is how do we make overflow guards dataflow
> dependent and subject to dead-code elimination if the result of the op isn't
> used.
Umm .. but I provided a suggestion for fixing that in #1, and a partial
proof of concept patch in #43. Said patch, if integrated properly in the
tree, would make dead code removal of guards trivial.
dmandelin is suggesting the same thing in #59 w.r.t. LIR_xadd.
Comment 61•15 years ago
|
||
(In reply to comment #59) > (In reply to comment #57) > > For me the more interesting problem is how do we make overflow guards dataflow > > dependent and subject to dead-code elimination if the result of the op isn't > > used. Example: > > > > x = add a, b > > y = ov x > > guard y, exit > > > > The guard will always keep y alive, which always keeps x alive. This is the > > problem I would prefer to focus on. If we end up finding a cool solution for > > this, that also happens to clarify our semantics along the way, that would be > > awesome. That takes us back to Julian's LIR_x_add_ov instruction. > This problem seems independent of overflow handling, because it could apply to > any guard. A clean way to get this DCE to happen in one pass would be to create > a LIR_xadd instruction or something like that. LIR_xadd would return a value, > but also include a guard and exit. Thus, if the value is not used, the LIR_xadd > can be eliminated in the usual way. I don't think that makes sense. This is closely tied to the presence of the ov. LIR_x_add_ov embodies a condition: "exit if the add overflows". LIR_xadd doesn't embody any such condition.
Comment 62•15 years ago
|
||
We discussed LIR_x_add_ov above. Its a bit heavyweight. I was hoping for something that is more inline with the fine-grained nature of LIR. Ed pointed out we might want to branch on these things as well, and we might want to branch and exit in case of overflow of not-overflow, and for at least 3 instructions, which is a whole bunch of combinations (its 6am, I let someone else count them all). My attempt at this is documented in https://bugzilla.mozilla.org/show_bug.cgi?id=498035. Its a value-producing instruction that marks a block and the entire block goes dead if the value isn't used. Such guard constructs can be encapsulated in it. Wiring that up to DCE is a bit ugly, but certainly doable.
Comment 63•15 years ago
|
||
(In reply to comment #50) > My instinct is to head in the opposite direction. > > Most of the fragments I've looked at are single-entry, multiple exit > affairs. I suspect the vast majority of executed fragments are like that. > In which case the added overhead of the full SSA machinery buys you > not very much. > I am currently working on implementing inline threading using nanojit, and the code I generate is very different than that generated for traces. Absolute and conditional jumps in the javascript correspond directly to absolute and conditional jumps in the generated LIR. Despite having a real CFG, the generated code doesn't really have any need for Phi-nodes (nor will it once I begin inlining simple opcodes). It's possible that possible future optimizations such as further inlining fast paths could require Phi-nodes, but I haven't thought that through fully. > I'd suggest: > > (a) restrict LIR to actually be this superblock form (single entry, > multiple exit) -- that is, get rid of all forward and backwards > jumps > This would *seriously* complicate code generation for inline threaded code. The generated code would need to be a large number of linked together fragments (one fragment per jump target, I believe), instead of one fragment with straightforward internal control flow. > (b) find a way to improve the inter-fragment jumping conventions, in > particular so we don't have to flush all live variables to memory > at such transitions > Inter-fragment jumps would need to be very cheap to not be a deal-breaker for nanojit based inline threading. How expensive are they now?
Comment 64•15 years ago
|
||
(In reply to comment #62) > We discussed LIR_x_add_ov above. Its a bit heavyweight. I was hoping for > something that is more inline with the fine-grained nature of LIR. Ed pointed > out we might want to branch on these things as well, and we might want to > branch and exit in case of overflow of not-overflow, and for at least 3 > instructions, which is a whole bunch of combinations (its 6am, I let someone > else count them all). > > My attempt at this is documented in > https://bugzilla.mozilla.org/show_bug.cgi?id=498035. Its a value-producing > instruction that marks a block and the entire block goes dead if the value > isn't used. Such guard constructs can be encapsulated in it. Wiring that up to > DCE is a bit ugly, but certainly doable. 1. Encoding the blocks using pseudoinstructions is indeed ugly, although I think the idea itself is not too bad. Expanding your explanation a bit, I would say that the (logical) CFG can contain a "smurf-block" of multiple instructions, where *only* the final instruction produces a value that can be used outside the "smurf-block", and if the final instruction is dead, then the entire "smurf-block" is dead. I would want this to come along with a test-mode checker that verifies that only the final instruction is used outside the "smurf-block". The other hazard is that the user can generate a "smurf-block" that contains a guard that verifies a property needed by the rest of the trace, not just the "smurf-block"'s value, in which case eliminating it would be erroneous. But so far it appears that risk applies in every dead-value guard-elimination algorithm I have thought of. 2. I thought a little more about more traditional solutions today, and it appears that nothing involving LIR_ov is likely to work cleanly. Consider this trace: 10 add 5, 6 11 ov 10 12 xt 11 For all the traces we generate, the following code will never refer to the value of instruction 11. Thus, when we arde scanning instruction 12, there is no information directly available *at all*: the form of instruction 11 and the state of its operands is the same, no matter what code follows, so clearly it can't be used to make any decisions. 3. What makes sense to me today is a LIR_xov instruction, that exits if the single operand value caused an overflow, and otherwise evaluates to the value of its operand. Now the code is this: 10 add 5, 6 11 xov 10 When scanning 11, if 11 is not used, we can say that it is dead and eliminate it, and then we would eliminate 10 as well because it is now dead. If 11 is used, then we correctly keep it around. I like this because the conceptual effect and purpose of a LIR_xov is to produce a non-overflowed integer value. If said value is not used, it is dead. A side issue is that theoretically we could generate a side exit on overflow that we intend to guard a trace property, and not to produce a value. Thus, we would never want that instruction eliminated. I think in that case, we would not want to use the value-producing LIR_xov. Instead, we would generate a LIR_ov like now and use it with a non-value producing guard, or create a new LIR_xovn that exits if its operand overflowed, but does not produce a value.
Comment 65•15 years ago
|
||
(In reply to comment #57) > For me the more interesting problem is how do we make overflow guards dataflow > dependent and subject to dead-code elimination if the result of the op isn't > used. Example: > > x = add a, b > y = ov x > guard y, exit > > The guard will always keep y alive, which always keeps x alive. If that exit will report an exception, then it must never be eliminated, because doing so would change the behavior of the code. If the exit would simply bail back to the interpreter to compute the value slowly and then ignore it, then it can be eliminated. Do we distinguish these two cases at the moment?
Comment 66•15 years ago
|
||
It seems there is really no good solution for this problem. xov is my personal favorite too, given the choices we discussed so far. We have already xt/xf, xo (or xov) seems somewhat logical. We will likely not need xno (xnov), but we might end up adding jo jno. We can probably drop LIR_cs, since we have no immediate plans of using it, which would further reduce the cross product of names.
Comment 67•15 years ago
|
||
(In reply to comment #66) > We can probably drop LIR_cs, since we have no > immediate plans of using it, which would further reduce the cross product of > names. Already done in bug 493125, which was pushed to TM yesterday.
Comment 68•15 years ago
|
||
I'll add my $.02 I like LIR_xov too, given the alternatives. LIR currently doesn't force any particular block traversal order; block traversal is determined by the LirReader which feeds the backend code gen pipeline. tamarin-redux contains a VerifyWriter filter used for basic type checking and LIR validation during the LIR generation pass. (full of asserts). Lets promote it into nanojit and improve it? We might still want LIR_ov for use with branching: jt(ov(add)). Or, jov(add). since plain branches don't need to be value producing, either one seems simple enough. but i haven't thought through it completely; all the needs for DCE of guards are still there for plain branches too, so maybe we really do want value-producing branches and thus jov(add). Think about the non-tracing version of integer-add speculation that TM's competitors emit, and that we want to emit in Tamarin, and that TM might even emit if the speculation doesn't predict well someplace in the code.
Reporter | ||
Comment 69•15 years ago
|
||
(In reply to comment #68) > tamarin-redux contains a VerifyWriter filter used for basic type checking and > LIR validation during the LIR generation pass. (full of asserts). Lets > promote it into nanojit and improve it? I hg clone'd http://hg.mozilla.org/tamarin-redux, but couldn't find any VerifyWriter in there. Should I look somewhere different? I'm working on a LIR typechecker and would like to have a look at VerifyWriter so as not to duplicate work.
Comment 70•15 years ago
|
||
I believe Ed is referring to ValidateWriter in CodegenLIR, if you haven't already found it.
Comment 71•15 years ago
|
||
Yep, I misspoke. class ValidateWriter in core/CodegenLIR.cpp
Updated•15 years ago
|
Target Milestone: --- → Future
Updated•15 years ago
|
Component: JIT Compiler (NanoJIT) → Nanojit
Product: Tamarin → Core
QA Contact: nanojit → nanojit
Updated•15 years ago
|
Comment 72•15 years ago
|
||
FWIW, LIR_cs no longer exists.
Comment 73•14 years ago
|
||
(In reply to comment #0) > > (1) Clearly distinguish between values and effects (or, if you > like, expressions and statements) in LIR Expression instructions and statement instructions are still the same types, but isStmt() is better than it used to be -- it's now type-based rather than being a hodge-podge of pattern matching. > (2) Formalise the concept of "LIR temporary" (virtual register, > variable, etc) No change, I was never quite sure what this meant. The address of a value-producing LIns serves as the handle for the value. > (3) Get rid of LIR_ov. It has an implicit dependency through a > notional, unstated %eflags, to the immediately preceding > instruction. Not currently sure what to replace it with, tho. Done by bug 539874. > (4) Possibly make the type system a bit stronger, and have types > for LIR temporaries as well as for the operations. Write a LIR > typechecker, to check for correctly typed, SSA-form LIR. Done by bug 463137. > (5) (Further ahead) Get rid of hardwired amodes in ld/st, and > require addresses simply to be arbitrary LIR temporaries. I'm not sure what this means. Loads and stores have a base pointer, which is a LIns* (ie. effectively an arbitrary LIR expression), and an integer offset. Julian, I think this bug should be closed and any issues you still care about pushed into follow-up bugs. Do you agree?
No longer depends on: 534309
Comment 74•14 years ago
|
||
(In reply to comment #73) > Julian, I think this bug should be closed and any issues you still care about > pushed into follow-up bugs. Do you agree? I'm closing this bug due to lack of interest. Please file any remaining issues in follow-ups! Thanks.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•10 years ago
|
Product: Core → Core Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•