Closed Bug 545406 Opened 15 years ago Closed 13 years ago

TM: loop invariant code motion (LICM)

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: jseward, Assigned: n.nethercote)

References

Details

(Whiteboard: PACMAN)

Attachments

(1 file, 4 obsolete files)

With bug 545274 (proper alias info for loads, stores and calls) in place, it's possible to contemplate how to do loop invariant code hoisting on traces. Implementation sketch follows (warning: at most 60% baked, possibly less) We want to compute the transitive closure of all value-producing nodes whose value does not depend on anything computed during the loop. So the root nodes are (a) those producing constants and (b) loads whose alias set does not intersect the alias set of any store in the trace nor the store-set of any call in the trace. Then visit all nodes that would be reachable from these in a forward traversal of the LIR. I believe the complete algorithm can be achieved in 3 passes over the LIR buffer. Pass 1: compute aggregate store alias set, and auxiliary forward index table: let S:AliasSet = empty T:Array of LIR* = empty for each node in LIR buffer, working backwards { if node is the trace-branch-back label (head of the loop) break if node is a store S = S union alias-set(node) if node is a call S = S union store-alias-set(node) add node to the end of T } so now S tells us which loads can't be considered loop invariant due to being stored to inside the loop proper T makes it possible to scan forwards during Pass 2 Pass 2: compute transitive closure of loop invariant nodes let I:Set of LIR* = empty // the invariant nodes for each node in T, working backwards (hence fwds in LIR buf) { if node denotes a literal I = I union {node} else if node is a load and intersect( alias-set(node), S ) == empty and node's address argument is in I then I = I union {node} else if node is an arithmetic node of some kind and all of node's arguments are in I then I = I union {node} } so now I is the set (or, a conservative subset) of loop invariant nodes. Note that we scan over the existing loop pre-header, and we ought to add all value producing nodes in the pre-header to I. So this loop is a bit too conservative. Pass 3: move I into the pre-header. Not sure it's possible to rearrange LIR buffer in-place in O(# nodes) time. Hence rearrange while copying into a new LIR buffer of the same size. The copy has to be done in 3 stages: (a) copy the pre-header (all stuff before the loopback label) (b) copy nodes in I that aren't already in the pre header (c) (now copy the loopback label) (d) copy all nodes after the loopback label, that aren't in I again using T so facilitate a forwards traversal Uh, ok. So this requires some further study on the details of the pre-header. Also, there's conflicting requirements in the representation of I. We require I to be some kind of set so that we can quickly determine whether a node is already part of it, in Pass 2. But Pass 3 (b) requires I to be array(ish) so that the order in which I's nodes appear in the pre-header (after the copy) is the same as it was to start with.
This bug is motivated by the fact that TM (not sure about TR, but probably) does lots of redundant stuff inside loops. Eg. in this: for (var i; i < N; i++) { x += a[i]; } various checks (is 'a' an array? does it have 'dslots' allocated?) are done every time around the loop. Loop invariant code hoisting will improve this by hoisting the checks outside the loop so they're only done once. Another approach is to do some static analysis of the loop body to decide if the checks can be omitted altogether. The above example doesn't have enough context to tell if the checks can be omitted, but if 'a' was accessed as an array just before the loop they probably could be, for example. This static analysis wouldn't necessarily be at the LIR level -- in TM it would probably be easier at the JS bytecode level. I'm not sure which of these ideas is better for handling this kind of thing. There may be room for both of them.
Depends on: 517910
JS is full of annoying memory and control flow ambiguities, due to things such as getters and setters. Linear SSA helps a lot, though, so I'd focus on LIR level, not JS bytecode level -- unless you meant linearized JS bytecode as seen by the TM recorder. /be
> so I'd focus on LIR level, not JS bytecode level Yes, pushing optimizations into the level that generates the LIR tends to make it very complex. Where I was trying to get to with the original alias set proposal in bug 517910 was to annotate LIR with whatever information is needed to do the transformations we really want. Then, NJ can optimise the LIR without having to know anything about JS semantics. I'm all in favour of modularity in compiler internals.
I totally agree NJ shouldn't have to know anything about the semantics of the front-end language. I'm just saying that we should do optimisations wherever they have the best cost/benefit ratio -- that might be in NJ, it might be in the front-end.
Getting back to the general approach... I think values that get hoisted may also need to marked as live with a 'live' instruction at the end of the loop. Hoisting increases register pressure, potentially quite a lot. Also, this is interesting. The first example I looked at: state = iparam 0 ecx state label1: state label1 sp = ld state[8] state sp cx = ld state[0] state sp cx eos = ld state[12] eos state sp cx ld1 = ld cx[0] eos ld1 state sp eq1 = eq ld1, 0 eos eq1 state sp xf1: xf eq1 -> pc=0x955fc37 imacpc=(nil) sp+0 rp+0 (GuardID=001) eos state sp ld1 is JSContext's 'operationCallBack'. AFAICT this is set when a script has been running for too long and should be aborted. But if you consider the LIR code in isolation it looks like you can hoist the 'ld1 = ld cx[0]' because there are no stores that can alias with cx[0] elsewhere in the loop. But really cx[0] can be set by some other code that the LIR/Nanojit doesn't know about, so hoisting it is unsafe. To handle that properly would require an annotation to say that a load is volatile. I wonder how many other such loads there are.
Another complication is the fact that LIR pretends to be SSA but really isn't because it has jumps but no phi nodes. If there are any jumps other than the back-to-the-top-of-the-loop jump things get more complicated.
Also, hoisting guards is tricky. We have to make sure the VM state is consistent before taking a guard; this mostly consists of doing stack stores. I think we can't hoist any guards that are preceded by such stores. But I think there's also no clear identification on which stores are make-the-VM-state-consistent stores.
Whiteboard: PACMAN
Target Milestone: --- → Future
Attached patch extremely rough starting patch (obsolete) — Splinter Review
This is an extremely rough patch. It works on this code: function f() { var sum = 0.5; var a = [1, 1]; for (var i = 0; i < 1000; i++) { sum += a[1]; } return sum; } print(f()); But I haven't really tried it on anything beyond that. It sucessfully hoists all the constants (which doesn't make much difference as they are mostly folded into other operations, on x86 at least) and a couple of loads from InterpState. That's not much but it shows this can work. There's potential for hoisting more but it'll require alias analysis to do a much better job with the stack, ie. be able to track every stack slot separately. That way any local variables that are read but not written can be hoisted (eg. 'a' above), which is crucial for this to be effective. One annoying thing about the patch is When copying/rearranging the instructions into the new buffer, because of the pointer operands I have to maintain a table mapping the old LIns pointers to the new ones, and change operands as they're copied. It's doable, though. Another annoying thing is that you end up with lots of LIR_live instructions at the end, which fills up the AR. I have no idea about the cost of this, the code is extremely rough and I haven't measured it at all.
Attached patch WIP patch, v1 (obsolete) — Splinter Review
Still very rough, but much better than the previous one. I realised that I can do alias analysis on stack accesses accurately quite easily -- stack loads/stores are already marked with ACC_STACK, and then you can just look at the 'disp' value and treat all disp values as different regions. So eg. sp[0] is a different region to sp[8], etc. This allows many more values to be found as loop-invariant. I've been using this code as my test case: function f() { var sum = 0.5; var a = [0, 1, 2, 3, 4, 5, 6, 7]; for (var i = 0; i < 100000000; i++) { sum += a[i & 0x7]; } return sum; } print(f()); It's successfully hoisting everything except the guards. This increases register pressure a lot. In particular, handleLoopCarriedExprs() was trying to register-allocate every loop-live value. This worked fine when there were only one or two (eg. 'state'), but was a disaster when there were a dozen or more. So I turned that off and things got a lot better. Despite all the hoisting, I'm only getting a 5% speedup on that loop so far. This is because many of the hoisted values have to be spilled before the loop starts and thus reloaded within the loop at their use point. And so far the hoisted values aren't any more expensive to compute than a store. Hoisting guards will be crucial, I'll do that next.
Assignee: nobody → nnethercote
Attachment #456394 - Attachment is obsolete: true
Status: NEW → ASSIGNED
I've almost got guard-hoisting working, but there's a big snag. If I hoist a guard I need to generate a new GuardRecord, which involves calling snapshot(). Unfortunately snapshot() depends on and updates various bits of TraceRecorder state (regs->pc, cx->regs, cx->fp) which I think are only correct mid-trace. Whereas I need to generate these snapshots once the recording of the fragment is finished, ie. I'm rewriting the final LIR. Any suggestions? I'm not sure how to proceed further. I'm going to need help from someone who understands GuardRecords better than I do (which isn't difficult!) This is currently a showstopper for a very promising optimisation...
(In reply to comment #10) > I've almost got guard-hoisting working, but there's a big snag. If I hoist a > guard I need to generate a new GuardRecord, which involves calling snapshot(). > Unfortunately snapshot() depends on and updates various bits of TraceRecorder > state (regs->pc, cx->regs, cx->fp) which I think are only correct mid-trace. Only correct mid-recording, because the interpreter is calling the recorder for each (non-fused) opcode. > Whereas I need to generate these snapshots once the recording of the fragment > is finished, ie. I'm rewriting the final LIR. If you always hoist to a known point before the loop, we could try to snapshot there and save the guard record in the trace recorder, for you to steal away. IOW, is the hoisting point well-defined and easy to identify from the bytecode at recording time? /be
(I'm reclassifying this from NJ to TM, because the current code is awfully TM-specific.) (In reply to comment #11) > If you always hoist to a known point before the loop, we could try to snapshot > there and save the guard record in the trace recorder, for you to steal away. That is the best hack, er, idea I've come up with as well :) > IOW, is the hoisting point well-defined and easy to identify from the bytecode > at recording time? I think so. Here's some TMFLAGS=recorder output: start ebx = parami 0 ebx esi = parami 1 esi edi = parami 2 edi state = parami 0 ecx ** About to try emitting guard code for SideExit=0x86d804 exitType=TIMEOUT label1: sp = ldi.o state[8] rp = ldi.o state[24] cx = ldi.o state[0] eos = ldi.o state[12] eor = ldi.o state[28] ldi1 = ldi.csro cx[0] immi1 = immi 0 eqi1 = eqi ldi1, immi1/*0*/ xf1: xf eqi1 -> pc=0x40d566 imacpc=0x0 sp+0 rp+0 (GuardID=001) 00022: 27 trace 00023: 28 bindname "bitwiseAndValue" ... I'm hoisting code to the "**" point, just before the loop header label. I find the interleaving of bytecodes and LIR given by TMFLAGS=recorder very confusing, but I think that point corresponds to the TRACE bytecode. If I do a speculative snapshot just before emitting the label I think/hope it'll work. This is assuming that I don't hoist any stores. AFAICT stores are the important part when it comes to connecting bytecode to LIR. For example, for an ADD bytecode you'll have something like this LIR: x = ldd.s sp[0] y = ldd.s sp[8] z = addd x, y sp[0] = std.s z It's the final store that commits the result, so you can hoist the two loads and the add (assuming they're loop invariant... sp[0] clearly isn't, but that's beside the point) without changing the meaning of anything. What this means is that you can end up with lots of hoisted loads and arithmetic ops in the loop pre-header, which correspond to lots of bytecodes within the loop, but because the stores haven't moved the entire pre-header still corresponds to the TRACE bytecode. If I start hoisting stores (I'm not sure if that's a good idea/worthwhile yet) then I'll need to make sure that any hoisted guards precede any hoisted stores in the pre-header, else the speculatively snapshotted GuardRecord will be wrong. That's a lot of hand-waving, hopefully it's correct. I'll try it on Monday.
Summary: NJ: loop invariant code hoisting → TM: loop invariant code motion (LICM)
Component: Nanojit → JavaScript Engine
QA Contact: nanojit → general
Target Milestone: Future → ---
(In reply to comment #12) > (I'm reclassifying this from NJ to TM, because the current code is awfully > TM-specific.) Bummer -- I saw the bug description and thought that TR would get this for free :-) [Still, if this pans out I'm guessing we'll try adapting it for TR...]
(In reply to comment #13) > > Bummer -- I saw the bug description and thought that TR would get this for free > :-) [Still, if this pans out I'm guessing we'll try adapting it for TR...] I'd be happy if it worked for TR as well, but I have no idea what TR code tends to look like, and thus how applicable this is.
TR generated code is generally a plain old linearized control flow graph in the same order that our ABC bytecodes were generated. I am guessing without looking, that the TM trace recorder is identifying loops and therefore where expressions can be hoisted to. TR would have to do its own analysis, but in ideal world once that was done we could reuse some NJ infrastructure to do the hoisting. (freely admit the world is not always ideal). > One annoying thing about the patch is When copying/rearranging the instructions > into the new buffer, because of the pointer operands I have to maintain a table > mapping the old LIns pointers to the new ones, and change operands as they're > copied. It's doable, though. I've run into this during experiments too. Are you using a map indexed by LIns* or something else? I was thinking that If we think of a LIR->LIR transformer as a LIR backend, then it could be sensible to add a new entry to the union containing SharedFields, for an ID field. this could be used to make an array that maps old LIns to new LIns during a full rewrite. > Another annoying thing is that you end up with lots of LIR_live instructions at > the end, which fills up the AR. yeah, hoisting increases register/stack pressure, fact of life.
(In reply to comment #15) > TR generated code is generally a plain old linearized control flow graph in the > same order that our ABC bytecodes were generated. I am guessing without > looking, that the TM trace recorder is identifying loops and therefore where > expressions can be hoisted to. TR would have to do its own analysis, but in > ideal world once that was done we could reuse some NJ infrastructure to do the > hoisting. (freely admit the world is not always ideal). Ye, it might be possible. > I've run into this during experiments too. Are you using a map indexed by > LIns* or something else? I was thinking that If we think of a LIR->LIR > transformer as a LIR backend, then it could be sensible to add a new entry to > the union containing SharedFields, for an ID field. this could be used to make > an array that maps old LIns to new LIns during a full rewrite. I'm just using HashMap<LIns*,LIns*>, see "OldNewTable" in the patch. As for the array, we can generate large loops in TM, crypto-md5 in SunSpider has one with more than 10,000 instructions in it, which might make an array map problematic. Besides, I need to demonstrate good speed-ups on the generated code before worrying about reducing compile-time...
> > I'm just using HashMap<LIns*,LIns*>, see "OldNewTable" in the patch. This works fine in most cases because def-points always precede use-points for data values. But that isn't true for control flow -- for forward jumps a label is effectively used before it's def'd. So I'll need an extra table that lets me go back and patch jumps when their target labels are copied. Sigh, what a hassle.
Depends on: 552812
I found a way to avoid the hashtable lookup for converting operand pointers from old-to-new -- when an instruction is copied I overwrite the old copy with a pointer to the new copy. This makes subsequent old-to-new lookups O(1). It's a bit of a hack, but it reduces the overhead for crypto-md5 from 1.30x to 1.10x so I'll take it. Still lots more to do, but alias analysis is going to have to improve further so more instructions can be marked as loop-invariant. Bug 552812 is the first step for that and I just posted patches for it.
Attachment #456535 - Attachment is obsolete: true
Ah, the old leave a forwarding address trick. I keep telling you hacks are good things :-P. Great work here and in bug 552812. /be
Blocks: 579285
Depends on: 584279
New version. I found that register pressure was being increased a lot -- eg. it could hoist a dozen or more values that are used in the loop. So I made the following changes: - Only hoist invariant guards (and everything they depend on). - If a hoisted value is used in both the loop pre-header and the loop body, I copy it twice, once in the pre-header and once in the body. The idea is to not add any additional register pressure across the loop boundary. Perf is something of a wash for SS -- some get better, some get worse (mostly due to compile-time, esp. crypto-md5 which is still 1.10x slower).
--------------------------------------------------------------- | millions of instructions executed | | total | on-trace (may overestimate) | --------------------------------------------------------------- | 111.182 111.939 (0.993x) | 20.798 20.817 (0.999x) | 3d-cube | 56.986 57.038 (0.999x) | 22.313 22.314 (------) | 3d-morph | 147.869 148.886 (0.993x) | 21.982 22.887 (0.960x) | 3d-raytrace | 77.551 77.482 (1.001x) | 17.401 17.400 (------) | access-binary-trees | 183.724 183.858 (0.999x) | 103.707 103.709 (------) | access-fannkuch | 41.567 42.080 (0.988x) | 17.553 17.759 (0.988x) | access-nbody | 60.019 60.037 (------) | 27.013 27.013 (------) | access-nsieve | 15.687 14.833 (1.058x) | 2.862 1.969 (1.453x) | bitops-3bit-bits-in-byte | 43.306 43.316 (------) | 30.280 30.281 (------) | bitops-bits-in-byte | 22.773 22.775 (------) | 10.219 10.219 (------) | bitops-bitwise-and | 71.623 71.644 (------) | 39.453 39.453 (------) | bitops-nsieve-bits | 42.397 42.363 (1.001x) | 26.655 26.654 (------) | controlflow-recursive | 47.105 52.594 (0.896x) | 5.541 5.657 (0.979x) | crypto-md5 | 31.332 31.460 (0.996x) | 7.030 7.032 (------) | crypto-sha1 | 127.749 127.150 (1.005x) | 10.457 9.846 (1.062x) | date-format-tofte | 99.146 99.183 (------) | 11.118 11.116 (------) | date-format-xparb | 52.625 52.749 (0.998x) | 27.614 27.715 (0.996x) | math-cordic | 38.424 38.357 (1.002x) | 6.252 6.065 (1.031x) | math-partial-sums | 28.581 28.517 (1.002x) | 10.286 9.944 (1.034x) | math-spectral-norm | 100.938 100.952 (------) | 75.045 75.045 (------) | regexp-dna | 44.074 43.726 (1.008x) | 8.098 7.626 (1.062x) | string-base64 | 123.879 123.184 (1.006x) | 24.544 24.545 (------) | string-fasta | 272.585 274.830 (0.992x) | 6.839 6.775 (1.009x) | string-tagcloud | 251.380 251.376 (------) | 5.637 5.636 (------) | string-unpack-code | 64.500 64.405 (1.001x) | 7.401 7.175 (1.031x) | string-validate-input ------- 2157.013 2164.746 (0.996x) | 546.111 544.666 (1.003x) | all
(In reply to comment #20) > Perf is something of a wash for SS -- some get better, some get worse (mostly > due to compile-time, esp. crypto-md5 which is still 1.10x slower). That's a bit disappointing for something which probably took a lot of futzing around to make work, I'd guess. I was wondering if d{vander,mandelin}'s work to tune the JM/tracer integration (bug 580468, excellent stuff btw) might tilt the field in your favour. As I read it, that might result in fewer absolutely huge loops being traced, which would help offset the compile time losses you're getting here, and might help reduce the extra register pressure.
(In reply to comment #22) > > I was wondering if d{vander,mandelin}'s work to tune the JM/tracer > integration (bug 580468, excellent stuff btw) might tilt the field in > your favour. As I read it, that might result in fewer absolutely huge > loops being traced, which would help offset the compile time losses > you're getting here, and might help reduce the extra register > pressure. I'm currently avoiding the register pressure issue fairly well, but yes, this'll be worth looking at again once all the current big changes have landed. One issue is that we don't spend that much time in trace code -- it's about 25% of total SunSpider time, for example, and in V8 the proportion is even less. I could also make it stop if the number of instructions hoisted isn't high enough -- eg. for the main loop of crypto-md5 it hoists 4 instructions out of 19,000+ which isn't exactly worth it. But then, even working out those numbers requires a couple of passes over the code.
This version is much closer to something landable. Notable things: - When hoisting is done in a fragment, instead of rewriting the entire fragment, which is difficult and slow, we now insert the hoisted instructions by using LIR_skip instructions to trampoline to an out-of-line sequence and then back. This also means we don't need to do any forward passes, so there's no need to build the vector of LIns pointers. crypto-md5 now only has a 9% instruction count increase when compiled with -j, and it's a pathologically bad case for LICM, having one loop with 36,000 LIR instructions of which only two are hoisted, combined with a very short running time that means compile-time increases really hurt. Other than crypto-md5, the worst instruction count increase is 1% for all of Sunspider when running with -j (running with -j -m -p the increases are much less because less code is traced). (Full numbers are below.) - Most of the new code is in two new files, tracejit/LICM.{cpp,h}. - Kraken results aren't bad, though not as good as I'd like given how much effort went into the patch. With -j the timing improvement is about 1.03x, mostly due to big wins in ai-astar and audio-dft. With -j -m -p the results are similar except that ai-astar currently doesn't trace (bug 606890 is open to fix this) so the improvement is less. - The one big question mark over this patch is the fact that it reorders guards. It's not clear to me that this is safe in general (although jit-tests are all passing). For example, if a guard that when taken causes an exception to be thrown is moved earlier than another guard, the exception behaviour of the program may change. This question needs to be resolved before this patch can be considered for landing. Sunspider with -j (this is mostly a measure of how much extra compile-time LICM incurs): --------------------------------------------------------------- | millions of instructions executed | | total | on-trace (may overestimate) | --------------------------------------------------------------- | 76.997 77.310 (0.996x) | 22.622 22.628 (------) | 3d-cube | 38.279 38.367 (0.998x) | 23.129 23.131 (------) | 3d-morph | 167.503 168.298 (0.995x) | 15.688 16.068 (0.976x) | 3d-raytrace | 130.658 130.662 (------) | 537758 537875 (------) | access-binary- | 83.818 83.768 (1.001x) | 74.571 74.317 (1.003x) | access-fannkuc | 27.568 27.585 (0.999x) | 15.193 14.933 (1.017x) | access-nbody | 30.483 30.171 (1.010x) | 23.624 23.277 (1.015x) | access-nsieve | 7.005 6.136 (1.142x) | 2.858 1.965 (1.454x) | bitops-3bit-bi | 34.364 34.380 (------) | 30.097 30.097 (------) | bitops-bits-in | 14.069 14.074 (------) | 10.215 10.215 (------) | bitops-bitwise | 36.272 34.723 (1.045x) | 31.217 29.621 (1.054x) | bitops-nsieve- | 149.555 149.552 (------) | 0 0 (------) | controlflow-re | 32.863 35.910 (0.915x) | 4.284 4.368 (0.981x) | crypto-md5 | 19.040 19.158 (0.994x) | 6.269 6.272 (------) | crypto-sha1 | 99.440 97.823 (1.017x) | 11.624 10.159 (1.144x) | date-format-to | 69.238 69.295 (0.999x) | 9.548 9.547 (------) | date-format-xp | 42.243 42.843 (0.986x) | 28.678 29.254 (0.980x) | math-cordic | 22.397 22.315 (1.004x) | 6.158 6.002 (1.026x) | math-partial-s | 21.533 21.468 (1.003x) | 13.190 12.938 (1.019x) | math-spectral- | 48.510 48.515 (------) | 34.578 34.578 (------) | regexp-dna | 27.459 27.486 (0.999x) | 9.026 9.026 (------) | string-base64 | 84.275 84.329 (0.999x) | 23.421 23.422 (------) | string-fasta | 141.033 141.056 (------) | 12.243 12.243 (------) | string-tagclou | 123.671 123.519 (1.001x) | 7.663 7.664 (------) | string-unpack- | 42.286 42.316 (0.999x) | 8.212 8.213 (------) | string-validat ------- | 1570.572 1571.069 (------) | 424.678 420.503 (1.010x) | all Sunspider with -j -m -p: --------------------------------------------------------------- | millions of instructions executed | | total | on-trace (may overestimate) | --------------------------------------------------------------- | 68.494 68.649 (0.998x) | 43.768 43.771 (------) | 3d-cube | 39.728 39.788 (0.998x) | 24.716 24.718 (------) | 3d-morph | 65.557 65.555 (------) | 37.044 37.044 (------) | 3d-raytrace | 24.407 24.403 (------) | 11.010 11.010 (------) | access-binary- | 88.397 88.394 (------) | 83.306 83.306 (------) | access-fannkuc | 28.390 28.258 (1.005x) | 16.264 16.008 (1.016x) | access-nbody | 35.169 35.166 (------) | 28.735 28.735 (------) | access-nsieve | 7.524 6.656 (1.130x) | 3.255 2.363 (1.378x) | bitops-3bit-bi | 35.393 35.391 (------) | 30.400 30.400 (------) | bitops-bits-in | 15.917 15.923 (------) | 12.019 12.019 (------) | bitops-bitwise | 38.143 36.582 (1.043x) | 32.966 31.370 (1.051x) | bitops-nsieve- | 17.137 17.134 (------) | 12.875 12.875 (------) | controlflow-re | 24.011 24.015 (------) | 11.836 11.836 (------) | crypto-md5 | 19.922 20.035 (0.994x) | 8.528 8.531 (------) | crypto-sha1 | 67.306 67.310 (------) | 21.026 21.026 (------) | date-format-to | 63.581 63.578 (------) | 9.945 9.945 (------) | date-format-xp | 43.648 44.242 (0.987x) | 29.513 30.086 (0.981x) | math-cordic | 22.939 22.852 (1.004x) | 6.310 6.154 (1.025x) | math-partial-s | 31.372 31.390 (0.999x) | 26.189 26.189 (------) | math-spectral- | 48.370 48.362 (------) | 34.580 34.580 (------) | regexp-dna | 28.804 28.830 (0.999x) | 9.277 9.277 (------) | string-base64 | 63.958 63.966 (------) | 32.064 32.064 (------) | string-fasta | 103.304 103.182 (1.001x) | 17.211 17.211 (------) | string-tagclou | 97.861 97.671 (1.002x) | 12.829 12.829 (------) | string-unpack- | 43.169 43.203 (0.999x) | 8.573 8.573 (------) | string-validat ------- | 1122.513 1120.548 (1.002x) | 564.251 561.933 (1.004x) | all Kraken with -j (this is a best-case measurement for LICM): --------------------------------------------------------------- | millions of instructions executed | | total | on-trace (may overestimate) | --------------------------------------------------------------- | 3217.152 2646.218 (1.216x) | 3001.706 2430.546 (1.235x) | ai-astar | 1860.749 1870.246 (0.995x) | 1536.491 1545.498 (0.994x) | audio-beat-det | 1257.911 985.993 (1.276x) | 1098.428 826.386 (1.329x) | audio-dft | 1685.241 1664.531 (1.012x) | 1225.797 1204.295 (1.018x) | audio-fft | 2392.974 2367.852 (1.011x) | 1588.545 1563.964 (1.016x) | audio-oscillat | 5871.686 5897.334 (0.996x) | 4693.845 4719.372 (0.995x) | imaging-gaussi | 2327.179 2326.063 (------) | 739.492 737.894 (1.002x) | imaging-darkro | 4927.677 4969.903 (0.992x) | 3752.846 3795.033 (0.989x) | imaging-desatu | 659.733 660.350 (0.999x) | 10.214 10.214 (------) | json-parse-fin | 455.590 455.667 (------) | 5.148 5.148 (------) | json-stringify | 1547.472 1541.430 (1.004x) | 599.730 594.972 (1.008x) | stanford-crypt | 867.638 857.053 (1.012x) | 319.781 313.732 (1.019x) | stanford-crypt | 1133.817 1131.367 (1.002x) | 609.569 606.929 (1.004x) | stanford-crypt | 555.423 552.587 (1.005x) | 199.499 196.206 (1.017x) | stanford-crypt ------- | 28760.248 27926.599 (1.030x) | 19381.099 18550.196 (1.045x) | all Kraken with -j -m -p: --------------------------------------------------------------- | millions of instructions executed | | total | on-trace (may overestimate) | --------------------------------------------------------------- | 4198.443 4198.522 (------) | 3907.436 3907.436 (------) | ai-astar | 1979.742 1960.657 (1.010x) | 1353.759 1334.330 (1.015x) | audio-beat-det | 1302.076 1025.964 (1.269x) | 1135.932 859.698 (1.321x) | audio-dft | 1709.187 1688.923 (1.012x) | 1248.676 1227.172 (1.018x) | audio-fft | 2585.718 2585.346 (------) | 1778.296 1778.290 (------) | audio-oscillat | 6973.203 6999.127 (0.996x) | 4784.010 4809.827 (0.995x) | imaging-gaussi | 3337.549 3334.614 (1.001x) | 748.425 745.135 (1.004x) | imaging-darkro | 6017.814 6060.041 (0.993x) | 3836.515 3878.702 (0.989x) | imaging-desatu | 659.792 660.409 (0.999x) | 10.221 10.221 (------) | json-parse-fin | 496.092 496.164 (------) | 5.932 5.932 (------) | json-stringify | 1288.445 1272.218 (1.013x) | 746.812 746.486 (------) | stanford-crypt | 704.767 699.804 (1.007x) | 361.605 356.196 (1.015x) | stanford-crypt | 1153.845 1144.825 (1.008x) | 585.183 576.007 (1.016x) | stanford-crypt | 528.984 525.988 (1.006x) | 199.318 196.024 (1.017x) | stanford-crypt ------- | 32935.664 32652.608 (1.009x) | 20702.127 20431.463 (1.013x) | all
Attachment #459330 - Attachment is obsolete: true
Attachment #463072 - Attachment is obsolete: true
Comment on attachment 491065 [details] [diff] [review] patch, v4 (against TM 57130:bc000c1509ac) Sayre has expressed interest in getting this into the tree, even if it's turned off for Firefox 4.0. So I'm asking for reviews. Ed, I'm asking you for feedback about just the LIR.h/LIR.cpp changes. Most notably, the whole idea of using LIR_skip to trampoline to out-of-line chunks of code. If we start allowing this it will preclude us ever changing our LIR operands from pointers to offsets. That will preclude bug 578264, for one. I'm ok with this -- I think the out-of-line trampolines could end up being used quite extensively, as they're a low-cost way to rewrite LIR that's already been written into a buffer -- but I'd like your feedback.
Attachment #491065 - Flags: review?(gal)
Attachment #491065 - Flags: feedback?(edwsmith)
Attachment #491065 - Flags: review?(jseward)
Comment on attachment 491065 [details] [diff] [review] patch, v4 (against TM 57130:bc000c1509ac) Commentary, musing, not for/against the patch: I've been tempted several times to stuff extra info into SharedFields, so I like the getExtra/setExtra() tweak. I always lament the fact that on x64, a whole machine word is there, sitting idle (maybe we could use it, but pragmatically, we can't). Tamarin's SeqReader class (CodegenLIR.cpp) is used to stitch together two LIR buffers that we generate simultaneously. One is the prologue. If we could use LIR_skip to connect them, we can drop the reader and save one virtual call per emitted LIR instruction (real costs that add up). We also (still) use LIR_skip to erase instructions; much as I'd like to clean this up, the cleanup is to incorporate dead-store elimination in-line with the assembly pass, which ends up looking remarkably similar to StackFilter. The fact that it hasn't been worth the work to do it yet, is a data point when considering the fate of LIR_skip. It seems genuinely useful, is all. (I'm really off topic now, please forgive me). Heck, if getExtra/setExtra wasn't used for LiCM, then could it be used to hold a sequence number, which then could be used to optimize CSE lua-style? A sequential id in each LIns could be useful several ways... lua-style cse, and maybe also for making fast-to-index maps using LIns->id as a key, instead of a hashtable using LIns* as a key. It would burn one word on 32bit machines, and fit inside wasted padding space on 64bit machines. Replacing the whole SharedFields struct with a 32bit id, (or 8-bit opcode+ 24-bit id), a) give us a sequence number, and b) let stages build arrays indexed by LIns->id, for their own data. (24bits is enough: "16M instructions is enough for anybody"). things like inReg/inAr could pack into bitsets if desired. arIndex could be bigger, if desired, and so on. The extra load per access (arindex[lins->id] vs lins->arindex) could be a deal killer, but maybe not. I've also experimented somewhat with a traditional CFG that holds lists of LIns*, then a Reader that feeds them to Assembler in the required order. Nothing worth landing, but it makes me think that the super-tight LUAJIT style IR (integer offsets, contiguous instructions) is not in our future. Emitting code with some extra LIR_skips here and there, would allow reordering blocks or inserting code somewhat arbitrarily, without extra buffer copying, or chained LirReaders. Nitty things about the patch: Is LIR_comment supposed to be in there or is that an accident? (okay with me either way, it looks useful). I take it we should update tamarin to use overwriteWithSkip(). I skimmed LICM.cpp/h but didn't go looking for bugs. A comment on arIndex's definition that its invalid during LIR generation, and only holds a real arIndex when inAr==1, would be a good idea. Should get/setExtra assert when inAr==1?
Attachment #491065 - Flags: feedback?(edwsmith) → feedback+
(In reply to comment #26) > > Heck, if getExtra/setExtra wasn't used for LiCM, then could it be used to > hold a sequence number, which then could be used to optimize CSE lua-style? > > A sequential id in each LIns could be useful several ways... lua-style cse, and > maybe also for making fast-to-index maps using LIns->id as a key, instead of a > hashtable using LIns* as a key. It would burn one word on 32bit machines, and > fit inside wasted padding space on 64bit machines. The 24 non-opcode bits in SharedFields are only used during assembly. So we could use them for a sequential id in any of the other passes. The fast-to-index maps are tricky, though, because you don't know how big to make them in the writer pipeline.(In reply to comment #26) > Is LIR_comment supposed to be in there or is that an accident? (okay with me > either way, it looks useful). Not sure what you mean, can you clarify? > A comment on arIndex's definition that its invalid during LIR generation, and > only holds a real arIndex when inAr==1, would be a good idea. Should > get/setExtra assert when inAr==1? Good idea.
> > Is LIR_comment supposed to be in there or is that an accident? (okay with me > > either way, it looks useful). > > Not sure what you mean, can you clarify? When I first glanced over the patch, it looked as though it included both the LICM code and the new LIR_comment opcode, which I remembered from a different patch, and wondered if they got lumped together by accident. Looking again, that's not the case -- you just added LIns.getComment(), which is no big deal.
(In reply to comment #24) > > - The one big question mark over this patch is the fact that it reorders > guards. It's not clear to me that this is safe in general (although > jit-tests are all passing). For example, if a guard that when taken > causes an exception to be thrown is moved earlier than another guard, the > exception behaviour of the program may change. > > This question needs to be resolved before this patch can be considered for > landing. http://article.gmane.org/gmane.comp.lang.lua.general/58908 says this: Code hoisting via unrolling and copy-substitution (LOOP): Traditional loop-invariant code motion (LICM) is mostly useless for the IR resulting from dynamic languages. The IR has many guards and most subsequent instructions are control-dependent on them. The first non-hoistable guard would effectively prevent hoisting of all subsequent instructions. It then describes what LuaJIT2 does instead: The LOOP pass does synthetic unrolling of the recorded IR, combining copy-substitution with redundancy elimination to achieve code hoisting. The unrolled and copy-substituted instructions are simply fed back into the compiler pipeline, which allows reuse of all optimizations for redundancy elimination. Loop recurrences are detected on-the-fly and a minimized set of PHIs is generated. which I don't understand at all.
When I read that I thought he was talking about loop peeling; make one copy of the loop body, and put it first, so it dominates the loop body, and redundant expressions in the body get deleted. Back on TT, one way I thought about doing it was to leave keep the trace recorder recording for one extra loop iteration. the code fragment is shaped like a little 'b' with a longer stem. I never got to it, and i'd imagine it would really hurt compile time for long loop bodies (crypto?)
(In reply to comment #30) > When I read that I thought he was talking about loop peeling; make one copy of > the loop body, and put it first, so it dominates the loop body, and redundant > expressions in the body get deleted. Ah, that makes sense. Why couldn't he have just said that? :) > Back on TT, one way I thought about doing it was to leave keep the trace > recorder recording for one extra loop iteration. the code fragment is shaped > like a little 'b' with a longer stem. I never got to it, and i'd imagine it > would really hurt compile time for long loop bodies (crypto?) Just about anything extra done by the compiler hurts compile-time of crypto-md5, it's such a pathological case. E.g. LICM causes a 9% instruction count increase.
Blocks: 622494
Attachment #491065 - Flags: review?(jseward)
Attachment #491065 - Flags: review?(gal)
With JM can we re-try this?
(In reply to comment #32) > With JM can we re-try this? Comment 24 already did so. It's maybe a 1% speedup for Kraken. Combine that with (a) the fact that I'm not confident the transformation is valid, and (b) the spectre of IonMonkey coming over the horizon and trampling TraceMonkey into the dust, my motivation to work on this further is limited.
TM's days are numbered: WONTFIX.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: