Closed Bug 442379 Opened 12 years ago Closed 6 years ago
try inline-threading, at least with GCC if not MSVC
142.66 KB, patch
|Details | Diff | Splinter Review|
6.29 KB, text/plain
3.62 KB, text/plain
3.61 KB, text/plain
255.28 KB, patch
|Details | Diff | Splinter Review|
Work in progress patch attached. It is flummoxed by GCC's moving (tail splitting I think) the L_END_JSOP_ADD label "up", making the inline-threader copy too little code and leaving a relative jump diving off the end of malloc'ed memory. But trivial tests work, e.g. js> function f(x) x; js> f(42) 42 inline-threads JSOP_GETARG before jumping back to the interpreter's JSOP_RETURN opcode label. Dave, if you can figure out this end-labeling conundrum, we can eliminate intra-basic-block dispatch overhead entire, use x86 jumps (condition code jumps not yet done in the patch) to align virtual and physical PCs for ultimat branch prediction goodness! /be
Forgot a table header. The columns are op, length, delta to next higher-addressed opcode, delta minus length. /be
The other problem Dave noticed is relative-displacement call instructions. We may want to put those through memory for quick joy (I already did this in the patch for GOTO_ERROR). In general, coping with tail splitting, identifying the DO_NEXT_OP pattern, squashing unconditional jumps to common tails, and relocating any relative out-of-interpreter calls cries out for a disassembler -- which Andreas pointed Dave at. /be
The code is really rough. In particular, it has to be run in a debug config, and you have to install libdasm and update the Makefile to point at it. (We should be able to get by with a much simpler disassembler in the final version.) I made 2 improvements over the previous patch: 1. Keeping (most) PC increments. The previous version didn't include the PC increment code in the code block that gets memcpy'd. This caused instructions like JSOP_INT8 to get the wrong value, because the PC is out of date. For now, I just verify that the next 3 x86 instructions "look like" an interpreter PC increment. If so, I increase the length of the JS op block accordingly. There are about 43 ops that don't get extended. Not sure why yet. In general, we need a more reliable way to detect the end of jsop cases. With this improvement, "function (x) x - 2" can be inlined correctly. (When called with int args, at least.) 2. Patching (most) out-of-block jumps. CALL, JMP, and Jcc instructions use a relative address for the target. Thus, if the address of the instruction in the SM main cases is A, the length of the instruction is L, and the relative operand is R, then the target is (A+L+R). If the instruction is memcpy'd to address A', then the new target is (A'+L+R). If the target is part of the block that is memcpy'd, then this works fine: the new target is a copy of the original target code. If the target is outside the block, then the new target is a bogus address and the program crashes. To fix this, I scan all the code in the blocks and build a table of the addresses of relative operands with out-of-block targets. Then, after doing the memcpy for inlining, I patch all the relative operands. This works, but I'm not handling all the cases yet. Most of these out-of-block jumps use a 32-bit address, but a few use an 8-bit relative address. I haven't handled these yet, and it's probably going to be harder, because the adjusted relative address won't necessarily fit in 8 bits. The solution is probably to make sure that the targets are included in the block to be copied, so the address doesn't have to be adjusted. But I also want to understand why there are short jumps that seem to go outside the block, anyway. With this improvement, "function (x) x + 2" is inlined correctly (with int args, again).
Quick update: I've been working on a call-threaded version as something slightly easier to try. The first step was to convert all the bytecode cases to individual functions. Hopefully this will also make the code in the functions easier to identify for inline-threading and patching. This conversion is done, although I'm pretty sure there are a bunch of bugs in my converted code. To verify that the conversion mostly works, I whipped up a test dispatch loop that simply looks up the function in a table and calls the function: opFuncs[op](state); It's good enough to run a test of adding numbers from 1 to 1M. In an optimized build, baseline trunk SM takes 85ms on my machine, and the function-based version takes 130ms. Note that we expect it to be slower because: * The function-based version still does a lookup and indirect jump, just like trunk. * There is only one static indirect jump instruction in the function-based version, so there must be more mispredicts. * The function-based version also has to return. * Each function returns a value indicating whether to continue, exit, return, etc., and this has to be tested for each opcode. This replaces things like GOTO_ERROR in trunk. * The called functions have prolog/epilog code. E.g., my function F_JSOP_PUSH: .globl __Z11F_JSOP_PUSHR16InterpreterState __Z11F_JSOP_PUSHR16InterpreterState: # Prolog: save stack ptr & load arg into edx pushl %ebp movl %esp, %ebp movl 8(%ebp), %edx # Access state members: this looks ok movl 36(%edx), %eax movl 32(%edx), %ecx # Do stuff movl $22, (%eax) addl $4, %eax movl %eax, 36(%edx) leal 1(%ecx), %eax movl %eax, 32(%edx) movzbl 1(%ecx), %eax movl %eax, 44(%edx) # Epilog: set return value & return xorl %eax, %eax leave ret An extra 6 instructions in a 10-instruction op has to hurt. Anyway, many of these problems will go away in the actual call-threaded version, and others can be mitigated. So don't take the numbers above too seriously.
I got call-threading working with the function-per-opcode system. The numbers are good or bad depending on what you're testing. The total time on the same test as before was 110ms, or 100ms with -fomit-frame-pointer (which didn't speed up the previous test, btw). (Also, I'm going to use a revised time of 80ms for baseline, because that's what i saw today.) So it's .8x as fast as SpiderMonkey, which is bad, but call-threading by itself was 1.3x as fast as the similar non-call-threaded version, which is good. Also, this call-threading was a bit crude in the way it handles control-flow ops, and fixing that should give a bigger boost. Keep in mind that the non-call-threaded function-based version was only about .6x as fast as SM, which is a huge gap to win back. I'm going to redo call-threading with inline cases (as in SM), which will hopefully keep most of the 1.3x boost without getting the big loss. For informational purposes, in the call-threaded native code, most JS opcodes turn into this (gnu argument order asm): mov ebx, ecx // Put interpreter state into fastcall arg 1 call F_JSOP_INCVAR The control-flow crudeness is in the way I implemented GOTO: mov ebx, ecx call F_JSOP_GOTO jmp <target in call-threaded code> and IFNE/IFEQ: mov ebx, ecx call F_JSOP_IFNE test eax, eax jne <target in call-threaded code> It is better to test the boolean value just once. We should be able to do that, but it's just a bit harder because it generally requires inlining the IF op, which in turn requires more native code patching and other trickery, so I didn't do it just yet.
This does call threading using labeled cases instead of functions. For my standard loop test, the time is now about 84, so .95x as fast as SM. For higher numbers of loop iterations, perf is almost identical. Note that I disabled the pop-after-store and if-after-rel optimizations in the call-threaded version, so we pay a few percent there. As to why it's not a win yet, I don't really know. I do still use the crude control-flow implementation. Also, I had to add some prolog and epilog code to the cases for the call-threading version. I call a case simply with a call instruction. But that puts the return address on the stack. If the case calls a function, it will stomp all over that. (When compiling the cases in js_Interpret, GCC sets esp to point to the end of the activation record for js_Interpret, with space preallocated for all possible calls. So it just sticks arguments to any calls into [esp], [esp+4], etc. This is bad if I have a return address in [esp].) To fix it, I made cases look like this (this is JSOP_POP): // pop return address into %edx 0x00031d31 <js_Interpret+1073>: pop %edx // actual code for JSOP_POP 0x00031d32 <js_Interpret+1074>: incl -0x54(%ebp) 0x00031d35 <js_Interpret+1077>: movl $0x1,(%eax) 0x00031d3b <js_Interpret+1083>: add $0x4,%eax // store the return address in a variable 0x00031d3e <js_Interpret+1086>: mov %edx,0x75ca9(%ebx) 0x00031d44 <js_Interpret+1092>: mov %eax,-0x50(%ebp) // push return address back on the stack so we can return 0x00031d47 <js_Interpret+1095>: push %edx 0x00031d48 <js_Interpret+1096>: ret So, compared with the standard case, we have an extra push, pop, and store. The store is because the asm I use to store the return address puts it in a variable (which happens to be a global): asm("pop %0" : "=r"(js_ReturnAddr) : : ); I think for small opcodes like POP, we might want to just inline them, or re-enable the pop-after-set stuff. All other opcodes that don't call functions should just skip popping and pushing the return address. For other opcodes, we may want to structure them so that the fast path is exposed in a way that lets us store the return address only if we go to a slow path that makes calls. (Or we could even inline the fast path.) I also have a problem where it crashes in JSOP_GETVAR, because the generated code seems to think there will be a useful value, like regs.sp or something, in esi, and it tries to store to [esi]. But there isn't: JSOP_GOTO, and possibly others, overwrite esi to non-pointer values like "12". (This seems to happen in CHECK_BRANCH.) For my small test, I was able to fix this by writing asm at the start of JSOP_GETVAR that explicitly clobbers esi and putting esi in the clobber list. Then GCC loads something into to esi explicitly and it works. But I don't know the real answer yet.
Wow, a 0.95x speedup isn't really what we were looking for. The extra push pop shouldn't be all that expensive. Sure, we can still gain some by inlining some small ops but still. Mhm.
After trying various permutations of call- and inline-threading but getting small (5-10%) and inconsistent perf improvements, I decided to spend some time with the Core 2 perf counters. Here's what I found on my tiny loop test (M=10^6, G=10^9): * Trunk SM: Operations 23G instructions retired (INSTR_RETIRED) 30G uops dispatched (RS_UOPS_DISPATCHED) 24G uops retired (UOPS_RETIRED) 6G fused uops retired 15G clock cycles (CPU_CLK_UNHALTED.CORE) // If I understand what fused uops are and how to cycle-account them, this // means that about 30G "original uops" were retired, so there were almost // no "wasted" uops dispatched (i.e., uops dispatched because of branch // mispredict). // // IPC is about 1.5, which doesn't sound too bad, but is 1/2 or 1/3 of // Core 2 max, I think. Indirect Branches 700M indirect branches executed (BR_IND_EXEC) 1M indirect branch mispredicts (BR_IND_EXEC_MISSP) // Branch mispredict rate is about .2% on indirect branches, so it is // probably not a significant impact on perf. Stall Cycles ~ 6.5G total (RESOURCE_STALL) 70% reservation stations full (.RS_FULL) 25% load/store buffers full (.LD_ST) 3% branch mispredict flush (.BRANCH_MISS_CLEAR) // Again, not much from branch mispredicts. About 40% of all cycles appear // to be stalls, so clearly there is a big opportunity for improvement. // // If I understand RS_FULL, it means there is at least a long-latency op, // or maybe a long-latency chain, that blocks 64 dependent instructions // in the pipeline. // // I think LD_ST means too many loads and/or stores in flight at a given time. // My guess is that there are a lot of loads and stores to ®s.pc, regs.pc, // ®s.sp, and regs.sp. To optimize this, we could either try to keep more // of these in registers, or try to optimize away some ops entirely (like // regs.pc-related stuff), reducing both stalls and uops count. // It appears that branch mispredicts are not costing much more than 1% of // total cycles here, so optimizing them alone probably will not help. Blocked Loads (LOAD_BLOCK) ~ 1G total 500M by preceding store of unknown data (.STD) 300M by preceding store to unknown address (.STA) 200M by partially overlapping store (.OVERLAP_STORE) 30M by L1 data cache (.L1D) 5M blocked until retirement (.UNTIL_RETIRE) // The main items here seem to be loads that come soon after "inconvenient" // or "confusing" stores. I think each of these guys has about a 5-6 cycle // penalty, so if I understand how to correlate this with cycle counting // (which I doubt), these could explain most of the stalls. Again, I suspect // regs.pc and regs.sp. // // These results suggest there could be benefits from somehow altering // things so the store addresses are easier to see, and reducing unaligned // accesses. Cache Stats 15G L1 I-cache loads 26M L1 I-cache misses 11G L1 D-cache loads 366M L2 loads 5G L1 D-cache stores 10M L2 stores // These numbers seem good, with generally 99%+ hit rates. I guess there // could be some benefit from reducing L2 loads, but I don't know where // those are coming from. I took some similar numbers for bitops-3bit-bits-in-byte.js, a similar microkernel-type SunSpider benchmark. The results are similar, but with somewhat more branch mispredicts. Again, about 40% of cycles were stalls. About 8% of indirect branches were mispredicted, but still accounted for only about 2% of stall cycles. It does appear that maybe about 7% of dispatched uops are not retired, which amounts to 10G squashed uops. This is a *very* rough estimate. There are about 300M misdirected indirect branches, meaning 30 wrong uops executed per mispredict, which seems like about the right order of magnitude, anyway. I'll put higher-level remarks in a separate comment.
By the way, I've been using Shark for all this, which I think has certain deficiencies vs. VTune. In particular, I think Shark doesn't give control over the CMASK and INV bitfields, which means I can't count cycles in which no ops were dispatched, which is a key metric according to a set of Intel slides I've been referring to. Also, Shark is supposed to be able to record (ISA) PCs for events like RESOURCE_STALL.RS_FULL, but the results look weird to me. It seems like the PC is "approximate" in some sense. * Branch Mispredicts & Dispatch Costs Anyway, on the 3bit benchmark, it looks like there at most an 8% win available from eliminating all branch mispredicts, vs. the 20-40% wins reported in papers on context (or call) and inline threading. It would seem that Core 2 has an excellent indirect branch predictor, esp. compared to what experimenters were using a few years ago. I think an 8% reduction in time taken would be nice, but it could be difficult to achieve. I think call threading increases the number of micro-ops relative to direct threading, and it also puts significantly more pressure on the load/store units. Inline threading avoids those issues, so in theory I think it could get that 8%, *if* everything is inline-threadable. Bigger jsops, jsops that call functions, and jsops that overlaps other jsops are somewhat more difficult to inline, and might have to be call-threaded, which is less good. * Instruction Counts & Stalls Probably the most promising place for optimization is simply reducing the instruction counts. One way is to get rid of regs.pc operations, which should be doable with inline or call threading. For example, JSOP_GETVAR is about 17 x86 instructions. 4 are devoted to getting the slot number (could get this to 1 or 2 by specializing for the slot), and 1 is a jne over JSOP_CALLVAR code (both have the same entry point--could get this to 0 by specalizing for GETVAR), so we can potentially reduce the instruction count by 25%. This optimization could be applied in various forms with any dispatch technique. Both of these optimizations also reduce loads and stores, which is important. Based on the inconsistent timing results I've been seeing and these perf counters, I suspect that reducing register ops won't help--the reservation station stalls/load blocks seem to be on the critical path or a near-critical path. These optimizations should also reduce the number of byte reads and unaligned reads, which are currently used to read the slot number and other jsop immediate operands.
(In reply to comment #9) > For example, JSOP_GETVAR is about 17 x86 instructions. 4 are devoted to getting > the slot number (could get this to 1 or 2 by specializing for the slot), This is bug 364683. I had a patch around, will attach if I can find it. Feel free to steal this bug from my list. > and 1 > is a jne over JSOP_CALLVAR code (both have the same entry point--could get this > to 0 by specalizing for GETVAR), Please do file a bug on this and try patching it. For the small GET vs. CALL ops it's worth the code size increase. May even help with branch prediction, although perhaps that's good as can be. > Both of these optimizations also reduce loads and stores, which is important. Anent that, Andreas pointed to JamVM's three dispatch tables, which registerize the top and next-to-top items on the stack. Worth a shot? Another bug would be good, seems a separate patch. > These optimizations should also reduce the number of byte reads and unaligned > reads, which are currently used to read the slot number and other jsop > immediate operands. Yeah, the property cache won big by avoiding loading immediate atom (interned identifier) operands, using the pc as key. Direct threading might be worth a try, especially to minimize regs.pc updates. /be
We could download jamvm and give the different configurations a spin. It implements direct/indirect threading with and without top-of-stack caching and also inline threading. Its a different VM, different bytecode, different everything, but the relative speedups might still be interested, especially if we can track down a Core1 and AMD platform for relative comparison. If Core2 really revolutionized branch prediction, we might be able to measure the differences between processors using jamvm instead before going for a SM implementation. We got surprised once already :)
Good news. I got some better looking numbers with a more realistic benchmark, which is a version of bitops-3bit-bits-in-byte.js modified to be easier to run in my prototype. With call threading, time taken is down about 6% vs. trunk (although this is still trunk with pop/if fusions disabled). In trunk, there are 7.7M indirect jumps, with a 6.5% mispredict rate. In call-threaded, there are 0.1M indirect jumps, again with a 6.5% mispredict rate. It looks like the key perf differences here are: + CT gets fewer branch mispredicts. About 5% of ops dispatched by trunk are useless (due to branch mispredict), while CT has almost no useless ups. + CT has about 50% as many stall cycles overall. This reduces total cycles by about 15%. I'm not sure why the stall cycles are reduced. I assume it's probably ultimately derived from the reduction in branch mispredicts, but I don't really understand it. - CT appears to put more pressure on the memory units, probably for return address manipulations. - CT performs about 10% more microops, probably something to do with the call/return pairs as well as the return address saving. I'll see if I can get some similar numbers for inlining, then see what makes sense to do next.
Inlining looks pretty good on on this benchmark. I got a time reduction of about 15% vs. de-optimized trunk, which should still be worth a good 5-10% vs. real trunk. The 15% can be read off almost entirely as a 5% reduction in branch mispredict uops + a 10% reduction in real uops retired. I'm pretty sure the literature results on call threading must have had much higher baseline branch mispredict rates, but I didn't see absolute numbers in the paper, only reduction vs. baseline.
I implemented specialization and regs.pc implementation in a very hacky way that's just barely good enough to run the modified 3 bits. By the above I mean: specialization: For inlinable ops that take an explicit argument, create a version that loads the argument as an immediate operand. When inlining, copy in the code, then patch the real value for the immediate. I did this for getvar, setvar, and the intX ops. With specialization, intX ops are only 4 x86 instructions. regs.pc elimination: Remove the update to and load of regs.pc from most ops. When call-threading or inlining an op that still needs the PC (ifne and varinc in my example), generate a mov instruction that updates regs.pc. This pulls x86 ops out of everything and reduces NOP instructions (such as GROUP) to empty code. Starting from the basic inlining version, specialization reduced time taken by 10%. regs.pc elimination reduced time taken by 17% vs. the baseline (so, about 7% more from the specialized version). In HPCs, the improvement shows up directly as a reduction in x86 ops executed. In total, this is 30% less time taken than the de-optimized trunk, or 25% less than regular trunk (and older trunk, but the speedup factors should not change terribly much, I think). Note that regs.pc elimination doesn't make a lot of sense without specialization, and specialization itself may be difficult to implement for real. The problem is how to identify the immediate operand to be patched in the x86 code for the specialized op. For now, I disassemble, find the location, and refer to that location explicitly in the x86 code generator. Some variant of this could be used in practice but it basically requires some manual attention at build time so it would not be suitable for regular builds, which seems pretty bad. Some ideas on how to do it better: - Make the operand be 0xdeadbeef, look for that in the code at run time, and hope false positives don't occur. Maybe there's some sequence that's illegal in x86 code that could be used. - Make the operand be a register, and then generate an immediate move to that register just before the inlined code. This would probably result in slightly worse x86 code, because of the extra constraints on the compiler. For example, my current specialized JSOP_INT32 (which I use for all int pushes) has 'mov imm, (%eax)' which wouldn't be possible with this method. - Code up the case in asm (by copying off the compiler) and put it as inline asm. That way, we'd know the exact shape of the code. The compiler would pick only the offsets for the local variables used (e.g., regs.sp). Knowing the instructions and lengths, we could reliably patch the operand. (This assumes the compiler doesn't alter the assembly, which I think we can ensure with 'volatile'). - Code up the case in x86 and patch both specialized values and local addresses. In this case, we need to know how to refer to locals used. E.g., in my current optimized builds, regs.sp is -0x50(%ebp). Presumably we could discover this at run time by comparing %ebp and ®s.sp. regs.pc elimination is fairly easy to implement, but I'd like to do it better by completely eliminating the PC. Presumably we'd need to keep a mapping table for the decompiler. The only hard part is ops like varinc that have an explicit operand and can't be inlined easily (they call functions/share code with other cases/are big). But maybe we can store an input or 2 to such functions in a register or a variable in generated x86 code just before the call. We have enough to do for now but I just want to mention the idea of refactoring things like varinc to be an inlinable fast path with side exits to a called slow path so I don't forget it. In sum, at this point we have a reasonable idea what dispatch optimizations help and how much they help, and a prototype of each of them. I think the challenge now is to decide which ones to do and figure out how to implement them robustly. Based on the data I have so far, call threading by itself is perf neutral (perhaps a 5% win if it keeps the trunk dispatch opts), and inlining is a perf improvement. So, I can start by cleaning up my implementation of those and start running bigger/more tests. There is one hard problem there: it seems the compiler generates wrong code unless the inlinable and callable cases end with a goto to a real code location (e.g., error). I still don't know the cause for sure but I think if the compiler thinks cases exit or fall through, it makes liveness assumptions and corresponding optimizations that are not valid. In the meantime, I add the goto and then patch in rets at run time using mprotect. This seems OK but I think the startup cost is substantial (~1ms) in my current version so I need to fix that. Any insight into this issue would help a lot.
Note that if you attempt to read this code your head will probably explode due to sloppy incrementally developed macrology and spaghetti control flow in js_InlineThreadScript.
I've got an implementation of call threading sufficiently robust to run SunSpider. Results are attached compared to a trunk checkout from a week or 2 ago. Overall slowdown is in the top end of the range I would have predicted from the tiny benchmark results. The ones that slowed down the most I think will also benefit the most from inlining, which is the next thing to try. I did a bit of shark testing, and it looks the situation is the same as before: lower branch mispredict rates, but more micro-ops retired because of the call/return and pop/push pairs. Inlining will remove those costs for the small ops.
I'm still trying to figure out what to do with the call threading. In theory, it should be able to run a bit faster than indirect threading, but in practice, it doesn't. Before moving ahead with the other optimizations, I would really like to come up with a version of call threading, or even an alternative solution, that doesn't regress baseline performance. I do need to collect a few more numbers first, to make sure the regression isn't just because I haven't implemented the call-threading analogue of if-after-compare optimization yet, but I want to lay out the issues here first to see if anyone has any ideas. In baseline SM, dispatch is indirect threaded and is 4 instructions: 3 loads (opcode, table base address, opcode target) and an indirect jump. On 11 SunSpider benchmarks I profiled with HPCs (the 3d, access, and bitops groups), the mispredict rate on the jump ranges from 0 to 27%, with an average of 6.25%. Baseline SM runs 4 instructions in 3 cycles, so we could crudely estimate the cost of SM dispatch in SunSpider as 3+P/16, where P is the penalty in cycles for an indirect branch mispredict, and is probably something around 24, for a total cost estimate of 4.5 cycles. In these benchmarks, SM runs about 28 cycles per op, so one cycle per vm op is worth about 3.5% of perf in this range. In theoretical call threading, dispatch is 2 instructions: a call and a return, and the mispredict rate on both is effectively zero, so the cost might be just 2 cycles. That assumes there isn't any penalty for correctly predicted calls. I'm not sure if that's true. But my implementation of call threading so far runs more like 2-3 cycles per dispatch *more* than baseline. There appear to be several reasons for the extra cycles: 1. The call/ret pair means that inside the op implementation, the x86 stack pointer (esp) will be 4 down from where it would normally be. This doesn't work if the op calls a function, for two reasons. First, gcc stores to [esp] to pass the first argument, clobbering the return address. Second, esp is supposed to be aligned on 16-byte boundaries, and having it off by 4 causes various crashes. I have so far not found any solution to this problem other than popping the return address off the top of the stack at the start of the function, and pushing it back on before the return. This adds 2 instructions. They don't need to be there if the op doesn't call a function. I've made that optimization, and it helps, but not dramatically, since most ops can call functions. Another option is to pop/push only if the op really needs to a call a function, but that is much more complex to implement. 2. GCC generates more code for the cases in the call threaded version. I'm still don't know why this happens, so it might be solvable. The most obvious change is that in baseline SM, the assignment to 'op' in the dispatch code (DO_OP) is implemented by a move to edi. This load to edi is required as part of the indirect threading lookup, so loading the value into the 'op' variable is free. But in the call-threaded version, the assignment to 'op' is implemented with a store to a stack frame location. Overall, the extra code usually comes out to 2-3 extra instructions. With these 2 items, call threading is left with a dispatch cost of 2 push/pop, 2.5 instructions of compiler inefficiency (relative to baseline), and a call/ret pair, for 6.5 instructions, 2.5 more than in baseline. This pretty well accounts for the observed perf regression. Now, there are other wins enabled by call threading, so it's not necessarily a bad idea as is, but I'd really like a better solution to these problems. The wins are: - Inline no-call ops: save 2 instructions per no-call op relative to baseline (or 6.5 relative to call threading), because dispatch is free with inlining. This could also be applied to ops with calls, if a call-free path can be exposed and inlined. - Eliminate 'op' variable: save 1-2 instructions per op in call-threaded version. Because 'op' seems to be the main culprit in the compiler inefficiency, eliminating it should help. Most cases don't even need it. For cases that do, we could either rewrite the code to not use 'op', or generate a store in the call-threading code to set op before the call. - Specialize ops with immediate operands: save 4-8 instructions for ops where this applies. - Eliminate 'pc' variable: save 3 instructions per op. Again, most cases don't need it, and we can generate an extra store for the ones that do. So, dispatch costs relative to baseline look something like this: Op Type Baseline CT-Now Magic CT-Max Magic-Max Most Ops 0 +2.5 - 2 - 2 - 5 No-Call 0 + .5 - 2 - 6 - 7 Specializable 0 + .5 - 2 -12 -13 CT-Now: current call-threading prototype; Magic: hypothetical magical call-threading without problems 1&2 above; CT-Max: call-threading with all further optimizations; Magic-Max: magic with all further optimizations. So, CT-Max is not bad, but Magic-Max is so much better, I'd really like to find a way to get there. I estimate CT-Max as an 10-15% reduction in time taken on SunSpider, but Magic-Max as 20-30%. Also, Magic-Max will be a speedup at every stage, while CT-Max regresses perf until almost the very last stage of work. I think we need to look further afield, into things like turning cases into functions (but somehow better than what I've been able to do with that so far), direct/inline-threading hybrids, etc. The goal is this: a dispatch mechanism that works for any vm op, costs 2-4 cycles (pretty much has to be 2-4 instructions because of dep chains), and is compatible with the optimizations above.
(In reply to comment #18) > In baseline SM, dispatch is indirect threaded and is 4 instructions: 3 loads > (opcode, table base address, opcode target) and an indirect jump. Are there any estimates for direct threading dispatch? > Baseline SM runs 4 instructions in 3 cycles, so we could crudely estimate the > cost of SM dispatch in SunSpider as 3+P/16, where P is the penalty in cycles > for an indirect branch mispredict, and is probably something around 24, for a > total cost estimate of 4.5 cycles. Are there any reasons for this number, 24? AFAICS it should be possible to measure it directly. For that the mainline interpreter can be patched so DO_OP just jumps to a label that in turn does the dispatch. This way the branch misprediction would be 100% providing the reliable estimates. > > In these benchmarks, SM runs about 28 cycles per op, so one cycle per vm op is > worth about 3.5% of perf in this range. > > In theoretical call threading, dispatch is 2 instructions: a call and a return, > and the mispredict rate on both is effectively zero, so the cost might be just > 2 cycles. That assumes there isn't any penalty for correctly predicted calls. > I'm not sure if that's true. > > But my implementation of call threading so far runs more like 2-3 cycles per > dispatch *more* than baseline. There appear to be several reasons for the extra > cycles: > > 1. The call/ret pair means that inside the op implementation, the x86 stack > pointer (esp) will be 4 down from where it would normally be. This doesn't work > if the op calls a function, for two reasons. First, gcc stores to [esp] to pass > the first argument, clobbering the return address. Second, esp is supposed to > be aligned on 16-byte boundaries, and having it off by 4 causes various > crashes. > > I have so far not found any solution to this problem other than popping the > return address off the top of the stack at the start of the function, and > pushing it back on before the return. This adds 2 instructions. They don't need > to be there if the op doesn't call a function. I've made that optimization, and > it helps, but not dramatically, since most ops can call functions. Another > option is to pop/push only if the op really needs to a call a function, but > that is much more complex to implement. > > 2. GCC generates more code for the cases in the call threaded version. I'm > still don't know why this happens, so it might be solvable. The most obvious > change is that in baseline SM, the assignment to 'op' in the dispatch code > (DO_OP) is implemented by a move to edi. This load to edi is required as part > of the indirect threading lookup, so loading the value into the 'op' variable > is free. But in the call-threaded version, the assignment to 'op' is > implemented with a store to a stack frame location. Overall, the extra code > usually comes out to 2-3 extra instructions. > > With these 2 items, call threading is left with a dispatch cost of 2 push/pop, > 2.5 instructions of compiler inefficiency (relative to baseline), and a > call/ret pair, for 6.5 instructions, 2.5 more than in baseline. This pretty > well accounts for the observed perf regression. > > Now, there are other wins enabled by call threading, so it's not necessarily a > bad idea as is, but I'd really like a better solution to these problems. The > wins are: > > - Inline no-call ops: save 2 instructions per no-call op relative to baseline > (or 6.5 relative to call threading), because dispatch is free with inlining. > This could also be applied to ops with calls, if a call-free path can be > exposed and inlined. > > - Eliminate 'op' variable: save 1-2 instructions per op in call-threaded > version. Because 'op' seems to be the main culprit in the compiler > inefficiency, eliminating it should help. Most cases don't even need it. For > cases that do, we could either rewrite the code to not use 'op', or generate a > store in the call-threading code to set op before the call. > > - Specialize ops with immediate operands: save 4-8 instructions for ops where > this applies. > > - Eliminate 'pc' variable: save 3 instructions per op. Again, most cases don't > need it, and we can generate an extra store for the ones that do. > > So, dispatch costs relative to baseline look something like this: > > Op Type Baseline CT-Now Magic CT-Max Magic-Max > Most Ops 0 +2.5 - 2 - 2 - 5 > No-Call 0 + .5 - 2 - 6 - 7 > Specializable 0 + .5 - 2 -12 -13 > > CT-Now: current call-threading prototype; Magic: hypothetical magical > call-threading without problems 1&2 above; CT-Max: call-threading with all > further optimizations; Magic-Max: magic with all further optimizations. > > So, CT-Max is not bad, but Magic-Max is so much better, I'd really like to find > a way to get there. I estimate CT-Max as an 10-15% reduction in time taken on > SunSpider, but Magic-Max as 20-30%. Also, Magic-Max will be a speedup at every > stage, while CT-Max regresses perf until almost the very last stage of work. > > I think we need to look further afield, into things like turning cases into > functions (but somehow better than what I've been able to do with that so far), > direct/inline-threading hybrids, etc. > > The goal is this: a dispatch mechanism that works for any vm op, costs 2-4 > cycles (pretty much has to be 2-4 instructions because of dep chains), and is > compatible with the optimizations above. >
(In reply to comment #19) > (In reply to comment #18) > > In baseline SM, dispatch is indirect threaded and is 4 instructions: 3 loads > > (opcode, table base address, opcode target) and an indirect jump. > > Are there any estimates for direct threading dispatch? Direct threading is 2 instructions per dispatch. At -2 relative to baseline, this is competitive with CT-Max, for much less work and less compiler dependence. I can't see how to do any of the other optimizations above with direct threading, except specialization, which we can partially achieve by making all immediate ops word-length, simplifying the load code and avoiding alignment issues. > > Baseline SM runs 4 instructions in 3 cycles, so we could crudely estimate the > > cost of SM dispatch in SunSpider as 3+P/16, where P is the penalty in cycles > > for an indirect branch mispredict, and is probably something around 24, for a > > total cost estimate of 4.5 cycles. > > Are there any reasons for this number, 24? AFAICS it should be possible to > measure it directly. For that the mainline interpreter can be patched so DO_OP > just jumps to a label that in turn does the dispatch. This way the branch > misprediction would be 100% providing the reliable estimates. I just did it and the mispredict rate is not 100%, but it did increase to 13.7%. This version makes 7.56M more mispredicts total over my test subsuite, and runs for 258M more cycles. But is also presumably executes 108M more direct jumps, and it's not obvious how to separate the jump cost from the miss penalty. I tried to use linear regression to separate the effects. The results don't look terribly reliable, but my best guess is a 16 cycle penalty. Theoretically, it should be about the same as the pipeline depth, so that sounds good.
(In reply to comment #18) > On 11 SunSpider benchmarks I profiled with HPCs (the 3d, access, and bitops > groups), the mispredict rate on the jump ranges from 0 to 27%, with an > average of 6.25%. ... > In theoretical call threading, dispatch is 2 instructions: a call and a > return, and the mispredict rate on both is effectively zero, so the cost > might be just 2 cycles. ... > But my implementation of call threading so far runs more like 2-3 cycles per > dispatch *more* than baseline. This clearly indicates that indirect.direct threading prefers thin-native-code-wise bytecodes while practical call threading needs fat bytecodes for the maximum benefits. With thin bytecodes there are natural correlations between instructions as typical script constructions correspond to repeated series of bytecodes. Thus the branch misprediction is lesser issue than the size of the dispatch code that present in every instruction. With fat bytecodes the correlation is gone as high-level constructions can be implemented with single bytecode and their sequences are random. So the cost of branch misprediction is high while the cycle cost of the dispatch code is a lesser issue due to the domination of the fat. Now, since current SM bytecode design uses a lot of thin bytecodes due to its stack-oriented nature, it is not surprising that practical call threading can be slower. So what the results tell is that call threading needs different bytecode model for the maximum benefits.
(In reply to comment #20) > I just did it and the mispredict rate is not 100%, but it did increase to > 13.7%. Hm, does the rate is the total rate for the tests including the rate for ifs inside bytecodes?
I don't see how to have only fat opcodes without having a lot of them, and dynamically generating them (shades of Tamarin-Tracing's superword synthesis ideas). We have a mix of fat and thin today. This suggests a mix of call and inline threading, which should let us win big (relatively speaking). But the trick is getting call threading not to turn Mr. Optimizer to go away, to borrow from shaver. /be
(In reply to comment #23) > I don't see how to have only fat opcodes without having a lot of them, and > dynamically generating them (shades of Tamarin-Tracing's superword synthesis > ideas). Using a register-based interpreter is a simple way to get that since it effectively fattens each bytecode with register indexes. With that model a = b translates into mov a, b instead of the current load b store a and a = b + c translates into add a, b, c to replace the current load b load c add store a Of cause, if in this example a,b,c would not be locals, then with extra bytecode to load them into temporaries the end result would look the same as with the stack-based approach: load tmp1, b store a, tmp1 and load tmp1, b load tmp2, c add tmp2, tmp1, tmp2 store a, tmp2 But, judging by SunSpider benchmarks, people long have learned to use local variables for anything that goes into a tight loop.
(In reply to comment #24) > But, judging by SunSpider benchmarks, people long have learned to use local > variables for anything that goes into a tight loop. I can only assume you're being sarcastic (SunSpider is full of undeclared loop-control variables), but regardless I think we should remember that the vast majority of SunSpider was never run on the web, even once -- it's translated Java/Fortran/whatever from the alioth benchmarks, without concern for whether it was translated well or not. It is many things, but representative of the web it is not, in the general case.
(In reply to comment #25) > I can only assume you're being sarcastic (SunSpider is full of undeclared > loop-control variables), I have not spotted a place in the benchmarks that uses undeclared global inside a tight loop. Could you point one? > but regardless I think we should remember that the > vast majority of SunSpider was never run on the web, The advise to use function-local var for temporaries is as old as the JS itself if not for performance, then just for code maintainability.
The ones I could quickly find are: * loop-control variables h, v, and i in 3d-raytrace:invertMatrix * the loop-control variable i in 3d-raytrace:Scene.prototype.intersect and Scene.prototype.blocked * bitwiseAndValue in bitops-bitwise-and * a2 through a9 in math-partial-sums:partial They definitely contribute significantly to the runtime of the tests; prior to the bacon cache, they dominated bitwiseAnd, f.e.
I've finally gotten call threading to perf parity with baseline, which I think is good enough to move on to the inlining and the other optimizations. - I left the pop/push strategy as is. For these extra pop-push instructions, the only alternatives I could think of were (a) to make the cases functions, which is a clear loser from the previous tests and counting instructions in function prologs (e.g. saving registers); and (b) performing the call "by hand", storing the return address in a different place so that the C stack is not misaligned. I tried this and it didn't work well. One of the problems is that the "return" becomes an indirect jump, with a worse mispredict rate than trunk. - The other extra instructions I found were generated by a GCC RTL pass called gcse2 or gcse-after-reload. This pass performs partially-redundant load elimination and runs right after register allocation. The purpose is to remove loads introduced in register allocation (presumably spills). If a load is partially redundant, eliminating it requires some store instructions to be added. This could slow down the program so the it should only be done on cold paths. But when not in PGO mode, the compiler can only use heuristics to guess cold paths. I think the compiler was making bad guesses, because it was adding loads to relatively hot paths, like JSOP_GETLOCAL. Because 'op' was the variable that got extra stores this way, I tried removing op and instead referred to '*regs.pc'. By itself, this causes a few percent regression because the gcse-after-reload pass starts adding extra stores to remove loads of 'pc'. But removing op together with call threading gives about performance parity. Clearly, the last few percent of performance in call threading are dependent on the vagaries of compiler optimization. I thought that maybe GCC 4.2 would do better, but in my SunSpider tests, that caused a 5% regression relative to GCC 4.0 on both baseline and the call-threaded version. And it didn't solve the gcse problem. The gcse problem may go away with PGO, but apparently PGO is pretty fussy because the data collection runs just crash for me, even on baseline.
David, Is the PGO crash with GCC? - moh
(In reply to comment #29) > Is the PGO crash with GCC? Yes, all of my work has been on MacOS with GCC. Clearly, we should be trying icc on Windows as well. :-)
Just want to describe a problem I hit when inlining things in case people have any ideas. To do inlining, I make an op end with a sequence like this: CT_END_JSOP_POPV: // dispatch sequence to make GCC's CFG correct I'm finding that often GCC puts a jmp just before where the label would be, then puts the label and the dispatch sequence above the main body of the case. This is a problem, because then I can't use the location of the label to find the end of the case--instead, I compute a negative length and crash. It appears the move first happens in the "reorder blocks" optimization, which tries to reorder basic blocks to reduce the number of taken branches. I tried turning off that optimization, and somehow the change didn't happen in that phase, but it looks like it still did in the final assembly code. Anyway, I doubt it's worth it to try to force the compiler not to reorder blocks, especially since it might deoptimize other parts of the code. Possible solutions include: - Don't use an end label. Then GCC won't have a basic block boundary (the end label with a taken address implies incoming control flow, although there isn't any in reality) so it won't reorder. I have confirmed this. This would require searching for the end point in the native code somehow--disassembly, searching for a certain instruction, or something like that. - Disassemble the code to look for the end label. The idea here is to disassemble down from the start point, following any jumps, until we get to the end point. The end point should just be some final jump. This idea would seem to require a much more robust disassembler than the first one. - Use ASM to code inlinable cases. With volatile asm, GCC won't reorder it. The disadvantage is of course the maintenance burden of requiring assembly blocks to be replaced when a case changes. But at least this applies only to the short cases. I like the first idea best. To make it robust I a way to put some marker in the native code that won't appear by chance in the instruction stream. E.g., a 5-byte sequence that can't occur in valid x86.
David, Two questions: 1. With the first solution, how would you ensure that the instructions scheduler does not move that "particular designated instruction" above other instructions in the same case body? Even with volatile memory accesses, isn't it possible that the compiler might move them above reg-reg instructions? 2. Do you also assume that the individual case bodies do not include other jump tables themselves? - moh
(In reply to comment #32) > David, > > Two questions: > > 1. With the first solution, how would you ensure that the instructions > scheduler does not move that "particular designated instruction" above other > instructions in the same case body? Even with volatile memory accesses, isn't > it possible that the compiler might move them above reg-reg instructions? Good point. So far, I haven't seen GCC move nops around too much. I figured something like that would work. I guess the ideal would be something that creates a barrier (wrt reordering), is clearly understood by the compiler to allow control to flow through but not to be a side entry. I wonder if an int instruction would work. Or are you hinting that option 2 or 3 would be more portable to icc? > 2. Do you also assume that the individual case bodies do not include other jump > tables themselves? Not sure what you're getting at. For now, I don't plan to try to inline anything that large and complex, just small ops with limited internal control flow that don't call any functions. Inlining the more complex case statements certainly would be more difficult.
I have specialization going for JSOP_U?INT\d+ and JSOP_(G|S)ET_(LOCAL|VAR), which are the common easily specialized ops. This is 4% better than my baseline on SunSpider. The results vary wildly by benchmark. The best is bitops-3bit-bits-in-byte, which is 1.32x as fast, while one of the worst is bitops-bitwise-and, with a 5% slowdown. I suspect that the slowdown is more compiler weirdness, and it may not be worth looking into too deeply just yet if further changes (such as PC elimination) make the compiler do different things anyway. But I think it might be worthwhile to make a script to print out the instruction count in each case, which should not be too hard to read off from the generated assembly, although basic block reordering makes it a bit harder. If I could at least see an increase in instruction count for jsops that are hot in benchmarks that regress, I would have some notion of where the problem is.
I would like to land something like this patch. In particular, the same functionality, minus the non-SpiderMonkeyness and any other crappiness. Perf is now about 8% better than (non-tracing) baseline on SunSpider overall, and 10-20% on the "good" test groups (3d, access, bitops, crypto, math). The only perf problem in SunSpider in this version is regexp-dna, which is actually 7% slower. I don't know why this is or if it is acceptable given the other speedups. There is definitely room for further perf improvement, such as doing inlining and specializing more ops, especially inc/dec. But it's reached a point where it really does seem to work, so I think it's better to land it and start getting some real testing. I think I can handle most of the fixup work, but there's a few areas I could use extra guidance in: - What kind of controls should turn this off or on? Build macros? Runtime options? If macros, any advice on what to set up? Just imitate JS_THREADED_INTERP? - The technique may generalize to other architectures, but this code is gcc and x86-specific. Should those aspects (e.g. x86 opcode defines) be partitioned into an arch-specific file or anything? - What do we need to know about memory usage effects? What are the current procedures for measuring JS memory usage? - Currently I free the native buffers only when their associated script is destroyed. But in the browser I guess certain GC or memory pressure events should be able to result in freeing a native buffer. What are those events? Also, freeing a buffer for currently running code is hard, although it should be possible. Is it necessary? That's probably enough questions for now.
Progress update: I've converted to a mini-assembler instead of trying to memcpy code blocks because it's the only practical way to get specialization. I just finished implementing TABLESWITCH and LOOKUPSWITCH as jitted ops with tables inline with the code, so I no longer need to keep a pc-to-ip map around forever. Current speedup is now about 15% on SunSpider. It's a very small change--before I was putting inline-threaded native code into page-sized buffers to avoid mixing code and data pages. But that was causing a page fault on the first native op generated for every function. Replacing that with straight JS_malloc (plus some resizing logic because I can longer easily know the exact size ahead of time) got rid of the page faults and produced a dramatic speedup. Code is now available at http://hg.mozilla.org/users/dmandelin_mozilla.com/inline-threading/ as a patch queue parented by 'changeset: 18807:1e32a7b9bd2f' Next things I need to do: - Get tracing instrumentation in there. Basic approach is to inline it using the mini-assembler. Should not be too hard. The main question is how inline-threaded code and tracing interact with the frame stack, and if they are compatible there. I'm sure there will be a few other bugs but I don't foresee anything major. - Arrange for inline threading to be disabled if debugger is enabled. Hopefully I can just decline to call an inline-threaded buffer if debug hooks are non-null - Arrange for browser-initiated GC (which occurs when SM is not executing, i.e., not in js_Interpret) to be able to nuke inline-threaded buffers. I guess this requires adding a parameter to the GC method. - Fix general bugs that make the browser crash. I have been using a small test suite of SunSpider plus small programs of my own, but I know that there are a good number of jsops not exercised at all by SunSpider.
(In reply to comment #36) > Current speedup is now about 15% on SunSpider. ... Very nice. Is the 15% gain on top of TM?
This is very close to the patch I would want to check in. I have finished the general cleanup, and both tracing and inline threading work and both can be used together. Inline threading is enabled with a '-g' option to the shell, and a corresponding JS option. Also, a build where JS_INLINETHREAD is not defined will not include any of the inline threading code. Such a build would be slightly different from current trunk--in particular, the 'op' local is removed. At present, enabling inline threading speeds up non-tracing SunSpider runs by about 10% and tracing runs are about the same speed. In particular, even with tracing control-flow runs about 15% faster. access-binary-trees is also a bit faster. With tracing, there is about a 12% regression on date-format-tofte because it calls eval a lot. I'll look into that. I think the solution is not to inline thread functions generated by eval. In general, anything that traces well is not going to get faster with inline threading, which is all but 2 or 3 SunSpider benchmarks. I currently seem to be getting a 15% speedup on the v8 suite (w/o tracing), which kind of surprises me. It seems that as of this minute, trunk tracing dies on v8-raytrace, but my earlier tests showed a 2% improvement when enabling inline threading with tracing. There are various other improvements that could be made to make inline threading even faster but I'd much rather get what I have landed before trying to do any of that. Also I think a Windows version (using some variant of the SFX techniques) would be a higher priority. What remains to be done is mostly stuff to get it to run in the browser: - Get it to run in the browser (probably just crashes immediately now) - Disable inline threading when being run from the debugger - Free inline threaded code when called from browser GC and interpreter is not running. I'm wondering if it makes sense to start the review process while I work on that. The patch may take some time to review. I'm sure I have some non-SMisms to be fixed. Also, I think these high-level decisions should be evaluated by reviewer(s) and others: - Moving cases into jsops.cpp so that it can be included twice, once for indirect threading and once for inline threading. - The JS_INLINETHREAD option for enabling inline threading at run time - The JS_INLINETHREAD cpp define for enabling inline threading at build time - The mini-assembler in jsemit.cpp. It's modeled after the one in nanojit, but goes forward, because that's what I want here. Also, instead of macros I used inline functions with reference arguments, because it was a lot easier to get that to work and the type checking is nice.
Attachment #340625 - Flags: review? → review?(brendan)
Will look at this tomorrow! /be
This is an update for current mozilla-central. The changes required for the update were trivial. Notes: - There is currently a huge perf regression when running 3d-raytrace with both inline threading and tracing enabled. (The other 3 combinations of options are all fine.) I think this might be caused by the correctness bug with tracing and 3d-raytrace, so I haven't looked any deeper yet. - I checked the memory allocation reserves in js_InlineThreadScript with static analysis. The previous version of the patch did have a bug there. I think this should fix it.
Attachment #344409 - Flags: review? → review?(brendan)
Attachment #344409 - Flags: review?(brendan)
Comment on attachment 344409 [details] [diff] [review] Updated patch for current mozilla-central This is happening over in bug 506182. /be
No longer relevant as we spend very little time in the interpreter these days.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.