Closed Bug 551636 Opened 15 years ago Closed 15 years ago

JM: load/store directly from args, locals and globals

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: luke, Assigned: dvander)

References

Details

Attachments

(2 files, 2 obsolete files)

Even after removing sp update (and compiling slot access as fixed offsets from fp), we still pay a copying overhead (compared to a register-based scheme) for loads/stores to/from arguments, locals and (with 548844) globals. E.g., for the code: function f(x) { x = x + 1 } we emit, and JIT the equivalent of: 00001: getarg 0 00004: one 00005: add 00006: setarg 0 A register-based strategy would simply use the argument's slot for the input and output of add, skipping the copying of arg 0 onto and off of the stack. Talking to dvander today, it seems like it is possible to boil the stackiness away at compile time without too much extra work. Recording the idea here for future reference: - for loads from args, locals, globals, we don't emit anything and simply keep a record of the fact that stack value X receives its value from slot Y, in a manner reminiscent of the tracing-jit's tracker. - for an op that consumes stack values, we check the tracker to change our loads from the temporary stack to be loads from the appropriate arg, local, or global. - stub call slow paths will need to write all currently outstanding tracker entries to their proper stack values (but this is no different than the copying JSC does for the arguments to a stub call). - for any opcode that produces a value on the stack, we peek to see whether the next opcode is a setlocal/global/arg, and if so, effectively fuse the two and write directly to the local/global/arg. Although this peephole optimization will miss setX ops that don't immediately follow the op that produces the value, I'm not sure how often, if at all, this comes up. Although I could be missing something, it seems this allows us to copy no more than if we were emitting from a register-based bytecode and, in some cases, less. Consider the stub call path for an op whose operands are not locals/args/globals. E.g., for the code below f() + g() when we emit the stub call path for the +, the stack will already hold the temporary results of 'f()' and 'g()' in the proper position and we simply emit the stub call without copying/moving these values. With a register-based bytecode, the values of 'f()' and 'g()' will need to be copied into known locations where the stub call expects to find them.
This is great stuff. Is the following formulation equivalent? When the internal semantics (i.e., SM's stack bytecode machine semantics) require a load or store, the JIT doesn't generate any code right away, but rather just creates a record (basically abstractly interpreting the method). At any given time, the abstract state is the SM stack, where the value is the set of regs/mem locs/constant values that currently hold that value. (The mem locs could be local slots, arg slots, property slots, or stack slots.) Only when we need to actually *do something* with the values (e.g., add them, or any other such op; and also counting stub calls as using whatever portion of the stack they use) do we emit the necessary load/store code. It would be really interesting if this could somehow be extended to do dead code elimination. Maybe we could even delay emitting the add code etc until the end of the basic block. I.e., track all the state and what we're delaying, and then emit code only at the end of the basic block.
(In reply to comment #1) > This is great stuff. Is the following formulation equivalent? Indeed; I like the way you've put it. > It would be really interesting if this could somehow be extended to do dead > code elimination. Maybe we could even delay emitting the add code etc until the > end of the basic block. I.e., track all the state and what we're delaying, and > then emit code only at the end of the basic block. That is an interesting generalization. I can see simple constant propagation of literals working with this scheme. As always, the decompiler will be our enemy if we do anything too interesting. Also, assuming that we need to emit all un-emitted computation before leaving JITed code through a stub call, we might run into the same bloat problem as trying to sink stores to side exits on trace (perhaps with the same delta compression solution we've talked about).
We'll also need to figure out how the tracer will bail back to method JIT'd code, when there are more opcode fusions/optimizations taking place.
(In reply to comment #1) > It would be really interesting if this could somehow be extended to do dead > code elimination. Maybe we could even delay emitting the add code etc until the > end of the basic block. I.e., track all the state and what we're delaying, and > then emit code only at the end of the basic block. Or until the stack pops and something else wants to push, overwriting the tracked location. Then you'd have to emit code, or buffer more deeply (keep a history of stacks in the abstract interpretation). Treating the stack as a register file should be equivalent. No loads from regs, just moves -- but you may have to spill. /be
(In reply to comment #3) Ew, yeah. Changing evaluation seems to be the point where complexity takes off. As long as all we are doing is delaying stores to the temp stack, it seems like the "spill on stub call" rule will save us from exposure by the tracing jit and disassembler.
Starting on this now. As a follow-up to comment #3, always being able to transition directly from the tracing JIT to the method JIT seems pretty complicated. For example: getlocal 0 getlocal 2 add We could imagine emitting no code for the "getlocal"s and then having "add" test the slots 0 and 2 for fast-pathiness. The trace JIT couldn't jump directly into the "add" though. The slow case for "add" would have to emit stack writebacks and create a safe point for the trace JIT to resume at. That's all kind of gross, and it's not clear whether it works everywhere. If the tracer could bail out at a "setlocal" that expected a value in a register, things would go wrong, so every opcode would need a slow case for the trace JIT. Instead I'm going to try just interpreting once we leave the tracer. The method JIT will mark safe points (calls, returns, ops with incoming edges) where the tracker is guaranteed to be empty/flushed. If we leave trace in the middle of an optimized block, we'll js_Interpret until we hit a safe point, and then return back to the method JIT. I tested this out and it's about a 2% perf loss on SunSpider, 1% loss on v8. I think we can win that back so I'm willing to take this loss for the reduced complexity.
Assignee: general → dvander
this patch only creates pc->jit mapping for safe points, and uses the interpreter to reach safe points. since it's a perf loss, i'm just putting it here for posterity until it's absolutely needed.
Attached patch experimenting (obsolete) — Splinter Review
Experimental tracker design. The ValueTracker mirrors the operand stack using little ValueInfo structs. These structs now get passed around instead of using RegisterID/Address. For example, this patch can optimize: JSOP_ZERO SETGVAR 0 POP This used to emit (imagine 180 is the distance for sp[-1]): ; store 0 to stack movq 180(%rbx), 0 ;... later, store top of stack to global dslots movq %rcx, 180(%rbx) movq -0x28(%rax, %rdx, 8), %rcx Now it can emit only: movq -0x28(%rax, %rdx, 8), 0 This is because setObjectSlot() now gets a ValueInfo instead of a register, and then it calls a helper to discharge the contents into a BaseIndex. My hope is that this will be simpler and more flexible than what we've got now. This is going to be a large opcode-by-opcode overhaul. Unfortunately it's not something that can land bit by bit because the tracker must be used correctly everywhere. ETA: May 1.
(In reply to comment #9) > Created an attachment (id=437759) [details] > experimenting > > Experimental tracker design. The ValueTracker mirrors the operand stack using > little ValueInfo structs. These structs now get passed around instead of using > RegisterID/Address. Looks good so far. I really like the example in comment 8 that describes the goal, too. Random thoughts on the patch so far: - It looks like StackState and ValueTracker hold the same data. Are they planned to diverge later, or can we combine them? - I like the more abstract names better. I like |StackState| a lot. |ValueTracker| sounds too operational to me. |AbstractValue| has a better ring to me than |ValueInfo|, although |AbstractValue| is also pretty vague. - What do stackDepth and trackDepth mean?
(In reply to comment #10) > Are they planned to diverge later, or can we combine them? I would like a way to combine them, if possible. For simplicity the tracker uses the entire stack depth (that way it doesn't have to use some sort of queue). But I don't want to malloc() a tracker every time we need a slow case. The queue might not be too bad. > I like the more abstract names better. Agree on ValueTracker/Info not being quite right. I'll see how Abstract* looks. > What do stackDepth and trackDepth mean? The tracker's stack depth is the same thing as Compiler::stackDepth, but they're separate until the entire compiler is updated. The track depth is the amount of values at the top of the stack that are being tracked.
Starting to play with simple register allocation. The compiler keeps track of temporary registers, and you can allocate/free them with allocReg()/freeReg(). If the compiler sees that no registers are free, it will look in the tracker and see if any live values are pinned to registers. If so, it evicts one, spilling it back to the JS stack. If there are no registers available anywhere, this is a compiler bug and it will assert. Of note, the act of popping an abstract value off the tracker means you now own it. The tracker will not evict it for you. So, taking a bit of example code: var i = 1 var j = i This compiles to: ONE SETGVAR "i" POP GETGVAR "i" SETGVAR "j" With stack transitions, this compiles (roughly) to: ; ONE mov [rbx + DEPTH], 0x3 ; SETGVAR mov rcx, [rbx + DEPTH] mov [r8+r9*8-0x28], rcx ; GETGVAR mov rcx, [r8+r9*8-0x28] mov [rbx + DEPTH], rcx ; SETGVAR mov rcx, [rbx + DEPTH] mov [r8+r9*8-0x28], rcx With this patch it compiles (roughly) to: ; SETGVAR mov [r8+r9*8-0x28], 0x3 ; GETGVAR mov r8, [r8+r9*8-0x28] ; SETGVAR mov [r10+r9*8-0x28], r8 The slow path gets slightly slower because it has to spill state and then re-merge it. If this gets to a be problem, we can be smarter with where stub calls put their values. But really we shouldn't take slow paths.
Attachment #437759 - Attachment is obsolete: true
Attached patch WIP v1Splinter Review
Okay, this is sort of what the final patch will look like. Barring any big architectural changes I'll stop patch spamming this bug and just land it when the whole compiler is updated. The last big change is that values on the stack can propagate longer. For example: UINT24 5000 NAME "i" The NAME opcode has to flush the stack, but we want to keep propagating the constant below. So now flushing the tracker also takes an invalidation depth parameter. Anything below this depth won't get clobbered (it also takes care not to spill values twice). There is enough right now to run bitops-bitwise-and.js, though there is no speedup because this test is gated on an intolerable JSOP_NAME. If I force the benchmark to use GETGVAR instead (by adding a "var" keyword), we get about a 10-20% speedup. Compile time seems to have increased anywhere from 2-5 microseconds. Code size got a 2-3% reduction.
Attachment #438231 - Attachment is obsolete: true
Could try imputing GVAR-ness to undeclared free names assigned before used. /be
The major refactoring of this patch is done, but there are a few trace-test failures remaining. For everything but the one benchmark I initially tried (bitops-bits-in-byte), there is little perf gain (~10ms on SS). This makes sense, since we have no fast paths and thus for most benchmarks we're basically just getting tiny load/store collapses that are undone as we hit slow cases. The reason bits-in-byte got faster is because we have fast paths for bit ops. To test this theory I added register allocated fast paths for ADD,SUB and bam - 70ms faster. I'm going to finish this baseline patch up, and tackle fast paths and improving register allocation in a separate bug.
Exciting results! Was the 70ms comparing (trunk + ADD,SUB fast paths) vs. (patch + ADD,SUB fast paths) or was it (trunk) vs. (patch + ADD,SUB fast paths)?
http://hg.mozilla.org/users/danderson_mozilla.com/jaegermonkey/rev/98f7d8deae38 (part 1) http://hg.mozilla.org/users/danderson_mozilla.com/jaegermonkey/rev/6a1e87dca3df (part 2) This brings us closer to not being stacky, but there's still work to do: * Fast paths * More register allocation (locals) * Using type information in the tracker to peephole optimize I'll file follow-up bugs for those. To answer the question in comment #16, that result did not test pre-regalloc with a fast-path for ADD.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: