Closed Bug 1361474 Opened 7 years ago Closed 8 months ago

Proposal: Improve interpreter speed in SpiderMonkey

Categories

(Core :: JavaScript Engine, enhancement, P3)

enhancement

Tracking

()

RESOLVED INCOMPLETE

People

(Reporter: djvj, Unassigned)

References

Details

(Keywords: triage-deferred)

Attachments

(3 files)

Attached file RESULTS.TXT
This grew out of an e-mail discussion.  The hypothesis I have is that we have a lot to gain from optimizing our interpreter speed through a combination of measures:

1. Optimize our instruction set to allow for more addressing modes (locals, args, etc.) to reduce the number of interpreter instruction dispatches.
2. Thread the interpreter execution more aggressively.
3. Lift the Inline-Cache optimization techniques we use in the baseline JIT right into the interpreter.

There's some question of whether this would actually improve perf, since we know that many instructions are executed exactly once.  As the interpreter is easy enough to instrument, I added a |printf| to it to print out every (filename, lineno, column, pc) tuple the interpreter executed.

I then built a browser with this change and ran it across a bunch of sites (google docs, visiting my gmail and reading a mail, visiting cnn and clicking on an article, visiting amazon and clicking on an item, visiting facebook front page).  I left baseline and ion enabled - so the numbers collected here should reflect the actual dispatches the interpreter does across all instructions during "normal" execution now.

Then I collected instruction execution counts for each unique bytecode location, and reversed that table, to get a mapping from number-of-times-executed to number-of-instructions-executed-that-many-times.  See attachment for raw results.

The summary is this:

3.9 million unique bytecode locations visited
18.5 million instructions dispatched by interpreter

1.7 million unique bytecode locations dispatched exactly once.
0.77 million unique bytecode locations dispatched exactly twice.
1.5 million unique bytecode locations dispatched three or more times.

Scaling each of these by the number of times each bytecode location was dispatched by the interpreter:

1.7 million single-dispatches.
1.55 million double-dispatches.
15.2 million multi-dispatches (bytecode location dispatched >= 3 times).

Let's say we are able to attach an optimized IC for each of these instructions (caveat: that's a simplification since many of them will be PUSH/POP/GETLOCAL/etc.).

Let's further say that the first execution, the attaching an IC costs us 1 performance point.  On the second execution, using the IC gains us 1 performance point.  Any subsequent execution gains us 1 performance point. (caveat: we're not considering IC chains longer than one)

Under these assumptions, even if we "optimize" every instruction at first run and eat the cost of optimization for bytecode locations that are dispatched exactly once, 74% of dispatches will be "pure win" dispatches (third or later dispatch on the same bytecode location).

I honestly think we are leaving a lot of performance on the table with our interpreter - especially load-time performance.

Thoughts?
Patch of instrumentation to interpreter to print bytecode locations on dispatch.
Attached file parse.py
Python script to get dispatch-counts to bytecode-locations table.
I wonder how much direct threading speeds things up vs the indirect threading that we have now, as well as how much code size increases. A lot of this strikes me at first blush almost like an alternative baseline.

As for the different addressing modes -- how much would this complicate the bytecode compiler? Gains would have to be clawed back from any frontend slowdowns, after all.

For the ICs, Stefan Brunthaler's "quickening" paper [1] comes to mind. What he calls quickening is just ICs.

[1] https://www.sba-research.org/wp-content/uploads/publications/dls10.pdf
I suspect the bytecode compiler wouldn't speed up or slow down much either way.  You'd be doing a bit more work to figure out the proper instruction to emit, but it would be in lieu of emitting the stack shuffling ops that a traditional stack-based machine would have to emit.

There would be some concern about the potential to blow the L1 cache (32k these days, assuming 64 bytes per op that gives us 512 ops that'll fit in icache at one time), but we should be able to mitigate that with cache hint instructions that pull in the next instruction codeblock at the top of the execution of the current one.
(In reply to Kannan Vijayan [:djvj] from comment #4)
> I suspect the bytecode compiler wouldn't speed up or slow down much either
> way.  You'd be doing a bit more work to figure out the proper instruction to
> emit, but it would be in lieu of emitting the stack shuffling ops that a
> traditional stack-based machine would have to emit.

How would instruction selection work? I thought you were talking about some kind of hybrid stack/register machine. Are there unlimited virtual registers? I'm afraid of having to run even a stupid regalloc algorithm.
Nothing that complicated.  Just that if you're looking at an AST fragment for:

x = a + (b * c) + d.bar

The wide-opcode hybrid machine would want to generate:

MUL_LLS(b, c)         // Push(GetLocal(b) * GetLocal(c))
GETPROP_LS(d, bar)    // Push(GetProp(GetLocal(d), bar))
ADD_LSS(a)            // Push(GetLocal(a) + Pop())
ADD_SSL(x)            // SetLocal(x, Pop() + Pop())

instead of something more traditionally stack-machine-oriented like:

LOAD_LS(a)      // Push(GetLocal(a))
LOAD_LS(b)      // Push(GetLocal(b))
LOAD_LS(c)      // Push(GetLocal(c))
MUL_SSS         // Push(Pop() * Pop())
ADD_SSS         // Push(Pop() + Pop())
LOAD_LS(d)      // Push(GetLocal(d))
GETPROP_SS(bar) // Push(GetProp(Pop(), bar))
ADD_SSS         // Push(Pop() + Pop())
STORE_SL(x)     // SetLocal(x, Pop())

The bytecode compiler would have to peek into the AST node's children to see if there are convenient leaf nodes that would be amenable to the hybrid instructions instead of walking the tree blindly and generating code a node-at-a-time.  This is not that complicated (I've written toy python code a year back when I first thought this up).

The two instruction sequences above also show the promise of this approach - the stack machine approach will force the interpeter to perform 9 dispatches to execute the code.  The wide-opcode approach takes 4 dispatches.

Note the equivalent register-machine style instructions for this:

MUL b, c, tempReg
ADD a, tempReg, tempReg
GETPROP b, bar, tempReg2
ADD tempReg, tempReg2, x

Also involve just 4 dispatches.

So the wide-opcode approach gets us down to register-machine levels for number-of-dispatches for executing code.

The nice thing is, it's more space efficient than the the register-machine style.  Let's compare the encoding space for all three variants.  The stack and register machines have narrow opcode spaces, so they can encode their opcode in 1 byte.  All the approaches use 2-byte operands.

The register machine uses (7 + 7 + 7 + 7 = 28) bytes for its code.
The stack machine uses (3 + 3 + 3 + 1 + 1 + 3 + 3 + 1 + 3 = 21) bytes.
The wide-opcode machine uses (6 + 6 + 4 + 4 = 20) bytes.

We actually hit bytecode sizes SMALLER than both the register and stack machine.  We get the best of both worlds: the compactness of stack-machine style because we can use the stack implicitly where appropriate, and the dispatch-efficiency of register-machine style because we can address operands freely.

The reason for this characteristic is that both stack-machines and register-machines force a particular addressing mode.  Stack machines always address stack (with special instructions to move things onto the stack), and register machines always address a register file (with special instructions to move things onto the register file).

By specializing operations with information about where their inputs/outputs are flowing to, we gain the flexibility to choose the best option depending on circumstance.

And this whole business comes without the complexity of things like "fused instruction" approaches like ADDMUL (*shudder*), because we're not trying to fuse operations that do "real work".  We're just eliminating the data shuffling operations that usually loiter around our core workhorse instructions.

The price we pay for that flexibility is a bigger opcode space, but that's ok because we've always been underutilizing opcode space - both stack machines and register machines underutilize opcode spaces.  If we just buy ourselves some breathing room in our opcodes (an extra byte), we gain enough flexibility to shove all that data shuffling away.
Also note that while this means our interpreter has a number of variants for workhorse instructions, if we use a direct-threaded interpreter (each instruction gets a fixed size region of machine instructions, and instruction dispatch becomes JUMP [baseAddr + (opcode << N)], we can still have just one implementation of the actual operation, and the code for the each instruction variant just does the data shuffling around it:

Pseudo-machine code for ADD_LLS:

; read local operands into r0 and r1
LOAD16 [bytecodePc + 2], r0
LOAD16 [bytecodePc + 4], r1
LOADWORD [framePointer + r0*8], r0
LOADWORD [framePointer + r1*8], r1

; pre-load next dispatch address for instruction cache hint.
LOAD16 [bytecodePc + 8], r2
ICACHE_HINT [interpBaseAddress + (r2 << 6)]

; call add implementation, result will be in r0.
CALL [addImplementation]

; store result onto stack
PUSHWORD r0

; dispatch next instruction
JUMP [interpBaseAddress + (r2 << 6)]
A couple of related points:

The compilation strategy can be adapted to the code being compiled, eg, for straight-line top-level code and the top-level code in top-level IIFEs (etc) one should never try to match "better" opcodes because chances are they'll only run once and the overhead of matching is (probably) not worth it.  While function bodies in general and loop bodies might reward more optimization.  That generalizes to figuring out if functions might only be invoked once, which, with lazy function compilation, we might already know something about.  Hard to say what's worthwhile.

I've always been curious about, with threaded interpreters, whether prefetching the destination address early in the opcode might be a win.  Since it's loaded from memory it may be hard for the microarchitecture to prove that it won't be affected by the opcode implementation, but prefetching the target address it would allow the CPU to start fetching at the target address before the opcode is done.  (I guess comment 7 also addresses this.)

(I also think that the instruction patterns that most benefit from being fused should be selected based on a representative corpus of code; I suspect the patterns that are executed must frequently may not be the ones we would naively expect, though presumably some profiling of those patterns have been done previously.   I also think that fusing should be performed by a simple automaton in the bytecode emitter that fuses adjacent instructions in the emitted stream, and should not be performed ad-hoc.  This is simple and reliable, and additionally an adaptive compilation strategy as I outlined above is a matter of enabling or disabling the automaton.)
I think first we need some data on how much time we actually spend *in* (not under) Interpret itself. I've looked at a lot of profiles and I don't think it's a lot of time, but maybe I'm wrong.

As mentioned, making opcodes fatter will bloat caches and will regress memory usage - we have tons of scripts.

Honestly I've been pretty happy with our current dumb/compact bytecode so we need some very convincing numbers to complicate that design. SpiderMonkey used to have these fused opcodes that caused all kinds of performance and correctness nightmares for the JITs.
Sorry, I don't want to sound too negative - I'm all for exploring our options here and making things faster.

I'm just saying it would be good to start with some real-world profiles (ideally both the Gecko profiler and an external profiler) to get a better sense of both the maximum vs expected gains (in ms and %).
I find the addressing mode interesting memory wise:

If we consider the addressing mode to be one byte. Then any opcode with:
 - 2 locals would be smaller by 1 byte.
 - 1 local would be of identical in size.
 - no locals would be larger by 1 byte.

I am also concerned with the size of the interpreter, and thus I wonder if we could do a double dispatch. One dispatch for the addressing mode, and one dispatch for the operations. The addressing mode dispatch will basically do load-effective-address of where the operands are located, and the operation will read it.

Having a smaller bytecode is interesting as we could save memory while saving the bytecode to the cache.

I also find it interesting performance wise:

Identically, doing a double dispatch for opcodes with:
 - 2 locals, saves 1 dispatch.
 - 1 local, have the same number of dispatches.
 - no locals, cost 1 dispatch.

The addressing mode also offer the ability to skip the stack for the locals, which implies that loading from a local to do an operation removes the store & load to/from the top of the stack, as well as the stack pointer mutation related to this operation.  Opcodes (2 operands) with:
 - 2 locals, saves 2 load & 2 store and 2 stack pointer mutations.
 - 1 local, save 1 load & 1 store and 2 stack pointer mutations. (the number of stack slots remains identical)
 - no locals, same cost.

I think it would be interesting to have the numbers of LOAD & STORE (= L) and compare it to the number of other operations (= O), in which case we could save X bytes and X dispatch (X = L - O) compared to our current interpreter.
(In reply to Lars T Hansen [:lth] from comment #8)
> A couple of related points:
> 
> The compilation strategy can be adapted to the code being compiled, eg, for
> straight-line top-level code and the top-level code in top-level IIFEs (etc)
> one should never try to match "better" opcodes because chances are they'll
> only run once and the overhead of matching is (probably) not worth it.

This is a good point.  For expressions in top-level scripts that are not contained in loops, a more naive instruction generation strategy would work fine.

> (I also think that the instruction patterns that most benefit from being
> fused should be selected based on a representative corpus of code; I suspect
> the patterns that are executed must frequently may not be the ones we would
> naively expect, though presumably some profiling of those patterns have been
> done previously.   I also think that fusing should be performed by a simple
> automaton in the bytecode emitter that fuses adjacent instructions in the
> emitted stream, and should not be performed ad-hoc.  This is simple and
> reliable, and additionally an adaptive compilation strategy as I outlined
> above is a matter of enabling or disabling the automaton.)

I'd really prefer to not call this "fused ops".  It suggests that we're doing things like ADDMUL, which we absolutely don't want to do.  This is just embedding some of the data flow information into the opcode.  I really don't want to consider "fusing" of instructions that do actual work, just the data shuffling bits, which is what this gets at.

The "peephole optimization" approach you're suggesting though sounds like a reasonable approach.


(In reply to Jan de Mooij [:jandem] from comment #9)
> I think first we need some data on how much time we actually spend *in* (not
> under) Interpret itself. I've looked at a lot of profiles and I don't think
> it's a lot of time, but maybe I'm wrong.
> 
> As mentioned, making opcodes fatter will bloat caches and will regress
> memory usage - we have tons of scripts.
> 
> Honestly I've been pretty happy with our current dumb/compact bytecode so we
> need some very convincing numbers to complicate that design. SpiderMonkey
> used to have these fused opcodes that caused all kinds of performance and
> correctness nightmares for the JITs.

My vtune analysis showed 8% of JS-execution time in js::Interpret alone.  That's not js::Interpret plus children, just direct time spent inside the function itself, measured on a webpage loading a google docs presentation and scrolling around a bit.

Don't worry about sounding negative.  It's good to have voices of caution ensuring we don't step blindly into big work.

But I really really want to emphasize this: this is NOT, I repeat NOT fused opcodes.  Basically all this is a hybrid betweeen register and stack machines.  We know for a fact that register machines are more dispatch-efficient because they embed operand info in the instructions.  We know for a fact that stack machines have more compact bytecode because of the use of implicit operands in stack ops.

All we're doing here is building a hybrid approach that allows us the flexibility to choose which style to use on an instructon-to-instruction basis.  We are NOT fusing operations.

The only instructions that would go away are things like GETLOCAL, SETLOCAL, PUSH, POP, SWAP, etc. - infallible operations that are pure data shuffling.  This approach is basically saying "stack machine? register machine? why not both?"
To wit, the JSC people mentioned that they have something like CacheIR and that their llint interpreter are able to use the same stubs as their JITs.
(In reply to Luke Wagner [:luke] from comment #13)
> To wit, the JSC people mentioned that they have something like CacheIR and
> that their llint interpreter are able to use the same stubs as their JITs.

Honestly I think that's where the biggest interpreter wins are going to come from.  It would be reasonably straightforward to do this on top our current interpreter.  For example, a JSOP_GETPROP_IC instruction could be added which has an extra operand - the index of the IC in the script's IC table.  The interpreter uses that to look up the IC, and call into it.

We'd take a hit from the fact that our interpreter stack isn't on the C stack, and our baseline ABI is not the same as the interpreter's ABI, so we'd have to jump through a trampoline to get into the ICs and back.  But it should still be faster than our slowpath.

Taking Lars' good observation that top-level scripts - and other run-once scripts such as ((function() { ... })() - would not benefit much from optimization, the bytecodegen for those scripts could emit the old-style instructions instead of the new JSOP_*_IC instructions, and we'd never incur IC overhead for those scripts.

This is a good short term approach, should be able to be investigated pretty quickly and educated/speculative numbers delivered on short order.
My intuition here is that applying CacheIR to the interpreter gets us "smoother" performance, since it'll act as a sub-Baseline perf tier that ramps up over time. That's definitely good. Not sure if this helps with cold code performance as much. llint's performance gains for the cold code cases, I thought, is due to the ll part.
I think you're probably right that the cold-load perf is more influenced by raw interpreter speed.

But moving IC techniques to the interpreter will likely lead to _better_ performance, not just smoother performance.  The numbers in the first comment show that 74% of interpreter instruction dispatches are happening at bytecode locations that have been dispatched to at least twice already.

Interpreter execution seems to be very fat-tailed with respect to repeated dispatch of the same bytecode.  Yes there's a big chunk of stuff that gets executed only 1 or 2 times, but the volume of stuff under "3 or more dispatches" is a super-majority.
Keywords: triage-deferred
Priority: -- → P3
Severity: normal → S3
Status: NEW → RESOLVED
Closed: 8 months ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.