Closed Bug 799777 Opened 12 years ago Closed 12 years ago

Unravel the threaded interpreter

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla19

People

(Reporter: terrence, Assigned: terrence)

References

Details

(Whiteboard: [js:t])

Attachments

(1 file)

Given the JITs, it is no longer clear that the extra complexity of the threaded interpreter loop is paying for useful performance.  Since it is easy to pref off, I ran some quick tests:

SunSpider (Interpreter)
=======================
Threaded:  3,031.2ms
Unraveled: 3,283.8ms (1.083x slower)

V8 (JIT)
========
Threaded:  13,193.33ms
Unraveled: 13,196.67ms (same)

SunSpider (JIT)
===============
Threaded:  250.8ms
Unraveled: 221.2ms (1.134x faster)

I'm not sure what mechanism could be causing the speedup on SS under IonMonkey, however, what is clear is that the threaded interpreter is not winning us much of anything.
Seriously, a 29ms speedup on SS?!!  Are you sure you are measuring the right thing?

Anyhow, assuming it isn't a slow down with the jits, I'd like to see it go.  Do you see any reason why not Dave?
(In reply to Terrence Cole [:terrence] from comment #0)
> Given the JITs, it is no longer clear that the extra complexity of the
> threaded interpreter loop is paying for useful performance.  Since it is
> easy to pref off, I ran some quick tests:
> 
> SunSpider (Interpreter)
> =======================
> Threaded:  3,031.2ms
> Unraveled: 3,283.8ms (1.083x slower)
> 
> SunSpider (JIT)
> ===============
> Threaded:  250.8ms
> Unraveled: 221.2ms (1.134x faster)
> 
> I'm not sure what mechanism could be causing the speedup on SS under
> IonMonkey, however, what is clear is that the threaded interpreter is not
> winning us much of anything.

IonMonkey is using the jsinterpinline functions, so this might be a little explanation if there are some cache lines shared between Ion and the interpreter. (which I qualify as random unmaintainable perf improvements)  If it persist with higher number of runs, can you check a cache emulation such as valgrind or oprofile, or vtune?
Attached patch v0: rip it outSplinter Review
Sean is investigating the potential perf win under IM as we speak.
Attachment #669812 - Flags: review?(luke)
There doesn't appear be to a performance difference with/without JS_THREADED_INTERP. Locally both result in SS runs of ~245.5ms.
Comment on attachment 669812 [details] [diff] [review]
v0: rip it out

Good riddance.
Attachment #669812 - Flags: review?(luke) → review+
(In reply to Luke Wagner [:luke] from comment #1)
> Seriously, a 29ms speedup on SS?!!  Are you sure you are measuring the right
> thing?
> 
> Anyhow, assuming it isn't a slow down with the jits, I'd like to see it go. 
> Do you see any reason why not Dave?

Fascinating. Go ahead and take it out if it doesn't slow down the standard with-JITs configuration. Be sure to check ARM as well.

Historical note for the curious + some remarks for the future: the Threaded interpreter was bug 121414 and goes back to 2004-2005. It is one of the oldest big optimizations for SpiderMonkey in this Bugzilla database. SunSpider didn't even come out until a few years later but I think the threaded interpreter was a notable (10%+) speedup for it.

Given that we've never had the threaded interpreter on Windows (MSVC doesn't have the right extensions) and our Windows perf is good, it seems very likely that we don't need the threaded interpreter any more. A fast interpreter would be a great thing to have in the future. These days, it looks like for a fast interpreter you need a register-based bytecode, no profiling or recording or any features like that, and quite probably custom code generation à la LLint.
https://hg.mozilla.org/mozilla-central/rev/44079242ee9b
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla19
(In reply to David Mandelin [:dmandelin] from comment #7)
> Given that we've never had the threaded interpreter on Windows (MSVC doesn't
> have the right extensions) and our Windows perf is good, it seems very
> likely that we don't need the threaded interpreter any more. A fast
> interpreter would be a great thing to have in the future. These days, it
> looks like for a fast interpreter you need a register-based bytecode, no
> profiling or recording or any features like that, and quite probably custom
> code generation à la LLint.

I've been thinking about the LLint implemntation and I believe there is potential for a really performant interpreter design which adopts LLint techniques but applyies them on top of a stack architecture.

The big advantages of a register-based architecture is instruction density, but we can achieve the same gains in a stack arch with two techniques:
1. instruction coalesing  - creating "bundle" instructions for common sequences like "dup getprop swap" (which shows up for every method call), "dup <OP>", "<OP> pop", "int8 getelem", "int8 bitand", etc.
2. extending stack instructions to be able to get operands from deeper in the stack than just the top N values.

A stack also has a distinct notion of locality not present in register files: the nearer to the top a value on the stack is, the more likely it's going to be used soon.  We can take advantage of this to move the top of the virtual stack into a reserved range of registers.  The "top" of the stack can vary within this range (however, the register-mapping of the top of the stack will be statically mappable for any bytecode PC).  Once we have this, we can actually parametrize the various opcode implementations for every possible "optimized stack state".

e.g. Given a register range (%r12, %r13, %r14, %r15) for the top-of-stack
JSOP_ADD<3> will assume that the top 3 values of the stack are in (%r12, %r13, and %r14).  Its entire implementation will require no loads or stores, and when it exits, it'll leave the stack in a state where the top 2 values are in registers (%r12, and %r13).

Whatever the opcode is that follows that add, the bytecode generator will choose the variant of that opcode with parametrization <2>.  Every opcode implementation will have five variants, each of which assume a different stack state.  The bytecode emitter will know the optimized-stack-state at any given PC, so it can go emit the correct sequence of variants as it generates them.


By doing this, we can generate optimized bytecode that avoids memory load/store operations over long runs of several instructions, which I suspect will actually yield better performance than an architecture based on a virtual register file, which is hard to optimize in this way.

All of the other LLint-style techniques will still apply.  We can still use "pointers-to-code" as the bytecode ops, and still custom-gen the opcode implementations.
I just realized that perhaps the above post may be a bit off topic for this bug.  Apologies :)
No problem!  Those are some very interesting ideas.

I was under the impression that we were planing to go the v8 route and create a baseline compiler on top of Ion to more-or-less replace jsinterp.cpp.  It would be nice to be able  to compare these methods -- is there a reasonably good way to race v8's baseline compiler against jsc so we don't have to build both?
Yeah, that's the plan - baseline compile a-la crankshaft.

That's a good suggestion to try to compare the LLint vs. baseline-compiler approach.  My educated guess would be that on pure performance, the baseline compiler would win out.  It eliminates the opcode fetch/dispatch overhead (which additionally reduces pollution of the data cache), and otherwise has no real disadvantages compared to the interpreter approach.

The main advantages of the LLint approach are more deterministic memory usage, and operation in restricted environments where we can't map PROT_EXEC memory to write jitcode.  The latter issue might become kind of significant over time, given recent unfortunate trends in computing devices...
(In reply to Kannan Vijayan [:djvj] from comment #9)
> (In reply to David Mandelin [:dmandelin] from comment #7)
> > Given that we've never had the threaded interpreter on Windows (MSVC doesn't
> > have the right extensions) and our Windows perf is good, it seems very
> > likely that we don't need the threaded interpreter any more. A fast
> > interpreter would be a great thing to have in the future. These days, it
> > looks like for a fast interpreter you need a register-based bytecode, no
> > profiling or recording or any features like that, and quite probably custom
> > code generation à la LLint.
> 
> I've been thinking about the LLint implemntation and I believe there is
> potential for a really performant interpreter design which adopts LLint
> techniques but applyies them on top of a stack architecture.
> 
> The big advantages of a register-based architecture is instruction density,
> but we can achieve the same gains in a stack arch with two techniques:
>
> 1. instruction coalesing  - creating "bundle" instructions for common
> sequences like "dup getprop swap" (which shows up for every method call),
> "dup <OP>", "<OP> pop", "int8 getelem", "int8 bitand", etc.

That technique was tried back in the day, with SpiderMonkey to some extent, and also to the Tamarin interpreter, but I don't believe it got more than incremental gains.

> 2. extending stack instructions to be able to get operands from deeper in
> the stack than just the top N values.

I'm not sure where that applies, so off the top of my head I would imagine limited benefits. Maybe you have some good examples, though.

> A stack also has a distinct notion of locality not present in register
> files: the nearer to the top a value on the stack is, the more likely it's
> going to be used soon.  We can take advantage of this to move the top of the
> virtual stack into a reserved range of registers.  The "top" of the stack
> can vary within this range (however, the register-mapping of the top of the
> stack will be statically mappable for any bytecode PC).  Once we have this,
> we can actually parametrize the various opcode implementations for every
> possible "optimized stack state".
> 
> e.g. Given a register range (%r12, %r13, %r14, %r15) for the top-of-stack
> JSOP_ADD<3> will assume that the top 3 values of the stack are in (%r12,
> %r13, and %r14).  Its entire implementation will require no loads or stores,
> and when it exits, it'll leave the stack in a state where the top 2 values
> are in registers (%r12, and %r13).
> 
> Whatever the opcode is that follows that add, the bytecode generator will
> choose the variant of that opcode with parametrization <2>.  Every opcode
> implementation will have five variants, each of which assume a different
> stack state.  The bytecode emitter will know the optimized-stack-state at
> any given PC, so it can go emit the correct sequence of variants as it
> generates them.

Using registers for temporary values is interesting. 

> By doing this, we can generate optimized bytecode that avoids memory
> load/store operations over long runs of several instructions, which I
> suspect will actually yield better performance than an architecture based on
> a virtual register file, which is hard to optimize in this way.
> 
> All of the other LLint-style techniques will still apply.  We can still use
> "pointers-to-code" as the bytecode ops, and still custom-gen the opcode
> implementations.

The most important thing to find out is what we need in terms of interpreter speed. On desktop, I haven't seen evidence that it matters in practice (with the current JIT setup) but I can't say for sure it doesn't exist. But mobile seems a more likely place for it to matter: e.g., if it turned some 100ms interaction (that JITs don't apply to) into a 200ms interaction, that would be something to care about.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: