Closed Bug 910782 Opened 6 years ago Closed 6 years ago

indirect-goto-based interpreter

Categories

(Core :: JavaScript Engine, enhancement)

enhancement
Not set

Tracking

()

RESOLVED FIXED
mozilla28

People

(Reporter: sunfish, Assigned: sunfish)

References

Details

(Whiteboard: [qa-])

Attachments

(19 files, 5 obsolete files)

2.06 KB, patch
jorendorff
: review+
Details | Diff | Splinter Review
3.14 KB, patch
terrence
: review+
Details | Diff | Splinter Review
4.23 KB, patch
luke
: review+
Details | Diff | Splinter Review
7.02 KB, patch
luke
: review+
Details | Diff | Splinter Review
2.02 KB, patch
luke
: review+
Details | Diff | Splinter Review
6.87 KB, patch
luke
: review+
Details | Diff | Splinter Review
2.84 KB, patch
luke
: review+
Details | Diff | Splinter Review
4.79 KB, patch
luke
: review+
Details | Diff | Splinter Review
7.70 KB, patch
luke
: review+
Details | Diff | Splinter Review
8.81 KB, patch
luke
: review+
Details | Diff | Splinter Review
7.33 KB, patch
luke
: review+
Details | Diff | Splinter Review
44.87 KB, patch
luke
: review+
Details | Diff | Splinter Review
3.97 KB, patch
luke
: review+
Details | Diff | Splinter Review
2.11 KB, patch
luke
: review+
Details | Diff | Splinter Review
23.43 KB, patch
luke
: review+
Details | Diff | Splinter Review
100.83 KB, patch
luke
: review+
Details | Diff | Splinter Review
4.44 KB, patch
luke
: review+
Details | Diff | Splinter Review
1.37 KB, patch
luke
: review+
Details | Diff | Splinter Review
2.27 KB, patch
luke
: review+
Details | Diff | Splinter Review
For a tangentially-related experiment, I (re-)implemented an indirect-goto-based interpreter loop. I'm aware of some reasons why this may not be desirable upstream, so I don't mind if this isn't accepted, but I'm sending it upstream in case someone finds it interesting. It speeds up the interpreter by about 10-15%, but this is typically only noticeable when the JITs are disabled.

Along the way, I also made several cleanup patches, which I split out in case they are desirable even if the main indirect-goto patch is not.

Some differences between this new interp-indirect-goto.patch and the code removed in bug 799777 include:
 - It gets rid of the global (to the interpreter loop) len and op variables, and significantly reduces the scope of the switchOp variable.
 - The table contains offsets instead of absolute addresses, which is more PIC/PIE friendly (though it is slightly slower).
 - There's only one table; the interrupt mechanism is simpler.
 - The indirect-goto code shares more with the switch code; there's only one #ifdef.
Assignee: general → sunfish
Attached patch interp-cleanup-branch.patch (obsolete) — Splinter Review
Attached patch interp-cleanup-switch-mask.patch (obsolete) — Splinter Review
Attached patch interp-prep-do-macros.patch (obsolete) — Splinter Review
Attached patch interp-prep-advance-n.patch (obsolete) — Splinter Review
Attached patch interp-indirect-goto.patch (obsolete) — Splinter Review
This patch is the one the actually implements the indirect goto feature.
Comment on attachment 797337 [details] [diff] [review]
interp-cleanup-comments.patch

Looks like the patch in bug 877599 accidentally re-introduced an old comment. This patch removes it again, plus one other unneeded comment.
Attachment #797337 - Flags: review?(jorendorff)
Comment on attachment 797338 [details] [diff] [review]
interp-cleanup-end-case.patch

Some fairly straight-forward cleanups.
Attachment #797338 - Flags: review?(terrence)
Whiteboard: [leave open]
Comment on attachment 797338 [details] [diff] [review]
interp-cleanup-end-case.patch

Review of attachment 797338 [details] [diff] [review]:
-----------------------------------------------------------------

Nice! \o/

r=me
Attachment #797338 - Flags: review?(terrence) → review+
We removed the indirect-goto code last time because it didn't seem to help performance and it was nice to avoid the extra macro goop.  Do you see any interesting cases where this improves performance?  (All the preparatory interp cleanup patches of course would be nice to land.)
(In reply to Luke Wagner [:luke] from comment #16)
> We removed the indirect-goto code last time because it didn't seem to help
> performance and it was nice to avoid the extra macro goop.  Do you see any
> interesting cases where this improves performance?  (All the preparatory
> interp cleanup patches of course would be nice to land.)

I don't have any general-interest cases where this improves performance, but I haven't really looked. If I decide to propose the indirect-goto patch for review, I'll definitely collect some more data to support it. For now I'll just work on landing the cleanups.
(In reply to Dan Gohman [:sunfish] from comment #18)
Sounds great!
Attachment #797339 - Flags: review?(luke)
Attachment #797340 - Flags: review?(luke)
Attachment #797339 - Flags: review?(luke) → review+
Attachment #797340 - Flags: review?(luke) → review+
Attachment #797342 - Flags: review?(luke)
Attachment #797341 - Attachment is obsolete: true
Attachment #798641 - Flags: review?(luke)
Comment on attachment 797343 [details] [diff] [review]
interp-cleanup-outer-for.patch

This diff looks pretty crazy, it's mostly indentation changes. The regular cases in the interpreter loop body are left-justified, so this patch left-justifies the special cases for consistency, but feel free to disagree with that choice :-).
Attachment #797343 - Flags: review?(luke)
Attachment #797344 - Attachment is obsolete: true
Attachment #798647 - Flags: review?(luke)
Comment on attachment 797347 [details] [diff] [review]
interp-prep-scoping.patch

I named this patch with "prep", but it's really a cleanup. InvokeState contains a RootedScript, so it's nice to give it a more explicit lifetime.
Attachment #797347 - Flags: review?(luke)
Comment on attachment 797337 [details] [diff] [review]
interp-cleanup-comments.patch

Review of attachment 797337 [details] [diff] [review]:
-----------------------------------------------------------------

embarrassing
Attachment #797337 - Flags: review?(jorendorff) → review+
Attachment #797342 - Flags: review?(luke) → review+
Attachment #797343 - Flags: review?(luke) → review+
Attachment #797347 - Flags: review?(luke) → review+
Comment on attachment 798641 [details] [diff] [review]
interp-cleanup-branch.patch

Review of attachment 798641 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/vm/Interpreter.cpp
@@ +1303,5 @@
>  
>  #define BRANCH(n)                                                             \
>      JS_BEGIN_MACRO                                                            \
> +        int32_t nlen = (n);                                                   \
> +        regs.pc += (nlen);                                                    \

does nlen need the parens?
Attachment #798641 - Flags: review?(luke) → review+
Attachment #798647 - Flags: review?(luke) → review+
(In reply to Luke Wagner [:luke] from comment #26)
> Comment on attachment 798641 [details] [diff] [review]
> interp-cleanup-branch.patch
> 
> Review of attachment 798641 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: js/src/vm/Interpreter.cpp
> @@ +1303,5 @@
> >  
> >  #define BRANCH(n)                                                             \
> >      JS_BEGIN_MACRO                                                            \
> > +        int32_t nlen = (n);                                                   \
> > +        regs.pc += (nlen);                                                    \
> 
> does nlen need the parens?

Nope.

https://hg.mozilla.org/integration/mozilla-inbound/rev/ebb5b0609fe6
https://hg.mozilla.org/integration/mozilla-inbound/rev/6df80a4fdb06
https://hg.mozilla.org/integration/mozilla-inbound/rev/d09951d9e0c0
https://hg.mozilla.org/integration/mozilla-inbound/rev/7792dc26b3e1
https://hg.mozilla.org/integration/mozilla-inbound/rev/0e876da27431
https://hg.mozilla.org/integration/mozilla-inbound/rev/bf286f1d5489
Attachment #797345 - Attachment is obsolete: true
Attachment #797346 - Attachment is obsolete: true
Attachment #797348 - Attachment is obsolete: true
This patch moves the interpreter's FrameGuard, FrameRegs, and opMask variables into the InterpreterActivation, allowing InterpreterActivation's methods  and other code to access them directly rather than through an extra pointer indirection.
For straight-line code, parser speed dwarfs interpreter speed. For loops above the baseline threshold, baseline/Ion take over for the win. But there is a small window around and below the baseline threshold where code runs in the interpreter where these patches often provide 10-15% speedups.

I've added several additional cleanup patches, and have further simplified the switch/goto macros, such that the maintenance burden of the threaded interpreter code should be quite low. There's only one #if, and it's pretty small and simple.

Also worth noting is bug 922440, where there is discussion of increasing the thresholds for the JITs on b2g, if not outright disabling them, either of which would put significantly more pressure on the interpreter. I haven't measured the performance of this patch on b2g, but a speedup is not unlikely.
Comment on attachment 821818 [details] [diff] [review]
interp-cleanup-case-macro-names.patch

This is just a simple rename; CASE instead of BEGIN_CASE because it really is just a case (and it's just a label in the indirect-goto version).
Attachment #821818 - Flags: review?(luke)
Comment on attachment 821819 [details] [diff] [review]
interp-cleanup-cases.patch

This adds some missing opcodes, and makes brace usage consistent.
Attachment #821819 - Flags: review?(luke)
Comment on attachment 821821 [details] [diff] [review]
interp-cleanup-var.patch

This cleans up what appears to be an obsolete optimization.
Attachment #821821 - Flags: review?(luke)
Comment on attachment 821748 [details] [diff] [review]
interp-prep-do-macros.patch

Ok. This patch is not strictly a cleanup; it's the first step towards re-enabling the threaded interpreter.

For SunSpider on a 2012 Nexus 7 (quad-core Cortex-A9) with --no-baseline, I see a pretty clear 10% speedup overall with this patch series. Some benchmarks see more, some less, but I don't see any slowdowns. With baseline enabled, I don't see a speedup, though I don't see any slowdowns either. I see similar results on an Intel laptop and an Intel desktop machine.

I am proposing this now because:
 - With all the cleanups, the complexity of this code is significantly lower than what was removed in bug 799777, and I claim the maintenance burden is significantly reduced.
 - In situations where the interpreter is used, either because the JIT is expensive, possibly including bug 922440, or not available for some reason, it's a pretty clear 10%-ish win on every platform I'm able to test on, excluding Windows where the optimization is currently unavailable.
Attachment #821748 - Flags: review?(luke)
I think somebody was using SpiderMonkey on iOS, which forces them to just use the interpreter. This would probably make them quite happy.
Attachment #821818 - Flags: review?(luke) → review+
Attachment #821748 - Flags: review?(luke) → review+
Attachment #821819 - Flags: review?(luke) → review+
Comment on attachment 821821 [details] [diff] [review]
interp-cleanup-var.patch

Hehe, real hot.
Attachment #821821 - Flags: review?(luke) → review+
(In reply to Dan Gohman [:sunfish] from comment #42)
I'm down.
Comment on attachment 821817 [details] [diff] [review]
interp-prep-advance-and-dispatch.patch

This is the next preparatory patch. Instead of setting len and then doing ADVANCE_AND_OP_OP(), this passes the length to the macro. It also renames the macro to ADVANCE_AND_DISPATCH.
Attachment #821817 - Flags: review?(luke)
Comment on attachment 821822 [details] [diff] [review]
interp-indirect-goto.patch

This is the main indirect goto patch. A few notes:

It eliminates the "op" variable, which was just a copy of *regs.pc. In theory this was an optimization, but it doesn't appear to be very important, and it does make the code more complex, since op and *regs.pc have to be manually kept in sync.

It changes the increment of regs.pc from being in a single location at advanceAndDoOp at the top of the interpreter loop to being duplicated at every ADVANCE_AND_DISPATCH. This is theoretically a minor pessimization for the case where a switch is used instead of indirect gotos, but this doesn't appear to be very significant.
Attachment #821822 - Flags: review?(luke)
Attachment #821817 - Flags: review?(luke) → review+
Comment on attachment 821827 [details] [diff] [review]
interp-indirection.patch

This patch folds the FrameGuard into InterpreterActivation and moves the interpreter's FrameRegs into InterpreterActivation too. And it fold in the opMask variable as well. The main effect of this is to eliminate InterpreterActivation's pointers to the FrameRegs, FrameGuard, and opMask, and FrameGuard's pointers. More code can access the data it needs without that indirection, and this results in about a 1% speedup of interpreter speed on Intel.

The main downside is the replacement of regs by the REGS macro in most of the interpreter body.
Attachment #821827 - Flags: review?(luke)
Comment on attachment 821829 [details] [diff] [review]
interp-cleanup-interpret.patch

This just reorganized a few things at the top of the Interpreter function.
Attachment #821829 - Flags: review?(luke)
Comment on attachment 821830 [details] [diff] [review]
interp-micro-mask.patch

This is a little optimization to finish things off. jsbytecode is a uint8_t, and this makes it easier for compilers to eliminate the zero-extend to size_t after or-ing the mask in. Since this happens after each opcode before the dispatch to the next opcode, this is actually noticeable.
Attachment #821830 - Flags: review?(luke)
Comment on attachment 821822 [details] [diff] [review]
interp-indirect-goto.patch

This is much nicer than it was before.
Attachment #821822 - Flags: review?(luke) → review+
Comment on attachment 821827 [details] [diff] [review]
interp-indirection.patch

Nice!
Attachment #821827 - Flags: review?(luke) → review+
Attachment #821829 - Flags: review?(luke) → review+
Attachment #821830 - Flags: review?(luke) → review+
I was looking at this all a bit more and I had two few questions:

- Why did you decide to store offsets instead of absolute addresses of labels?  Even if x86 can (I assume) fold in the displacement of &LABEL(JSOP_NOP), it still seems like an extra 4-byte immediate per jump instruction.  Is there a downside to storing the addresses of the labels directly?  It seems like it requires sizeof(void*) instead of sizeof(uint32_t) which could help on 64-bit.  Another possibility is if the compiler was able to do a better job baking in the constant offsets than constant addresses of labels, but I wouldn't think so.

- By having 'offsets' be a static const, is the compiler able to bake in a constant address?  It seems like, in the best case, DISPATCH_TO would be a single
  jmp *(ADDRESS_OF_OFFSETS + op<<2)
instruction.  Is this what you see in the generated code?
(In reply to Luke Wagner [:luke] from comment #53)
> I was looking at this all a bit more and I had two few questions:
> 
> - Why did you decide to store offsets instead of absolute addresses of
> labels?  Even if x86 can (I assume) fold in the displacement of
> &LABEL(JSOP_NOP), it still seems like an extra 4-byte immediate per jump
> instruction.  Is there a downside to storing the addresses of the labels
> directly?  It seems like it requires sizeof(void*) instead of
> sizeof(uint32_t) which could help on 64-bit.  Another possibility is if the
> compiler was able to do a better job baking in the constant offsets than
> constant addresses of labels, but I wouldn't think so.

Absolute addresses in static initializers require load-time relocations in PIC mode. See the remark about shared libraries in GCC's manual:

[0] http://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html

I'll add this info as a comment.

> - By having 'offsets' be a static const, is the compiler able to bake in a
> constant address?  It seems like, in the best case, DISPATCH_TO would be a
> single
>   jmp *(ADDRESS_OF_OFFSETS + op<<2)
> instruction.  Is this what you see in the generated code?

Offsets actually make it slightly slower at runtime. On x64, the sequence with offsets is:

  movslq (%r12,%rax,4),%rax
  addq   %r14,%rax
  jmpq   *%rax

It's easy enough to switch to absolute addresses if you prefer.
(In reply to Dan Gohman [:sunfish] from comment #54)
> It's easy enough to switch to absolute addresses if you prefer.

I don't have any a priori preferences, I was just wanting to understand the choices.  Thanks, I fogot about PIC code.  Assuming we just eat the load-time cost, though, is there any noticeable speed difference using absolute addresses?

>   movslq (%r12,%rax,4),%rax
>   addq   %r14,%rax
>   jmpq   *%rax

I assume %r12 is the address of 'offsets' and %r14 is the LABEL(JSOP_NOP)?  If so, are these kept in registers or do most cases have to fill these registers from the stack or somewhere else?
(In reply to Luke Wagner [:luke] from comment #55)
> (In reply to Dan Gohman [:sunfish] from comment #54)
> > It's easy enough to switch to absolute addresses if you prefer.
> 
> I don't have any a priori preferences, I was just wanting to understand the
> choices.  Thanks, I fogot about PIC code.  Assuming we just eat the
> load-time cost, though, is there any noticeable speed difference using
> absolute addresses?

In some ad-hoc testing, absolute addresses appear to be about a 1% speedup at runtime.

> >   movslq (%r12,%rax,4),%rax
> >   addq   %r14,%rax
> >   jmpq   *%rax
> 
> I assume %r12 is the address of 'offsets' and %r14 is the LABEL(JSOP_NOP)? 

Yep.

> If so, are these kept in registers or do most cases have to fill these
> registers from the stack or somewhere else?

It looks like they're in registers most of the time. They get rematerialized in a few places, but most of the time they're preserved.
Here's a patch which changes the interpreter to use absolute addresses.

On x86-64 Linux for example, it increases the number of R_X86_64_RELATIVE relocations in libmozjs-27.0a.so from 25709 to 25965. I hadn't realized it was already that high.  Given that, the added cost is possibly not very significant, so the 1% runtime speedup may be worth it.
Attachment #822913 - Flags: review?(luke)
(In reply to Dan Gohman [:sunfish] from comment #56)
Thanks for looking into this!

> It looks like they're in registers most of the time. They get rematerialized
> in a few places, but most of the time they're preserved.

Wow, I'm impressed.  I mean, to know that 'offsets' and 'LABEL(JSOP_NOP)' are in registers at the beginning of a case, the compiler has to know that they are in registers, and the same registers at end of every case.  (Of course they are, since each indirect goto is using 'offsets' and 'LABEL(JSOP_NOP)', but still :)
Comment on attachment 822913 [details] [diff] [review]
interp-absolute.patch

Review of attachment 822913 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/vm/Interpreter.cpp
@@ +1220,4 @@
>  # define OPPAD(v)                                                             \
> +    (v) == EnableInterruptsPseudoOpcode ?                                     \
> +    LABEL(EnableInterruptsPseudoOpcode) :                                     \
> +    LABEL(default),

SM style is ? and : at the beginning of the line, aligned with the condition.

Alternatively, what if you added a third OPINT (in keeping with the 5-char macros :) that was only used for EnableInterruptsPseudoOpcode to avoid the ternary operator (which I realize should be const-evaluated; it might just read a bit better).
Attachment #822913 - Flags: review?(luke) → review+
(In reply to Luke Wagner [:luke] from comment #59)
> Comment on attachment 822913 [details] [diff] [review]
> interp-absolute.patch
> 
> Review of attachment 822913 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: js/src/vm/Interpreter.cpp
> @@ +1220,4 @@
> >  # define OPPAD(v)                                                             \
> > +    (v) == EnableInterruptsPseudoOpcode ?                                     \
> > +    LABEL(EnableInterruptsPseudoOpcode) :                                     \
> > +    LABEL(default),
> 
> SM style is ? and : at the beginning of the line, aligned with the condition.

Fixed. I also put parens around the whole thing.

> Alternatively, what if you added a third OPINT (in keeping with the 5-char
> macros :) that was only used for EnableInterruptsPseudoOpcode to avoid the
> ternary operator (which I realize should be const-evaluated; it might just
> read a bit better).

I considered this, but I also like keeping the Interpreter's implementation details out of jsopcode.tbl as much as possible. OPPAD is intrusion enough :-}.

remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/0f80724faca5
remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/88e167fa6f4b
remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/d4322df6c15e
remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/c5186c7a40ac
remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/af637e56ce12
remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/fb2bf717ab24
remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/953dc75f2bb9
remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/a2074887deb6
remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/9a5e42743781
remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/9540960e67a2
~                                                                                        
~                                                                                        
~                                                                                        
~                                                                                        
~                                                                                        
~                                                                                        
~                                                                                        
~                                                                                        
~
Whiteboard: [leave open]
Whiteboard: [qa-]
You need to log in before you can comment on or make changes to this bug.