Closed Bug 1499324 (BaselineInterpreter) Opened 6 years ago Closed 5 years ago

[meta] Consider compiling the interpreter dynamically

Categories

(Core :: JavaScript Engine: JIT, enhancement, P2)

enhancement

Tracking

()

RESOLVED FIXED
Performance Impact ?
Tracking Status
relnote-firefox --- 70+
firefox64 --- wontfix
firefox70 --- fixed

People

(Reporter: jandem, Assigned: jandem)

References

(Depends on 2 open bugs)

Details

(Keywords: meta)

Attachments

(1 file)

Our current C++ interpreter has a number of issues, most importantly:

(1) Pretty slow. No IC support.
(2) Has its own stack layout and calling convention. This makes calls and OSR into Baseline more slow and complicated than they should be.

Baseline is pretty fast but compilation time sometimes shows up in profiles.

The past weeks we have been talking a bit about generating an interpreter dynamically. It would work a lot like Baseline, but instead of compiling scripts to JIT code it could interpret bytecode at runtime. So concretely:

* Frame layout exactly like Baseline, reusing BaselineFrame and JIT calling convention.
* ICs will be attached to JSScript instead of BaselineScript (completely shared by interpreter and Baseline).
* Code to compile a given JSOp will be factored out of BaselineCompiler (with some policy template magic). Implementing a new op will then ensure it works in both Baseline and the generated interpreter.
* The interpreter trampoline we already have will contain the interpreter code instead of the current call-into-C++-interpreter machinery.

This has some nice benefits:

(a) The interpreter will use exactly the same ICs and stubs (code + data) as Baseline.

(b) With a faster interpreter, we can consider increasing the Baseline compile threshold so we spend less time Baseline compiling.

(c) Interpreter => Baseline OSR becomes trivial because the frame layout is the same, so at loop entries we could just change the instruction pointer to the Baseline code.

(d) Ion bailouts could probably be simplified a bit by resuming into the interpreter instead of Baseline (until the next loop edge where we can then use (c) to get back into Baseline).

Some more things we discussed:

(A) The current C++ interpreter could still be used for platforms without a JIT, for differential testing and debugging, or (at least initially) for functions that have > 20000 arguments or so (because the C++ interpreter allocates them on the heap instead of on the much smaller native stack). This interpreter could then be simplified a lot.

(B) To determine the current bytecode pc for a given Baseline frame, we currently use the native code => pc map. An interpreter would have to maintain this bytecode pc on the BaselineFrame when we make calls. BaselineFrame already has an "override pc" field that we could probably use.

(C) To load the ICEntry for the current bytecode pc we have a few options:

    (1) Store the IC index in the bytecode as an operand.
    (2) Flag all bytecode ops that have an IC as JOF_IC and have the bytecode emitter only store the IC index for jump target ops. This means the interpreter could then keep an ICEntry* pointer that is bumped between ops.

(D) Most code on websites is very cold (executed once or twice). Using ICs for such code might actually make us slower because we will waste time attaching stubs that are then never used. One idea is to compile two interpreters, one that uses ICs and one that doesn't, and then switch dynamically between the two. A similar mechanism could maybe be used when toggling debug mode for on-stack scripts.
Depends on: 1494618, 1494590
My biggest concerns are about the development cost needed to maintain the interpreter.

As I mentioned in the meeting that we had in April 2018, generating an Interpreter implies that everybody who deals with JavaScript language features has to know how to write code for the Interpreter, and our current MacroAssembler is by far worse than C++.

* Interpreter Optimization

From what I understand the concern here is that we have no control over the layout of the interpreter stack. Maybe there is an option which does not involve rewritting everything with our MacroAssembler which gives us the ability to have full control over the stack layout.

My understanding is that C++ compiler give up due to the abuse of goto branches and create a single stack frame where all stack variables are live at any time.

* Inline Caches (in C++)

About having Inline Caches in the interpreter, we can implement most of our Inline Caches as statically compiled function with associated data. Doing so would avoid having to generate code at runtime, while having a data-specialization of the C++ code, such as pre-computed property indexes based on known shapes.

CASE(JSOP_GETPROP) {
    lhs = ...;
    CppInlineCache* ic = getCppICForPC(pc);
    else if (!ic->call(ic, lhs)) { goto error; }
}

Where the CppInlineCache, would be sorted and data for the specialization stub would be attached to the CppInlineCache pointer.

bool jsop_getprop_NativeKnownShape(CppInlineCache* ic, MutableHandleValue obj) {
    auto ic = ic->downcast<GetPropNativeKnownShapeCppIC>();
    if (obj->shape() != ic->shape()) {
        goto fallback;
    }

    obj = obj->getProp(ic->slot());
    return true;

  fallback:
    CppInlineCache* ic = ic->next();    
    return ic->call(ic, obj);
}

bool jsop_getprop_fallback(CppInlineCache* ic, MutableHandleValue lhs) {
    ...
}
I think it should be stressed (at least this is my understanding), that the intent would be to *not* rewrite things into MacroAssembler. The existing BaselineCompiler (with 99% coverage) would be leveraged directly. We would leave in VM calls for any things still that are not on critical path. If we needed to rewrite all the opcodes to make this work, then this would probably not make sense to even consider.
Keywords: meta
This sounds promising!
I am in general highly supportive of this idea - mostly because it seems to deliver a number of different short-term wins while still moving us towards a single source of optimization truth.  CacheIR was a good step in that direction, unifying large chunks of Ion and Baseline ICs (with some remaining work to be done).

Moving the interpreter optimization behaviour to start unifying with CacheIR is a good direction.

(In reply to Jan de Mooij [:jandem] from comment #0)
> Our current C++ interpreter has a number of issues, most importantly:
> 
> (1) Pretty slow. No IC support.
> (2) Has its own stack layout and calling convention. This makes calls and
> OSR into Baseline more slow and complicated than they should be.

I'd like to add that in pages with heavy load-time JS (e.g. GDocs), we also see high occurrences of Interpreter behaviour that relates to setting up calls.  Not only is interpreter speed a big deal, but the call and ABI boundaries between Interp => Interp calls, Interp => JIT, and JIT => Interp calls are insane, and constitute a legitimately heavy component of performance issues we are trying to address today.

> * Frame layout exactly like Baseline, reusing BaselineFrame and JIT calling
> convention.
> * ICs will be attached to JSScript instead of BaselineScript (completely
> shared by interpreter and Baseline).

I have a slightly tangential question: currently when we compile a script with Ion, and decide to emit an Ion IC for a particular instruction, do we automatically convert any existing baseline CacheIR IC stubs into Ion IC stubs? 

> * Code to compile a given JSOp will be factored out of BaselineCompiler
> (with some policy template magic). Implementing a new op will then ensure it
> works in both Baseline and the generated interpreter.

Additionally, we should be able to optimize the interpreter dispatch here.  If we're generating the interpreter op implementations, we can easily place them all in memory at well-aligned contiguous addresses (e.g. 256-byte chunks), starting at some base address, and turn our dispatch into:

```
   // DISPATCH NEXT OP
   goto InterpreterBaseAddress + (op << 8)
```

> (C) To load the ICEntry for the current bytecode pc we have a few options:
> 
>     (1) Store the IC index in the bytecode as an operand.
>     (2) Flag all bytecode ops that have an IC as JOF_IC and have the
> bytecode emitter only store the IC index for jump target ops. This means the
> interpreter could then keep an ICEntry* pointer that is bumped between ops.
> 

I would strongly advocate choice (1).  There is just too much weird potential for things to get out of sync.  It's also very nice to be able to have a `big-O(1)` way of going from a bytecode op to the IC corresponding to it.

> (D) Most code on websites is very cold (executed once or twice). Using ICs
> for such code might actually make us slower because we will waste time
> attaching stubs that are then never used. One idea is to compile two
> interpreters, one that uses ICs and one that doesn't, and then switch
> dynamically between the two. A similar mechanism could maybe be used when
> toggling debug mode for on-stack scripts.

Couldn't we just make sure not to generate an optimized stub on the first run through an IC?  We would always jump immediately to the fallback stub on first run anyway.  That stub has a hit-counter, and could easily choose to skip the optimized stub generation on first execution.
Depends on: 1499644
Priority: -- → P2
Whiteboard: [qf:p1]
Attached patch PrototypeSplinter Review
This is a very primitive/hackish/throw-away prototype to test some of these ideas:

* Disclaimer 1: it doesn't share any code with BaselineCompiler, a more serious version would share most opcode implementations.

* Disclaimer 2: it's slow because the pc => IC index lookup requires a C++ call. With bytecode format changes we can make this way faster. It's also slow because the dispatch loop is just an if-else chain on the current JSOp instead of a tableswitch or something.

* It works, we can enter Baseline ICs without having a BaselineScript. We attach optimized stubs and can call into them etc.

* One part that doesn't work is type monitoring, because it relies on the bytecode map stored in BaselineScript. This shouldn't be too hard to fix, but the prototype just hacks around it.

* Stack walking and GC work as expected. For this, BaselineFrame stores the bytecode pc if we're in interpreter mode.

* It supports enough bytecode ops to interpret "f" in the script below:

---


function f() { // Must be on line 3 to run in prototype interpreter.
    var x = 2;
    for (var i = 0; i < 3; i = i + 1) {
	x *= Math.sqrt(x);
    }
    gc();
    print(getBacktrace());
    return 1;
}
for (var i = 0; i < 50; i++) {}
for (var i = 0; i < 5; i++) {
    print(f());
}
---

* If applies on top of the patch in bug 1499644, inbound rev ca965ec1bc91.
I did an audit of all ops implemented in BaselineCompiler. The majority of op implementations there can be easily shared with the interpreter. A number of them can be shared after some minor refactoring (for instance, we can just pass the jsbytecode* pc to slow VM functions instead of a bytecode operand). The ones that are harder to share can often still reuse the most complicated codegen, VM calls, etc.

Because we will reuse most of Baseline (frame layout, ICs, op implementations, VM/GC/Debugger interaction), the interpreter-specific code should be pretty small - most of it should be the interpreter loop, (bits of) the prologue and unshared opcodes.

For debugger breakpoints/stepping the simplest option is to do something like the C++ interpreter's opMask trick. Having a special debug-mode interpreter we can switch to is more complicated (because of scripts on the stack when the debugger is enabled) and probably not worth it initially (but it's fun to think about).

At this point my main concerns are:

(1) Perf/memory overhead from allocating ICScript and TypeScript eagerly (and using ICs immediately) instead of waiting for Baseline compilation, because most scripts are cold. However:

(a) This will hopefully be mitigated by execution becoming faster (because ICs and optimized interpreter loop), Baseline compilation becoming faster (because no ICs to allocate), and Baseline compiling less code (because faster interpreter).

(b) TypeScript allocation we could potentially do more lazily.

(c) The "cold interpreter" design mentioned in comment 0 is an option: generate two copies and have one of them use VM calls for everything, then when the script warms up (use count of 3 for example) we allocate the ICScript and switch to the other interpreter. This would be very similar to tiering up to Baseline. The main downside is that it will require some extra code for ops that use an IC.

(d) Shorter-term we could also use the C++ interpreter as "cold interpreter". That requires keeping the "OSR into Baseline (Interpreter)" machinery and it would be really nice to get rid of that, but it's something to consider.

(2) Bytecode size overhead from storing IC indexes. I'll collect some data so we can see what the overhead is %-wise.
One of the possibilities to address the size concern is this: with a native-stack JIT-ABI-compatible interpreter, we can more seriously consider compiling with baseline on a per-block basis.

That's a performance no-go right now because the ABI differences between baseline and interpreter will mean that we'll lose efficiency to overhead of moving values to/from the interpreter and C stack, but once this code lands, jumping to/from baseline and interpreter should become a simple jump instruction (we can ensure that stack slots are synced whenever a new basic block is entered in baseline).

This follow-up would have the following benefits:

1. Less time spent compiling baseline code (and BaselineCompile is showing up in profiles).

2. Less memory used for jitcode.

This should mitigate (if not eliminate) the memory costs that are of concern here.  Baseline JITcode is very heavyweight.
(In reply to Kannan Vijayan [:djvj] from comment #7)

That's a great idea. Same frame layout has so many benefits (for calls, OSR, debug-mode OSR, Ion bailouts, more fine-grained Baseline compilation) that I'm increasingly thinking that alone makes this worth doing. In this world Baseline (the compiler [0]) really just becomes "unroll the interpreter loop and do some trivial optimization".

I'm also hoping we can increase the Baseline compile threshold in general to compensate for extra time spent upfront. Next week I'll try to get some data on GDocs about our current interpreter and Baseline behavior.

[0] Terminology can get a bit confusing, it will be Baseline = Baseline Interpreter + Baseline Compiler..
Storing the IC index as 4-byte bytecode operand for each op that has an IC will for most scripts be a 40-65% bytecode size regression (with outliers in both directions). I like the simplicity of it, but unfortunately that's quite significant so it might make sense to store the IC index only for JSOP_JUMPTARGET ops + store an interpreterICIndex/ICEntry* in BaselineFrame and increment it at the end of all JOF_IC ops. If everything is based on JOF_IC it's less likely for things to go out of sync.

I also logged all JSOP_GETPROPs running in the C++ interpreter when starting Nightly, opening a Google Doc and scrolling a bit. Out of 132403 "locations", 82455 (62%) of them is executed only once. This suggests we shouldn't try to attach IC stubs on the first hit but wait for the second one. Note that the remaining 38% of GETPROP locations (the ones that executed more than once) account for > 83% of *all* GETPROPs in the log so it seems most GETPROPs executed in the interpreter would still benefit from ICs.
An idea Ted and I just had: once all fallback stub codes are runtime-wide trampolines (bug 1501310, something we want to do anyway for perf/simplicity), then EmitCallIC in the *cold* interpreter could just emit a call to that with a nullptr stub.

That makes very cold code fast because (1) we don't have to allocate ICScript/TypeScript for it and (2) we don't try to attach IC stubs. When the script's warm up count reaches 2-3 we allocate ICScript/TypeScript and jump to the warm interpreter.

This is similar to the cold/warm interpreter mentioned before, but in this design the only difference between the two is in EmitCallIC and that's really nice for simplicity and code size.
Depends on: 1501310
Depends on: 1503496
Depends on: 1503542
Summary: Consider compiling the interpreter dynamically → [meta] Consider compiling the interpreter dynamically
Alias: BaselineInterpreter
Depends on: 1507066
Depends on: 1507120
Depends on: 1508962
Assignee: nobody → jdemooij
Status: NEW → ASSIGNED
Depends on: 1509537
See Also: → 1507271
Depends on: 1511739
Depends on: 1511747
Depends on: 1511837
Depends on: 1511891
Depends on: 1519378
Depends on: 1521491
Depends on: 1522048
Depends on: 1522068
Depends on: 1522394
Depends on: 1524499
Depends on: 1537906
Depends on: 1541810
Whiteboard: [qf:p1] → [qf:meta]
Depends on: 1548775
Blocks: 1552154
Depends on: 1562129
Depends on: 1562830
Depends on: 1563510
Depends on: 1564017
Depends on: 1564337
Depends on: 1564887
Depends on: 1565788
Depends on: 1565807
Blocks: 1566330
Depends on: 1566332
Blocks: 1567388
Blocks: 1567438
Blocks: 1567920
Blocks: 1570241
Depends on: 1576567

Adding relnote for 70 beta as: "The Baseline Interpreter for JavaScript bytecode executiion is now enabled" with a link to the hacks.m.o write-up: https://hacks.mozilla.org/2019/08/the-baseline-interpreter-a-faster-js-interpreter-in-firefox-70/

(In reply to Liz Henry (:lizzard) from comment #11)

Adding relnote for 70 beta as: "The Baseline Interpreter for JavaScript bytecode executiion is now enabled" with a link to the hacks.m.o write-up: https://hacks.mozilla.org/2019/08/the-baseline-interpreter-a-faster-js-interpreter-in-firefox-70/

Thanks! I think we can close this bug now.

Note: initially we wanted the Baseline Interpreter to replace the C++ Interpreter in most cases, but we decided to keep the C++ Interpreter for cold code (because trying to use ICs for cold code would waste some memory and add runtime overhead). A lot of the win from the Baseline Interpreter now comes from delaying Baseline JIT compilation. Nothing is set in stone and this might change in the future, but we're pretty happy with this setup atm.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Performance Impact: --- → ?
Whiteboard: [qf:meta]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: