Closed Bug 639099 Opened 9 years ago Closed 9 years ago

TI+JM: inline scripted calls

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set

Tracking

()

RESOLVED FIXED

People

(Reporter: bhackett, Assigned: bhackett)

References

(Blocks 1 open bug)

Details

Attachments

(1 file, 3 obsolete files)

After discussion with dvander today, inlining scripted calls in JM doesn't look like an incredibly insurmountable problem (doesn't look easy, either).  When we get to a call we want to inline, start a new instance of the Compiler that knows about (and can modify) the frame state of the root function.  During compilation, record addresses of stub calls and other information in that root function so we have the necessary information to reify the stack --- when calling into the VM the slots themselves will be synced, and the contents of the JSStackFrames (which we reserve space for but do not fill in) can be determined from VMFrame/JSStackFrame return addresses, and can be reified during recompilation.

The main challenge is making sure the stack is left in a way that the rest of the VM works right.  By restricting things so that inlined functions cannot have closed args/vars and must have the same parent as the function they are inlined into (what else? accessing f.arguments probably needs to trigger recompilation of functions inlining f), we should be able to get away with erasing the stack frames of inlined functions.
Well, and actual heuristics to decide what's worth inlining, right?
I think that inlining short, loop-free functions to some maximum depth (should even be able to do some recursive call inlining/unrolling) will probably work to a first approximation.  Hopefully we won't need to worry too much about cases like the crypto-md5 benchmark in SS that has a loop with a ton of inlinable code but hardly runs at all.
Taking.
Assignee: general → bhackett1024
Attached patch barely working WIP (obsolete) — Splinter Review
WIP.  Main changes, which only theoretically work so far:

- When inlining frames within a script, to the rest of the VM the JSStackFrame will look like it is at the outermost call site in the script, i.e. the one whose call was inlined.  To dig further, when a call is inlined at the fp's pc, fp->pc(...) also produces a call site describing the frames that were inlined and the PC of the innermost frame.  This call site is a counterpart to the outer PC --- it is a field in JSFrameRegs, and is implicitly stored and recovered the same way as prevPC() if the inner frame makes another scripted call.  There is an extra write when making a stub call, but this will be able to go away when the write to fp->pc itself goes away (hopefully before too long).

- A single instance of mjit::Compiler and mjit::FrameState are created for the entire script's compilation (as before).  They now additionally maintain an active stack of any inlined frames, and push/pop these when deciding to inline.  While an inner frame is active, the outer frame(s) are not modified; this will make it simple to generate multiple inline fragments at polymorphic but not megamorphic sites (i.e. v8-deltablue).

The FrameState itself maintains very little state specific to the outer script (just loop regalloc stuff), and could be made a per-frame thing rather than a per-compilation thing.  The only problem here is that cc.frame would need to be a pointer rather than a reference, and the resulting diff would contain most of JM.

Anyways:

function bar(x, y) {
  return x + y;
}
function foo(x, y) {
  var a = 0;
  for (var i = 0; i < 10000000; i++) {
    a += bar(x, y);
    a += bar(x, y);
    a += bar(x, y);
    a += bar(x, y);
  }
  return a;
}
var q = foo(0, 1);

js -m -n (old): 327
js -m -n (new): 54
js -m: 387
js -j: 134
d8: 41

I think the difference vs. V8 is more memory traffic than necessary on our part (regalloc needs tuning; also, thinking about adding a way to measure static memory traffic in loops along with dynamic loop hotness, to quickly identify loops whose regalloc needs work).

Still to do:

- Work on more than one example.
- Handle parent registers when doing cross-branch regalloc in callee.
- Remat inlined frames and state, for recompilation.
- Fix VM to use inlined frame info where necessary.
Attached patch WIP (obsolete) — Splinter Review
Patch with most needed features in (still barely tested).  This fixes codegen bugs reflected in the above example where we were allocating loop registers inside the first inline call, improving our time from 54ms to 41ms (the same as V8).
Attachment #521386 - Attachment is obsolete: true
Attached patch WIP v3 (obsolete) — Splinter Review
Mostly working. Passes jit-tests.
Attachment #521873 - Attachment is obsolete: true
Attached patch patchSplinter Review
Patch landed on JM.

http://hg.mozilla.org/projects/jaegermonkey/rev/2a6ecf5af25c

This is the 'passes tests' push, not the 'makes stuff faster' one.  We don't yet inline the scripted calls through CALLPROP that dominate v8bench (will be simple to do), and we always inline everywhere possible, which hurts sunspider's warm but not hot loops (especially crypto-md5).  For this I think we want another compilation stage:

When we detect a script getting warm and compile, we won't inline anything and just generate type specialized code.  When the script gets pretty hot (maybe 500 calls/backedges), we can recompile a version that inlines calls where possible, and do on-stack replacement.  (On stack replacement works fine with frame inlining; we expand all inlined frames before doing the OSR, and whenever inlining a frame we leave a hook for a non-inlined frame to rejoin after the call finishes).
Attachment #522132 - Attachment is obsolete: true
Depends on: 645632
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
(In reply to comment #7)
>and we always inline everywhere possible, which hurts

How about adding some heuristic what to inline and what not instead of recompiling? Inlining only makes sense if the overhead of function calls is significant, which obviously won't be the case if the function contains costly things or simply is large.

Afaik the java hotspot client compiler's main criterion for inlining is the number of instructions in the target function. Since it can inline both at the bytecode level and the compiled level it has two thresholds there.
The hotspot compiler does more (call frequency based recompiling etc.) but it does that off-thread, which means it doesn't slow down the main thread while doing so.
You need to log in before you can comment on or make changes to this bug.