Last Comment Bug 721031 - IonMonkey: Make native calls faster
: IonMonkey: Make native calls faster
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: unspecified
: All All
-- normal (vote)
: ---
Assigned To: Sean Stangl [:sstangl]
: Jason Orendorff [:jorendorff]
Depends on: 722238 767679
Blocks: IonSpeed
  Show dependency treegraph
Reported: 2012-01-25 07:38 PST by Jan de Mooij [:jandem]
Modified: 2012-06-23 00:39 PDT (History)
9 users (show)
See Also:
Crash Signature:
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---

WIP patch. (21.28 KB, patch)
2012-01-31 14:35 PST, Sean Stangl [:sstangl]
no flags Details | Diff | Splinter Review
Patch v1. (29.33 KB, patch)
2012-02-02 17:18 PST, Sean Stangl [:sstangl]
no flags Details | Diff | Splinter Review
Patch v2 (29.34 KB, patch)
2012-02-02 17:30 PST, Sean Stangl [:sstangl]
no flags Details | Diff | Splinter Review
Patch v3. (33.42 KB, patch)
2012-02-06 16:00 PST, Sean Stangl [:sstangl]
dvander: review+
Details | Diff | Splinter Review

Description User image Jan de Mooij [:jandem] 2012-01-25 07:38:13 PST
Ion+TI runs Kraken imaging-darkroom slower than JM+TI, because our native calls are slower. The benchmark calls Math.log/Math.pow 3844801 times; we can measure the overhead of this with the following micro-benchmark:
function f() {
    for (var i=0; i<3844801; i++) {
var t = new Date;
print(new Date - t);

JM+TI:  87 ms (65% in math_log)
IM+TI: 210 ms (23% in math_log)

We spend a lot of time in js::Invoke, ContextStack::onTop, ContextStack::pushInvokeArgs etc. This seems to be responsible for most of our slowdown vs JM+TI.

FWIW, V8 inlines/specializes Math.log/Math.pow - we can do something similar, but this bug is about native calls in general, not specifically Math.log/Math.pow.
Comment 1 User image Sean Stangl [:sstangl] 2012-01-31 14:35:00 PST
Created attachment 593221 [details] [diff] [review]
WIP patch.

This is basically done. It needs factoring out some common patterns, but otherwise works. The test case runs at the same speed with '--ion -n' as with '-m -n'.

It can't land yet until OSI changes are made -- since we're not going through callVM(), there is no call to a generateReturnWrapper() that can have a return address passed. We could fake-call the next instruction, but that's a bit silly. OSI is changing soon anyway, so blocking on that.

Passes all tests except basic/arrayConcat.js, which requires native invalidation.
Comment 2 User image Sean Stangl [:sstangl] 2012-01-31 14:37:02 PST
This is about a 4% speedup on SS.
Comment 3 User image Sean Stangl [:sstangl] 2012-02-02 17:18:30 PST
Created attachment 594029 [details] [diff] [review]
Patch v1.

Handles native calls on all platforms in an efficient way, creating a fake exit frames. Passes arrayConcat.js (invalidation). ARM code is written and compiles and should work, but can't be tested because ARM segfaults for unrelated reasons in generate().

On x86, jandem's test with an iteration counter of 10,000,000 gives '-m -n' a score of 262; '--ion -n' gets 258, both results consistent.
Comment 4 User image Sean Stangl [:sstangl] 2012-02-02 17:30:16 PST
Created attachment 594034 [details] [diff] [review]
Patch v2

Actual patch.
Comment 5 User image David Anderson [:dvander] 2012-02-03 14:35:22 PST
Comment on attachment 594029 [details] [diff] [review]
Patch v1.

Review of attachment 594029 [details] [diff] [review]:

Looks good but there's a bunch of little things that should probably be addressed first:

::: js/src/ion/CodeGenerator.cpp
@@ +454,5 @@
> +    const Register argVpReg = ToRegister(argVp);
> +
> +    // Misc. temporary registers.
> +    const LAllocation *temp = call->getTempReg();
> +    const Register tempReg = ToRegister(temp);

I'd just do ToRegister(call->getTempReg())

@@ +458,5 @@
> +    const Register tempReg = ToRegister(temp);
> +
> +    DebugOnly<uint32> initialStack = masm.framePushed();
> +
> +#ifdef JS_CPU_ARM

Does this need to be #ifdef ARM? Might be nicer to make it an empty call on x86/x64.

@@ +469,5 @@
> +    // are the function arguments.
> +
> +    // Allocate space for the outparam, moving the StackPointer to &vp[0].
> +    unusedStack -= sizeof(Value);
> +    masm.adjustStack(unusedStack);

This must push a Value containing the callee object. Natives are allowed to access their callee before setting rval.

::: js/src/ion/IonBuilder.cpp
@@ +130,5 @@
>      return state;
>  }
> +JSFunction *
> +IonBuilder::getSingleCallTarget(uint32 argc, jsbytecode *pc, bool inlining)

Why does this take in an |argc|?

@@ +153,5 @@
>      }
>      if (calleeTypes->getObjectCount() > 1) {
> +        if (inlining)
> +            IonSpew(IonSpew_Inlining, "Cannot inline due to multiple objects");

Using the bool parameter for this seems unwieldy. I'd just switch these to IonSpew_MIR, and change "Cannot inline" to "Cannot determine single call target". Or just remove the spew, it's not associated with any pc or line number so it doesn't seem useful.

@@ +2392,4 @@
>      // Bytecode order: Function, This, Arg0, Arg1, ..., ArgN, Call.
> +    for (int i = argc; i >= 0; i--)
> +        call->addArg(i, current->pop()->toPassArg());
> +    call->initFunction(current->pop());

If constructing, |thisv| must be replaced with a Magic value with why=JS_IS_CONSTRUCTING. You can do this either in MIR or in the CodeGenerator (by just writing to the |thisv| slot like JM, see MonoIC.cpp generateNative).

::: js/src/ion/Lowering.cpp
@@ +177,5 @@
> +        JSObject &obj = funcdef->toConstant()->value().toObject();
> +        if (!obj.isFunction())
> +            break;
> +
> +        target = obj.toFunction();

It might look cleaner to extract this code out as a separate function, like CheckKnownCallee(MDefinition *) or something.

@@ +186,5 @@
> +
> +    // Monomorphic native calls lower to LCallNative.
> +    if (target->isNative()) {
> +        LCallNative *lcall = new LCallNative(target, argslot,
> +#if defined(JS_CPU_X64)

We should try to avoid #ifdefs like this, especially in really generic files like this one. If this optimization makes a big difference (probably unlikely), it's worth making CPU-specific lowering functions.

::: js/src/ion/arm/MacroAssembler-arm.cpp
@@ +1047,5 @@
> +uint32
> +MacroAssemblerARMCompat::buildFakeExitFrame(const Register &scratch)
> +{
> +    DebugOnly<uint32> initialDepth = framePushed();

Same comments as on x86.

::: js/src/ion/shared/MacroAssembler-x86-shared.h
@@ +200,5 @@
>      }
> +
> +    // Builds an exit frame on the stack, with a return address to an internal
> +    // non-function. Returns offset to be passed to markSafepointAt().
> +    uint32 buildFakeExitFrame(const Register &scratch) {

I think this function can be a lot simpler:

    Label label;

If it's followed immediately by a markSafepoint(), you don't even need to return anything.

::: js/src/ion/x64/MacroAssembler-x64.h
@@ +605,4 @@
>              unboxNonDouble(Operand(address), dest.gpr());
>      }
> +    void loadInstructionPointerAfterCall(const Register &scratch) {


::: js/src/ion/x86/MacroAssembler-x86.h
@@ +549,4 @@
>          shrl(imm, dest);
>      }
> +    void loadInstructionPointerAfterCall(const Register &scratch) {

Comment 6 User image David Anderson [:dvander] 2012-02-03 14:37:43 PST
Note on the JS_IS_CONSTRUCTING bit: that is only for when calling natives.
Comment 7 User image Sean Stangl [:sstangl] 2012-02-03 15:07:32 PST
(In reply to David Anderson [:dvander] from comment #5)
> I think this function can be a lot simpler:
>     Label label;
>     Push(Imm32(descriptor));
>     call(&label);
>     bind(&label);
> If it's followed immediately by a markSafepoint(), you don't even need to
> return anything.

The branch predictor on Intel chips wants each call to be matched with a ret. I can measure both ways, but there are strongly-worded warnings in Intel documentation to absolutely never call non-functions.
Comment 8 User image David Anderson [:dvander] 2012-02-03 15:10:38 PST
Would it work to just have a bogus ret as well then? (leaveFakeExitFrame?)
Comment 9 User image Sean Stangl [:sstangl] 2012-02-03 15:22:44 PST
Not reasonably. The ret would go to the code after the call, which would have to be a jump to after leaveFakeExitFrame().

But the code after the call has to be from markSafepoint(), and we probably don't want that code to be re-executed. Since we can't have both a jump and markSafepoint() code immediately following the call, the jump has to be after markSafepoint(), so we wind up not only using space in the branch predictor, but also executing more instructions.

Furthermore, that jump then has to be bound in visitCallNative(), but not in the case of ARM, so visitCallNative() becomes platform-specific, at least with #ifdefs.

It would be breaking up buildFakeExitFrame() into two halves, and requiring the caller to stitch them together. Keeping the frame building code close together seems simpler.
Comment 10 User image David Anderson [:dvander] 2012-02-03 15:27:50 PST
Random idea: do we even need a call instruction? We could just push the next offset, and make sure in C++ to call the right overload to getSafepointIndex/getOsiIndex. For debugging we will need some way to differentiate C++ exit frames from JSNative exit frames, that could be a bit in the frame descriptor, telling us how to interpret the return address.
Comment 11 User image Sean Stangl [:sstangl] 2012-02-03 15:36:39 PST
That's a good idea.

I was talking with Marty last night, and it's likely the (untested) ARM implementation of markSnapshot() is wrong -- because of the moving pools, currentOffset() can never be called, since the code can be shifted to make way for a pool. We could always patch it in later, though.

If we fix that, keeping a unique offset around would be sufficient for the native case.
Comment 12 User image Sean Stangl [:sstangl] 2012-02-06 16:00:53 PST
Created attachment 594853 [details] [diff] [review]
Patch v3.

Magic value placed in MIR; callee value pushed as part of codegen. X64-specific optimization removed -- it made about 4ms difference in a test suite with 1,000,000 iterations.

The logic to create the exit frame was kept, since it works on x86/x64, is no more broken than ARM, and lets this patch not touch OSI. Will file a follow-up bug for that work once this lands.
Comment 13 User image David Anderson [:dvander] 2012-02-06 17:08:15 PST
Comment on attachment 594853 [details] [diff] [review]
Patch v3.

Review of attachment 594853 [details] [diff] [review]:

::: js/src/ion/Lowering.cpp
@@ +191,5 @@
> +                tempFixed(CallTempReg0), tempFixed(CallTempReg1), tempFixed(CallTempReg2),
> +                tempFixed(CallTempReg3));
> +        if (!defineReturn(lcall, call))
> +            return false;
> +        ins = (LInstruction *)lcall;

Are these casts needed?
Comment 14 User image Sean Stangl [:sstangl] 2012-02-06 17:18:34 PST
Comment 15 User image Sean Stangl [:sstangl] 2012-02-06 22:37:36 PST

Small fix to get building on OS X.
Comment 16 User image Sean Stangl [:sstangl] 2012-02-07 13:09:23 PST

Backed out: although the check-* tests for v8 pass individually, this patch prevented the harness from running to completion. Will debug that and re-land.
Comment 17 User image Sean Stangl [:sstangl] 2012-02-07 15:39:52 PST

The bug involved argument padding for native calls -- since the Native functions take an argc parameter, padding with |undefined| and lying about argc is incorrect.

Note You need to log in before you can comment on or make changes to this bug.