IonMonkey: Make native calls faster

RESOLVED FIXED

Status

()

Core
JavaScript Engine
RESOLVED FIXED
6 years ago
5 years ago

People

(Reporter: jandem, Assigned: sstangl)

Tracking

(Blocks: 1 bug)

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment, 3 obsolete attachments)

(Reporter)

Description

6 years ago
Ion+TI runs Kraken imaging-darkroom slower than JM+TI, because our native calls are slower. The benchmark calls Math.log/Math.pow 3844801 times; we can measure the overhead of this with the following micro-benchmark:
--
function f() {
    for (var i=0; i<3844801; i++) {
        Math.log(0.4);
    }
}
var t = new Date;
f();
print(new Date - t);
--

JM+TI:  87 ms (65% in math_log)
IM+TI: 210 ms (23% in math_log)

We spend a lot of time in js::Invoke, ContextStack::onTop, ContextStack::pushInvokeArgs etc. This seems to be responsible for most of our slowdown vs JM+TI.

FWIW, V8 inlines/specializes Math.log/Math.pow - we can do something similar, but this bug is about native calls in general, not specifically Math.log/Math.pow.
(Assignee)

Updated

6 years ago
Assignee: general → sstangl
(Assignee)

Comment 1

6 years ago
Created attachment 593221 [details] [diff] [review]
WIP patch.

This is basically done. It needs factoring out some common patterns, but otherwise works. The test case runs at the same speed with '--ion -n' as with '-m -n'.

It can't land yet until OSI changes are made -- since we're not going through callVM(), there is no call to a generateReturnWrapper() that can have a return address passed. We could fake-call the next instruction, but that's a bit silly. OSI is changing soon anyway, so blocking on that.

Passes all tests except basic/arrayConcat.js, which requires native invalidation.
(Assignee)

Updated

6 years ago
Depends on: 722238
(Assignee)

Comment 2

6 years ago
This is about a 4% speedup on SS.
(Assignee)

Comment 3

6 years ago
Created attachment 594029 [details] [diff] [review]
Patch v1.

Handles native calls on all platforms in an efficient way, creating a fake exit frames. Passes arrayConcat.js (invalidation). ARM code is written and compiles and should work, but can't be tested because ARM segfaults for unrelated reasons in generate().

On x86, jandem's test with an iteration counter of 10,000,000 gives '-m -n' a score of 262; '--ion -n' gets 258, both results consistent.
Attachment #593221 - Attachment is obsolete: true
Attachment #594029 - Flags: review?(dvander)
(Assignee)

Comment 4

6 years ago
Created attachment 594034 [details] [diff] [review]
Patch v2

Actual patch.
Attachment #594029 - Attachment is obsolete: true
Attachment #594029 - Flags: review?(dvander)
Attachment #594034 - Flags: review?(dvander)
Comment on attachment 594029 [details] [diff] [review]
Patch v1.

Review of attachment 594029 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good but there's a bunch of little things that should probably be addressed first:

::: js/src/ion/CodeGenerator.cpp
@@ +454,5 @@
> +    const Register argVpReg = ToRegister(argVp);
> +
> +    // Misc. temporary registers.
> +    const LAllocation *temp = call->getTempReg();
> +    const Register tempReg = ToRegister(temp);

I'd just do ToRegister(call->getTempReg())

@@ +458,5 @@
> +    const Register tempReg = ToRegister(temp);
> +
> +    DebugOnly<uint32> initialStack = masm.framePushed();
> +
> +#ifdef JS_CPU_ARM

Does this need to be #ifdef ARM? Might be nicer to make it an empty call on x86/x64.

@@ +469,5 @@
> +    // are the function arguments.
> +
> +    // Allocate space for the outparam, moving the StackPointer to &vp[0].
> +    unusedStack -= sizeof(Value);
> +    masm.adjustStack(unusedStack);

This must push a Value containing the callee object. Natives are allowed to access their callee before setting rval.

::: js/src/ion/IonBuilder.cpp
@@ +130,5 @@
>      return state;
>  }
>  
> +JSFunction *
> +IonBuilder::getSingleCallTarget(uint32 argc, jsbytecode *pc, bool inlining)

Why does this take in an |argc|?

@@ +153,5 @@
>      }
>  
>      if (calleeTypes->getObjectCount() > 1) {
> +        if (inlining)
> +            IonSpew(IonSpew_Inlining, "Cannot inline due to multiple objects");

Using the bool parameter for this seems unwieldy. I'd just switch these to IonSpew_MIR, and change "Cannot inline" to "Cannot determine single call target". Or just remove the spew, it's not associated with any pc or line number so it doesn't seem useful.

@@ +2392,4 @@
>      // Bytecode order: Function, This, Arg0, Arg1, ..., ArgN, Call.
> +    for (int i = argc; i >= 0; i--)
> +        call->addArg(i, current->pop()->toPassArg());
> +    call->initFunction(current->pop());

If constructing, |thisv| must be replaced with a Magic value with why=JS_IS_CONSTRUCTING. You can do this either in MIR or in the CodeGenerator (by just writing to the |thisv| slot like JM, see MonoIC.cpp generateNative).

::: js/src/ion/Lowering.cpp
@@ +177,5 @@
> +        JSObject &obj = funcdef->toConstant()->value().toObject();
> +        if (!obj.isFunction())
> +            break;
> +
> +        target = obj.toFunction();

It might look cleaner to extract this code out as a separate function, like CheckKnownCallee(MDefinition *) or something.

@@ +186,5 @@
> +
> +    // Monomorphic native calls lower to LCallNative.
> +    if (target->isNative()) {
> +        LCallNative *lcall = new LCallNative(target, argslot,
> +#if defined(JS_CPU_X64)

We should try to avoid #ifdefs like this, especially in really generic files like this one. If this optimization makes a big difference (probably unlikely), it's worth making CPU-specific lowering functions.

::: js/src/ion/arm/MacroAssembler-arm.cpp
@@ +1047,5 @@
>  
> +uint32
> +MacroAssemblerARMCompat::buildFakeExitFrame(const Register &scratch)
> +{
> +    DebugOnly<uint32> initialDepth = framePushed();

Same comments as on x86.

::: js/src/ion/shared/MacroAssembler-x86-shared.h
@@ +200,5 @@
>      }
> +
> +    // Builds an exit frame on the stack, with a return address to an internal
> +    // non-function. Returns offset to be passed to markSafepointAt().
> +    uint32 buildFakeExitFrame(const Register &scratch) {

I think this function can be a lot simpler:

    Label label;
    Push(Imm32(descriptor));
    call(&label);
    bind(&label);

If it's followed immediately by a markSafepoint(), you don't even need to return anything.

::: js/src/ion/x64/MacroAssembler-x64.h
@@ +605,4 @@
>              unboxNonDouble(Operand(address), dest.gpr());
>      }
>  
> +    void loadInstructionPointerAfterCall(const Register &scratch) {

s/scratch/dest/

::: js/src/ion/x86/MacroAssembler-x86.h
@@ +549,4 @@
>          shrl(imm, dest);
>      }
>  
> +    void loadInstructionPointerAfterCall(const Register &scratch) {

s/scratch/dest/
Attachment #594029 - Attachment is obsolete: false
Attachment #594034 - Flags: review?(dvander)
Note on the JS_IS_CONSTRUCTING bit: that is only for when calling natives.
(Assignee)

Comment 7

6 years ago
(In reply to David Anderson [:dvander] from comment #5)
> I think this function can be a lot simpler:
> 
>     Label label;
>     Push(Imm32(descriptor));
>     call(&label);
>     bind(&label);
> 
> If it's followed immediately by a markSafepoint(), you don't even need to
> return anything.

The branch predictor on Intel chips wants each call to be matched with a ret. I can measure both ways, but there are strongly-worded warnings in Intel documentation to absolutely never call non-functions.
Would it work to just have a bogus ret as well then? (leaveFakeExitFrame?)
(Assignee)

Comment 9

6 years ago
Not reasonably. The ret would go to the code after the call, which would have to be a jump to after leaveFakeExitFrame().

But the code after the call has to be from markSafepoint(), and we probably don't want that code to be re-executed. Since we can't have both a jump and markSafepoint() code immediately following the call, the jump has to be after markSafepoint(), so we wind up not only using space in the branch predictor, but also executing more instructions.

Furthermore, that jump then has to be bound in visitCallNative(), but not in the case of ARM, so visitCallNative() becomes platform-specific, at least with #ifdefs.

It would be breaking up buildFakeExitFrame() into two halves, and requiring the caller to stitch them together. Keeping the frame building code close together seems simpler.
Random idea: do we even need a call instruction? We could just push the next offset, and make sure in C++ to call the right overload to getSafepointIndex/getOsiIndex. For debugging we will need some way to differentiate C++ exit frames from JSNative exit frames, that could be a bit in the frame descriptor, telling us how to interpret the return address.
(Assignee)

Comment 11

6 years ago
That's a good idea.

I was talking with Marty last night, and it's likely the (untested) ARM implementation of markSnapshot() is wrong -- because of the moving pools, currentOffset() can never be called, since the code can be shifted to make way for a pool. We could always patch it in later, though.

If we fix that, keeping a unique offset around would be sufficient for the native case.
(Assignee)

Comment 12

6 years ago
Created attachment 594853 [details] [diff] [review]
Patch v3.

Magic value placed in MIR; callee value pushed as part of codegen. X64-specific optimization removed -- it made about 4ms difference in a test suite with 1,000,000 iterations.

The logic to create the exit frame was kept, since it works on x86/x64, is no more broken than ARM, and lets this patch not touch OSI. Will file a follow-up bug for that work once this lands.
Attachment #594029 - Attachment is obsolete: true
Attachment #594034 - Attachment is obsolete: true
Attachment #594853 - Flags: review?(dvander)
Comment on attachment 594853 [details] [diff] [review]
Patch v3.

Review of attachment 594853 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/ion/Lowering.cpp
@@ +191,5 @@
> +                tempFixed(CallTempReg0), tempFixed(CallTempReg1), tempFixed(CallTempReg2),
> +                tempFixed(CallTempReg3));
> +        if (!defineReturn(lcall, call))
> +            return false;
> +        ins = (LInstruction *)lcall;

Are these casts needed?
Attachment #594853 - Flags: review?(dvander) → review+
(Assignee)

Comment 14

6 years ago
http://hg.mozilla.org/projects/ionmonkey/rev/f46cfb199e77
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
(Assignee)

Comment 15

6 years ago
http://hg.mozilla.org/projects/ionmonkey/rev/2ce3daef5bd2

Small fix to get building on OS X.
(Assignee)

Comment 16

6 years ago
http://hg.mozilla.org/projects/ionmonkey/rev/1c5fcda56a00

Backed out: although the check-* tests for v8 pass individually, this patch prevented the harness from running to completion. Will debug that and re-land.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Assignee)

Comment 17

6 years ago
hg.mozilla.org/projects/ionmonkey/rev/41382184b0f5

The bug involved argument padding for native calls -- since the Native functions take an argc parameter, padding with |undefined| and lying about argc is incorrect.
Status: REOPENED → RESOLVED
Last Resolved: 6 years ago6 years ago
Resolution: --- → FIXED
Depends on: 767679
You need to log in before you can comment on or make changes to this bug.