Closed Bug 1108290 Opened 5 years ago Closed 4 years ago

Optimize f.apply(x, array) in Ion

Categories

(Core :: JavaScript Engine: JIT, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla45
Tracking Status
firefox45 --- fixed

People

(Reporter: jandem, Assigned: lth)

References

Details

Attachments

(5 files, 3 obsolete files)

Two issues I noticed while profiling crypto.swf (bug 1103441):

(1) Shumway uses .asApply(). I think most of our |apply| optimizations are only used for JSOP_FUNAPPLY (.apply), so they won't work. This is pretty silly and no longer necessary I think, it'd be nice if we could remove JSOP_FUNCALL/JSOP_FUNAPPLY completely.

(2) Ion has optimizations for f.apply(x, arguments) but we should also inline f.apply(x, array).

On crypto.swf this matters for the method.asApply(..) call in asCallProperty, it's executed > 10 million times so it's pretty hot.
Also, in this Shumway function, |method| is sometimes String.prototype.charCodeAt and |receiver| is a String object. Mix of scripted vs native callees will make it a bit harder to optimize the asApply, and passing a String object to charCodeAt is also not ideal (few % under ToStringSlow)...
(In reply to Jan de Mooij [:jandem] from comment #0)
> (1) Shumway uses .asApply(). I think most of our |apply| optimizations are
> only used for JSOP_FUNAPPLY (.apply), so they won't work. This is pretty
> silly and no longer necessary I think, it'd be nice if we could remove
> JSOP_FUNCALL/JSOP_FUNAPPLY completely.

By removing them, you mean having this logic in jsop_call, if the native corresponds to the apply function?

> (2) Ion has optimizations for f.apply(x, arguments) but we should also
> inline f.apply(x, array).

For calling, this should be easy to do (except on x86), as we duplicate the vector of arguments before making a call.
For inlining it, this would be harder as we would have to modify the way we inline functions and do bailouts.
Attached patch WIP - optimize apply with arrays (obsolete) — Splinter Review
This optimizes calling fun.apply(self, array). There is still some stuff that I need to figure out. We probably actually need an overflow check, if the array is too large. Do I need to add a type barrier for loading the values? It seems to me that the callee would usually check the typeset of its arguments. What do you think of the approach with a template?
Comment on attachment 8542626 [details] [diff] [review]
WIP - optimize apply with arrays

Review of attachment 8542626 [details] [diff] [review]:
-----------------------------------------------------------------

(In reply to Tom Schuster [:evilpie] from comment #3)
> This optimizes calling fun.apply(self, array). There is still some stuff
> that I need to figure out. We probably actually need an overflow check, if
> the array is too large.

For the array version yes.  For the arguments version, this should already be limited by the trampoline.  I suggest you reuse the same amount, such that there no additional check needed before choosing between the Baseline/Ion or the Invoke VM function.

> Do I need to add a type barrier for loading the
> values?

I don't think so, as there is no location associated to these load in the bytecode of the caller.

> It seems to me that the callee would usually check the typeset of
> its arguments. What do you think of the approach with a template?

It is good but it would be nicer if the parametrized functions were smaller.

::: js/src/jit/CodeGenerator.cpp
@@ +3197,5 @@
>          ImmPtr ptr = ImmPtr(&JSFunction::class_);
>          bailoutCmpPtr(Assembler::NotEqual, objreg, ptr, apply->snapshot());
>      }
>  
> +    // XXX Arguments overflow check

This is not needed for LApplyArgs, but it would be in LApplyArray, as the size the of the list of arguments is already limited by the caller.

@@ +3342,5 @@
> +    Label end;
> +    // Initialize the loop counter AND Compute the stack usage (if == 0)
> +    // Load the array length for the counter.
> +    Address length(elementsreg, ObjectElements::offsetOfLength());
> +    masm.load32(length, extraStackSpace);

This looks like you might want to use emitArgumentsCount with a different set of registers.

::: js/src/jit/TypePolicy.cpp
@@ +990,5 @@
>      _(Mix3Policy<ObjectPolicy<0>, BoxPolicy<1>, ObjectPolicy<2> >)      \
>      _(Mix3Policy<ObjectPolicy<0>, IntPolicy<1>, BoxPolicy<2> >)         \
>      _(Mix3Policy<ObjectPolicy<0>, IntPolicy<1>, IntPolicy<2> >)         \
>      _(Mix3Policy<ObjectPolicy<0>, ObjectPolicy<1>, IntPolicy<2> >)      \
> +    _(Mix3Policy<ObjectPolicy<0>, ObjectPolicy<1>, BoxPolicy<2> >)      \

nit: Move this one line above to keep it sorted.
Assignee: nobody → evilpies
Blocks: 1087917
AMD FX4100 @ 3.6GHz, x64.

applyArguments:         91 ms
applyArray w/o patch: 1640 ms
applyArray w/ patch:    94 ms
Attached patch WIP v2 (obsolete) — Splinter Review
This is Tom's patch rebased and cleaned up a little (in line with nbp's suggestions) and with code to support 32-bit.

Nbp, is anything missing here?  I still need to write (quite a lot of) test cases but feedback on whether something is obviously left out would be good.

(The templated function probably still pays for itself, even if there are a lot of specializations around it.)
Attachment #8542626 - Attachment is obsolete: true
Attachment #8556569 - Flags: feedback?(nicolas.b.pierron)
Assignee: evilpies → lhansen
Just as a side-note, I made some changes myself to fun.apply in Bug 1112161.  So there is likely a merge conflict coming soonish. :/
Comment on attachment 8556569 [details] [diff] [review]
WIP v2

Review of attachment 8556569 [details] [diff] [review]:
-----------------------------------------------------------------

I think this should be all the location which have to be patched, as the marking of the stack should be handled the same way as any other fun.apply(…, arguments) calls.

::: js/src/jit/CodeGenerator.cpp
@@ +3221,5 @@
> +// Do not bailout after the execution of this function since the stack no longer
> +// corresponds to what is expected by the snapshots.
> +void
> +CodeGenerator::emitPushArguments(LApplyArrayGeneric *apply, Register extraStackSpace,
> +                                 Register copyreg)

Seems that this signature is duplicated?

@@ +3248,5 @@
> +        masm.bind(&loop);
> +
> +        Register elementsreg = ToRegister(apply->getElements());
> +#if defined(JS_CODEGEN_X64)
> +        masm.loadPtr(BaseObjectElementIndex(elementsreg, count, -8), copyreg);

nit: I prefer sizeof over constants.
Attachment #8556569 - Flags: feedback?(nicolas.b.pierron) → feedback+
Depends on: 1112161
Performance situation unchanged, looks like this is worth cleaning up and landing:

applyArguments.js 
Time: 88; result: 30000000
Time: 87; result: 30000000

applyArray.js
Time: 1255; result: 30000000
Time: 1259; result: 30000000
Attached patch bug1108290-apply-array.patch (obsolete) — Splinter Review
(The rewrite of apply earlier this year made this code simpler.)

This is basically right, I think, though I still have to run the full test suite and test compile on all platforms.

There's one TODO item here as a comment in the code: there's a guard on the code that optimizes apply(...,arguments) that I don't quite understand, and the question is whether a similar guard would be appropriate and/or necessary for the code that optimizes apply(..., array).

Another question is whether the limit on the array length that I've chosen (corresponds to what baseline and the intrinsic are using) is reasonable for jitted code, or if a shorter limit would be appropriate.

Also, I have some test cases coming.

(In summary, I expect one more round of reviews.)
Attachment #8556569 - Attachment is obsolete: true
Attachment #8689463 - Flags: review?(nicolas.b.pierron)
Comment on attachment 8689463 [details] [diff] [review]
bug1108290-apply-array.patch

Pulling review for now for these additional issues:

- the use of Operand() in a guard is not portable, it fails to compile on
  ARM64 and asserts on ARM
- the original patch on this bug uses js_JitOptions.maxStackArgs as a limit
  on the array length, this seems more reasonable

Also need to make sure we don't regress (badly) the case where length exceeds initializedLength.
Attachment #8689463 - Flags: review?(nicolas.b.pierron)
Attached file Test case: headroom.js
This tests fun.apply() with an array where length > initializedLength.  This is the worst (realistic) case for the bailout: there is just one value in the array and the function just negates its one argument.

On current m-i, this takes 35ms on my system.  With the patch in place it takes 73ms, so this particular case regresses significantly - the running time doubles.  If we think this case is important (I do not) then more work is needed to mitigate this.

(With this test case and length == initializedLength, m-i uses 22ms without the patch and 2ms with the patch.  So that's nice.)
A little bit on churn on the MIPS side, so hoping :hev can take a look at that as I'm not set up to test compile locally.
Attachment #8690088 - Flags: review?(r)
Attachment #8690088 - Flags: review?(nicolas.b.pierron)
Attachment #8689463 - Attachment is obsolete: true
Attachment #8690089 - Flags: review?(nicolas.b.pierron)
Comment on attachment 8690088 [details] [diff] [review]
Add sub32 to the common MacroAssembler

Review of attachment 8690088 [details] [diff] [review]:
-----------------------------------------------------------------

I just looked mips code.

::: js/src/jit/mips64/MacroAssembler-mips64.cpp
@@ +926,3 @@
>  {
>      ma_subu(dest, dest, imm);
>  }

Can we remove MacroAssemblerMIPSCompat::ma_sub and MacroAssemblerMIPS64Compat::ma_sub32, use MacroAssemblerMIPSShared::ma_subu?
Attachment #8690088 - Flags: review?(r) → review+
(In reply to Heiher [:hev] from comment #18)
> Comment on attachment 8690088 [details] [diff] [review]
> Add sub32 to the common MacroAssembler
> 
> Review of attachment 8690088 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Can we remove MacroAssemblerMIPSCompat::ma_sub and
> MacroAssemblerMIPS64Compat::ma_sub32, use MacroAssemblerMIPSShared::ma_subu?

Sure, I'll make the change before I land.
Comment on attachment 8690088 [details] [diff] [review]
Add sub32 to the common MacroAssembler

Review of attachment 8690088 [details] [diff] [review]:
-----------------------------------------------------------------

Side note: This patch factor sub32 to the generic MacroAssembler, and it adds  "MacroAssembler::sub(const Address&, Register)" for the next patch.

::: js/src/jit/MacroAssembler.h
@@ +719,5 @@
> +    // Arithmetic functions
> +
> +    inline void sub32(const Address& src, Register dest) PER_SHARED_ARCH;
> +    inline void sub32(Register src, Register dest) PER_SHARED_ARCH;
> +    inline void sub32(Imm32 imm, Register dest) PER_SHARED_ARCH;

nit: Follow the same order as in the -inl.h files.
Attachment #8690088 - Flags: review?(nicolas.b.pierron) → review+
Comment on attachment 8690089 [details] [diff] [review]
Inline fun.apply(..., array) in Ion

Review of attachment 8690089 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit-test/tests/basic/testApplyArrayInline.js
@@ +5,5 @@
> +}
> +
> +function f(xs) {
> +    var sum = 0;
> +    for ( var i=0 ; i < 10000 ; i++ )

nit: use setJitCompilerOption or inIon to avoid backing in high limits.

@@ +49,5 @@
> +    print("No bailout: " + (B - A));
> +    print("Bailout: " + (C - B));
> +    print("Short: " + (D - C));
> +    assertEq((C - B) > (B - A), true);
> +    assertEq((C - B) > (D - C), true);

This is time-sensitive, use inIon() to check if we remained in Ion generated code after the loop.

function f(xs) {
  var sum = 0;
  for ( var i=0 ; i < 100 ; i++ ) {
    inIonResult.between |= inIon()
    sum += g.apply(null, xs);
  }
  inIonResult.return = inIon()
  return sum;
}

assertEq(f([1,2,3]), 7*100);
assertEq(inIonResult.between && !inIonResult.return, inIonResult.between);

which also works, even if Ion is disabled.

::: js/src/jit/CodeGenerator.cpp
@@ +3618,5 @@
> +    Register tmp = ToRegister(apply->getTempObject());
> +
> +    Address length(ToRegister(apply->getElements()), ObjectElements::offsetOfLength());
> +    masm.load32(length, tmp);
> +    bailoutCmp32(Assembler::Above, tmp, Imm32(js_JitOptions.maxStackArgs), snapshot);

Would it make sense to have an OOL path for that?
Attachment #8690089 - Flags: review?(nicolas.b.pierron) → review+
(In reply to Nicolas B. Pierron [:nbp] from comment #20)
> Comment on attachment 8690088 [details] [diff] [review]
> Add sub32 to the common MacroAssembler
> 
> Review of attachment 8690088 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Side note: This patch factor sub32 to the generic MacroAssembler, and it
> adds  "MacroAssembler::sub(const Address&, Register)" for the next patch.

nbp: I don't think I understand what you mean by this.  Did you mean to reference another bug with this comment?
Flags: needinfo?(nicolas.b.pierron)
(In reply to Nicolas B. Pierron [:nbp] from comment #21)
> Comment on attachment 8690089 [details] [diff] [review]
> Inline fun.apply(..., array) in Ion
> 
> Review of attachment 8690089 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> ::: js/src/jit/CodeGenerator.cpp
> @@ +3618,5 @@
> > +    Register tmp = ToRegister(apply->getTempObject());
> > +
> > +    Address length(ToRegister(apply->getElements()), ObjectElements::offsetOfLength());
> > +    masm.load32(length, tmp);
> > +    bailoutCmp32(Assembler::Above, tmp, Imm32(js_JitOptions.maxStackArgs), snapshot);
> 
> Would it make sense to have an OOL path for that?

I'm guessing probably not, since maxStackArgs is 4096 by default and at that point the overhead of the bailout will be tiny compared to the cost of the operation itself.
(In reply to Lars T Hansen [:lth] from comment #22)
> (In reply to Nicolas B. Pierron [:nbp] from comment #20)
> > Review of attachment 8690088 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > 
> > Side note: This patch factor sub32 to the generic MacroAssembler, and it
> > adds  "MacroAssembler::sub(const Address&, Register)" for the next patch.
> 
> nbp: I don't think I understand what you mean by this.  Did you mean to
> reference another bug with this comment?

Sorry, I suggest you split the patch in 2 parts if possible, such that one deals with the sub32 macro-assembler technical debt, and the second one adds the extra function.

(In reply to Lars T Hansen [:lth] from comment #23)
> (In reply to Nicolas B. Pierron [:nbp] from comment #21)
> > ::: js/src/jit/CodeGenerator.cpp
> > > +    bailoutCmp32(Assembler::Above, tmp, Imm32(js_JitOptions.maxStackArgs), snapshot);
> > 
> > Would it make sense to have an OOL path for that?
> 
> I'm guessing probably not, since maxStackArgs is 4096 by default and at that
> point the overhead of the bailout will be tiny compared to the cost of the
> operation itself.

Ok, let's go with that, but be aware that the cost of a bailout is the time spent out of IonMonkey which includes the time spent in Baseline (which can be a lot if we are banned from Ion).
Flags: needinfo?(nicolas.b.pierron)
Depends on: 1228421
Apparently this made dromaeo-jslib-traverse-prototype on Win8 on AWFY 45% faster!
(In reply to Guilherme Lima from comment #28)
> Apparently this made dromaeo-jslib-traverse-prototype on Win8 on AWFY 45%
> faster!

Well alright then :)
(In reply to Lars T Hansen [:lth] from comment #29)
> (In reply to Guilherme Lima from comment #28)
> > Apparently this made dromaeo-jslib-traverse-prototype on Win8 on AWFY 45%
> > faster!
> 
> Well alright then :)

prototype.js uses fn.apply(..., array) here and there - for its substitution of "bind", and for its Responder framework, and elsewhere.  In most cases those arrays are either literal arrays (and Function.prototype.call might have been the better API to use, though it might be slow too) or the result of munging the arguments object; in either case the arrays will tend to be short.  Inlining apply is unsurprisingly a big win in those cases.

Once the spread operator becomes more widely available apply will be used less, but it will still be used when a this object needs to be specified as part of the call, as in many of the cases in prototype.js.
In light of the discussion on bug 1235092 comment 7 I'm now not completely sure that this optimization is correct as it stands (apart from the register allocation issue in bug 1231024): I have done nothing explicit to guard against changes to the array iterator protocol, cf arai's last four bullets:

  * array[@@iterator] is not modified
  * the array's prototype is Array.prototype
  * Array.prototype[@@iterator] is not modified
  * %ArrayIteratorPrototype%.next is not modified

and I'm not yet sure that TI guarantees this in my case.
arai corrects me: the issue here is that spread is explicitly defined in terms of iteration, not indexed access.  but apply is backward-compatible and uses indexed access only.  so we should be ok.
Depends on: 1351278
You need to log in before you can comment on or make changes to this bug.