BaselineCompiler: Compile JSOP_SPREAD* ops

RESOLVED FIXED in mozilla32

Status

()

defect
RESOLVED FIXED
5 years ago
5 years ago

People

(Reporter: arai, Assigned: arai)

Tracking

Trunk
mozilla32
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments, 6 obsolete attachments)

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:29.0) Gecko/20100101 Firefox/29.0 (Beta/Release)
Build ID: 20140421221237

Steps to reproduce:

(separated from bug 889456)

We need JSOP_SPREADCALL, JSOP_SPREADNEW and JSOP_SPREADEVAL.
Blocks: 872184, 1000780
OS: Mac OS X → All
Hardware: x86 → All
Implemented JSOP_SPREAD* in the Baseline Compiler.
  Part1: fallback stub (ICCall_Fallback)
  Part2: optimized stubs (ICCall_Scripted and ICCall_Native)

Also, enabled js/src/jit-test/tests/basic/spread-call-maxarg.js
on ASAN and debug builds again, because bug 919948 is already landed.


jstests and jit-test(--tbpl) are passed on Mac OS X 10.9.2 64bit.
Performance comparison with --baseline-eager flag is following:

1. SPREADCALL/CALL

  SPREADCALL (scripted):
    function f(a, b, c, d, e) { return a + b + c + d + e; } let s = elapsed(); for (let i = 0; i < 500000; i ++) f(1, 2, ...[3, 4, 5]); print(elapsed() - s);
  result:
                per-iter (total)
    original :  0.866686 (433342) :     --
    part1    :  0.590597 (295298) :  46.7% faster
    part1+2  :  0.439997 (219998) :  97.0% faster

  CALL (scripted):
    function f(a, b, c, d, e) { return a + b + c + d + e; } let s = elapsed(); for (let i = 0; i < 500000000; i ++) f(1, 2, 3, 4, 5); print(elapsed() - s);
  result:
    original :  0.000282 (141087) :     --
    part1    :  0.000282 (140887) :   0.1% faster
    part1+2  :  0.000282 (140970) :   0.1% faster

  SPREADCALL (native):
    let s = elapsed(); for (let i = 0; i < 500000; i ++) Math.min(1, 2, ...[3, 4, 5]); print(elapsed() - s);
  result:
    original :  0.831337 (415668) :     --
    part1    :  0.540356 (270178) :  53.8% faster
    part1+2  :  0.455972 (227986) :  82.3% faster

  CALL (native):
    let s = elapsed(); for (let i = 0; i < 500000000; i ++) Math.min(1, 2, 3, 4, 5); print(elapsed() - s);
  result:
    original :  0.000283 (141476) :     --
    part1    :  0.000282 (140910) :   0.4% faster
    part1+2  :  0.000282 (140967) :   0.4% faster

  SPREAD only:
    let s = elapsed(); for (let i = 0; i < 500000; i ++) [1, 2, ...[3, 4, 5]]; print(elapsed() - s);
  result:
    original :  0.397787 (198893) :     --
    part1    :  0.385565 (192782) :   3.2% faster
    part1+2  :  0.391209 (195604) :   1.7% faster


2. SPREADNEW/NEW

  SPREADNEW (scripted):
    function A(a, b, c, d, e) { this.v = a + b + c + d + e; } let s = elapsed(); for (let i = 0; i < 500000; i ++) new A(1, 2, ...[3, 4, 5]); print(elapsed() - s);
  result:
                per-iter (total)
    original :  1.055098 (527549) :     --
    part1    :  0.805464 (402732) :  31.0% faster
    part1+2  :  0.563302 (281651) :  87.3% faster

  NEW (scripted):
    function A(a, b, c, d, e) { this.v = a + b + c + d + e; } let s = elapsed(); for (let i = 0; i < 50000000; i ++) new A(1, 2, 3, 4, 5); print(elapsed() - s);
  result:
    original :  0.005144 (257184) :     --
    part1    :  0.005128 (256405) :   0.3% faster
    part1+2  :  0.005138 (256909) :   0.1% faster

  SPREADNEW (native):
    let s = elapsed(); for (let i = 0; i < 500000; i ++) new Array(1, 2, ...[3, 4, 5]); print(elapsed() - s);
  result:
    original :  0.899546 (449773) :     --
    part1    :  0.753050 (376524) :  19.5% faster
    part1+2  :  0.553810 (276904) :  62.4% faster

  NEW (native):
    let s = elapsed(); for (let i = 0; i < 500000; i ++) new Array(1, 2, 3, 4, 5); print(elapsed() - s);
  result:
    original :  0.008495 (4247) :     --
    part1    :  0.008479 (4239) :   0.2% faster
    part1+2  :  0.008464 (4231) :   0.4% faster


3. SPREADEVAL/EVAL

  SPREADEVAL:
    let s = elapsed(); for (let i = 0; i < 100000; i ++) eval(...["1 + 2"]); print(elapsed() - s);
  result:
                per-iter (total)
    original :  8.786327 (878632) :     --
    part1    :  8.509529 (850952) :   3.3% faster
    part1+2  :  8.521812 (852181) :   3.1% faster

  EVAL:
    let s = elapsed(); for (let i = 0; i < 100000; i ++) eval("1 + 2"); print(elapsed() - s);
  result:
    original :  7.867457 (786745) :     --
    part1    :  7.867739 (786773) :   0.0% slower
    part1+2  :  7.879573 (787957) :   0.2% slower

  SPREAD only:
    let s = elapsed(); for (let i = 0; i < 100000; i ++) [...["1 + 2"]]; print(elapsed() - s);
  result:
    original :  0.224546 (22454) :     --
    part1    :  0.213659 (21365) :   5.1% faster
    part1+2  :  0.216147 (21614) :   3.9% faster
Attachment #8415208 - Flags: review?(jdemooij)
Status: UNCONFIRMED → NEW
Ever confirmed: true
Comment on attachment 8415208 [details] [diff] [review]
Part1: Implement JSOP_SPREAD* fallback stubs in the baseline compiler.

Review of attachment 8415208 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for working on this!

::: js/src/jit/BaselineCompiler.cpp
@@ +2531,5 @@
> +{
> +    JS_ASSERT(IsCallPC(pc));
> +
> +    frame.syncStack(0);
> +    masm.mov(ImmWord(1), R0.scratchReg());

Nit: masm.move32(Imm32(1), ..);

While you're here, can you also change the code in emitCall to use move32(Imm32( ?

::: js/src/jit/BaselineIC.cpp
@@ +8218,2 @@
>  
> +    JS_ASSERT(isSpread || argc == GET_ARGC(pc));

Nit: MOZ_ASSERT_IF is a bit more readable: MOZ_ASSERT_IF(!isSpread, argc == GET_ARGC(pc));

@@ +8221,5 @@
>      RootedValue callee(cx, vp[0]);
>      RootedValue thisv(cx, vp[1]);
>  
> +    Value *args;
> +    AutoValueVector argv(cx);

99.99% of calls are not spread calls. This patch adds quite a bit of code to this path. Could you add a separate DoSpreadCallFallback for spread calls? You may have to duplicate *some* lines of code but I think it will be worth it.

We can still share the ICCall_Fallback etc code; we just need to override getKey on it's Compiler class like you did in the second patch in this bug.

Let me know if you need help with this.

@@ +8238,5 @@
> +        argv.resize(argc);
> +        args = argv.begin();
> +
> +        if (!GetElements(cx, aobj, argc, args))
> +            return false;

Another advantage of separating the two is that you can maybe more easily share this with the interpreter code. Can we add a separate SpreadCallOperation or something?
Attachment #8415208 - Flags: review?(jdemooij)
Comment on attachment 8415209 [details] [diff] [review]
Part2: Implement JSOP_SPREAD* optimized stubs in the baseline compiler.

Clearing review for now as this patch will also need changes after the first patch is updated. Sorry for the review delay btw.
Attachment #8415209 - Flags: review?(jdemooij)
Thank you for reviewing!

Added DoSpreadCallFallback and SpreadCallOperation.
Also, added MOZ_UNLIKELY to |if (isSpread) ...| in shared generateStubCode's.

Then, I removed JSOP_SPREADNEW optimization in types::UseNewType. I noticed that the target code (such like |Sub.prototype = new Super(...args);|) is not common case, so this optimization makes no sense for now.
So, I removed types::UseNewType call in DoSpreadCallFallback while separating it from DoCallFallback.


jstests and jit-test(--tbpl) are passed on Mac OS X 10.9.2 64bit.
Performance comparison with --baseline-eager flag is following:

1. SPREADCALL/CALL

  SPREADCALL (scripted):
    function f(a, b, c, d, e) { return a + b + c + d + e; } let s = elapsed(); for (let i = 0; i < 500000; i ++) f(1, 2, ...[3, 4, 5]); print(elapsed() - s);
  result:
    original :  0.849620 (424810) :     --
    part1    :  0.584474 (292237) :  45.4% faster
    part1+2  :  0.438068 (219034) :  93.9% faster

  CALL (scripted):
    function f(a, b, c, d, e) { return a + b + c + d + e; } let s = elapsed(); for (let i = 0; i < 500000000; i ++) f(1, 2, 3, 4, 5); print(elapsed() - s);
  result:
    original :  0.000282 (140914) :     --
    part1    :  0.000282 (140851) :   0.0% faster
    part1+2  :  0.000282 (141014) :   0.1% slower

  SPREADCALL (native):
    let s = elapsed(); for (let i = 0; i < 500000; i ++) Math.min(1, 2, ...[3, 4, 5]); print(elapsed() - s);
  result:
    original :  0.830256 (415128) :     --
    part1    :  0.524412 (262206) :  58.3% faster
    part1+2  :  0.457374 (228687) :  81.5% faster

  CALL (native):
    let s = elapsed(); for (let i = 0; i < 500000000; i ++) Math.min(1, 2, 3, 4, 5); print(elapsed() - s);
  result:
    original :  0.000282 (140997) :     --
    part1    :  0.000282 (141020) :   0.0% slower
    part1+2  :  0.000282 (141144) :   0.1% slower

  SPREAD only:
    let s = elapsed(); for (let i = 0; i < 500000; i ++) [1, 2, ...[3, 4, 5]]; print(elapsed() - s);
  result:
    original :  0.398640 (199320) :     --
    part1    :  0.387716 (193858) :   2.8% faster
    part1+2  :  0.387056 (193528) :   3.0% faster


2. SPREADNEW/NEW

  SPREADNEW (scripted):
    function A(a, b, c, d, e) { this.v = a + b + c + d + e; } let s = elapsed(); for (let i = 0; i < 500000; i ++) new A(1, 2, ...[3, 4, 5]); print(elapsed() - s);
  result:
    original :  1.063494 (531747) :     --
    part1    :  0.807190 (403595) :  31.8% faster
    part1+2  :  0.557620 (278810) :  90.7% faster

  NEW (scripted):
    function A(a, b, c, d, e) { this.v = a + b + c + d + e; } let s = elapsed(); for (let i = 0; i < 50000000; i ++) new A(1, 2, 3, 4, 5); print(elapsed() - s);
  result:
    original :  0.005102 (255099) :     --
    part1    :  0.005107 (255347) :   0.1% slower
    part1+2  :  0.005106 (255282) :   0.1% slower

  SPREADNEW (native):
    let s = elapsed(); for (let i = 0; i < 500000; i ++) new Array(1, 2, ...[3, 4, 5]); print(elapsed() - s);
  result:
    original :  0.883216 (441608) :     --
    part1    :  0.627534 (313767) :  40.7% faster
    part1+2  :  0.579846 (289923) :  52.3% faster

  NEW (native):
    let s = elapsed(); for (let i = 0; i < 50000000; i ++) new Array(1, 2, 3, 4, 5); print(elapsed() - s);
  result:
    original :  0.006032 (301590) :     --
    part1    :  0.006041 (302067) :   0.2% slower
    part1+2  :  0.006024 (301199) :   0.1% faster


3. SPREADEVAL/EVAL

  SPREADEVAL:
    let s = elapsed(); for (let i = 0; i < 100000; i ++) eval(...["1 + 2"]); print(elapsed() - s);
  result:
    original :  8.449750 (844975) :     --
    part1    :  8.662900 (866290) :   2.5% slower
    part1+2  :  8.798130 (879813) :   4.0% slower

  EVAL:
    let s = elapsed(); for (let i = 0; i < 100000; i ++) eval("1 + 2"); print(elapsed() - s);
  result:
    original :  7.940280 (794028) :     --
    part1    :  7.815970 (781597) :   1.6% faster
    part1+2  :  7.966240 (796624) :   0.3% slower

  SPREAD only:
    let s = elapsed(); for (let i = 0; i < 1000000; i ++) [...["1 + 2"]]; print(elapsed() - s);
  result:
    original :  0.216936 (216936) :     --
    part1    :  0.207803 (207803) :   4.4% faster
    part1+2  :  0.210440 (210440) :   3.1% faster
Attachment #8415208 - Attachment is obsolete: true
Attachment #8418279 - Flags: review?(jdemooij)
and part2.
Attachment #8415209 - Attachment is obsolete: true
Attachment #8418280 - Flags: review?(jdemooij)
Comment on attachment 8418279 [details] [diff] [review]
Part1: Implement JSOP_SPREAD* fallback stubs in the baseline compiler.

Review of attachment 8418279 [details] [diff] [review]:
-----------------------------------------------------------------

Great job, looks perfect! Please post an updated patch addressing the minor comments below and we can land this.

::: js/src/jit/BaselineIC.cpp
@@ +8493,5 @@
> +        masm.pushValue(Address(argPtr, 0)); // args[0]
> +        masm.addPtr(Imm32(sizeof(Value)), argPtr);
> +        masm.pushValue(Address(argPtr, 0)); // this
> +        masm.addPtr(Imm32(sizeof(Value)), argPtr);
> +        masm.pushValue(Address(argPtr, 0)); // callee

Nice idea to unroll this instead of calling pushCallArguments. We can optimize this a little more to avoid the addPtrs though:

masm.pushValue(Address(argPtr, 0 * sizeof(Value))); // array
masm.pushValue(Address(argPtr, 1 * sizeof(Value))); // this
masm.pushValue(Address(argPtr, 2 * sizeof(Value))); // callee

::: js/src/vm/Interpreter.cpp
@@ +3940,5 @@
> +
> +bool
> +js::SpreadCallOperation(JSContext *cx, Handle<GlobalObject*> global, HandleScript script,
> +                        jsbytecode *pc, JSOp op, HandleValue thisv, HandleValue callee,
> +                        HandleValue arr, MutableHandleValue res)

You can remove the op argument, and just do:

JSOp op = JSOp(*pc);

in this function.

Also remove the "global" argument and use cx->global()->valueIsEval(..) below; it's guaranteed to be the same value.

In general fewer arguments is better, because it makes the callsites simpler and makes it less likely a caller passes the wrong values :)
Attachment #8418279 - Flags: review?(jdemooij) → review+
Comment on attachment 8418280 [details] [diff] [review]
Part2: Implement JSOP_SPREAD* optimized stubs in the baseline compiler.

Review of attachment 8418280 [details] [diff] [review]:
-----------------------------------------------------------------

Nice work again :) Some comments below to make this even faster by doing more in JIT code, let me know if you need help with this.

::: js/src/jit/BaselineIC.cpp
@@ +8112,1 @@
>          if (!GetTemplateObjectForNative(cx, script, pc, fun->native(), args, &templateObject))

Change this to:

RootedObject templateObject(cx);
if (!isSpread) {
    CallArgs args = ..;
    ..GetTemplateObject..
}

@@ +8243,5 @@
> +
> +    // Try attaching a call stub.
> +    if (op != JSOP_SPREADEVAL &&
> +        !TryAttachCallStub(cx, stub, script, pc, op, 1, vp, constructing, true, false))
> +        return false;

Nit: condition does not fit on one line, in this case we add {} (with { on its own line) because it's more readable.

@@ +8320,5 @@
> +ICCallStubCompiler::guardSpreadCall(MacroAssembler &masm, GeneralRegisterSet regs,
> +                                    Register argcReg, Register actualArgc,Label *failure)
> +{
> +    Register argPtr = regs.takeAny();
> +    masm.mov(BaselineStackReg, argPtr);

Instead of the VM call below, we can easily load the array length in JIT code:

Register scratch = regs.takeAny();
masm.unboxObject(Address(BaselineStackReg, ICStackValueOFfset), scratch);
masm.loadPtr(Address(scratch, JSObject::offsetOfElements()), scratch);
masm.load32(Address(scratch, ObjectElements::offsetOfLength()), actualArgc);

// Ensure actual argc <= ARGS_LENGTH_MAX
...

You may have to add unboxObject(Address, Register) to the macro assemblers. For x86/ARM, it's exactly the same as unboxInt32 etc. On x64 it can just call unboxNonDouble(Operand(src), dest);

@@ +8371,5 @@
> +
> +    // Allocate stack space to extract arguments (sizeof(Value) * argc).
> +    Register argsSizeReg = regs.takeAny();
> +    masm.mov(actualArgc, argsSizeReg);
> +    JS_STATIC_ASSERT(sizeof(Value) == 8);

Nit: we're using C++11 static_assert now instead of JS_STATIC_ASSERT, so:

static_assert(sizeof(Value) == 8, "Value must be 8 bytes");

@@ +8372,5 @@
> +    // Allocate stack space to extract arguments (sizeof(Value) * argc).
> +    Register argsSizeReg = regs.takeAny();
> +    masm.mov(actualArgc, argsSizeReg);
> +    JS_STATIC_ASSERT(sizeof(Value) == 8);
> +    masm.lshiftPtr(Imm32(3), argsSizeReg);

Nit: lshift32

@@ +8374,5 @@
> +    masm.mov(actualArgc, argsSizeReg);
> +    JS_STATIC_ASSERT(sizeof(Value) == 8);
> +    masm.lshiftPtr(Imm32(3), argsSizeReg);
> +    masm.subPtr(argsSizeReg, BaselineStackReg);
> +    regs.add(argsSizeReg);

At this point the object must be an array with dense elements and initializedLength == length (we should assert all this in SpreadCallOperation, let me know if you have questions about this).

It'd be great if we could do the copy in JIT code, without the VM call. It's a bit complicated but fortunately we have code for this already, in pushArrayArguments (for fun.apply(x, array)). Can you call that function here?

@@ +8723,5 @@
>  
> +    Register actualArgc;
> +    if (MOZ_UNLIKELY(isSpread_)) {
> +        actualArgc = regs.takeAny();
> +        if (!guardSpreadCall(masm, regs, argcReg, actualArgc, &failure))

I think we should just use argcReg instead of a separate actualArgc reg, or does that lead to problems elsewhere?
Attachment #8418280 - Flags: review?(jdemooij)
Thank you again!

Here is updated patch.

(In reply to Jan de Mooij [:jandem] from comment #8)
> It'd be great if we could do the copy in JIT code, without the VM call. It's
> a bit complicated but fortunately we have code for this already, in
> pushArrayArguments (for fun.apply(x, array)). Can you call that function
> here?

I copied pushArrayArguments to pushSpreadCallArguments partially instead of calling it,
because array length is already known (argcReg) and it's better to use register
instead of loading length again from memory.

> I think we should just use argcReg instead of a separate actualArgc reg, or
> does that lead to problems elsewhere?

Yeah, you are correct.
The reason I use actualArgc was that there exists some codes which suppose that
argcReg remains as original value, such as |BaseIndex calleeSlot(...)|,
between guardSpreadCall and pushSpreadCallArguments.
But, I noticed that those codes can be optimized by omitting argcReg,
because original argcReg is always 1 in spreadcall.
Attachment #8418279 - Attachment is obsolete: true
Attachment #8420193 - Flags: review?(jdemooij)
jstests and jit-test(--tbpl) are passed on Mac OS X 10.9.2 64bit.
Performance comparison with --baseline-eager flag is following:

1. SPREADCALL/CALL

  SPREADCALL (scripted):
    function f(a, b, c, d, e) { return a + b + c + d + e; } let s = elapsed(); for (let i = 0; i < 500000; i ++) f(1, 2, ...[3, 4, 5]); print(elapsed() - s);
  result:
    original :  0.865526 (432763) :     --
    part1    :  0.575428 (287714) :  50.4% faster
    part1+2  :  0.408342 (204171) : 112.0% faster

  CALL (scripted):
    function f(a, b, c, d, e) { return a + b + c + d + e; } let s = elapsed(); for (let i = 0; i < 500000000; i ++) f(1, 2, 3, 4, 5); print(elapsed() - s);
  result:
    original :  0.000281 (140704) :     --
    part1    :  0.000282 (140821) :   0.1% slower
    part1+2  :  0.000282 (140780) :   0.1% slower

  SPREADCALL (native):
    let s = elapsed(); for (let i = 0; i < 500000; i ++) Math.min(1, 2, ...[3, 4, 5]); print(elapsed() - s);
  result:
    original :  0.856160 (428080) :     --
    part1    :  0.522518 (261259) :  63.9% faster
    part1+2  :  0.435112 (217556) :  96.8% faster

  CALL (native):
    let s = elapsed(); for (let i = 0; i < 500000000; i ++) Math.min(1, 2, 3, 4, 5); print(elapsed() - s);
  result:
    original :  0.000282 (140832) :     --
    part1    :  0.000281 (140651) :   0.1% faster
    part1+2  :  0.000282 (140754) :   0.1% faster

  SPREAD only:
    let s = elapsed(); for (let i = 0; i < 500000; i ++) [1, 2, ...[3, 4, 5]]; print(elapsed() - s);
  result:
    original :  0.398302 (199151) :     --
    part1    :  0.392210 (196105) :   1.6% faster
    part1+2  :  0.387166 (193583) :   2.9% faster


2. SPREADNEW/NEW

  SPREADNEW (scripted):
    function A(a, b, c, d, e) { this.v = a + b + c + d + e; } let s = elapsed(); for (let i = 0; i < 500000; i ++) new A(1, 2, ...[3, 4, 5]); print(elapsed() - s);
  result:
    original :  1.058806 (529403) :     --
    part1    :  0.766202 (383101) :  38.2% faster
    part1+2  :  0.533848 (266924) :  98.3% faster

  NEW (scripted):
    function A(a, b, c, d, e) { this.v = a + b + c + d + e; } let s = elapsed(); for (let i = 0; i < 50000000; i ++) new A(1, 2, 3, 4, 5); print(elapsed() - s);
  result:
    original :  0.005091 (254561) :     --
    part1    :  0.005110 (255475) :   0.4% slower
    part1+2  :  0.005093 (254639) :   0.0% slower

  SPREADNEW (native):
    let s = elapsed(); for (let i = 0; i < 500000; i ++) new Array(1, 2, ...[3, 4, 5]); print(elapsed() - s);
  result:
    original :  0.904744 (452372) :     --
    part1    :  0.629578 (314789) :  43.7% faster
    part1+2  :  0.570500 (285250) :  58.6% faster

  NEW (native):
    let s = elapsed(); for (let i = 0; i < 50000000; i ++) new Array(1, 2, 3, 4, 5); print(elapsed() - s);
  result:
    original :  0.006019 (300943) :     --
    part1    :  0.006033 (301638) :   0.2% slower
    part1+2  :  0.006025 (301254) :   0.1% slower


3. SPREADEVAL/EVAL

  SPREADEVAL:
    let s = elapsed(); for (let i = 0; i < 100000; i ++) eval(...["1 + 2"]); print(elapsed() - s);
  result:
    original :  8.842150 (884215) :     --
    part1    :  8.731040 (873104) :   1.3% faster
    part1+2  :  8.711200 (871120) :   1.5% faster

  EVAL:
    let s = elapsed(); for (let i = 0; i < 100000; i ++) eval("1 + 2"); print(elapsed() - s);
  result:
    original :  7.903420 (790342) :     --
    part1    :  8.008430 (800843) :   1.3% slower
    part1+2  :  8.002970 (800297) :   1.2% slower

  SPREAD only:
    let s = elapsed(); for (let i = 0; i < 1000000; i ++) [...["1 + 2"]]; print(elapsed() - s);
  result:
    original :  0.212673 (212673) :     --
    part1    :  0.209087 (209087) :   1.7% faster
    part1+2  :  0.207979 (207979) :   2.3% faster
Attachment #8418280 - Attachment is obsolete: true
Attachment #8420195 - Flags: review?(jdemooij)
Comment on attachment 8420193 [details] [diff] [review]
Part1: Implement JSOP_SPREAD* fallback stubs in the baseline compiler.

Review of attachment 8420193 [details] [diff] [review]:
-----------------------------------------------------------------

Perfect, thanks!
Attachment #8420193 - Flags: review?(jdemooij) → review+
Comment on attachment 8420195 [details] [diff] [review]
Part2: Implement JSOP_SPREAD* optimized stubs in the baseline compiler.

Review of attachment 8420195 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry for the delay. This looks great! I'll take another look after the comments below are addressed, but should be good to go then.

::: js/src/jit-test/tests/basic/spread-call-maxarg.js
@@ -2,5 @@
> -var config = getBuildConfiguration();
> -
> -// FIXME: ASAN and debug builds run this too slowly for now.  Re-enable
> -// after bug 919948 lands.
> -if (!config.debug && !config.asan) {

How fast is this test now in a debug build?

::: js/src/jit/BaselineIC.cpp
@@ +8252,5 @@
> +    bool constructing = (op == JSOP_SPREADNEW);
> +
> +    // Try attaching a call stub.
> +    if (op != JSOP_SPREADEVAL &&
> +        !TryAttachCallStub(cx, stub, script, pc, op, 1, vp, constructing, true, false)) {

Nit: because the condition does not fit on one line, move the { to its own line so that it's clearer where the condition ends and the body starts.

@@ +8321,5 @@
> +    Register scratch = regs.takeAny();
> +    masm.unboxObject(Address(BaselineStackReg, ICStackValueOffset), scratch);
> +    masm.loadPtr(Address(scratch, JSObject::offsetOfElements()), scratch);
> +    masm.load32(Address(scratch, ObjectElements::offsetOfLength()), argcReg);
> +    regs.add(scratch);

You can use argcReg instead of scratch (and also remove the "regs" argument).

@@ +8324,5 @@
> +    masm.load32(Address(scratch, ObjectElements::offsetOfLength()), argcReg);
> +    regs.add(scratch);
> +
> +    // Ensure actual argc <= ARGS_LENGTH_MAX
> +    masm.branchTest32(Assembler::Above, argcReg, Imm32(ARGS_LENGTH_MAX), failure);

It's a comparison so this should be branch32.

@@ +8325,5 @@
> +    regs.add(scratch);
> +
> +    // Ensure actual argc <= ARGS_LENGTH_MAX
> +    masm.branchTest32(Assembler::Above, argcReg, Imm32(ARGS_LENGTH_MAX), failure);
> +    return true;

Nit: change the return type to |void|

@@ +8358,5 @@
> +
> +    // Push the callee and |this|.
> +    masm.pushValue(Address(BaselineFrameReg, STUB_FRAME_SIZE + 1 * sizeof(Value)));
> +    masm.pushValue(Address(BaselineFrameReg, STUB_FRAME_SIZE + 2 * sizeof(Value)));
> +    return true;

Same here.

@@ +8665,4 @@
>      regs.take(ArgumentsRectifierReg);
>      regs.takeUnchecked(BaselineTailCallReg);
>  
> +    if (MOZ_UNLIKELY(isSpread_)) {

Nit: remove the MOZ_UNLIKELY here and below: we only use it in cases where it really helps performance. The IC generation code isn't very hot because IC code is shared/reused :)

::: js/src/vm/Interpreter.cpp
@@ +3952,5 @@
>          return false;
>      }
>  
> +#ifdef DEBUG
> +    JS_ASSERT(aobj->getDenseInitializedLength() == length);

Nit: add a comment above this line:

    // The object must be an array with dense elements and no holes. Baseline's
    // optimized spread call stubs rely on this.

@@ +3953,5 @@
>      }
>  
> +#ifdef DEBUG
> +    JS_ASSERT(aobj->getDenseInitializedLength() == length);
> +    JS_ASSERT(!ObjectMayHaveExtraIndexedProperties(aobj));

This assert will fail when there are indexed properties somewhere on the prototype, like this: Array.prototype[0] = 1. Because the array has no holes, we don't consult the prototype chain so we don't have to worry about that.

You could change it to: JS_ASSERT(!aobj->isIndexed());

isIndexed() returns true if the object itself has non-dense, indexed properties.

@@ +3956,5 @@
> +    JS_ASSERT(aobj->getDenseInitializedLength() == length);
> +    JS_ASSERT(!ObjectMayHaveExtraIndexedProperties(aobj));
> +    for (uint32_t i = 0; i < length; i++) {
> +        JS_ASSERT(!aobj->getDenseElement(i).isMagic());
> +    }

Nit: no {}
Attachment #8420195 - Flags: review?(jdemooij)
Thank you again!
Here is updated patch.

jstests and jit-test(--tbpl) are passed on Mac OS X 10.9.2 64bit.
Performance is almost same as previous patch.

(In reply to Jan de Mooij [:jandem] from comment #12)
> How fast is this test now in a debug build?

(as talked on IRC)
It test takes 9 seconds for each run, and I added the check back.
Also, removed comment about bug 919948.

> Nit: because the condition does not fit on one line, move the { to its own
> line so that it's clearer where the condition ends and the body starts.

Oh sorry, I've misunderstood what you meant.
I fixed it.
Attachment #8420195 - Attachment is obsolete: true
Attachment #8423326 - Flags: review?(jdemooij)
Comment on attachment 8423326 [details] [diff] [review]
Part2: Implement JSOP_SPREAD* optimized stubs in the baseline compiler.

Canceling r? to conform the patch to bug 1010775.
I'll post updated patch soon.
Attachment #8423326 - Flags: review?(jdemooij)
Updated the arguments in macro assembler.

jstests and jit-test(--tbpl) are passed on Mac OS X 10.9.2 64bit.
Attachment #8423326 - Attachment is obsolete: true
Attachment #8424001 - Flags: review?(jdemooij)
Comment on attachment 8424001 [details] [diff] [review]
Part2: Implement JSOP_SPREAD* optimized stubs in the baseline compiler.

Review of attachment 8424001 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for updating the patch! I'll test your changes on x86/ARM now and push them to Try if that works :)
Attachment #8424001 - Flags: review?(jdemooij) → review+
Builds on x86 and ARM too and passes spread jit-tests so I pushed this to Try:

https://tbpl.mozilla.org/?tree=Try&rev=a8f3ee1b69ee
Flags: needinfo?(jdemooij)
Green on Try so I pushed it:

https://hg.mozilla.org/integration/mozilla-inbound/rev/4106f916a9b5
https://hg.mozilla.org/integration/mozilla-inbound/rev/bcf7062a8d78

Thanks again!
Assignee: nobody → arai_a
Status: NEW → ASSIGNED
Flags: needinfo?(jdemooij)
This seemed to have regressed some of the asmjs-ubenchs (with --no-asmjs):
http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=307078994617&tochange=ec4b23a08c5e

asmjs-ubench-skinning: 32%
asmjs-ubench-fannkuch: 4.4%
Weird. It's possible one of the patches subtly changes non-spread calls somehow. I'll take a look later today.
Flags: needinfo?(jdemooij)
https://hg.mozilla.org/mozilla-central/rev/4106f916a9b5
https://hg.mozilla.org/mozilla-central/rev/bcf7062a8d78
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla32
I can't reproduce this regression with a 32-bit shell build. I updated my AWFY clone and ran

./harness.py $shell -- --no-asmjs

in asmjs-ubench

Hannes can you reproduce this? I wonder which patch would have regressed this; it shouldn't affect non-spread calls at all.
Flags: needinfo?(jdemooij) → needinfo?(hv1989)
I couldn't reproduce on my local machine, but I can on the mac machine.

The first bad revision is:
http://hg.mozilla.org/integration/mozilla-inbound/rev/bcf7062a8d78

on 32bit optimized build:
CC="clang -m32" CXX="clang++ -m32" AR=ar CROSS_COMPILE=1 ../configure --enable-optimize --disable-debug --enable-threadsafe --disable-intl-api --without-intl-api --target=i686-apple-darwin10.0.0 --enable-nspr-build
Flags: needinfo?(hv1989)
Unfortunately, I still cannot reproduce the regression on my machine
(iMac Late 2013 / Core i5 2.9GHz / 16GB RAM / OS X 10.9.3 / Xcode 5.1.1),
even if I use the configure options in comment 24 and run the command in comment 23,
there is no significant difference between the revision and previous one.
It might be an environment dependent problem.

looking for another solution...
Yes those regressions are really annoying. They happen on a single machine + platform and aren't reproducible anywhere else. There's probably some weird compiler or CPU cache thing going on.

Hannes, maybe we could update the Clang version we use on AWFY and see if it helps? Binaries for OS X can be downloaded on llvm.org, though you may need 10.9 for the latest versions. I've 3.4.1 here; maybe that's why I can't repro it.
I've updated the clang version to 3.3 (which inbound uses to compile).
Seems skinning is back to its original score, so this might have the clang compiler tripping and producing worse code.
Depends on: 1053692
You need to log in before you can comment on or make changes to this bug.