Closed Bug 595884 Opened 10 years ago Closed 9 years ago

JM: make f.apply(x, arguments) fast

Categories

(Core :: JavaScript Engine, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
Tracking Status
blocking2.0 --- -
status2.0 --- wanted

People

(Reporter: jandem, Assigned: luke)

References

Details

Attachments

(1 file, 15 obsolete files)

26.31 KB, patch
dvander
: review+
Details | Diff | Splinter Review
As discussed in bug 594972, we need this for the Class.create pattern in v8-raytrace. This makes up for at least 100 ms on awfy, and speeding this up will also help many other benchmarks and websites.
Which aspect of |arguments| do you think is costing the most? Passing it to fun.apply, simply creating it, or accessing members of it?
(In reply to comment #1)
> Which aspect of |arguments| do you think is costing the most? Passing it to
> fun.apply, simply creating it, or accessing members of it?

From the numbers in bug 594972 it looks like arguments[i] (ARGSUB) is pretty fast. "return arguments" is much slower than v8. Assigning it to a global  variable or passing it to a function is also slow, so I assume creating the arguments object is slow. I can profile when I get home.
Blocks: 594972
The prototypejs.org-style passing arguments only to fun.apply (Class initialize delegated application in v8 raytrace) is juicy, because we don't really need to create an arguments object. Never creating beats optimizing creation costs any day.

/be
Attached patch WIP 1 (obsolete) — Splinter Review
This patch adds a fast path for f.apply(x, arguments). It makes v8-raytrace more than twice as fast (415 ms -> 204 ms for me) 

One remaining TODO is adding support for call IC's; this patch generates just uncached calls, and some parts are hackish. I'd appreciate some feedback on how to fix this properly.

Btw, I won't have much time next week to hack on this; feel free to steal it.
Assignee: general → jandemooij
Status: NEW → ASSIGNED
Attachment #475891 - Flags: feedback?(dvander)
Great work! Cc'ing potential stealers.

/be
Attached patch WIP v2 (obsolete) — Splinter Review
Added some helper functions to make things easier to read.
Also fixes the case where f is not a function in f.apply.
Attachment #475891 - Attachment is obsolete: true
Attachment #475891 - Flags: feedback?(dvander)
There's one jstest failure. It can be reduced to this:
-----
function f() {
    return print.apply(null, arguments);
}
res = f(1, 2, 3);
print(res);
-----
This prints:
----
1 2 3
function apply() {
    [native code]
}
----
This happens for all builtins. For non-native functions everything works fine. What do native functions do differently?
Attached patch WIP v3 (obsolete) — Splinter Review
Further cleanup. The bug I mentioned is still there.
Attachment #476206 - Attachment is obsolete: true
Natives get the callee pushed on the interpreter stack slot where the return value should go on return. So a classic native function implementation mistake is to forget to JS_SET_RVAL before returning and the called native returns itself. Something in JM needs to track the apply'ed r.v. as copied there?

/be
(In reply to comment #9)
> Natives get the callee pushed on the interpreter stack slot where the return
> value should go on return. So a classic native function implementation mistake
> is to forget to JS_SET_RVAL before returning and the called native returns
> itself. Something in JM needs to track the apply'ed r.v. as copied there?
> 
> /be

There are (or were, a few months ago) a fair number of these bugs still around.  Developing the type inference was great at finding them, because it checks the actual returned value of all calls against the type specification.  The inference patch should fix the ones I found.
Attached patch WIP v4 (obsolete) — Splinter Review
@brendan: thanks, that made me realize what went wrong. This patch fixes the test and passes trace-tests and jstests. 

Now we need to fix the remaining hacks. It may be easier to steal now that it passes all tests..
Attachment #476233 - Attachment is obsolete: true
I remember bhackett talking on IRC about many calls to Function.prototype.call in v8-deltablue. I think this could be fixed easily with this patch. For f.call(x, y) the stack looks like this:call, f, x, y. testNativeFunction can check for builtin Function.prototype.call. If it's not, just pass argc to emitUncachedCallVarArgs. If it *is* builtin Function.prototype.call, increment the argc register, so that we call f with x as this-value. Maybe we don't even need this as the number of arguments is not really variable. Could be worth a try though.
The Function.call thing on IRC turned out to be mostly incompetence in my measuring abilities.  deltablue calls Function.call about 125k times, which works out to a loss of ~20ms (not nothing, but not huge) vs JSC.  There are about 2 million calls to Invoke though (which I was confusing with Function.call), which look to be coming from non-Function.call slow paths and which are probably costing significantly (haven't looked further).
Attached patch WIP v5 (obsolete) — Splinter Review
Cleanup and a small improvement. Now we use a single lea instruction to calculate the dynamic stack pointer instead of a shift and two adds. 

With this patch the micro-benchmark in bug 594972 becomes almost 3x faster. v8-raytrace still shows a 2x speedup (~200 ms on v8).

Anybody willing to take over and drive this in?
Attachment #476403 - Attachment is obsolete: true
I can take a look tomorrow.  What's left -- I see various "XXX" marks in the patch, is there anything other than those?
I like the idea of function specific optimisations. I think it might be worthwile in real world applications to fastpath Array.prototype.slice.call(arguments, 1) as this is most common approach from my point of view to convert arguments into an array.
(In reply to comment #15)
> I can take a look tomorrow.  What's left -- I see various "XXX" marks in the
> patch, is there anything other than those?

I just looked at this patch for over an hour... I've barely looked at JM prior to now, and I barely understand what's going on.  And for several of the "XXX" marks I can't even tell what the shortcoming is.  Sorry, I'm not much use here.
blocking2.0: --- → ?
Sorry Nick, I will add some comments and attach a new version of the patch monday morning PDT.
(In reply to comment #18)
> Sorry Nick, I will add some comments and attach a new version of the patch
> monday morning PDT.

While adding comments I realized that I can rewrite some stuff to make it a bit cleaner, I will try this approach tomorrow. Please don't review the current patch, it will change quite a bit.
Attached patch WIP v6 (obsolete) — Splinter Review
Something like this. I added some comments and changed a few things.

@dvander: I know there are more important bugs right now. Just requesting feedback so you won't forget this bug.
Attachment #476515 - Attachment is obsolete: true
Attachment #477252 - Flags: feedback?(dvander)
Comment on attachment 477252 [details] [diff] [review]
WIP v6

>+mjit::Compiler::emitUncachedCallVarArgs(Assembler& masm, RegisterID varArgcReg, int nonVarArgc)
>+{

This is a great start. We could cache the call with more work, but this will already be much faster.

>+void
>+mjit::Compiler::testNativeFunction(FrameEntry *fe, void *funcPtr, JumpList& notFunction)

Comments on this at the end.

>+    /* Check if receiver is a function. 
>+     * XXX: is testObject redundant after callprop? */
>+    FrameEntry *receiver = frame.peek(-2);
>+    RegisterID dataReg = frame.copyDataIntoReg(receiver);
>+    Jump notObject = frame.testObject(Assembler::NotEqual, receiver);
>+    Jump notFunction = masm.testFunction(Assembler::NotEqual, dataReg);

CALLPROP doesn't guarantee a function or even an object, though it probably won't get very far without one. So your checks here are correct. Maybe we should have a follow-up bug on eliminating them in the fast-path, but it probably wouldn't buy a lot for the added complexity.

>+    
>+    /* Check for builtin apply. */
>+    JumpList notApply;
>+    testNativeFunction(frame.peek(-3), JS_DATA_TO_FUNC_PTR(void *, js_fun_apply), notApply);

If we're compile-and-go (which content scripts are), there's an alternative to all this testing:
 1. At compile time, lookup "apply" on our global object's Function.prototype.
 2. If "apply" is an function-object whose JSFunction is js_fun_apply, bake in this check:
   cmp receiverType, OBJECT_TAG
   cmp receiverData, <funobj ptr>

Another option is to layout JSFunction such that |native| overlays |script|. You wouldn't have to check flags and you're guaranteed the pointer is enough. You'll still end up with more checks, but but either way for scripts that will be shared across multiple globals, we'll have to do something a little slower.
Attachment #477252 - Flags: feedback?(dvander) → feedback+
blocking2.0: ? → -
status2.0: --- → wanted
Comment on attachment 477252 [details] [diff] [review]
WIP v6

>+int JS_FASTCALL
>+stubs::PushArguments(VMFrame &f)
...
>+        Value lenval;
>+        if (!js_GetArgsProperty(cx, fp, ATOM_TO_JSID(rt->atomState.lengthAtom), &lenval))
>+            THROWV(0);
>+        
>+        if (lenval.isInt32()) {
>+            argc = jsuint(lenval.toInt32());
>+        } else {
>+            JS_STATIC_ASSERT(sizeof(jsuint) == sizeof(uint32_t));
>+            if (!ValueToECMAUint32(cx, lenval, (uint32_t *)&argc))
>+                THROWV(0);
>+        }

Since this logic is duplicated from js_fun_apply, perhaps it could be factored out into a function called by both.

>+    /* XXX: return -1 to fall back to slow path? */
>+    if (!stack.ensureSpace(cx, f.regs.sp, argc))
>+    	THROWV(0);

ensureSpace will report an error if it returns false, so you are correct in propagating the error.

>+    for(uint i=0; i<argc; i++) {
>+        /*XXX: can this happen?
>+        if (argsobj) {
>+            if (argsobj->getArgsElement(arg).isMagic(JS_ARGS_HOLE))
>+                return argsobj->getProperty(cx, id, vp);
>+        }*/
>+        f.regs.sp[i] = fp->canonicalActualArg(i);
>+    }

It is faster to call fp->forEachCanonicalActualArg(op) than calling fp->canonicalActualArg(i) for each arg.  Also, I think the XXX condition can happen, a la:
  function f() { delete arguments[1]; g.apply(null, arguments); }
In the case of a hole, though, you will want to do what js_fun_apply does, which is to the corresponding stack arg to undefined.  Addressing both points (and hoisting CopyTo and CopyNonHoleArgs into a shared header, say, jsinterpinlines.h), you can simply write:

  if (argsobj)
    fp->forEachCanonicalActualArg(CopyNonHoleArgs(argsobj, regs.sp));
  else
    fp->forEachCanonicalActualArg(CopyTo(regs.sp));
Thanks for the feedback guys. 

The Compartmental-GC patch was a 100 ms. win on v8-raytrace. I suppose that a big part of this is speeding up allocating/freeing the arguments object. This reduces the expected perf win from this patch - 200 ms to 150 ms, iirc.
Attached patch WIP v7 (obsolete) — Splinter Review
This addresses David's comments and adds the compile-and-go optimization. I hope I got it right...

I will address Luke's comments later this week. Factoring out arguments/apply logic sounds good though.
Attachment #477252 - Attachment is obsolete: true
Attached patch WIP v7a (obsolete) — Splinter Review
Forgot StubCalls.cpp + check for js_fun_apply at compile time.
Attachment #479138 - Attachment is obsolete: true
Attached patch WIP v8 (obsolete) — Splinter Review
Update to new call machinery. 

If I run V8 in the browser, v8-raytrace goes from 1185 to 2234 points (average of 5 runs). Total score from 2274 to 2518 (= 244 points). 

I still need to address Luke's comments.
Attachment #479148 - Attachment is obsolete: true
Delicious! fun_apply bake-in looks good. Some drive by comments:

dataReg comparison should be branchPtr, against the ImmPtr(funobj). I don't think it's safe to ignore the top 17 bits on x64.

js_GetClassPrototype and GetProperty are fallible, so failure there has to propagate Compile_Error all the way up (yeah, gross).

This is a pre-existing bug, but the compiler should pass scopeChain to js_GetClassPrototype instead of NULL. This is because a debug-mode recompilation can give the compiler a different frame than cx->fp(), and NULL grabs the scope off the current frame.
Thanks for the feedback. Unfortunately, I'm AFK the coming days, I will try to finish this patch monday/tuesday. (as always, feel free to steal if you want it in sooner)
Attached patch WIP v9 (obsolete) — Splinter Review
This fixes all comments so far. I removed an if in js_fun_apply (because ValueToECMAUint32 special-cases it too) to make it more consistent with stubs::PushArguments.

@dvander, is there a better way to test the value? I added an #ifdef for now, this works on 32 and 64 bit.
Attachment #480387 - Attachment is obsolete: true
Attachment #480987 - Flags: feedback?(lw)
Attachment #480987 - Flags: feedback?(dvander)
There is a problem with this patch if you override arguments.length or replace the arguments object (arguments = ...). Should the fast path support this? It's possible, but I'm afraid this will add/duplicate quite some code. Otherwise, we can bake in a check before calling the stub. Or return -1 from stubs::PushArguments and if that happens jump to the slow case.

Any suggestions?
(In reply to comment #30)
> There is a problem with this patch if you override arguments.length or replace
> the arguments object (arguments = ...). Should the fast path support this?

I would say it's not important to make that case fast, unless there is a current important benchmark that does that. But I don't think they do. I think web frameworks do things like that sometimes, but hopefully not in a hot loop.
Attached patch WIP v10 (obsolete) — Splinter Review
Rebease + fix overridden arguments and arguments.length. Let me know if there is a better solution, but this seems to work well.
Attachment #480987 - Attachment is obsolete: true
Attachment #481047 - Flags: feedback?(lw)
Attachment #481047 - Flags: feedback?(dvander)
Attachment #480987 - Flags: feedback?(lw)
Attachment #480987 - Flags: feedback?(dvander)
Pretty sure dromaeo includes prototype stuff in hot loops that might do evil things with arguments....
Comment on attachment 481047 [details] [diff] [review]
WIP v10

Thanks for making the changes.

>--- a/js/src/jscntxt.h	Tue Oct 05 10:49:27 2010 -0700
>+++ b/js/src/jscntxt.h	Tue Oct 05 23:25:04 2010 +0200
>+    public:
>     inline bool ensureSpace(JSContext *maybecx, Value *from, ptrdiff_t nvals) const;
>-
>+	private:

  public:
    inline bool ensureSpace(...
  private:

>+    CopyNonHoleArgs(JSObject *aobj, js::Value *dst) : aobj(aobj), dst(dst) {}
>+    JSObject *aobj;
>+    js::Value *dst;
>+    void operator()(uintN argi, js::Value *src) {
>+    	if (aobj->getArgsElement(argi).isMagic(JS_ARGS_HOLE)) {
>+            dst->setUndefined();
>+         }
>+        else
>+            *dst = *src;
>+        ++dst;

Could you remove the curlies around dst->setUndefined.  Also could CopyTo and CopyNonHoleArgs go in jsinterpinlines.h (perhaps right below forEachCanonicalActualArg) inside namespace js?

>+int JS_FASTCALL
>+stubs::PushArguments(VMFrame &f)
>+{
[snip]
>+    f.regs.sp += argc;  
>+    return argc;
>+}

Normally, jit code knows the static offset of sp from fp, so I wouldn't think bumping f.regs.sp was necessary.  Is there a special dependency on f.regs.sp here?
Summary: JM: make JSOP_ARGUMENTS fast → JM: make f.apply(x, arguments) fast
Duplicate of this bug: 595902
I just want to add some numbers that show how important this bug is. On this line of code:

  new Flag.RayTracer.Vector(...)

we are slower than V8 by 0.43ms/(1000 calls) (all numbers are on a Mac Pro). I confirmed that the property accesses (Flag.RayTracer and Flag.RayTracer.Vector) are not the issue. We are a bit slower on new Object, by 0.043ms/(1000 calls), so that's 10% of the loss. But most of it is probably the apply.

The benchmark makes 58268 calls to Class.Create() (specifically, a closure returned by that call), so leaving out the |new Object| slowdown, we are losing about 23 ms here. This is on a benchmark that does 1 cycle in 15 ms in V8 and 48 ms in SpiderMonkey. So this is 2/3 of the time gap.
I made a mistake in counting calls to Class.create() in comment 36. There are actually 69372 calls, so the speedup from this patch is expected to be 27 ms. And we are losing 3 ms on the basic |new Object| functionality.

So there are only 3 ms of other losses to be found.
Attached patch Patch v11 (obsolete) — Splinter Review
Rebased, addressed luke's comments.

@luke: I removed the "f.regs.sp += argc" line (it has no effect but I added it to match the other stubs) Can you help me get this in?
Attachment #481047 - Attachment is obsolete: true
Attachment #482611 - Flags: review?(lw)
Attachment #481047 - Flags: feedback?(lw)
Attachment #481047 - Flags: feedback?(dvander)
Attached patch Patch v11 (obsolete) — Splinter Review
Oops, this is the correct version.
Attachment #482611 - Attachment is obsolete: true
Attachment #482623 - Flags: review?(lw)
Attachment #482611 - Flags: review?(lw)
@jandem: Sorry for the silence.  I was considering how this patch would best interact with other Function.call/apply optimizations (bug 602129 and bug 602262) and the possibility of reusing the callIC instead of using uncached calls for all these.  I'll play around with your path and see if the idea works.  Whether it does or doesn't, I am happy to help get this in.
Update: the call IC seems to be working out.  The work in bug 602129 will be a pre-requisite to this patch.
Depends on: 605192
Attached patch WIP 1 (obsolete) — Splinter Review
Patch is lightly tested and applies on top of bug 602129.  Measuring just this patch (-m in the ss harness):
    raytrace:     1.64x as fast      233.5ms +/- 0.5%    142.2ms +/- 0.9%
Combined (i.e., compared to trunk):
    raytrace:     2.12x as fast      299.5ms +/- 0.7%    141.0ms +/- 0.3%
Assignee: jandemooij → lw
Attached patch WIP 2 (obsolete) — Splinter Review
Attachment #482623 - Attachment is obsolete: true
Attachment #485436 - Attachment is obsolete: true
Attachment #482623 - Flags: review?(lw)
Attached patch patchSplinter Review
Green on try.
Attachment #485464 - Attachment is obsolete: true
Attachment #486281 - Flags: review?(dvander)
Attachment #486281 - Attachment is patch: true
Attachment #486281 - Attachment mime type: application/octet-stream → text/plain
Wow, this is way better than my hacks ;) I guess the call ICs did help quite a bit? 2.12x is more than I got IIRC.
Comment on attachment 486281 [details] [diff] [review]
patch

>+        JS_ASSERT_IF(applyTricks == NoApplyTricks, frame.stackDepth() == opinfo->stackDepth);

Instead of this,

>+                applyTricks = LazyArgsObj;

Move the other frame.pushSynced() outside of the else branch, and pop the placeholder when you don't want it anymore. That way the compiler state is more coherent outside of the apply trick.

> void
>-mjit::Compiler::checkCallSpeculation(uint32 argc,
>+mjit::Compiler::checkCallSpeculation(uint32 callImmArgc, uint32 origArgc,

Nit: It wasn't obvious at first what "origArgc" was. It seems like the argc in the bytecode should be "origArgc" and the speculated argc should be something else, like "specialArgc".

> #ifdef JS_MONOIC
>     if (origCallee->isConstant() || origCallee->isNotType(JSVAL_TYPE_OBJECT) || debugMode) {
> #endif
>-        emitUncachedCall(argc, callingNew);
>+        JS_ASSERT(applyTricks == NoApplyTricks);

Okay, so this shouldn't fire because something like 3.apply(5, arguments) will deoptimize, even for strings where JM still assumes primitives need to be wrapped?

>+    JS_ASSERT(GET_ARGC(f.regs.pc) == 2);
>+
>+    /*
>+     * The lazyArgsObj flag indicates an optimized call |f.apply(x, arguments)|
>+     * where the args obj has not been created (so regs.sp[-1] is uninitialized).
>+     */

Explain the state of f.regs.sp before and after, here, for each case.

>+        /* Can't optimize; just create an args obj and take the normal path. */
>+        f.regs.sp++;

Add a comment here about why f.regs.sp needs to be bumped. In stubs we tend to bump sp where &sp[-X] would suffice, so it's best to give an explicit reason in cases like this.
Attachment #486281 - Flags: review?(dvander) → review+
> Move the other frame.pushSynced() outside of the else branch, and pop the
> placeholder when you don't want it anymore. That way the compiler state is more
> coherent outside of the apply trick.

Awesome observation

> Okay, so this shouldn't fire because something like 3.apply(5, arguments) will
> deoptimize, even for strings where JM still assumes primitives need to be
> wrapped?

My reasoning was: JSOP_FUNAPPLY is only emitted if you have x.apply, which means JSOP_CALLPROP, and JSOP_CALLPROP seemed to push non-constant callee/this.  But, it seems like this could change in the future, so I'll handle the case, good catch!
Landed last month on mozilla-central:

http://hg.mozilla.org/mozilla-central/rev/92af3359a18f
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Depends on: 613732
You need to log in before you can comment on or make changes to this bug.