Closed Bug 599214 Opened 10 years ago Closed 10 years ago

JM: Object equality tests are slow

Categories

(Core :: JavaScript Engine, defect)

defect
Not set

Tracking

()

RESOLVED FIXED

People

(Reporter: billm, Assigned: billm)

References

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(2 files, 4 obsolete files)

When we do |o1 == o2|, where both vars are objects, we take a stub call in JM. However, the stub just does a pointer comparison. The microbenchmark that is attached exhibits this behavior. V8 runs it about 2x faster than JM. This is important because there are a lot of these tests in the Kraken ai-astar benchmark.
Attached patch WIP patch (obsolete) — Splinter Review
This patch adds object equality as an out-of-line fast path. It speeds up the benchmark by about 95%.

The patch has two problems. First, it doesn't check obj->getClass()->ext.equality, which is required for correctness. Second, it would be much better to do this as a PIC. I'll keep working on this.
Assignee: general → wmccloskey
Status: NEW → ASSIGNED
Assignee: wmccloskey → general
Status: ASSIGNED → NEW
Assignee: general → wmccloskey
Nice. JSC does not appear to fast-path this case, and spends 70% of its time in cti_op_eq.

v8 inlines the integer case. If this fails, it also generates an OOL path at runtime that effectively inlines the entire stub call (see CompareStub::Generate in ia32/code-stubs-ia32.cc). This approach also seems reasonable, and would definitely be faster than calling our StubEqualityOp.

That might be more appealing if a full-blown PIC is too complicated.
Is the speed-up TM-only?  Presumably in a TM+JM run all the time is spent in trace-generated code, since astar is dominated by a single small loop.
(In reply to comment #3)
> Is the speed-up TM-only?  Presumably in a TM+JM run all the time is spent in
> trace-generated code, since astar is dominated by a single small loop.

The speedup is JM-only (I think that's what you meant). The trace tuning heuristics make the "wrong" decision here and say that the loop will not trace well. This is because of all the branchiness in the main loop (see the neighbors method). Rather than fudge this somehow, it's easier just to fix the performance problem in JM. With this patch, the two JITs are close to even.
This seems like basically a perfect case for tracing, and there are ways to make it faster still on trace (see bug 579285), so it seems like longer term we want to make the JIT selection pick up these sorts of hot loops.
(In reply to comment #4)
> 
> The speedup is JM-only (I think that's what you meant).

Yes, that's what I meant, sorry for any confusion.

> The trace tuning
> heuristics make the "wrong" decision here and say that the loop will not trace
> well. This is because of all the branchiness in the main loop (see the
> neighbors method).

The benchmark is totally dominated by the performance of the loop in findGraphNode().  If you have one or more calls to a function containing a very tight loop, and those calls are within an outer loop that is very branchy, I would have thought that the inner loops would definitely be traced and the outer loop maybe not...
Bill, what's the neighbors method you mention?
(In reply to comment #7)
> Bill, what's the neighbors method you mention?

I was looking at the main loop, where neighbors is called. I missed the inner loop, because it was getting blacklisted due to a low initial iteration count. I'm not entirely convinced that the inner loop should be traced, but this bug probably isn't the right place to discuss that. I guess I should have posted this patch independent of trace tuning, since it's really just an improvement to the method JIT.

Regarding the trace tuning, I'll post an updated patch for that in Bug 580468 soon. Then we can talk about what to do for A*.
Attached patch new patch (obsolete) — Splinter Review
Attachment #478847 - Flags: review?(dvander)
Comment on attachment 478847 [details] [diff] [review]
new patch

I fixed the equality case. I didn't have time for the PIC. I think it makes sense to get this in and then do additional work later, if it seems necessary.
I just tested this benchmark, extracted from astar, with the new patch:

    function objEq(a, b) {
	for (var i = 0; i < 4000; ++i) {
	    if (a == b)
		break;
	}
    }

    var t0 = new Date;
    var a = {}, b = {};
    for (var i = 0; i < 10000; ++i)
	objEq(a, b);
    var t1 = new Date;
    print('objEq ' + (t1-t0));

Before the patch, the methodjit is 7x slower than the tracer. After the patch, it is still 2x slower. If I change |==| to |===|, then the methodjit is as fast as the tracer was on ==. (The tracer gets much slower when I do that--should I file a bug on that?) I didn't look at this patch in detail, but it seems there is still room to make it faster.
(In reply to comment #11)
> I didn't look at this patch in detail, but it seems there
> is still room to make it faster.

Okay, maybe it's time to do the PIC. I'll look into this.
Attached patch inline cache version (obsolete) — Splinter Review
Okay, this version uses an inline cache. It always generates the int path. The object and string paths are generated as needed. To avoid issues with TRACE MICs, I made the decision that we should only use inline caching for forward jumps. Otherwise, things would get really complex. And anyway, I doubt there are many loops whose condition is an object or string comparison. (Loops of the form |while (x!=null)| don't count, I think, because null is its own type.)

The patch also adds a guard for the equality extension to the tracer, since it was missing.

The performance of the IC version on A* doesn't really improve at all over the non-IC version. I guess this isn't too surprising, since we really only avoid one extra branch for the string case. Nevertheless, the code is cleaner, I think. And it will be a lot easier to add additional cases, if needed.

I don't see too big a difference between the tracer and the method JIT on Dave's microbenchmark: 113ms for tracer versus 138ms for the method JIT. Something very strange is going on, though. The equality extension guard has no performance impact on the tracer. However, if I take it out of the method JIT, performance on the microbenchmark goes down to 113ms---exactly parallel with the tracer. I tried generating exactly the same assembly ops in the two, but it has no effect. Weird.
Attachment #478152 - Attachment is obsolete: true
Attachment #478847 - Attachment is obsolete: true
Attachment #481656 - Flags: review?(dvander)
Attachment #478847 - Flags: review?(dvander)
What about doing this for strict object equality too?  V8's earley-boyer calls stubs::StrictEq about 1 million times. It's called 500.000 times with both lhs and rhs objects.
Comment on attachment 481656 [details] [diff] [review]
inline cache version

Sorry for the late review.

As discussed IRL, we should move the HAS_EQUALITY flag set into XML object creation, to avoid adding overhead to the object allocation path.

>+void
>+mjit::Compiler::passEqualityICAddress(EqualityGenInfo &eic)
>+{
>+    eic.addrLabel = stubcc.masm.moveWithPatch(ImmPtr(NULL), Registers::ArgReg1);
>+}

This only gets used in one place, I'd just inline it.

Also a general comment, I've been removing "mic" and "pic" in favor of just "ic" as the catch-all canonical name for inline caches.

>-mjit::Compiler::jumpAndTrace(Jump j, jsbytecode *target, Jump *slowOne, Jump *slowTwo)
>+mjit::Compiler::jumpAndTrace(Jump j, jsbytecode *target, Jump *slow)

Yay.

>+void JS_FASTCALL
>+ic::Equality(VMFrame &f, ic::EqualityICInfo *eic)
>+{
>+    EqualityCompiler cc(f, *eic);
>+    if (!cc.update())
>+        THROW();
>+}

It doesn't look like the IC ever disables itself. We probably want to do that.

Patch looks great otherwise, should probably review the final changes though.
Attachment #481656 - Flags: review?(dvander) → review+
Comment on attachment 481656 [details] [diff] [review]
inline cache version

Sorry for the late review.

As discussed IRL, we should move the HAS_EQUALITY flag set into XML object creation, to avoid adding overhead to the object allocation path.

>+void
>+mjit::Compiler::passEqualityICAddress(EqualityGenInfo &eic)
>+{
>+    eic.addrLabel = stubcc.masm.moveWithPatch(ImmPtr(NULL), Registers::ArgReg1);
>+}

This only gets used in one place, I'd just inline it.

Also a general comment, I've been removing "mic" and "pic" in favor of just "ic" as the catch-all canonical name for inline caches.

>-mjit::Compiler::jumpAndTrace(Jump j, jsbytecode *target, Jump *slowOne, Jump *slowTwo)
>+mjit::Compiler::jumpAndTrace(Jump j, jsbytecode *target, Jump *slow)

Yay.

>+void JS_FASTCALL
>+ic::Equality(VMFrame &f, ic::EqualityICInfo *eic)
>+{
>+    EqualityCompiler cc(f, *eic);
>+    if (!cc.update())
>+        THROW();
>+}

It doesn't look like the IC ever disables itself. We probably want to do that.

Patch looks great otherwise, should probably review the final changes though.
Attached patch updated patch (obsolete) — Splinter Review
This resolves the equality extension issue in the way we discussed. It's slightly tricky ensuring that the bit gets set in the right places, but I think I got them all.

I also added the ability to disable the IC.
Attachment #481656 - Attachment is obsolete: true
Attachment #482730 - Flags: review?(dvander)
Attachment #482730 - Flags: review?(dvander) → review+
Attached patch updated patchSplinter Review
Attachment #482730 - Attachment is obsolete: true
Attachment #484111 - Flags: review?(nnethercote)
Comment on attachment 484111 [details] [diff] [review]
updated patch

>@@ -112,36 +113,37 @@ static const nanojit::AccSet ACCSET_CX  
> static const nanojit::AccSet ACCSET_EOS           = (1 <<  4);
> static const nanojit::AccSet ACCSET_ALLOC         = (1 <<  5);
> static const nanojit::AccSet ACCSET_FRAMEREGS     = (1 <<  6);
> static const nanojit::AccSet ACCSET_STACKFRAME    = (1 <<  7);
> static const nanojit::AccSet ACCSET_RUNTIME       = (1 <<  8);
> 
> // Nb: JSObject::{lastProp,map,flags} don't have an AccSet because they are never accessed on trace

Nit: remove ",flags" from that comment.

The rest of the tracer-related changes look fine.  I didn't look at the methodjit-related changes, I assume previous reviews covered them.
Attachment #484111 - Flags: review?(nnethercote) → review+
Please remember to set the whiteboard and post the changeset to the bug(s) when pushed.  Thanks.
Alas, this broke ARM's trace tests. I'm going to back this change out for the time being so we can work on PICs (backout link to follow).

Failing trace tests:

    -m basic/bug524826-2.js
    -m basic/bug524826.js
    -m basic/equalInt.js
    -m basic/strings.js
    -m basic/truthies.js
    -m sunspider/check-date-format-tofte.js
    -m sunspider/check-string-tagcloud.js
    -m v8-v5/check-crypto.js
    -m v8-v5/check-deltablue.js
This patch blocks tracer tuning, which we want to land tomorrow. It might be better to #ifdef POLYIC the new code. The ARM ICs are in-flux anyway.
(In reply to comment #23)
> It might be
> better to #ifdef POLYIC the new code. The ARM ICs are in-flux anyway.

I don't think I understand what you're saying we should ifdef out -- this causes a regression on ARM with a clean tip and methodjit enabled. I can investigate it, but we shouldn't leave the tree broken on a platform, right?
Oh, maybe you meant to disable JS_MONOIC on ARM for the time being -- I'll try that and see if everything is copacetic.
Okay, disabled MONOIC for ARM: http://hg.mozilla.org/tracemonkey/rev/b7c0fd2370c6

See bug 605415 for ARM debugging progress.
I'm confused about how |ic.jumpToStub.executableAddress()| can be safely used unconditionally in |EqualityCompiler::linkForIC| when it is conditionally initialized in |Compiler::jsop_equality_int_string| and, subsequently, |Compiler::finishThisUp|. I might figure it out as I probe more, but having a recommended comment to throw into my commit would be appreciated in any case!
Sorry about this, Chris. I ran the tests you mentioned through valgrind and it comes up clean, so it'll be hard for me to debug it.

I'm pretty sure the jumpToStub thing is safe, although admittedly not very clear. Here's the rationale. Inside jsop_equality_int_string, the only way jumpToStub won't be set is if both operands are known integers. In this case, there's no chance we'll ever enter the IC, so linkForIC is never called.
http://hg.mozilla.org/mozilla-central/rev/81d0ca612cc8
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.