Last Comment Bug 622494 - Investigate performance on IMVU model skinning benchmark
: Investigate performance on IMVU model skinning benchmark
Status: NEW
imvu
:
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: Trunk
: x86 All
-- normal with 5 votes (vote)
: ---
Assigned To: Nobody; OK to take it and work on it
:
: Jason Orendorff [:jorendorff]
Mentors:
Depends on: TypeInference 545406 606892 700101
Blocks:
  Show dependency treegraph
 
Reported: 2011-01-02 23:27 PST by Chad Austin
Modified: 2014-07-24 11:07 PDT (History)
29 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Typed array shell testcase (3.66 KB, text/plain)
2011-01-03 00:20 PST, Boris Zbarsky [:bz] (still a bit busy)
no flags Details

Description User image Chad Austin 2011-01-02 23:27:09 PST
I claimed that JavaScript will never be as fast as native code, thus the web needs Native Client or something like it.  Shaver called my bluff "Take a performance-sensitive algorithm that matters to you, write a benchmark, and submit it as a bug."

I took the SSE-accelerated Cal3D skinning loop and implemented four versions of it.  I benchmarked them on a MacBook Pro, 2.5 GHz Core i5.  C++ is compiled with gcc -O2, and JavaScript is run with Firefox 4 beta 8.  Results are in millions of vertices skinned per second.

C++ w/ SSE intrinsics (near theoretical maximum on a single core):  98.3
C++ w/ scalars:  61.2
JavaScript:  5.1
JavaScript w/ typed arrays:  8.4

The code is available at https://github.com/chadaustin/Web-Benchmarks

vlad did some investigation on IRC:

<vlad>	are we really guarding that these things are still functions?
<vlad>	is that what's going on here?
<vlad>	well, got a 15% win by removing all the inner calls to TransformPoint/TransformVector/ScaleMatrix/AddScaledMatrix and inlining those
<vlad>	which is stupid

The JS testcases are trivial to turn into shell testcases by changing dump() to print().  (I think vlad meant alert() there.)
Comment 1 User image Boris Zbarsky [:bz] (still a bit busy) 2011-01-03 00:19:48 PST
No, he probably meant dump(), which writes to stdout in the browser if the right pref is enabled.

I'll attach a shell version of the typed array JS, with profiling hooks added but commented out; running a shark profile there shows pretty much all of the time spent in tracejit-generated code.

Given that tjit inlines method calls, obviously, the fact that inlining them manually helped does seem to point to function guards of some sort.

I wonder how this testcase does with the LICM patches...
Comment 2 User image Boris Zbarsky [:bz] (still a bit busy) 2011-01-03 00:20:17 PST
Created attachment 500754 [details]
Typed array shell testcase
Comment 3 User image Boris Zbarsky [:bz] (still a bit busy) 2011-01-03 00:22:10 PST
Fwiw, v8 can't run the typed array testcase, but their tip (with crankshaft) gets about 8e6 vertices per second on the non-typed-array version; we get about 5.5e6 (with -j or -m -j -p; the results are about the same; -m alone is at 3e6).  So there's definite headroom here!
Comment 4 User image Boris Zbarsky [:bz] (still a bit busy) 2011-01-03 00:39:34 PST
Also fwiw, I can't seem to reproduce vlad's 15% speedup here, though maybe my lazy way of inlining (copy-paste, with some var decls to assign the things that got passed to the function before) is responsible... but that seems like the right way to inline in this case anyway, so we only compute the somewhat-complicated arguments once.

One other note: we seem to compile two copies of the hot loop here; it looks like something is type-unstable (a bunch of I and O types the first time through; all U the second time through).

And if I inline the ScaleMatrix/AddScaledMatrix calls (not like the latter is called anyway, in my testing), and then re-roll those loops, performance goes down by a factor of 2, and the number of traces compiled goes from 4 to 10 (presumably due to more type instability?).
Comment 5 User image Vladimir Vukicevic [:vlad] [:vladv] 2011-01-03 12:26:04 PST
(In reply to comment #4)
> Also fwiw, I can't seem to reproduce vlad's 15% speedup here, though maybe my
> lazy way of inlining (copy-paste, with some var decls to assign the things that
> got passed to the function before) is responsible... but that seems like the
> right way to inline in this case anyway, so we only compute the
> somewhat-complicated arguments once.

Yeah, that's what I did.  After a bunch more runs, the win wasn't consistent, but it showed up.

> One other note: we seem to compile two copies of the hot loop here; it looks
> like something is type-unstable (a bunch of I and O types the first time
> through; all U the second time through).
> 
> And if I inline the ScaleMatrix/AddScaledMatrix calls (not like the latter is
> called anyway, in my testing), and then re-roll those loops, performance goes
> down by a factor of 2, and the number of traces compiled goes from 4 to 10
> (presumably due to more type instability?).

One thing that I did is change the timing loop:

    const start = new Date();
    const now = new Date();
    while ((new Date() - start) < 1000) {
	calculateVerticesAndNormals(bt, v, i, output);
	vertices_skinned += N;
    }

to

    var start = Date.now();
    while (true) {
	calculateVerticesAndNormals(bt, v, i, output);
	vertices_skinned += N;

	if ((vertices_skinned % 100000) == 0 &&
	    (Date.now() - start) >= 1000)
	{
	    break;
	}
    }

to avoid constant Date object creation.  It made a difference, but not a huge one; it made me feel better, though!  (Also the testcase has some odd const usage; replacing most of those with var didn't change the results, but I think we largely ignore const for this?)
Comment 6 User image Nicholas Nethercote [:njn] 2011-01-03 16:09:22 PST
(In reply to comment #0)
> 
> <vlad>    are we really guarding that these things are still functions?

Sounds like bug 606892 would help?

And I bet type inference (bug 608741) will help a *lot* with this program.
Comment 7 User image Boris Zbarsky [:bz] (still a bit busy) 2011-01-03 16:20:49 PST
Yeah, the date thing isn't too bad, esp on non-Windows.  And yes, const is pretty much ignored at runtime for optimization purposes.

Marking dependent on the obvious bugs that can help here.  Chances are, it'll be a few months before we get far enough on type inference to see an effect here, right?
Comment 8 User image Nicholas Nethercote [:njn] 2011-01-03 16:26:53 PST
(In reply to comment #7)
> 
> Marking dependent on the obvious bugs that can help here.  Chances are, it'll
> be a few months before we get far enough on type inference to see an effect
> here, right?

Yep.

Type inference will cannibalize some of LICM's potential benefits too.  Eg. LICM often can hoist the "is this an array?" test that currently occur on every array access;  type inference will typically eliminate that test outright.  IMO, type inference is crucial if JS is to have a chance of getting remotely close to C++ on computationally-intensive programs like this.
Comment 9 User image Vladimir Vukicevic [:vlad] [:vladv] 2011-01-03 17:04:17 PST
I'm not as convinced, but maybe my methodology is flawed.  I went in and stubbed out guardClass to be a no-op -- it only gave back a few precent of perf.  Might even have been noise.  Given that I don't see any side exits other than the normal loop termination, it should be safe to stub out -all- guards here and see what the perf is like, right?  I suspect we'll still be a long ways off.
Comment 10 User image Nicholas Nethercote [:njn] 2011-09-29 09:22:43 PDT
I wonder how we're doing on this now that type inference is enabled.
Comment 11 User image Alon Zakai (:azakai) 2013-09-22 09:24:10 PDT
Almost 2 years passed here. The non-sse version of this benchmark is tested on

http://arewefastyet.com/#machine=12&view=breakdown&suite=asmjs-ubench

and it runs at about half the speed of native with odin. Not quite native yet, but getting close.

Note You need to log in before you can comment on or make changes to this bug.