Translated C++ code is not as fast as it could be

RESOLVED FIXED

Status

()

defect
RESOLVED FIXED
9 years ago
8 years ago

People

(Reporter: azakai, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments, 2 obsolete attachments)

Reporter

Description

9 years ago
Posted file Ray tracing demo (obsolete) —
Attachment is a C++ ray tracer, compiled to JavaScript using Emscripten. To test it, press 'Go!'. It will print out the time elapsed when it is done.

On my laptop, Takes 10.2 seconds on latest tracemonkey nightly. Chrome 6 is almost 4X faster, 2.6 seconds.

Perhaps the issue is that the code relies heavily on emulated memory using the 'HEAP' global array, so lots of HEAP[x+y] = z and so forth? See discussion in bug 598655, things that improve that bug may help here. Although this code does *not* use switch() operations (the code is compiled - native looping structures are utilized), which is an issue there.

Links to original C++ code etc. are in the attached html file.
Reporter

Updated

9 years ago
Depends on: 598655
Reporter

Comment 1

9 years ago
Hmm, running Firebug, it seems most of the time is spent in memory allocation functions, which are called a lot but do very little (stackAlloc(), stackEnter(), stackExit(), etc.). So perhaps the efficiency of function calls is a factor here.

(The generated code should be much more efficient, for sure - focus so far has been on accuracy, not performance.)
I see a fair amount of memory traffic in the profile: finalizer thread is 15% of the total samples; allocating arrays (not the GC, the allocations) is 10%.  Other than that, many methodjit stub calls.

Alon, I wouldn't trust the Firebug profiler to tell you anything useful at all...
Reporter

Comment 3

9 years ago
> Alon, I wouldn't trust the Firebug profiler to tell you
> anything useful at all...

I guess not, but I was hoping at least the order would be meaningful.

Anyhow it does seem that there are many, many calls to short functions. Perhaps v8 does better at that sort of thing?
(In reply to comment #0)
> 
> Perhaps the issue is that the code relies heavily on emulated memory using the
> 'HEAP' global array, so lots of HEAP[x+y] = z and so forth? See discussion in
> bug 598655, things that improve that bug may help here. Although this code does
> *not* use switch() operations (the code is compiled - native looping structures
> are utilized), which is an issue there.

I think the most concrete suggestion in bug 598655 was for a COPYELEM bytecode, which would speed up copies within an array, eg. "a[i] = a[j]".  I see some cases of that in your benchmark, but mostly they're just normal array accesses.  Kraken is similar in having lots of interleaved GETELEM and SETELEM opcodes, so I'll be looking to improve that in the tracer, eg bug 584279 (currently waiting for a review) is a step in that direction.  (That's assuming this benchmark traces well.)

Alon, if you use typed arrays (see eg. https://developer.mozilla.org/en/JavaScript_typed_arrays, though that's incomplete, Google may tell you more) that might help a lot.
Reporter

Comment 5

9 years ago
Thanks, yeah, typed arrays are on my todo list. It isn't trivial since I'll need separate arrays for ints and floats, and another for everything else. But looks like typed arrays help a simple benchmark of compiled C++ code, fannkuch (that happens to only use ints, so was easy to test) by 30%.
Reporter

Comment 6

9 years ago
Posted file Much better code for testing (obsolete) —
Ok, this is with inlining all the small method calls. The previous code was bottlenecked on that entirely - this runs several times faster, and is closer to what code compiled from C++ should look like.

The difference between V8 and SM remains: V8 takes 5.2 seconds, trunk TraceMonkey takes 17.8 seconds, which is almost 3.5X slower.
Attachment #480974 - Attachment is obsolete: true
Given that TM would inline those anyway, sounds like we weren't tracing this?  Or do you mean you inlined the C calls (thus not having to do the stack setup etc)?
Reporter

Comment 8

9 years ago
I meant that I inlined when compiling the C++ to JS. So before there were calls to stackEnter() in the generated JS code, and I replaced them with the contents of that function. The original C++ was not changed.
OK, so yeah, the tracer would have done that anyway.
Reporter

Comment 10

9 years ago
Hmm, perhaps it isn't surprising that those weren't traced - those function calls were not inside loops, they were done right at the beginning of a function, before any loops.
Right, but are those functions in any loops?  I guess maybe not well enough to trace.
Inlining in the C++ to JS translation process can indeed win more than current JS-level optimizations.
Getting TMFLAGS=stats output would be good.

/be
Reporter

Comment 13

9 years ago
Ok, did pulls and clean rebuilds for everything, here is some better data (running on a faster machine):

tm        7.61 seconds
tm -j     5.66
tm -m     3.10
tm -m -j  3.11
v8        1.91

(tm = tracemonkey trunk). So, the difference is around 60%.

Results with TMFLAGS=stats:

 recorder: started(51), aborted(39), completed(56), different header(0), trees trashed(15), slot promoted(0), unstable loop variable(3), breaks(32), returns(4), merged loop exits(0), unstableInnerCalls(10), blacklisted(530)
monitor: exits(630), timeouts(0), type mismatch(0), triggered(630), global mismatch(4), flushed(4)
Reporter

Updated

9 years ago
Depends on: 602366
Reporter

Comment 14

9 years ago
Same benchmark, after more optimizations, including passing through Closure Compiler's advanced optimizations.

The code is now somewhat readable, unlike before, so hopefully easier to figure out what would make it faster. In particular it now looks like it would greatly benefit from a COPYELEM bytecode.

Benchmarks:
  v8:                2.63 seconds
  tracemonkey -m -j: 4.10 seconds (55% slower)
Attachment #481422 - Attachment is obsolete: true
Reporter

Comment 15

9 years ago
And here is a version that uses typed arrays.

Benchmarks:
  v8:                2.54
  tracemonkey -m -j: 3.32 (30% slower)

So, this is better than without typed arrays, but not tremendously so. Interestingly though, in other benchmarks with typed arrays, tracemonkey beats v8.
Reporter

Updated

9 years ago
Blocks: WebJSPerf
Reporter

Updated

9 years ago
Depends on: 594247
No longer depends on: 598655
Reporter

Updated

9 years ago
Attachment #482899 - Attachment is patch: false
Reporter

Updated

9 years ago
Attachment #482901 - Attachment is patch: false
I tried to adapt this to a browser test so I could run with Chrome 10, but it seems not to work in Chrome.
Reporter

Comment 17

8 years ago
You can see this code live here

http://www.syntensity.com/static/raytrace.html

It prints out how long it takes to run. On my laptop I get 2.91 seconds in Firefox (nightly) and 3.53 seconds in Chrome 10. I suspect typed arrays are a factor here.

So, looks good!
Status: NEW → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → FIXED
At least until they make typed arrays work with Crankshaft... ;)
You need to log in before you can comment on or make changes to this bug.