Closed
Bug 1340106
Opened 8 years ago
Closed 3 years ago
Raybench runs +35.9% slower in wasm compared to asm.js
Categories
(Core :: JavaScript Engine: JIT, defect, P3)
Core
JavaScript Engine: JIT
Tracking
()
RESOLVED
FIXED
People
(Reporter: jujjyl, Unassigned)
References
(Blocks 2 open bugs)
Details
Attachments
(3 files)
Testing performance of Lars Hansen's Raybench app from http://github.com/lars-t-hansen/moz-sandbox, it looks like the wasm variant is running much slower than the asm.js one. Attached both asm.js and wasm builds that can be run and profiled locally.
asm.js run on my linux box takes 9.2 seconds, whereas wasm run takes 12.5 seconds.
The application is pure floating point number crunching, so should run at equal speed to asm.js, or if not, we should figure out what causes such a large discrepancy.
Reporter | ||
Comment 1•8 years ago
|
||
Benchmarking the asm.js run on a Windows PC with Intel i7 5960X:
Setup time: 0 ms raybench.html
Render time: 7139 ms
See geckoprofile of a Windows run at https://perfht.ml/2kMUyQf
Reporter | ||
Comment 2•8 years ago
|
||
Benchmark run of the wasm version on the same Windows i7 5960X:
Setup time: 0 ms
Render time: 10211 ms
which is +43% more time compared to the asm.js version.
See a geckoprofile of the run here: https://perfht.ml/2kMP8of
Reporter | ||
Comment 3•8 years ago
|
||
One thing in particular that pops up when comparing the above two profiles is that the wasm version has a number of slow FFI trampolines to the pow function (see the screenshot above), whereas the pow function shows up as "native call" for the asm.js version. Is there a difference how these are handled in asm.js vs wasm? Perhaps wasm is taking a slow path in reaching pow() built-in?
This looks a bit similar to bug 1339089 where the floor() function is taking a slow path in wasm. Perhaps these share some of the same characteristics?
Flags: needinfo?(luke)
Flags: needinfo?(bbouvier)
Reporter | ||
Comment 4•8 years ago
|
||
Removed all calls to the pow() function in Raybench locally to see if that would explain the perf difference, and it does change the landscape a little, but not nearly enough to explain the overall performance difference.
Tweaking Emscripten build flags, it looks like -O1, -O2, -O3 and -Oz builds all run in ~12.5 seconds in wasm on the Linux PC (2.2GHz Intel Xeon), so Emscripten/Binaryen optimizations do not seem to have much effect.
Also the profiles show that there's no int div/rem in play here, nor float-to-int conversions, which are known to be a perf regression compared to asm.js. (https://github.com/kripken/emscripten/issues/4625, https://github.com/WebAssembly/design/issues/986, https://github.com/WebAssembly/binaryen/pull/907) So this is something else altogether.
The Emscripten wasm builds with -s BINARYEN_METHOD='native-wasm' vs -s BINARYEN_METHOD='native-wasm,asmjs' also run at equal performance, so Binaryen side codegen does not have much effect either.
One thing that I do see in the profiles that is relatively uncommon compared to other profiled apps is the heavy use of recursion. I wonder if either in Wasm backend or Binaryen this might have any difference.
Otherwise this suggests that either a) Binaryen asm2wasm is generating slower code compared to asm.js, or b) backend is generating slower x86 code of the wasm file compared to asm.js. Has a brief look at wasm-dis of the file in the hot functions, although nothing there really catches my eye. Alon, anything you might be able to get out of looking at the builds?
Flags: needinfo?(azakai)
Comment 5•8 years ago
|
||
(In reply to Jukka Jylänki from comment #4)
>
> One thing that I do see in the profiles that is relatively uncommon compared
> to other profiled apps is the heavy use of recursion. I wonder if either in
> Wasm backend or Binaryen this might have any difference.
Additionally, most function calls within the application that aren't inlined are virtual.
Comment 6•8 years ago
|
||
On my end I replaced the call to pow with a built-in Pow on integer powers, this speeds up the wasm version quite a bit but not the asm.js version. So there's something there.
Another hot function is sqrt. Replacing Sqrt with a function that just does five iterations (unrolled) of Newton's approximation brings times down to where asm.js is, roughly; and the output image is recognizably the same, even if of poor quality.
Reporter | ||
Comment 7•8 years ago
|
||
Spliced off the pow() part to bug 1340219 for separate handling, since fixing that won't fix the whole benchmark. Interesting about sqrt, I'll try if I can find a small synthetic benchmark about that.
Depends on: 1340219
Comment 8•8 years ago
|
||
Note: wasm has f32.sqrt and f64.sqrt; is Emscripten emitting those?
Flags: needinfo?(luke)
Reporter | ||
Comment 9•8 years ago
|
||
Lars referred to virtual function calls in the recursion above, so created a test case about that, and that does uncover a 2.15x performance differential against wasm. See bug 1340235.
Depends on: 1340235
Reporter | ||
Comment 10•8 years ago
|
||
(In reply to Luke Wagner [:luke] from comment #8)
> Note: wasm has f32.sqrt and f64.sqrt; is Emscripten emitting those?
Yes, had a peek with wasm-dis and Emscripten is using f32.sqrt and f64.sqrt. I do not see a perf difference for sqrt in synthetic scenarios.
For pow, I see Emscripten generates
(import "global.Math" "pow" (func $import$3 (param f64 f64) (result f64)))
does there exist a f32.pow/f64.pow? The above will do a double precision pow() even for f32.
Comment 11•8 years ago
|
||
OK, probably best to ignore my comments about sqrt. It is used, but the speedup I saw from my tweaking is probably a result of not computing a proper square root.
Comment 12•8 years ago
|
||
(In reply to Jukka Jylänki from comment #10)
> does there exist a f32.pow/f64.pow? The above will do a double precision
> pow() even for f32.
No, there is no f32.pow/f64.pow in wasm. However, asm.js also has no single precision pow, so it should be in the same boat.
If you really want a single precision pow, the only option I know of is to compile one (eg. from musl's libm).
Comment 13•8 years ago
|
||
(In reply to Luke Wagner [:luke] from comment #8)
> Note: wasm has f32.sqrt and f64.sqrt; is Emscripten emitting those?
Yeah, as Jukka saw, we emit those - we should emit every single thing LLVM IR has that has a wasm instruction, unless we have a bug of course.
Flags: needinfo?(azakai)
Updated•8 years ago
|
Updated•8 years ago
|
Priority: -- → P3
Comment 14•6 years ago
|
||
Per policy at https://wiki.mozilla.org/Bug_Triage/Projects/Bug_Handling/Bug_Husbandry#Inactive_Bugs. If this bug is not an enhancement request or a bug not present in a supported release of Firefox, then it may be reopened.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INACTIVE
Updated•6 years ago
|
Status: RESOLVED → REOPENED
Resolution: INACTIVE → ---
Updated•4 years ago
|
Component: JavaScript Engine: JIT → Javascript: WebAssembly
Comment 15•4 years ago
|
||
Register allocation bug.
Component: Javascript: WebAssembly → JavaScript Engine: JIT
Updated•3 years ago
|
Blocks: wasm-jit-bugs
Comment 16•3 years ago
|
||
I love progress:
Xeon: wasm 7734ms, asm 10287ms, asm 33% slower than wasm, wasm WINS!
Apple M1: wasm 2889ms, asm 4371ms, asm 51% slower than wasm, wasm WINS AGAIN!
(This is with the original content as it was compiled then. The wasm blob is version 0xd so I had to remove the version check in the engine, but the rendered content looks correct and there's no reason to assume it is not.)
Status: REOPENED → RESOLVED
Closed: 6 years ago → 3 years ago
Resolution: --- → FIXED
Updated•3 years ago
|
Updated•3 years ago
|
You need to log in
before you can comment on or make changes to this bug.
Description
•