Closed Bug 1340106 Opened 7 years ago Closed 2 years ago

Raybench runs +35.9% slower in wasm compared to asm.js

Categories

(Core :: JavaScript Engine: JIT, defect, P3)

defect

Tracking

()

RESOLVED FIXED

People

(Reporter: jujjyl, Unassigned)

References

(Blocks 2 open bugs)

Details

Attachments

(3 files)

Attached file raybench.tar.gz
Testing performance of Lars Hansen's Raybench app from http://github.com/lars-t-hansen/moz-sandbox, it looks like the wasm variant is running much slower than the asm.js one. Attached both asm.js and wasm builds that can be run and profiled locally.

asm.js run on my linux box takes 9.2 seconds, whereas wasm run takes 12.5 seconds.

The application is pure floating point number crunching, so should run at equal speed to asm.js, or if not, we should figure out what causes such a large discrepancy.
Attached image raybench_asmjs.png
Benchmarking the asm.js run on a Windows PC with Intel i7 5960X:

Setup time: 0 ms  raybench.html
Render time: 7139 ms

See geckoprofile of a Windows run at https://perfht.ml/2kMUyQf
Attached image raybench_wasm.png
Benchmark run of the wasm version on the same Windows i7 5960X:

Setup time: 0 ms
Render time: 10211 ms

which is +43% more time compared to the asm.js version.

See a geckoprofile of the run here: https://perfht.ml/2kMP8of
One thing in particular that pops up when comparing the above two profiles is that the wasm version has a number of slow FFI trampolines to the pow function (see the screenshot above), whereas the pow function shows up as "native call" for the asm.js version. Is there a difference how these are handled in asm.js vs wasm? Perhaps wasm is taking a slow path in reaching pow() built-in?

This looks a bit similar to bug 1339089 where the floor() function is taking a slow path in wasm. Perhaps these share some of the same characteristics?
Flags: needinfo?(luke)
Flags: needinfo?(bbouvier)
Removed all calls to the pow() function in Raybench locally to see if that would explain the perf difference, and it does change the landscape a little, but not nearly enough to explain the overall performance difference.

Tweaking Emscripten build flags, it looks like -O1, -O2, -O3 and -Oz builds all run in ~12.5 seconds in wasm on the Linux PC (2.2GHz Intel Xeon), so Emscripten/Binaryen optimizations do not seem to have much effect.

Also the profiles show that there's no int div/rem in play here, nor float-to-int conversions, which are known to be a perf regression compared to asm.js. (https://github.com/kripken/emscripten/issues/4625, https://github.com/WebAssembly/design/issues/986, https://github.com/WebAssembly/binaryen/pull/907) So this is something else altogether.

The Emscripten wasm builds with -s BINARYEN_METHOD='native-wasm' vs -s BINARYEN_METHOD='native-wasm,asmjs' also run at equal performance, so Binaryen side codegen does not have much effect either.

One thing that I do see in the profiles that is relatively uncommon compared to other profiled apps is the heavy use of recursion. I wonder if either in Wasm backend or Binaryen this might have any difference. 

Otherwise this suggests that either a) Binaryen asm2wasm is generating slower code compared to asm.js, or b) backend is generating slower x86 code of the wasm file compared to asm.js. Has a brief look at wasm-dis of the file in the hot functions, although nothing there really catches my eye. Alon, anything you might be able to get out of looking at the builds?
Flags: needinfo?(azakai)
(In reply to Jukka Jylänki from comment #4)
> 
> One thing that I do see in the profiles that is relatively uncommon compared
> to other profiled apps is the heavy use of recursion. I wonder if either in
> Wasm backend or Binaryen this might have any difference. 

Additionally, most function calls within the application that aren't inlined are virtual.
On my end I replaced the call to pow with a built-in Pow on integer powers, this speeds up the wasm version quite a bit but not the asm.js version.  So there's something there.

Another hot function is sqrt.  Replacing Sqrt with a function that just does five iterations (unrolled) of Newton's approximation brings times down to where asm.js is, roughly; and the output image is recognizably the same, even if of poor quality.
Spliced off the pow() part to bug 1340219 for separate handling, since fixing that won't fix the whole benchmark. Interesting about sqrt, I'll try if I can find a small synthetic benchmark about that.
Depends on: 1340219
Note: wasm has f32.sqrt and f64.sqrt; is Emscripten emitting those?
Flags: needinfo?(luke)
Lars referred to virtual function calls in the recursion above, so created a test case about that, and that does uncover a 2.15x performance differential against wasm. See bug 1340235.
Depends on: 1340235
(In reply to Luke Wagner [:luke] from comment #8)
> Note: wasm has f32.sqrt and f64.sqrt; is Emscripten emitting those?

Yes, had a peek with wasm-dis and Emscripten is using f32.sqrt and f64.sqrt. I do not see a perf difference for sqrt in synthetic scenarios.

For pow, I see Emscripten generates

(import "global.Math" "pow" (func $import$3 (param f64 f64) (result f64)))

does there exist a f32.pow/f64.pow? The above will do a double precision pow() even for f32.
OK, probably best to ignore my comments about sqrt.  It is used, but the speedup I saw from my tweaking is probably a result of not computing a proper square root.
(In reply to Jukka Jylänki from comment #10)
> does there exist a f32.pow/f64.pow? The above will do a double precision
> pow() even for f32.

No, there is no f32.pow/f64.pow in wasm. However, asm.js also has no single precision pow, so it should be in the same boat.

If you really want a single precision pow, the only option I know of is to compile one (eg. from musl's libm).
(In reply to Luke Wagner [:luke] from comment #8)
> Note: wasm has f32.sqrt and f64.sqrt; is Emscripten emitting those?

Yeah, as Jukka saw, we emit those - we should emit every single thing LLVM IR has that has a wasm instruction, unless we have a bug of course.
Flags: needinfo?(azakai)
Blocks: wasm
Component: JavaScript Engine → JavaScript Engine: JIT
Flags: needinfo?(bbouvier)
Priority: -- → P3
Per policy at https://wiki.mozilla.org/Bug_Triage/Projects/Bug_Handling/Bug_Husbandry#Inactive_Bugs. If this bug is not an enhancement request or a bug not present in a supported release of Firefox, then it may be reopened.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INACTIVE
Status: RESOLVED → REOPENED
Resolution: INACTIVE → ---
Component: JavaScript Engine: JIT → Javascript: WebAssembly

Register allocation bug.

Component: Javascript: WebAssembly → JavaScript Engine: JIT
Depends on: 1742930

I love progress:

Xeon: wasm 7734ms, asm 10287ms, asm 33% slower than wasm, wasm WINS!
Apple M1: wasm 2889ms, asm 4371ms, asm 51% slower than wasm, wasm WINS AGAIN!

(This is with the original content as it was compiled then. The wasm blob is version 0xd so I had to remove the version check in the engine, but the rendered content looks correct and there's no reason to assume it is not.)

Status: REOPENED → RESOLVED
Closed: 6 years ago2 years ago
Resolution: --- → FIXED
No longer depends on: 1340235
See Also: → 1340235
Blocks: 1742930
Severity: normal → --
No longer depends on: 1742930
You need to log in before you can comment on or make changes to this bug.