Closed Bug 1340106 Opened 8 years ago Closed 3 years ago

Raybench runs +35.9% slower in wasm compared to asm.js

Tracking

()

Status:

RESOLVED FIXED

People

(Reporter: jujjyl, Unassigned)

References

(Blocks 2 open bugs)

Details

Attachments

(3 files)

raybench.tar.gz 8 years ago Jukka Jylänki 211.16 KB, application/gzip		Details
raybench_asmjs.png 8 years ago Jukka Jylänki 208.89 KB, image/png		Details
raybench_wasm.png 8 years ago Jukka Jylänki 311.30 KB, image/png		Details

Jukka Jylänki

Reporter

Description

•

8 years ago

Attached file raybench.tar.gz — Details

Testing performance of Lars Hansen's Raybench app from http://github.com/lars-t-hansen/moz-sandbox, it looks like the wasm variant is running much slower than the asm.js one. Attached both asm.js and wasm builds that can be run and profiled locally. asm.js run on my linux box takes 9.2 seconds, whereas wasm run takes 12.5 seconds. The application is pure floating point number crunching, so should run at equal speed to asm.js, or if not, we should figure out what causes such a large discrepancy.

Jukka Jylänki

Reporter

Comment 1

•

8 years ago

Attached image raybench_asmjs.png — Details

Benchmarking the asm.js run on a Windows PC with Intel i7 5960X: Setup time: 0 ms raybench.html Render time: 7139 ms See geckoprofile of a Windows run at https://perfht.ml/2kMUyQf

Jukka Jylänki

Reporter

Comment 2

•

8 years ago

Attached image raybench_wasm.png — Details

Benchmark run of the wasm version on the same Windows i7 5960X: Setup time: 0 ms Render time: 10211 ms which is +43% more time compared to the asm.js version. See a geckoprofile of the run here: https://perfht.ml/2kMP8of

Jukka Jylänki

Reporter

Comment 3

•

8 years ago

One thing in particular that pops up when comparing the above two profiles is that the wasm version has a number of slow FFI trampolines to the pow function (see the screenshot above), whereas the pow function shows up as "native call" for the asm.js version. Is there a difference how these are handled in asm.js vs wasm? Perhaps wasm is taking a slow path in reaching pow() built-in? This looks a bit similar to bug 1339089 where the floor() function is taking a slow path in wasm. Perhaps these share some of the same characteristics?

Flags: needinfo?(luke)

Flags: needinfo?(bbouvier)

Jukka Jylänki

Reporter

Comment 4

•

8 years ago

Removed all calls to the pow() function in Raybench locally to see if that would explain the perf difference, and it does change the landscape a little, but not nearly enough to explain the overall performance difference. Tweaking Emscripten build flags, it looks like -O1, -O2, -O3 and -Oz builds all run in ~12.5 seconds in wasm on the Linux PC (2.2GHz Intel Xeon), so Emscripten/Binaryen optimizations do not seem to have much effect. Also the profiles show that there's no int div/rem in play here, nor float-to-int conversions, which are known to be a perf regression compared to asm.js. (https://github.com/kripken/emscripten/issues/4625, https://github.com/WebAssembly/design/issues/986, https://github.com/WebAssembly/binaryen/pull/907) So this is something else altogether. The Emscripten wasm builds with -s BINARYEN_METHOD='native-wasm' vs -s BINARYEN_METHOD='native-wasm,asmjs' also run at equal performance, so Binaryen side codegen does not have much effect either. One thing that I do see in the profiles that is relatively uncommon compared to other profiled apps is the heavy use of recursion. I wonder if either in Wasm backend or Binaryen this might have any difference. Otherwise this suggests that either a) Binaryen asm2wasm is generating slower code compared to asm.js, or b) backend is generating slower x86 code of the wasm file compared to asm.js. Has a brief look at wasm-dis of the file in the hot functions, although nothing there really catches my eye. Alon, anything you might be able to get out of looking at the builds?

Flags: needinfo?(azakai)

Lars T Hansen [:lth]

Comment 5

•

8 years ago

(In reply to Jukka Jylänki from comment #4) > > One thing that I do see in the profiles that is relatively uncommon compared > to other profiled apps is the heavy use of recursion. I wonder if either in > Wasm backend or Binaryen this might have any difference. Additionally, most function calls within the application that aren't inlined are virtual.

Lars T Hansen [:lth]

Comment 6

•

8 years ago

On my end I replaced the call to pow with a built-in Pow on integer powers, this speeds up the wasm version quite a bit but not the asm.js version. So there's something there. Another hot function is sqrt. Replacing Sqrt with a function that just does five iterations (unrolled) of Newton's approximation brings times down to where asm.js is, roughly; and the output image is recognizably the same, even if of poor quality.

Jukka Jylänki

Reporter

Comment 7

•

8 years ago

Spliced off the pow() part to bug 1340219 for separate handling, since fixing that won't fix the whole benchmark. Interesting about sqrt, I'll try if I can find a small synthetic benchmark about that.

Depends on: 1340219

Luke Wagner [:luke]

Comment 8

•

8 years ago

Note: wasm has f32.sqrt and f64.sqrt; is Emscripten emitting those?

Flags: needinfo?(luke)

Jukka Jylänki

Reporter

Comment 9

•

8 years ago

Lars referred to virtual function calls in the recursion above, so created a test case about that, and that does uncover a 2.15x performance differential against wasm. See bug 1340235.

Depends on: 1340235

Jukka Jylänki

Reporter

Comment 10

•

8 years ago

(In reply to Luke Wagner [:luke] from comment #8) > Note: wasm has f32.sqrt and f64.sqrt; is Emscripten emitting those? Yes, had a peek with wasm-dis and Emscripten is using f32.sqrt and f64.sqrt. I do not see a perf difference for sqrt in synthetic scenarios. For pow, I see Emscripten generates (import "global.Math" "pow" (func $import$3 (param f64 f64) (result f64))) does there exist a f32.pow/f64.pow? The above will do a double precision pow() even for f32.

Lars T Hansen [:lth]

Comment 11

•

8 years ago

OK, probably best to ignore my comments about sqrt. It is used, but the speedup I saw from my tweaking is probably a result of not computing a proper square root.

Dan Gohman [:sunfish]

Comment 12

•

8 years ago

(In reply to Jukka Jylänki from comment #10) > does there exist a f32.pow/f64.pow? The above will do a double precision > pow() even for f32. No, there is no f32.pow/f64.pow in wasm. However, asm.js also has no single precision pow, so it should be in the same boat. If you really want a single precision pow, the only option I know of is to compile one (eg. from musl's libm).

Alon Zakai (:azakai)

Comment 13

•

8 years ago

(In reply to Luke Wagner [:luke] from comment #8) > Note: wasm has f32.sqrt and f64.sqrt; is Emscripten emitting those? Yeah, as Jukka saw, we emit those - we should emit every single thing LLVM IR has that has a wasm instruction, unless we have a bug of course.

Flags: needinfo?(azakai)

Benjamin Bouvier [:bbouvier] (inactive)

Updated

•

8 years ago

Blocks: wasm

Component: JavaScript Engine → JavaScript Engine: JIT

Flags: needinfo?(bbouvier)

Hannes Verschore [:h4writer]

Updated

•

8 years ago

Priority: -- → P3

BMO Automation

Comment 14

•

7 years ago

Per policy at https://wiki.mozilla.org/Bug_Triage/Projects/Bug_Handling/Bug_Husbandry#Inactive_Bugs. If this bug is not an enhancement request or a bug not present in a supported release of Firefox, then it may be reopened.

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → INACTIVE

Benjamin Bouvier [:bbouvier] (inactive)

Updated

•

7 years ago

Status: RESOLVED → REOPENED

Resolution: INACTIVE → ---

André Bargull [:anba]

Updated

•

4 years ago

Component: JavaScript Engine: JIT → Javascript: WebAssembly

Lars T Hansen [:lth]

Comment 15

•

4 years ago

Component: Javascript: WebAssembly → JavaScript Engine: JIT

Lars T Hansen [:lth]

Updated

•

4 years ago

Blocks: wasm-jit-bugs

Lars T Hansen [:lth]

Updated

•

4 years ago

Depends on: 1742930

Lars T Hansen [:lth]

Comment 16

•

3 years ago

I love progress:

Xeon: wasm 7734ms, asm 10287ms, asm 33% slower than wasm, wasm WINS!
Apple M1: wasm 2889ms, asm 4371ms, asm 51% slower than wasm, wasm WINS AGAIN!

(This is with the original content as it was compiled then. The wasm blob is version 0xd so I had to remove the version check in the engine, but the rendered content looks correct and there's no reason to assume it is not.)

Status: REOPENED → RESOLVED

Closed: 7 years ago → 3 years ago

Resolution: --- → FIXED

Mathew Hodson

Updated

•

3 years ago

No longer depends on: 1340235

Updated

•

3 years ago

Blocks: 1742930

Severity: normal → --

No longer depends on: 1742930

You need to log in before you can comment on or make changes to this bug.