Benchmarking an upcoming Unreal Engine 4 WebGL 2 + WebAssembly demo in a special execution mode where all WebGL 2 calls are no-opped out to maximize stress on wasm execution itself and not on WebGL 2, it looks like asm.js is winning by a small but noticeable margin. The numbers are as follows: asm.js: Run Date Total time (lower is better) | FPS | CPU Time | CPU Idle | Page load time | # of janks -------------------------------------------------------------------------------------------------------- 2017-02-22 16:10 28437ms |272.79 | 11238ms | 2.13% | 16467.24ms | 0 2017-02-22 16:11 34596ms |259.33 | 11532ms | 3.99% | 17205.26ms | 5 2017-02-22 16:11 27986ms |270.20 | 11348ms | 2.33% | 15748.36ms | 0 2017-02-22 16:12 27774ms |274.83 | 11039ms | 2.99% | 15797.39ms | 1 2017-02-22 16:12 28916ms |260.97 | 11719ms | 2.47% | 16239.13ms | 1 2017-02-22 16:13 28130ms |274.35 | 11163ms | 2.33% | 16180.13ms | 0 2017-02-22 16:13 29466ms |271.77 | 11244ms | 2.61% | 17259.35ms | 1 2017-02-22 16:14 29288ms |262.49 | 11661ms | 2.47% | 16677.63ms | 1 2017-02-22 16:14 28898ms |268.46 | 11403ms | 2.38% | 16574.93ms | 0 2017-02-22 16:15 28441ms |269.51 | 11348ms | 2.48% | 16176.18ms | 0 (7 samples Averaged) 28585ms |268.03 | 11325ms | 2.44% | 16301.80ms | 0.4 wasm: Run Date Total time (lower is better) | FPS | CPU Time | CPU Idle | Page load time | # of janks -------------------------------------------------------------------------------------------------------- 2017-02-22 16:16 29542ms |255.94 | 11981ms | 2.31% | 16792.20ms | 2 2017-02-22 16:17 27529ms |264.26 | 11599ms | 2.25% | 15047.03ms | 1 2017-02-22 16:17 27833ms |243.70 | 12315ms | 4.00% | 14256.06ms | 5 2017-02-22 16:18 28394ms |244.93 | 12278ms | 4.06% | 15100.01ms | 5 2017-02-22 16:18 28821ms |252.72 | 12133ms | 2.17% | 15845.27ms | 1 2017-02-22 16:19 28432ms |248.81 | 12110ms | 3.93% | 15189.23ms | 6 2017-02-22 16:19 28343ms |241.16 | 12477ms | 3.89% | 14679.72ms | 5 2017-02-22 16:20 28048ms |242.36 | 12485ms | 3.39% | 14421.75ms | 4 2017-02-22 16:20 27809ms |249.11 | 12132ms | 3.49% | 14587.31ms | 5 2017-02-22 16:21 29136ms |242.98 | 12356ms | 3.98% | 15658.47ms | 6 (7 samples Averaged) 28240ms |246.37 | 12186ms | 3.32% | 14954.79ms | 3.9 The relevant column is the "CPU Time" column, which measures how long asm.js/wasm code execution took inside the requestAnimationFrame() functions accumulated, so it will ignore scheduling due to vsync etc. Last line shows average score of 7/10 results, with three outliers removed. Averaged, asm.js gives a score of 11325ms, and wasm gives 12186ms, which is 7.6% slower than asm.js. Looking at geckoprofiler, this does not seem to be explained by none of bug 1340219, bug 1338914 or https://github.com/WebAssembly/design/issues/986. This could be due to bug 1340235 but profiles don't show anything regarding one way or the other. It could also be something completely different. Please contact me for a test case.
To help determine whether this is engine-side or compiler-side, what numbers do you see for Chrome (Canary)?
Sent a test case via email. I don't currently have Canary set up to test, I'll try that once I have a free time slice for that.
Created attachment 8840211 [details] [diff] [review] measure-calls-loads-stores So attached is a measurement patch I whipped up to count the total number of loads, stores, internal calls, import calls, and indirect calls executed. My understanding is that, given a deterministic workload, these numbers should be meaningful and roughly the same between wasm and asm.js. (Technically, wasm should have a smaller numbers of loads/stores b/c asm.js has to break up i64 and unaligned accesses into multiple smaller accesses.) So the first thing I noticed is that the workload isn't entirely deterministic, but the variance is small (<1% difference in each category), so still good enough to do a gross comparison. Given that, the differences I saw between wasm and asm.js are significant: wasm asm.js load 16.1B 14.8B store 6.14B 6.17B direct call 600M 535M import call 92M 16.5M indirect call 214M 214M So we can see that wasm is being given 8.7% more loads, 12.1% more calls, and 5.5x more calls to imports. Together, it seems like this could explain the slowdown. (Note on the patch if you want to try it out: the numbers get accumulated process-wide and you print them by calling Math.toSource() from the console which also resets the counters.)
The 5.5x more import calls suggests this is the known precision issue (that UE4 can't turn on fast imprecise mode because it crashes). If there's a way to print out the # of calls per import, that could confirm. Another option could be to add a counter in the JS imports that are called, given the relevant suspects.
Yeah, imports could be a major part of it. Do you have any idea about the 1.3B extra loads? This seemed pretty surprising.
Looking at the gecko profiles I posted in the email thread, the amount of time spent in the i32s-div and float-to-int imports was not nearly enough to account for 7.6%. i32s-div is the hottest with 0.9% of total time, and f64-to-int is 0.3% of total time. js::wasm::Instance::callImport accounts for 0.2% + 0.1% + 0.1% of total time. All in all, that's 1.6% of total execution time that gecko profiler attributes for this. If it turns out that those indeed are the root cause, then I think there's a major bias in the gecko profiler itself which would be quite suspicious of the profiler itself that should be investigated as well. Larger number of direct calls is also interesting, if I understand correctly we should not have any extra functions compiled inside wasm itself?
The extra direct calls, I might guess are the calls to the safe-i32-div/rem (i.e. the optimized precise versions, which guard against div/rem of 0). Those are extra in wasm. If we can profile the # of direct calls per target, that could confirm. But no idea about the loads. If we can profile # of loads per function then we could find out. (Neither the extra imports or extra direct calls suggests anything about the loads, as the precise stuff shouldn't affect loads at all. Very strange!)
For comparison, I measured embenchen's bullet (arg=3) (where wasm is 13% faster on my machine). So first, bullet is *almost* entirely deterministic; between two runs, the load and store counts varied by only 1 or 2 and the rest of the fields were identical. wasm asm.js load 1.68B 1.89B store 646M 864M direct call 20.5M 18.7M import call 644K 14K indirect call 18.9M 18.9M So this time we see wasm with far fewer loads/stores. But other than that, the same pattern as above of more import and direct calls and roughly the same number of indirect calls.
(In reply to Jukka Jylänki from comment #6) Are you also seeing/including the "fast FFI trampoline" and "slow FFI trampoline" self time? But yes, the profiler could be missing some time here. But I expect this isn't all of it and the issues (particularly the loads) are a major component too. (In reply to Alon Zakai (:azakai) from comment #7) It's a bit of work to do the finer-grained per-function profiling from the machine code; is it possible to harvest the data you need from Emscripten? This might also be beneficial if it allows you to figure out which functions are supposed to correspond (so you can compare them pairwise); in wasm/asm.js all I'd have is indices which I assume don't correspond between asm.js/wasm.
Here's some more data points for the cpu time fields: FF Nightly 54.0a1 (2017-02-22) (64-bit) asm.js: 11413ms FF Nightly 54.0a1 (2017-02-22) (64-bit) wasm: 12512ms Chrome 58.0.3020.0 (Official Build) canary (64-bit) asm.js: 30775ms Chrome 58.0.3020.0 (Official Build) canary (64-bit) wasm: 16117ms
(In reply to Luke Wagner [:luke] from comment #8) > For comparison, I measured embenchen's bullet (arg=3) (where wasm is 13% > faster on my machine). > > So first, bullet is *almost* entirely deterministic; between two runs, the > load and store counts varied by only 1 or 2 and the rest of the fields were > identical. > > wasm asm.js > load 1.68B 1.89B > store 646M 864M > direct call 20.5M 18.7M > import call 644K 14K > indirect call 18.9M 18.9M > > So this time we see wasm with far fewer loads/stores. But other than that, > the same pattern as above of more import and direct calls and roughly the > same number of indirect calls. Adding some manual logging in JS on each import, looks like the extra import calls (644K - 14K) are to Math.pow. This is something we also saw on Raybench. It was low priority there, but if this is a bigger factor in UE4, perhaps we should prioritize fixing that more - would be good to confirm it is in fact pow in UE4, though. Anyhow seems like optimizing pow would help bullet at least.
So on bullet, the patch in bug 1340219 indeed takes the number of import calls from 644K down to 14K, matching asm.js exactly. So that pretty much settles bullet. However, it appears calls to imported Math.* builtins (with no intermediate JS function) only account for 717K import calls (out of the total 91M import calls) in UE4.
Created attachment 8841716 [details] [diff] [review] measure-loads Here's a hacky patch that prints out load counts (when you run Math.toSource()) along with function names. It shows some C library functions (mdct, vorbis) are way higher, but in the core engine C++ code, wasm is equal or better.
Summary: UE4 timedemo runs 7.6% slower in wasm compared to asm.js → UE4 Zen Garden demo runs 7.6% slower in wasm compared to asm.js
The demo is now live, with public test case URLs: wasm: https://s3.amazonaws.com/mozilla-games/ZenGarden/EpicZenGarden.html?playback&novsync&fakegl&noaudio asm.js: https://s3.amazonaws.com/mozilla-games/ZenGarden/EpicZenGarden.html?playback&novsync&fakegl&noaudio&asmjs Link to an offline build of the demo for local testing: https://s3.amazonaws.com/mozilla-games/ZenGarden/2017-03-16-ZenGarden.zip The results in current FF Nightly 55.0a1 (2017-03-16) (64-bit) gives: > wasm: > "totalTime": 25104.535, > "totalRenderTime": 12325.125000000002, > "cpuTime": 11831.075000000006, > "fps": 259.6322552509609, > "pageLoadTime": 12700.52, > "WebAssembly.instantiate()": 59.335000000000036, > "main()": 3545.49, > asm.js: > "totalTime": 28324.52, > "totalRenderTime": 13350.019999999999, > "cpuTime": 12792.315000000055, > "fps": 239.70001543068852, > "pageLoadTime": 14878.33, > "asm.js compilation": 1954.5400000000002, > "main()": 3481.625, However this public URL is a newer build than the original one. The original one was identified to have an issue that some libraries had not been built with optimizations enabled, this newer build has that issue resolved. With this, wasm is a bit faster than asm.js on all interesting metrics in the demo. It could have been the case that unoptimized asm.js just was faster than unoptimized wasm, but it would be really good to verify. Luke, would you be able to do the same test with the above optimized builds? That .zip file has both asm.js and wasm versions of the demo in it. Not a big priority, but would be good to close whether there might still be something else than bug 1340219 in play here?
Great, so it looks like UE4 is back in line with the general ~10% improvement we see from wasm over asm.js. Re-running the measurements I posted in comment 8, things look much better: wasm asm.js load 9.4B 10.2B store 4.2B 5.0B direct call 995M 960M import call 91M 16M indirect call 214M 214M This seems to confirm that you fixed the underlying issue here. I guess we can close this bug now?
I guess an interesting result from this is that unoptimized LLVM IR run through the asm.js optimizer does better than when run through the wasm optimizer. While optimized LLVM IR (what we've focused on) of course does better in the wasm optimizer. I suppose that's not a bug we need to fix anytime soon, but it's odd.
Thanks Luke, those numbers look better. I believe the remaining difference in the import call is from the float-to-int conversion, which I think still routes out to JS in that build.
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.