1341633 - UE4 Zen Garden demo runs 7.6% slower in wasm compared to asm.js

Reporter

Description

•

7 years ago

Benchmarking an upcoming Unreal Engine 4 WebGL 2 + WebAssembly demo in a special execution mode where all WebGL 2 calls are no-opped out to maximize stress on wasm execution itself and not on WebGL 2, it looks like asm.js is winning by a small but noticeable margin.

The numbers are as follows:

asm.js:

 Run Date       Total time (lower is better) |   FPS | CPU Time | CPU Idle | Page load time | # of janks
--------------------------------------------------------------------------------------------------------
 2017-02-22 16:10            28437ms |272.79 |  11238ms |    2.13% |     16467.24ms |                0
 2017-02-22 16:11            34596ms |259.33 |  11532ms |    3.99% |     17205.26ms |                5
 2017-02-22 16:11            27986ms |270.20 |  11348ms |    2.33% |     15748.36ms |                0
 2017-02-22 16:12            27774ms |274.83 |  11039ms |    2.99% |     15797.39ms |                1
 2017-02-22 16:12            28916ms |260.97 |  11719ms |    2.47% |     16239.13ms |                1
 2017-02-22 16:13            28130ms |274.35 |  11163ms |    2.33% |     16180.13ms |                0
 2017-02-22 16:13            29466ms |271.77 |  11244ms |    2.61% |     17259.35ms |                1
 2017-02-22 16:14            29288ms |262.49 |  11661ms |    2.47% |     16677.63ms |                1
 2017-02-22 16:14            28898ms |268.46 |  11403ms |    2.38% |     16574.93ms |                0
 2017-02-22 16:15            28441ms |269.51 |  11348ms |    2.48% |     16176.18ms |                0
 (7 samples Averaged)        28585ms |268.03 |  11325ms |    2.44% |     16301.80ms |              0.4

wasm:

 Run Date       Total time (lower is better) |   FPS | CPU Time | CPU Idle | Page load time | # of janks
--------------------------------------------------------------------------------------------------------
 2017-02-22 16:16            29542ms |255.94 |  11981ms |    2.31% |     16792.20ms |                2
 2017-02-22 16:17            27529ms |264.26 |  11599ms |    2.25% |     15047.03ms |                1
 2017-02-22 16:17            27833ms |243.70 |  12315ms |    4.00% |     14256.06ms |                5
 2017-02-22 16:18            28394ms |244.93 |  12278ms |    4.06% |     15100.01ms |                5
 2017-02-22 16:18            28821ms |252.72 |  12133ms |    2.17% |     15845.27ms |                1
 2017-02-22 16:19            28432ms |248.81 |  12110ms |    3.93% |     15189.23ms |                6
 2017-02-22 16:19            28343ms |241.16 |  12477ms |    3.89% |     14679.72ms |                5
 2017-02-22 16:20            28048ms |242.36 |  12485ms |    3.39% |     14421.75ms |                4
 2017-02-22 16:20            27809ms |249.11 |  12132ms |    3.49% |     14587.31ms |                5
 2017-02-22 16:21            29136ms |242.98 |  12356ms |    3.98% |     15658.47ms |                6
 (7 samples Averaged)        28240ms |246.37 |  12186ms |    3.32% |     14954.79ms |              3.9

The relevant column is the "CPU Time" column, which measures how long asm.js/wasm code execution took inside the requestAnimationFrame() functions accumulated, so it will ignore scheduling due to vsync etc. Last line shows average score of 7/10 results, with three outliers removed.

Averaged, asm.js gives a score of 11325ms, and wasm gives 12186ms, which is 7.6% slower than asm.js. Looking at geckoprofiler, this does not seem to be explained by none of bug 1340219, bug 1338914 or https://github.com/WebAssembly/design/issues/986. This could be due to bug 1340235 but profiles don't show anything regarding one way or the other. It could also be something completely different.

Please contact me for a test case.

Luke Wagner [:luke]

Comment 1

•

7 years ago

To help determine whether this is engine-side or compiler-side, what numbers do you see for Chrome (Canary)?

Jukka Jylänki

Reporter

Comment 2

•

7 years ago

Sent a test case via email. I don't currently have Canary set up to test, I'll try that once I have a free time slice for that.

Luke Wagner [:luke]

Comment 3

•

7 years ago

Attached patch measure-calls-loads-stores — Details — Splinter Review

So attached is a measurement patch I whipped up to count the total number of loads, stores, internal calls, import calls, and indirect calls executed.

My understanding is that, given a deterministic workload, these numbers should be meaningful and roughly the same between wasm and asm.js.  (Technically, wasm should have a smaller numbers of loads/stores b/c asm.js has to break up i64 and unaligned accesses into multiple smaller accesses.)

So the first thing I noticed is that the workload isn't entirely deterministic, but the variance is small (<1% difference in each category), so still good enough to do a gross comparison.

Given that, the differences I saw between wasm and asm.js are significant:

               wasm   asm.js
load           16.1B  14.8B
store          6.14B  6.17B
direct call    600M   535M
import call    92M    16.5M
indirect call  214M   214M

So we can see that wasm is being given 8.7% more loads, 12.1% more calls, and 5.5x more calls to imports.  Together, it seems like this could explain the slowdown.

(Note on the patch if you want to try it out: the numbers get accumulated process-wide and you print them by calling Math.toSource() from the console which also resets the counters.)

Alon Zakai (:azakai)

Comment 4

•

7 years ago

The 5.5x more import calls suggests this is the known precision issue (that UE4 can't turn on fast imprecise mode because it crashes). If there's a way to print out the # of calls per import, that could confirm. Another option could be to add a counter in the JS imports that are called, given the relevant suspects.

Luke Wagner [:luke]

Comment 5

•

7 years ago

Yeah, imports could be a major part of it.  Do you have any idea about the 1.3B extra loads?  This seemed pretty surprising.

Jukka Jylänki

Reporter

Comment 6

•

7 years ago

Looking at the gecko profiles I posted in the email thread, the amount of time spent in the i32s-div and float-to-int imports was not nearly enough to account for 7.6%. i32s-div is the hottest with 0.9% of total time, and f64-to-int is 0.3% of total time. js::wasm::Instance::callImport accounts for 0.2% + 0.1% + 0.1% of total time. All in all, that's 1.6% of total execution time that gecko profiler attributes for this. If it turns out that those indeed are the root cause, then I think there's a major bias in the gecko profiler itself which would be quite suspicious of the profiler itself that should be investigated as well.

Larger number of direct calls is also interesting, if I understand correctly we should not have any extra functions compiled inside wasm itself?

Alon Zakai (:azakai)

Comment 7

•

7 years ago

The extra direct calls, I might guess are the calls to the safe-i32-div/rem (i.e. the optimized precise versions, which guard against div/rem of 0). Those are extra in wasm. If we can profile the # of direct calls per target, that could confirm.

But no idea about the loads. If we can profile # of loads per function then we could find out. (Neither the extra imports or extra direct calls suggests anything about the loads, as the precise stuff shouldn't affect loads at all. Very strange!)

Luke Wagner [:luke]

Comment 8

•

7 years ago

For comparison, I measured embenchen's bullet (arg=3) (where wasm is 13% faster on my machine).

So first, bullet is *almost* entirely deterministic; between two runs, the load and store counts varied by only 1 or 2 and the rest of the fields were identical.

                wasm    asm.js
load            1.68B   1.89B
store           646M    864M
direct call     20.5M   18.7M
import call     644K    14K
indirect call   18.9M   18.9M

So this time we see wasm with far fewer loads/stores.  But other than that, the same pattern as above of more import and direct calls and roughly the same number of indirect calls.

Luke Wagner [:luke]

Comment 9

•

7 years ago

(In reply to Jukka Jylänki from comment #6)
Are you also seeing/including the "fast FFI trampoline" and "slow FFI trampoline" self time?  But yes, the profiler could be missing some time here.  But I expect this isn't all of it and the issues (particularly the loads) are a major component too.

(In reply to Alon Zakai (:azakai) from comment #7)
It's a bit of work to do the finer-grained per-function profiling from the machine code; is it possible to harvest the data you need from Emscripten?  This might also be beneficial if it allows you to figure out which functions are supposed to correspond (so you can compare them pairwise); in wasm/asm.js all I'd have is indices which I assume don't correspond between asm.js/wasm.

Jukka Jylänki

Reporter

Comment 10

•

7 years ago

Here's some more data points for the cpu time fields:

FF Nightly 54.0a1 (2017-02-22) (64-bit) asm.js: 11413ms
FF Nightly 54.0a1 (2017-02-22) (64-bit)   wasm: 12512ms
Chrome 58.0.3020.0 (Official Build) canary (64-bit) asm.js: 30775ms
Chrome 58.0.3020.0 (Official Build) canary (64-bit)   wasm: 16117ms

Alon Zakai (:azakai)

Comment 11

•

7 years ago

(In reply to Luke Wagner [:luke] from comment #8)
> For comparison, I measured embenchen's bullet (arg=3) (where wasm is 13%
> faster on my machine).
> 
> So first, bullet is *almost* entirely deterministic; between two runs, the
> load and store counts varied by only 1 or 2 and the rest of the fields were
> identical.
> 
>                 wasm    asm.js
> load            1.68B   1.89B
> store           646M    864M
> direct call     20.5M   18.7M
> import call     644K    14K
> indirect call   18.9M   18.9M
> 
> So this time we see wasm with far fewer loads/stores.  But other than that,
> the same pattern as above of more import and direct calls and roughly the
> same number of indirect calls.

Adding some manual logging in JS on each import, looks like the extra import calls (644K - 14K) are to Math.pow. This is something we also saw on Raybench. It was low priority there, but if this is a bigger factor in UE4, perhaps we should prioritize fixing that more - would be good to confirm it is in fact pow in UE4, though. Anyhow seems like optimizing pow would help bullet at least.

Luke Wagner [:luke]

Comment 12

•

7 years ago

So on bullet, the patch in bug 1340219 indeed takes the number of import calls from 644K down to 14K, matching asm.js exactly.  So that pretty much settles bullet.

However, it appears calls to imported Math.* builtins (with no intermediate JS function) only account for 717K import calls (out of the total 91M import calls) in UE4.

Luke Wagner [:luke]

Comment 13

•

7 years ago

Attached patch measure-loads — Details — Splinter Review

Here's a hacky patch that prints out load counts (when you run Math.toSource()) along with function names.  It shows some C library functions (mdct, vorbis) are way higher, but in the core engine C++ code, wasm is equal or better.

Jukka Jylänki

Reporter

Updated

•

7 years ago

Summary: UE4 timedemo runs 7.6% slower in wasm compared to asm.js → UE4 Zen Garden demo runs 7.6% slower in wasm compared to asm.js

Jukka Jylänki

Reporter

Comment 14

•

7 years ago

The demo is now live, with public test case URLs:

wasm: https://s3.amazonaws.com/mozilla-games/ZenGarden/EpicZenGarden.html?playback&novsync&fakegl&noaudio
asm.js: https://s3.amazonaws.com/mozilla-games/ZenGarden/EpicZenGarden.html?playback&novsync&fakegl&noaudio&asmjs

Link to an offline build of the demo for local testing: https://s3.amazonaws.com/mozilla-games/ZenGarden/2017-03-16-ZenGarden.zip

The results in current FF Nightly 55.0a1 (2017-03-16) (64-bit) gives:

> wasm:
>	"totalTime": 25104.535,
>	"totalRenderTime": 12325.125000000002,
>	"cpuTime": 11831.075000000006,
>	"fps": 259.6322552509609,
>	"pageLoadTime": 12700.52,
>	"WebAssembly.instantiate()": 59.335000000000036,
>	"main()": 3545.49,

> asm.js:
>	"totalTime": 28324.52,
>	"totalRenderTime": 13350.019999999999,
>	"cpuTime": 12792.315000000055,
>	"fps": 239.70001543068852,
>	"pageLoadTime": 14878.33,
>	"asm.js compilation": 1954.5400000000002,
>	"main()": 3481.625,

However this public URL is a newer build than the original one. The original one was identified to have an issue that some libraries had not been built with optimizations enabled, this newer build has that issue resolved. With this, wasm is a bit faster than asm.js on all interesting metrics in the demo. It could have been the case that unoptimized asm.js just was faster than unoptimized wasm, but it would be really good to verify.

Luke, would you be able to do the same test with the above optimized builds? That .zip file has both asm.js and wasm versions of the demo in it. Not a big priority, but would be good to close whether there might still be something else than bug 1340219 in play here?

Luke Wagner [:luke]

Comment 15

•

7 years ago

Great, so it looks like UE4 is back in line with the general ~10% improvement we see from wasm over asm.js.  Re-running the measurements I posted in comment 8, things look much better:

               wasm   asm.js
load           9.4B   10.2B
store          4.2B   5.0B
direct call    995M   960M
import call    91M    16M
indirect call  214M   214M

This seems to confirm that you fixed the underlying issue here.  I guess we can close this bug now?

Alon Zakai (:azakai)

Comment 16

•

7 years ago

I guess an interesting result from this is that unoptimized LLVM IR run through the asm.js optimizer does better than when run through the wasm optimizer. While optimized LLVM IR (what we've focused on) of course does better in the wasm optimizer. I suppose that's not a bug we need to fix anytime soon, but it's odd.

Jukka Jylänki

Reporter

Comment 17

•

7 years ago

Thanks Luke, those numbers look better. I believe the remaining difference in the import call is from the float-to-int conversion, which I think still routes out to JS in that build.

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

measure-calls-loads-stores 7 years ago Luke Wagner [:luke] 9.91 KB, patch		Details \| Diff \| Splinter Review
measure-loads 7 years ago Luke Wagner [:luke] 12.04 KB, patch		Details \| Diff \| Splinter Review