Open Bug 1341133 Opened 7 years ago Updated 2 years ago

WebAssembly execution performance stutters in Zen Garden Wasm demo in 64-bit Firefox, but enabling built-in profiler or geckoprofiler takes the performance stutters away

Categories

(Core :: JavaScript Engine, defect, P3)

defect

Tracking

()

People

(Reporter: jujjyl, Unassigned)

Details

(Keywords: triage-deferred)

Attachments

(5 files)

Here's something that is very peculiar. Benchmarking performance of a WebAssembly + WebGL 2 demo with Unreal Engine 4, performance is generally good, but animation suffers from constant stuttering where execution misses frames randomly every second or two. This execution time that causes the stutters apppers inside the wasm code that is running in the requestAnimationFrame loop.

However enabling geckoprofiler or the built-in profiler to analyze where these stutters would come from actually completely remove the stutters, giving perfectly smooth animation. Having the profiler enabled does have a tiny impact on overall performance (CPU load rises from 89.03% to 94.87%), but each rAF() finishes consistently within 16msecs after that, as opposed to spikes of 30-50 msecs when profiler is disabled.

Why would enabling the profiler g
Attached image wasm_stutters.png
View of the Emscripten requestAnimationFrame performance profiler when Firefox profiler is disabled. Each one pixel column represent a single rAF tick, with the height of the column measuring the number of msecs that the rAF took. Note the constant random height spikes at random intervals.
Enabling geckoprofiler completely fixes performance and all rAF()s now finish in < 16.67 msecs for smooth 60fps animation.
Oh, I notice the text had cut off in comment 0.

Why would enabling the profiler generate such a different performance footprint? I think this should not be anything GC or CC related, since the added time is spent inside the rAF(), and GC and CC run when the browser main loop is idle?
Could you perhaps profile with a native C++ profiler and see if any C++ functions (e.g., GC) occur during the stuttering frames?  Profiling-mode does affect tracing and discarding of JIT code, iirc.
I don't have a native C++ profiler that would let me see into the stuttering frames. The native profilers that I have all aggregate samples together and don't show them on a timeline.

Here is an overview of what AMD CodeXL says as hotspots.
Attached image hot_samples.png
And a detailed view.

The biggest overhead, setUniform() which takes 986 samples corresponds to 1.70% of total execution time, so it is difficult to conclude that that would be the slow item. Moz_speex_resampler_set_rate_frac is about audio resampling, which should run the same with or without profiling enabled.

AMD CodeXL is not able to show wasm frames, so the "Unknown Module (PID 15540)" is a black box in this case.
Marked down the speex part separately as bug 1341254, although that is not the root issue, since disabling all audio in the demo does not take the stutters away.
I'm unable to repro in Linux (Nightly or Beta).  While I see if I can repo in Windows, one quick test that would be useful, could you enable GC logging (javascript.options.mem.log = true), open the Browser Console (note: not *Web* Console) and see if the GC logging output (which includes max pause times) coincides with the janks you're observing.
I am able to see more stuttery behavior on Windows (subjectively: in Beta much more than Nightly), but looking at the GC log, GC activity doesn't seem to be the culprit.  So far, this isn't looking like a JS issue; it seems like some timing instrumentation is needed here to focus on what is going slower for these dropped frames.
Summary: WebAssembly execution performance stutters in requestAnimationFrame-based animation, but enabling built-in profiler or geckoprofiler takes the performance stutters away → WebAssembly execution performance stutters in Zen Garden Wasm demo, but enabling built-in profiler or geckoprofiler takes the performance stutters away
The demo went live today. Here is a public link, which force disables WebGL and WebAudio from the equation to remove any stuttering effects they might cause:

https://s3.amazonaws.com/mozilla-games/tmp/2017-03-07-ZenGarden-disablewebaudioandwebgl/EpicZenGarden.html?playback&novsync&cpuprofiler&fakegl&noaudio

Note that this bug is about different source of stuttering compared to bug 1345212. Bug 1345212 focuses on extra stuttering that is caused by specifically running in 32-bit Firefox, but this bug focuses on performance spikes on 64-bit Firefox that seem to vanish when enabling the profiler.
Summary: WebAssembly execution performance stutters in Zen Garden Wasm demo, but enabling built-in profiler or geckoprofiler takes the performance stutters away → WebAssembly execution performance stutters in Zen Garden Wasm demo in 64-bit Firefox, but enabling built-in profiler or geckoprofiler takes the performance stutters away
Attached image zengarden_spikes.png
Here is a new profiling run, executed twice with geckoprofiler enabled, which was practically stutter free on both runs, except for three distinct deterministically occurring spikes which geckoprofiler locates to be specific to when the demo initializes particle systems at the first time they come to view in the camera.

When running with geckoprofiler disabled however (two bottom graphs), there is much more short term spiking in executing the demo Wasm code, or some FFI calls it does. These are pointed to by red color in the lower graphs.

Luke asked whether these could be GCs or CCs? My understanding is that GC&CC are only run in the main thread when the browser main loop is idle. Is that assumption correct? If so, in these cpuprofiler graphs, such GC&CC spikes will then be shown either in light green, light yellow or light red (depending on how much time they take, i.e. the vertical height in the graph). The interesting spikes in question for this bug are drawn in blue in the lower graphs, which is for time spent inside the user requestAnimationFrame() code.

The demo is extremely optimized for avoiding temporary garbage, and there's only a handful of locations that do generate temporary garbage. Large memcpys dominate with 52% of all allocations, localtime_r does new Date(), mouse input injection needs to estimate bounding client rectangles for the canvas, that's about it.

Profiling the source is hard for this reason, because I cannot attach geckoprofiler or the built-in profiler because that takes the spikes away, and native profilers such as VTune or CodeXL are not able to show interactive timelines (nor do they show wasm call stacks), and attempting to instrument the UE4 codebase itself is difficult since the codebase is huge and I don't have a clue where to start, especially since it does not look like the UE4 code would be the culprit, as indicated by perf being smooth when one enables geckoprofiler.

Are geckoprofiler and/or built-in profiler expected to change how GC&CC operations behave?
One thing that comes to mind is that the Gecko profiler calls timeBeginPeriod/timeEndPeriod on Windows in order to increase the timer resolution. IIRC we also do that for requestAnimationFrame but maybe with a different  value.
When I tried this out, in comment 9 I didn't see any problematic GC/CC behavior, so that's probably not the culprit.

Jukka, is there anyway to use that integrated profiler whose graph you're showing to get some insight into what functions are being run during those spikes?
(In reply to Luke Wagner [:luke] from comment #13)
> Jukka, is there anyway to use that integrated profiler whose graph you're
> showing to get some insight into what functions are being run during those
> spikes?

Unfortunately not. The integrated profiler is only able to hook into high level JS DOM elements, since it's not possible to inject to WebAssembly code on the fly. If I was able to do that, certainly would have attempted it a long time ago.

(In reply to Markus Stange [:mstange] from comment #12)
> One thing that comes to mind is that the Gecko profiler calls
> timeBeginPeriod/timeEndPeriod on Windows in order to increase the timer
> resolution. IIRC we also do that for requestAnimationFrame but maybe with a
> different  value.

Would you be able to help me figure out the effect of this? I'd like to either

a) make those time*Period calls happen when profiling is disabled, i.e. perhaps run with these always set to their "profiling enabled" states, or

b) adjust profiler to run without changing any of the time*Period values.
b) is a lot easier. Just remove the two calls in tools/profiler/core/platform-win32.cpp.
Keywords: triage-deferred
Priority: -- → P3
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: