Poor performance in webgl aquarium demo (even with EGL/DmaBuf)
Categories
(Core :: Graphics: CanvasWebGL, defect)
Tracking
()
Tracking | Status | |
---|---|---|
firefox86 | --- | disabled |
People
(Reporter: sk.griffinix, Unassigned)
References
(Depends on 1 open bug)
Details
(Keywords: nightly-community, perf)
Attachments
(2 files)
User Agent: Mozilla/5.0 (Android 10; Mobile; rv:86.0) Gecko/86.0 Firefox/86.0
Steps to reproduce:
-
Enable webrender, widget.dmabuf-webgl.enabled and launch firefox with MOZ_X11_EGL=1.
-
Run webgl aquarium benchmark https://webglsamples.org/aquarium/aquarium.html
Actual results:
Firefox runs it at 60fps until 1000 fish, drops to 23fps at 5000 fish and runs at 5-6fps at 30000 fish
Chrome runs it at 60fps at 10000 fish, 45-50 fps at 15000 and 22-23fps at 30000 fish
Expected results:
Performance should be comparable to chrome
Comment 2•4 years ago
|
||
Bugbug thinks this bug should belong to this component, but please revert this change in case of error.
Updated•4 years ago
|
Updated•4 years ago
|
Comment 3•4 years ago
|
||
Which results do you get on KDE Wayland (MOZ_ENABLE_WAYLAND=1) and KDE Xwayland (MOZ_X11_EGL=1) for comparison?
(Do you get the same results with Gnome Wayland and Gnome Xwayland?)
(Martin Stránský [:stransky] from bug 1586696 comment 6)
Bug 1608800 is also related and when lands we can create dmabuf framebuffer/textures with modifiers. It should improve dmabuf performance for surfaces/textures used exclusively by GPU, like WebGL framebuffer.
(Martin Stránský [:stransky] from bug 1588736 comment 2 + bug 1588736 comment 3)
Gnome bug: https://bugzilla.gnome.org/show_bug.cgi?id=785779
This may be also related: https://phabricator.kde.org/T8067
(Martin Stránský [:stransky] from bug 1662409 comment 0)
Dmabuf modifiers are not used with WebGL right now.
Comment 4•4 years ago
•
|
||
For the record: the aquarium demo is known to perform much better on chrome (on high fish counts), on all platform (also win/mac) and there are a bunch of issues for that already. I remember having read that it's about spidermonkey vs v8, but on a short search I could only find e.g. bug 1663084
Anyhow, all I wanted to say is that for this specific demo the result is expected - and it's most likely not something we can fix in the dmabuf code.
Comment 5•4 years ago
|
||
Profiles would help with demonstrating that! :)
Comment 6•4 years ago
|
||
There's a Windows one in bug 1662811 comment 1 with 10000 fish IIUC (https://share.firefox.dev/3gT2DPq)
Comment 7•4 years ago
|
||
It's easy enough for someone who's running Linux to make one here.
Comment 8•4 years ago
•
|
||
Hehe, true. Here are two with 10000 fish on current nightly:
- wayland: https://share.firefox.dev/2WRuN60
- x11/egl: https://share.firefox.dev/2L64MNy
Both, wayland and x11/egl, use the dmabuf buffer sharing and show almost identical performance here.
Comment 9•4 years ago
|
||
Jan, would you know why some functions such as renderMono
remain running as baseline-interpreter
mode for multiple seconds in both profiles? (feel free to fork this to another bug blocking this one)
Comment 10•4 years ago
|
||
~25% of time in webgl dispatch code, so webgl's not the long pole here.
Comment 11•4 years ago
|
||
(In reply to Nicolas B. Pierron [:nbp] from comment #9)
Jan, would you know why some functions such as
renderMono
remain running asbaseline-interpreter
mode for multiple seconds in both profiles? (feel free to fork this to another bug blocking this one)
I looked into this but I think the profiler is buggy for JIT frames and as a result incorrectly attributes time to renderMono
and some C++ functions. I filed bug 1691504 for that.
Comment 12•4 years ago
|
||
Here is a fresh profile (10000 fish) with the fix from bug 1691504: https://share.firefox.dev/3poBBnv
For comparison a chromium profile: https://share.firefox.dev/2Zkmscq
Comment 13•4 years ago
|
||
With fixed profiling, there is a lot of time (17%) in:
tdl.math.pseudoRandom = function() {
var math = tdl.math;
return (math.randomSeed_ =
(134775813 * math.randomSeed_ + 1) %
math.RANDOM_RANGE_) / math.RANDOM_RANGE_;
};
Comment 14•4 years ago
|
||
So this could just be bug 1518857? Bug 1673840 improved the situation for power-of-two modulus which are representable as int32_t, but math.RANDOM_RANGE_
is Math.pow(2, 32)
, so it doesn't use the fast path. Changing LModPowTwoD
to accept larger values may help here.
Comment 15•4 years ago
|
||
Took a quick look at this. Relative to Ion:
- Warp emits a pre-barrier when storing to math.randomSeed_, because it doesn't have TI to indicate that the previous value isn't a GC thing
- Warp loads math.RANDOM_RANGE_ out of the slot each time instead of using a constant value, presumably because of the singleton optimizations.
However, the overall difference between Ion and Warp on this is minimal (<5%), so it's likely that neither of those optimizations would help us much here. (However, we might need something like the second optimization to enable the LModPowTwoD
improvements anba mentions above.)
This function is small enough that Warp is willing to inline it, and in general the LIR looks pretty reasonable. (I haven't looked at the actual generated code.) Unfortunately we end up using double arithmetic, but I don't think that can be avoided because the product of the multiplication is sometimes greater than 2*53, where double arithmetic loses precision, so we'd get the wrong answer if we tried computing this using 64-bit integer math.
Comment 16•4 years ago
|
||
Presumably Chrome has the same ISA constraints as us w.r.t. precision, so are we actually slower and they're still correct, and if so, what are they doing differently here?
Comment 17•4 years ago
|
||
Is it <5% faster overall? If it's 5% faster overall, and this is 17% of overall in the slow version, is this code then 5/17 = 30% faster in Ion? (Is that the faster one?)
It's possible the answer here will be a bunch of 3-5% improvements causing a big win in aggregate.
Comment 18•4 years ago
|
||
(In reply to Jeff Gilbert [:jgilbert] from comment #16)
Presumably Chrome has the same ISA constraints as us w.r.t. precision, so are we actually slower and they're still correct, and if so, what are they doing differently here?
They're using x87 instructions. See bug 1518857, comment #0 for a more detailed explanation.
Comment 19•4 years ago
|
||
(In reply to Jeff Gilbert [:jgilbert] from comment #17)
Is it <5% faster overall? If it's 5% faster overall, and this is 17% of overall in the slow version, is this code then 5/17 = 30% faster in Ion? (Is that the faster one?)
It's possible the answer here will be a bunch of 3-5% improvements causing a big win in aggregate.
Taking this particular function, wrapping it in a loop, and comparing the performance with Ion (our old backend) to Warp (our new backend) we are <5% slower. (Warp deliberately traded off aggressive optimization of hot number-crunching loops for reduced overhead, which is a good deal overall but sometimes causes regressions on code like this.)
Digging in slightly deeper, the ~5% gap only occurs with inlining. If we turn inlining off, the difference between Ion and Warp is actually <0.5%. Warp's issue post-inlining is that we generate guards to ensure that tdl.math.pseudoRandom
is still the same function we inlined, and LICM doesn't hoist those guards out of the loop because our alias analysis isn't precise enough to realize that writing to tdl.math.randomSeed_
won't change tdl.math.pseudoRandom
. The cost of those guards is amortized over the body of the loop, so for more realistic code that actually uses the random value, the overhead of the extra guards would be smaller.
Comment 20•4 years ago
|
||
Leo, can you please run Firefox with dmabuf logging:
MOZ_LOG="Dmabuf:5" MOZ_X11_EGL=1 firefox
and attach the log here? I suspect Bug 1696869 may be related and we fall back to shm here.
Reporter | ||
Comment 21•4 years ago
|
||
I am currently not in possession of linux device. I will try to post it within a week
Comment 22•4 years ago
|
||
Hi Martin, I can provide the log you requested but for an Arch Linux + Nvidia system.
Before posting and leading to confusion (since you mention something related to an AMD Radeon driver configuration) I would like to ask you if you could find it useful or if it's something unrelated so useless for this bug.
Reporter | ||
Comment 23•4 years ago
|
||
Reporter | ||
Comment 24•4 years ago
|
||
Some about:config entries that were active while log was made. Do tell if i need to change any and paste another log. I also ran the webgl sample while logging was being done
media.ffmpeg.dmabuf-textures.disabled false
media.ffmpeg.dmabuf-textures.enabled true
widget.dmabuf-textures.enabled false
widget.dmabuf-webgl.enabled true
Comment 25•4 years ago
|
||
From the log the dmabuf looks working correctly. Do you see any difference when you disable dmabuf framebuffer, i.e. set widget.dmabuf-webgl.enabled to false and restart browser?
Thanks.
Reporter | ||
Comment 26•4 years ago
|
||
(In reply to Martin Stránský [:stransky] from comment #25)
From the log the dmabuf looks working correctly. Do you see any difference when you disable dmabuf framebuffer, i.e. set widget.dmabuf-webgl.enabled to false and restart browser?
Thanks.
Setting widget.dmabuf.webgl.enabled to false actually reduces frame rate by about 16-20%
Reporter | ||
Comment 27•4 years ago
|
||
One thing I would like to point out particularly with regard to webgl aquarium benchmark is that frame rates drop suddenly in firefox with increasing number of fishes. It is something of the sort:
1000-60fps
5000-24fps
10000-13fps
15000-9fps
20000-7fps
25000-6fps
30000-5fps
With chrome on other hand, it is as follows:
1000-60fps
5000-60fps
10000-58fps
15000-45fps
20000-35fps
25000-29fps
30000-24fps
In chrome, decrease in fps is regular with increase in number of elements, while firefox for some reason takes a massive sudden dip. I am no where close to being trained in computers, but the only time I have seen such dips in performance was when enough memory was not available
Comment 28•4 years ago
|
||
(In reply to Leo_sk from comment #26)
Setting widget.dmabuf.webgl.enabled to false actually reduces frame rate by about 16-20%
In such case the dmabuf backend is working correctly and it must be something different.
How does the performance look like when you run Firefox without MOZ_X11_EGL=1 set, i.e. with GLX backend?
Reporter | ||
Comment 29•4 years ago
|
||
Sorry for the delay. In the latest nightly (89.0a1 (2021-03-24) (64-bit), about:troubleshoot shows X11_EGL as 'blocklisted by env: Blocklisted by gfxInfo' . It is showing the same with MOZ_X11_EGL as 0 or 1. Does it mean it is GLX backend in both cases?
Updated•3 years ago
|
Comment 30•3 years ago
|
||
I just hit an interesting blog post concerning the Zink driver, which is hitting issues with exactly this demo as well: https://www.supergoodcode.com/underwater/
Quote:
it’s one of the only test cases for GL_EXT_multisampled_render_to_texture I’m aware of, at least when running Chrome in EGL mode.
This might help explain why this demo is so notoriously slow on Firefox compared to Chrome.
Updated•3 years ago
|
Comment 31•3 years ago
|
||
The bug has a release status flag that shows some version of Firefox is affected, thus it will be considered confirmed.
Description
•