DrawTargetWebgl interleaves glBufferSubData calls with draw calls
Categories
(Core :: Graphics: CanvasWebGL, defect)
Tracking
()
Tracking | Status | |
---|---|---|
firefox113 | --- | fixed |
People
(Reporter: jnicol, Assigned: jnicol)
References
(Blocks 2 open bugs)
Details
(Whiteboard: [sp3])
Attachments
(1 file, 1 obsolete file)
Speedometer 3's React-Stockcharts is slow on Android with accelerated canvas enabled.
In this profile we see the CanvasRenderThread spends a lot of time in glBufferSubData
.
In DrawTargetWebl::SharedContext::DrawPathAccel()
we generate vertices for a path, push them to the back of a VBO, then issue a draw call. Once the VBO is full we orphan it and allocate a new one. In practice this means that to render many strokes, many glBufferSubData()
calls are interleaved with draw calls. This is bad practice on Mali, as the driver is not clever enough to realize our buffer updates do not overlap with the draw call inputs, and therefore has to create resource ghosts. This might explain the poor performance.
Comment 1•2 years ago
•
|
||
Seems like a far bigger issue in that profile is the constant memcpy and RawTexImage calls, with glBufferSubData's contribution maybe being somewhat questionable by comparison? For some reason we're doing a lot of regenerating the clip mask, which ideally we reeeeally don't want to happen a lot. Maybe that aspect is fixable and can make performance here tolerable without any extensive overhaul of the VBO setup otherwise.
Also the biggest overhead here is still SendGetFrontBuffer waiting for all this GPU work to complete, which means we're still blocked partially on Sotaro's work on fixing up async-present to work around that overhead.
Updated•2 years ago
|
Updated•2 years ago
|
Comment 2•2 years ago
|
||
We have bug 1819996 about the stroking slow path.
Comment 3•2 years ago
|
||
Removing bug 1804233 from the dependency list again - it's already tracked under speedometer3 for this benchmark, and this bug is about a different narrow aspect of the benchmark.
Assignee | ||
Comment 4•2 years ago
|
||
(In reply to Lee Salzman [:lsalzman] from comment #1)
Seems like a far bigger issue in that profile is the constant memcpy and RawTexImage calls, with glBufferSubData's contribution maybe being somewhat questionable by comparison? For some reason we're doing a lot of regenerating the clip mask, which ideally we reeeeally don't want to happen a lot. Maybe that aspect is fixable and can make performance here tolerable without any extensive overhaul of the VBO setup otherwise.
Also the biggest overhead here is still SendGetFrontBuffer waiting for all this GPU work to complete, which means we're still blocked partially on Sotaro's work on fixing up async-present to work around that overhead.
I'm not quite up to speed on async present yet, but was aware Sotaro was working on something related to it. My assumption was that will allow us to avoid having to wait for Msg_GetFrontBuffer in the content process. But presumably we'll still need to wait on the GPU in order to composite the canvas, and therefore finding the underlying cause of, and reducing, the GPU time is still imperative?
I assume you're refering to the content process' main thread in the profile here? I was looking at the GPU process' CanvasRenderThread, where glBufferSubData accounts for 50% of the CPU time, and I suspect the resource ghosting that causes is a major factor in the GPU time.
Comment 5•2 years ago
|
||
(In reply to Lee Salzman [:lsalzman] from comment #1)
For some reason we're doing a lot of regenerating the clip mask
Oh, I misread this part. I thought this was bug 1819996, but it is something different. The stroking is accelerated, but the clipping is not.
Comment 6•2 years ago
|
||
We also don't seem to have the complex clip showing up on desktop. My guess is that the normally rectangular pixel aligned clip is ending up not that way on Android for some reason.
Comment 7•2 years ago
•
|
||
I can confirm that this seems to have to do with window.devicePixelRatio. On macOS, I can see DrawTargetWebgl::GenerateComplexClipMask
in the profiles if I zoom out one step before starting the benchmark runner.
Updated•2 years ago
|
Updated•2 years ago
|
Assignee | ||
Comment 8•2 years ago
|
||
Great we have multiple avenues for improving this test on Android. Shall we file a separate bug about the clip mask issue?
On the VBO front, I made a quick prototype: Use one fixed VAO/VBO for the rect data and separate one for the paths, orphaning the path VBO for each individual upload and allocating only the required size for that draw. Obviously this is still far from an optimal setup, in fact on non-Mali it's probably worse than what we're currently doing, but from my local testing on Mali this sees a considerable improvement. I'll try to get some proper numbers from try.
Here's a profile. glBufferSubData is down to around 4% of the CanvasRenderThread time.
Comment 10•2 years ago
|
||
This adds interpolants to the AA distance calculation to handle the AA'ing of
the clip rect.
Updated•2 years ago
|
Assignee | ||
Comment 11•2 years ago
|
||
Can we file a separate bug for that please? Things get confusing down the line when there are multiple patches on the same bug
Comment 12•2 years ago
|
||
Comment on attachment 9326947 [details]
Bug 1826134 - Fast-path non-aligned clip rects in DrawTargetWebgl. r?jrmuizel
Revision D174650 was moved to bug 1826420. Setting attachment 9326947 [details] to obsolete.
Assignee | ||
Comment 13•2 years ago
|
||
DrawTargetWebgl renders a path by uploading vertex data to the back of
a large VBO using glBufferSubData then issuing a draw call, orphaning
the buffer when it becomes full. This results in many glBufferSubData
calls being interleaved with draw calls. On Mali GPUs this causes
severe performance issues as the driver is unable to determine that
any pending draw calls do not reference the updated region of the
buffer, and therefore must create a copy of the buffer for each
update.
However, since we know that we never overwrite a region that is
referenced by a submitted draw call, we can force the driver to avoid
making these copies. We do so by adding a new function
UnsynchronizedBufferSubData(), which acts like BufferSubData so long
as this rule is followed. Internally, this uses glMapBufferRange with
GL_MAP_UNSYNCHRONIZED_BIT, allowing the driver to omit the extraneous
copies.
Comment 14•2 years ago
|
||
Comment 15•2 years ago
|
||
bugherder |
Updated•2 years ago
|
Comment 16•2 years ago
|
||
(In reply to Pulsebot from comment #14)
Pushed by jnicol@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/d043e462e390
Avoid interleaving glBufferSubData calls with draw calls in DrawTargetWebgl.
r=gfx-reviewers,jgilbert,lsalzman
== Change summary for alert #38049 (as of Wed, 12 Apr 2023 05:30:10 GMT) ==
Improvements:
Ratio | Test | Platform | Options | Absolute values (old vs new) |
---|---|---|---|---|
16% | speedometer3 | android-hw-a51-11-0-aarch64-shippable-qr | webrender | 34.13 -> 39.46 |
15% | speedometer3 | android-hw-a51-11-0-aarch64-shippable-qr | webrender | 34.15 -> 39.42 |
For up to date results, see: https://treeherder.mozilla.org/perfherder/alerts?id=38049
Description
•