Closed Bug 1826134 Opened 2 years ago Closed 2 years ago

DrawTargetWebgl interleaves glBufferSubData calls with draw calls

Categories

(Core :: Graphics: CanvasWebGL, defect)

defect

Tracking

()

RESOLVED FIXED
113 Branch
Tracking Status
firefox113 --- fixed

People

(Reporter: jnicol, Assigned: jnicol)

References

(Blocks 2 open bugs)

Details

(Whiteboard: [sp3])

Attachments

(1 file, 1 obsolete file)

Speedometer 3's React-Stockcharts is slow on Android with accelerated canvas enabled.

In this profile we see the CanvasRenderThread spends a lot of time in glBufferSubData.

In DrawTargetWebl::SharedContext::DrawPathAccel() we generate vertices for a path, push them to the back of a VBO, then issue a draw call. Once the VBO is full we orphan it and allocate a new one. In practice this means that to render many strokes, many glBufferSubData() calls are interleaved with draw calls. This is bad practice on Mali, as the driver is not clever enough to realize our buffer updates do not overlap with the draw call inputs, and therefore has to create resource ghosts. This might explain the poor performance.

Seems like a far bigger issue in that profile is the constant memcpy and RawTexImage calls, with glBufferSubData's contribution maybe being somewhat questionable by comparison? For some reason we're doing a lot of regenerating the clip mask, which ideally we reeeeally don't want to happen a lot. Maybe that aspect is fixable and can make performance here tolerable without any extensive overhaul of the VBO setup otherwise.

Also the biggest overhead here is still SendGetFrontBuffer waiting for all this GPU work to complete, which means we're still blocked partially on Sotaro's work on fixing up async-present to work around that overhead.

Flags: needinfo?(lsalzman)
Flags: needinfo?(jmuizelaar)
Blocks: gpu-canvas
Severity: -- → S3
Depends on: 1804233
Flags: needinfo?(jmuizelaar)

We have bug 1819996 about the stroking slow path.

Removing bug 1804233 from the dependency list again - it's already tracked under speedometer3 for this benchmark, and this bug is about a different narrow aspect of the benchmark.

No longer depends on: 1804233

(In reply to Lee Salzman [:lsalzman] from comment #1)

Seems like a far bigger issue in that profile is the constant memcpy and RawTexImage calls, with glBufferSubData's contribution maybe being somewhat questionable by comparison? For some reason we're doing a lot of regenerating the clip mask, which ideally we reeeeally don't want to happen a lot. Maybe that aspect is fixable and can make performance here tolerable without any extensive overhaul of the VBO setup otherwise.

Also the biggest overhead here is still SendGetFrontBuffer waiting for all this GPU work to complete, which means we're still blocked partially on Sotaro's work on fixing up async-present to work around that overhead.

I'm not quite up to speed on async present yet, but was aware Sotaro was working on something related to it. My assumption was that will allow us to avoid having to wait for Msg_GetFrontBuffer in the content process. But presumably we'll still need to wait on the GPU in order to composite the canvas, and therefore finding the underlying cause of, and reducing, the GPU time is still imperative?

I assume you're refering to the content process' main thread in the profile here? I was looking at the GPU process' CanvasRenderThread, where glBufferSubData accounts for 50% of the CPU time, and I suspect the resource ghosting that causes is a major factor in the GPU time.

(In reply to Lee Salzman [:lsalzman] from comment #1)

For some reason we're doing a lot of regenerating the clip mask

Oh, I misread this part. I thought this was bug 1819996, but it is something different. The stroking is accelerated, but the clipping is not.

We also don't seem to have the complex clip showing up on desktop. My guess is that the normally rectangular pixel aligned clip is ending up not that way on Android for some reason.

I can confirm that this seems to have to do with window.devicePixelRatio. On macOS, I can see DrawTargetWebgl::GenerateComplexClipMask in the profiles if I zoom out one step before starting the benchmark runner.

Whiteboard: [sp3]

Great we have multiple avenues for improving this test on Android. Shall we file a separate bug about the clip mask issue?

On the VBO front, I made a quick prototype: Use one fixed VAO/VBO for the rect data and separate one for the paths, orphaning the path VBO for each individual upload and allocating only the required size for that draw. Obviously this is still far from an optimal setup, in fact on non-Mali it's probably worse than what we're currently doing, but from my local testing on Mali this sees a considerable improvement. I'll try to get some proper numbers from try.

Here's a profile. glBufferSubData is down to around 4% of the CanvasRenderThread time.

And here are the results from try

This adds interpolants to the AA distance calculation to handle the AA'ing of
the clip rect.

Assignee: nobody → lsalzman
Status: NEW → ASSIGNED

Can we file a separate bug for that please? Things get confusing down the line when there are multiple patches on the same bug

Assignee: lsalzman → jnicol
Depends on: 1826420

Comment on attachment 9326947 [details]
Bug 1826134 - Fast-path non-aligned clip rects in DrawTargetWebgl. r?jrmuizel

Revision D174650 was moved to bug 1826420. Setting attachment 9326947 [details] to obsolete.

Attachment #9326947 - Attachment is obsolete: true

DrawTargetWebgl renders a path by uploading vertex data to the back of
a large VBO using glBufferSubData then issuing a draw call, orphaning
the buffer when it becomes full. This results in many glBufferSubData
calls being interleaved with draw calls. On Mali GPUs this causes
severe performance issues as the driver is unable to determine that
any pending draw calls do not reference the updated region of the
buffer, and therefore must create a copy of the buffer for each
update.

However, since we know that we never overwrite a region that is
referenced by a submitted draw call, we can force the driver to avoid
making these copies. We do so by adding a new function
UnsynchronizedBufferSubData(), which acts like BufferSubData so long
as this rule is followed. Internally, this uses glMapBufferRange with
GL_MAP_UNSYNCHRONIZED_BIT, allowing the driver to omit the extraneous
copies.

No longer depends on: 1826420
Pushed by jnicol@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/d043e462e390 Avoid interleaving glBufferSubData calls with draw calls in DrawTargetWebgl. r=gfx-reviewers,jgilbert,lsalzman
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 113 Branch
Regressions: 1827047
Regressions: 1827050
Flags: needinfo?(lsalzman)
No longer regressions: 1827050

(In reply to Pulsebot from comment #14)

Pushed by jnicol@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/d043e462e390
Avoid interleaving glBufferSubData calls with draw calls in DrawTargetWebgl.
r=gfx-reviewers,jgilbert,lsalzman

== Change summary for alert #38049 (as of Wed, 12 Apr 2023 05:30:10 GMT) ==

Improvements:

Ratio Test Platform Options Absolute values (old vs new)
16% speedometer3 android-hw-a51-11-0-aarch64-shippable-qr webrender 34.13 -> 39.46
15% speedometer3 android-hw-a51-11-0-aarch64-shippable-qr webrender 34.15 -> 39.42

For up to date results, see: https://treeherder.mozilla.org/perfherder/alerts?id=38049

Regressions: 1827591
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: