Closed Bug 1826134 Opened 2 years ago Closed 2 years ago

DrawTargetWebgl interleaves glBufferSubData calls with draw calls

Tracking

()

Status:

RESOLVED FIXED

Milestone:

113 Branch

Tracking Flags:

Tracking

Status

firefox113

---

fixed

People

(Reporter: jnicol, Assigned: jnicol)

References

(Blocks 2 open bugs)

Details

(Whiteboard: [sp3])

Attachments

(1 file, 1 obsolete file)

Bug 1826134 - Fast-path non-aligned clip rects in DrawTargetWebgl. r?jrmuizel 2 years ago Lee Salzman [:lsalzman] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1826134 - Avoid interleaving glBufferSubData calls with draw calls in DrawTargetWebgl. r?#gfx-reviewers 2 years ago Jamie Nicol [:jnicol] 48 bytes, text/x-phabricator-request		Details \| Review

Jamie Nicol [:jnicol]

Assignee

Description

•

2 years ago

Speedometer 3's React-Stockcharts is slow on Android with accelerated canvas enabled.

In this profile we see the CanvasRenderThread spends a lot of time in glBufferSubData.

In DrawTargetWebl::SharedContext::DrawPathAccel() we generate vertices for a path, push them to the back of a VBO, then issue a draw call. Once the VBO is full we orphan it and allocate a new one. In practice this means that to render many strokes, many glBufferSubData() calls are interleaved with draw calls. This is bad practice on Mali, as the driver is not clever enough to realize our buffer updates do not overlap with the draw call inputs, and therefore has to create resource ghosts. This might explain the poor performance.

Lee Salzman [:lsalzman]

Comment 1

•

2 years ago

•

Edited

Seems like a far bigger issue in that profile is the constant memcpy and RawTexImage calls, with glBufferSubData's contribution maybe being somewhat questionable by comparison? For some reason we're doing a lot of regenerating the clip mask, which ideally we reeeeally don't want to happen a lot. Maybe that aspect is fixable and can make performance here tolerable without any extensive overhaul of the VBO setup otherwise.

Also the biggest overhead here is still SendGetFrontBuffer waiting for all this GPU work to complete, which means we're still blocked partially on Sotaro's work on fixing up async-present to work around that overhead.

Flags: needinfo?(lsalzman)

Flags: needinfo?(jmuizelaar)

Lee Salzman [:lsalzman]

Updated

•

2 years ago

Blocks: gpu-canvas

Severity: -- → S3

Depends on: 1804233

Lee Salzman [:lsalzman]

Updated

•

2 years ago

Flags: needinfo?(jmuizelaar)

Markus Stange [:mstange]

Comment 2

•

2 years ago

We have bug 1819996 about the stroking slow path.

Markus Stange [:mstange]

Comment 3

•

2 years ago

Removing bug 1804233 from the dependency list again - it's already tracked under speedometer3 for this benchmark, and this bug is about a different narrow aspect of the benchmark.

No longer depends on: 1804233

Jamie Nicol [:jnicol]

Assignee

Comment 4

•

2 years ago

(In reply to Lee Salzman [:lsalzman] from comment #1)

Seems like a far bigger issue in that profile is the constant memcpy and RawTexImage calls, with glBufferSubData's contribution maybe being somewhat questionable by comparison? For some reason we're doing a lot of regenerating the clip mask, which ideally we reeeeally don't want to happen a lot. Maybe that aspect is fixable and can make performance here tolerable without any extensive overhaul of the VBO setup otherwise.

Also the biggest overhead here is still SendGetFrontBuffer waiting for all this GPU work to complete, which means we're still blocked partially on Sotaro's work on fixing up async-present to work around that overhead.

I'm not quite up to speed on async present yet, but was aware Sotaro was working on something related to it. My assumption was that will allow us to avoid having to wait for Msg_GetFrontBuffer in the content process. But presumably we'll still need to wait on the GPU in order to composite the canvas, and therefore finding the underlying cause of, and reducing, the GPU time is still imperative?

I assume you're refering to the content process' main thread in the profile here? I was looking at the GPU process' CanvasRenderThread, where glBufferSubData accounts for 50% of the CPU time, and I suspect the resource ghosting that causes is a major factor in the GPU time.

Markus Stange [:mstange]

Comment 5

•

2 years ago

(In reply to Lee Salzman [:lsalzman] from comment #1)

For some reason we're doing a lot of regenerating the clip mask

Oh, I misread this part. I thought this was bug 1819996, but it is something different. The stroking is accelerated, but the clipping is not.

Jeff Muizelaar [:jrmuizel]

Comment 6

•

2 years ago

We also don't seem to have the complex clip showing up on desktop. My guess is that the normally rectangular pixel aligned clip is ending up not that way on Android for some reason.

Markus Stange [:mstange]

Comment 7

•

2 years ago

•

Edited

I can confirm that this seems to have to do with window.devicePixelRatio. On macOS, I can see DrawTargetWebgl::GenerateComplexClipMask in the profiles if I zoom out one step before starting the benchmark runner.

Patricia Lawless

Updated

•

2 years ago

Whiteboard: [sp3]

Jira Integration Bot

Updated

•

2 years ago

See Also: → https://mozilla-hub.atlassian.net/browse/SP3-308

Jamie Nicol [:jnicol]

Assignee

Comment 8

•

2 years ago

Great we have multiple avenues for improving this test on Android. Shall we file a separate bug about the clip mask issue?

On the VBO front, I made a quick prototype: Use one fixed VAO/VBO for the rect data and separate one for the paths, orphaning the path VBO for each individual upload and allocating only the required size for that draw. Obviously this is still far from an optimal setup, in fact on non-Mali it's probably worse than what we're currently doing, but from my local testing on Mali this sees a considerable improvement. I'll try to get some proper numbers from try.

Here's a profile. glBufferSubData is down to around 4% of the CanvasRenderThread time.

Jamie Nicol [:jnicol]

Assignee

Comment 9

•

2 years ago

And here are the results from try

Lee Salzman [:lsalzman]

Comment 10

•

2 years ago

Attached file Bug 1826134 - Fast-path non-aligned clip rects in DrawTargetWebgl. r?jrmuizel (obsolete) — Details

This adds interpolants to the AA distance calculation to handle the AA'ing of
the clip rect.

Phabricator Automation

Updated

•

2 years ago

Assignee: nobody → lsalzman

Status: NEW → ASSIGNED

Jamie Nicol [:jnicol]

Assignee

Comment 11

•

2 years ago

Can we file a separate bug for that please? Things get confusing down the line when there are multiple patches on the same bug

Assignee: lsalzman → jnicol

Lee Salzman [:lsalzman]

Updated

•

2 years ago

Depends on: 1826420

Phabricator Automation

Comment 12

•

2 years ago

Comment on attachment 9326947 [details]
Bug 1826134 - Fast-path non-aligned clip rects in DrawTargetWebgl. r?jrmuizel

Revision D174650 was moved to bug 1826420. Setting attachment 9326947 [details] to obsolete.

Attachment #9326947 - Attachment is obsolete: true

Jamie Nicol [:jnicol]

Assignee

Comment 13

•

2 years ago

Attached file Bug 1826134 - Avoid interleaving glBufferSubData calls with draw calls in DrawTargetWebgl. r?#gfx-reviewers — Details

DrawTargetWebgl renders a path by uploading vertex data to the back of
a large VBO using glBufferSubData then issuing a draw call, orphaning
the buffer when it becomes full. This results in many glBufferSubData
calls being interleaved with draw calls. On Mali GPUs this causes
severe performance issues as the driver is unable to determine that
any pending draw calls do not reference the updated region of the
buffer, and therefore must create a copy of the buffer for each
update.

However, since we know that we never overwrite a region that is
referenced by a submitted draw call, we can force the driver to avoid
making these copies. We do so by adding a new function
UnsynchronizedBufferSubData(), which acts like BufferSubData so long
as this rule is followed. Internally, this uses glMapBufferRange with
GL_MAP_UNSYNCHRONIZED_BIT, allowing the driver to omit the extraneous
copies.

Jamie Nicol [:jnicol]

Assignee

Updated

•

2 years ago

No longer depends on: 1826420

Pulsebot

Comment 14

•

2 years ago

Pushed by jnicol@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/d043e462e390 Avoid interleaving glBufferSubData calls with draw calls in DrawTargetWebgl. r=gfx-reviewers,jgilbert,lsalzman

Atila Butkovits

Comment 15

•

2 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/d043e462e390

Status: ASSIGNED → RESOLVED

Closed: 2 years ago

status-firefox113: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 113 Branch

Mayank Bansal

Updated

•

2 years ago

Regressions: 1827047

Mayank Bansal

Updated

•

2 years ago

Regressions: 1827050

Lee Salzman [:lsalzman]

Updated

•

2 years ago

Flags: needinfo?(lsalzman)

Sotaro Ikeda [:sotaro]

Updated

•

2 years ago

No longer regressions: 1827050

Acasandrei Beatrice (needinfo me)

Comment 16

•

2 years ago

(In reply to Pulsebot from comment #14)

Pushed by jnicol@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/d043e462e390
Avoid interleaving glBufferSubData calls with draw calls in DrawTargetWebgl.
r=gfx-reviewers,jgilbert,lsalzman

== Change summary for alert #38049 (as of Wed, 12 Apr 2023 05:30:10 GMT) ==

Improvements:

Ratio	Test	Platform	Options	Absolute values (old vs new)
16%	speedometer3	android-hw-a51-11-0-aarch64-shippable-qr	webrender	34.13 -> 39.46
15%	speedometer3	android-hw-a51-11-0-aarch64-shippable-qr	webrender	34.15 -> 39.42

For up to date results, see: https://treeherder.mozilla.org/perfherder/alerts?id=38049

Jamie Nicol [:jnicol]

Assignee

Updated

•

2 years ago

Regressions: 1827591

You need to log in before you can comment on or make changes to this bug.