ClientStorage could maybe work better on Intel GPUs

RESOLVED FIXED in Firefox 67

Status

()

enhancement
P3
normal
RESOLVED FIXED
11 months ago
2 months ago

People

(Reporter: jrmuizel, Assigned: mstange)

Tracking

(Depends on 1 bug, Blocks 1 bug)

63 Branch
mozilla68
Unspecified
macOS
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox63 wontfix, firefox66 wontfix, firefox67 fixed, firefox68 fixed)

Details

(Whiteboard: [gfx-noted])

Attachments

(2 attachments)

Reporter

Description

11 months ago
It looks like client storage is working okish on my Intel Mac, but there are still some pretty big chunks of compositor time that happen:

https://upload.wikimedia.org/wikipedia/commons/c/c0/Big_Buck_Bunny_4K.webm playback: https://perfht.ml/2NN0NBN

In this profile things are working great except for period of slow composites.

gsmarena: https://perfht.ml/2NSUlJE
Here we're nearly always slow and we sometimes spend appreciable time in memmove.

We might be running into the limits of the hardware but I think some more investigation is warranted. If it is a memory bandwidth limit, bug 1191965 will hopefully help with that.
Reporter

Updated

11 months ago
Blocks: 1400787
Are those memmoves just from scrolling on gsmarena.com? I can't get those to happen on intel hardware locally with the build on autoland right now.
Just tried using GL_LOCAL_STORAGE_SHARED_APPLE instead of GL_LOCAL_STORAGE_CACHED_APPLE and it significantly improves things on intel hardware. I'll take a look at what we can do with this later. The problem was it was regressing some things on the discrete GPU.
(This is regarding the Big Buck Bunny video. I don't know if it helps for the gsmarena.com memmoves because I can't repro those.)
Reporter

Comment 4

11 months ago
Yeah. It was the gsmareana.com profile is just scrolling on that site. Here's another profile: https://perfht.ml/2LYKhhl.

Using GL_LOCAL_STORAGE_SHARED_APPLE is interesting. In theory that causes the GPU to do texture fetches directly from the bufffer instead of doing a copy into separate GPU memory. In makes sense that GL_LOCAL_STORAGE_SHARED_APPLE would be slower on discrete GPUs. Though I'm actually kind surprised it still works.

I don't think I've ever seen an Apple software use it. Definitely worth investigating. We could always choose the type depending on the GPU.
Hmm, I think I was a bit hasty in this. I'm not sure GL_STORAGE_SHARED_APPLE is any better than GL_STORAGE_CACHED_APPLE here. The difference I was seeing must have been a fluke or an error of measurement. Looking now it might be better, but the profiles are noisy enough and the difference is small enough that it's hard to tell.

Trying to think up more ideas for this. One thing is that in all examples I can observe, glTextureRangeAPPLE is supposed to be used with a large buffer containing multiple textures stored contiguously, whereas for us we call it once per texture. We'd have to jump through a bunch of hoops to actually get textures into a big contiguous buffer, and it's hard to conduct a great test without doing this extra work. I might play with the sample app for client storage though and see what kind of impact allocating all the textures as separate buffers has.

Other than that I don't know that I have any good ideas for this.
So the GL_STORAGE_SHARED_APPLE thing wasn't imagined after all, thought it's not as dramatic as I first thought. With the test app[1] (I blew up the texture size in order to create a performance problem), switching to GL_STORAGE_SHARED_APPLE yielded something like a 20% performance boost. glTextureRangeAPPLE seems to make no difference on my hardware (neither the integrated nor the discrete GPU.)

So it seems looking into using GL_STORAGE_SHARED_APPLE based on the user's hardware is probably worth it, but we'll have to sort out what the trade-off is.

[1] https://github.com/jrmuizel/client-storage
OS: Unspecified → Mac OS X
Priority: -- → P3
Whiteboard: [gfx-noted]
Assignee

Updated

3 months ago
Duplicate of this bug: 1539680
Assignee

Updated

3 months ago
Summary: Client storage could maybe work better on Intel GPUs → ClientStorage could maybe work better on Intel GPUs
Assignee

Comment 8

3 months ago

If I switch to GL_STORAGE_SHARED_APPLE, FPS improves on my Intel GPU (less time spent blocking on texture deletion, it seems), but I still see the copies. Do you have a patch that adds texture range?

Flags: needinfo?(dothayer)

(In reply to Markus Stange [:mstange] from comment #8)

If I switch to GL_STORAGE_SHARED_APPLE, FPS improves on my Intel GPU (less time spent blocking on texture deletion, it seems), but I still see the copies. Do you have a patch that adds texture range?

Well, we actually do use glTextureRangeAPPLE, but only at a single texture level. Based on Apple's documentation I think it's intended to be used for a contiguous buffer encompassing multiple textures, which would take quite a chunk of work for us to implement.

I don't have the sample code that I was using to test this in isolation, but I think you can play with it in Jeff's test app.

Flags: needinfo?(dothayer)
Assignee

Comment 10

3 months ago

So I compared our textures with the ones in Jeff's new test app (this one uses rust) and found no difference in the texture formats. But there was one difference that Jeff pointed out to me: The textures in Firefox sometimes have widths that are not a multiple of 8 pixels. And that seems to make all the difference!
In the example app I verified that it's enough to have an 32-byte aligned stride (GL_UNPACK_ROW_LENGTH multiple of 8). The width doesn't necessarily have to be aligned.

Changing the 4 in ComputeRGBStride to a 32 eradicates the memmov calls from my profiles on the integrated GPU. It doesn't necessarily increase the frame rate, but it should reduce CPU usage at least.

I haven't found any evidence that glTextureRangeAPPLE makes any difference.

Assignee

Updated

3 months ago
Assignee: nobody → mstange
Status: NEW → ASSIGNED

Reduced Firefox CPU usage would really be appreciated, at least on the MacBook Pro 13" (2015) with Intel Graphics.
Is this something that can be expected to be ported back to the Beta channel?

Assignee

Comment 12

3 months ago

In particular, it looks like this alignment is required by the Intel driver on
macOS if you want to avoid CPU copies.

It was already known that the efficiency gains from ClientStorage only
materialize if you follow certain restrictions:

  • The textures need to use the TEXTURE_RECTANGLE_ARB texture target.
  • The textures' format, internalFormat and type need to be chosen from a small
    list of supported configurations. Unsupported configurations will trigger
    format conversions on the CPU.
  • The GL_TEXTURE_STORAGE_HINT_APPLE may need to be set to shared or cached.
  • glTextureRangeAPPLE may or may not make a difference.

It now appears that the stride alignment is another requirement:

When uploading textures which otherwise comply with the above requirements, the
Intel driver will still make copies using the CPU if the texture's stride is not
32-byte aligned. These CPU copies are reflected in a high CPU usage (as observed
in Activity Monitor) and they show up in profiles as time spent inside
_platform_memmove under glrUpdateTexture.

However, when uploading 32-byte stride aligned textures which comply with the
above requirements, this CPU usage goes away. There might still be hardware
copies behind the scenes, but they no longer take up CPU time.

Assignee

Comment 14

3 months ago

I would have liked to add an assertion about aligned strides in DirectMapTextureSource::UpdateInternal, which is the place where we actually care about it, but it seems tricky to require stride alignment from all texture origins. Especially video decoders usually make their own choice about strides.

Assignee

Comment 15

3 months ago

(In reply to Stefan Fleiter (:sfleiter) from comment #11)

Reduced Firefox CPU usage would really be appreciated, at least on the MacBook Pro 13" (2015) with Intel Graphics.
Is this something that can be expected to be ported back to the Beta channel?

I wouldn't expect too much from this change - most of the textures we upload are already aligned and already benefit from client storage. If you see consistently high CPU usage, it usually comes from other places, e.g. from web pages executing JavaScript code.

That said, this patch is really simple and does help in certain cases. We should definitely consider uplifting it to the Beta channel.

Assignee

Comment 16

3 months ago

So on https://jrmuizel.github.io/webrender/facebook-refresh.html (warning: flashing background), at a window size of 1084x638 (2168x1276 device pixels), I get fluctuating FPS numbers between 32 and 40 FPS both with and without this patch. But without this patch, the parent process CPU usage is at 53% and with the patch it's at 27%.
Profiles: without patch and with patch

(In reply to Markus Stange [:mstange] from comment #14)

I would have liked to add an assertion about aligned strides in DirectMapTextureSource::UpdateInternal, which is the place where we actually care about it, but it seems tricky to require stride alignment from all texture origins. Especially video decoders usually make their own choice about strides.

Maybe an NS_WARNING? For video at least, we almost always have to copy at least once (from memory owned by the decoder, to shmem that we can share with the compositor), so we have the option to adjust the strides for free-ish at that point.

Comment 18

3 months ago
Pushed by mstange@themasta.com:
https://hg.mozilla.org/integration/autoland/rev/b64006992c1c
Give RGB textures a 32-byte aligned stride in order to improve texture upload efficiency on certain drivers. r=mattwoodrow
Assignee

Comment 19

3 months ago

(In reply to Matt Woodrow (:mattwoodrow) from comment #17)

(In reply to Markus Stange [:mstange] from comment #14)

I would have liked to add an assertion about aligned strides in DirectMapTextureSource::UpdateInternal, which is the place where we actually care about it, but it seems tricky to require stride alignment from all texture origins. Especially video decoders usually make their own choice about strides.

Maybe an NS_WARNING? For video at least, we almost always have to copy at least once (from memory owned by the decoder, to shmem that we can share with the compositor), so we have the option to adjust the strides for free-ish at that point.

I see, that seems reasonable then. But I don't want to add the warning without fixing at least some occurrences first (or at least checking whether there are any, during regular browsing). I'll file a bug about adding such a warning.
On the other hand, most videos probably have aligned widths anyway. Maybe we shouldn't spend time on this until we actually see memmoves show up in profiles of video playback.

Assignee

Comment 20

3 months ago

I've filed bug 1540013 on this.

Comment 21

3 months ago
Backout by rgurzau@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/0898fe281bfe
Backed out changeset b64006992c1c for failing at /client/CanvasClient.cpp on a CLOSED TREE.

Backed out changeset b64006992c1c (bug 1479145) for failing at /client/CanvasClient.cpp on a CLOSED TREE.

Backout link: https://hg.mozilla.org/integration/autoland/rev/0898fe281bfee176cf31fabfa66494c6c34d0f23

Push with failures: https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=testfailed%2Cbusted%2Cexception&revision=b64006992c1c321e1e99c8ef53d0b772d662c744&selectedJob=236796876

Log link: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=236796876&repo=autoland&lineNumber=24720

Log snippet:

[task 2019-03-29T02:48:29.186Z] 02:48:29 INFO - REFTEST TEST-START | file:///builds/worker/workspace/build/tests/reftest/tests/dom/canvas/crashtests/727547.html
[task 2019-03-29T02:48:29.187Z] 02:48:29 INFO - REFTEST TEST-LOAD | file:///builds/worker/workspace/build/tests/reftest/tests/dom/canvas/crashtests/727547.html | 350 / 3704 (9%)
[task 2019-03-29T02:48:29.226Z] 02:48:29 INFO - ++DOMWINDOW == 76 (0x7f2a815a2000) [pid = 1116] [serial = 2657] [outer = 0x7f2a881b9d40]
[task 2019-03-29T02:48:29.294Z] 02:48:29 INFO - [Child 1116, Main Thread] WARNING: GLX_swap_control unsupported, ASAP mode may still block on buffer swaps.: file /builds/worker/workspace/build/src/gfx/gl/GLContextProviderGLX.cpp, line 217
[task 2019-03-29T02:48:29.302Z] 02:48:29 INFO - [Child 1116, Main Thread] WARNING: robust_buffer_access_behavior marked as unsupported: file /builds/worker/workspace/build/src/gfx/gl/GLContextFeatures.cpp, line 621
[task 2019-03-29T02:48:29.303Z] 02:48:29 INFO - [Child 1116, Main Thread] WARNING: Robustness supported, strategy is not LOSE_CONTEXT_ON_RESET!: file /builds/worker/workspace/build/src/gfx/gl/GLContext.cpp, line 960
[task 2019-03-29T02:48:29.303Z] 02:48:29 INFO - [Child 1116, Main Thread] WARNING: robustness marked as unsupported: file /builds/worker/workspace/build/src/gfx/gl/GLContextFeatures.cpp, line 621
[task 2019-03-29T02:48:29.320Z] 02:48:29 INFO - JavaScript error: file:///builds/worker/workspace/build/tests/reftest/tests/dom/canvas/crashtests/727547.html, line 6: TypeError: Argument 6 is not valid for any of the 6-argument overloads of WebGLRenderingContext.texImage2D.
[task 2019-03-29T02:48:29.336Z] 02:48:29 INFO - Assertion failure: mapped.stride / 4 == mapped.size.width, at /builds/worker/workspace/build/src/gfx/layers/client/CanvasClient.cpp:320
[task 2019-03-29T02:48:29.337Z] 02:48:29 INFO - #01: mozilla::layers::CanvasClientSharedSurface::Update(mozilla::gfx::IntSizeTyped<mozilla::gfx::UnknownUnits>, mozilla::layers::ShareableCanvasRenderer*, mozilla::wr::RenderRoot) [mfbt/MaybeOneOf.h:93]
[task 2019-03-29T02:48:29.337Z] 02:48:29 INFO -
[task 2019-03-29T02:48:29.338Z] 02:48:29 INFO - #02: mozilla::layers::ShareableCanvasRenderer::UpdateCompositableClient(mozilla::wr::RenderRoot) [gfx/layers/CanvasRenderer.h:133]
[task 2019-03-29T02:48:29.338Z] 02:48:29 INFO -
[task 2019-03-29T02:48:29.339Z] 02:48:29 INFO - #03: mozilla::layers::ClientCanvasLayer::RenderLayer() [gfx/layers/client/ClientCanvasLayer.h:74]
[task 2019-03-29T02:48:29.340Z] 02:48:29 INFO -
[task 2019-03-29T02:48:29.341Z] 02:48:29 INFO - #04: mozilla::layers::ClientContainerLayer::RenderLayer() [gfx/layers/client/ClientContainerLayer.h:128]
[task 2019-03-29T02:48:29.342Z] 02:48:29 INFO -
[task 2019-03-29T02:48:29.343Z] 02:48:29 INFO - #05: mozilla::layers::ClientLayerManager::EndTransactionInternal(void ()(mozilla::layers::PaintedLayer, gfxContext*, mozilla::gfx::IntRegionTyped<mozilla::gfx::UnknownUnits> const&, mozilla::gfx::IntRegionTyped<mozilla::gfx::UnknownUnits> const&, mozilla::layers::DrawRegionClip, mozilla::gfx::IntRegionTyped<mozilla::gfx::UnknownUnits> const&, void*), void*, mozilla::layers::LayerManager::EndTransactionFlags) [gfx/layers/client/ClientLayerManager.cpp:324]
[task 2019-03-29T02:48:29.344Z] 02:48:29 INFO -
[task 2019-03-29T02:48:29.349Z] 02:48:29 INFO - #06: mozilla::layers::ClientLayerManager::EndTransaction(void ()(mozilla::layers::PaintedLayer

Flags: needinfo?(mstange)
Assignee

Updated

3 months ago
Depends on: 1540209
Attachment #9054350 - Attachment description: Bug 1479145 - Give RGB textures a 32-byte aligned stride in order to improve texture upload efficiency on certain drivers. r?mattwoodrow → Bug 1479145 - Give RGB textures a 32-byte aligned stride on macOS in order to improve texture upload efficiency on certain drivers. r?mattwoodrow
Assignee

Comment 23

3 months ago

The only platforms that do not support GL_PACK_ROW_LENGTH are platforms with
GLES 2. So on those platforms, trying to read back into buffers whose stride is
not width * 4 will assert.
That's fine because we usually don't encounter buffers with such large strides
on GLES 2 platforms. The only platform that really needs to handle the large
strides is macOS, and it always supports GL_PACK_ROW_LENGTH.
On macOS, we often run into large strides on surfaces that we intend to upload
as textures at some point, because large stride alignments are required for
efficient upload performance on some drivers.

Bug 1540209 tracks fixing the general case.

Depends on D25316

Assignee

Comment 24

3 months ago

I made the 32-byte alignment only apply on macOS, and fixed the readback path.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=6e4e0e7821fbfea26aa4d389df2222fc67fc2fba

Flags: needinfo?(mstange)

Comment 25

3 months ago
Pushed by mstange@themasta.com:
https://hg.mozilla.org/integration/autoland/rev/0ea914bf03b7
Give RGB textures a 32-byte aligned stride on macOS in order to improve texture upload efficiency on certain drivers. r=mattwoodrow
https://hg.mozilla.org/integration/autoland/rev/d5836acadff9
Handle arbitrary strides for WebGL-to-SharedSurface readback on platforms that support it. r=jgilbert

Comment 26

3 months ago
bugherder
Status: ASSIGNED → RESOLVED
Closed: 3 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla68

== Change summary for alert #20216 (as of Sat, 30 Mar 2019 09:41:58 GMT) ==

Improvements:

8% Resident Memory osx-10-10-shippable opt stylo 600,563,393.37 -> 550,107,308.26

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=20216

Looking back to comment 15 and comment 27, should this be backported to Firefox 67?

Assignee

Comment 29

3 months ago

Comment on attachment 9054350 [details]
Bug 1479145 - Give RGB textures a 32-byte aligned stride on macOS in order to improve texture upload efficiency on certain drivers. r?mattwoodrow

Beta/Release Uplift Approval Request

  • Feature/Bug causing the regression: None
  • User impact if declined: slower composites and a bit more memory usage on macOS with Intel GPUs
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): very simple fix
  • String changes made/needed:
Attachment #9054350 - Flags: approval-mozilla-beta?
Assignee

Updated

3 months ago
Attachment #9054594 - Flags: approval-mozilla-beta?

Comment on attachment 9054350 [details]
Bug 1479145 - Give RGB textures a 32-byte aligned stride on macOS in order to improve texture upload efficiency on certain drivers. r?mattwoodrow

Low risk patch improving macos graphics performance, uplift approved given that the patch is tiny and well understoof and the results measured in comment #27 are signifivant, uplift approved for 67 beta 9, thanks.

Attachment #9054350 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9054594 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Regressions: 1544478
You need to log in before you can comment on or make changes to this bug.