Use ClientStorage for texture upload on Mac
Categories
(Core :: Graphics: WebRender, enhancement, P3)
Tracking
()
Tracking | Status | |
---|---|---|
firefox63 | --- | affected |
People
(Reporter: jrmuizel, Unassigned)
References
(Depends on 3 open bugs, Blocks 2 open bugs)
Details
(Whiteboard: wr-planning)
The current upload path still seems pretty unoptimal
Updated•6 years ago
|
Updated•6 years ago
|
Comment 1•5 years ago
•
|
||
Here's a profile that shows some of the problems: https://perfht.ml/2XQPazL
- The
glBufferSubData
call inTextureUploader::upload
performs a copy on the CPU: https://perfht.ml/2Dvf1nT - A
glTexSubImage2D
call inUploadTarget::update_impl
performs CPU-side format conversion: https://perfht.ml/34prCVf (glgProcessPixelsWithProcessor
)
Comment 2•4 years ago
|
||
Discussed this with Dzmitry a bit.
(In reply to Markus Stange [:mstange] from comment #1)
Here's a profile that shows some of the problems: https://perfht.ml/2XQPazL
- The
glBufferSubData
call inTextureUploader::upload
performs a copy on the CPU: https://perfht.ml/2Dvf1nT
There are two steps to eliminating this copy:
- First, we want to remove the
glBufferSubData
call. Rather than having the driver copy our data into the PBO, we want to have a mapped PBO and copy into the mapping ourselves. Bug 1602550 will help with this. - Then, we want to eliminate the copy by having the RenderBackend write data directly into the PBO mapping. This is bug 1604546.
At that point, texture upload with PBOs "should" be as efficient as using ClientStorage. (We should make a small test app to make sure that this is the case. The driver might not be doing exactly what we expect.)
However, adding a path for texture upload that uses ClientStorage textures would allow for the following:
- It would mean that we wouldn't need to map buffers ahead of time and guess their sizes. ClientStorage allows us to make our own allocations without talking to GL.
- It would probably encounter fewer bugs with GL drivers on macOS.
- A
glTexSubImage2D
call inUploadTarget::update_impl
performs CPU-side format conversion: https://perfht.ml/34prCVf (glgProcessPixelsWithProcessor
)
This is probably the path that gets taken for things like software-decoded video. We can improve this by having Gecko do the upload (which will be using ClientStorage from an existing code path) and then expose the texture to WebRender as an external-image-native-texture rather than as a bag of bytes. I'll file a separate bug about this. (I thought we already had one on it but I cannot find it at the moment.)
Updated•4 years ago
|
Reporter | ||
Updated•4 years ago
|
Comment 3•4 years ago
|
||
(In reply to Markus Stange [:mstange] from comment #2)
This is probably the path that gets taken for things like software-decoded video. We can improve this by having Gecko do the upload (which will be using ClientStorage from an existing code path) and then expose the texture to WebRender as an external-image-native-texture rather than as a bag of bytes. I'll file a separate bug about this. (I thought we already had one on it but I cannot find it at the moment.)
I assume bug 1403618 is meant. (?)
Reporter | ||
Comment 4•4 years ago
|
||
This pages shows off the problem pretty well: https://bug554004.bmoattachments.org/attachment.cgi?id=477057&particles=2000
Reporter | ||
Comment 5•4 years ago
|
||
YouTube VP9 video playback also shows the problem pretty obviously
Updated•4 years ago
|
Reporter | ||
Comment 6•4 years ago
|
||
As a first step to comment 2 we should see if we can get a texture upload path that works well enough on mac using PBOs. Using https://github.com/jrmuizel/client-storage-rs should make it easy to compare PBOs and client storage.
Comment 7•4 years ago
|
||
Related to bug 1640952, which is required to get the staging belt overhead close to the client storage.
Updated•4 years ago
|
Comment 8•4 years ago
•
|
||
(In reply to Jeff Muizelaar [:jrmuizel] from comment #6)
As a first step to comment 2 we should see if we can get a texture upload path that works well enough on mac using PBOs. Using https://github.com/jrmuizel/client-storage-rs should make it easy to compare PBOs and client storage.
I wrote a benchmarking script to try out all different texture upload methods in this test app. The benchmark renders 300 frames with vsync off.
Results on 2019 MBP with Intel UHD Graphics 630:
['--apple-format', '--texture-storage', '--pbo', '1'] 8.89855162s, 8.831704057s, 8.85521106s
['--apple-format', '--texture-storage', '--pbo', '1', '--pbo-reallocate-buffer'] 8.923318468s, 8.902088411s, 8.919655329s
['--apple-format', '--texture-storage', '--pbo', '2'] 8.843013143s, 8.829951477s, 8.843549752s
['--apple-format', '--texture-storage', '--pbo', '2', '--pbo-reallocate-buffer'] 8.876591648s, 8.892376154s, 8.871327972s
['--apple-format', '--texture-storage', '--client-storage'] 7.124985015s, 7.289057942s, 7.229166325s
['--apple-format', '--texture-storage', '--texture-rectangle', '--pbo', '1'] 8.844125617s, 8.835762945s, 8.859816758s
['--apple-format', '--texture-storage', '--texture-rectangle', '--pbo', '1', '--pbo-reallocate-buffer'] 8.863563841s, 8.873639903s, 8.892736942s
['--apple-format', '--texture-storage', '--texture-rectangle', '--pbo', '2'] 8.784697434s, 8.820122089s, 8.831129245s
['--apple-format', '--texture-storage', '--texture-rectangle', '--pbo', '2', '--pbo-reallocate-buffer'] 8.88656608s, 8.873930213s, 8.888188097s
['--apple-format', '--texture-storage', '--texture-rectangle', '--client-storage'] 7.233205088s, 7.228350659s, 7.183014964s
['--apple-format', '--texture-storage', '--texture-array', '--pbo', '1'] 8.833866519s, 8.840146932s, 8.838348655s
['--apple-format', '--texture-storage', '--texture-array', '--pbo', '1', '--pbo-reallocate-buffer'] 8.860797364s, 9.043897408s, 8.887586905s
['--apple-format', '--texture-storage', '--texture-array', '--pbo', '2'] 8.831442529s, 8.860461698s, 8.851406145s
['--apple-format', '--texture-storage', '--texture-array', '--pbo', '2', '--pbo-reallocate-buffer'] 8.890968003s, 8.899123153s, 8.915493115s
['--apple-format', '--texture-storage', '--texture-array', '--client-storage'] 7.173057978s, 7.156567533s, 7.200522028s
['--apple-format', '--texture-rectangle', '--pbo', '1'] 4.70128211s, 4.695886643s, 4.698270348s
['--apple-format', '--texture-rectangle', '--pbo', '1', '--pbo-reallocate-buffer'] 7.770578027s, 7.686980041s, 7.750275808s
['--apple-format', '--texture-rectangle', '--pbo', '2'] 4.706645099s, 4.660831551s, 4.663864238s
['--apple-format', '--texture-rectangle', '--pbo', '2', '--pbo-reallocate-buffer'] 4.474235542s, 4.464127835s, 4.47519886s
['--apple-format', '--texture-rectangle', '--client-storage'] 3.228521601s, 3.225977283s, 3.234716105s
['--apple-format', '--texture-array', '--pbo', '1'] 4.693244955s, 4.697209266s, 4.697757732s
['--apple-format', '--texture-array', '--pbo', '1', '--pbo-reallocate-buffer'] 7.704128247s, 7.719084604s, 7.611890613s
['--apple-format', '--texture-array', '--pbo', '2'] 4.710451563s, 4.680535763s, 4.693366143s
['--apple-format', '--texture-array', '--pbo', '2', '--pbo-reallocate-buffer'] 4.465463219s, 4.470689369s, 4.474402022s
['--apple-format', '--texture-array', '--client-storage'] 4.459025465s, 4.461897402s, 4.484525335s
['--swizzle', '--texture-storage', '--pbo', '1'] 4.689885722s, 4.660584195s, 4.701891236s
['--swizzle', '--texture-storage', '--pbo', '1', '--pbo-reallocate-buffer'] 7.718375091s, 7.695965396s, 7.662461593s
['--swizzle', '--texture-storage', '--pbo', '2'] 4.719654608s, 4.686956754s, 4.688548537s
['--swizzle', '--texture-storage', '--pbo', '2', '--pbo-reallocate-buffer'] 4.466023294s, 4.459831265s, 4.465406599s
['--swizzle', '--texture-storage', '--client-storage'] 4.456106935s, 4.425488113s, 4.445511299s
Reporter | ||
Comment 9•4 years ago
|
||
We put together a "shortsighted" plan for what will happen next here:
The kinds of uploads that we need to deal with:
- imagelib - content process, also needs to respect alignment
- blobs - gpu process, needs pointer passing, lifetime is extended
- glyphs - gpu process (scene building), async and ahead of time, needs alignment
- video - lifetime issues, cross-process, can encapsulate the client storage
Complications:
- lifetimes (client storage buffer needs to stay alive for the duration of the texture)
- only can allocate PBOs on the GL threads
- PBO can't work with shmems
Current issues:
- We're currently hitting zero-fill-on-demand when copying into the PBO
- reallocating is bad
- drivers searches the orphan lists
- drivers zero-fill on demand
Plan:
- kvark: reduced test case for amd issue
- miko: software video decoding use existing client storage infrastructure.
- jeff: look into why/if we are page faulting on a re-used orphan PBO
- i.e. why isn't the memory being reused for blobs
Updated•4 years ago
|
Comment 10•4 years ago
|
||
Client storage upload for software-decoded video and canvas2d was implemented in bug 1536515.
Comment 11•3 years ago
|
||
I filed two bugs about using client storage in the texture cache:
- bug 1690682 for standalone textures
- bug 1690685 for the upload staging texture
Updated•2 years ago
|
Description
•