Poor texture upload performance on Mali
Categories
(Core :: Graphics: WebRender, defect)
Tracking
()
Tracking | Status | |
---|---|---|
firefox84 | --- | fixed |
People
(Reporter: jnicol, Assigned: jnicol)
References
Details
Attachments
(3 files)
In bug 1661135 we established that stuttery zooming on Mali G71/G72 can occur due to excessive texture upload.
On Adreno, we did not notice this issue because the texture upload path is fast enough to hide it on most sites. On Mali, however, it is unbearably slow. We should investigate whether we are hitting a slow path on mali, and if so find out how to get on the fast path instead.
Here is a profile: https://share.firefox.dev/2QpwggF
Profiles on Adreno prior to bug 1498732 clearly showed lots of time in glTexSubImage3D
or similar functions. In the Mali profile, all the time is less intuitively spent in eglSwapBuffers()
. This indicates that the upload was being done synchronously on Adreno whereas on Mali it is in fact being done asynchronously, but taking too long, resulting in eglSwapBuffers blocking for it to complete. Hopefully despite it being peformed asynchronously it is not in fact taking the optimal path, and we can get an improvement.
Updated•4 years ago
|
Updated•4 years ago
|
Assignee | ||
Comment 1•4 years ago
|
||
I have figured out that the time taken to uploaded the texture data is correlated with the number of separate glTexSubImage
calls, rather than the number of bytes in total. (These are quite often inversely proportional, for example when zooming text: when zoomed out there are many glyphs to upload but they are very small, when zoomed in there are fewer glyphs but they are very large.)
Some higher level ideas we have had which could help are:
- Upload all subpixel-position variants of a glyph together
- Cache text-runs as opposed to individual glyphs
The latter might help a lot but I'm not convinced the former would be enough.
A lower-level idea is to maintain a copy of the texture data in a persistently-mapped PBO, and copy texture updates in to this with the same memory layout as in the texture. It is then possible to call glTexSubImage
for coarser regions than the individual items being uploaded, as the existing texture data will just be overwritten by equal data.
I've hacked together a prototype of this and it appears to be a huge improvement. I still need to determine how eager we should be to group updates. eg If a texture has 2 updates, one in layer 0 and one in layer 15, do we upload all 16 regions together?
There is also a question of what to do about data written to the texture cache from render tasks rather than uploads. If our buffer doesn't contain these then a subsequent upload could overwrite the data in the texture. Hopefully it is possible to simply glReadPixels
these in to our PBO.
My biggest concern, however, is of code quality. We already have 3 texture upload paths, and longer term goals to implement bug 1602550 and bug 1640952. I'm struggling to come up with a clean abstraction for this. Dzmitry, do you have any ideas for how you want the belt uploader's API to look? Do you think it could be compatible with this system?
Glenn, can you think of any issues with the above plan?
Comment 2•4 years ago
|
||
That's an interesting discovery!
Going with glReadPixels
to update the parts of the PBO just to upload it afterwards seems rather concerning.
I'm thinking about a middle path between what we do now (separate glTexSubImage
for update region) and composing everything into a single update. What if get all the sub-regions, and upload them into a 512x512 texture (packed in some way), only to follow-up with a bunch of glCopyTexImage2D
from that temporary into the real places where that data need to live?
Assignee | ||
Comment 3•4 years ago
|
||
I quite like that idea, and we'll actually need to do something similar for Adreno 3xx anyway. (Due to bug 1513185, where PBO uploads always write to layer 0 instead of the specified layer.)
I'll hack together a prototype, hopefully it will perform well.
Comment 4•4 years ago
|
||
I like the idea kvark proposed above. It's something we've discussed previously (I forget the exact details, I think it was a similar issue). It seems like a good potential solution to reduce driver sync points.
Assignee | ||
Comment 5•4 years ago
|
||
Provides glBufferStorage and glFlushMappedBufferRange, and fixes
glClientWaitSync's return type.
Updated•4 years ago
|
Assignee | ||
Comment 6•4 years ago
|
||
Currently we upload texture data by first splitting the updates based
on their destination texture. Then we iterate through every update for
the texture and calculate the total required size, then allocate and
map a PBO large enough to contain the updates. Finally we fill the PBO
with the data and submit the uploads, before moving on to the next
texture. This results in a lot of allocating and mapping PBOs of
highly variable sizes, and also requires an extra iteration of the
updates to calculate the required size.
This change makes it so that we allocate fixed size PBOs. We fill the
PBOs with as many updates as they have room for, then start filling a
new one. Furthermore, we recycle the PBOs to avoid having to
reallocate them, and if possible we persistently map them too.
The TextureUploader struct can now be used to upload to more than one
texture, ensuring buffers are filled more efficiently. It is also now
possible to manually stage some data in the PBO to be uploaded, in
case the client requires more fine-grained control over the layout of
the data.
If a single upload is too large to fit within the fixed-size PBO, then
we allocate a single-use larger PBO for the upload. As before, on Mac
we unfortunately must use a separate PBO for each individual upload,
so we also allocate single-use custom-sized PBOs in that case.
Depends on D96023
Assignee | ||
Comment 7•4 years ago
|
||
Mali-Gxx performance really struggles when we make a large number of
glTexSubImage calls, even if the total amount of memory being uploaded
isn't unreasonably large. To work around this, this change makes it so
that we batch many texture updates in to fewer upload calls.
We sort the updates by their target texture format, then use a
guillotine allocator to pack the updates efficiently. Using the
TextureUploader's newly added ability to manually stage texture data,
we arrange the updates' data as they have been packed, then upload the
large texture in a single call. Finally, we blit each update from its
location in the staging texture to its correct location in the texture
cache.
Depends on D96024
Comment 9•4 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/77c01f13f298
https://hg.mozilla.org/mozilla-central/rev/7ba854d835c4
https://hg.mozilla.org/mozilla-central/rev/5f5c269d15d6
Description
•