Open Bug 1729328 Opened 3 years ago Updated 3 years ago

Avoid copying images row by row in texture uploads

Categories

(Core :: Graphics: WebRender, enhancement, P3)

enhancement

Tracking

()

People

(Reporter: nical, Assigned: nical)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

We see a lot of time spent copying from the cache item into staging buffer on the CPU, and suspect that part of that comes from copying row by row instead of having a large memcpy for the whole image.

As far as I could measure, packing all items linearly and unpacking them in a shader only has a modest impact on the time spent copying into the staging buffer (6% improvement in the copy time at best). It can remove 1 ms in a bad frame which isn't bad but I was hoping for a better speedup.

Storing the images contiguously in the staging buffer opens the door to porentially rasterizing blobs directly into it (for the glTexSubImage code path we use on Windows), however that means more risk of false cache sharing since the blob tiles are rasterized in parallel.

The worse data locality doesn't appear to hurt the copy shader GPU time (or it's compensated by the simplicity of the shader). In renderdoc there is no observable time difference between the two.

Using larger (1024x1024 instead of 512x512) staging textures doesn't affect the copy time, but the number of draw calls required to do the GPU copy go from 40-ish to 15-ish when testing with the bottom of https://creativecluster.lu/ (it has a large animated blob).

On linux+intel, having bigger blob tiles (512x512 instead of 256x256) and uploading them directly instead of using the batched upload path makes a pretty large difference (creativecluster test case total cache update time goes from avg 18.2 max 56.4 to avg 8.9 max 27.3).

Bigger blob tiles means less invalidation granularity, however it also means less of the per-tile overhead during rasterization so it would be a tradeoff.

I added a code path to upload directly off of the image buffer into the staging texture (skipping the staging CPU buffer) when the image is large enough that we are unlikely to fit another one in the Staging CPU buffer. With that and setting the blob tile size to 512, the memcpy time almost or less goes away, however, the time we spend in glTexSubImage2D increases a lot on windows+intel (16ms to 22ms average on the creativecluster test case), which is odd since it should be doing exactly the same thing (except reading from a different source). The time is spent in WaitForSynchronizationObjectForCpu under UpdateSubResource.

See Also: → 1730707
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: