Closed Bug 1616901 Opened 5 months ago Closed 3 months ago

Texture cache (re)allocation performance issues

Categories

(Core :: Graphics: WebRender, defect, P3)

defect

Tracking

()

RESOLVED FIXED
mozilla76
Tracking Status
firefox76 --- fixed

People

(Reporter: nical, Assigned: nical)

References

(Depends on 1 open bug, Blocks 4 open bugs)

Details

Attachments

(10 files)

47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review

Jrmuizel was looking into performance issues on Windows + Intel, related to the texture array being resized. Part of the problem is that we sometimes have high peaks of texture memory usage, and part comes from having reallocate the the arrays. The badness tends to scale with the size of the texture array.

A typical test case is to go to youtube's front page and scroll to the bottom. By the time we get there array will have been resized multiple times. another one is scrolling down the testcase from bug 1520401.

A nice way to look at the texture cache's allocation patterns is to enable both prefs gfx.webrender.debug.texture-cache and gfx.webrender.debug.texture-cache.clear-evicted.

Some notes/ideas discussed after the standup meeting:

  • Tweak the texture cache's eviction policy. See the code in texture_cache.rs around EvictionThreshold and EvictionThresholdBuilder. It could be made a lot more aggressive than it currently is, and could be tuned per class of devices.
  • Grow the texture arrays in larger increments, to resize less often (at the cost of worst overal memory usage).
  • After a texture array reaches a certain size, allocate an extra texture array instead of growing the current one. This will break batches but avoid the cost of reallocating and copying the contents of the current texture array.
  • In the same vein, if having more texture arrays provide desirable flexibility, have a separate texture array for opaque and non-opaque content, since that won't cause any batch breaks.
  • After a texture array reaches a certain size, allocate standalone entries instead of growing the array (even worse for batching but maybe simpler?).
  • Start with a larger texture array for main windows (and keep intially small texture arrays for popups)
  • Upload all cached images to long lived standalone textures and copy them on-demand in an atlas that has a very aggressive eviction policy. This may require more work than the rest but would prevent extra texture uploads which some other proposals would regress.

Note: we already eagerly evict items associated with deleted image keys.

Priority: -- → P3

On my laptop (4k screen so hidpi scaling is probably relevant), just starting the browser on about:newtab makes us (re)allocate the 512x512 color texture array 5 times, with 2, 6 ,7 ,8 and finally 9 layers before the page stabilizes.
After that going to https://fr.wikipedia.org/wiki/Caf%C3%A9 causes us to reallocate the texture array with 10, 11, 15, 16 and 18 layers before the page stabilizes. After leaving the tab idle for a moment, it looks like the array is resized down to 12 layers.
Then, scrolling that page all the way to the bottom makes us reallocate the texture array a bunch, until we reach 27 layers.

I think that it is safe to assume that we can start with a larger texture array for main windows (the layer count won't go below 10 on simple web pages on my laptop), and grow it at a coarser granularity. This would be a nice occasion to pick a larger size for the array layers (maybe 1024x0124), allowing slightly larger images to participate in batching. Maybe start with 4 1024x1024 layers for the main window and keep small 512x512 arrays for popups initially.

Edit: I got the same results running the steps above with a device pixel ratio of 1 at a quarter of the screen resolution.

I experimented with 1024x1024 layers, however the slab allocation startegy is not very good with larger sizes. We end up wasting a lot of space on items that are slightly bigger than 512 pixels and allocate a lot of layers as well (though some of it also comes from not placing some images in standalone textures).
Some of the layers could use a guillotine allocator instead of the slab allocator, or we could have 1024x1024 textures but still place the 512px+ items in standalone textures initially.

Another interesting thing is that the slab allocator appears to have a fixed slab size per layer. As a result, we tend to always have at least a layer for each slab size. Setting the layer size to 1024 while keeping the max item size to 512 still leads to a lot of allocated layers, with a lot of free space.

I added some quick and dirty logging to see what's going on in the shared cache (patch).

  • "total" is the number of used AND free regions multiplied by the amount of pixels per region (closest to actual GPU memory usage).
  • "allocated" percentages are the pixels in allocated entries vs total gpu memory allocated.
  • "wasted" percentages are the amount of pixels in allocated slots that are wasted due to rounding slab sizes to the next power of two.
  • "buckets", show a distribution of the texture regions depending on how full they are (first bucket is number of regions less than 12.5% full, next one is number of buckets 12.5% to 25% full, last one is number of regions more than 87.5% full, etc.).
  • "region sizes" are the number of regions of each slab size (all non-square regions lumped in the same count).
youtube front page
====================
- color 8 linear: regions: 27 + 2 empty, total 7602176px, allocated: 82.421875%, wasted: 46.48007%, buckets: [1, 1, 1, 0, 1, 0, 1, 22], region sizes: 16: 1, 32: 1, 64: 1, 128: 1, 256: 0, 512: 0, non-square: 23
- color 8 nearest: regions: 0 + 0 empty, total 0px, allocated: NaN%, wasted: NaN%, buckets: [0, 0, 0, 0, 0, 0, 0, 0], region sizes: 16: 0, 32: 0, 64: 0, 128: 0, 256: 0, 512: 0, non-square: 0
- alpha 8 linear: regions: 1 + 2 empty, total 786432px, allocated: 0.13020833%, wasted: 12.109375%, buckets: [1, 0, 0, 0, 0, 0, 0, 0], region sizes: 16: 0, 32: 1, 64: 0, 128: 0, 256: 0, 512: 0, non-square: 0
- alpha 16 linear: regions: 0 + 0 empty, total 0px, allocated: NaN%, wasted: NaN%, buckets: [0, 0, 0, 0, 0, 0, 0, 0], region sizes: 16: 0, 32: 0, 64: 0, 128: 0, 256: 0, 512: 0, non-square: 0


wikipedia coffee (top of the page)
====================
- color 8 linear: regions: 10 + 0 empty, total 2621440px, allocated: 54.296875%, wasted: 68.59066%, buckets: [3, 0, 1, 0, 2, 0, 0, 4], region sizes: 16: 2, 32: 1, 64: 1, 128: 1, 256: 2, 512: 2, non-square: 1
- color 8 nearest: regions: 0 + 0 empty, total 0px, allocated: NaN%, wasted: NaN%, buckets: [0, 0, 0, 0, 0, 0, 0, 0], region sizes: 16: 0, 32: 0, 64: 0, 128: 0, 256: 0, 512: 0, non-square: 0
- alpha 8 linear: regions: 0 + 2 empty, total 524288px, allocated: 0.0%, wasted: NaN%, buckets: [0, 0, 0, 0, 0, 0, 0, 0], region sizes: 16: 0, 32: 0, 64: 0, 128: 0, 256: 0, 512: 0, non-square: 0
- alpha 16 linear: regions: 0 + 0 empty, total 0px, allocated: NaN%, wasted: NaN%, buckets: [0, 0, 0, 0, 0, 0, 0, 0], region sizes: 16: 0, 32: 0, 64: 0, 128: 0, 256: 0, 512: 0, non-square: 0


about:newtab
====================
- color 8 linear: regions: 7 + 3 empty, total 2621440px, allocated: 26.65039%, wasted: 54.420403%, buckets: [3, 0, 0, 1, 1, 0, 1, 1], region sizes: 16: 2, 32: 1, 64: 1, 128: 1, 256: 2, 512: 0, non-square: 0
- color 8 nearest: regions: 0 + 0 empty, total 0px, allocated: NaN%, wasted: NaN%, buckets: [0, 0, 0, 0, 0, 0, 0, 0], region sizes: 16: 0, 32: 0, 64: 0, 128: 0, 256: 0, 512: 0, non-square: 0
- alpha 8 linear: regions: 0 + 2 empty, total 524288px, allocated: 0.0%, wasted: NaN%, buckets: [0, 0, 0, 0, 0, 0, 0, 0], region sizes: 16: 0, 32: 0, 64: 0, 128: 0, 256: 0, 512: 0, non-square: 0
- alpha 16 linear: regions: 0 + 0 empty, total 0px, allocated: NaN%, wasted: NaN%, buckets: [0, 0, 0, 0, 0, 0, 0, 0], region sizes: 16: 0, 32: 0, 64: 0, 128: 0, 256: 0, 512: 0, non-square: 0

cnn.com
====================
- color 8 linear: regions: 15 + 5 empty, total 5242880px, allocated: 47.910156%, wasted: 58.60028%, buckets: [1, 2, 2, 0, 1, 1, 2, 6], region sizes: 16: 2, 32: 1, 64: 1, 128: 1, 256: 2, 512: 2, non-square: 6
- color 8 nearest: regions: 0 + 1 empty, total 262144px, allocated: 0.0%, wasted: NaN%, buckets: [0, 0, 0, 0, 0, 0, 0, 0], region sizes: 16: 0, 32: 0, 64: 0, 128: 0, 256: 0, 512: 0, non-square: 0
- alpha 8 linear: regions: 1 + 2 empty, total 786432px, allocated: 8.333333%, wasted: 94.174194%, buckets: [0, 0, 1, 0, 0, 0, 0, 0], region sizes: 16: 0, 32: 0, 64: 0, 128: 0, 256: 1, 512: 0, non-square: 0
- alpha 16 linear: regions: 0 + 0 empty, total 0px, allocated: NaN%, wasted: NaN%, buckets: [0, 0, 0, 0, 0, 0, 0, 0], region sizes: 16: 0, 32: 0, 64: 0, 128: 0, 256: 0, 512: 0, non-square: 0

Take with a pinch of salt, I may have poorly counted the allocations and waste (don't hesitate to double-check the patch).

Some observations:

  • I expected the slab allocation strategy with powers-of-two slab sizes to waste memory, but not nearly as much as these quick measurements suggest.
  • The rectangular pages appear to pay off on some sites, like youtube with lots of large images.
  • Regions with small slab sizes tend to have a lot of free space. In other words we have large regions with just a few items and using smaller regions would significantly reduce the memory cost.
  • Regions with large slab sizes tend to have less free space, but that's a consequence of having few slots per region and they waste a lot of pixels.
  • It's worth adjusting the alpha and color arrays separately. The alpha one could grow one layer at a time instead of 4, for example.
  • We don't run into nearest sampled and alpha16 content much, but I suspect that when we do we'll allocate 4 texture layers just for that one item so it's probably worth growing them in small increments.

It would be nice to reduce the amount of waste per allocation. I suspect that a best-fit shelf allocator with power-of-two shelve sizes would do OK with fragmentation and at least remove a bunch of the waste on the x-axis.

Without changing the allocator type, though, I think that we would benefit in the short term from having 256x256 regions for most slab sizes and 512x512 ones for larger items.

Very relevant comment from Jamie:

one thing we need to be careful about with texture cache changes is that the X position can be important to whether uploads can be DMAd or not. If we try to squeeze more thin items in horizontally we might hit a slow path during uploads on some platforms I think we might already get this wrong on Mac: https://bugzilla.mozilla.org/show_bug.cgi?id=1603783#c2

Means we'll have a hard time having both nicely packed atlas allocation schemes and fast uploads, unless we decouple the upload from the packing by uploading conservatively aligned data and then issuing GPU-side copies into a tightly packed shared cache.

As an initial step to reduce the reallocation churn a bit. This texture array grows up to 12 layers for any trivial page and goes up to 40+ on many sites (like the youtube front page) so it's far from enough but it's a start. Simple popups like Help > About Nightly don't need more than 4 layers, though.

Assignee: nobody → nical.bugzilla
Status: NEW → ASSIGNED

Also make sure that the pressure factor never gets to zero.

These new constants aren't deinitive in any way though I think that they are a bit more reasonable.

Depends on D67366

Doesn't make for a stellar API but texture_cache.rs has a lot of code and I had a hard time wrapping my head around the scattered parts of picture-cache specific code.

In addition (and more importantly) texture arrays for each tile sizes can be allocated on demand and do not need to be created when initializing the texture cache. This avoids allocating and deallocating the mispredicted picture texture size when the initial window size is garbage.

Depends on D67367

It was previously set on mac only due to driver mischiefs, however the cost of growing texture arrays becomes high with large layer counts, which capping the layer count to 32 everywhere helps mitigate at the expense of batch breaks.

Depends on D67368

Pushed by nsilva@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/a781ce6dd020
Allocate rgb8 linear shared texture cache regions four at a time. r=gw
https://hg.mozilla.org/integration/autoland/rev/86167a71e706
Put a bit more texture cache eviction pressure. r=gw
https://hg.mozilla.org/integration/autoland/rev/955e7790f963
Move picture cache texture array logic into its own struct. r=gw
https://hg.mozilla.org/integration/autoland/rev/8f4b47079a44
Cap the maximum texture array layer count to 32 on all platforms. r=gw
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Keywords: leave-open
Pushed by nsilva@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/622591504941
Allocate 16 layers for rgba8 linear textures in the cache and allocate new arrays instead of growing it. r=gw
Depends on: 1624272
Blocks: wr-intel
Blocks: 1616646

I suspect the number of texture array layers impacts some arithmetic realted to how the GL implementation deals with texture coordinates. These refetests started failing more consistently (though not 100% of the time) with the patch that sets the number of rgba8 linear texture array laters to 16.

Pushed by nsilva@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/6e0471fb1731
Allocate 16 layers for rgba8 linear textures in the cache and allocate new arrays instead of growing it. r=gw
https://hg.mozilla.org/integration/autoland/rev/b2eed1f2b242
Adjust reftest expectations. r=jrmuizel
Blocks: wr-perf-p1
Pushed by nsilva@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/8eb82d6066cc
More reftest adjustments. r=jrmuizel
https://hg.mozilla.org/integration/autoland/rev/01b9bdb6e5ef
Yet more reftest adjustments. r=jrmuizel
Depends on: 1624565
Depends on: 1624640
Depends on: 1624644

The texture cache code modified in 1624644 can't affect scene building. It's much more likely that the regression comes from bug 1616412 (unfortunately).

Flags: needinfo?(mikokm)
Regressions: 1616412
No longer regressions: 1624250

Oh snap I updated the wrong bug.

Flags: needinfo?(mikokm)
No longer regressions: 1616412
Pushed by nsilva@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/7b36a0458e50
Remove accidental print statements. r=jrmuizel
Status: REOPENED → RESOLVED
Closed: 4 months ago3 months ago
Keywords: leave-open
Resolution: --- → FIXED
Regressions: 1632795
You need to log in before you can comment on or make changes to this bug.