1628530 - [meta] Scene/frame building performance improvements

Glenn Watson [:gw]

Reporter

Description

•

4 years ago

Goal:

Make significant improvements to the CPU performance of WR, primarily in frame/scene building.

Background:

There are, roughly speaking, four areas of WR that we care about for performance:

Scene building
Frame building
Rendering
GPU

For (3), we have many ideas and options on how to improve these. For example:

Unifying shaders to improve batching performance (both CPU batch generation and driver time)
More efficient management of resources (e.g. UBOs instead of vertex textures, SSBO on ANGLE)
Simpler batching code (e.g. simplify segment support to nine-patches)
Texture atlas staging updates strategy
Mapping a pool of vertex buffers for batching to write instances directly into

For (4), we already have quite good performance in most cases, especially with picture caching. We can:

Fix cases where picture caching doesn't work as well as it could
Make use of more render task caching (e.g. box-shadows, gradients)
Simplify shaders (e.g. remove perspective requirements from most shader kinds)
Continue feature work to support native compositors (e.g. webgl into a compositor surface)

Improving performance for (1) and (2) is more interesting, and the topic of this meta-bug.

For (1) and (2), there are a range of micro-optimizations we can (and should) make to improve cache locality, reduce memory allocations etc, but these will only improve things by a small percentage. There are, however, a number of optimizations we can make to how we process the scene and frame builds, that have the potential to reduce the amount of work we need to do per frame. These changes have the potential to realize much larger performance gains than micro-optimizations.

Doing too much (redundant) work:

Frame building, and to a lesser extent scene building, do a lot of work per-frame that is mostly redundant (either the same result most frames, or done per-primitive rather than a coarser granularity).

Some examples:

(1) Segment building is done during frame building. It is cached for each subsequent APZ frame, but still redone every time a new display list is received.

(2) Per-primitive culling. We check visibility of each primitive. For some cases, we need the per-primitive results (e.g. to work out an allocation size for an off-screen target), but for most purposes these are not required.

(3) Clip chain building. These get rebuilt each frame build, since we don't currently know if the result has changed due to spatial nodes moving that any of the clip(s) in the chain may be attached to, even though most of the time they are the same.

How to fix:

There are two things that prevent us making a heap of optimizations to avoid doing all this redundant work:

(1) We currently store primitives in render order (by storing primitive instances directly inside the picture primitives).

(2) We don't have an easy way to correlate primitives in a new display list with existing primitives in the frame builder.

Solving those two issues would unlock a large number of optimizations, which could drastically reduce the per-frame work we need to do.

Roughly speaking, the plan is:

(a) Decouple storage of primitive instances from render order. We can do this by storing primitive instances in a custom storage container (similar to a freelist) and having the picture tree refer to these primitive instances by index / handle.

(b) Add a hashing method that allows us to correlate and remap new primitive instances in a new display list with existing primitive instances in the custom storage container (we can correlate by prim interning id + spatial id).

What this gives us:

(1) Since primitive instances can now be correlated (persisted) between display list updates, we can cache / store relevant information related to the primitive instance (e.g. cached clip chain state, spatial culling arrangement, (child) picture dependencies and tile assignments).

(2) We can cull at a much coarser granularity (most likely based on the per-tile assignments / spatial tree information attached to each primitive instance).

(3) Store index buffers of primitive instances (and clip chains) that need to be updated when a particular dependency changes (e.g. the value of the positioning spatial node for a set of primitives).

(4) Process parts of the frame build in smaller, tight loops (i.e. data oriented style), since we have decoupled the draw ordering from the primitive (and clip chain) instances. For example, we could:

Update all clip chains that are out of date (dependent spatial node changed)
Re-assign prims to tiles that have moved spatial nodes
Coarse cull at per-tile level
(potentially parallel per virtual surface / tile):
- Prepare prims on tiles that are dirty (doing per-primitive work as required)
- Build batches for the dirty region of that tile

Extra notes (reminders for writing up some extra detail / bits and pieces):

Sanitize clip API so that only one clip node is provided per clip-chain entry.
Introduce "clip set" - a collection of clips, so we don't need to walk clip-chain linked lists.
Remove Push/PopClipChain primitive instance (simplifies clip-chain/set caching).
Make box-shadow a normal primitive, instead of a clip type (performance win).
Simplify segment logic - use nine-patches, build once and persist.
Work out how shared_clips / clip chain stack levels can be simplified (gecko work?).
Need to work out how to handle segments for image tiling when simplifying segment building.

Glenn Watson [:gw]

Reporter