Open Bug 1628530 Opened 4 years ago Updated 2 years ago

[meta] Scene/frame building performance improvements

Categories

(Core :: Graphics: WebRender, enhancement)

enhancement

Tracking

()

People

(Reporter: gw, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Keywords: meta)

Goal:

Make significant improvements to the CPU performance of WR, primarily in frame/scene building.

Background:

There are, roughly speaking, four areas of WR that we care about for performance:

  1. Scene building
  2. Frame building
  3. Rendering
  4. GPU

For (3), we have many ideas and options on how to improve these. For example:

  • Unifying shaders to improve batching performance (both CPU batch generation and driver time)
  • More efficient management of resources (e.g. UBOs instead of vertex textures, SSBO on ANGLE)
  • Simpler batching code (e.g. simplify segment support to nine-patches)
  • Texture atlas staging updates strategy
  • Mapping a pool of vertex buffers for batching to write instances directly into

For (4), we already have quite good performance in most cases, especially with picture caching. We can:

  • Fix cases where picture caching doesn't work as well as it could
  • Make use of more render task caching (e.g. box-shadows, gradients)
  • Simplify shaders (e.g. remove perspective requirements from most shader kinds)
  • Continue feature work to support native compositors (e.g. webgl into a compositor surface)

Improving performance for (1) and (2) is more interesting, and the topic of this meta-bug.

For (1) and (2), there are a range of micro-optimizations we can (and should) make to improve cache locality, reduce memory allocations etc, but these will only improve things by a small percentage. There are, however, a number of optimizations we can make to how we process the scene and frame builds, that have the potential to reduce the amount of work we need to do per frame. These changes have the potential to realize much larger performance gains than micro-optimizations.

Doing too much (redundant) work:

Frame building, and to a lesser extent scene building, do a lot of work per-frame that is mostly redundant (either the same result most frames, or done per-primitive rather than a coarser granularity).

Some examples:

(1) Segment building is done during frame building. It is cached for each subsequent APZ frame, but still redone every time a new display list is received.

(2) Per-primitive culling. We check visibility of each primitive. For some cases, we need the per-primitive results (e.g. to work out an allocation size for an off-screen target), but for most purposes these are not required.

(3) Clip chain building. These get rebuilt each frame build, since we don't currently know if the result has changed due to spatial nodes moving that any of the clip(s) in the chain may be attached to, even though most of the time they are the same.

How to fix:

There are two things that prevent us making a heap of optimizations to avoid doing all this redundant work:

(1) We currently store primitives in render order (by storing primitive instances directly inside the picture primitives).

(2) We don't have an easy way to correlate primitives in a new display list with existing primitives in the frame builder.

Solving those two issues would unlock a large number of optimizations, which could drastically reduce the per-frame work we need to do.

Roughly speaking, the plan is:

(a) Decouple storage of primitive instances from render order. We can do this by storing primitive instances in a custom storage container (similar to a freelist) and having the picture tree refer to these primitive instances by index / handle.

(b) Add a hashing method that allows us to correlate and remap new primitive instances in a new display list with existing primitive instances in the custom storage container (we can correlate by prim interning id + spatial id).

What this gives us:

(1) Since primitive instances can now be correlated (persisted) between display list updates, we can cache / store relevant information related to the primitive instance (e.g. cached clip chain state, spatial culling arrangement, (child) picture dependencies and tile assignments).

(2) We can cull at a much coarser granularity (most likely based on the per-tile assignments / spatial tree information attached to each primitive instance).

(3) Store index buffers of primitive instances (and clip chains) that need to be updated when a particular dependency changes (e.g. the value of the positioning spatial node for a set of primitives).

(4) Process parts of the frame build in smaller, tight loops (i.e. data oriented style), since we have decoupled the draw ordering from the primitive (and clip chain) instances. For example, we could:

  • Update all clip chains that are out of date (dependent spatial node changed)
  • Re-assign prims to tiles that have moved spatial nodes
  • Coarse cull at per-tile level
  • (potentially parallel per virtual surface / tile):
    • Prepare prims on tiles that are dirty (doing per-primitive work as required)
    • Build batches for the dirty region of that tile

Extra notes (reminders for writing up some extra detail / bits and pieces):

  • Sanitize clip API so that only one clip node is provided per clip-chain entry.
  • Introduce "clip set" - a collection of clips, so we don't need to walk clip-chain linked lists.
  • Remove Push/PopClipChain primitive instance (simplifies clip-chain/set caching).
  • Make box-shadow a normal primitive, instead of a clip type (performance win).
  • Simplify segment logic - use nine-patches, build once and persist.
  • Work out how shared_clips / clip chain stack levels can be simplified (gecko work?).
  • Need to work out how to handle segments for image tiling when simplifying segment building.
Assignee: nobody → gwatson
Depends on: 1628564

There are also potential wins that we could get from smarter spatial partitioning, by:

  • split the document into tiles of primitives attached to the same spatial node (clusters with properly spatial partitioning),
  • don't do CPU culling at the primitive level, only cull clusters,
  • manage the liveness of resources at the cluster level instead of per primitive: when the cluster is expired, clear all of its resources from the caches. We could make it so that resources within a cluster like GPU cache handles are contiguous.

The advantage of this scheme is that there are a lot of places in frame building where cost scales with the number of primitives (up to a point where simply having too many primitives makes us miss the frame budget), and stronger spatial clustering would decouple a good chunk of this cost from scene complexity.
Also the knowledge that resources are grouped per cluster would help us arrange them more efficiently in the cache for uploads).

See also bug 1611153

Pushing this tiling scheme further we could update the display list on a per-tile bases, making it trivial to do invalidation across scenes. And making incremental updates very cheap.

Depends on: 1629672
Depends on: 1632389
Depends on: 1632409
Depends on: 1634243
Depends on: 1636645
Depends on: 1720555
Depends on: 1720624
Depends on: 1749380
Severity: normal → --
Keywords: meta
Summary: [metabug] Scene/frame building performance improvements → [meta] Scene/frame building performance improvements
Depends on: 1772049
Depends on: 1773899
Depends on: 1773905
Depends on: 1775188, 1775189
Depends on: 1775369
Assignee: gwatson → nobody
You need to log in before you can comment on or make changes to this bug.