Closed Bug 1559423 Opened 6 years ago Closed 4 years ago

Investigate pathfinder CPU tiling performance

Categories

(Core :: Graphics: WebRender, task, P3)

task

Tracking

()

RESOLVED FIXED

People

(Reporter: nical, Assigned: nical)

References

Details

Pathfinder's CPU side tiling is where I think there is the most room for performance improvements.

The current algorithm is a sweep-line with some "local" SIMD optimizations at places (by local I mean that some specific arithmetic functions have SIMD versions rather than using operating by batches). The sweep line has the option of running in parallel with rayon using a job per path.
Pathfinder's demo pipelines the CPU and GPU parts by staring GPU work while the CPU isn't done. Here I am ignoring this because WebRender is not set up to take advantage of that and already pipelines the CPU and GPU processing in a different way. So I am looking at the sum of the timings reported for each CPU stage.

There are different options to improve upon what we have:

  • Improve the current algorithm.
  • Use a different algorithm on the CPU (I will come back to that).
  • Use a different algorithm on the GPU (a good example is the tiling implementation in piet-metal that is done in a compute shader with shared memory, atomics and wave intrinsics. This one is worth keeping in mind but I would like to start with a good solution that targets all webrender users and then maybe try more advanced solutions targeting high end configurations if need be.

Some initial notes:

  • Pathfinder converts strokes to fills on the CPU, and I am not sure whether this is done at load time or each frame. It doesn't stand out on the profiler but it'd be useful to see how much of the CPU time comes from this conversion.
  • As far as I can tell, pathfinder's parallel option with rayon only brings performance improvements on hardware with two physical cores (4 logical) or more. On a rather high end desktop with 12 logical cores I get 5.4ms for the tiling of the ghostscript tiger at the default resolution of the demo app (1067x800) in multi-threaded mode and about 5.8ms in single-threaded mode. Interestingly I get about 5.5 ms multi-threaded and 7.2 single-threaded on my much slower xps13 2018 with 4 physical cores. According to perf I get twice as many instructions per cycle on the single threaded mode so the current setup doesn't quite scale nicely with the number of threads.
  • Because our users tend to not have a lot of cores and because Gecko already has a good amount of work to feed these cores, I am more interested in fast single-threaded performance than than fast multi-threaded performance, even if the latter ends up faster than the former. That said once we have established a good baseline for single-threaded performance, having option to speed things up with multi-threading is a plus.
  • In order to make parallel processing easier pathfinder currently does not cull mask tiles submitted to the stencil pass, it only culls the alpha tiles that will read from them. I tried a naïve implementation that tiles paths front-to-back in single-threaded mode and reads the CPU-side z-buffer (*) to avoid generating edges occluded by solid tiles. This culled about 500 tiles out of 5700 which is not bad but (due to the rather hacky way I plumbed that in) the extra overhead of reading the z-buffer in a few places was more or less equivalent to the avoided work. Performance didn't move with this experiment but I think that with a bit more work this optimization could yield improvements.

(*): The "z-buffer" here is not the GPU's z-buffer but an array of atomic integers that represent the highest z-index of path that fully covers each tile.

Trying out a different tiling algorithm

I have started prototyping another alorthim which is not based on a full sweep-line pass. It runs as follows:

  • The first pass decomposes each segment into either line segments or monotonic quadratic bézier segments after having applied a 2d transformation and assigns these segments to all rows they intersect with.
  • In the second pass, for each row the segments are sorted from left to right then the tiling pattern is traversed from left to right, building an active list of edges that affect that tile (a sort of horizontal per-row sweep line traversal).

Paths are tiled front-to-back and a CPU-side z-buffer is used for culling similarly to pathfinder's current tiler.
The output of the tiler is more or less the same format as pathfinder (vector of solid tiles, vector of alpha tiles and vector of mask edges).
No multi-threading mode so far, but each is processed in the second pass with zero shared mutable state so it is trivial to parallelized if need be.
The tiler can be configured to either generate only line segments or a mix of quadratic bézier and line segments. A padding can be configured to give tiles an overlap region (pathfinder currently does not support that but it's useful enough that I wanted to see if we could have it).

On the xps13, so far I get 3.5ms on cold caches without pre-allocation (start the program, tile once, exit) when tiling the ghostscript tiger in a similar configuration as the pathfinder demo. If I run the same code in a loop I get around 2.7ms total.

Two thirds of the time is spent in the first pass, most of that time is either in the quadratic bézier approximation or the linear approximation depending on the mode.
So that's a pretty encouraging starting point since none of this code is optimized at all (no SIMD, no reuse of vector allocations, quadratic béziers are degree-elevated to cubics when performing intersection checks with row/tile boundaries, etc.). I think that we can easily shave a millisecond or two off this benchmark.

Take these numbers with a grain of salt since there may be bugs.

Blocks: pathfinder
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.