Gecko profile: https://perfht.ml/2RDnTwc
Looks like the Renderer thread is the bottleneck here, spending hundreds of milliseconds per frame to composite it. In particular, 1/3 of the thread time is spent in init_fbos() call. This is unexpected: we should preserve the FBOs across frames and only creates new when necessary.
A GPU capture shows us making 355 draw calls, which is more than we should aim for (up to 100, ideally). This happens due to two reasons:
- all the blits necessary for picture caching. If caching is unfeasible here, we should improve the heuristic that disables it, so that blits could be avoided.
- there is a lot of draws of transformed primitives into scissored areas. There are ways to limit the transformed rendering to specific areas without breaking batches, we should look more into it.
- frame is split across 7 passes, which seems a bit high. We need to investigate if the task graph is deeper than needs to be.