Open Bug 1487864 (frame-scheduling) Opened 6 years ago Updated 1 month ago

Improve frame scheduling in WebRender - Strange couplets of animation frames on expensive main-thread-driven animation

Categories

(Core :: Graphics: WebRender, defect, P2)

defect

Tracking

()

Tracking Status
firefox63 --- disabled

People

(Reporter: mstange, Unassigned)

References

(Depends on 10 open bugs, Blocks 3 open bugs)

Details

(Keywords: regression)

Like bug 1462503, I expect this to be a regression from bug 1371668.

Here's a profile I took of running https://output.jsbin.com/surane/quiet on my machine: https://perfht.ml/2NBnNUx

Edit 2023-09-20: Updated profile: https://share.firefox.dev/3POXUDu

It has 28ms refresh driver ticks but isn't keeping the main thread or the renderer thread busy. I think the intention here is to fall down to 30fps and do two refresh driver ticks per each 66ms window. And that is what happening in the tail end of the profile, but during the first half of the profile, there are many instances of doing two refresh ticks per 85ms.
It just seems like our refresh tick scheduling is leaving too much idle time as soon as we can't keep up with 60fps.
Flags: needinfo?(matt.woodrow)
Priority: -- → P2
This is really hard to reason about! I think it would be useful to have some more markers to try figure out what's happening.

I think part of the problem is that we can put events into the queue on vsync, but not actual processes them until considerably later if the prior paint was slow. That can mean that even if DidComposite has been sent from the compositor, it could be in the event queue behind the Vsync message, and we'll skip a paint unnecessarily.

The vsync code in refresh driver has handling for squashing multiple vsyncs together when we're slow, as well as dispatching a new runnable to the event queue if it thinks painting would be starving out other events.

We also have no marker for when we tried to paint, but decided to skip it because we're behind.

I think if we had markers for all those things happening (all tagged with the relevant paint id/ vsync time), as well as markers for the vsync thread then we might be able to understand this better.

I can try play with that later this week if I get time.
Flags: needinfo?(matt.woodrow)
Looks like this also affects tscrollx: https://perfht.ml/2Q8X83b

We're getting quick pairs of main thread paints, which get coalesced into a single composite.

Non-WR doesn't seem to be affected for some reason.
(In reply to Matt Woodrow (:mattwoodrow) from comment #2)
> Looks like this also affects tscrollx: https://perfht.ml/2Q8X83b
> 
> We're getting quick pairs of main thread paints, which get coalesced into a
> single composite.
> 
> Non-WR doesn't seem to be affected for some reason.

Ok, so I think this is because WR has extra threads than non-WR. Non-WR only has a single thread, so we're blocked from processing incoming layer transactions while we composite. WR can process incoming transactions while we work on a slow composite, and we can coalesce them.

It seems like this issue shouldn't be restricted to ASAP mode, but I think it could happen any time the Renderer thread is the slowest piece of the pipeline, and it takes more than one vsync interval.

It feels like we shouldn't be firing requestAnimationFrame that fast if we know we're just going to combine them on the compositor side.
No longer blocks: wr-snap
Priority: P2 → P3
Note that until recently we were building frames each time a display list was updated, in addition to each time we rendered. So in workloads where we rebuild the display list continuously we used to build the frame twice as often as necessary and that would also trigger some extra render work on the renderer. This was fixed in https://github.com/servo/webrender/pull/3092.
It sounds from the comments here that there are other things coming into play, but in any case it'll be good to take new profiles that don't have the redundant frame builds whenever someone looks into this again.
Priority: P3 → P2
Looking at that new profile, I think I have a good understanding of this.

With non-WR, we operate in a double-buffered mode, where there are two phases (content and compositor), each which run on vsync. This adds a frame of latency, but gives us up to ~32ms of total time to produce a frame while still hitting 60fps. I believe we have some code that relies on this 2-deep pipeline, like <canvas>.

The main thread only allows 2 outstanding frames at any one time to avoid overproduction.

WebRender is trying to do roughly the same thing, except that it runs across 4 threads instead of two. The content phase starts on the main thread with display list building, and then continues on the scene builder thread. The compositing phase starts on the render backend thread ('frame building') and continues on the Renderer thread ('generate frame').

The big difference here is that WebRender can actually be running four things in parallel, even though it's trying to pretend that the pipeline is only 2 deep.

In the profile, we can see that the end-to-end time of a given frame is ~60ms, with the combined content time at ~21ms and the combined compositing time at ~24ms.

Given that, we'd expect each a composite phase to run every 32ms (2 vsync intervals), but we manage to initiate one every vsync (while there are ones waiting in the pipeline).

So we're starting two frames in quick succession, and then the main thread stops, waiting for response. Since the end-to-end time is ~60ms, the total throughput is still 30fps, but our DidComposite messages come back very slowly spaced (since the Renderer thread component is sub-16ms) and then we can paint twice again.

The simplest solution would probably be to not let frame building and frame generation run in parallel and act as if these two were a single thread.

Other alternatives include letting the gecko thread have more frames outstanding, and probably making the extra phases explicit (schedule things on vsync, and increase the latency).
I think our choices are:

* Don't let frame building and frame generation run in parallel, go back to a true 2 phase mode like non-WR.
    Low latency, matches what we do now, but we lose the parallelism advantages of the extra threads.

* Maintain the current setup of 4 pseudo-phases, but only 2 that run on vsync and two that run asap. Change gecko throttling to allow 4 outstanding frames.
    Latency dynamically transitions between 2 and 4 frames depending on how long the work takes. Can coalesce and drop frames (janky?) when transitioning to lower latencies.
    Might need to update canvas/webgl to handle this.

* Switch to a true 4 phase model. Schedule all phases on their own vsync tick.
    Max latency of 4 frames at all times, but maximizes time we get to spent on each. Would probably get this testcase to 60fps on my machine.
    Again with the canvas issues.
None of these choices are particularly appealing. I'd like to think about this problem some more.
We'd eventually like to merge the compositor and render backend threads (run the render backend code on the compositor). I don't think this will dramatically change the equation but it does remove some of the parallelism.
We also talked about moving the renderer's command submission to the compositor thread.

Since the biggest unknown in terms of timing is the scene building phase which happens asynchronously while we can scroll any number of frames, we may want to throttle the content thread based on that. Currently non-wr gecko can't tick the refresh driver while it is painting (if painting takes a long time) so we might want to make it so that it can't tick frame N+1 until frame N is past the scene building phase.

The phases would become:

> [* display-list building / scene building] [* frame building ] [ rendering ]

where display-list and scene building would happen on different threads but not overlap. The boxes with an asterisk* are vsync driven while the others run asap.

or if we move the renderer and frame building to the compositor thread:

> [* display-list building / scene building] [* frame building + rendering ]

I was hoping that we'd be able to not do this thread shuffling before the MVP though.
QA Contact: mreavy
QA Contact: mreavy
Since it was too hard for me to think about the different options for making this work well I put together a script for simulating different scenarios at https://github.com/jrmuizel/frame-scheduling

With the current parameters it reproduces the couplets we see in the profile.
I had a break through today that enabled me to get somewhat decent schedules in the simulator. Previously all threads were gready. As soon as they had work they would start doing it. By being less greedy and starting work only when the next thread is not too busy we seem to be able to avoid alternating between a hurd of work and then no work.
(In reply to Jeff Muizelaar [:jrmuizel] from comment #13)
> I had a break through today that enabled me to get somewhat decent schedules
> in the simulator. Previously all threads were gready. As soon as they had
> work they would start doing it. By being less greedy and starting work only
> when the next thread is not too busy we seem to be able to avoid alternating
> between a hurd of work and then no work.

Can you please update the repo with this change, and/or upload the output file so we can look at it?
Assignee: nobody → dglastonbury
We met about this yesterday. The basic plan is as follows:
- Currently we only throttle on the main thread. If there are two many frames in flight we won't produce anymore.
- We need to change to throttling between each thread queue. The general strategy that we want to have is that we don't start new work if the next thread down the line already has work in it's queue that it's not working on.

This should give us decent latency while keeping throughput reasonably high.
See Also: → 1513254
We discussed this during the all-hands and the original plan has some problems because things like canvas assume a 2 frame delay and adding more frames make a mess of that. At this point, I think we're best off just shipping as is or at most making our pipeline more synchronous so that it's more like the existing code.
Priority: P2 → P4

We need a new plan.

Assignee: dglastonbury → nobody
Flags: needinfo?(jmuizelaar)
Blocks: wr-68
No longer blocks: stage-wr-trains
Flags: needinfo?(jmuizelaar)
Priority: P4 → P2
Flags: needinfo?(jmuizelaar)

Short-term-ish things to investigate:

  • Increase the max number of in-flight frames (and see what breaks).
  • Throttle the refresh driver so that we don't start frame N+1 until the scene for Frame N is built (see comment 11).

Long-term-ish things to investigate:

  • What Chrome does.
  • Make scene building, blob rasterization and frame building (at least the parts that happen in WebRender's process) fast enough that we can just run them all synchronously.

We also wanted to make it so that scene building starts without needing the RenderBackend.

Flags: needinfo?(jmuizelaar)
Blocks: wr-69
No longer blocks: wr-68
Blocks: wr-70
No longer blocks: wr-69
Alias: frame-scheduling
No longer blocks: wr-70
Summary: Strange couplets of animation frames on expensive main-thread-driven animations → Strange couplets of animation frames on expensive main-thread-driven animation
Summary: Strange couplets of animation frames on expensive main-thread-driven animation → Improve frame scheduling in WebRender - Strange couplets of animation frames on expensive main-thread-driven animation
Blocks: wr-perf
Depends on: 1612440
Depends on: 1612441
Depends on: 1612443

A few notes from the Berlin allhands discussion:

  • WebRender has too much pipelining, it causes more harm than good at this point.
  • Doint frame building and rendering on the same thread would help (bug 1612443).
  • We should avoid doing heavy work on the compositor thread because a lot of messages (fx and non-gfx related) go through it so anything that hurts the responsiveness of the compositor thread will add latency to various things.
  • We can reduce some of the latency by sending scene building requests directly from the webrender api (compositor thread) to the scene builder thread (bug 1612440). This was complicated to do with ipc_channel but the latter is going away in bug 1612437.
  • Before we can do most of this, it is important to make hit testing queries not block on the render backend (bug 1580178).
Depends on: 1668339
Depends on: 1671490
Depends on: 1683294
Depends on: 1614734
Depends on: 1603453
Depends on: 1715317
Depends on: 1693205
Depends on: 1686358
Depends on: 1735391
Depends on: 1732144
Severity: normal → S3
Depends on: 1797975
Depends on: 1803387
Blocks: 1825391
Depends on: 1865542
You need to log in before you can comment on or make changes to this bug.