Bugzilla

Comment 1

•

6 years ago

This is really hard to reason about! I think it would be useful to have some more markers to try figure out what's happening.

I think part of the problem is that we can put events into the queue on vsync, but not actual processes them until considerably later if the prior paint was slow. That can mean that even if DidComposite has been sent from the compositor, it could be in the event queue behind the Vsync message, and we'll skip a paint unnecessarily.

The vsync code in refresh driver has handling for squashing multiple vsyncs together when we're slow, as well as dispatching a new runnable to the event queue if it thinks painting would be starving out other events.

We also have no marker for when we tried to paint, but decided to skip it because we're behind.

I think if we had markers for all those things happening (all tagged with the relevant paint id/ vsync time), as well as markers for the vsync thread then we might be able to understand this better.

I can try play with that later this week if I get time.

Flags: needinfo?(matt.woodrow)

Marco Castelluccio [:marco]

Updated

•

6 years ago

Blocks: 1483610

Updated

•

6 years ago

Keywords: regression

David Durst [:ddurst]

Updated

•

6 years ago

status-firefox63: affected → disabled

Comment 2

•

6 years ago

Looks like this also affects tscrollx: https://perfht.ml/2Q8X83b

We're getting quick pairs of main thread paints, which get coalesced into a single composite.

Non-WR doesn't seem to be affected for some reason.

Comment 3

•

6 years ago

(In reply to Matt Woodrow (:mattwoodrow) from comment #2)
> Looks like this also affects tscrollx: https://perfht.ml/2Q8X83b
> 
> We're getting quick pairs of main thread paints, which get coalesced into a
> single composite.
> 
> Non-WR doesn't seem to be affected for some reason.

Ok, so I think this is because WR has extra threads than non-WR. Non-WR only has a single thread, so we're blocked from processing incoming layer transactions while we composite. WR can process incoming transactions while we work on a slow composite, and we can coalesce them.

It seems like this issue shouldn't be restricted to ASAP mode, but I think it could happen any time the Renderer thread is the slowest piece of the pipeline, and it takes more than one vsync interval.

It feels like we shouldn't be firing requestAnimationFrame that fast if we know we're just going to combine them on the compositor side.

Updated

•

6 years ago

No longer blocks: wr-snap

Maire Reavy [:mreavy]

Updated

•

6 years ago

Priority: P2 → P3

Comment 4

•

6 years ago

Note that until recently we were building frames each time a display list was updated, in addition to each time we rendered. So in workloads where we rebuild the display list continuously we used to build the frame twice as often as necessary and that would also trigger some extra render work on the renderer. This was fixed in https://github.com/servo/webrender/pull/3092.
It sounds from the comments here that there are other things coming into play, but in any case it'll be good to take new profiles that don't have the redundant frame builds whenever someone looks into this again.

Updated

•

6 years ago

Priority: P3 → P2

Comment 5

•

6 years ago

Newer profile: https://perfht.ml/2zQ3xde

Comment 6

•

6 years ago

Looking at that new profile, I think I have a good understanding of this.

With non-WR, we operate in a double-buffered mode, where there are two phases (content and compositor), each which run on vsync. This adds a frame of latency, but gives us up to ~32ms of total time to produce a frame while still hitting 60fps. I believe we have some code that relies on this 2-deep pipeline, like <canvas>.

The main thread only allows 2 outstanding frames at any one time to avoid overproduction.

WebRender is trying to do roughly the same thing, except that it runs across 4 threads instead of two. The content phase starts on the main thread with display list building, and then continues on the scene builder thread. The compositing phase starts on the render backend thread ('frame building') and continues on the Renderer thread ('generate frame').

The big difference here is that WebRender can actually be running four things in parallel, even though it's trying to pretend that the pipeline is only 2 deep.

In the profile, we can see that the end-to-end time of a given frame is ~60ms, with the combined content time at ~21ms and the combined compositing time at ~24ms.

Given that, we'd expect each a composite phase to run every 32ms (2 vsync intervals), but we manage to initiate one every vsync (while there are ones waiting in the pipeline).

So we're starting two frames in quick succession, and then the main thread stops, waiting for response. Since the end-to-end time is ~60ms, the total throughput is still 30fps, but our DidComposite messages come back very slowly spaced (since the Renderer thread component is sub-16ms) and then we can paint twice again.

The simplest solution would probably be to not let frame building and frame generation run in parallel and act as if these two were a single thread.

Other alternatives include letting the gecko thread have more frames outstanding, and probably making the extra phases explicit (schedule things on vsync, and increase the latency).

Comment hidden (offtopic)

heres a profile I took on windows with 2Core/4Thread laptop : https://perfht.ml/2ICYsIf
It appears to be different from yours,almost completely dominated by stylo stuff.

Comment hidden (offtopic)

nvm, that as because of ABP. Please ignore this and the previous comment.

Comment 9

•

6 years ago

I think our choices are:

* Don't let frame building and frame generation run in parallel, go back to a true 2 phase mode like non-WR.
    Low latency, matches what we do now, but we lose the parallelism advantages of the extra threads.

* Maintain the current setup of 4 pseudo-phases, but only 2 that run on vsync and two that run asap. Change gecko throttling to allow 4 outstanding frames.
    Latency dynamically transitions between 2 and 4 frames depending on how long the work takes. Can coalesce and drop frames (janky?) when transitioning to lower latencies.
    Might need to update canvas/webgl to handle this.

* Switch to a true 4 phase model. Schedule all phases on their own vsync tick.
    Max latency of 4 frames at all times, but maximizes time we get to spent on each. Would probably get this testcase to 60fps on my machine.
    Again with the canvas issues.

Comment 10

•

6 years ago

None of these choices are particularly appealing. I'd like to think about this problem some more.

Firefox Bug Husbandry Bot

Comment 11

•

6 years ago

We'd eventually like to merge the compositor and render backend threads (run the render backend code on the compositor). I don't think this will dramatically change the equation but it does remove some of the parallelism.
We also talked about moving the renderer's command submission to the compositor thread.

Since the biggest unknown in terms of timing is the scene building phase which happens asynchronously while we can scroll any number of frames, we may want to throttle the content thread based on that. Currently non-wr gecko can't tick the refresh driver while it is painting (if painting takes a long time) so we might want to make it so that it can't tick frame N+1 until frame N is past the scene building phase.

The phases would become:

> [* display-list building / scene building] [* frame building ] [ rendering ]

where display-list and scene building would happen on different threads but not overlap. The boxes with an asterisk* are vsync driven while the others run asap.

or if we move the renderer and frame building to the compositor thread:

> [* display-list building / scene building] [* frame building + rendering ]

I was hoping that we'd be able to not do this thread shuffling before the MVP though.

QA Contact: mreavy

Updated

•

6 years ago

QA Contact: mreavy

Comment 12

•

6 years ago

Since it was too hard for me to think about the different options for making this work well I put together a script for simulating different scenarios at https://github.com/jrmuizel/frame-scheduling

With the current parameters it reproduces the couplets we see in the profile.

Updated

•

6 years ago

Blocks: 1476368

Comment 13

•

6 years ago

I had a break through today that enabled me to get somewhat decent schedules in the simulator. Previously all threads were gready. As soon as they had work they would start doing it. By being less greedy and starting work only when the next thread is not too busy we seem to be able to avoid alternating between a hurd of work and then no work.

Comment 14

•

6 years ago

(In reply to Jeff Muizelaar [:jrmuizel] from comment #13)
> I had a break through today that enabled me to get somewhat decent schedules
> in the simulator. Previously all threads were gready. As soon as they had
> work they would start doing it. By being less greedy and starting work only
> when the next thread is not too busy we seem to be able to avoid alternating
> between a hurd of work and then no work.

Can you please update the repo with this change, and/or upload the output file so we can look at it?

Updated

•

6 years ago

Assignee: nobody → dglastonbury

Comment 15

•

6 years ago

We met about this yesterday. The basic plan is as follows:
- Currently we only throttle on the main thread. If there are two many frames in flight we won't produce anymore.
- We need to change to throttling between each thread queue. The general strategy that we want to have is that we don't start new work if the next thread down the line already has work in it's queue that it's not working on.

This should give us decent latency while keeping throughput reasonably high.

Updated

•

6 years ago

Comment 16

•

6 years ago

We discussed this during the all-hands and the original plan has some problems because things like canvas assume a 2 frame delay and adding more frames make a mess of that. At this point, I think we're best off just shipping as is or at most making our pipeline more synchronous so that it's more like the existing code.

Priority: P2 → P4

Comment 17

•

6 years ago

We need a new plan.

Assignee: dglastonbury → nobody

Flags: needinfo?(jmuizelaar)

Updated

•

5 years ago

Blocks: wr-68
No longer blocks: stage-wr-trains

Flags: needinfo?(jmuizelaar)

Priority: P4 → P2

Updated

•

5 years ago

Flags: needinfo?(jmuizelaar)

Comment 18

•

5 years ago

Short-term-ish things to investigate:

Increase the max number of in-flight frames (and see what breaks).
Throttle the refresh driver so that we don't start frame N+1 until the scene for Frame N is built (see comment 11).

Comment 19

•

5 years ago

Long-term-ish things to investigate:

What Chrome does.
Make scene building, blob rasterization and frame building (at least the parts that happen in WebRender's process) fast enough that we can just run them all synchronously.

Comment 20

•

5 years ago

We also wanted to make it so that scene building starts without needing the RenderBackend.

Flags: needinfo?(jmuizelaar)

Updated

•

5 years ago

Blocks: wr-69
No longer blocks: wr-68

Updated

•

5 years ago

Blocks: wr-70
No longer blocks: wr-69

Updated

•

5 years ago

Alias: frame-scheduling

Updated

•

5 years ago

Blocks: 1553575

Updated

•

5 years ago

No longer blocks: wr-70

Summary: Strange couplets of animation frames on expensive main-thread-driven animations → Strange couplets of animation frames on expensive main-thread-driven animation

Updated

•

5 years ago

Summary: Strange couplets of animation frames on expensive main-thread-driven animation → Improve frame scheduling in WebRender - Strange couplets of animation frames on expensive main-thread-driven animation

Updated

•

5 years ago

Blocks: wr-perf

Updated

•

5 years ago

Depends on: 1612440

Updated

•

5 years ago

Depends on: 1612441

Updated

•

5 years ago

Depends on: 1612443

Comment 21

•

5 years ago

A few notes from the Berlin allhands discussion:

WebRender has too much pipelining, it causes more harm than good at this point.
Doint frame building and rendering on the same thread would help (bug 1612443).
We should avoid doing heavy work on the compositor thread because a lot of messages (fx and non-gfx related) go through it so anything that hurts the responsiveness of the compositor thread will add latency to various things.
We can reduce some of the latency by sending scene building requests directly from the webrender api (compositor thread) to the scene builder thread (bug 1612440). This was complicated to do with ipc_channel but the latter is going away in bug 1612437.
Before we can do most of this, it is important to make hit testing queries not block on the render backend (bug 1580178).

Updated

•

5 years ago

Blocks: 1611582

Updated

•

4 years ago

Depends on: 1668339

Updated

•

4 years ago

Depends on: 1671490

Updated

•

4 years ago

Depends on: 1683294

Updated

•

3 years ago

Depends on: 1614734

Updated

•

3 years ago

Depends on: 1603453

Updated

•

3 years ago

Depends on: 1715317

Updated

•

3 years ago

Depends on: 1693205

Updated

•

3 years ago

Depends on: 1686358

Markus Stange [:mstange]

Reporter

Updated

•

3 years ago

Depends on: 1735391

Updated

•

3 years ago

Depends on: 1732144

BMO Automation

Updated

•

2 years ago

Severity: normal → S3

Updated

•

2 years ago

Depends on: 1797975

Updated

•

2 years ago

Depends on: 1803387

David Shin[:dshin]

Updated

•

1 year ago

Blocks: 1825391