Closed Bug 1391318 Opened 3 years ago Closed 2 years ago

Asynchronous RenderBackend processing for APZ

Categories

(Core :: Graphics: WebRender, enhancement, P1)

enhancement

Tracking

()

RESOLVED FIXED
mozilla61
Tracking Status
firefox56 --- unaffected
firefox57 --- unaffected
firefox61 --- fixed

People

(Reporter: kvark, Assigned: kats)

References

(Blocks 2 open bugs)

Details

(Whiteboard: [wr-reserve] [gfx-noted])

Attachments

(2 files)

https://github.com/servo/webrender/issues/1357

WR should process different parts of documents asynchronously, so that scrolling is not blocked by the scene building of the new layout.
Blocks: 1377291
Whiteboard: [gfx-noted]
Priority: P3 → P2
Whiteboard: [gfx-noted] → [wr-mvp] [gfx-noted]
Priority: P2 → P3
Whiteboard: [wr-mvp] [gfx-noted] → [wr-reserve] [gfx-noted]
Assignee: nobody → kvark
Priority: P3 → P1
== Background ==

Now that the async scene building stuff in WR is close to landing (servo/webrender#2362) I spent some more time thinking about how APZ fits into the model. The planned async scene building architecture is described in an easy-to-understand diagram at https://github.com/servo/webrender/pull/2362#issuecomment-365597637 - this shows that while WR is rendering and scrolling scene "A", it can build scene "B" on a scene builder thread, and once that's done, it gets swapped in and WR will start rendering and scrolling that.

In order for this to work, it means that while scene B is being built (before the swap), the APZ code on the C++ side needs to keep track of the state from scene A as the primary scene, because scroll animations/hit-testing etc should apply to that scene. The problem is that when the gecko main-thread sends over a display list transaction to the compositor, that contains both the WR display list and the APZ data for scene B. That step happens at the very start of the diagram, and is what triggers the step labelled "Txn:SetDL(B)", and it also currently triggers the APZ state update to scene B. So if we need to keep APZ on scene A, then we need to defer that APZ state update - we need to stick the APZ data for scene B into a queue or something and then apply it to APZ at the same time as the "Set Scene(B)" step happens. That's basically what I explained in the comment following the diagram.

== Problem ==

The thing I didn't quite realize when I wrote that comment is this also affects hit-testing. Let's say we're in the state where the scene builder thread is building scene B, and the APZ data for scene B is stashed in a queue somewhere, waiting for the scene swap. In this state, if we get an input event, we will hit-test it against scene A in webrender, and APZ has the scroll data related to scene A so it can interpret the results fine. The problem is if the hit-test lands on a "dispatch-to-content" region - this might be an unlayerized scrollframe, or something with input event listeners, for example. In this case APZ needs to queue the input event, and wait for the gecko main-thread to process the event and tell it what to do with the hit-test. However at this point the gecko main-thread state is already on scene B, or might even have progressed to scene C or D. For the purposes of walking through an example, let's say that the input event triggers scrollframe layerization, which causes scene C to be generated. So the main thread sends scene C's display list to the compositor, followed by the hit-test result for the input event (which refers to a scrollframe id in scene C's display list).

Without WR this works just fine because the hit-test response arrives to APZ in-order with the display list update, so by the time APZ processes the hit-test result it will already have processed scene C's APZ data, and so it will be "in sync" with the gecko main-thread with respect to interpreting the hit-test result. In the WR case though, the scene C APZ data got stashed into the queue behind the scene B APZ data, because the scene builder thread is still trying to build scene B. So if we give the hit-test result to APZ it becomes out-of-order with respect to the scene data, and gets misinterpreted. In our example, the scrollframe id would refer to a non-existent scrollframe and APZ would throw it away.

== Solutions ==

One way to look at the problem is as an ordering problem of messages from the main thread to APZ, in which case the obvious solution is to put the hit-test result into the same queue that is holding scene B and scene C's APZ data, and only deliver it to APZ after scene C gets swapped in. The problem with this is that the latency to APZ increases quite a bit, and might actually negate the benefit of the async scene building in the first place.

Another way to look at this is that when the user created the input event, they were actually looking at scene A on the screen, and arguably their input belongs to scene A. That means if we get a dispatch-to-content hit-result in WR, and the main thread is already on scene B or scene C then we *should* just ignore it. However I don't think that's valid because scene B might be basically identical to scene C, at least from the user's point of view. From the user's point of view they might just be trying to scroll a scrollframe and they wouldn't be able to.

Other options? I had something else in mind when I was thinking about this yesterday but now I've forgotten and anyway this comment is too long so I'm just going to end it here.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #2)
> One way to look at the problem is as an ordering problem of messages from
> the main thread to APZ, in which case the obvious solution is to put the
> hit-test result into the same queue that is holding scene B and scene C's
> APZ data, and only deliver it to APZ after scene C gets swapped in. The
> problem with this is that the latency to APZ increases quite a bit, and
> might actually negate the benefit of the async scene building in the first
> place.

The way I see it, one of the premises of APZ is that compositing is fast. That is, the time from a layer tree arriving at the compositor, to that layer tree showing up on screen, is short (1 frame). 

Therefore, if APZ is already waiting for the layer tree to arrive before it can process an event (because it hit a dispatch-to-content region which triggered layerization), then additionally "waiting" until that layer tree is composited is not a big deal. (I put "waiting" in quotes because the layer tree will be composited on the next composite, so APZ doesn't actually need to wait further, it can process the event right as soon as the layer tree and subsequent hit-test confirmation arrive.)

If WebRender achieves this property (that compositing is fast) often enough, I don't see a problem with keeping the hit-test confirmation in the same queue as the "layer tree" (display list) update. If it doesn't, then it seems like we're breaking one of the premises of APZ, and I don't think there are easy solutions.

(I realize that with WebRender we're doing more of the work on the compositing side, so we should probably factor that into account. That is, while putting the hit-test confirmation into the queue may introduce latency compared to handling it as soon as the display list is sent over, does it introduce latency compared to handling it when the *layer tree* would have been sent over with non-WR (which would have meant also waiting for painting)? If not, we're not actually regressing latency compared to non-WR.)
Another mitigating factor here is that we should be able to only put the hit-test confirmation into the queue in the "layers dependent" confirmation case, just like how with non-WR the confirmation only waits for the next layer tree update in that case.
Here's a slightly different way of framing the issue, in case that helps:

When a main-thread hit test confirmation message is "layers dependent", on a conceptual level we're telling APZ "you can't handle this event until we're just about ready to present [as in, show on the screen] a new scene".

Pre-WR, sending the confirmation right after a layer tree transaction met this conceptual criterion, since by the time a layer transaction arrives to the compositor, we're "just about ready to present a new scene" (i.e. it will be composited on the next frame).

If WR can't consistently present a display list within one frame, then the arrival of a display list to the compositor no longer constitutes "we're just about ready to present a new scene", so we can't reasonably expect APZ to process the hit test confirmation as soon as the display list arrives. Rather, it's the "switch to this scene" message that consistitues "we're just about ready to present the scene", so it makes sense for the confirmation to be processed at that time as well.
(In reply to Botond Ballo [:botond] from comment #4)
> Another mitigating factor here is that we should be able to only put the
> hit-test confirmation into the queue in the "layers dependent" confirmation
> case, just like how with non-WR the confirmation only waits for the next
> layer tree update in that case.

I don't think this is the case - in my example even if the event didn't trigger a layerization on the main thread (and therefore didn't go into the "layers dependent" case), the response to APZ would be in terms of scene B. And since APZ is "currently on" scene A, the response would still need to go into the queue to be processed after APZ has been updated to scene B. I'm pretty sure this is going to be the case for any messages sent from the main thread to APZ. Or at least enough of them that queuing only those will expose subtle and pesky bugs where we were implicitly relying on the ordering of messages. So I'd rather just do them all.

Still mulling over some of your other comments.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #6)
> (In reply to Botond Ballo [:botond] from comment #4)
> > Another mitigating factor here is that we should be able to only put the
> > hit-test confirmation into the queue in the "layers dependent" confirmation
> > case, just like how with non-WR the confirmation only waits for the next
> > layer tree update in that case.
> 
> I don't think this is the case - in my example even if the event didn't
> trigger a layerization on the main thread (and therefore didn't go into the
> "layers dependent" case), the response to APZ would be in terms of scene B.
> And since APZ is "currently on" scene A, the response would still need to go
> into the queue to be processed after APZ has been updated to scene B. I'm
> pretty sure this is going to be the case for any messages sent from the main
> thread to APZ. Or at least enough of them that queuing only those will
> expose subtle and pesky bugs where we were implicitly relying on the
> ordering of messages. So I'd rather just do them all.

Isn't it also true today (pre-WR) that when we send a hit test confirmation that's not layers-dependent, that hit test was performed on a "scene" (display list) that may be newer than the one APZ's information is based on?
(In reply to Botond Ballo [:botond] from comment #7)
> Isn't it also true today (pre-WR) that when we send a hit test confirmation
> that's not layers-dependent, that hit test was performed on a "scene"
> (display list) that may be newer than the one APZ's information is based on?

Yes, but by the time the result of that hit-test is interpreted by APZ, APZ is guaranteed to be on the same scene that the hit-test was done against. The result of the hit-test can never overtake a layer tree on the path from the main-thread to APZ.
So we discussed this some more today and the main conclusions were:
- the latency should not really increase if we queue the main-thread messages in order with respect to the APZ updates, so that's not a concern
- we should move the APZ's "compositor thread" (i.e. the thread that gives it layer updates and that queries the transform) onto the render backend thread when WR is enabled
- messages from the main thread -> APZ should get delivered to APZ after the preceding display list's scene is swapped in, on the render backend thread.

One way of doing this would be to have a list of buckets, where each bucket contains a list of messages. Any message from the main thread -> APZ would go into the last bucket on the list, and we'd start a new bucket any time we get a new display list (or more generally, any time APZ would have gotten an UpdateHitTestingTree call). A pointer to the bucket that gets started would get attached to the display list we send to WR, and at the time of the scene swap on the render backend thread, we can call back into C++ code with the pointer, and all the messages in that bucket would get applied. We'd probably want a render backend thread API to re-read from the bucket as well, as we might get messages from the main thread during an otherwise "idle" state (no inflight scenes being built, and no pending messages).
Assignee: kvark → bugmail
Ran into another problem with this (or at least, the way I had it planned in my head).

Right now, we do hit tests against WR from the APZ controller thread while holding the APZ tree lock [1]. This hit-test blocks the controller thread on the render backend thread while it waits for a response [2]. Considering that we are now planning to run operations like UpdateHitTestingTree on the render backend thread, and UpdateHitTestingTree requires holding the tree lock [3], it will be possible to get into a deadlock situation.

For example, say there's a scene build in flight on the scene builder thread. And at the same time we get an input event on the controller thread, so we grab the tree lock and request a hit-test from WR. Before the hit-test request is processed by the RB thread, the scene build finishes and the new scene is delivered to the RB thread. This in turn will trigger the UpdateHitTestingTree call, but the tree lock is already held by the controller thread. So we have the RB thread blocked on acquiring the tree lock, which is held by the controller thread, which is blocked on a response to the hit-test from the RB thread --> deadlock.

I'll think about this some and see if I can figure out a way to resolve it.

[1] https://searchfox.org/mozilla-central/rev/b29daa46443b30612415c35be0a3c9c13b9dc5f6/gfx/layers/apz/src/APZCTreeManager.cpp#2360,2366
[2] https://searchfox.org/mozilla-central/rev/877c99c523a054419ec964d4dfb3f0cadac9d497/gfx/webrender_api/src/api.rs#911,923
[3] https://searchfox.org/mozilla-central/rev/b29daa46443b30612415c35be0a3c9c13b9dc5f6/gfx/layers/apz/src/APZCTreeManager.cpp#322
I guess the solution here is to acquire the tree lock from the scene builder thread *before* it sends the result back to the RB thread. This will effectively block until any inflight hit-tests are completed, and once the lock is acquired, block any subsequent hit-tests from starting. Then, once the tree lock is acquired, it can do the scene swap and run UpdateHitTestingTree on the RB thread, and then release the lock. That sounds a little convoluted but should make the scene swap + UpdateHitTestingTree run atomically which is the key characteristic we want to avoid the deadlock.
Comment 11 is only half-right. I wasn't thinking clearly when I wrote that, because we can't run UpdateHitTestingTree on the RB thread if we grabbed the lock from the scene builder. We actually need the scene builder thread to become to APZ sampler thread, rather than the render backend thread. I need to think about this a bit more to convince myself though.
I discussed this with Botond today. The current plan (still needs some more thinking about) is to:

- Split the current APZ "sampler thread" notion (introduced in bug 1441916) into a more limited "sampler thread" (which actually has just the sampling functions) and an "updater thread" which does all the APZ updates. The sampler thread would be the render backend thread, and the updater thread would be the scene builder thread.
- Have callbacks into C++ code around the scene swap on the scene builder thread. The callback before the scene swap would lock the APZ tree, and the callback after would unlock it. One of the two callbacks would also do the hit-testing tree update (inside the lock).
- Ensure that the render backend thread never acquires the tree lock. All the sampling functions that are done on this thread would have to query APZ without picking up the tree lock. For now we can probably just continue using the current mechanism of "pushing" the APZ state to WR from the compositor thread as part of GenerateFrame, which accomplishes the goal of getting APZ data to the RB thread without holding the tree lock on RB. This has a problem, though - the set of APZCs for which the compositor thread sends state might be out-of-date by the time the GenerateFrame is processed on the RB thread. So a better long-term solution would be to create a secondary data structure in APZCTreeManager which is a map from ScrollableLayerGuid->APZC with its own lock. This lock can be acquired after the tree lock but not vice-versa; this will allow the map to be updated as part of the hit-testing tree update, and the map can still be queried from the render backend thread for sampling purposes.


For completeness, we also considered (and discarded) some other approaches. I forget all of them, here's one:
- Making the WR hit-test async or doing it without holding the tree lock. These have the problem that by time we actually use the guid returned by the hit-test, we might have had the HTT tree get updated by the other thread, and so the guid is garbage. We could associate a hit result with a scene generation and retry the hit-test if the generation doesn't match but it's possible for this scenario to repeat over and over, which leads to increasingly hacky solutions.
Just posting this here, it outlines the relevant threads and locks that are going to be used with APZ and async scene building in WR, and the order in which locks can be acquired / which threads can block on which other threads. The "total dependency ordering" at the end has the final dependency chain. The UI main thread is allowed to block on the GPU main thread, which is allowed to block on the compositor thread, which is allowed to block on the scene builder thread, which is allowed to block on the render backend thread.
Comment on attachment 8969362 [details]
Bug 1391318 - Automatically enable async scene building with webrender.all.

https://reviewboard.mozilla.org/r/238104/#review243866
Attachment #8969362 - Flags: review?(jmuizelaar) → review+
Pushed by kgupta@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/c3da365d2916
Automatically enable async scene building with webrender.all. r=jrmuizel
https://hg.mozilla.org/mozilla-central/rev/c3da365d2916
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla61
You need to log in before you can comment on or make changes to this bug.