Bug 1737725 Comment 11 Edit History

Original comment by

on 2021-12-12 15:48:14 PST

Using Try pushes with added logging, I was able to arrive at a diagnosis here.

It appears that it's possible for WebRender to perform a scene build at an inopportune time, such that a hit-test manages to observe the state of the scene after the new scene (containing the latest **main-thread** scroll offsets) has been swapped in, but before async properties (containing the latest **APZ** scroll offsets) have been sampled. In effect, the hit test observes a state that is not ever composited.

I believe the relevant code is in `RenderBackend::process_transaction()`, the function that runs on the render backend thread to accept the result of a scene build and perform the scene swap.

* Coming into this function, the scene builder thread is holding the APZ tree lock. (That is acquired by the `pre_scene_swap` hook called [here](https://searchfox.org/mozilla-central/rev/93c6fe08a4ef08d4146a30c2c9494bf0f3bb42e0/gfx/wr/webrender/src/scene_builder_thread.rs#685), and remains held while the scene builder thread waits for the render backend thread to perform the scene swap [here](https://searchfox.org/mozilla-central/rev/93c6fe08a4ef08d4146a30c2c9494bf0f3bb42e0/gfx/wr/webrender/src/scene_builder_thread.rs#717).)
* The scene swap is then performed [here](https://searchfox.org/mozilla-central/rev/93c6fe08a4ef08d4146a30c2c9494bf0f3bb42e0/gfx/wr/webrender/src/render_backend.rs#792).
* The hit tester is rebuilt [here](https://searchfox.org/mozilla-central/rev/93c6fe08a4ef08d4146a30c2c9494bf0f3bb42e0/gfx/wr/webrender/src/render_backend.rs#824), but note that async properties have not been sampled and applied to the new scene yet (that will happen later, in `update_document()`).
* The render backend thread [signals](https://searchfox.org/mozilla-central/rev/93c6fe08a4ef08d4146a30c2c9494bf0f3bb42e0/gfx/wr/webrender/src/render_backend.rs#829) the scene builder thread that it can invoke the `post_scene_swap` hook, and [waits](https://searchfox.org/mozilla-central/rev/93c6fe08a4ef08d4146a30c2c9494bf0f3bb42e0/gfx/wr/webrender/src/render_backend.rs#834) for it to do so. When the render backend thread [calls](https://searchfox.org/mozilla-central/rev/93c6fe08a4ef08d4146a30c2c9494bf0f3bb42e0/gfx/wr/webrender/src/scene_builder_thread.rs#719) the `post_scene_swap` hook, the APZ tree lock is released.
* Finally, the scene builder thread calls `update_document()` [here](https://searchfox.org/mozilla-central/rev/93c6fe08a4ef08d4146a30c2c9494bf0f3bb42e0/gfx/wr/webrender/src/render_backend.rs#852), which samples async properties [here](https://searchfox.org/mozilla-central/rev/93c6fe08a4ef08d4146a30c2c9494bf0f3bb42e0/gfx/wr/webrender/src/render_backend.rs#1278) and rebuilds the hit tester again [here](https://searchfox.org/mozilla-central/rev/93c6fe08a4ef08d4146a30c2c9494bf0f3bb42e0/gfx/wr/webrender/src/render_backend.rs#1450).

Note that there is a period of time where the APZ tree lock has been released, but async properties have not yet been sampled nor the hit-tester rebuilt to account for them. A hit-test can occur in this interval, and observe the main-thread scroll offset.

I believe this is a regression from bug 1580178, which made hit-tests not block on the render thread. Prior to that bug, a hit test that did occur in the mentioned interval would be enqueued onto the render backend thread's event loop (as a `FrameMsg::HitTest`), and be processed only after `process_transaction()` (including the `update_document()` call and its async sampling) is complete.

I also believe this is a general correctness bug affecting production scenarios (not just tests), and may explain occasional hit testing misbehaviour that we've observed.

I thought a bit about how to fix this. Holding the APZ tree lock through the `update_document()` call doesn't work, because the render backend thread needs to acquire the same lock to perform the async sampling.

That leaves me with two ideas for potential fix approaches:

1. After the scene swap, put the hit tester into an invalid state where hit test requests are blocked until the hit tester is rebuilt subsequent to the async property sampling in `update_document()`. This would partially regress bug 1580178 in that it would make hit tests sometimes block again, but presumably the blocking would be considerably less frequent than before (it would only happen around a scene swap, and not when the render backend thread is busy for other reasons).

2. Remember the result of the last async sample before the scene swap, and apply it immediately to the new scene before releasing the APZ tree lock. We would have to take some care here as we're interpreting a sample that applied to a previous scene and applying it to the new scene, but e.g. async scroll offsets are attached to scroll ids which persist across scenes so that should work. That way, the hit test observes something that does take into account async scrolling, even if it's from the last frame (which is fine).

Nical, do you have any thoughts on this? Do you have a preference for one of the above approaches, or foresee any problems?

Glenn, if you have any thoughts I'd be curious to hear them as well.

Revision 1 by

Botond Ballo [:botond]

on 2021-12-12 17:11:27 PST

Using Try pushes with added logging, I was able to arrive at a diagnosis here.

It appears that it's possible for WebRender to perform a scene build at an inopportune time, such that a hit-test manages to observe the state of the scene after the new scene (containing the latest **main-thread** scroll offsets) has been swapped in, but before async properties (containing the latest **APZ** scroll offsets) have been sampled. In effect, the hit test observes a state that is not ever composited.

I believe the relevant code is in `RenderBackend::process_transaction()`, the function that runs on the render backend thread to accept the result of a scene build and perform the scene swap.

* Coming into this function, the scene builder thread is holding the APZ tree lock. (That is acquired by the `pre_scene_swap` hook called [here](https://searchfox.org/mozilla-central/rev/93c6fe08a4ef08d4146a30c2c9494bf0f3bb42e0/gfx/wr/webrender/src/scene_builder_thread.rs#685), and remains held while the scene builder thread waits for the render backend thread to perform the scene swap [here](https://searchfox.org/mozilla-central/rev/93c6fe08a4ef08d4146a30c2c9494bf0f3bb42e0/gfx/wr/webrender/src/scene_builder_thread.rs#717).)
* The scene swap is then performed [here](https://searchfox.org/mozilla-central/rev/93c6fe08a4ef08d4146a30c2c9494bf0f3bb42e0/gfx/wr/webrender/src/render_backend.rs#792).
* The hit tester is rebuilt [here](https://searchfox.org/mozilla-central/rev/93c6fe08a4ef08d4146a30c2c9494bf0f3bb42e0/gfx/wr/webrender/src/render_backend.rs#824), but note that async properties have not been sampled and applied to the new scene yet (that will happen later, in `update_document()`).
* The render backend thread [signals](https://searchfox.org/mozilla-central/rev/93c6fe08a4ef08d4146a30c2c9494bf0f3bb42e0/gfx/wr/webrender/src/render_backend.rs#829) the scene builder thread that it can invoke the `post_scene_swap` hook, and [waits](https://searchfox.org/mozilla-central/rev/93c6fe08a4ef08d4146a30c2c9494bf0f3bb42e0/gfx/wr/webrender/src/render_backend.rs#834) for it to do so. When the render backend thread [calls](https://searchfox.org/mozilla-central/rev/93c6fe08a4ef08d4146a30c2c9494bf0f3bb42e0/gfx/wr/webrender/src/scene_builder_thread.rs#719) the `post_scene_swap` hook, the APZ tree lock is released.
* Finally, the render backend thread calls `update_document()` [here](https://searchfox.org/mozilla-central/rev/93c6fe08a4ef08d4146a30c2c9494bf0f3bb42e0/gfx/wr/webrender/src/render_backend.rs#852), which samples async properties [here](https://searchfox.org/mozilla-central/rev/93c6fe08a4ef08d4146a30c2c9494bf0f3bb42e0/gfx/wr/webrender/src/render_backend.rs#1278) and rebuilds the hit tester again [here](https://searchfox.org/mozilla-central/rev/93c6fe08a4ef08d4146a30c2c9494bf0f3bb42e0/gfx/wr/webrender/src/render_backend.rs#1450).

Note that there is a period of time where the APZ tree lock has been released, but async properties have not yet been sampled nor the hit-tester rebuilt to account for them. A hit-test can occur in this interval, and observe the main-thread scroll offset.

I believe this is a regression from bug 1580178, which made hit-tests not block on the render thread. Prior to that bug, a hit test that did occur in the mentioned interval would be enqueued onto the render backend thread's event loop (as a `FrameMsg::HitTest`), and be processed only after `process_transaction()` (including the `update_document()` call and its async sampling) is complete.

I also believe this is a general correctness bug affecting production scenarios (not just tests), and may explain occasional hit testing misbehaviour that we've observed.

I thought a bit about how to fix this. Holding the APZ tree lock through the `update_document()` call doesn't work, because the render backend thread needs to acquire the same lock to perform the async sampling.

That leaves me with two ideas for potential fix approaches:

1. After the scene swap, put the hit tester into an invalid state where hit test requests are blocked until the hit tester is rebuilt subsequent to the async property sampling in `update_document()`. This would partially regress bug 1580178 in that it would make hit tests sometimes block again, but presumably the blocking would be considerably less frequent than before (it would only happen around a scene swap, and not when the render backend thread is busy for other reasons).

2. Remember the result of the last async sample before the scene swap, and apply it immediately to the new scene before releasing the APZ tree lock. We would have to take some care here as we're interpreting a sample that applied to a previous scene and applying it to the new scene, but e.g. async scroll offsets are attached to scroll ids which persist across scenes so that should work. That way, the hit test observes something that does take into account async scrolling, even if it's from the last frame (which is fine).

Nical, do you have any thoughts on this? Do you have a preference for one of the above approaches, or foresee any problems?

Glenn, if you have any thoughts I'd be curious to hear them as well.

Revision 2 by

Botond Ballo [:botond]

on 2021-12-12 17:11:59 PST

Using Try pushes with added logging, I was able to arrive at a diagnosis here.

It appears that it's possible for WebRender to perform a scene build at an inopportune time, such that a hit-test manages to observe the state of the scene after the new scene (containing the latest **main-thread** scroll offsets) has been swapped in, but before async properties (containing the latest **APZ** scroll offsets) have been sampled. In effect, the hit test observes a state that is not ever composited.

I believe the relevant code is in `RenderBackend::process_transaction()`, the function that runs on the render backend thread to accept the result of a scene build and perform the scene swap.

* Coming into this function, the scene builder thread is holding the APZ tree lock. (That is acquired by the `pre_scene_swap` hook called [here](https://searchfox.org/mozilla-central/rev/93c6fe08a4ef08d4146a30c2c9494bf0f3bb42e0/gfx/wr/webrender/src/scene_builder_thread.rs#685), and remains held while the scene builder thread waits for the render backend thread to perform the scene swap [here](https://searchfox.org/mozilla-central/rev/93c6fe08a4ef08d4146a30c2c9494bf0f3bb42e0/gfx/wr/webrender/src/scene_builder_thread.rs#717).)
* The scene swap is then performed [here](https://searchfox.org/mozilla-central/rev/93c6fe08a4ef08d4146a30c2c9494bf0f3bb42e0/gfx/wr/webrender/src/render_backend.rs#792).
* The hit tester is rebuilt [here](https://searchfox.org/mozilla-central/rev/93c6fe08a4ef08d4146a30c2c9494bf0f3bb42e0/gfx/wr/webrender/src/render_backend.rs#824), but note that async properties have not been sampled and applied to the new scene yet (that will happen later, in `update_document()`).
* The render backend thread [signals](https://searchfox.org/mozilla-central/rev/93c6fe08a4ef08d4146a30c2c9494bf0f3bb42e0/gfx/wr/webrender/src/render_backend.rs#829) the scene builder thread that it can invoke the `post_scene_swap` hook, and [waits](https://searchfox.org/mozilla-central/rev/93c6fe08a4ef08d4146a30c2c9494bf0f3bb42e0/gfx/wr/webrender/src/render_backend.rs#834) for it to do so. When the render backend thread [calls](https://searchfox.org/mozilla-central/rev/93c6fe08a4ef08d4146a30c2c9494bf0f3bb42e0/gfx/wr/webrender/src/scene_builder_thread.rs#719) the `post_scene_swap` hook, the APZ tree lock is released.
* Finally, the render backend thread calls `update_document()` [here](https://searchfox.org/mozilla-central/rev/93c6fe08a4ef08d4146a30c2c9494bf0f3bb42e0/gfx/wr/webrender/src/render_backend.rs#852), which samples async properties [here](https://searchfox.org/mozilla-central/rev/93c6fe08a4ef08d4146a30c2c9494bf0f3bb42e0/gfx/wr/webrender/src/render_backend.rs#1278) and rebuilds the hit tester again [here](https://searchfox.org/mozilla-central/rev/93c6fe08a4ef08d4146a30c2c9494bf0f3bb42e0/gfx/wr/webrender/src/render_backend.rs#1450).

Note that there is a period of time where the APZ tree lock has been released, but async properties have not yet been sampled nor the hit-tester rebuilt to account for them. A hit-test can occur in this interval, and observe the main-thread scroll offset.

I believe this is a regression from bug 1580178, which made hit-tests not block on the render backend thread. Prior to that bug, a hit test that did occur in the mentioned interval would be enqueued onto the render backend thread's event loop (as a `FrameMsg::HitTest`), and be processed only after `process_transaction()` (including the `update_document()` call and its async sampling) is complete.

I also believe this is a general correctness bug affecting production scenarios (not just tests), and may explain occasional hit testing misbehaviour that we've observed.

I thought a bit about how to fix this. Holding the APZ tree lock through the `update_document()` call doesn't work, because the render backend thread needs to acquire the same lock to perform the async sampling.

That leaves me with two ideas for potential fix approaches:

1. After the scene swap, put the hit tester into an invalid state where hit test requests are blocked until the hit tester is rebuilt subsequent to the async property sampling in `update_document()`. This would partially regress bug 1580178 in that it would make hit tests sometimes block again, but presumably the blocking would be considerably less frequent than before (it would only happen around a scene swap, and not when the render backend thread is busy for other reasons).

2. Remember the result of the last async sample before the scene swap, and apply it immediately to the new scene before releasing the APZ tree lock. We would have to take some care here as we're interpreting a sample that applied to a previous scene and applying it to the new scene, but e.g. async scroll offsets are attached to scroll ids which persist across scenes so that should work. That way, the hit test observes something that does take into account async scrolling, even if it's from the last frame (which is fine).

Nical, do you have any thoughts on this? Do you have a preference for one of the above approaches, or foresee any problems?

Glenn, if you have any thoughts I'd be curious to hear them as well.