Closed Bug 1630781 Opened 3 months ago Closed 2 months ago

Sometimes we skip frames during autoscroll without going over our frame budget

Categories

(Core :: Graphics: WebRender, defect, P3)

Desktop
Unspecified
defect

Tracking

()

RESOLVED FIXED
mozilla77
Tracking Status
firefox77 --- fixed

People

(Reporter: jrmuizel, Assigned: kats)

References

(Depends on 1 open bug, Blocks 2 open bugs)

Details

(Whiteboard: wr-planning)

Attachments

(7 files, 1 obsolete file)

47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review

During HD 4600 investigations I've seen the situation where we skip a frame because we seem to early out from frame building.

Here's what we know:

  • This seems to always happen two vsyncs after we do a scene build.
  • The scene build happens close to finish around the time of a vsync.
  • In the frame after the vsync we get two calls to update_document.
  • the second frame is render_frame=true and has build_frame = true as usual
  • In the next frame we get a call to update_document(render_frame=true) but this time build_frame is false but requested_frame is true and this is the frame where we fail to render anything causing the frame skip. We don't set build_frame because frame_is_valid=true
Blocks: 1576637

Here's a log of scroll messages:

DebugMessage: set_scroll_origin (0,-4706.684) (0,-4721.2964)

DebugMessage: set_scroll_origin (0,-4721.2964) (0,-4735.9087)

DebugMessage: set_scroll_origin (0,-4735.9087) (0,-4750.521)

DebugMessage: set_scroll_origin (0,-4622.521) (0,-4765.1333)
DebugMessage: set_scroll_origin (0,-4765.1333) (0,-4779.7456)

DebugMessage: set_scroll_origin nochange (0,-4779.7456)

DebugMessage: set_scroll_origin (0,-4779.7456) (0,-4794.358)

You can see we double up in one frame and then skip in the next frame.

Whiteboard: wr-planning

Based on this I expect the following to be happening:

  1. Scene build
  2. Vsync, which advances the LastComposeTime in the vsync dispatcher
  3. WebRenderBridgeParent::MaybeGenerateFrame runs which advances the APZ "current" sample time
  4. update_document runs which samples from APZ here. Because the APZ frame delay is enabled, this actually returns values corresponding to the previous sample time (i.e. the vsync preceding the scene build). Nonetheless, I would expect this to provide updated scroll offsets to WR which should cause an actual frame build instead of bailing out with frame_is_valid=true. So why that's happening is an open question. But this will also advance the APZ animations to the "current" sample time, which is the LastComposeTime from step 2.
  5. update_document runs again, samples from APZ again. This "current" sample time is the same as before, but the APZ animations were advanced at the end of step 4, so this will return different scroll offsets from what was produced in step 4. And it will not advance APZ animations any further, because the new sample time hasn't changed.
  6. Vsync, which advances the LastComposeTime in the vsync dispatcher
  7. WebRenderBridgeParent::MaybeGenerateFrame runs which again advances the APZ "current" sample time to the LastComposeTime in step 6.
  8. Another update_document happens. Since step 5 didn't advance animations, the scroll offsets here will be the same as in step 5. And APZ animations will get advanced to the timestamp from step 6/7. Since this step doesn't produce new scroll offsets, it makes sense that WR will not repaint anything.

I thought that if, in step 5, we prevent APZ from producing any scroll offsets at all (via the patch I sent :jrmuizel) then it would make step 8 render properly. But I guess we're still missing something there. Would be worth trying to confirm all this by logging the sample time passed to APZCTreeManager::SampleForWebRender and correlating it with the data posted previously. Would also be worth trying to answer the open question in step 4.

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #2)

But I guess we're still missing something there.

We're actually not missing anything, I misunderstood Jeff's comment. The patch does work. The answer to the "open question" in step 4 is that frame_is_valid does become false, but since render_frame is also false, it produces a false build_frame value here. And that in turn means that frame_is_valid remains false until the second call to update_document, at which point the frame build will actually happen.

The patch I wrote prevents APZ from emitting modified scroll offsets in step 5, and since frame_is_valid is still false from the previous update_document in step 4, we still go ahead and build a new frame in step 5 even without more data from APZ. So I think this produces all the desired results and the patch can be landed with some cleanup.

Blocks: wr-intel
Priority: -- → P3
Hardware: Unspecified → Desktop
Version: unspecified → Trunk
Assignee: nobody → kats
Status: NEW → ASSIGNED

There appear to be some test failures. I started some retriggers. If they're legit they probably have to do with sampling multiple times with the same test timestamp.

They seem legit. Here's another try push with some better handling of the test timestamp: https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=07f176db1ceb112058bca5c0534cfafda7ddec3c

I reproduced the test failure locally and think I understand what's going on. The problem is that the transaction that's filled out by APZ has both scroll offsets and other animation transforms (in particular for scrollbars and zoom and stuff). And those properties need to be set unconditionally, otherwise the pending properties don't match the current properties here and the properties effectively get erased from the frame build. This can result in mispositioned scrollbars and such.

This kind of throws a wrench in the works because it means we need to emit sampling data for some previous timestamp that we no longer still have data for. I'm going to think about this some more.

I came up with something that I thought would work, but upon further consideration I think it will regress the APZ frame delay mechanism. It will still frame-delay animations, but not user input. Anyway I kicked off a try push with that here while I think of other options.

Not really making a lot of progress here. It seems like there's an assumption baked into the frame-delay implementation that we only sample from APZ (via ApplyAsyncContentTransformToTree or SampleForWebRender) once per vsync interval, so it's safe to resample into the mComposited* fields after that happens, in order to set up for the next vsync interval. I think this assumption holds for the layers codepath but not the WR codepath. It doesn't seem possible to rearrange the existing blocks of code to make this work so I might have to think of an alternative implementation for the frame-delay mechanism that doesn't make this assumption.

Currently, the flow is something like this (assuming t0, t1, t2 are evenly spaced vsync times):

  • Vsync (at time t=t1)
  • Do sampling of APZ transforms (top part of SampleForWebRender or in ApplyAsyncContentTransformToTree for non-WR). This emits values that include animations at t=t0 and user input at the last call to SampleCompositedAsyncTransform which was also roughly at t=t0.
  • Attempt to advance APZ animations to time t=t2. This trickles down to AsyncPanZoomController::UpdateAnimation which does:
    • Call SampleCompositedAsyncTransform, snapshots async state from animations at t=t1 and all received user input into the mComposited* fields.
    • Advance animations to time t=t2

If we loop this over and over once per vsync everything works out consistently. The problem is if we get a second sampling (i.e. call to SampleForWebrender) within the same vsync interval. Let's say that happens within the t=t1 vsync interval described above. Then:

  • Do sampling of APZ transforms. This emits values that include animations at t=t1 and user input at the last call to SampleCompositedAsyncTransform which was earlier in this vsync interval.
  • Attempt to advance APZ animations to time t=t2. This trickles down to AsyncPanZoomController::UpdateAnimation which does:
    • Early exit since timestamp is the same as the last time

So the result of the sampling is different in that it includes more recently sampled state. And then on the next vsync interval this happens:

  • Vsync (at time t=t2)
  • Do sampling of APZ transforms (top part of SampleForWebRender or in ApplyAsyncContentTransformToTree for non-WR). This emits values that include animations at t=t1 and user input at the last call to SampleCompositedAsyncTransform which was partway between t1 and t2.
  • Attempt to advance APZ animations to time t=t3. This trickles down to AsyncPanZoomController::UpdateAnimation which does:
    • Call SampleCompositedAsyncTransform, snapshots async state from animations at t=t2 and all received user input into the mComposited* fields.
    • Advance animations to time t=t3

And so the result emitted here is the same as the result for the second sampling in the previous vsync interval. This produces visual jank because we update two frames' worth in a single frame and then "skip" the next frame.

The constraints here are that (1) in order to allow for multiple calls to SampleForWebRender within a vsync interval, we need to preserve the mComposited* values all the way through to the end of the vsync interval, rather than clobbering them at the end of the first call to SampleForWebRender. And that (2) in order to respect frame-delay, we want to snapshot the Metrics() into the mComposited* fields pretty early in the vsync interval. If we do it late, then we're capturing user input that arrived later and defeating the frame-delay mechanism. These constraints cannot be simultaneously satisfied while only having one set of mComposited* fields, so I wrote a set of patches that encapsulates those fields and allows an AsyncPanZoomController to have another one. This way it can snapshot the async state for the next frame while still continue providing the older snapshot during the current vsync interval.

Note that I had considered other simplifications (such as just caching the values emitted by SampleForWebRender) but it's possible to get a layers update between two samplings within a single vsync interval, and handling that case properly defeats those simplification attempts (because in that case the second sampling might have a different set of APZCs, and the async transform may have been reset, etc.).

Here's a try push with the patches: https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=536c7281c179a958215858257284153fadac242d - they have the patches from bug 1627716 as a dependency as deleting the dynamic toolbar also removes some fiddling of the mComposited* fields.

Depends on: 1627716

https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&selectedJob=298879406&revision=9af24ef0cc11d3acb29d53e762bb879e3f6c5ae8 is looking better, had a silly error in the last try push. There's still one test failure here but I believe that's a bug in the test. Will get the patches ready for review soon.

Having to think about multiple codepaths adds complexity and it doesn't seem
like we're going to turn this pref back off anytime soon.

No functional change, since everything was safe already. But this propagates
the proof-of-lock a little further to make it more obvious that fields are
safely accessed.

Depends on D72040

No functional changes intended here, just code getting moved into a helper
class. Note that this patch folds RecalculateCompositedLayoutViewport into
ClampCompositedScrollOffset since there are no longer any independent callers
of the former function (as of bug 1627716).

Depends on D72041

The deque always has size 1, so this patch is functionally a no-op. It sets up
the usage of front() and back() to allow holding more than one item in a future
patch.

Depends on D72042

Again, functionally this is a no-op since instead of replacing the back()
element, we do an emplace_back() followed by a pop_front().

Depends on D72043

We have had to apply this same fix in other tests in the past, for example in
helper_hittest_basic.html.

Depends on D72044

Refer to bug 1630781 comment 11 for a more detailed explanation of the problem.
A quick summary is that prior to this patch, the sampling operation involved
these steps:

  • sample stored APZ state for composition
  • advance animations to next vsync
  • update stored APZ state to the one just computed
    When sampling occured twice within a vsync interval, the stored APZ state
    calculated at the end of the first sampling would be produced in the second
    sampling, resulting in non-smooth scrolling. Note that the second and third
    steps would only run once per vsync interval, but that was sufficient to cause
    the problem.

With this patch, the order of the steps is reordered:

  • update stored APZ state to that computed in the last vsync interval
  • advance animations to next vsync and save it in the queue
  • sample stored APZ state for composition
    With this ordering (and with the first two steps only running once per vsync
    interval), the third step now emits consistent data even if the steps are run
    multiple times in a vsync interval. It does mean that the mSampledState deque
    can have up to two items in it (front() being the state for the current vsync
    interval, and back() being the state computed for the next vsync interval).

Although this problem only affected the WR codepath in practice, it could in
theory occur with the non-WR codepath too, if we for some reason ran the
TransformShadowTree code multiple times in a vsync interval. This patch updates
both codepaths with the new order of steps.

Depends on D72045

Attachment #9141360 - Attachment is obsolete: true
Pushed by kgupta@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/0e01cc6c7ba9
Eliminate the frame_delay pref, assume it true everywhere. r=botond,mstange
https://hg.mozilla.org/integration/autoland/rev/a0a7f8bd9b39
Ensure the mComposited* fields are accessed safely across threads. r=botond
https://hg.mozilla.org/integration/autoland/rev/e83a1f37cafd
Encapsulate the mComposited* variables into a class. r=botond
https://hg.mozilla.org/integration/autoland/rev/84d718cc1c95
Expand mSampledState into a deque. r=botond
https://hg.mozilla.org/integration/autoland/rev/bbce34f0afe3
Allow mSampledState to temporarily hold more than one state. r=botond
https://hg.mozilla.org/integration/autoland/rev/a9d62cd638a6
Adjust test to account for extra composite required for WR hit-test. r=botond
https://hg.mozilla.org/integration/autoland/rev/a8048daebd7e
Reorder operations to increase idempotency of sampling APZ state during a vsync interval. r=botond
Regressions: 1640060
Depends on: 1640060
No longer regressions: 1640060
You need to log in before you can comment on or make changes to this bug.