Closed Bug 1737725 Opened 3 years ago Closed 2 years ago

Intermittent gfx/layers/apz/test/mochitest/test_group_hittest-2.html | helper_hittest_bug1730606-2.html | root scroller (after <second> async scroll) scrollid - got 189, expected 187

Categories

(Core :: Panning and Zooming, defect, P2)

defect

Tracking

()

RESOLVED FIXED
113 Branch
Tracking Status
firefox-esr91 --- wontfix
firefox-esr102 --- wontfix
firefox95 --- wontfix
firefox96 --- wontfix
firefox97 --- wontfix
firefox111 --- wontfix
firefox112 --- wontfix
firefox113 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: botond)

References

(Blocks 2 open bugs, Regression)

Details

(Keywords: intermittent-failure, regression)

Attachments

(1 file)

Filed by: imoraru [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=355997283&repo=autoland
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/KrajjYP1T_CZ8Ew0u9DesA/runs/0/artifacts/public/logs/live_backing.log
Reftest URL: https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/KrajjYP1T_CZ8Ew0u9DesA/runs/0/artifacts/public/logs/live_backing.log&only_show_unexpected=1


[task 2021-10-25T22:45:46.104Z] 22:45:46     INFO - TEST-PASS | gfx/layers/apz/test/mochitest/test_group_hittest-2.html | helper_hittest_bug1730606-2.html | root scroller (after async scroll) hit info 
[task 2021-10-25T22:45:46.104Z] 22:45:46     INFO - Buffered messages finished
[task 2021-10-25T22:45:46.105Z] 22:45:46     INFO - TEST-UNEXPECTED-FAIL | gfx/layers/apz/test/mochitest/test_group_hittest-2.html | helper_hittest_bug1730606-2.html | root scroller (after async scroll) scrollid - got 189, expected 187
[task 2021-10-25T22:45:46.105Z] 22:45:46     INFO -     SimpleTest.is@SimpleTest/SimpleTest.js:500:14
[task 2021-10-25T22:45:46.105Z] 22:45:46     INFO -     runSubtestsSeriallyInFreshWindows/</advanceSubtestExecution/spawnTest/w.is@gfx/layers/apz/test/mochitest/apz_test_utils.js:467:18
[task 2021-10-25T22:45:46.105Z] 22:45:46     INFO -     checkHitResult@gfx/layers/apz/test/mochitest/apz_test_utils.js:877:5
[task 2021-10-25T22:45:46.105Z] 22:45:46     INFO -     test@gfx/layers/apz/test/mochitest/helper_hittest_bug1730606-2.html:117:17
[task 2021-10-25T22:45:46.106Z] 22:45:46     INFO - TEST-PASS | gfx/layers/apz/test/mochitest/test_group_hittest-2.html | helper_hittest_bug1730606-2.html | root scroller (after async scroll) layersid 
Summary: Intermittent TV-fis gfx/layers/apz/test/mochitest/test_group_hittest-2.html | helper_hittest_bug1730606-2.html | root scroller (after async scroll) scrollid - got 189, expected 187 → Intermittent [tier 2] gfx/layers/apz/test/mochitest/test_group_hittest-2.html | helper_hittest_bug1730606-2.html | root scroller (after async scroll) scrollid - got 189, expected 187
Summary: Intermittent [tier 2] gfx/layers/apz/test/mochitest/test_group_hittest-2.html | helper_hittest_bug1730606-2.html | root scroller (after async scroll) scrollid - got 189, expected 187 → Intermittent [tier 2] gfx/layers/apz/test/mochitest/test_group_hittest-2.html | helper_hittest_bug1730606-2.html | root scroller (after <second> async scroll) scrollid - got 189, expected 187

botond, looks like you've poked at this test in the past -- do you know what might be going on here?

Looks like this is a pretty-frequent failure (and some of the failures are hidden by virtue of the fact that the test is skipped on a handful of platforms per this annotation.

Flags: needinfo?(botond)

I can investigate.

Assignee: nobody → botond
Flags: needinfo?(botond)

Summary of my investigation so far:

Here's what the test is doing around the point where it fails:

  1. Sets a simulated async scroll offset using nsIDOMWindowUtils.setAsyncScrollOffset()
  2. Manually ticks the refresh driver once to give WebRender a chance to sample the async scroll offset
  3. Performs a hit test by synthesizing a native MozMouseHittest event, which causes APZ to ask WebRender to do a hit test and log the result into the APZ test data
  4. Checks the result of the hit test by reading the hit-result from the APZ test data

The failure is that the hit-result the test sees does not reflect the async scroll offset.

I've confirmed the following:

  • The async scroll offset is set successfully and sampled by WebRender before the MozMouseHittest arrives, so the issue isn't related to the test sending the MozMouseHittest event too soon.
  • The outcome of the hit test on the APZ side is indeed the incorrect outcome, so the issue isn't related to the result not making it back to the test in time.

Based on the above, the issue somehow seems to be in WebRender itself: it samples an async scroll offset, and yet produces a hit test result that does not reflect that sampled offset.

(In reply to Botond Ballo [:botond] from comment #8)

Based on the above, the issue somehow seems to be in WebRender itself: it samples an async scroll offset, and yet produces a hit test result that does not reflect that sampled offset.

Continued investigation is showing that in between the sample and the hit-test, WR rebuilds the scene, which results in the sampled async offset being discarded and replaced with the non-async offset from Gecko.

The fact that a hit-test observes a scroll offset coming from Gecko without the async offset having been sampled again from APZ is concerning.

Using Try pushes with added logging, I was able to arrive at a diagnosis here.

It appears that it's possible for WebRender to perform a scene build at an inopportune time, such that a hit-test manages to observe the state of the scene after the new scene (containing the latest main-thread scroll offsets) has been swapped in, but before async properties (containing the latest APZ scroll offsets) have been sampled. In effect, the hit test observes a state that is not ever composited.

I believe the relevant code is in RenderBackend::process_transaction(), the function that runs on the render backend thread to accept the result of a scene build and perform the scene swap.

  • Coming into this function, the scene builder thread is holding the APZ tree lock. (That is acquired by the pre_scene_swap hook called here, and remains held while the scene builder thread waits for the render backend thread to perform the scene swap here.)
  • The scene swap is then performed here.
  • The hit tester is rebuilt here, but note that async properties have not been sampled and applied to the new scene yet (that will happen later, in update_document()).
  • The render backend thread signals the scene builder thread that it can invoke the post_scene_swap hook, and waits for it to do so. When the render backend thread calls the post_scene_swap hook, the APZ tree lock is released.
  • Finally, the render backend thread calls update_document() here, which samples async properties here and rebuilds the hit tester again here.

Note that there is a period of time where the APZ tree lock has been released, but async properties have not yet been sampled nor the hit-tester rebuilt to account for them. A hit-test can occur in this interval, and observe the main-thread scroll offset.

I believe this is a regression from bug 1580178, which made hit-tests not block on the render backend thread. Prior to that bug, a hit test that did occur in the mentioned interval would be enqueued onto the render backend thread's event loop (as a FrameMsg::HitTest), and be processed only after process_transaction() (including the update_document() call and its async sampling) is complete.

I also believe this is a general correctness bug affecting production scenarios (not just tests), and may explain occasional hit testing misbehaviour that we've observed.

I thought a bit about how to fix this. Holding the APZ tree lock through the update_document() call doesn't work, because the render backend thread needs to acquire the same lock to perform the async sampling.

That leaves me with two ideas for potential fix approaches:

  1. After the scene swap, put the hit tester into an invalid state where hit test requests are blocked until the hit tester is rebuilt subsequent to the async property sampling in update_document(). This would partially regress bug 1580178 in that it would make hit tests sometimes block again, but presumably the blocking would be considerably less frequent than before (it would only happen around a scene swap, and not when the render backend thread is busy for other reasons).

  2. Remember the result of the last async sample before the scene swap, and apply it immediately to the new scene before releasing the APZ tree lock. We would have to take some care here as we're interpreting a sample that applied to a previous scene and applying it to the new scene, but e.g. async scroll offsets are attached to scroll ids which persist across scenes so that should work. That way, the hit test observes something that does take into account async scrolling, even if it's from the last frame (which is fine).

Nical, do you have any thoughts on this? Do you have a preference for one of the above approaches, or foresee any problems?

Glenn, if you have any thoughts I'd be curious to hear them as well.

Flags: needinfo?(nical.bugzilla)
Flags: needinfo?(gwatson)
Regressed by: 1580178
Has Regression Range: --- → yes

I suspect {bug 1734206, bug 1738440, bug 1742356, bug 1743071} have the same failure mode.

See Also: → 1742356, 1743071, 1738440, 1734206

I can't claim to understand the details or the subtleties of how this code works, so take any opinion from me with a grain of salt.

My initial reaction was that (1) sounds safer, but now I'm thinking that (2) probably does sound like it should just work, and is probably more the "correct" thing to do anyway, in terms of maintaining consistent internal state just after a new scene swap?

Flags: needinfo?(gwatson)
Severity: S4 → S3
Priority: P5 → P2

It occurred to me that, while option (2) would improve correctness, it would not be fully correct. The reason is that the main thread scroll offset can sometimes clobber an async scroll offset.

So, suppose we have display list 1 with main thread scroll offset M1, and an async scroll delta of Delta1, resulting in a composited scroll offset of A1 = (M1 + Delta1).

Now suppose we get a new display list with a new main thread scroll offset M2. Depending on what's going on in that transaction (e.g. the type of scroll offset update), that may clobber the async delta (resulting in A2 = M2), or it may not (resulting in A2 = M2 + Delta1).

So, if WebRender is just remembering the Delta1 from the previous async sample and blindly applying it to M2, that may or may not be correct. In an extreme case where Delta1 is large, M2 clobbers it, and M2 is far away from M1, (M2 + Delta1) may be very different from both the scroll offset composited the previous frame (M1 + Delta1), and the scroll offset that will be composited the next frame (M2).

I guess WebRender could compute Delta2 = Delta1 + (M1 - M2), such that the scroll offset it uses for hit testing in the affected interval is A2 = M2 + Delta2 = M2 + Delta1 + (M1 - M2) = M1 + Delta1. That corresponds to the actual composited scroll offset of the previous frame, which is reasonable to use for hit testing until we get a new sample for the new scene.

My understanding of this is very fuzzy. Would it make sense to:

  • Not re-build the hit-tester during scene swap and only do it when we sample in update_document ? (So we basically observe the previous state a bit longer during the time we would observe an incomplete/invalid hit tester)
  • Or sample during scene swap so that we have up to date deltas at all times? (is there data to sample at this point that corresponds to the new scene?)

Or if the delta fixup you mention in comment 15 is correct then that sounds good to me.

I'm not a huge fan of adding blocking synchronization for hit testing. It sounds like it would add unpredictable timing-deoendent input latency and main thread jank that would be hard to reproduce and debug reliably.

Flags: needinfo?(nical.bugzilla)

(In reply to Nicolas Silva [:nical] from comment #17)

My understanding of this is very fuzzy. Would it make sense to:

  • Not re-build the hit-tester during scene swap and only do it when we sample in update_document ? (So we basically observe the previous state a bit longer during the time we would observe an incomplete/invalid hit tester)

We can't do this. The post-sceneswap hook updates APZ's hit-testing tree to reflect the new scene, and the hit-test operation uses both APZ's hit-testing tree and WR's hit-tester. If the APZ hit-testing tree is based on the new scene but WR's hit-tester is based on the old scene, the results will be inconsistent (we have an assertion for this, and avoiding this inconsistency is the primary purpose of holding the APZ tree lock during the scene swap).

  • Or sample during scene swap so that we have up to date deltas at all times? (is there data to sample at this point that corresponds to the new scene?)

This idea is interesting. The sampling would have to be done after the APZ hit-testing tree update in the post-sceneswap hook, but perhaps the result could be returned as a result of the post-sceneswap hook, and then sent to the render backend thread as a result of this recv() call?

(In reply to Botond Ballo [:botond] [away until Jan 12] from comment #14)

It occurred to me that, while option (2) would improve correctness, it would not be fully correct. The reason is that the main thread scroll offset can sometimes clobber an async scroll offset.

So, suppose we have display list 1 with main thread scroll offset M1, and an async scroll delta of Delta1, resulting in a composited scroll offset of A1 = (M1 + Delta1).

Now suppose we get a new display list with a new main thread scroll offset M2. Depending on what's going on in that transaction (e.g. the type of scroll offset update), that may clobber the async delta (resulting in A2 = M2), or it may not (resulting in A2 = M2 + Delta1).

So, if WebRender is just remembering the Delta1 from the previous async sample and blindly applying it to M2, that may or may not be correct. In an extreme case where Delta1 is large, M2 clobbers it, and M2 is far away from M1, (M2 + Delta1) may be very different from both the scroll offset composited the previous frame (M1 + Delta1), and the scroll offset that will be composited the next frame (M2).

Wouldn't this error be fixed by the APZ scroll generation introduced in bug 1571758? FWIW, I managed to write a mochitest for bug 1571758, but the test is flaky since the same reason that the main-thread scroll offset clobbers the APZ scroll offset. So anyway I'd like to keep having sampled APZ offset in new scroll frame updated by the new disply list.

This started appearing as tier 1 here

Summary: Intermittent [tier 2] gfx/layers/apz/test/mochitest/test_group_hittest-2.html | helper_hittest_bug1730606-2.html | root scroller (after <second> async scroll) scrollid - got 189, expected 187 → Intermittent gfx/layers/apz/test/mochitest/test_group_hittest-2.html | helper_hittest_bug1730606-2.html | root scroller (after <second> async scroll) scrollid - got 189, expected 187

(In reply to Hiroyuki Ikezoe (:hiro) from comment #22)

Wouldn't this error be fixed by the APZ scroll generation introduced in bug 1571758?

Sorry, I've been meaning to respond to this for a while but I kept alternately forgetting, or being distracted by something before I could think through it fully.

I believe the work in bug 1571758 does not solve this issue, nor otherwise affect it very much. The issue remains that a hit test can observe ScrollFrameInfo.offsets when it has this value (or this one) -- i.e. with just a single element, whose offset is the main thread offset without any async delta, before set_scroll_offsets has been called to give it the result of sampling APZ offsets.

Brain-dump of my current thoughts on this bug:


First, the high-level approach. I think two of the discussed approaches are potentially viable:

  1. Approach (2) from comment 11, refined by comment 15. Basically, carry over the most recent composited scroll offsets from the previous scene and use them until we can get a sample for the new scene.
  2. The second idea discussed in comment 18, i.e. get a sample for the new scene before releasing the tree lock.

The first one seems simpler to implement, so I suggest starting with that, and leaving the second approach as a fallback if the first doesn't work out for some reason.


Some notes on how the suggested approach could be implemented:

  • In RenderBackend::process_transaction(), before updating the spatial tree, walk it and pull the most recently sampled scroll offsets off of each scroll frame node. The result can be a map from ExternalScrollId to Vec<SampledScrollOffset>. (The value type is a vector because, as of bug 1571758, we retain multiple samples for each scroll frame. This is a slight complication but I think it actually helps us later.)
  • After updating the spatial tree but before rebuilding the hit tester, iterate through the map, and apply the old samples to the updated spatial tree.
    • As a first pass, I think the "apply" step can take the form of a simple Document::set_scroll_offsets() call. Note that this basically implements the intent of comment 15 without doing any calculation: the samples stored in SampledScrollOffset are already the "combined" scroll offsets (M_x + Delta_x), so if the intent is to use the combined scroll offset that was last sampled, that's what we're doing.
    • If we want to try and be a little more accurate, we could do the following instead:
      • Iterate the samples to see if there's one whose generation matches the scroll frame node's new ScrollFrameInfo::offset_generation.
      • If so, keep just that sample. This is the ideal case, as that's exactly what we want to be compositing in the next frame anyways.
      • If not, maybe keep just the most recent (last) sample? Given that the number of samples is typically two, and we typically use the older (one-frame-delayed) one for compositing, the newer one from the previous frame is most likely to be the older one from the next frame and therefore what will be composited the next frame.

(In reply to Botond Ballo [:botond] from comment #46)

  • If we want to try and be a little more accurate, we could do the following instead:
    • Iterate the samples to see if there's one whose generation matches the scroll frame node's new ScrollFrameInfo::offset_generation.
    • If so, keep just that sample. This is the ideal case, as that's exactly what we want to be compositing in the next frame anyways.

(I now better understand what Hiro meant in comment 22. The scroll generation added in bug 1571758 is actually helpful, for this reason.)

Blocks: 1776971

I have a WIP patch for this but at the moment it seems to just be making things worse: in addition to appearing not to fix the intermittent in bug 1776971, it's also causing test_wheel_transactions.html to fail. I'm continuing to investigate.

I think I figured out my problem here: my patch is trying to round-trip scroll offsets from the spatial tree to a temporary storage location and back, but the "setter" part of the round-trip is doing more than just setting the values.

When a new scene is swapped in on the render backend thread, its
scroll frames have scroll offsets that come from the main thread
and do not reflect async scroll deltas until such deltas are
sampled from APZ.

It's possible for hit-testing to observe the scene in this
temporary state, potentially leading to incorrect hit-test results.

To avoid this, save the async offsets from the previous scene
and apply them to the new scene until we can sample proper offsets
from APZ.

Pushed by bballo@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/22e246050235 Ensure hit-testing never observes scroll offsets that don't have async deltas applies. r=hiro,gw

Should be fixed in the updated version of the patch on Phabricator, but I pushed to Try for good measure.

Flags: needinfo?(botond)
Pushed by bballo@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/5ac55813344e Ensure hit-testing never observes scroll offsets that don't have async deltas applies. r=hiro,gw
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 113 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: