Intermittent gfx/layers/apz/test/mochitest/test_group_hittest-2.html | helper_hittest_bug1730606-2.html | root scroller (after <second> async scroll) scrollid - got 189, expected 187
Categories
(Core :: Panning and Zooming, defect, P2)
Tracking
()
People
(Reporter: intermittent-bug-filer, Assigned: botond)
References
(Blocks 2 open bugs, Regression)
Details
(Keywords: intermittent-failure, regression)
Attachments
(1 file)
Filed by: imoraru [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=355997283&repo=autoland
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/KrajjYP1T_CZ8Ew0u9DesA/runs/0/artifacts/public/logs/live_backing.log
Reftest URL: https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/KrajjYP1T_CZ8Ew0u9DesA/runs/0/artifacts/public/logs/live_backing.log&only_show_unexpected=1
[task 2021-10-25T22:45:46.104Z] 22:45:46 INFO - TEST-PASS | gfx/layers/apz/test/mochitest/test_group_hittest-2.html | helper_hittest_bug1730606-2.html | root scroller (after async scroll) hit info
[task 2021-10-25T22:45:46.104Z] 22:45:46 INFO - Buffered messages finished
[task 2021-10-25T22:45:46.105Z] 22:45:46 INFO - TEST-UNEXPECTED-FAIL | gfx/layers/apz/test/mochitest/test_group_hittest-2.html | helper_hittest_bug1730606-2.html | root scroller (after async scroll) scrollid - got 189, expected 187
[task 2021-10-25T22:45:46.105Z] 22:45:46 INFO - SimpleTest.is@SimpleTest/SimpleTest.js:500:14
[task 2021-10-25T22:45:46.105Z] 22:45:46 INFO - runSubtestsSeriallyInFreshWindows/</advanceSubtestExecution/spawnTest/w.is@gfx/layers/apz/test/mochitest/apz_test_utils.js:467:18
[task 2021-10-25T22:45:46.105Z] 22:45:46 INFO - checkHitResult@gfx/layers/apz/test/mochitest/apz_test_utils.js:877:5
[task 2021-10-25T22:45:46.105Z] 22:45:46 INFO - test@gfx/layers/apz/test/mochitest/helper_hittest_bug1730606-2.html:117:17
[task 2021-10-25T22:45:46.106Z] 22:45:46 INFO - TEST-PASS | gfx/layers/apz/test/mochitest/test_group_hittest-2.html | helper_hittest_bug1730606-2.html | root scroller (after async scroll) layersid
Updated•3 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•3 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 6•3 years ago
|
||
botond, looks like you've poked at this test in the past -- do you know what might be going on here?
Looks like this is a pretty-frequent failure (and some of the failures are hidden by virtue of the fact that the test is skipped on a handful of platforms per this annotation.
Assignee | ||
Comment 7•3 years ago
|
||
I can investigate.
Assignee | ||
Comment 8•3 years ago
•
|
||
Summary of my investigation so far:
Here's what the test is doing around the point where it fails:
- Sets a simulated async scroll offset using
nsIDOMWindowUtils.setAsyncScrollOffset()
- Manually ticks the refresh driver once to give WebRender a chance to sample the async scroll offset
- Performs a hit test by synthesizing a native
MozMouseHittest
event, which causes APZ to ask WebRender to do a hit test and log the result into the APZ test data - Checks the result of the hit test by reading the hit-result from the APZ test data
The failure is that the hit-result the test sees does not reflect the async scroll offset.
I've confirmed the following:
- The async scroll offset is set successfully and sampled by WebRender before the
MozMouseHittest
arrives, so the issue isn't related to the test sending theMozMouseHittest
event too soon. - The outcome of the hit test on the APZ side is indeed the incorrect outcome, so the issue isn't related to the result not making it back to the test in time.
Based on the above, the issue somehow seems to be in WebRender itself: it samples an async scroll offset, and yet produces a hit test result that does not reflect that sampled offset.
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 10•3 years ago
|
||
(In reply to Botond Ballo [:botond] from comment #8)
Based on the above, the issue somehow seems to be in WebRender itself: it samples an async scroll offset, and yet produces a hit test result that does not reflect that sampled offset.
Continued investigation is showing that in between the sample and the hit-test, WR rebuilds the scene, which results in the sampled async offset being discarded and replaced with the non-async offset from Gecko.
The fact that a hit-test observes a scroll offset coming from Gecko without the async offset having been sampled again from APZ is concerning.
Assignee | ||
Comment 11•3 years ago
•
|
||
Using Try pushes with added logging, I was able to arrive at a diagnosis here.
It appears that it's possible for WebRender to perform a scene build at an inopportune time, such that a hit-test manages to observe the state of the scene after the new scene (containing the latest main-thread scroll offsets) has been swapped in, but before async properties (containing the latest APZ scroll offsets) have been sampled. In effect, the hit test observes a state that is not ever composited.
I believe the relevant code is in RenderBackend::process_transaction()
, the function that runs on the render backend thread to accept the result of a scene build and perform the scene swap.
- Coming into this function, the scene builder thread is holding the APZ tree lock. (That is acquired by the
pre_scene_swap
hook called here, and remains held while the scene builder thread waits for the render backend thread to perform the scene swap here.) - The scene swap is then performed here.
- The hit tester is rebuilt here, but note that async properties have not been sampled and applied to the new scene yet (that will happen later, in
update_document()
). - The render backend thread signals the scene builder thread that it can invoke the
post_scene_swap
hook, and waits for it to do so. When the render backend thread calls thepost_scene_swap
hook, the APZ tree lock is released. - Finally, the render backend thread calls
update_document()
here, which samples async properties here and rebuilds the hit tester again here.
Note that there is a period of time where the APZ tree lock has been released, but async properties have not yet been sampled nor the hit-tester rebuilt to account for them. A hit-test can occur in this interval, and observe the main-thread scroll offset.
I believe this is a regression from bug 1580178, which made hit-tests not block on the render backend thread. Prior to that bug, a hit test that did occur in the mentioned interval would be enqueued onto the render backend thread's event loop (as a FrameMsg::HitTest
), and be processed only after process_transaction()
(including the update_document()
call and its async sampling) is complete.
I also believe this is a general correctness bug affecting production scenarios (not just tests), and may explain occasional hit testing misbehaviour that we've observed.
I thought a bit about how to fix this. Holding the APZ tree lock through the update_document()
call doesn't work, because the render backend thread needs to acquire the same lock to perform the async sampling.
That leaves me with two ideas for potential fix approaches:
-
After the scene swap, put the hit tester into an invalid state where hit test requests are blocked until the hit tester is rebuilt subsequent to the async property sampling in
update_document()
. This would partially regress bug 1580178 in that it would make hit tests sometimes block again, but presumably the blocking would be considerably less frequent than before (it would only happen around a scene swap, and not when the render backend thread is busy for other reasons). -
Remember the result of the last async sample before the scene swap, and apply it immediately to the new scene before releasing the APZ tree lock. We would have to take some care here as we're interpreting a sample that applied to a previous scene and applying it to the new scene, but e.g. async scroll offsets are attached to scroll ids which persist across scenes so that should work. That way, the hit test observes something that does take into account async scrolling, even if it's from the last frame (which is fine).
Nical, do you have any thoughts on this? Do you have a preference for one of the above approaches, or foresee any problems?
Glenn, if you have any thoughts I'd be curious to hear them as well.
Updated•3 years ago
|
Assignee | ||
Comment 12•3 years ago
|
||
I suspect {bug 1734206, bug 1738440, bug 1742356, bug 1743071} have the same failure mode.
Comment 13•3 years ago
|
||
I can't claim to understand the details or the subtleties of how this code works, so take any opinion from me with a grain of salt.
My initial reaction was that (1) sounds safer, but now I'm thinking that (2) probably does sound like it should just work, and is probably more the "correct" thing to do anyway, in terms of maintaining consistent internal state just after a new scene swap?
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Comment 14•3 years ago
|
||
It occurred to me that, while option (2) would improve correctness, it would not be fully correct. The reason is that the main thread scroll offset can sometimes clobber an async scroll offset.
So, suppose we have display list 1 with main thread scroll offset M1, and an async scroll delta of Delta1, resulting in a composited scroll offset of A1 = (M1 + Delta1).
Now suppose we get a new display list with a new main thread scroll offset M2. Depending on what's going on in that transaction (e.g. the type of scroll offset update), that may clobber the async delta (resulting in A2 = M2), or it may not (resulting in A2 = M2 + Delta1).
So, if WebRender is just remembering the Delta1 from the previous async sample and blindly applying it to M2, that may or may not be correct. In an extreme case where Delta1 is large, M2 clobbers it, and M2 is far away from M1, (M2 + Delta1) may be very different from both the scroll offset composited the previous frame (M1 + Delta1), and the scroll offset that will be composited the next frame (M2).
Assignee | ||
Comment 15•3 years ago
|
||
I guess WebRender could compute Delta2 = Delta1 + (M1 - M2), such that the scroll offset it uses for hit testing in the affected interval is A2 = M2 + Delta2 = M2 + Delta1 + (M1 - M2) = M1 + Delta1. That corresponds to the actual composited scroll offset of the previous frame, which is reasonable to use for hit testing until we get a new sample for the new scene.
Comment hidden (Intermittent Failures Robot) |
Updated•3 years ago
|
Comment 17•3 years ago
|
||
My understanding of this is very fuzzy. Would it make sense to:
- Not re-build the hit-tester during scene swap and only do it when we sample in
update_document
? (So we basically observe the previous state a bit longer during the time we would observe an incomplete/invalid hit tester) - Or sample during scene swap so that we have up to date deltas at all times? (is there data to sample at this point that corresponds to the new scene?)
Or if the delta fixup you mention in comment 15 is correct then that sounds good to me.
I'm not a huge fan of adding blocking synchronization for hit testing. It sounds like it would add unpredictable timing-deoendent input latency and main thread jank that would be hard to reproduce and debug reliably.
Assignee | ||
Comment 18•3 years ago
|
||
(In reply to Nicolas Silva [:nical] from comment #17)
My understanding of this is very fuzzy. Would it make sense to:
- Not re-build the hit-tester during scene swap and only do it when we sample in
update_document
? (So we basically observe the previous state a bit longer during the time we would observe an incomplete/invalid hit tester)
We can't do this. The post-sceneswap hook updates APZ's hit-testing tree to reflect the new scene, and the hit-test operation uses both APZ's hit-testing tree and WR's hit-tester. If the APZ hit-testing tree is based on the new scene but WR's hit-tester is based on the old scene, the results will be inconsistent (we have an assertion for this, and avoiding this inconsistency is the primary purpose of holding the APZ tree lock during the scene swap).
- Or sample during scene swap so that we have up to date deltas at all times? (is there data to sample at this point that corresponds to the new scene?)
This idea is interesting. The sampling would have to be done after the APZ hit-testing tree update in the post-sceneswap hook, but perhaps the result could be returned as a result of the post-sceneswap hook, and then sent to the render backend thread as a result of this recv() call?
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•3 years ago
|
Comment 22•3 years ago
|
||
(In reply to Botond Ballo [:botond] [away until Jan 12] from comment #14)
It occurred to me that, while option (2) would improve correctness, it would not be fully correct. The reason is that the main thread scroll offset can sometimes clobber an async scroll offset.
So, suppose we have display list 1 with main thread scroll offset M1, and an async scroll delta of Delta1, resulting in a composited scroll offset of A1 = (M1 + Delta1).
Now suppose we get a new display list with a new main thread scroll offset M2. Depending on what's going on in that transaction (e.g. the type of scroll offset update), that may clobber the async delta (resulting in A2 = M2), or it may not (resulting in A2 = M2 + Delta1).
So, if WebRender is just remembering the Delta1 from the previous async sample and blindly applying it to M2, that may or may not be correct. In an extreme case where Delta1 is large, M2 clobbers it, and M2 is far away from M1, (M2 + Delta1) may be very different from both the scroll offset composited the previous frame (M1 + Delta1), and the scroll offset that will be composited the next frame (M2).
Wouldn't this error be fixed by the APZ scroll generation introduced in bug 1571758? FWIW, I managed to write a mochitest for bug 1571758, but the test is flaky since the same reason that the main-thread scroll offset clobbers the APZ scroll offset. So anyway I'd like to keep having sampled APZ offset in new scroll frame updated by the new disply list.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 33•3 years ago
|
||
This started appearing as tier 1 here
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 45•2 years ago
|
||
(In reply to Hiroyuki Ikezoe (:hiro) from comment #22)
Wouldn't this error be fixed by the APZ scroll generation introduced in bug 1571758?
Sorry, I've been meaning to respond to this for a while but I kept alternately forgetting, or being distracted by something before I could think through it fully.
I believe the work in bug 1571758 does not solve this issue, nor otherwise affect it very much. The issue remains that a hit test can observe ScrollFrameInfo.offsets
when it has this value (or this one) -- i.e. with just a single element, whose offset is the main thread offset without any async delta, before set_scroll_offsets
has been called to give it the result of sampling APZ offsets.
Assignee | ||
Comment 46•2 years ago
|
||
Brain-dump of my current thoughts on this bug:
First, the high-level approach. I think two of the discussed approaches are potentially viable:
- Approach (2) from comment 11, refined by comment 15. Basically, carry over the most recent composited scroll offsets from the previous scene and use them until we can get a sample for the new scene.
- The second idea discussed in comment 18, i.e. get a sample for the new scene before releasing the tree lock.
The first one seems simpler to implement, so I suggest starting with that, and leaving the second approach as a fallback if the first doesn't work out for some reason.
Some notes on how the suggested approach could be implemented:
- In
RenderBackend::process_transaction()
, before updating the spatial tree, walk it and pull the most recently sampled scroll offsets off of each scroll frame node. The result can be a map fromExternalScrollId
toVec<SampledScrollOffset>
. (The value type is a vector because, as of bug 1571758, we retain multiple samples for each scroll frame. This is a slight complication but I think it actually helps us later.) - After updating the spatial tree but before rebuilding the hit tester, iterate through the map, and apply the old samples to the updated spatial tree.
- As a first pass, I think the "apply" step can take the form of a simple
Document::set_scroll_offsets()
call. Note that this basically implements the intent of comment 15 without doing any calculation: the samples stored inSampledScrollOffset
are already the "combined" scroll offsets (M_x + Delta_x
), so if the intent is to use the combined scroll offset that was last sampled, that's what we're doing. - If we want to try and be a little more accurate, we could do the following instead:
- Iterate the samples to see if there's one whose
generation
matches the scroll frame node's newScrollFrameInfo::offset_generation
. - If so, keep just that sample. This is the ideal case, as that's exactly what we want to be compositing in the next frame anyways.
- If not, maybe keep just the most recent (last) sample? Given that the number of samples is typically two, and we typically use the older (one-frame-delayed) one for compositing, the newer one from the previous frame is most likely to be the older one from the next frame and therefore what will be composited the next frame.
- Iterate the samples to see if there's one whose
- As a first pass, I think the "apply" step can take the form of a simple
Assignee | ||
Comment 47•2 years ago
|
||
(In reply to Botond Ballo [:botond] from comment #46)
- If we want to try and be a little more accurate, we could do the following instead:
- Iterate the samples to see if there's one whose
generation
matches the scroll frame node's newScrollFrameInfo::offset_generation
.- If so, keep just that sample. This is the ideal case, as that's exactly what we want to be compositing in the next frame anyways.
(I now better understand what Hiro meant in comment 22. The scroll generation added in bug 1571758 is actually helpful, for this reason.)
Assignee | ||
Comment 48•2 years ago
|
||
I have a WIP patch for this but at the moment it seems to just be making things worse: in addition to appearing not to fix the intermittent in bug 1776971, it's also causing test_wheel_transactions.html to fail. I'm continuing to investigate.
Assignee | ||
Comment 49•2 years ago
|
||
I think I figured out my problem here: my patch is trying to round-trip scroll offsets from the spatial tree to a temporary storage location and back, but the "setter" part of the round-trip is doing more than just setting the values.
Assignee | ||
Comment 50•2 years ago
|
||
When a new scene is swapped in on the render backend thread, its
scroll frames have scroll offsets that come from the main thread
and do not reflect async scroll deltas until such deltas are
sampled from APZ.
It's possible for hit-testing to observe the scene in this
temporary state, potentially leading to incorrect hit-test results.
To avoid this, save the async offsets from the previous scene
and apply them to the new scene until we can sample proper offsets
from APZ.
Comment 51•2 years ago
|
||
Comment 52•2 years ago
|
||
Backed out changeset 22e246050235 (Bug 1737725) for causing wrench bustages in fixed-position.yaml
Log: https://treeherder.mozilla.org/logviewer?job_id=409706230&repo=autoland&lineNumber=6418
Backout: https://hg.mozilla.org/integration/autoland/rev/1a7cd0fc32fb36d702e45718fea7cd5e93ac2755
Assignee | ||
Comment 53•2 years ago
|
||
Should be fixed in the updated version of the patch on Phabricator, but I pushed to Try for good measure.
Comment 54•2 years ago
|
||
Comment 55•2 years ago
|
||
bugherder |
Updated•2 years ago
|
Description
•