Open Bug 1774315 Opened 2 years ago Updated 13 days ago

Consider dropping ClampAndAlignWithPixels in nsGfxScrollFrame.cpp

Categories

(Core :: Layout: Scrolling and Overflow, task, P3)

task

Tracking

()

ASSIGNED

People

(Reporter: hiro, Assigned: emilio, NeedInfo)

References

(Depends on 1 open bug, Blocks 6 open bugs)

Details

Attachments

(2 files)

I am totally unsure whether ClampAndAlignWithPixels is still necessary or not. I suppose it was necessary for our old layer backend, but it's no longer needed to WebRender.

I believe this is the cause of wpt test failures with the patch for bug 1674687 in bug 1674687 comment 5. I actually tried to run a couple of failed tests locally, background-change-during-smooth-scroll.html and selection.html, they passed on local Android emulator without ClampAndAlignWithPixels.

To be precise dropping ClampAndAlignWithPixels entirely is wrong. We still need to clamp the given value to the scrollable bounds.

Anyway, without the AlignWithPixels there are two test failures.

One is browser_dbg-breakpoints-scroll-to-log.js

TEST-UNEXPECTED-FAIL | devtools/client/debugger/test/mochitest/browser_dbg-breakpoints-scroll-to-log.js | editor is scrolled to the log input line. - Got 870.5, expected 870

I am totally unsure how we calculate the expected 870 value, I suppose it was just picked from a success run.

Another failure is more interesting, it's helper_scroll_linked_effect_detector.html, we incorrectly detect scroll-linked effect by target.style.top = getComputedStyle(target).top. It's supposed to be a no-op, but in fact without ALignWithPixels the values, getComputedStyle one and the internal representation of top values are different.

What the test does is;

  1. Set window.scrollY to style.top
  2. In a later frame set getComputedStyle(target).top to style.top

So for example, window.scrollY is double, I got 82.16666412353516, then it was represented as 82.166664 in the style value, then its computed value was 82.1667, thus we consider the top value is going to be changed from 82.166664 to 82.1667. I believe the difference comes from here in dtoa-short. Presumably we should have only 6 decimal digits in the first place? Not just rounding it in the serializer.

See Also: → 1370779

I think the second test is probably worth just rounding to integers, it's not the point of the test to test these minor floating point rounding issues. We could restrict the floats at parse time to what we can serialize but that might be slightly unfortunate. Alternatively, just doing target.style.top = target.style.top; here should work too, right?

https://searchfox.org/mozilla-central/rev/230a641415c4212fe719279263e8ddf2a411aff1/gfx/layers/apz/test/mochitest/helper_scroll_linked_effect_detector.html#58

(In reply to Emilio Cobos Álvarez (:emilio) from comment #2)

Alternatively, just doing target.style.top = target.style.top; here should work too, right?

I actually tried it first, it didn't work. target.style.top is also rounded.

Ah! Ok, top is one of these properties that use used values from the computed style. So what you're describing is kind of expected.

Priority: -- → P3

During tracking down bug 1801098, I found one of other sources of rounding scroll position, here in ClipManager::DefineScrollLayers. Though I haven't followed what bug 1669865 (which introduced the rounding) tried to solve, at first glance it's odd because we've already aligned the scroll position in layer pixel by using ClampAndAlignWithPixels. So I suppose the change in bug 1669865 means ClampAndAlignWithPixels is useless for WebRender. And even with the change in bug 1669865 with APZ paint skipping there's a case that ClipManager::DefineScrollLayers isn't invoked.

I realized this bug is one of other sources of scroll jittering (such as bug 1782797).

In WebRender we have two type of scroll positions;

  1. a scroll position named `externa_scroll_offset which represents the scroll position for a given scroll container on the main-thread.
  2. scroll positions named offsets, which are sums of the external_scroll_offset and async scroll offsets

And when a paint happens, we use the external_scroll_offset, which means we use a rounded scroll position (rounded by ClampAndAlignWithPixels), not rounded by WebRender. That will result jittering scrolling.

See Also: → 1782797
See Also: → 1682203

This prevents scrollLeft = <integer> to return scrollLeft = <integer>,
which is right now papered by bug 1674687.

Assignee: nobody → emilio
Status: NEW → ASSIGNED
Pushed by ealvarez@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/7ece24d43282
Don't clamp target scroll position to layer scale. r=mstange
Pushed by emilio@crisal.io:
https://hg.mozilla.org/integration/autoland/rev/a3b5ec0e4306
Annotate a reftest as fuzzy in macOS for now.

Backed out for causing bc failures in browser_test_position_sticky.js

  • Backout link
  • Push with failures
  • Failure Log
  • Failure line: TEST-UNEXPECTED-FAIL | gfx/layers/apz/test/mochitest/browser_test_position_sticky.js | The position:sticky element should stay at the same place after scrolling on heavy load - Got "
Flags: needinfo?(emilio)
Regressions: 1829990

Hiro, it seems you wrote this test in bug 1730998, do you know why it's failing on macOS?

I can repro this failure but it fails at some resolutions even without the patch. It seems like some fractional overscroll / antialiasing or something? But the weird thing its that it's at the top...

It seem the test was intended to test whether the sticky element paints at all, so maybe I can just tweak it to ignore that fractional overflow like this?

diff --git a/gfx/layers/apz/test/mochitest/browser_test_position_sticky.js b/gfx/layers/apz/test/mochitest/browser_test_position_sticky.js
index c324440b61dc6..36316482e1f17 100644
--- a/gfx/layers/apz/test/mochitest/browser_test_position_sticky.js
+++ b/gfx/layers/apz/test/mochitest/browser_test_position_sticky.js
@@ -48,6 +48,12 @@ add_task(async () => {

       // Reduce the scrollbar width from the sticky area.
       stickyRect.width -= w.value;
+
+      // Reduce the rect a bit vertically to avoid antialiasing and overscroll
+      // artifacts.
+      stickyRect.height -= 2;
+      stickyRect.x += 1;
+
       return {
         rect: stickyRect,
         scrollbarWidth: w.value,

But it feels a bit weird... I'll try that tho (macOS try runs are super slow lately...).

Flags: needinfo?(emilio) → needinfo?(hikezoe.birchill)

If it's reproducible without the patch, filing a bug and ignoring 1px at the top would be fine, but allowing the tolerance should be only on Mac for now.

IIRC, when I wrote the test in bug 1730998, it failed to render the sticky element entirely (i.e. fully checkerboarding), or top 1px or 2px (or more) of the sticky element was scrolled out because the sticky element didn't catch up with the latest async scroll offset. So this failure is totally different from what I saw before, which makes me believe D176306 doesn't regress bug 1730998 at all.

My best guess is that the underlying issue is somewhere inside WebRender where we combine the sticky position on the main-thread and the async scorll offset.

Flags: needinfo?(hikezoe.birchill)
Attached image a failure snapshot

FWIW here is a snapshot in a failure case.

Failed to attach a data url. :/

See Also: → 1831655
Pushed by ealvarez@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/332b088b3141
Don't clamp target scroll position to layer scale. r=mstange

I've been thinking about this for a bit, after talking to Botond and Hiro.

I may be overcomplicating things, perhaps others have more clarity on what's needed / how to best fix this / other ideas / corrections.

As I understand it, we'd want to pass unsnapped values through to WR for scroll frames (and also sticky frames, I guess), and have WR handle the snapping correctly. This necessarily includes support for scroll frames that have APZ enabled.

We currently do pixel snapping in WR during scene building, which is prior to evaluating per-frame APZ values (scrolls and animated transforms). This mostly works, since we generally don't want to snap subpixel transforms for compositor animations.

But since the snapping occurs during scene build, we therefore currently require the scroll positions for APZ frames to be snapped values to get the expected result.

We could fix this by moving snapping to occur during frame building, which would allow us to snap correctly while taking into account subpixel offsets in APZ scroll (+ sticky) frames. We'd need to ensure we differentiate between scroll frames and transform animations (where we don't want to snap), but we have this information available in WR.

Snapping is conceptually simple, but difficult to change incrementally, I don't think there's anything particularly complicated about doing snapping during frame building, but it is tricky to get it correct, and can take time to debug each fuzziness failure.

Rough list of things we'd need to consider:

  • Correctly determine when we want to snap (scroll vs. transform animations). May need to apply partial transform, then snap, then apply rest of the transform in some cases.

  • Is there any way we could do parts of the work incrementally, to avoid landing all in one large change?

  • Does this have any impact on the existing hit-testing code?

  • We need to be able to snap on CPU, to correctly determine device-space bounding rects for render task allocations for off-screen surfaces. We could do all snapping on CPU, but that might produce a lot of extra per-frame GPU upload data. Perhaps we could do most of the snapping in vertex shader, and only do CPU snapping for render task allocations?

  • What impact does this have on GPU upload traffic for gpu_cache / gpu_buffer? Perhaps we should move more primitives to gpu_buffer first, and consider deprecating gpu_cache before making these changes anyway?

  • Need to correctly handle updated snapping behavior in both the new and legacy clip-mask sampling paths.

  • The way we currently hash and intern primitives for invalidation purposes may require changes. We already quantize the values, so it may not be an issue. However, we do rely on the scroll frame external_scroll_offset being snapped to get stable values for invalidation, so this may require some extra work.

Also, for what it's worth:

We don't currently handle snapping correctly for cases where there is an off-screen surface. Mostly this isn't noticeable since they are generally device pixel aligned for most content. However, if we were to fix the above, we could fix this at the same time, which should fix a number of existing fuzziness / correctness issues we have.

(In reply to Glenn Watson [:gw] from comment #17)

  • Is there any way we could do parts of the work incrementally, to avoid landing all in one large change?

I guess we could land the snapping part during frame building first, then we can land the remaining part, i.e., dropping the current snapping stuff.

I believe bug 1735008 will be fixed by the first part, it's another variant of bug 1801098 (it's still open but it was already fixed by backing out bug 1745969) but with zooming-in cases. Thus if we properly do snapping during frame building, i.e. after applying the async zoom (transform) value, then bug 1735008 should be disappeared.

Markus and I have been discussing this in #gfx-firefox, and Markus has an alternative proposal, which I'll describe here.

  • During the Gecko -> WR DL conversion step, we would "undo" the scroll positions set in display items. Conceptually this seems simple, perhaps there's some unknown complexity here though. This would result in WR display items always being in an unscrolled local space.

  • Snapping continues to occur during scene building, as it does now. This effectively means the snapping occurs assuming no scrolling has taken place, so it's unaffected by any changes to have a subpixel scroll offset.

  • The scroll offsets are supplied as complete scroll transforms by APZ, which can be snapped by WR internally.

In addition to solving the problems above, this has some side benefits:

  • We could remove the concept of external_scroll_offset from WR. This is a source of complexity, and we currently have to be very careful how we handle this to avoid redundant invalidations.

  • We could remove the extra overhead in WR that deals with undoing these scroll positions to enable correct content hash / intern values. This is significant as we need to operate on each glyph.

  • Making WR work this way would potentially simplify some of the planned changes for the proposed frame tree <-> WR work. By changing WR to work with unscrolled coordinates, we could produce those directly from the frame tree, and then eventually the Gecko -> WR DL conversion step itself gets removed as it's redundant.

A further benefit:

  • The snapping logic remains per-scene, rather than per-frame. This would avoid any performance regression during frame building, and any extra GPU upload traffic etc.
See Also: → 1851066

So is investigating the failures in comment 16 and getting this patch landed still worth it?

Flags: needinfo?(mstange.moz)
Flags: needinfo?(hikezoe.birchill)
Flags: needinfo?(emilio)

I'm not sure if we have a plan for what should happen in what order and who will work on which part and where the test changes slot into that plan.

Flags: needinfo?(mstange.moz)

My understanding (though it may be out of date) is that we will want the change in this patch (removing the alignment done in nsHTMLScrollFrame::ScrollToImpl), but fixing the failures will require Glenn's work on pixel alignment in WebRender.

Ok so it seems to me:

  • The failure in test_bug1253683, it seems we probably don't scroll enough (scroll less than one pixel because the scroll is smooth and we resolve too early). That seems fixable on the test side right?
  • The sticky failure seems rather harmless, maybe we can annotate it as todo_is, or ignore a few pixels on macOS? Will look into that one more deeply.
Flags: needinfo?(emilio)

Doh! I didn't submit below comment last night. :/ Posting now anyway.

Investigating the test failure reason is worth, I think. But I suspect landing the change will cause performance regressions such as bug 1825394 or cause jittering text/borders such as bug 1801098, or both. Though I may be too pessimistic, I'd suggest to land the change after Glenn's WR changes finished.

Flags: needinfo?(hikezoe.birchill)

For what it's worth - I don't currently have the WR work scheduled to get done in the near future, as some other high priority work has come up. I think Bob was going to talk to Kelsey and Tim about whether it's something that Tim may have time to work on?

Flags: needinfo?(tnikkel)
Flags: needinfo?(bhood)

...I think Bob was going to talk to Kelsey and Tim about whether it's something that Tim may have time to work on?

Yes, I spoke to Kelsey about this last week, and I will follow up today.

Flags: needinfo?(bhood)
See Also: → 1852884
Status: ASSIGNED → RESOLVED
Closed: 10 months ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---

I closed the bug in the wrong tab, sorry!

Status: REOPENED → ASSIGNED
Blocks: 1851066
See Also: 1851066
Blocks: 1831655

I did a little looking into what would be required for this. One issue that came to mind as this became more concrete was that (at least on the content side) we will basically be computing the coords of every single item twice (instead of once as currently). The existing coords as we compute and store in the display item, and then we'll have to compute it again as we convert from display items and send to web render. I think this will be a non-trivial cost.

Flags: needinfo?(tnikkel)

Glenn, will your current DisplayList work have any impact on Tim's analysis in comment 30?

Flags: needinfo?(gwatson)

Eventually it would intend to resolve the performance related issues that Tim has mentioned, yes.

However, my understanding of this bug is that the performance side of it is a tertiary concern, and it's primarily about fixing the behavior of how / when we snap in certain situations?

Flags: needinfo?(gwatson)

(In reply to Glenn Watson [:gw] from comment #32)

However, my understanding of this bug is that the performance side of it is a tertiary concern, and it's primarily about fixing the behavior of how / when we snap in certain situations?

Yes, however we'd have to pay the increased perf cost in all cases.

See Also: → 1860504
Blocks: 1556685

A rough plan for how we will approach this:

  • Add the Gecko side parts to remove the scrolling snap logic (either on a preference or in a branch).
  • Craft some Gecko tests that fail with that behaviour enabled.
  • Proof of concept in WR to make those tests pass.
  • Work out the unknowns and how to make the proof of concept landable, and enable the Gecko preference.
Depends on: 1887125

Could one of you include a brief outline of what the proposed fix was in wr? It had to do with when we snap vs when we add the external scroll offset.

Flags: needinfo?(mstange.moz)
Flags: needinfo?(gwatson)

The general idea is to apply the snapping with an assumption of zero-scroll offset.

During scene building, separate out the offsets from reference frames vs. scroll offsets, and do the snapping calculations only on the reference frame + local position offsets. Then, apply the (external) scroll offsets after the snapping logic occurs (for interning / hashing purposes).

The visual effect from the scroll offsets could actually be applied as part of the calculated transform matrix built in the spatial tree (which itself would be snapped after concatenating with the APZ scroll offset during frame building).

It remains to be seen if there's any gotchas or unexpected test failures / unknowns with that idea, but I think that's the general scope of the fix.

Flags: needinfo?(gwatson)
Blocks: 1878876
See Also: 1831655
See Also: → 1878247
Blocks: 1894400
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: