Closed Bug 1745969 Opened 3 years ago Closed 2 years ago

Overscroll sometimes grabs overflow:auto elements that don't have scrollbars (GitHub commit page)

Categories

(Core :: Panning and Zooming, defect, P3)

Desktop
All
defect

Tracking

()

RESOLVED FIXED
109 Branch
Tracking Status
firefox109 --- fixed

People

(Reporter: val, Assigned: hiro)

References

(Blocks 2 open bugs, Regressed 1 open bug, )

Details

(Keywords: perf-alert)

Attachments

(4 files)

Found this with code blocks on GitHub commit pages (e.g.) — sometimes scrolling with the touchpad with apz overscroll results in overscrolling these containers instead of scrolling the page. They do have overflow-x:auto set, but don't even actually overflow (no scrollbars are drawn) in my particular case — and I'm scrolling in the Y direction anyway. Seems like it does not happen when overflow-x:auto is removed from these elements.

Screen recording: https://dl.unrelenting.technology/overscroll-wtf.webm

Ah, I'm able to reproduce if I start with a horizontal swipe and then turn it vertical.

Status: UNCONFIRMED → NEW
Ever confirmed: true

Adding bug 1720240 to "See Also", I'm not sure if it's actually related but the symptoms are broadly similar.

At the moment I'm without a working device with a precision touchpad, but as that situation improves in the near future I'll try to repro and investigate.

See Also: → 1720240

The severity field is not set for this bug.
:botond, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(botond)
Severity: -- → S3
Flags: needinfo?(botond)
Priority: -- → P3
See Also: → 1751042

I am starting on this.

Assignee: nobody → hikezoe.birchill
Status: NEW → ASSIGNED

(In reply to Hiroyuki Ikezoe (:hiro) from comment #5)

I am starting on this.

I meant bug 1720240. :)

Now I think bug 1720240 is unrelated to this. Given that setting overflow-x: auto causes this bug;

  1. overflow-x:auto makes the target frame scrollable
  2. thus an APZC is created
  3. the APZC messes up it's overscrollable vertically because it doesn't check overflow-y property

that's my guess.

(In reply to Hiroyuki Ikezoe (:hiro) from comment #7)

Now I think bug 1720240 is unrelated to this. Given that setting overflow-x: auto causes this bug;

  1. overflow-x:auto makes the target frame scrollable
  2. thus an APZC is created
  3. the APZC messes up it's overscrollable vertically because it doesn't check overflow-y property

that's my guess.

My guess was wrong. It seems to be the correct behavior that setting overflow-x:auto makes overflow-y:auto too. I can see it on Chrome as well.

So a question here is "why the overflow-x:auto causes overflowing?". It's probably an issue in nsGfxScrollFrame.cpp.

Looking at the recording in comment 0, I can see overflow annotation on the element in the devtools panel, so the element is definitely overflowing. An interesting observation on my Linux box, on a local debug build, by default the element is not overflowing at all, i.e. no overflow annotation appears, with layout.css.devPixelsPerPx:1.1 it's overflowing but I see scroll annotation instead. I have no idea what's the difference is.

Though I haven't yet been able to reproduce this issue (bug 1751042 either), what I am suspecting is in a certain condition, this mY.CanScroll() call returns true for some reasons. The CanScoll implementation is;

bool Axis::CanScroll() const {                                                  
  return GetPageLength() - GetCompositionLength() > COORDINATE_EPSILON;         
} 

Can someone give me the metrics, GetPageLength() and GetCompositionLength() in the function when the unexpected overscroll happens?

Can you still reproduce this? I'm wondering if something on github changed because I can no longer reproduce.

Flags: needinfo?(greg)

(In reply to Timothy Nikkel (:tnikkel) from comment #11)

Can you still reproduce this? I'm wondering if something on github changed because I can no longer reproduce.

Yes, though the behavior seems slightly different in ./mach run of a build just built (I only get caught in the very last code file on the github page, but maaaybe more often/consistently?) vs. the 10 days old build I've been running (same as in the video, i.e. it can happen on any file).

(In reply to Hiroyuki Ikezoe (:hiro) from comment #10)

Can someone give me the metrics, GetPageLength() and GetCompositionLength() in the function when the unexpected overscroll happens?

Both when the overscroll happens and when it doesn't, it CanScroll() due to a difference of exactly 1 pixel like so:

[Parent 31389: Main Thread]: D/apz.axis 3976ec12ae00|Y PageLen1592.500000 - CompLen1591.500000 -> 1
Flags: needinfo?(greg)

(In reply to greg v [:unrelentingtech/myfreeweb] from comment #12)

(In reply to Hiroyuki Ikezoe (:hiro) from comment #10)

Can someone give me the metrics, GetPageLength() and GetCompositionLength() in the function when the unexpected overscroll happens?

Both when the overscroll happens and when it doesn't, it CanScroll() due to a difference of exactly 1 pixel like so:

[Parent 31389: Main Thread]: D/apz.axis 3976ec12ae00|Y PageLen1592.500000 - CompLen1591.500000 -> 1

Thank you, greg! This 0.5 fraction makes me suspect our pixel alignment on the main-thread causes this.

Now I am mostly sure the 1px difference comes from these SnapCoord calls.

I am unsure whether the bottom position snapping is still a thing we need to do in this WebRender era or not. It was introduced in bug 1012752 for the old layers backend. Markus, do you think it's still necessary for WebRender? Or can you please bounce to someone who is familiar with the area?

Flags: needinfo?(mstange.moz)
Attached patch no-op-snap-coordSplinter Review

greg, would you mind trying this patch to make sure the place I am suspecting is the root cause of this issue?

Flags: needinfo?(greg)

(In reply to Hiroyuki Ikezoe (:hiro) from comment #15)

greg, would you mind trying this patch to make sure the place I am suspecting is the root cause of this issue?

Yep, cannot reproduce with the patch, very nice.

Flags: needinfo?(greg)

I managed to create a simplified test case. I confirmed that I can see the undesirable overscroll effect on Mac.

STR;

  1. Open the attachment
  2. Click somewhere on the green area
  3. Try to scroll down

Actually result; you can see red area at the bottom of the green area, it means the green box is overscrolled

I am totally unsure this case works as expected on other environments, since the green box's height is 200px and there's another box whose height is 200.45px with red background color. In my theory with 200.5px inner box the issue should be appeared, but with 200.5px there appears the vertical scrollbar.

Now I am mostly 100% sure that this SnapCoord is unnecessary now. The biggest reason I believe it's unnecessary is this aosmond's comment in bug 1574493. From the comment;

  1. Round in layout space in Gecko. This effectively snaps any primitive rects and clip rects if there are no transforms applied. Most of the time, this is correct.

(snip)

  1. is fundamentally wrong. This is faking snapping, and works most of the time due to the relative rarity of fractional transforms and offsets.
    Proposed solution: Solve 2), 3) and 4), and then we can finally stop doing 1).

Given that bug 1574493 has solved the 2,3 and 4. I guess we can now safely drop the SnapCoord which is one of the parts of where we round metrics in Gecko.

:aosmond, do you think it's reasonable to drop this SnapCoord function which is used to snap the bottom edge of the scrolled rect?

A couple of caveats I've noticed.

  1. The SnapCoord function was introduced in bug 1012752, and in the bug two different type of reftests were added, one is checking rendering results, the other is checking frame layer invalidation stuff, the latter has been broken on WebRender unfortunately. The broken reftest uses reftest-no-paint but on WebRender the element in question will never be painted on the main-thread instead it does create a a WR command for border thus, the reftest-no-paint's element hasn't been worked as expected
  2. I guessed that dropping SnapCoord will fix bug 1736757 (now it's duped as bug 1735008) since the SnapCoord is called with the visualViewport size. But actually it doesn't fix
Flags: needinfo?(aosmond)

I am stuck at figuring out the reason why blob-fallback-clip.html reftest fails on Windows without the SnapCoord function. Without the SnapCoord, the testing image is rendered 1px up position. Here is a link to Reftest analyzer to see the difference.

Facts what I've found so far;

  1. the reference file, blob-fallback-clip-ref.html is not affected by the SnapCoord change, even though the return value of ScrollFrameHelper::GetScrolledRange (which is the only function using the SnapCoord) is affected by the change.
  2. the test file has a couple of rAF callbacks before rendering the result, whereas the reference doesn't have them. And in fact, with two rAF callbacks the reference starts being affected by the SnapCoord change for some reason.
    2-1) scrolled positions with/without two rAF callbacks are pretty much same
    2-2) I did double check 1) by adding linear gradient background color to <html> to tell the scroll position differences
    2-3) display list items with/without two rAF callbacks are also same

With these facts, I think this failure is more related to text rendering rather than scrolling. I am assuming that text rendering at the very beginning, in this test case, it's the first paint, the rendering result is slightly different from results after the second paint. That said, it's just a guess, it's hard to tell for me whether the test failure is indicating that dropping the SnapCoord function will regress something, or it's just caused by pre-existing issues or not.

Clearning NIs to both Markus and Andrew, I am going to ask Andrew in the next comment to get his attention.

Flags: needinfo?(mstange.moz)
Flags: needinfo?(aosmond)

Andrew, can you please shed some light on proceeding this bug?

There are some test failures caused by dropping the SnapCoord function, but other than the blob-fallback-clip reftest failure, I understand why they fail. They just need some tweaks in the test itself.

Flags: needinfo?(aosmond)
See Also: → 1774315

I meant to respond to this today (actually I wrote a response ages ago that was lost). Will do so tomorrow.

Text snapping is very complicated. I'm sure there will be interactions with changing the SnapCoord behaviour, although it isn't immediately obvious to me how the math will work out.

If we are dealing with anything other than a static simple 2d transform, we will snap excluding those effects to minimize the text movement during an animation:
https://searchfox.org/mozilla-central/rev/59f0bf3c13dd455d9f5415b89178de701ea6b850/gfx/wr/webrender/src/prim_store/text_run.rs#319

From the test case, I can only assume it will be snapping in device space, so it will take this path in the shader:
https://searchfox.org/mozilla-central/rev/59f0bf3c13dd455d9f5415b89178de701ea6b850/gfx/wr/webrender/res/ps_text_run.glsl#147

The reference_frame_relative_offset is just the current offset for the spatial node during scene building:
https://searchfox.org/mozilla-central/rev/59f0bf3c13dd455d9f5415b89178de701ea6b850/gfx/wr/webrender/src/scene_building.rs#699

We do a bunch of snapping inside of the spatial node, which I am not sure is relevant, but they do include the scroll offset (directly or indirectly):
https://searchfox.org/mozilla-central/rev/59f0bf3c13dd455d9f5415b89178de701ea6b850/gfx/wr/webrender/src/spatial_node.rs#450
https://searchfox.org/mozilla-central/rev/59f0bf3c13dd455d9f5415b89178de701ea6b850/gfx/wr/webrender/src/spatial_node.rs#533
https://searchfox.org/mozilla-central/rev/59f0bf3c13dd455d9f5415b89178de701ea6b850/gfx/wr/webrender/src/spatial_node.rs#539
https://searchfox.org/mozilla-central/rev/59f0bf3c13dd455d9f5415b89178de701ea6b850/gfx/wr/webrender/src/spatial_node.rs#471

The external scroll offset contribution to that calculation comes from the ClipManager, which does its own rounding:
https://searchfox.org/mozilla-central/rev/59f0bf3c13dd455d9f5415b89178de701ea6b850/gfx/layers/wr/ClipManager.cpp#360

My first thought would be if we are removing snapping from ScrollOffsetHelper, we may need to reconsider how we snap in the ClipManager and/or the spatial node. I'm less familiar generally with how APZ feeds changing scroll offsets into WebRender but there may be more to consider along those paths.

Flags: needinfo?(aosmond)

Thank you :aosmond for the detailed information. I will dive into places where you linked in the next week. Thank you!

(In reply to Andrew Osmond [:aosmond] (he/him) from comment #22)

I'm less familiar generally with how APZ feeds changing scroll offsets into WebRender but there may be more to consider along those paths.

For references/records, I know this bit. The async scroll offset comes from this offset() function and it will be snapped by this snap_offset IIRC.

Interestingly on the latest m-c the reftest in question doesn't fail on Windows 11 (at least on my Window 11 laptop), so I wondered it was fixed at some point recently, but unfortunately the reftest still fails on Windows 10 on the latest m-c. So I wondered the failure is caused by something timing specific, then I tried use "MozReftestInvalidate" event instead of onload, the result was pretty green. So I am going to tweak the reftest rather than diving into details of text snapping on Windows 10.

With onload event handler this reftest isn't in a good state, in other words
it's not stable on all platforms. Specifically on Windows 10 this testing html
is rendered at 1px upper than the reference html without pixel snapping on the
main-thread. Interestingly this reftest doesn't fail at least on my Windows 11
laptop without the pixel snapping. So there's something that metrics are timing
specific (e.g. font loading) to make this test flaky.

With doing SnapCoord we mis-consider non-scrollable frames as scrollable in APZ,
thus it causes unexpected overscroll gutters.

Depends on D161417

Pushed by hikezoe.birchill@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/aec681965be3 Use "MozReftestInvalidate" event instead of onload to wait for ready to be tested. r=tnikkel https://hg.mozilla.org/integration/autoland/rev/a23bb92cb6fe Stop doing SnapCoord. r=tnikkel
Blocks: 1715154
Pushed by hikezoe.birchill@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/6fd4ba66ac87 Use "MozReftestInvalidate" event instead of onload to wait for ready to be tested. r=tnikkel https://hg.mozilla.org/integration/autoland/rev/11b84155f96e Stop doing SnapCoord. r=tnikkel

There were more "UNEXPECTED-PASS" on android.

Flags: needinfo?(hikezoe.birchill)
Backout by sstanca@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/145c2af4cfe3 Backed out 2 changesets for causing mochitests failures on test_bug842853-2.html.CLOSED TREE

There were 3 test failures, test_bug842853-2.html, test_bug842853-2.html and test_bug851485.html. All tests use file_bug842853.html. The file_bug842853.html has an anchor node at the bottom of the root scroller, and all the tests try to go to the anchor then scroll back to 500px.

I was optimistic that the scroll position alignment (bug 1774315) will not be affected by dropping SnapCoord since the SnapCoord is just snapping the scroll container's bottom edge. But in fact in these specific test cases, it affects. The scroll position alignment (ClampAndAlignWithPixels) does;

    double currentLayerVal = (aRes * aCurrent) / aAppUnitsPerPixel;               
    double desiredLayerVal = (aRes * desired) / aAppUnitsPerPixel;                
    double delta = desiredLayerVal - currentLayerVal;                             
    double nearestLayerVal = NS_round(delta) + currentLayerVal;

aCurrent is the current scroll position which is in these cases it's the maximum scroll position (affected by dropping SnapCoord), so the result of ClampAndAlignWithPixels is also affected by dropping SnapCoord unfortunately.

Dropping ClampAndAlignWithPixel (bug 1774315) altogether is as hard as this bug since it requires changing our internal representation of style values (see bug 1774315 comment 1). Thus I am inclined to relax these test conditions to allow 0.5 difference.

Though this isn't directly related to this bug, the code in ClampAndAlignWithPixels was introduced in bug 772679, I did confirm that the STR in bug 772679 comment 0 doesn't reproduce the issue with/without ClampAndAlignWithPixels (without SnapCoord) both on Linux and Mac. So I am mostly sure that it was necessary for our old layer backend, but it's no longer necessary. Also note that I am quite unsure the NS_round(delta) is correct, my naive guess is that it should be delta > 0.0 ? ceil(delta + 0.5) : floor(delta - 0.5) and clamp it in the scrollable range, otherwise there may be unreachable 1px at the edges.

Pushed by hikezoe.birchill@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/c381bdd1fea3 Use "MozReftestInvalidate" event instead of onload to wait for ready to be tested. r=tnikkel https://hg.mozilla.org/integration/autoland/rev/248c025cb922 Stop doing SnapCoord. r=tnikkel
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 109 Branch

(In reply to Pulsebot from comment #33)

Pushed by hikezoe.birchill@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/c381bdd1fea3
Use "MozReftestInvalidate" event instead of onload to wait for ready to be
tested. r=tnikkel
https://hg.mozilla.org/integration/autoland/rev/248c025cb922
Stop doing SnapCoord. r=tnikkel

== Change summary for alert #36103 (as of Wed, 16 Nov 2022 21:52:02 GMT) ==

Improvements:

Ratio Test Platform Options Absolute values (old vs new)
7% bing-search-restaurants ContentfulSpeedIndex android-hw-a51-11-0-aarch64-shippable-qr warm webrender 987.25 -> 913.50

For up to date results, see: https://treeherder.mozilla.org/perfherder/alerts?id=36103

Regressions: 1807890
Regressions: 1818967

Hello Val, can you please double-check that this bug doesn't happen on recent nightlies (March 26th and later I suppose)?

I did revert the change for fixing this bug in bug 1817126. And even without the change, I believe this original bug has been fixed by bug 1782339 in a different way. At least I can no longer reproduce this bug in the test case in comment 17, and as far as I tried, I can no longer see this bug on the github page either. Though I am unsure whether I could see the github issue locally before. :/

Flags: needinfo?(val)

Can't reproduce either, everything seems good. :)

Flags: needinfo?(val)

Glad to hear! Thank you!

See Also: → 1795784
Blocks: 1852884
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: