Closed Bug 1871760 Opened 11 months ago Closed 8 months ago

Scroll is jittery on sites with virtual scrolling

Categories

(Core :: Panning and Zooming, defect, P2)

Firefox 121
defect

Tracking

()

RESOLVED FIXED
125 Branch
Tracking Status
firefox-esr115 --- unaffected
firefox121 --- wontfix
firefox122 --- wontfix
firefox123 --- wontfix
firefox124 --- wontfix
firefox125 --- fixed

People

(Reporter: brawaru, Assigned: hiro)

References

(Regression)

Details

(Keywords: regression)

Attachments

(5 files, 2 obsolete files)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:120.0) Gecko/20100101 Firefox/120.0

Steps to reproduce:

  1. Open a site that heavily relies on virtual scrolling like https://bsky.app (can also be reproduced to a lesser degree on Mastodon, https://discord.com/app (channels / messages)).
  2. Scroll down or up

Actual results:

Scroll is very jittery compared to what it used to be in 120.

Expected results:

Page scrolls as normal and no jitter is observed or it's caused by the implementation of virtual scrolling itself, not the browser.

Notes:

  • Firefox <=120.0.1 had a different bug that affected scrolling on Bluesky, Firefox 121 it made so severe that the site is no longer usable in Firefox. I personally had to upgrade to 120.0.1 and hope this can be fixed soon. This bug affects all sites (to some degree) that use virtual scrolling.
  • The demonstration video was recorded in clear Firefox 121 profile, no settings were changed.

The Bugbug bot thinks this bug should belong to the 'Core::Panning and Zooming' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.

Component: Untriaged → Panning and Zooming
Product: Firefox → Core
Status: UNCONFIRMED → NEW
Ever confirmed: true
Keywords: regression
Regressed by: 1856088

As expected, setting layout.css.scroll-anchoring.absolute-update=false fixes the bug.

(In reply to Sasha Sorokin [:brawaru] from comment #1)

There are a lot of comments there so it wasn't immediately clear to me, does this older bug still exist (say in 120)? If so, are the steps to reproduce? Thanks.

Flags: needinfo?(dafri.nochiterov8+mozbugzilla)

Set release status flags based on info from the regressing bug 1856088

(In reply to Timothy Nikkel (:tnikkel) from comment #5)

does this older bug still exist (say in 120)? If so, are the steps to reproduce? Thanks.

I think it's a different bug, because it affected only Bluesky, and all other sites seemed to be alright. The reproduction steps are similar — you just scroll the feed down, eventually your scroll seems to jump randomly (mostly scroll you back a bit, as if you had a broken mouse wheel), it also may stick to the very bottom of the page, continuing loading of the feed (see video in my comment on GitHub).

And yes, I just reproduced this in 120.0.1. Important observation is that turning off smooth scrolling seems to make it occur less. It also happens slightly less often now after Bluesky removed the sliding feeds bar at the top (which was replaced with the fixed tabs bar for the current feed).

I personally believe this one may actually be Bluesky's own bug, but they rejected it on the merits of non-reproduction in Chromium, which only had the feed blinking (you can also see me trying to reproduce this in Chromium via above link). This is why I filled it in Webcompat rather than immediately heading to this tracker.

Flags: needinfo?(dafri.nochiterov8+mozbugzilla)

Updating and disabling layout.css.scroll-anchoring.absolute-update fixed the jitter, but not the the bug reported on Webcompat: there are still weird scrollbacks as described above.

:hiro, since you are the author of the regressor, bug 1856088, could you take a look? Also, could you set the severity field?

For more information, please visit BugBot documentation.

Flags: needinfo?(hikezoe.birchill)

I can't since I don't have accounts on bluesky. Can someone provide a local version of the site I can reproduce locally even if it's gigantic?

I can give you the invite code. Do I use your Bugzilla email to send it?

No worries. Timothy has already sent an invitation to me. Though as of now I am not going to use it since I don't want to have accounts on the site. I'd rather wait to local test cases. :)

That's said, it looks like https://github.com/bluesky-social/social-app is the source code of the bluesky web site? If so, it's worth trying?

Can we set a severity on this?

Flags: needinfo?(botond)

If we can, I think we should try to fix this during this cycle.

Discussed during today's APZ meeting, marking as S2 based on user impact.

Severity: -- → S2
Flags: needinfo?(botond)
Priority: -- → P2

(In reply to Timothy Nikkel (:tnikkel) from comment #14)

If we can, I think we should try to fix this during this cycle.

We are in the final week of beta for Fx122, do thing you'll have a fix in time or will this go with Fx123?

Flags: needinfo?(tnikkel)

We should at least flip the pref to nightly-only (or disabled for now...)

oh, we shipped this in 121...

(In reply to Donal Meehan [:dmeehan] from comment #16)

(In reply to Timothy Nikkel (:tnikkel) from comment #14)

If we can, I think we should try to fix this during this cycle.

We are in the final week of beta for Fx122, do thing you'll have a fix in time or will this go with Fx123?

Sorry, I'm not really in a position to answer this, I don't know the code affected by the regressing patch well enough.

Flags: needinfo?(tnikkel)

Sorry, redirecting my NI to :botond - if there were any discussions on the next steps, etc. re: comment 15
We are near the end of the beta cycle. 122.0b9 builds on 2024-01-12. At this point, it might be too late?

Flags: needinfo?(dafri.nochiterov8+mozbugzilla)

(In reply to Donal Meehan [:dmeehan] from comment #20)

Sorry, redirecting my NI to :botond - if there were any discussions on the next steps, etc. re: comment 15
We are near the end of the beta cycle. 122.0b9 builds on 2024-01-12. At this point, it might be too late?

(We'll discuss again during today's meeting and I'll post any updates here.)

Flags: needinfo?(dafri.nochiterov8+mozbugzilla)
Flags: needinfo?(botond)

Discussed this today:

  • Both Hiro and I have tried to reproduce this, with no success so far. We'll keep trying but for this reason an immediate fix is not likely to be forthcoming.
  • With regards to the suggestion in comment 17 to flip the pref added in bug 1856088 (the regressing bug): it's a tough call, since flipping the pref will bring back the regression fixed in bug 1856088. We think that may be even more impactful as it potentially affects a more widely used site (Facebook) and a more common scrolling method (touchpad, vs. autoscroll in this bug). For this reason, we are leaning towards leaving the pref as is, even if that means shipping this regression in 122.
Flags: needinfo?(botond)

(In reply to Botond Ballo [:botond] from comment #22)

vs. autoscroll in this bug

Just to clarify, this bug is perfectly reproducible with regular scrolling, and is even more noticeable by disabling smooth scrolling (pretty sure it's not worsened, just easier to see). I used auto-scrolling in demonstration to speed up the reproduction. Sorry for causing the confusion with that.

I was able to reproduce on the test bluesky account using a touchpad on macos. The str I used were:

scroll down a large number of pages
scroll up using a fling that travels about 2 pages.
observe the page during the momentum phase. after doing this several times I can observe a jump in the page contents.

Looks like there are two different bugs? Jittery scrolling and jumpy scrolling? Given that flipping layout.css.scroll-anchoring.absolute-update pref fixes jittery one, but not jumpy one, it would be nice to open a new bug for the jumpy one.

(In reply to Hiroyuki Ikezoe (:hiro) from comment #25)

Looks like there are two different bugs? Jittery scrolling and jumpy scrolling? Given that flipping layout.css.scroll-anchoring.absolute-update pref fixes jittery one, but not jumpy one, it would be nice to open a new bug for the jumpy one.

Why do you think there are two different bugs? Where did you get the information that the pref flip fixes one of the bugs but not the other one?

Whiteboard: [apz-needsdiagnosis]

(In reply to Timothy Nikkel (:tnikkel) from comment #26)

(In reply to Hiroyuki Ikezoe (:hiro) from comment #25)

Looks like there are two different bugs? Jittery scrolling and jumpy scrolling? Given that flipping layout.css.scroll-anchoring.absolute-update pref fixes jittery one, but not jumpy one, it would be nice to open a new bug for the jumpy one.

Why do you think there are two different bugs? Where did you get the information that the pref flip fixes one of the bugs but not the other one?

Oop, I forgot mentioning the information source. It's comment 8.

(In reply to Timothy Nikkel (:tnikkel) from comment #24)

I was able to reproduce on the test bluesky account using a touchpad on macos. The str I used were:

scroll down a large number of pages
scroll up using a fling that travels about 2 pages.
observe the page during the momentum phase. after doing this several times I can observe a jump in the page contents.

I can reproduce this on Linux with these STR. The issue is fixed by setting layout.css.scroll-anchoring.absolute-update=false.

I added the following code to log composited scroll offsets and their timestamps:

diff --git a/gfx/layers/apz/src/AsyncPanZoomController.cpp b/gfx/layers/apz/src/AsyncPanZoomController.cpp
--- a/gfx/layers/apz/src/AsyncPanZoomController.cpp
+++ b/gfx/layers/apz/src/AsyncPanZoomController.cpp
@@ -4819,16 +4819,26 @@ bool AsyncPanZoomController::AdvanceAnim
   // smooth. If an animation frame is requested, it is the compositor's
   // responsibility to schedule a composite.
   bool requestAnimationFrame = false;
   nsTArray<RefPtr<Runnable>> deferredTasks;
 
   {   
     RecursiveMutexAutoLock lock(mRecursiveMutex);
     {  // scope lock
+      float yOffset = GetEffectiveScrollOffset(
+                          AsyncTransformConsumer::eForCompositing, lock)
+                          .y.value;
+      if (yOffset != 0) {  // exclude things that aren't scrolling
+        printf_stderr("[SAMPLE] %f, %f\n",
+                      (aSampleTime.Time() - TimeStamp::ProcessCreation())
+                          .ToMilliseconds(),
+                      yOffset);
+      }
+
       CSSRect visibleRect = GetVisibleRect(lock);
       MutexAutoLock lock2(mCheckerboardEventLock);
       // Update RendertraceProperty before UpdateAnimation() call, since
       // the UpdateAnimation() updates effective ScrollOffset for next frame
       // if APZFrameDelay is enabled.
       if (mCheckerboardEvent) {
         mCheckerboardEvent->UpdateRendertraceProperty(
             CheckerboardEvent::UserVisible, visibleRect);

and graphed the results while reproducing Timothy's STR (from comment 24) with both values of layout.css.scroll-anchoring.absolute-updates. The jumps are fairly clear in the first graph.

This is just a guess though, in bug 1856088 we introduced ScrollUpdateType::MergeableAbsolute and we merged the delta into both sampled offsets, but we should merge the delta only in the latest offset?

Note sure if Comment 32 was intended for :botond? Adding needinfo

Flags: needinfo?(botond)

For regression tracking purposes, I'm going to suggest that this be marked firefox122: wontfix:

  • Not backing out the regressing change (including by flipping the default value of the layout.css.scroll-anchoring.absolute-update pref`) is probably still the right call. While we now understand this issue is not specific to autoscroll but also affects e.g. touchpad scrolling, we're still talking about potentially regressing a similar issue on another popular site (Facebook) vs. keeping this regression on Bluesky.
  • Given the tricky nature of the code involved, uplifting a fix (that's not just a backout) to 122 would be risky at this stage. We will continue pursuing a fix and hopefully have one to uplift early in the 123 cycle.

(Regarding the change described in comment 32, I did try it locally and it doesn't seem to fix the issue. A deeper investigation is needed to come up with a working fix.)

Flags: needinfo?(botond)

Sasha, would you mind providing a performance profile result while the jump happens?

My hypothesis is that when a scroll anchor adjustment caused by some DOM node deletions/additions happened if reflowing the contents and building display list for the change takes some amount of time, say 50ms or a couple of VSync frames, in the meantime APZ can scroll further, then when APZ received the updated contents, APZ updates the scroll position where the adjustment tries to set since bug 1856088, it will have some gap between them, so it causes a small jump.

Flags: needinfo?(hikezoe.birchill) → needinfo?(dafri.nochiterov8+mozbugzilla)

Like this? https://share.firefox.dev/3tYKLiO (scrolling back and forth causing jumps) / https://share.firefox.dev/3u07uLu (scrolling only down causing jumps)

Flags: needinfo?(dafri.nochiterov8+mozbugzilla)

Sasha, thanks for the profile results.

Both results have indeed some janky frames on the content's main-thread. Though it's hard to tell whether the frames happened just before the scroll jumps.

Anyways, I've implemented a mitigating way based on my hypothesis. Here is a try run with the way. You can download a Windows version of the binaries via the link of "target.zip" in the "Artifacts and Debugging Tools" pane when you clicked the "Windows 2012 opt B" symbol. (this is the direct link to the "target.zip", FWIW).

Sasha, would you mind trying the binary to see whether this bug is mitigated or not? Thanks!

Flags: needinfo?(dafri.nochiterov8+mozbugzilla)

I pushed the changes of the way for references.

I should note two interesting observations I experienced while I was trying it locally;

  1. I saw not-smooth scrolling while I was trying to see whether the change doesn't have any side-effects on bsky.app, it may be able to called "jump", but I saw it with enabling/disabling layout.css.scroll-anchoring.absolute-update, so I suppose it's something else
  2. I suspected the mitigation regresses the wpt added in bug 1856088, but it doesn't

(In reply to Hiroyuki Ikezoe (:hiro) from comment #37)

Here is a try run with the way.

I've tried this build and unfortunately the scrolling issues are still there. Here's the screen recording (not sure why it got recorded with lower FPS, but you can still see it being jerky and getting in the way a couple of times, as well as rubber banding during auto-scrolling).

Flags: needinfo?(dafri.nochiterov8+mozbugzilla)

I've debugged this a bit locally and I believe I understand what's causing the jumps that I'm seeing (which are shown in the graph in comment 29). The explanation is simpler than I thought:

  • Due to the asynchronicity between the web content main thread and APZ, the current scroll position seen by the main thread can be one or more frames behind the scroll position seen by APZ.
    • (The APZ frame delay mitigates this such that in the case where it's no more than 1 frame behind, scroll linked effects should remain in sync. However, besides scroll linked effects, other aspects of correctness (such as not having scroll position jumps like the ones observed in this bug) should not depend on the main thread's ability to keep up with APZ.)
  • Scroll anchoring happens on the main thread, and triggers a scroll position update calculated based on the main thread scroll position.
    • By the time this update reaches APZ, if there's a scroll animation in progress that updates the scroll offset every frame (such as a fling), the scroll position in APZ will have advanced by one or more frames relative to the one the main thread based its computation on.
    • Prior to bug 1856088, the scroll position update triggered by scroll anchoring was a relative one, so it got combined gracefully with the additional frames of scrolling that happen in APZ, i.e. the composited scroll offset reflected both the additional frames of the fling, and the scroll anchor adjustment.
    • Since bug 1856088, the scroll position update triggered by scroll anchoring is an absolute one, so it effectively "overwrites" the additional frames of scrolling that have happened in APZ, creating the appearance of "jump" backwards relative to the direction of scrolling.

Based on this diagnosis, I'm fairly sure the site isn't doing anything wrong and this is just an APZ bug; I'm accordingly removing the [apz-needsdiagnosis] flag. I think a fix will need to be along the lines of finding a way to fix the scenario in bug 1856088 without using absolute updates for scroll anchoring.

Whiteboard: [apz-needsdiagnosis]

Thanks Botond. I've started fixing bug 1856088 in a different way which is an alternative approach in bug 1856088 comment 2. That is "We might want to defer the clamp later?" in the comment. It means we mainly revert bug 1779404 and bug 1852818. There was a concern about the approach, but I can't recall it at all. So I'd take the approach here and "let's see what happens".

FWIW, here's a try run;https://treeherder.mozilla.org/jobs?repo=try&revision=14a6ab0edf0f3a14fd2af4491e683c8e80966db9

Assignee: nobody → hikezoe.birchill
Status: NEW → ASSIGNED

Two test cases attached in bug 1856088 now work with the previous commit and
work without MergeableAbsolute.

Attachment #9375923 - Attachment is obsolete: true
Attachment #9375922 - Attachment is obsolete: true
Pushed by hikezoe.birchill@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/4075109d54d0 Skip clamping the scroll offset while applying a relative ScrollPositionUpdate and clamp it after applying all ScrollPositionUpdates. r=botond https://hg.mozilla.org/integration/autoland/rev/0824e173f725 Revert bug 1856088. r=botond

Backed out for causing Wr failures on window-scrollBy-display-change.html

Backout link

Push with failures

Failure log

Flags: needinfo?(hikezoe.birchill)

The reason of the window-scrollBy-display-change.html failure is that it's the first paint case, and in D199840 we didn't clamp the scroll offset in the first paint case. I've updated the fixed version.

That's said, I realized that with D199840 and D199841 after-scrollable-range-shrinkage-004.html is flaky. I need to find out the flaky reason.

Flags: needinfo?(hikezoe.birchill)

(In reply to Hiroyuki Ikezoe (:hiro) from comment #48)

The reason of the window-scrollBy-display-change.html failure is that it's the first paint case, and in D199840 we didn't clamp the scroll offset in the first paint case. I've updated the fixed version.

That's said, I realized that with D199840 and D199841 after-scrollable-range-shrinkage-004.html is flaky. I need to find out the flaky reason.

Okay the reason is bug 1845646. With replacing these InvalidateFrame stuff with RecvDidUnsuppressPainting, I can no longer see failure locally. But still I have no idea why the original change for bug 1856088 could avoid the race.

That's unfortunate.

Because of bug 1845646, this scrollBy call (which is to scroll the anchor node initially) is omitted in APZ. This case also happens with the original change for bug 1856088, but it doesn't matter since the original change handles scroll position change by scroll anchoring as absolute position change.

I am going to add a workaround in the test just like we did in bug 1841305. And I will land it after the next soft freeze finished just in case.

I've started implementing bug 1845646 to fix the test failure since even with the workaround the test in question is quite flaky.

A good news is that fixing bug 1845646 wasn't so hard.

Depends on: 1845646

Oops, I forgot mentioning an important thing. Without fixing the race, the test, after-scrollable-range-shrinkage-004.html is stable only with the fix for bug 1856088.

In other words, we can't discard any relative scroll position updates to make the test stable. That means, even with the alternative approach Botond suggested in a review comment, the test is also flaky.

Pushed by hikezoe.birchill@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/1a5755ffb35b Skip clamping the scroll offset while applying a relative ScrollPositionUpdate and clamp it after applying all ScrollPositionUpdates. r=botond https://hg.mozilla.org/integration/autoland/rev/8be226e74ca5 Revert bug 1856088. r=botond
Status: ASSIGNED → RESOLVED
Closed: 8 months ago
Resolution: --- → FIXED
Target Milestone: --- → 125 Branch

Finally! Thanks Botond for your dedicated reviews and advice, etc.!

The patch landed in nightly and beta is affected.
:hiro, is this bug important enough to require an uplift?

  • If yes, please nominate the patch for beta approval.
  • If no, please set status-firefox124 to wontfix.

For more information, please visit BugBot documentation.

Flags: needinfo?(hikezoe.birchill)

This bug is important, but I'd bake it on nightly for a while.

Flags: needinfo?(hikezoe.birchill)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: