Closed Bug 1280667 Opened 3 years ago Closed 3 years ago

Dragging fake scrollbar on Facebook jumps around wildly

Categories

(Core :: Panning and Zooming, defect, P1)

48 Branch
All
macOS
defect

Tracking

()

VERIFIED FIXED
mozilla50
Tracking Status
firefox48 + verified
firefox49 + verified
firefox50 + verified

People

(Reporter: mstange, Assigned: kats)

References

()

Details

(Keywords: regression, Whiteboard: [gfx-noted])

Attachments

(2 files)

Facebook has a custom (fake) scrollbar in the mini newsfeed in the top right corner. Dragging that scrollbar shows behavior similar to the one reported in bug 1258851.
Regression from non-APZ, or did it work at some point with APZ and stopped?
Version: Trunk → 48 Branch
I can repro this on 48 as well if I click on the notifications icon on the FB top-toolbar and then drag the scrollbar inside to scroll through the notifications.
Priority: -- → P1
Turning off paint-skipping doesn't help here either.
Whiteboard: [gfx-noted]
I'm starting investigation on this one.
Assignee: nobody → bugmail.mozilla
Order of events in one example:

Main thread scrolls to y=3319
Main thread scrolls to y=3622
APZ gets layers update with y=3319
APZ sends repaint request with y=3319
Main thread rejects repaint request, sets callback transform y=-303
Input event goes through APZ (no untransform) and main thread (applies -303 callback transform)
The callback transform produces a discontinuity in the screenY/clientY coordinates of the event which throws off the "fake scrollbar" scrolling, and introduces further discontinuities and diverges.

This is basically the same problem we had with scrollbars in bug 1244549 and bug 1258851, the same thing botond is trying to fix in bug 1259781. I have a simple patch that extends the scrollbar hack to cover fixed-pos items as well which solves the problem on my reduced test case and Facebook. I think it should be safe enough to uplift to 48, whereas any other "more correct" solutions will carry more risk.

That being said I'm starting to wonder if we'll want to do the clone-refpoint thing I suggested in bug 1259781 comment 9 regardless. That would allow us to drop the scrollbar hack and move incrementally towards a more correct solution.
I wrote the hacky fix and wrote a test to go with it, try push at https://treeherder.mozilla.org/#/jobs?repo=try&revision=933878c938c1

I'll see if I can write a better fix with the clone-refpoint method.
New try push at https://treeherder.mozilla.org/#/jobs?repo=try&revision=55feac48ef58&group_state=expanded&selectedJob=22881553 looking better, there's still one test failure which I think I can fix easily.

I spent a while trying to come up with some mental model to decide which of these approaches is more correct. One thing I realized is that if the APZ scroll position is different from the main-thread scroll position, then there's going to be some inconsistencies exposed to web content, and there's no way to avoid that.

Consider the case where we have a page is scrolled to y=0 on the MT and y=100 in APZ. The user does a mousedown at y=100 on the screen, which is visually y=200 on the page because of the APZ scroll offset. Now when we dispatch that event to web content, assuming the event has a single refPoint (which is the case in trunk gecko), the pageY and clientY values on the event will be the same, because they are both computed from the same refpoint and the MT scroll offset is 0. In this case we will send a refpoint with y=100 from widget to APZ, APZ will untransform that to a refpoint of y=200 (because of the transient async transform), and so web content will report pageY of 200 and clientY of 200. However in reality the clientY should be 100 (distance from mouse position to top of client area, as per spec). In the general case, if the APZ scroll offset is different from the MT scroll offset, the pageY and clientY values on the event cannot both be correct as long as we are computing them from a single refpoint (I hereafter call this "the APZ dilemma").

A different aspect of the dilemma is that in the non-APZ world web content could generally assume that e.clientY + window.scrollY == e.pageY, with all the values being correct per spec, but this assumption no longer holds with APZ. In current gecko trunk, the e.clientY value comes out incorrect even though we maintain the equality. My patch tries to resolve the APZ dilemma by splitting the single refpoint into two, so that we can have correct (per spec) pageY and clientY values. However, it breaks the equality.

There are also inconsistencies with other properties. For example element.getBoundingClientRect() is computed on the MT using the MT scroll position. If we change e.clientY to be visually correct, it will be using the APZ scroll position. That means web content implementing hit-testing by comparing e.clientY and element.getBoundingClientRect() will fail, and I suspect that's a reasonably common thing that happens on the web. So I'm not sure going down this road is worthwhile, as it just exchanges one can of worms for another.

Need to think about this more still.
Up until recently, my understanding was that an APZ callback transform arose in two situations:

  1) When panning around within the initial scroll port of a zoomed-in overflow:hidden
     root scroll frame.

  2) When Gecko introduces a fractional difference between the APZ scroll position and
     the Gecko scroll position by rounding to an integer number of Layer pixels.

In case (1), we can probably live with the "APZ dilemma", because the whole scenario is a bit of an edge case (it only happens on mobile, and even then only on sites that are not mobile-friendly and need to be zoomed in). In case (2), the issue shouldn't be noticeable.

However, recently - in bug 1259781 and this bug - we seem to have encountered a third category of situations, where we sometimes have a large callback transform in spite of not being in the edge-case situation (1). I confess I don't fully understand yet the mechanism by which the callback transform arises in these cases, but I wanted to suggest one avenue of exploration: instead of solving the "APZ dilemma" for this case, could we avoid it by modifying our scroll offset synchronization mechanism in some way that prevents the callback transform from arising in the first place?
The two situations you describe are the only ones in which we have a "steady state" callback transform. We can get transient callback transforms winking in and out of existence in other scenarios, such as when the APZ sends a repaint request while the main thread has just changed the scroll position. That is the case here. The callback transform is transient but high-frequency during scrollbar dragging or the scenario in this bug, and if an event goes through while the callback transform is in place the clientX/Y on the event comes out wrong.
Can we detect the difference between a steady-state callback transform and a transient one? If so, I wonder what would happen if we just ignored the transient one, i.e. never applied it.
That sounds equivalent to just never setting a transient callback transform in the first place. I could probably construct a case where that would regress something, but I don't know how often that happens in practice.
I gave this some more thought, and I think your suggestion might be the best way forward. For one thing, we already have scenarios where there's a race condition between user input events and content changing the scroll position. i.e. consider if the user is clicking on something at the same time as a setTimeout is changing the scroll position. The mouse events could arrive just at the content process before or just after the scroll position change, and so it might click at the old or the new scroll position. The existence of the transient callback transform just changes the click from arriving at the "new" position to arriving at the "old" position, but fundamentally doesn't eliminate the race. So if we got rid of it, the overall behaviour should be no worse.

In terms of the APZ dilemma formulation, we would be changing the behaviour from having incorrect e.clientY and correct e.pageY to the reverse - a correct e.clientY and an incorrect e.pageY. While this sounds bad, the "incorrect e.pageY" scenario can already happen in the racy condition described in the previous paragraph, so not really a big deal.
I'm not super happy with the test because it's racy. Without the patch it's failed every single time I've run it, but in theory the test might still pass if the stars align just so. With the patch the test should never fail. So in theory if we regress this behaviour, the test might start failing intermittently and it might be hard to pin down. But in my local runs I've gotten a 100% failure rate so hopefully even on try the intermittent rate will be high enough.

I wasn't able to come up with a way to isolate the behaviour in question with our existing APIs exposed to testing infrastructure, and considering this is a P1 that needs uplifting to beta I thought it was better to get it landed in this form now.
Once this lands let's verify the fix - then please request uplift.
Flags: needinfo?(andrei.vaida)
Comment on attachment 8765497 [details]
Bug 1280667 - Don't update the callback transform if the APZ repaint request was rejected by a concurrent main-thread scroll position update.

https://reviewboard.mozilla.org/r/60848/#review57938

Nice! I assume this fixes the STR of bug 1259781, too?
Attachment #8765497 - Flags: review?(botond) → review+
(In reply to Botond Ballo [:botond] from comment #19)
> Nice! I assume this fixes the STR of bug 1259781, too?

It should in theory, but I haven't actually tested it. Bug 1259781 was pretty hard for me to consistently reproduce so it would be hard for me to verify with any degree of certainty.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #20)
> (In reply to Botond Ballo [:botond] from comment #19)
> > Nice! I assume this fixes the STR of bug 1259781, too?
> 
> It should in theory, but I haven't actually tested it. Bug 1259781 was
> pretty hard for me to consistently reproduce so it would be hard for me to
> verify with any degree of certainty.

I found the STR from bug 1259781 comment 10 to be fairly reliable; I'll test it when I get into the office tomorrow.
https://hg.mozilla.org/mozilla-central/rev/06685c0eba05
https://hg.mozilla.org/mozilla-central/rev/a5f7ad06f03b
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla50
Comment on attachment 8765497 [details]
Bug 1280667 - Don't update the callback transform if the APZ repaint request was rejected by a concurrent main-thread scroll position update.

Approval Request Comment
[Feature/regressing bug #]: APZ, probably regressed by bug 1242690 but didn't verify that
[User impact if declined]: The fake scrollbars that Facebook has in some places (they're not platform scrollbars, they are reimplemented in HTML) don't behave properly when dragged with the mouse.
[Describe test coverage new/current, TreeHerder]: tested locally on Facebook, as well as added automated test for it
[Risks and why]: some risk, the code is a little tricky and may introduce regressions. QA cycles on this and/or a little bake time would probably be good.
[String/UUID change made/needed]: none
Attachment #8765497 - Flags: approval-mozilla-beta?
Attachment #8765497 - Flags: approval-mozilla-aurora?
Attachment #8765498 - Flags: approval-mozilla-beta?
Attachment #8765498 - Flags: approval-mozilla-aurora?
I will verify this tomorrow when the fix reaches Nightly. Currently latest Nightly 2016-06-29 doesn't have the fix.
Flags: needinfo?(andrei.vaida) → needinfo?(petruta.rasa)
QA Contact: petruta.rasa
(In reply to Botond Ballo [:botond] from comment #22)
> (In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #20)
> > (In reply to Botond Ballo [:botond] from comment #19)
> > > Nice! I assume this fixes the STR of bug 1259781, too?
> > 
> > It should in theory, but I haven't actually tested it. Bug 1259781 was
> > pretty hard for me to consistently reproduce so it would be hard for me to
> > verify with any degree of certainty.
> 
> I found the STR from bug 1259781 comment 10 to be fairly reliable; I'll test
> it when I get into the office tomorrow.

Yup, can't repro those STR with this fix any more! \o/
Comment on attachment 8765497 [details]
Bug 1280667 - Don't update the callback transform if the APZ repaint request was rejected by a concurrent main-thread scroll position update.

Taking it (impact a major website) in aurora to improve the coverage
Beta later.
Attachment #8765497 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #8765498 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Duplicate of this bug: 1276122
Verified as fixed using latest Nightly 50.0a1 across platforms.
Status: RESOLVED → VERIFIED
Flags: needinfo?(petruta.rasa)
Duplicate of this bug: 1259781
Comment on attachment 8765497 [details]
Bug 1280667 - Don't update the callback transform if the APZ repaint request was rejected by a concurrent main-thread scroll position update.

This patch fixes a regression. Take it in 48 beta 6.
Attachment #8765497 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Comment on attachment 8765498 [details]
Bug 1280667 - Add a test.

Take it in 48 beta 6.
Attachment #8765498 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
has problems to apply to beta:

grafting 352141:7dca72fff37b "Bug 1280667 - Add a test. r=botond a=sylvestre"
remote changed gfx/layers/apz/test/mochitest/test_group_mouseevents.html which local deleted
use (c)hanged version, leave (d)eleted, or leave (u)nresolved? c
merging gfx/layers/apz/test/mochitest/mochitest.ini
warning: conflicts while merging gfx/layers/apz/test/mochitest/mochitest.ini! (edit, then use 'hg resolve --mark')
abort: unresolved conflicts, can't continue
Flags: needinfo?(bugmail)
Uplifting the test would require uplifting bug 1264017 and a bunch of other things, so I just skipped that patch. Uplifted the fix:

https://hg.mozilla.org/releases/mozilla-beta/rev/6ad50d4d6f7b
Flags: needinfo?(bugmail)
I reproduced this bug using Fx 48.0b1 build ID:20160507030302 , on OS X 10.9.5 .
Confirmed the fix on Fx 48.0b6, build ID  20160706215822 and Fx 49.0a2, build ID  20160708004052 on OS X 10.9.5 .
Pushed by kgupta@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/081472be4c0e
Followup to use the ok function from the testgroup, which logs the subtest name as well. r=me
You need to log in before you can comment on or make changes to this bug.