Click location is off when scrolling in nested browser

RESOLVED FIXED in Firefox 69

Status

()

defect
P2
normal
RESOLVED FIXED
3 months ago
4 days ago

People

(Reporter: mstriemer, Assigned: kats)

Tracking

(Blocks 2 bugs)

unspecified
mozilla69
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox69 fixed)

Details

Attachments

(6 attachments)

about:addons is being updated to HTML and has an HTML browser embedded in it. This browser can embed a remote browser with extension content. The extension's browser can be very tall and the HTML browser will scroll to show the full content.

When scrolling around and immediately clicking the click location can be off for a bit until something fixes it [1].

I flipped layers.async-pan-zoom.enabled to false, restarted and tried again. This seemed to fix the problem.

STR

  1. Apply [3] (depends on [2])
  2. Set extensions.htmlaboutaddons.enabled to true
  3. Install Tree Style Tab [4]
  4. Open about:addons
  5. Go to Extensions
  6. Click the ... menu then Preferences for Tree Style Tab
  7. Open the "Context Menu" section, scroll around and click things
  • This is just a good place to test since there are lots of checkboxes

Expected results: Clicking on a checkbox checks it
Actual results: The click sometimes checks a different checkbox

[1] https://www.dropbox.com/s/xewzdzv3b0t3ppe/weird-scrolling.mov.gif?dl=0
[2] https://hg.mozilla.org/try/rev/a78b221eb76367918c6e7c79f2f447b1f6d954a4
[3] https://hg.mozilla.org/try/rev/97d018b38f205eed63bd65b5dd58aae8601fa62f
[4] https://addons.mozilla.org/en-US/firefox/addon/tree-style-tab/

Blocks: 1532724

Hey kats, is it easy to tell if this is just the same type of breakage we get from having a "remote browser embedded inside a scrolling thing" (from https://bugzilla.mozilla.org/show_bug.cgi?id=1390445#c17 ) but the affects seem to be worse in this case? Or something else is going on as well as that?

Flags: needinfo?(kats)

It might be a similar problem, I can't tell just from the STR. I'll investigate this more tomorrow or early next week, leaving needinfo on me for now.

I worked around the issue for now by calling window.windowUtils.flushApzRepaints(); in a scroll handler.

There is another issue where when scrolling it seems to want to scroll the inner browser even though its height is the same as the content height. It only scrolls it by 1 pixel and shows the scrollbar on the container browser. I've worked around this by making the browser 1 pixel larger than it needs to be.

I recorded a video [1] and it's a little hard to tell, but the scrollbar becomes visible and you can see that the content is moving by 1px as I scroll. To make it scroll the containing browser you need to continue scrolling in the same direction, so when scrolling the inner browser finish by scrolling down, then scroll down again to scroll the containing browser. Scrolling up in that case would try to scroll the inner browser again.

[1] https://www.dropbox.com/s/p8do729tw5f7ppn/scroll-stuck.mov?dl=0

I can repro, will take a look.

Assignee: nobody → kats
Flags: needinfo?(kats)

P2 since it's blocking a P2

Priority: -- → P2

Actually, can you rebase the patches to m-c tip? The build I had made locally with the patches from your try push is suffering from the addon cert expiry problem and there's some nontrivial conflicts when I try rebasing to m-c tip.

Flags: needinfo?(mstriemer)
Blocks: 1549450

Ok so I applied Mark's rebased patches and debugged a bit. I think comment 1 is exactly right - we have a remote browser embedded in a scrolling thing, and when we send the mouse event to the remote browser it doesn't have the right coordinates. I need to spend some time refreshing my mental model of how this all should work.

It's a little confusing because it feels like there's two levels of "async"-ness with scrolling. One is the true asyncness, where the scroll offset is changed in APZ/compositor and has not yet been propagated to the main thread. The other (which I'll call "false asyncness") is due to paint skipping, and introduces a lag between the main thread scroll position and the layer's CSS transform (the latter will only update on a repaint, which paint skipping makes infrequent).

The bit that's relevant to this bug is that when the mouse event goes from the BrowserParent to the BrowserChild, it gets transformed by this code which is applying a transform derived from the CSS transform. And it's this "false asyncness" that is making this event transformation transiently incorrect. If paint skipping is disabled then everything works fine.

This seems to imply that the code added in bug 1530661 that updates this transform should actually be including another component, to account for this false asyncness, and that these matrices should get updated more often (~per RequestContentRepaint). But on the other hand I don't think that matrix was intended to account for scrolling deltas so maybe there's two bugs here, and one is partially cancelling out the other...

Flags: needinfo?(mstriemer)

introduces a lag between the main thread scroll position and the layer's CSS transform

This should really say "the layer's transform", not "the layer's CSS transform".

But anyway, I spent more time paging this all back in, specifically with how paint-skipping works (both when initiated by JS-driven scrolling and APZ scrolling). The quick summary is that with paint skipping mLastContentPaintMetrics remains at the last-painted scroll offset, while the APZ and the main thread scroll positions get updated to the most recent thing. If APZ is driving the changes, as it is in this case, the mExpectedGeckoMetrics gets updated so APZ knows what the main thread version of the scroll position is, and this causes the TransformToLastDispatchedPaint to correspondingly change as repaint requests get sent.

I also gave some thought as to what the parent-to-child transform in BrowserParent is supposed to represent and I think it does need to include scroll deltas, so I think this:

This seems to imply that the code added in bug 1530661 that updates this transform should actually be including another component, to account for this false asyncness, and that these matrices should get updated more often (~per RequestContentRepaint)

is the right approach here. The missing component is the TransformToLastDispatchedPaint, and needs to get multiplied into the transforms sent over by CollectTransformsForChromeMainThread. And that message needs to get sent more frequently. I'll try implementing this and see how it goes.

My WIP implementation for this does indeed fix the problem. Running into a bit of a problem with lock ordering though, since we need to call CollectTransformsForChromeMainThread which acquires the tree lock from inside RequestContentRepaint which holds the APZC lock. I'll contemplate this a bit more.

This bug will affect fission too. I was able to write a test that exercises it, and I have a more-or-less working patch that I need to clean up a bit.

Blocks: 1519412

Most of those failures are due to bug 1559259 which I included in my try push. That I fixed in bug 1559565 via partial revert.

The APZCTreeManager already knows its root layers id, so we don't need
to be passing it all over the place. I previously verified via a try
push with an assertion that the incoming aRootLayersId is always equal
to mRootLayersId.

In cases such as APZ scrolling with paint-skipping, there is an
additional transform component that needs to be included in the
transforms that APZ sends to the main thread for the purposes of
tracking OOP iframe positions. This component also updates more
frequently, generally on each call to RequestContentRepaint, as
mExpectedGeckoMetrics is updated.

This patch adds some machinery to send partial updates to the main
thread, for layers subtrees that are descendants of a given APZC. It
also includes the missing transform component in the calculated
matrices.

Depends on D35200

The case being fixed by this bug is currently relatively rare, in that
it requires the scrolling of a frame that contains OOP content. This
patch adds a bit of optimization to ensure that we only send the
affected transforms (i.e. for the scrolled OOP layers subtrees) in this
scenario, so that we don't cause unnecessary IPC overhead.

Depends on D35201

This scrolls the document containing the OOPIF and ensures that
hit-testing still works. Some of the fission hit-testing machinery is
still WR-only, so we have to restrict the subtest to only run when WR
is enabled.

Depends on D35203

Pushed by kgupta@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/6fde36402c40
Remove unnecessary root layers id argument. r=botond
https://hg.mozilla.org/integration/autoland/rev/0454826b6047
Send OOP frame transforms to the main thread more often. r=botond
https://hg.mozilla.org/integration/autoland/rev/a6aa3a8b41b0
Optimization to reduce the number of IPC messages. r=botond
https://hg.mozilla.org/integration/autoland/rev/1ddcedc57f6f
Remove unused dangling argument. r=botond
https://hg.mozilla.org/integration/autoland/rev/a222c1b94540
Add mochitest to exercise scrolling an oopif's ancestor. r=botond
https://hg.mozilla.org/integration/autoland/rev/b8f38148ccbc
Rename functions per review comment suggestions. r=botond

Backed out 6 changesets (Bug 1548687) for browser chrome failure at gfx/layers/apz/test/mochitest/browser_test_group_fission.js.

Push with failures: https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=testfailed%2Cbusted%2Cexception&revision=b8f38148ccbc4bc094d685c296ad49f886a40689

Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=252501012&repo=autoland&lineNumber=35526

Backout link: https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=testfailed%2Cbusted%2Cexception&classifiedState=unclassified&revision=c05a24ea5373052bde91d5aeb736ad7a230dc9c7

20:59:02     INFO - TEST-PASS | gfx/layers/apz/test/mochitest/browser_test_group_fission.js | http://mochi.test:8888/browser/gfx/layers/apz/test/mochitest/helper_fission_scroll_oopif.html | code_for_oopif_to_run successfully installed - 
20:59:02     INFO - Buffered messages finished
20:59:02     INFO - TEST-UNEXPECTED-FAIL | gfx/layers/apz/test/mochitest/browser_test_group_fission.js | http://mochi.test:8888/browser/gfx/layers/apz/test/mochitest/helper_fission_scroll_oopif.html | window has scrolled by 10 pixels - Got 1, expected 10
20:59:02     INFO - Stack trace:
20:59:02     INFO - chrome://mochikit/content/browser-test.js:test_is:1324
20:59:02     INFO - chrome://mochitests/content/browser/gfx/layers/apz/test/mochitest/FissionTestHelperParent.jsm:receiveMessage:49
20:59:02     INFO - GECKO(10340) | Finished synthesizing click, waiting for OOPIF message...
20:59:02     INFO - GECKO(10340) | OOPIF got click at 48,47
20:59:02     INFO - GECKO(10340) | OOPIF response: {"x":48,"y":47}
20:59:02     INFO - TEST-PASS | gfx/layers/apz/test/mochitest/browser_test_group_fission.js | http://mochi.test:8888/browser/gfx/layers/apz/test/mochitest/helper_fission_scroll_oopif.html | x-coord of old and new match - 
20:59:02     INFO - Not taking screenshot here: see the one that was previously logged
20:59:02     INFO - TEST-UNEXPECTED-FAIL | gfx/layers/apz/test/mochitest/browser_test_group_fission.js | http://mochi.test:8888/browser/gfx/layers/apz/test/mochitest/helper_fission_scroll_oopif.html | y-coord of old and new match - Got 47, expected 48
20:59:02     INFO - Stack trace:
20:59:02     INFO - chrome://mochikit/content/browser-test.js:test_is:1324
20:59:02     INFO - chrome://mochitests/content/browser/gfx/layers/apz/test/mochitest/FissionTestHelperParent.jsm:receiveMessage:49
20:59:02     INFO - GECKO(10340) | [Child 3524, Main Thread] WARNING: No active window: file z:/build/build/src/js/xpconnect/src/XPCJSContext.cpp, line 662
20:59:02     INFO - GECKO(10340) | Finished test http://mochi.test:8888/browser/gfx/layers/apz/test/mochitest/helper_fission_scroll_oopif.html
20:59:02     INFO - GECKO(10340) | --DOCSHELL 000001F733B49800 == 0 [pid = 8244] [id = {ef443b0d-e88e-4ca9-89c1-3fd296685a21}] [url = about:blank]
20:59:02     INFO - GECKO(10340) | --DOCSHELL 00000120FA953000 == 4 [pid = 9808] [id = {1e6a72fa-3b22-49dc-899b-a5d5b4082232}] [url = moz-extension://88bd9b96-83de-4b22-b402-d01301f4f815/_generated_background_page.html]
20:59:02     INFO - GECKO(10340) | [GPU 10168, Compositor] WARNING: Possibly dropping task posted to updater thread: file z:/build/build/src/gfx/layers/apz/src/APZUpdater.cpp, line 428
20:59:02     INFO - GECKO(10340) | --DOCSHELL 000001AC5847B800 == 9 [pid = 2368] [id = {44c80c23-45bd-4523-9871-48f305e61732}] [url = about:blank]
20:59:02     INFO - GECKO(10340) | --DOCSHELL 000001AC58481800 == 8 [pid = 2368] [id = {73d315ad-56d7-4520-8d32-07c5ebd4dd59}] [url = about:blank]
20:59:02     INFO - Leaving test bound test_main
Flags: needinfo?(kats)

Weird, this is still passing consistently for me on Mac. I'll try on Windows.

Flags: needinfo?(kats)

I was able to repro on Windows. The scroll wheel input is starting a wheel animation so we need to make sure that runs all the way to the end before testing the scroll position. Also it only scrolls to y=8.5 instead of y=10 so I'm loosening the check to just check it's more than 5 pixels which should be sufficient for the purposes of the test.

Pushed by kgupta@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/d93036e907e5
Remove unnecessary root layers id argument. r=botond
https://hg.mozilla.org/integration/autoland/rev/6d41cf61f1f8
Send OOP frame transforms to the main thread more often. r=botond
https://hg.mozilla.org/integration/autoland/rev/1171aa4d8d43
Optimization to reduce the number of IPC messages. r=botond
https://hg.mozilla.org/integration/autoland/rev/58fe308249b1
Remove unused dangling argument. r=botond
https://hg.mozilla.org/integration/autoland/rev/db1bd24129bc
Add mochitest to exercise scrolling an oopif's ancestor. r=botond
https://hg.mozilla.org/integration/autoland/rev/9bf150eba100
Rename functions per review comment suggestions. r=botond

Hey Mark!
Thanks for the detailed steps, they are clear and concise but I'm not sure they can be followed from the manual QA perspective. To be more specific I am talking about the first step where we need to apply specific patches (Apply [3] (depends on [2])).
Is there a workaround for that?

Flags: needinfo?(mstriemer)

I'm not sure my STR are the right way to verify this. We added a workaround until this issue was fixed so I'm not sure you'd be able to reproduce this in about:addons.

kats: should this get some QA time?

Flags: needinfo?(mstriemer) → needinfo?(kats)

I think it would be better to remove the workaround and have QA test builds after that to ensure the problem hasn't returned.

Flags: needinfo?(kats)

If no clear steps can be produced it would help us to know a general range for our regression tests. Where are the risk areas, in what context the old bug reproduced.
@Mark
In the first comment I have observed that the problem was related to the check boxes in the options menu in about:addons. Do you think that running some tests to make sure check boxes work correctly in about:addons is enough?

The STR from 3-7 would still work for testing that about:addons works as expected, although it still has a forced workaround. That workaround will be removed in bug 1565235, which could be a better place to test the about:addons page.

You need to log in before you can comment on or make changes to this bug.