[Android Fission] Testing MOZ_ASSERT(false) (Two layers that scroll together have different ancestor transforms)
Categories
(Core :: Panning and Zooming, defect, P2)
Tracking
()
People
(Reporter: owlish, Assigned: botond)
References
Details
(Whiteboard: [fission:android][fxdroid])
Attachments
(13 files)
119.72 KB,
text/plain
|
Details | |
248.92 KB,
text/plain
|
Details | |
14.37 KB,
text/plain
|
Details | |
30.03 KB,
text/plain
|
Details | |
26.06 KB,
text/plain
|
Details | |
403 bytes,
text/html
|
Details | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review |
mochitest-plain variant. Debug builds.
Comment 1•1 year ago
|
||
Failure log: https://treeherder.mozilla.org/logviewer?job_id=409186678&repo=try&lineNumber=5113
I've attached the snippet of the logcat log for this particular test run. Notably, I see that there were 2 assertion failures, both with the same message, may have caused a process to crash.
03-16 19:33:42.791 26894 26928 F MOZ_Assert: Assertion failure: false (Two layers that scroll together have different ancestor transforms), at /builds/worker/checkouts/gecko/gfx/layers/apz/src/APZCTreeManager.cpp:1359
This corresponds to this assertion (https://searchfox.org/mozilla-central/rev/11dbac7f64f509b78037465cbb4427ed71f8b565/gfx/layers/apz/src/APZCTreeManager.cpp#1356-1359).
Crash backtraces from failure log: https://treeherder.mozilla.org/logviewer?job_id=409186678&repo=try&lineNumber=11602
I think a crash like this could easily explain a test timeout like the one found here, so it seems somewhat likely that it is responsible for the failure here. Leaving a ni? for :botond to look into this specific failure.
Unfortunately, I was unable to reproduce this specific test failure when running just this single test locally with ./mach mochitest security/manager/ssl/tests/mochitest/mixedcontent/test_unsecureIframe2.html --setpref=fission.autostart=true
or when running the entire directory with ./mach mochitest security/manager/ssl/tests/mochitest/mixedcontent/ --setpref=fission.autostart=true
, so it may be difficult to test in isolation.
Comment 3•1 year ago
|
||
As I've removed the name of the test which originally caused the failure from the title, I'll note it was security/manager/ssl/tests/mochitest/mixedcontent/test_unsecureIframe2.html here. It appears this failure is not bound to a specific test, as a similar assertion failure was also found in dom/base/test/test_bug769117.html in bug 1827725, making fixing this a blocker for us being able to enable Android Fission mochitests.
Assignee | ||
Comment 4•1 year ago
|
||
Thanks for flagging this! I brought this up at our APZ team meeting and we're going to prioritize investigating this as an Android Fission blocker.
Comment 5•1 year ago
|
||
Comment 6•1 year ago
|
||
This test assertion failure blocks Android Fission.
Comment 7•11 months ago
|
||
Assignee | ||
Comment 8•11 months ago
|
||
My attempts to even try to reproduce this locally have been thwarted by bug 1803027. If anyone familiar with Android tests could aid me in further diagnosis of that bug, that would be appreciated.
Assignee | ||
Comment 9•11 months ago
|
||
Timothy, how did you obtain the logs attached in comment 5?
Comment 10•11 months ago
|
||
Hmm, it was a while ago but I believe I set the pref layout.display-list.dump-content in mochitest.ini for that test and then pushed to try and then searched all the logs I could find linked from the artifact view.
Assignee | ||
Comment 11•11 months ago
|
||
(In reply to Botond Ballo [:botond] from comment #8)
My attempts to even try to reproduce this locally have been thwarted by bug 1803027. If anyone familiar with Android tests could aid me in further diagnosis of that bug, that would be appreciated.
Update: I was able to reproduce this using an emulator build, which is unaffected by bug 1803027.
Assignee | ||
Comment 12•11 months ago
|
||
Assignee | ||
Comment 13•11 months ago
|
||
Assignee | ||
Comment 14•11 months ago
•
|
||
Dan and I had a look at the logs from test_unsecureIframe2.html. Summarizing our findings:
- The logs only show the relevant data structures from the toplevel content process. For a more complete picture, we should get equivalent logs from the nested content process.
- The WebRenderScrollData tree shows two scroll frames (scrollId=3, which is the root scroll frame of the root content document, and scrollId=4, which is the root scroll frame of an in-process iframe).
- scrollId=4 is present on two sibling nodes of the tree, only one of which has a transform. This in and of itself does not violate the constraint being asserted (the transform, being on the node itself, is not part of the scroll frame's ancestor transform), but it could play a role in a scroll frame in the nested content process violating the constraint. (For clarity, the assertion message should include the scrollId of the affected scroll frame.)
- It's unclear where the transform is coming from, as there are no
Transform
display items in the Gecko display list, nor is there an obvious source of the transform based on the page structure.
Next steps:
- Get more complete logs, including in particular logs from the nested content process.
- Investigate where the transform is coming from.
Assignee | ||
Comment 15•11 months ago
|
||
The transform is coming from nsDisplayRemote::UpdateScrollData().
Reporter | ||
Comment 16•11 months ago
|
||
Botond, would it be safe to assign the bug to you, since you seem to be actively working on it?
Assignee | ||
Comment 17•11 months ago
|
||
Yep, I can take this.
Assignee | ||
Comment 18•11 months ago
|
||
Attached are some better logs from test_unsecureIframe2.html. They include a compositor-side WebRenderScrollData dump (which show the complete tree from all processes).
Here's what I can tell from these logs:
- While the page does have a remote iframe, at the time the assertion fires the remote iframe's layer tree has not been sent to the compositor yet. So, even though I'm capturing logs from the iframe's content process, at the time of the assertion there are none to show yet.
- The contents of the remote iframe are therefore not relevant to the assertion. In particular, the scroll frame that's violating the constraint is not inside the remote iframe.
- The scroll frame violating the constraint is in fact
scrollId=4
, which seems to be the root scroll frame of an in-process iframe that contains the remote iframe.- My claim in comment 14 that
scrollId=4
was not violating the constraint was clearly wrong, though I don't yet know what part of my reasoning was mistaken. I will need to debug this in more detail.
- My claim in comment 14 that
(In reply to Botond Ballo [:botond] from comment #15)
The transform is coming from nsDisplayRemote::UpdateScrollData().
I did look into the reason we have a transform here, and it's simply that nodes coming from the remote iframe's content process have coordinates relative to the iframe's origin, and these need to be translated to be relative to the enclosing content process's origin. It's as if we had a Transform
display item there, but the translation is stored directly on the Remote
item instead.
As such, I don't think there's anything conceptually problematic with this page structure (which is in fact quite simple), we likely just have a bug somewhere in the pipeline that gives rise to this assertion when it shouldn't.
Assignee | ||
Comment 19•11 months ago
|
||
(In reply to Botond Ballo [:botond] from comment #18)
My claim in comment 14 that
scrollId=4
was not violating the constraint was clearly wrong, though I don't yet know what part of my reasoning was mistaken.
Ok, I figured out where I went wrong.
The assertion fires when the ancestor transforms of two nodes in the expanded tree bearing metadata for the same scrollframe do not match. The ancestor transform of a given node in the expanded tree is the product of the transforms of its (proper) ancestor nodes in the expanded tree.
I saw in the logs that the only transform in the unexpanded tree was a transform=[ 1 0; 0 1; 20 83; ]
, and reasoned that during expansion, this will go on the scrollId=4 node itself (because transforms go on the bottom layer), and therefore will not be part of the scrollId=4 node's ancestor transform.
I was correct that the transform=[ 1 0; 0 1; 20 83; ]
goes on the scrollId=4 node and is not part of that node's ancestor transform, but I overlooked that there is also a resolution=0.816327
on the same unexpanded node, and during expansion the resolution induces a scale transform on the expanded node as well.
Unlike the transform
, the resolution
goes on the top layer, and therefore does become part of the bottom node's ancestor transform.
The resolution
also comes from nsDisplayRemote::UpdateScrollData
.
The next thing to answer is: what is that resolution doing there? It looks like it was added (or rather, split out from the transform), in bug 1715187. The fact that the resolution is present on one of the two scrollId=4 nodes but not both, does seem wrong.
Assignee | ||
Comment 20•11 months ago
|
||
(For diagnosis purposes, at this stage we can be confident that this is an APZ bug, so I'm removing [apz-needsdiagnosis]
.)
Comment 21•11 months ago
|
||
(In reply to Botond Ballo [:botond] from comment #19)
The next thing to answer is: what is that resolution doing there? It looks like it was added (or rather, split out from the transform), in bug 1715187. The fact that the resolution is present on one of the two scrollId=4 nodes but not both, does seem wrong.
I'm only half remembering this stuff from last time I tangled with it, but I'll add a little here in hopes it helps you.
I think bug 1715187, comment 11 is relevant here. We don't put the resolution into these transforms except when we encounter an iframe. Which seems and feels weird. But, IIRC, we have to put the resolution on remote iframe items otherwise we can't target their process correctly. Perhaps the resolution can be incorporated into the transform that we set on the browsingcontext so that the process gets targeted correctly, but do not incoporate the resolution into the webrender scrolldata? Is that possible or would break things that webrenderscrolldata tree needs to have?
Assignee | ||
Comment 22•10 months ago
|
||
The only conditions necessary for triggering this scenario are:
- The root content document has a resolution (pinch-zoom scale) that's not 1. On Android this can happen on page load due to mobile viewport sizing, but this can also happen on desktop platforms if you pinch-zoom the page in.
- The page contains a remote iframe.
- There is an intermediate scroll frame in between the remote iframe's root scroll frame, and the root content doc's root scroll frame.
That means this issue is not specific to Android. Indeed, I can hit the assertion in a debug build on desktop, by loading the attached page and pinch-zooming in.
Assignee | ||
Comment 23•10 months ago
|
||
(In reply to Timothy Nikkel (:tnikkel) from comment #21)
I think bug 1715187, comment 11 is relevant here. We don't put the resolution into these transforms except when we encounter an iframe. Which seems and feels weird. But, IIRC, we have to put the resolution on remote iframe items otherwise we can't target their process correctly. Perhaps the resolution can be incorporated into the transform that we set on the browsingcontext so that the process gets targeted correctly, but do not incoporate the resolution into the webrender scrolldata? Is that possible or would break things that webrenderscrolldata tree needs to have?
These are very good questions.
I haven't arrived at a complete answer, but I wanted to record the current state of my understanding before going on PTO.
- Besides the "transform that we set on the browsingcontext so that the process gets targeted correctly" (this is the
mChildToParentConversionMatrix
stored onBrowserParent
andBrowserChild
; henceforth, "CTP transform"), the main additional place where the transforms on WebRenderScrollData nodes are propagated are the screen-to-apzc and apzc-to-gecko transforms. These are used to convert the coordinates of input events from their original Screen coordinates to the local "ParentLayer" coordinates that the target APZC processes the event in, and then onwards to coordinates suitable for Gecko processing (which are Screen coordinates adjusted to remove some transforms Gecko doesn't know about). - So, we could arrange for the resolution transform discussed in previous comments to be present in the CTP transform but not in the screen-to-apzc and apzc-to-gecko transforms, but this would mean the ParentLayer coordinates received by APZCs of descendant nodes would be numerically different.
- For many purposes, such a change to the magnitude of ParentLayer coordinates is a no-op; for example, coordinates would still round-trip from Screen to Gecko coordinates correctly (we'd be dropping one component of the screen-to-apzc transform, and a corresponding inverse from the apzc-to-gecko transform, such that if you apply both in the succession to event coordinates, you get the same result as before). However, there are some situations in which the numerical value of ParentLayer coordinates is important. Based on inspection, these include:
- Some scrollbar stuff, including async scrollbar dragging (ConvertScrollbarPoint calculations) and async transforms of scrollbars (ScrollThumbUtils calculations, though maybe not for the apz.scrollthumb.recalc codepath)
- Code that compares event coordinates in ParentLayer pixels to other ParentLayer quantities which are computed independently (not from event coordinates). Most notably,
mCompositionBounds
is such a quantity, and such comparisons are commonplace; basically, changes to the screen-to-apzc transform need to be made in sync with corresponding changes to code that calculates the composition bounds.
- Based on experimentation, when the page is zoomed in, the composition bounds of in-process subframes include the resolution but the composition bounds of out-of-process subframes do not. This seems to be a deliberate choice (it's discussed in this comment above
mCumulativeResolution
, which is an input to the composition bounds calculation for subframes), and it's consistent with the fact that the screen-to-apzc transform for in-process subframes does not include the resolution while the screen-to-apzc transform for out-of-process does (due to the resolution coming from nsDisplayRemote). - So, we can probably make the suggested change if we also change the cumulative resolution and composition bounds of OOP subframes, and make sure we keep the scrollbar-related codepaths working.
I also came across bug 1741333, which proposes to make sort of the opposite change (including the resolution in the screen-to-apzc transform of in-process subframes too, and excluding it from their composition bounds, in essence making the handling of in-process subframes more like the current handling of OOP subframes).
Assignee | ||
Comment 24•10 months ago
|
||
One important addendum to the previous comment:
As mentioned, for OOP subframes, we can either include or exclude the resolution from their screen-to-apzc transform, as long as we handle the composition bounds consistently.
However, the current implementation approach (having the resolution on the layer built for the Remote
display item, and emitting that on the topmost node for that layer) can result in the screen-to-apzc transforms of in-process subframes containing the OOP subframe also including the resolution -- and that is definitely wrong.
(The "two layers" assertion catches a subset of these cases, namely ones where the in-process subframe has multiple nodes in the expanded tree, and we notice that one of them (the one containing the OOP iframe as descendant) is subject to the resolution and the other is not.)
This can lead to observable misbehaviour. Here is one affected STR:
- Load the testcase in comment 22 on desktop
- Pinch-zoom in (the more you zoom in, the more noticeable the bug; I suggest at least 2x)
- Try to drag the scrollbar of the outer subframe (which is the in-process one). The drag misbehaves (there is an initial portion where the thumb does not move, and then when it starts moving it moves more slowly than the mouse cursor).
Notably, this is a case where the mere presence of an OOP iframe inside that outer subframe, messes up the ability to drag the scrollbar of the outer subframe.
Assignee | ||
Comment 25•10 months ago
|
||
(In reply to Botond Ballo [:botond] from comment #24)
However, the current implementation approach (having the resolution on the layer built for the
Remote
display item, and emitting that on the topmost node for that layer) can result in the screen-to-apzc transforms of in-process subframes containing the OOP subframe also including the resolution -- and that is definitely wrong.
Having looked over bug 1715187 in more detail, it seems that the reason for emitting the resolution on the topmost node is to make sure that it is applied after a CSS transform enclosing the OOP iframe.
This makes sense, since conceptually the resolution is at the level of the async zoom container. However, I don't think this solution is general enough, since we could have a page structure where a CSS transform ends up on an ancestor of the Remote
item's node in the unexpanded (WebRenderLayerScrollData) tree, in which case emitting the resolution on the topmost node during expansion will not achieve the desired effect of having the resolution be outside the transform.
Moreover, this implementation approach has the undesired side effect discussed in comment 24, that in cases where the Remote
item's node also carries metadata for an enclosing scroll frame, the resolution transform will end up part of the screen-to-apzc transform for that enclosing scroll frame as well.
To make this work, we need to meet two constraints simultaneously:
- The resolution needs to be applied at the level of the async zoom container, i.e. on top of any transforms that are in between the OOP iframe and the root.
- The resolution needs to apply only for scroll frames in the OOP iframe, not their in-process ancestors.
It's not possible to meet both of these constraints while also preserving the property that the screen-to-apzc and CTP transforms are computed recursively (i.e. your transform is (some local properties) * (your parent's transform)).
However, I don't think that latter property is actually important? I don't think anything stops us from conditionally including the resolution at the level of the async zoom container, based on whether or not the original input APZC is inside an OOP subframe or not.
I believe doing that would:
- Fix this issue (both in the sense of no longer triggering the assertion, and in fixing the observable misbehaviour such as the STR from comment 24)
- Fix other CSS transform edge cases that bug 1715187 left unfixed
- Avoid messing with composition bounds. (We can still explore messing with the composition bounds in order to make the handling of in-process and OOP subframes consistent, in a follow-up.)
Assignee | ||
Comment 26•9 months ago
|
||
I've got a rough draft of a fix up on Try: https://treeherder.mozilla.org/jobs?repo=try&revision=a41efc0dc36697f7f0d0863c1f9ec701f99a0815
Assignee | ||
Comment 27•9 months ago
•
|
||
Revised to fix an issue that was causing helper_hittest_bug1715187.html to fail: https://treeherder.mozilla.org/jobs?repo=try&revision=d4fa06e732998e4606bb6757de60e572cabea9da
Assignee | ||
Comment 28•9 months ago
|
||
Assignee | ||
Comment 29•9 months ago
|
||
Depends on D184906
Assignee | ||
Comment 30•9 months ago
|
||
This reverts older attempts to account for the resolution in input transforms
targeting OOP subdocuments (bug 1682200, bug 1715187), clearing the slate for
a more proper solution in the forthcoming patches.
Depends on D184907
Assignee | ||
Comment 31•9 months ago
|
||
Depends on D184908
Assignee | ||
Comment 32•9 months ago
|
||
Depends on D184909
Assignee | ||
Comment 33•9 months ago
|
||
Depends on D184910
Assignee | ||
Comment 34•9 months ago
|
||
Depends on D184911
Updated•9 months ago
|
Assignee | ||
Comment 35•9 months ago
|
||
I did my best to formulate an implementation that doesn't just sprinkle in matrices in the numerically correct places, but also tries to maintain a somewhat coherent set of interfaces. I'm not 100% satisfied with the result, but this may be as good as it gets given the inherent complexity of the situation. If at some point we get around to a more comprehensive refactoring like bug 1741333, that should clean up some of this code further.
Comment 36•9 months ago
|
||
Pushed by bballo@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/62c130806aad Log which scroll frame violates the 'two layers' assertion. r=dlrobertson https://hg.mozilla.org/integration/autoland/rev/8b299a2069f7 Add a scrollbar dragging mochitest. r=dlrobertson https://hg.mozilla.org/integration/autoland/rev/59e72e835c97 Do not add a resolution transform to the WebRenderLayerScrollData node for a Remote display item. r=dlrobertson,hiro https://hg.mozilla.org/integration/autoland/rev/b30d74f1f0c7 Introduce a GetPaintedResolutionTransform() helper. r=dlrobertson https://hg.mozilla.org/integration/autoland/rev/fda5ed4ebf22 Conditionally include the painted resolution transform in the screen-to-apzc and apzc-to-gecko transforms. r=dlrobertson,hiro https://hg.mozilla.org/integration/autoland/rev/ff7c77e60615 Conditionally include the painted resolution transform in HitTestingTreeNode::GetTransformToGecko(). r=dlrobertson,hiro https://hg.mozilla.org/integration/autoland/rev/07ddd8965e85 Re-enable tests that were marked as failing with Android Fission. r=dlrobertson https://hg.mozilla.org/integration/autoland/rev/ef2753ba331b apply code formatting via Lando
Comment 37•9 months ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/62c130806aad
https://hg.mozilla.org/mozilla-central/rev/8b299a2069f7
https://hg.mozilla.org/mozilla-central/rev/59e72e835c97
https://hg.mozilla.org/mozilla-central/rev/b30d74f1f0c7
https://hg.mozilla.org/mozilla-central/rev/fda5ed4ebf22
https://hg.mozilla.org/mozilla-central/rev/ff7c77e60615
https://hg.mozilla.org/mozilla-central/rev/07ddd8965e85
https://hg.mozilla.org/mozilla-central/rev/ef2753ba331b
Updated•8 months ago
|
Reporter | ||
Updated•8 months ago
|
Description
•