Closed Bug 1827330 Opened 1 year ago Closed 9 months ago

[Android Fission] Testing MOZ_ASSERT(false) (Two layers that scroll together have different ancestor transforms)

Categories

(Core :: Panning and Zooming, defect, P2)

All
Android
defect

Tracking

()

RESOLVED FIXED
118 Branch
Tracking Status
firefox117 --- wontfix
firefox118 --- fixed

People

(Reporter: owlish, Assigned: botond)

References

Details

(Whiteboard: [fission:android][fxdroid])

Attachments

(13 files)

119.72 KB, text/plain
Details
248.92 KB, text/plain
Details
14.37 KB, text/plain
Details
30.03 KB, text/plain
Details
26.06 KB, text/plain
Details
403 bytes, text/html
Details
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
Attached file logcat log

Failure log: https://treeherder.mozilla.org/logviewer?job_id=409186678&repo=try&lineNumber=5113

I've attached the snippet of the logcat log for this particular test run. Notably, I see that there were 2 assertion failures, both with the same message, may have caused a process to crash.

03-16 19:33:42.791 26894 26928 F MOZ_Assert: Assertion failure: false (Two layers that scroll together have different ancestor transforms), at /builds/worker/checkouts/gecko/gfx/layers/apz/src/APZCTreeManager.cpp:1359

This corresponds to this assertion (https://searchfox.org/mozilla-central/rev/11dbac7f64f509b78037465cbb4427ed71f8b565/gfx/layers/apz/src/APZCTreeManager.cpp#1356-1359).

Crash backtraces from failure log: https://treeherder.mozilla.org/logviewer?job_id=409186678&repo=try&lineNumber=11602

I think a crash like this could easily explain a test timeout like the one found here, so it seems somewhat likely that it is responsible for the failure here. Leaving a ni? for :botond to look into this specific failure.

Unfortunately, I was unable to reproduce this specific test failure when running just this single test locally with ./mach mochitest security/manager/ssl/tests/mochitest/mixedcontent/test_unsecureIframe2.html --setpref=fission.autostart=true or when running the entire directory with ./mach mochitest security/manager/ssl/tests/mochitest/mixedcontent/ --setpref=fission.autostart=true, so it may be difficult to test in isolation.

Flags: needinfo?(botond)
Duplicate of this bug: 1827725

As I've removed the name of the test which originally caused the failure from the title, I'll note it was security/manager/ssl/tests/mochitest/mixedcontent/test_unsecureIframe2.html here. It appears this failure is not bound to a specific test, as a similar assertion failure was also found in dom/base/test/test_bug769117.html in bug 1827725, making fixing this a blocker for us being able to enable Android Fission mochitests.

Type: task → defect
Component: Sandboxing → Panning and Zooming
Product: GeckoView → Core
Summary: Fix security/manager/ssl/tests/mochitest/mixedcontent/test_unsecureIframe2.html test on Android Fission → [Android Fission] Testing MOZ_ASSERT(false) (Two layers that scroll together have different ancestor transforms)

Thanks for flagging this! I brought this up at our APZ team meeting and we're going to prioritize investigating this as an Android Fission blocker.

Severity: -- → S3
Flags: needinfo?(botond)
Priority: -- → P2
Whiteboard: [apz-needsdiagnosis]

This test assertion failure blocks Android Fission.

Whiteboard: [apz-needsdiagnosis] → [apz-needsdiagnosis] [fission:android]

My attempts to even try to reproduce this locally have been thwarted by bug 1803027. If anyone familiar with Android tests could aid me in further diagnosis of that bug, that would be appreciated.

Depends on: 1803027

Timothy, how did you obtain the logs attached in comment 5?

Flags: needinfo?(tnikkel)

Hmm, it was a while ago but I believe I set the pref layout.display-list.dump-content in mochitest.ini for that test and then pushed to try and then searched all the logs I could find linked from the artifact view.

Flags: needinfo?(tnikkel)

(In reply to Botond Ballo [:botond] from comment #8)

My attempts to even try to reproduce this locally have been thwarted by bug 1803027. If anyone familiar with Android tests could aid me in further diagnosis of that bug, that would be appreciated.

Update: I was able to reproduce this using an emulator build, which is unaffected by bug 1803027.

No longer depends on: 1803027

Dan and I had a look at the logs from test_unsecureIframe2.html. Summarizing our findings:

  • The logs only show the relevant data structures from the toplevel content process. For a more complete picture, we should get equivalent logs from the nested content process.
  • The WebRenderScrollData tree shows two scroll frames (scrollId=3, which is the root scroll frame of the root content document, and scrollId=4, which is the root scroll frame of an in-process iframe).
  • scrollId=4 is present on two sibling nodes of the tree, only one of which has a transform. This in and of itself does not violate the constraint being asserted (the transform, being on the node itself, is not part of the scroll frame's ancestor transform), but it could play a role in a scroll frame in the nested content process violating the constraint. (For clarity, the assertion message should include the scrollId of the affected scroll frame.)
  • It's unclear where the transform is coming from, as there are no Transform display items in the Gecko display list, nor is there an obvious source of the transform based on the page structure.

Next steps:

  • Get more complete logs, including in particular logs from the nested content process.
  • Investigate where the transform is coming from.

The transform is coming from nsDisplayRemote::UpdateScrollData().

Botond, would it be safe to assign the bug to you, since you seem to be actively working on it?

Flags: needinfo?(botond)

Yep, I can take this.

Assignee: nobody → botond
Flags: needinfo?(botond)

Attached are some better logs from test_unsecureIframe2.html. They include a compositor-side WebRenderScrollData dump (which show the complete tree from all processes).

Here's what I can tell from these logs:

  • While the page does have a remote iframe, at the time the assertion fires the remote iframe's layer tree has not been sent to the compositor yet. So, even though I'm capturing logs from the iframe's content process, at the time of the assertion there are none to show yet.
  • The contents of the remote iframe are therefore not relevant to the assertion. In particular, the scroll frame that's violating the constraint is not inside the remote iframe.
  • The scroll frame violating the constraint is in fact scrollId=4, which seems to be the root scroll frame of an in-process iframe that contains the remote iframe.
    • My claim in comment 14 that scrollId=4 was not violating the constraint was clearly wrong, though I don't yet know what part of my reasoning was mistaken. I will need to debug this in more detail.

(In reply to Botond Ballo [:botond] from comment #15)

The transform is coming from nsDisplayRemote::UpdateScrollData().

I did look into the reason we have a transform here, and it's simply that nodes coming from the remote iframe's content process have coordinates relative to the iframe's origin, and these need to be translated to be relative to the enclosing content process's origin. It's as if we had a Transform display item there, but the translation is stored directly on the Remote item instead.

As such, I don't think there's anything conceptually problematic with this page structure (which is in fact quite simple), we likely just have a bug somewhere in the pipeline that gives rise to this assertion when it shouldn't.

(In reply to Botond Ballo [:botond] from comment #18)

My claim in comment 14 that scrollId=4 was not violating the constraint was clearly wrong, though I don't yet know what part of my reasoning was mistaken.

Ok, I figured out where I went wrong.

The assertion fires when the ancestor transforms of two nodes in the expanded tree bearing metadata for the same scrollframe do not match. The ancestor transform of a given node in the expanded tree is the product of the transforms of its (proper) ancestor nodes in the expanded tree.

I saw in the logs that the only transform in the unexpanded tree was a transform=[ 1 0; 0 1; 20 83; ], and reasoned that during expansion, this will go on the scrollId=4 node itself (because transforms go on the bottom layer), and therefore will not be part of the scrollId=4 node's ancestor transform.

I was correct that the transform=[ 1 0; 0 1; 20 83; ] goes on the scrollId=4 node and is not part of that node's ancestor transform, but I overlooked that there is also a resolution=0.816327 on the same unexpanded node, and during expansion the resolution induces a scale transform on the expanded node as well.

Unlike the transform, the resolution goes on the top layer, and therefore does become part of the bottom node's ancestor transform.

The resolution also comes from nsDisplayRemote::UpdateScrollData.

The next thing to answer is: what is that resolution doing there? It looks like it was added (or rather, split out from the transform), in bug 1715187. The fact that the resolution is present on one of the two scrollId=4 nodes but not both, does seem wrong.

(For diagnosis purposes, at this stage we can be confident that this is an APZ bug, so I'm removing [apz-needsdiagnosis].)

Whiteboard: [apz-needsdiagnosis] [fission:android] → [fission:android]

(In reply to Botond Ballo [:botond] from comment #19)

The next thing to answer is: what is that resolution doing there? It looks like it was added (or rather, split out from the transform), in bug 1715187. The fact that the resolution is present on one of the two scrollId=4 nodes but not both, does seem wrong.

I'm only half remembering this stuff from last time I tangled with it, but I'll add a little here in hopes it helps you.

I think bug 1715187, comment 11 is relevant here. We don't put the resolution into these transforms except when we encounter an iframe. Which seems and feels weird. But, IIRC, we have to put the resolution on remote iframe items otherwise we can't target their process correctly. Perhaps the resolution can be incorporated into the transform that we set on the browsingcontext so that the process gets targeted correctly, but do not incoporate the resolution into the webrender scrolldata? Is that possible or would break things that webrenderscrolldata tree needs to have?

The only conditions necessary for triggering this scenario are:

  • The root content document has a resolution (pinch-zoom scale) that's not 1. On Android this can happen on page load due to mobile viewport sizing, but this can also happen on desktop platforms if you pinch-zoom the page in.
  • The page contains a remote iframe.
  • There is an intermediate scroll frame in between the remote iframe's root scroll frame, and the root content doc's root scroll frame.

That means this issue is not specific to Android. Indeed, I can hit the assertion in a debug build on desktop, by loading the attached page and pinch-zooming in.

(In reply to Timothy Nikkel (:tnikkel) from comment #21)

I think bug 1715187, comment 11 is relevant here. We don't put the resolution into these transforms except when we encounter an iframe. Which seems and feels weird. But, IIRC, we have to put the resolution on remote iframe items otherwise we can't target their process correctly. Perhaps the resolution can be incorporated into the transform that we set on the browsingcontext so that the process gets targeted correctly, but do not incoporate the resolution into the webrender scrolldata? Is that possible or would break things that webrenderscrolldata tree needs to have?

These are very good questions.

I haven't arrived at a complete answer, but I wanted to record the current state of my understanding before going on PTO.

  • Besides the "transform that we set on the browsingcontext so that the process gets targeted correctly" (this is the mChildToParentConversionMatrix stored on BrowserParent and BrowserChild; henceforth, "CTP transform"), the main additional place where the transforms on WebRenderScrollData nodes are propagated are the screen-to-apzc and apzc-to-gecko transforms. These are used to convert the coordinates of input events from their original Screen coordinates to the local "ParentLayer" coordinates that the target APZC processes the event in, and then onwards to coordinates suitable for Gecko processing (which are Screen coordinates adjusted to remove some transforms Gecko doesn't know about).
  • So, we could arrange for the resolution transform discussed in previous comments to be present in the CTP transform but not in the screen-to-apzc and apzc-to-gecko transforms, but this would mean the ParentLayer coordinates received by APZCs of descendant nodes would be numerically different.
  • For many purposes, such a change to the magnitude of ParentLayer coordinates is a no-op; for example, coordinates would still round-trip from Screen to Gecko coordinates correctly (we'd be dropping one component of the screen-to-apzc transform, and a corresponding inverse from the apzc-to-gecko transform, such that if you apply both in the succession to event coordinates, you get the same result as before). However, there are some situations in which the numerical value of ParentLayer coordinates is important. Based on inspection, these include:
    • Some scrollbar stuff, including async scrollbar dragging (ConvertScrollbarPoint calculations) and async transforms of scrollbars (ScrollThumbUtils calculations, though maybe not for the apz.scrollthumb.recalc codepath)
    • Code that compares event coordinates in ParentLayer pixels to other ParentLayer quantities which are computed independently (not from event coordinates). Most notably, mCompositionBounds is such a quantity, and such comparisons are commonplace; basically, changes to the screen-to-apzc transform need to be made in sync with corresponding changes to code that calculates the composition bounds.
  • Based on experimentation, when the page is zoomed in, the composition bounds of in-process subframes include the resolution but the composition bounds of out-of-process subframes do not. This seems to be a deliberate choice (it's discussed in this comment above mCumulativeResolution, which is an input to the composition bounds calculation for subframes), and it's consistent with the fact that the screen-to-apzc transform for in-process subframes does not include the resolution while the screen-to-apzc transform for out-of-process does (due to the resolution coming from nsDisplayRemote).
  • So, we can probably make the suggested change if we also change the cumulative resolution and composition bounds of OOP subframes, and make sure we keep the scrollbar-related codepaths working.

I also came across bug 1741333, which proposes to make sort of the opposite change (including the resolution in the screen-to-apzc transform of in-process subframes too, and excluding it from their composition bounds, in essence making the handling of in-process subframes more like the current handling of OOP subframes).

One important addendum to the previous comment:

As mentioned, for OOP subframes, we can either include or exclude the resolution from their screen-to-apzc transform, as long as we handle the composition bounds consistently.

However, the current implementation approach (having the resolution on the layer built for the Remote display item, and emitting that on the topmost node for that layer) can result in the screen-to-apzc transforms of in-process subframes containing the OOP subframe also including the resolution -- and that is definitely wrong.

(The "two layers" assertion catches a subset of these cases, namely ones where the in-process subframe has multiple nodes in the expanded tree, and we notice that one of them (the one containing the OOP iframe as descendant) is subject to the resolution and the other is not.)

This can lead to observable misbehaviour. Here is one affected STR:

  1. Load the testcase in comment 22 on desktop
  2. Pinch-zoom in (the more you zoom in, the more noticeable the bug; I suggest at least 2x)
  3. Try to drag the scrollbar of the outer subframe (which is the in-process one). The drag misbehaves (there is an initial portion where the thumb does not move, and then when it starts moving it moves more slowly than the mouse cursor).

Notably, this is a case where the mere presence of an OOP iframe inside that outer subframe, messes up the ability to drag the scrollbar of the outer subframe.

(In reply to Botond Ballo [:botond] from comment #24)

However, the current implementation approach (having the resolution on the layer built for the Remote display item, and emitting that on the topmost node for that layer) can result in the screen-to-apzc transforms of in-process subframes containing the OOP subframe also including the resolution -- and that is definitely wrong.

Having looked over bug 1715187 in more detail, it seems that the reason for emitting the resolution on the topmost node is to make sure that it is applied after a CSS transform enclosing the OOP iframe.

This makes sense, since conceptually the resolution is at the level of the async zoom container. However, I don't think this solution is general enough, since we could have a page structure where a CSS transform ends up on an ancestor of the Remote item's node in the unexpanded (WebRenderLayerScrollData) tree, in which case emitting the resolution on the topmost node during expansion will not achieve the desired effect of having the resolution be outside the transform.

Moreover, this implementation approach has the undesired side effect discussed in comment 24, that in cases where the Remote item's node also carries metadata for an enclosing scroll frame, the resolution transform will end up part of the screen-to-apzc transform for that enclosing scroll frame as well.

To make this work, we need to meet two constraints simultaneously:

  1. The resolution needs to be applied at the level of the async zoom container, i.e. on top of any transforms that are in between the OOP iframe and the root.
  2. The resolution needs to apply only for scroll frames in the OOP iframe, not their in-process ancestors.

It's not possible to meet both of these constraints while also preserving the property that the screen-to-apzc and CTP transforms are computed recursively (i.e. your transform is (some local properties) * (your parent's transform)).

However, I don't think that latter property is actually important? I don't think anything stops us from conditionally including the resolution at the level of the async zoom container, based on whether or not the original input APZC is inside an OOP subframe or not.

I believe doing that would:

  1. Fix this issue (both in the sense of no longer triggering the assertion, and in fixing the observable misbehaviour such as the STR from comment 24)
  2. Fix other CSS transform edge cases that bug 1715187 left unfixed
  3. Avoid messing with composition bounds. (We can still explore messing with the composition bounds in order to make the handling of in-process and OOP subframes consistent, in a follow-up.)

Revised to fix an issue that was causing helper_hittest_bug1715187.html to fail: https://treeherder.mozilla.org/jobs?repo=try&revision=d4fa06e732998e4606bb6757de60e572cabea9da

This reverts older attempts to account for the resolution in input transforms
targeting OOP subdocuments (bug 1682200, bug 1715187), clearing the slate for
a more proper solution in the forthcoming patches.

Depends on D184907

Attachment #9346429 - Attachment description: Bug 1827330 - Conditionally include the painted resolution transform in the screen-to-apzc and gecko-to-apzc transforms. r=tnikkel,dlrobertson → Bug 1827330 - Conditionally include the painted resolution transform in the screen-to-apzc and apzc-to-gecko transforms. r=tnikkel,dlrobertson

I did my best to formulate an implementation that doesn't just sprinkle in matrices in the numerically correct places, but also tries to maintain a somewhat coherent set of interfaces. I'm not 100% satisfied with the result, but this may be as good as it gets given the inherent complexity of the situation. If at some point we get around to a more comprehensive refactoring like bug 1741333, that should clean up some of this code further.

Blocks: 1846902
Blocks: 1649440
See Also: → 1827327
Blocks: 1827327
Pushed by bballo@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/62c130806aad
Log which scroll frame violates the 'two layers' assertion. r=dlrobertson
https://hg.mozilla.org/integration/autoland/rev/8b299a2069f7
Add a scrollbar dragging mochitest. r=dlrobertson
https://hg.mozilla.org/integration/autoland/rev/59e72e835c97
Do not add a resolution transform to the WebRenderLayerScrollData node for a Remote display item. r=dlrobertson,hiro
https://hg.mozilla.org/integration/autoland/rev/b30d74f1f0c7
Introduce a GetPaintedResolutionTransform() helper. r=dlrobertson
https://hg.mozilla.org/integration/autoland/rev/fda5ed4ebf22
Conditionally include the painted resolution transform in the screen-to-apzc and apzc-to-gecko transforms. r=dlrobertson,hiro
https://hg.mozilla.org/integration/autoland/rev/ff7c77e60615
Conditionally include the painted resolution transform in HitTestingTreeNode::GetTransformToGecko(). r=dlrobertson,hiro
https://hg.mozilla.org/integration/autoland/rev/07ddd8965e85
Re-enable tests that were marked as failing with Android Fission. r=dlrobertson
https://hg.mozilla.org/integration/autoland/rev/ef2753ba331b
apply code formatting via Lando
Blocks: 1832914
Whiteboard: [fission:android] → [fission:android][fxdroid]
Blocks: 1855511
No longer blocks: 1855511
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: