Closed Bug 1208780 Opened 9 years ago Closed 9 years ago

More Linux e10s test failures with APZ enabled (handoff chain for nested scrollable frames doesn't get layerized properly)

Categories

(Core :: Panning and Zooming, defect)

Unspecified
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla45
Tracking Status
e10s + ---
firefox45 --- fixed

People

(Reporter: kats, Assigned: tnikkel)

References

Details

Attachments

(5 files, 3 obsolete files)

Bug 1205087 landed on inbound yesterday and caused a bunch of tests to fail if APZ is enabled. In [1] at the very least the M-4 failures (and most probably the reftest failures as well) are caused by the patches in bug 1205087; I verified the mochitest failure with a local bisection. I did another try push on tip of inbound [2] with those patches backed out and APZ enabled to see if that resolves all the failures.

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=344c1090f7de
[2] https://treeherder.mozilla.org/#/jobs?repo=try&revision=2ff080c48cb3
We sometimes change the AGR via RecomputeCurrentAnimatedGeometryRoot, so if a display item was init'd with an AGR, and then the AGR changed via RecomputeCurrentAnimatedGeometryRoot that might cause problems.
As far as I can tell the current call sites to RecomputeCurrentAnimatedGeometryRoot aren't the problem. I haven't fully confirmed this but I think the problem is that at [1] we set mShouldBuildScrollableLayer to true (and hence make the frame an animated geometry root) after we build the items for the children. This means all the children end up with a stale AGR and it never gets updated. Without matt's patches the AGR is computed more "dynamically" and so this isn't a problem. Arguably turning a scrollframe into an AGR after building its descendant display items isn't really that great but that's what we do here.

[1] http://mxr.mozilla.org/mozilla-central/source/layout/generic/nsGfxScrollFrame.cpp?rev=8270ee9be504#3120
Attached patch Hacky fixSplinter Review
This seems to do the job for the mochitest for me locally. Apparently just recomputing the mAnimatedGeometryRoot recursively on all the display items in |scrolledContent| is not sufficient, because (a) the display items that get built are actually different and (b) properties on the display items such as the layerBounds and visibleRegion are also different.
Btw here is a page that you can load in the browser (with APZ enabled) that replicates the failure seen in the test. It is a regression that affects real-world cases so unless we can get a good fix today I'm inclined to back out matt's patches since he's on PTO until next week.
Summary: More Linux e10s test failures with APZ enabled → More Linux e10s test failures with APZ enabled (handoff chain for nested scrollable frames doesn't get layerized properly)
Try push with the AGR cache backed out based on latest m-c:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=b0b4f3859cac
Try push is quite green (except for W-e10s which is hidden on m-c). I think I'll back out Matt's patches once the tree reopens, turn on apz on Linux (once bug 1208622 is landed), and work to get Matt's patches relanded with appropriate adjustments so this bug doesn't happen.

My justification for the backout is that even as-is the patches cause a legitimate correctness issue (comment 4) on OSX and Windows; it's just not caught by the tests because we don't have APZ tests running on those platforms. If I could fix it in-place that would be ideal but I think it will take a couple of days to get to that point and I don't want to leave the regression in that long. I also don't want to wait too long before doing the backout because that increases the risk of other stuff creeping in between.

I'll reopen bug 1205087 with the backout, and use this bug to land any patches that address the stale AGR problem. My current plan to tackle it is to somehow walk up the scroll handoff chain around [1] and set a flag on the scrollframes there that force them to layerize. That way we will know which scrollframes are getting layerized before we do any displaylist building, and won't have to change the decision after building descendant display items.

[1] http://mxr.mozilla.org/mozilla-central/source/gfx/layers/apz/util/APZCCallbackHelper.cpp?rev=25e572cce005#653
I backed out the AGR cache bits of bug 1205087 in this cset:

https://hg.mozilla.org/integration/mozilla-inbound/rev/15ed42bf5d4c

This bug no longer blocks turning on APZ on Linux, but we still need to fix it to reland that work from bug 1205087.
No longer blocks: apz-linux
Based on the reftest failures I suspect we do need to do something about the "first encountered scrollframe" case as well :(
After looking at the code again I think that the "first encountered scrollframe" code shouldn't activate an ancestor scroll frame (because it would have had a display port created for it by the same code). I made a try push with that code disable (otherwise the same as your push) to see if that is correct.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=1bc16d6ce608
The green reftests (opt builds) on your push seem to indicate that the reftest failures are related to the "first encountered scrollframe" scenario, if I'm interpreting this correctly. I tried reproducing the failures locally and I can intermittently reproduce them, but haven't been able to capture it with rr. I'll try debugging it the old-fashioned way.
This is for deferred-anchor2.xhtml; in the case where the test fails it seems to be because the AGR for the SubDocument display item is the root-level Canvas frame rather than the subdocument's Viewport frame. Same goes for the LayerEventRegions item immediately inside the SubDocument displayitem. This causes that eventregions item to end up on a different layer than when the test passes.
So the SubDocument display item's bad AGR can be fixed by calling mAnimatedGeometryRootCache.Clear() at the top of RecomputeCurrentAnimatedGeometryRoot(). That way when the displayport is added to the root scroll frame, the AGR cache is cleared and later when the SubDocument display item is created it ends up with the right AGR. However the event regions for the subdocument is always created before calling BuildDisplayList on the children, so it ends up with a bad AGR and updating that one is a bit more tricky.
To get the right AGR on the event regions item I had to not only recompute them after building the children's display items but also get rid of the mCurrentAnimatedGeometryRoot field on the builder, because that is saved/restored via AutoBuildingDisplayList and gets restored to a stale value.

Try push at https://treeherder.mozilla.org/#/jobs?repo=try&revision=d887e0766d33

Even if it's green though I think this approach introduces huge footguns and I don't want to land it.
I discovered a problem with the patch in the above try push (mPrevAnimatedGeometryRoot is left uninitialized). Here's an updated try push which removes less code: https://treeherder.mozilla.org/#/jobs?repo=try&revision=9ff4510bb807

I think this one will still trigger assertion failures on the debug build but hopefully the opt builds will give us some useful results.
I also started working on a patchset that splits out this code into a separate pass like we sort of discussed on IRC. I got it to the point where test_layerization.html was passing but there's probably other bugs. The patches are the top two at https://github.com/staktrace/gecko-dev/commits/0ca1cd06ec9a99e1cf214cdee874f2f41dbb33e4.

I would like to move all the GetOrMaybeCreateDisplayPort calls into the new pass but it's trickier than I expected because doing it right requires knowing things like the dirty rect (which the call in nsSubDocumentFrame needs) or things like "aBuilder->GetIgnoreScrollFrame() == mOuter || IsIgnoringViewportClipping()". To do that I might need to pull those variables into the new pass as well, and anything it depends on, and so on.
I discussed this with tn on Friday and we figured the best approach was probably the one I tried first (comment 14). I'll pursue that a bit more and see if I can figure out the remaining failures.
^ That push was pretty green except for a few reftest failures which might be addressed by tn's patches on bug 1210578. Try push to see if that works:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=03210bccd8f2
(Anybody should feel free to steal my patches from the try push above and do whatever needs to be done to get bug 1205087 relanded)
Thanks for looking into this Kats :)

Looks like that fixed the reftests!
Assignee: nobody → bugmail.mozilla
This patch replaces kats' patch which added a "force layerization" field to scroll frames. Instead we just set a 0 margin display port on ancestor scroll frames. Instead of adding more confusing state we just use the existing state, and everything should work how how we want.

There were two callsites that I don't _think_ need to call SetZeroMarginDisplayPortOnAsyncScrollableAncestors because (correct me if I'm wrong) they never create a new displayport, only update an existing one: APZCCallbackHelper::UpdateRootFrame and APZCCallbackHelper::UpdateSubFrame.

It would be nice if there was one API that combined SetDisplayPortMargins/CalculateAndSetDisplayPortMargins and SetZeroMarginDisplayPortOnAsyncScrollableAncestors but a solution didn't immediately come to mind.
Attachment #8670601 - Attachment is obsolete: true
Attachment #8670601 - Flags: review?(tnikkel)
Attachment #8672965 - Flags: review?(botond)
Comment on attachment 8672965 [details] [diff] [review]
Set a zero-margin display port on all async scrollable ancestors when we set a display port.

Review of attachment 8672965 [details] [diff] [review]:
-----------------------------------------------------------------

(In reply to Timothy Nikkel (:tn) from comment #24)
> There were two callsites that I don't _think_ need to call
> SetZeroMarginDisplayPortOnAsyncScrollableAncestors because (correct me if
> I'm wrong) they never create a new displayport, only update an existing one:
> APZCCallbackHelper::UpdateRootFrame and APZCCallbackHelper::UpdateSubFrame.

UpdateSubFrame can create a new displayport if the layer was a scroll info layer; do we still use those in some edge cases?

::: layout/base/nsLayoutUtils.cpp
@@ +3107,5 @@
> +    if (nsLayoutUtils::AsyncPanZoomEnabled(frame) &&
> +        !nsLayoutUtils::GetDisplayPort(frame->GetContent())) {
> +      nsLayoutUtils::SetDisplayPortMargins(
> +        frame->GetContent(), frame->PresContext()->PresShell(), ScreenMargin(), 0,
> +        aRepaintMode);

I assume the multiple calls to SchedulePaint() that could happen if this function is called with aRepaintMode == Repaint are not a problem?
(In reply to Botond Ballo [:botond] from comment #25)
> UpdateSubFrame can create a new displayport if the layer was a scroll info
> layer; do we still use those in some edge cases?

I think so. I'll update the patch for UpdateSubFrame, and assert in UpdateRootFrame.

> I assume the multiple calls to SchedulePaint() that could happen if this
> function is called with aRepaintMode == Repaint are not a problem?

Not a problem in that it will only schedule one more paint no matter how many times it's called. The GetDisplayRootFrame call in SchedulePaint could theoretically lead to quadratic behavior (probably not a problem in practice), but I can change the patch to just do one SchedulePaint call.
Comment on attachment 8672965 [details] [diff] [review]
Set a zero-margin display port on all async scrollable ancestors when we set a display port.

Review of attachment 8672965 [details] [diff] [review]:
-----------------------------------------------------------------

Sounds good, r+ with the change to UpdateSubFrame. I'll leave the SchedulePaint quadratic performance issue up to your judgment.
Attachment #8672965 - Flags: review?(botond) → review+
(In reply to Timothy Nikkel (:tn) from comment #26)
> > I assume the multiple calls to SchedulePaint() that could happen if this
> > function is called with aRepaintMode == Repaint are not a problem?
> 
> Not a problem in that it will only schedule one more paint no matter how
> many times it's called. The GetDisplayRootFrame call in SchedulePaint could
> theoretically lead to quadratic behavior (probably not a problem in
> practice), but I can change the patch to just do one SchedulePaint call.

Actually GetDisplayRootFrame is fast: it just walks the root frames of documents (and not every frame up the frame tree), unless there is a frame state bit indicating we are inside of a xul popup (which generally contain content we control, not random content from in the wild). Since the number of nested scrollable scroll frames would have to be pretty large before we even notice this, and then the number of layers we create would be huge, and we have to deal with those every paint, not just the first time we set a displayport. I think it's best to not worry about this.
Attached patch interdiffSplinter Review
Just the changes I made to my patch.
Attachment #8674629 - Flags: review?(botond)
Attachment #8674629 - Flags: review?(botond) → review+
Comment on attachment 8670602 [details] [diff] [review]
Ensure that when we turn something into an AGR we update things properly

>@@ -2480,18 +2487,18 @@ nsIFrame::BuildDisplayListForChild(nsDis
>-        MOZ_ASSERT(buildingForChild.GetPrevAnimatedGeometryRoot() ==
>-                   aBuilder->FindAnimatedGeometryRootFor(child->GetParent()));
>+        //MOZ_ASSERT(buildingForChild.GetPrevAnimatedGeometryRoot() ==
>+        //           aBuilder->FindAnimatedGeometryRootFor(child->GetParent()));

Note sure why kats commented this assert out. I pushed updated patches to try with this assert active (ie not commented out) to see if the assert is firing on try to try to figure it out.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=88b66d81ce70
Had to change the assert that the return from GetNearestScrollable has WantAsyncScroll because we specifically allow it to return the root scroll frame whether or not it has WantAsyncScroll.

New try push

https://treeherder.mozilla.org/#/jobs?repo=try&revision=4209636a01d7
GetNearestScrollable with always match root works in a wierd way, so I wrote patches to fix that and pushed again
https://treeherder.mozilla.org/#/jobs?repo=try&revision=001ff1ed298c
(In reply to (away until Oct23) Kartikaya Gupta (email:kats@mozilla.com) from comment #13)
> So the SubDocument display item's bad AGR can be fixed by calling
> mAnimatedGeometryRootCache.Clear() at the top of
> RecomputeCurrentAnimatedGeometryRoot(). That way when the displayport is
> added to the root scroll frame, the AGR cache is cleared and later when the
> SubDocument display item is created it ends up with the right AGR. However
> the event regions for the subdocument is always created before calling
> BuildDisplayList on the children, so it ends up with a bad AGR and updating
> that one is a bit more tricky.

With my patches for bug 1210578 we should never turn something into an AGR after we've already created display items for it. That includes layer event regions and subdocument items. There are two types of frames that can be turned into an AGR by the creation of a displayport: the scrolled frame of the scroll frame with a displayport, and the root frame of a document who's root scroll frame has a displayport. In both cases the patches of bug 1210578 make it so that the displayport is already created (if one is going to get created at all) before we build any display items for the frame. For the scrolled frame case we just have to make sure the displayport is created in ScrollFrameHelper::BuildDisplayList before we build the display list for the scrolled frame. For the root frame case we need to make sure the displayport is created in nsLayoutUtils::PaintFrame and nsSubdocumentFrame::BuildDisplayList before we build any items. Bug 1210578 does this by calling DecideScrollableLayer in those places.

The mAnimatedGeometryRootCache might still be needed to be cleared, I'm not sure I can prove we don't at least ask for the AGR of some frame deeper in the frame tree (but it's easy to prove that we don't build display items for them).
Depends on: 1215974
Depends on: 1215977
Comment on attachment 8670601 [details] [diff] [review]
Force layerization on handoff chain of nested scrollables.

>+  while (true) {
>+    nsIFrame* frame = do_QueryFrame(scrollAncestor);
>+    nsIFrame* parent = nsLayoutUtils::GetCrossDocParentFrame(frame);
>+    parent = nsLayoutUtils::GetCrossDocParentFrame(parent);

I wondered why this patch got the parent twice in a row, seems it was not a mistake. It was to get around the problem that bite me with GetNearestScrollableFrame. I filed bug 1215977 to fix this.
(In reply to Timothy Nikkel (:tn) from comment #32)
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=001ff1ed298c

I only included the mAnimatedGeometryRootCache.Clear() change in nsDisplayListBuilder::RecomputeCurrentAnimatedGeometryRoot() for this push because I think that is all that is needed.

The try run seems green, so I'm going to take that as validation that I haven't missed anything.
This is kats' patch, with my review comments address (ie removed the hunk that I think at unecessary).
Attachment #8675926 - Flags: review+
Attachment #8670602 - Attachment is obsolete: true
Attachment #8670602 - Flags: review?(tnikkel)
Assignee: bugmail.mozilla → tnikkel
Depends on: 1220020
Comment on attachment 8675926 [details] [diff] [review]
Ensure that when we turn something into an AGR we update things properly v2

Moved this patch to bug 1220020.
Attachment #8675926 - Attachment is obsolete: true
https://hg.mozilla.org/mozilla-central/rev/8d0a776c5d1e
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla45
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: