Closed Bug 1361970 Opened 7 years ago Closed 7 years ago

5.84 - 6.68% tp5o_scroll (windows8-64) regression on push 9fa3d6e54b5b59b88e6f63496d05601e454fa7a0 (Mon May 1 2017)

Categories

(Core :: Graphics: Layers, defect, P3)

53 Branch
defect

Tracking

()

RESOLVED FIXED
mozilla56
Tracking Status
firefox-esr52 --- unaffected
firefox54 --- unaffected
firefox55 --- wontfix
firefox56 --- fixed

People

(Reporter: jmaher, Assigned: mattwoodrow)

References

(Regression)

Details

(Keywords: perf, regression, talos-regression, Whiteboard: [gfx-noted])

Attachments

(2 files, 1 obsolete file)

Talos has detected a Firefox performance regression from push 9fa3d6e54b5b59b88e6f63496d05601e454fa7a0. As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

  7%  tp5o_scroll summary windows8-64 pgo e10s     1.86 -> 1.98
  6%  tp5o_scroll summary windows8-64 opt e10s     1.88 -> 1.99


You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=6347

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Buildbot/Talos/Tests

For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Buildbot/Talos/Running

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
:mattwoodrow can you take a look at this regression and determine if we can fix this or need to back it out or accept it?
Component: Untriaged → Graphics: Layers
Flags: needinfo?(matt.woodrow)
Product: Firefox → Core
Priority: -- → P3
Whiteboard: [gfx-noted]
I'll investigate this. It looks like it's the APZ tests only for some reason.
I can't seem to get tp5o_scroll running locally. Unzipping the tp5n.zip file fails because the paths are too long (both when done automatically using mach talos-test and when unzipping by hand).

Has anyone seen this before?

Markus, does profiling on try still work?
Flags: needinfo?(mstange)
(In reply to Matt Woodrow (:mattwoodrow) from comment #3)
> Markus, does profiling on try still work?

It mostly doesn't; non-e10s tests have a higher probability of giving you useful profiles than e10s tests. (This is bug 1329132.)
And when loaded into perf.html, the call tree has many duplicated frames that are not combined into the same node, https://github.com/devtools-html/perf.html/issues/256 .
Flags: needinfo?(mstange)
the paths might be too long if you have a long base/root directory name to start with.  I unzip to c:\users\jmaher\mozilla-inbound\objdir, but it could be easy to have a much longer name.
I had the path as: c:\users\mattwoodrow\src\mozilla-central\objdir-debug without success. Switching to c:\users\mattwoodrow\m-c\o fixed things.

I guess we just don't have many characters to spare with the tp5 path names.
This is because APZ sets non-integer translations on scrolled layers.

Previously we just had the scrolling layer, but now we have the scrolling layer + the non scrolled ColorLayer behind it.

When scrolling via the main-thread, the scroll translation is always an integer, and PostProcessLayers (on the compositor side) is able to determine the scrolling layer occludes the ColorLayer, and we don't render it.

When scrolling via APZ, the scroll translations is (sometimes) non-integer, leading to the occlusion test failing, and us drawing an extra Layer across the whole screen.

The extra annoying part is that most Layer implementations snap their transform back to integers anyway (as part of their ComputeEffectiveTransforms impl), but this snapping is only reflected in mEffectiveTransform, not the other transform values.

The effective transform is the transform to the nearest intermediate surface (or the root), but PostProcessLayers only uses the local layer transforms, since it recurses down through all Layers and want to update visible/opaque regions on each.

We can't easily replicate the snapping on the local transform, since multiple accumulated non-integer translations might snap differently.

Possible solutions:

* Rewrite PostProcessLayers to work with effective transforms, not entirely sure how this would work.

* Allow non-integer translations to still occlude area within PostProcessLayers. I think we basically convert the int-region into a float-region, translate by the non-integer amounts, and then round to inside pixels.

* Make APZ snap its scroll offsets before setting them on Layers.

Kats, botond, is the 3rd option feasible? Neither of the first two are particularly appealing, so hopefully that one is easier.
Flags: needinfo?(matt.woodrow)
Flags: needinfo?(bugmail)
Flags: needinfo?(botond)
(In reply to Matt Woodrow (:mattwoodrow) from comment #7)
> * Make APZ snap its scroll offsets before setting them on Layers.
> 
> Kats, botond, is the 3rd option feasible? 

I think it should be doable. We just need to make sure we do the same snapping everywhere, because we use the async transform for fixed-pos layers, scrollbars, checkerboarding calculations, hit-testing, etc. A rule of thumb would be to do snapping in all the functions in AsyncPanZoomController that read the mTestAsyncScrollOffset field, and to do the snapping basically right after that field is combined with the scroll offset from mFrameMetrics.
Flags: needinfo?(bugmail)
Although I guess the snapping needs to be done in LayerPixel or ParentLayerPixel space, so there might be some conversions needed back and forth from CSS pixel space. Still, a helper function to do that would be relatively clean I think.
I have a vague recollection of us discussing APZ snapping scroll offsets to integers in the early days of APZ, and concluding that it was undesirable, but I can't remember why, nor can I find any record of that discussion.

Out of curiosity, though: 

(In reply to Matt Woodrow (:mattwoodrow) from comment #7)
> * Allow non-integer translations to still occlude area within
> PostProcessLayers. I think we basically convert the int-region into a
> float-region, translate by the non-integer amounts, and then round to inside
> pixels.

why is this option unappealing? Is it just due to the performance impact of the additional region operations?
Flags: needinfo?(botond)
(In reply to Botond Ballo [:botond] from comment #10)
> I have a vague recollection of us discussing APZ snapping scroll offsets to
> integers in the early days of APZ, and concluding that it was undesirable,
> but I can't remember why, nor can I find any record of that discussion.
> 
> Out of curiosity, though: 
> 
> (In reply to Matt Woodrow (:mattwoodrow) from comment #7)
> > * Allow non-integer translations to still occlude area within
> > PostProcessLayers. I think we basically convert the int-region into a
> > float-region, translate by the non-integer amounts, and then round to inside
> > pixels.
> 
> why is this option unappealing? Is it just due to the performance impact of
> the additional region operations?

I had a think about this more:

It seems that the normal transform on Layers (both mTransform and the shadow transform) aren't actually part of the coordinate space conversions, they are just inputs to the effective transform. The effective transform can differ from the local transform in ways that aren't predictable (thanks to snapping, and the accumulation of non-integer amounts across multiple transforms).

Given that, it seems that there isn't actually a defined conversion between a Layer and its parent, unless that parent happens to have an intermediate surface (or is the root).

We currently have visible regions on all Layer types, but it seems to me that it's only a meaningful representation for the root, containers with an intermediate surface, or leaf layers. ContainerLayers without an intermediate surface can't have a defined visible region, since we have no idea what the transform to their coordinate space actually is.

The right solution might be to make the 'official' traversal order of Layers be to jump directly down to Layers that have a defined coordinate space (we do this for preserve-3d already) rather than traversing through the intermediate ContainerLayers. We'd also want to do something to invalidate the visible region on those skipped ContainerLayers so that nobody tries to use it.

That's probably a lot of work, and not something I fancy doing, especially with all the WR changes happening right now.

Rounding the non-integer translations inwards to generate an approximate opaque regions feels like its propagating this weird state more, and I think that's the main reason I don't like it. The perf impact is a bit of pain too.
Comment on attachment 8869593 [details]
Bug 1361970 - Make PostProcessLayers occlusion culling work against the surface we will draw to rather than the parent layer.

https://reviewboard.mozilla.org/r/141172/#review148922

This looks plausible. I only have one very superficial comment.

::: layout/base/UnitTransforms.h:300
(Diff revision 1)
>  }
>  
> +
> +template <typename TargetUnits, typename SourceUnits>
> +static gfx::IntRectTyped<TargetUnits>
> +MoveBy(const gfx::IntRectTyped<SourceUnits>& aRect,

Maybe call this TransformByTranslation? Since that's what you have here: Usually, the two units require a transform to be applied to convert from one to the other, but you know that in this case, the transform is just an integer translation.
Attachment #8869593 - Flags: review?(mstange) → review+
Pushed by mwoodrow@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/4524bfde8583
Make PostProcessLayers occlusion culling work against the surface we will draw to rather than the parent layer. r=mstange
Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/48e28c61a9dc for failures in transform-3d/backface-visibility-3.html and transform-3d/opacity-preserve3d-1.html, https://treeherder.mozilla.org/logviewer.html#?job_id=105048217&repo=mozilla-inbound, and maybe, not sure yet, Android-only async-scrolling/group-opacity-surface-size-1.html, https://treeherder.mozilla.org/logviewer.html#?job_id=105031192&repo=mozilla-inbound
Pushed by mwoodrow@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/14ac21b548cf
Make PostProcessLayers occlusion culling work against the surface we will draw to rather than the parent layer. r=mstange
Attached patch Return empty rect (obsolete) — Splinter Review
Returning None rather than an empty rect is a bit confusing, and actually wrong if the rect in question is a clip.
Assignee: nobody → matt.woodrow
Flags: needinfo?(matt.woodrow)
Attachment #8876558 - Flags: review?(bugmail)
Containerful scrolling does weird things when you have fixed position content within the scrolled container. The effective visible region of the container can increase when scrolled (something moves away from the fixed content), but it doesn't update the visible regions for this.

Using CalculateScissorRect included the current visible region of the ancestor with an intermediate surface in the clip, so we'd clip to the old size.

Getting rid of this and sticking to computing the clip rects ourselves lets PostProcessLayers correctly compute the increased visible area on the ancestors.
Comment on attachment 8876558 [details] [diff] [review]
Return empty rect

Review of attachment 8876558 [details] [diff] [review]:
-----------------------------------------------------------------

The UntransformBy function was added in http://searchfox.org/mozilla-central/commit/368ce48f75b8189bc0372d2749b4861f069ba9c1 - my understanding of that commit message is that the function should return Nothing() if the untransformed rect is entirely on the negative side of the w=0 plane. However the implementation seems to convert any empty rect into Nothing(), which is a different thing. Can we change the implementation to do the former? I feel like that would be a more correct fix, but it might be more involved as it would require changes to ProjectRectBounds.

Failing that, I'd still prefer to leave the UntransformBy function as returning a Maybe, and change the call site in LayerManagerComposite to convert it to an empty rect in case of Nothing() - something like:
  LayerRect insideClipFloat = UntransformBy(...).valueOr(LayerRect());
Presumably the other two call sites will not need to be touched at all if we do this, since they're not dealing with clip rects and the new code is functionally equivalent to the old code.

The reason I'd prefer this is because even though the implementation of UntransformBy currently doesn't match what it's supposed to do, the correct implementation would still want to return a Maybe, and so I'd like to keep the signature the same. That will make it easier to fix the function in a follow-up bug if we decide not to do it right away.

(Also, please add a commit message to the patch.)
Attachment #8876558 - Flags: review?(bugmail) → review-
Attachment #8876558 - Attachment is obsolete: true
Pushed by mwoodrow@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/68932b1a5b71
Make PostProcessLayers occlusion culling work against the surface we will draw to rather than the parent layer. r=mstange
https://hg.mozilla.org/mozilla-central/rev/68932b1a5b71
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla56
and a fix:
== Change summary for alert #7294 (as of June 14 2017 02:17 UTC) ==

Improvements:

  6%  tp5o_scroll summary osx-10-10 opt e10s     3.02 -> 2.85
  2%  tscrollx summary linux64 opt e10s          4.38 -> 4.28

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=7294
Depends on: 1373335
Depends on: 1373479
Depends on: 1373528
Depends on: 1373797
Depends on: 1373830
Depends on: 1374022
Depends on: 1375208
Depends on: 1375269
Matt, should we be backing out bug 1349418 from 55, or would that be worse than shipping with this bug here?
Flags: needinfo?(matt.woodrow)
This is a pretty small regression that likes affects talos a lot more (thanks to ASAP mode) than it would affect real webpages.

Given that, I'm not too worried, and think we can just live with it, but backing out is fine too.
Flags: needinfo?(matt.woodrow)
Depends on: 1378455
Depends on: 1373983
(In reply to Matt Woodrow (:mattwoodrow) from comment #27)
> This is a pretty small regression that likes affects talos a lot more
> (thanks to ASAP mode) than it would affect real webpages.
> 
> Given that, I'm not too worried, and think we can just live with it, but
> backing out is fine too.

Bug 1373983 seems kind of bad. Can you take a look? Or reconsider backout? Thanks.
Flags: needinfo?(matt.woodrow)
Depends on: 1423528
Depends on: 1424846
Dropping stale needinfo here. The remaining regressions already have needinfos on matt, so there's nothing he needs to do on this bug specifically.
Flags: needinfo?(matt.woodrow)
Depends on: 1442915
Regressions: 1623548
No longer blocks: 1349418
Has Regression Range: --- → yes
Regressed by: 1349418
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: