Closed Bug 1247049 Opened 4 years ago Closed 4 years ago

5.61 - 12% tp5o_scroll (linux64, osx-10-10, windows7-32, windows8-64) regression on push 7d6e0141de4f (Sun Feb 7 2016)

Categories

(Toolkit :: Notifications and Alerts, defect)

defect
Not set

Tracking

()

RESOLVED WONTFIX
mozilla47
Tracking Status
firefox47 --- fixed

People

(Reporter: wlach, Assigned: jfkthame)

References

Details

(Keywords: perf, regression, Whiteboard: [talos_regression])

Attachments

(1 file)

Talos has detected a Firefox performance regression from push 7d6e0141de4f. As author of one of the patches included in that push, we need your help to address this regression.

This is a list of all known regressions and improvements related to the push:

https://treeherder.allizom.org/perf.html#/alerts?id=27

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see:

https://wiki.mozilla.org/Buildbot/Talos/Tests#tp5o_scroll

Reproducing and debugging the regression:

If you would like to re-run this Talos test on a potential fix, use try with the following syntax:

try: -b o -p win32,macosx64 -u none -t g1-e10s[Windows 7,10.10] --rebuild 5  # add "mozharness: --spsProfile" to generate profile data

(we suggest --rebuild 5 to be more confident in the results)

To run the test locally and do a more in-depth investigation, first set up a local Talos environment:

https://wiki.mozilla.lorg/Buildbot/Talos/Running#Running_locally_-_Source_Code

Then run the following command from the directory where you set up Talos:

talos --develop -e <path>/firefox -a tp5o_scroll

(add --e10s to run tests in e10s mode)

Making a decision:

As the patch author we need your feedback to help us handle this regression.
*** Please let us know your plans by Friday, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations:

https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
Flags: needinfo?(jfkthame)
Summary: 5.61 - 5.91% tp5o_scroll (osx-10-10, windows7-32) regression on push 7d6e0141de4f (Sun Feb 7 2016) → 5.61 - 12% tp5o_scroll (osx-10-10, windows7-32, windows8-64) regression on push 7d6e0141de4f (Sun Feb 7 2016)
We can add a code path for OS X and Win7 to only multiple the scale factor.
Summary: 5.61 - 12% tp5o_scroll (osx-10-10, windows7-32, windows8-64) regression on push 7d6e0141de4f (Sun Feb 7 2016) → 5.61 - 12% tp5o_scroll (linux64, osx-10-10, windows7-32, windows8-64) regression on push 7d6e0141de4f (Sun Feb 7 2016)
(In reply to Masatoshi Kimura [:emk] from comment #1)
> We can add a code path for OS X and Win7 to only multiple the scale factor.

I'm not sure what you mean by this suggestion -- could you explain a bit more, or post a suggested patch? Thanks.
Flags: needinfo?(jfkthame)
Flags: needinfo?(VYV03354)
Something like

  if (!PerMonitorDPIAware() || OSX) {
    return CSSIntPoint(DevToCSSIntPixels(x), DevToCSSIntPixels(y));
  }
  /* current code */

(I'm not confident that it works on OS X)
Flags: needinfo?(VYV03354)
Those try pushes ^^^ are SPS profiling runs of Talos, in case they're helpful for determining what's going on.
So it looks like this is only an issue with e10s. That seems to suggest that under e10s (but not otherwise), the tp5o_scroll test is hammering the window.screenX / window.screenY APIs really heavily. Perhaps something about how APZ works?

We can try to make those methods a bit cheaper, perhaps by short-circuiting in the case where per-monitor DPI is not enabled (although that's somewhat "cheating", in that it may hide the issue from talos but it won't help users who actually have mixed-DPI configurations). But I think the real question here is why that code is so hot under e10s.

:kats, any insight here? Does APZ scrolling cause heavy use of window.screenX/Y, and if so, could we reduce this?
Flags: needinfo?(bugmail.mozilla)
FWIW, the tp5o_scroll test does not scroll with APZ. It scrolls synchronously, typically with scrollBy for each step, and on each step also reads back the page position typically using pageYOffset.

See the implementation here http://hg.mozilla.org/mozilla-central/file/tip/testing/talos/talos/pageloader/chrome/tscroll.js
OK, thanks for the info. So APZ isn't relevant here, then.
Flags: needinfo?(bugmail.mozilla)
I see it does read win.screenY every time (within ensureScroll()). OK, so it makes sense that bug 1240085 could affect this; though I don't yet understand why it should be so different in e10s versus non-e10s.
(In reply to Jonathan Kew (:jfkthame) from comment #11)
> I see it does read win.screenY every time (within ensureScroll()). OK, so it
> makes sense that bug 1240085 could affect this; though I don't yet
> understand why it should be so different in e10s versus non-e10s.

With e10s, unfortunately screen queries are sent to the parent process using a sync message over the PScreenManager protocol. That's the big difference - if the parent process is busy doing other things when that sync message comes in, the content process will wait until the message gets served.

bug 1194751 tracks the work to switch away from using sync messages for screen.
OK, thanks. This also means that even if the parent process isn't busy, there's substantial added overhead on what would (in single-process) be a very cheap call.

In that case, the e10s regression here is totally understandable: the code in nsGlobalWindow::GetScreenXY needs to find the screen in order to do the appropriate coordinate mapping, and that's what is so expensive. (The screen lookup happens within nsDeviceContext::GetRect, called by GetScreenXY.)

I confirmed this is the root of the regression by adding a short-circuit codepath to avoid the screen lookup when per-monitor DPI is not enabled on Windows:

  https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=0c6e357a858a&newProject=try&newRevision=ec374ef18c4f&framework=1

This gives us a "win" of -6.59% (Win32) to -9.55% (Win64). However, I don't think we should take that patch; it's an ugly platform-specific hack, and it's working around an issue that should be addressed elsewhere (bug 1194751).

Note that the "regression" here is somewhat artificial: it only shows up on tp5_scroll because the scroll test also reads win.screenY, which shouldn't normally be needed for scrolling. There's a comment at:

  http://hg.mozilla.org/mozilla-central/file/tip/testing/talos/talos/pageloader/chrome/tscroll.js#l40

that says "screenY is for X11", though I don't know what that really means... why does X11 require us to read screenY? This looks like an artifact of the test, and not a regression that will normally impact users. I doubt that performance of real-world sites is usually dependent on the performance of window.screenX/Y queries.

Given that we need the change (including the screen query) from bug 1240085 in order to get correct behavior in real-world configurations; given that the regression reported here is an artifact of how the test works rather than a true regression in actual scrolling performance; and given that there's work planned (bug 1194751) that will completely change it anyway .... I think we should just accept this as a new baseline for now.

(We can get a very minor win, as it happens, by using the new nsDeviceContext::GetDesktopToDeviceScale method that was just added in bug 1246346 to simplify GetScreenXY slightly. This seems to gain a little over 1% on Windows:

  https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=0c6e357a858a&newProject=try&newRevision=7634ec298fd8&framework=1&showOnlyImportant=0&showOnlyConfident=1

I'll post that patch for review; but apart from that, I think we should just close this as wontfix and move on, pending the e10s reworking of the screen query.)
This is only a very small win, but it's also a nice code simplification.
Attachment #8719272 - Flags: review?(VYV03354)
Assignee: nobody → jfkthame
Status: NEW → ASSIGNED
wontfix is just fine!  Thanks for diving into this and figuring out the root cause.
Attachment #8719272 - Flags: review?(VYV03354) → review+
(In reply to Jonathan Kew (:jfkthame) from comment #13)
> Note that the "regression" here is somewhat artificial: it only shows up on
> tp5_scroll because the scroll test also reads win.screenY, which shouldn't
> normally be needed for scrolling. There's a comment at:
> 
>  
> http://hg.mozilla.org/mozilla-central/file/tip/testing/talos/talos/
> pageloader/chrome/tscroll.js#l40
> 
> that says "screenY is for X11", though I don't know what that really
> means... why does X11 require us to read screenY? This looks like an
> artifact of the test, and not a regression that will normally impact users.


I added that comment (and current implementation) more than two years ago, and TBH I don't recall the X11 case exactly, but I guess that without it it was not scrolling each step correctly, hence it's at EnsureScroll().

Also, tests are typically also affected by performance of things which are not strictly the subject they pretends to test. This alone is usually not a good enough reason to ignore those regressions if we believe the regression is still real.

But your assessment of the low real-life effect on users does seem valid to me, and combined with the expected future improvements of this subject I agree we should accept this specific regression.

However, there's still one thing which bugs me. The same identical scroll test code is also used at the tscrollx test (the only difference is that tscrollx scrolls some synthetic pages and tp5o-scroll scrolls some real-world pages), but this report does not mention tscrollx at all.

I would like to know why tsctollx seems unaffected before we wontfix this.
(In reply to Avi Halachmi (:avih) from comment #16)
> I would like to know why tsctollx seems unaffected before we wontfix this.

Or rather, according to this link from comment 0 https://treeherder.allizom.org/perf.html#/alerts?id=27 , tscrollx only regressed on linux64 (opt/pgo, e10s obviously), where its regressions are only slightly smaller than those of tp5o-scroll on the respective platforms.

It seems to me we should have noticed a similar tscrollx regressions also on other platforms, but the link/report does not seem to support this.
I don't know why we don't see tscrollx regression reports for other platforms.... but looking at the graphs, it seems that on Windows-e10s, tscrollx has some kind of bi-modal behavior, which makes me suspect some other factor entirely may be coming into play there.

And on OS X e10s, the graph

  http://graphs.mozilla.org/graph.html#tests=[[287,63,61]]&sel=none&displayrange=14&datatype=geo

*does* show a distinct regression on Feb 7th. But the test seems to be quite noisy; perhaps that's why it wasn't clear enough for the report to list it?
https://hg.mozilla.org/integration/mozilla-inbound/rev/dc0bf29f0354d1ea71cabb2854dc42756b583cf8
Bug 1247049 - Optimize GetScreenXY by using the new nsDeviceContext method to get desktop scale factor. r=emk
https://hg.mozilla.org/mozilla-central/rev/dc0bf29f0354
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 47
I think the resolution here should be WONTFIX rather than FIXED, as per comment 13. The patch that was landed here is only a marginal improvement that will likely be lost in the noise (see the end of comment 13), but aside from that the real issue here is the cost of synchronous cross-process screen queries under e10s, and we have bug 1194751 intending to address that.
Resolution: FIXED → WONTFIX
sounds great!  Thanks for digging in and doing what we can do.
Flags: needinfo?(jfkthame)
Component: Untriaged → Notifications and Alerts
Product: Firefox → Toolkit
Target Milestone: Firefox 47 → mozilla47
Version: unspecified → Trunk
You need to log in before you can comment on or make changes to this bug.