Closed Bug 1210376 Opened 9 years ago Closed 9 years ago

Large regressions in TART, CART, TResize on Linux e10s after APZ enabled

Categories

(Core :: Panning and Zooming, defect)

defect
Not set
normal

Tracking

()

RESOLVED WONTFIX
Tracking Status
e10s + ---

People

(Reporter: wlach, Assigned: botond)

References

Details

(Keywords: perf, regression, Whiteboard: [talos_regression][e10s])

Talos has detected a Firefox performance regression from your commit 30742281c223e2a7abb5931ee5383d890b5ce0e8 in bug 1143856.  We need you to address this regression.

This is a list of all known regressions and improvements related to your bug:
http://alertmanager.allizom.org:8080/alerts.html?rev=30742281c223e2a7abb5931ee5383d890b5ce0e8&showAll=1

On the page above you can see Talos alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing tests, please see: https://wiki.mozilla.org/Buildbot/Talos/Tests

Reproducing and debugging the regression:
If you would like to re-run this Talos test on a potential fix, use try with the following syntax:
try: -b o -p linux64 -u none -t tp  # add "mozharness: --spsProfile" to generate profile data

To run the test locally and do a more in-depth investigation, first set up a local Talos environment:
https://wiki.mozilla.org/Buildbot/Talos/Running#Running_locally_-_Source_Code

Then run the following command from the directory where you set up Talos:
talos --develop -e <path>/firefox -a tp5o

Making a decision:
As the patch author we need your feedback to help us handle this regression.
*** Please let us know your plans by Monday, or the offending patch will be backed out! ***

Our wiki page oulines the common responses and expectations:
https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
(In reply to William Lachance (:wlach) from comment #1)
> Perfherder compare view:
> https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-
> inbound&originalRevision=19400ffcf469&newProject=mozilla-
> inbound&newRevision=30742281c223&e10s=1

This should actually be: https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-inbound&originalRevision=2258001851df&newProject=mozilla-inbound&newRevision=30742281c223&e10s=1

(perfherder seems to be having trouble picking the right revisions from the graphsview, I will investigate)
Flags: needinfo?(bugmail.mozilla)
Yeah, turning on APZ tends to have these sorts of effects. Some of these are going to stay because APZ inherently paints more than we did before. There is probably some room for improvement though.

The numbers in this regression seem a bit larger than what was reported for windows when it was enabled there (see bug 1189817). The tp5 scroll regression in particular is probably also picking up the regression from bug 1208636 comment 10 which landed just before this did. (i.e. on windows we enabled APZ first, then landed bug 1208636, so the regression was split across the two bugs; on linux we landed bug 1208636 first and then enabled it with APZ, so all the regression got attributed to APZ).

Nonetheless I'll kick off some try pushes with profiles before and after to see what stands out.
Blocks: apz-desktop
Flags: needinfo?(bugmail.mozilla)
See Also: → 1189817
Summary: Large regressions in TART, CART, TResize on Linux after APZ enabled → Large regressions in TART, CART, TResize on Linux e10s after APZ enabled
Assignee: nobody → bugmail.mozilla
Blocks: apz-e10s
tracking-e10s: --- → +
(In reply to William Lachance (:wlach) from comment #0)
> Reproducing and debugging the regression:
> If you would like to re-run this Talos test on a potential fix, use try with
> the following syntax:
> try: -b o -p linux64 -u none -t tp  # add "mozharness: --spsProfile" to
> generate profile data

This is a lie! I pushed https://treeherder.mozilla.org/#/jobs?repo=try&revision=d9a71d15d6af and it didn't do the talos run :(
... and using try-extender to get the right jobs doesn't seem to give me the SPS profile. Or maybe I'm not looking in the right place? Will - do you know if I need to redo the try push to get the profile info?
Flags: needinfo?(wlachance)
Blocks: 1207141
Keywords: perf, regression
Whiteboard: [talos_regression][e10s]
this seems to be the right compare view that shows just this change compared to the previous revision to it:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-inbound&originalRevision=077a6b1a1ba5&newProject=mozilla-inbound&newRevision=30742281c223&e10s=1

I found the previous tip via:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&fromchange=2258001851df&tochange=8ad5cb2c028d&filter-searchStr=Ubuntu%20HW%2012.04%20x64%20mozilla-inbound%20talos%20g1

and the magic try syntax:
try: -b o -p linux64 -u none -t tp5o mozharness: --spsProfile
Flags: needinfo?(wlachance)
Quick update: I'm currently unable to get profile data from tryserver, see bug 1210521, and that is blocking any progress here.
Depends on: 1193838
No longer depends on: 1210967
Build failure :( I'll continue with this when I'm back from PTO.
is there any problem with backing this out until we can work on this?  with :kats on PTO for the week and probably another week to fix it, I would like to keep regressions out of the tree if they are not being worked on.  

I am not sure who to talk to, :mstange, can you help me find a contact
Flags: needinfo?(mstange)
Botond, how do you feel about this?
Flags: needinfo?(mstange) → needinfo?(botond)
(In reply to Joel Maher (:jmaher) from comment #10)
> is there any problem with backing this out until we can work on this?

We'd like to avoid disabling APZ on Linux, because some test suites like e10s mochitests only run on Linux, and with APZ disabled on Linux, APZ wouldn't have test coverage in automation for those suites; enabling APZ was an uphill battle as it is because, without automation coverage, tests kept regressing with APZ on.

I can pick up the investigation while Kats is on PTO.
Assignee: bugmail.mozilla → botond
Flags: needinfo?(botond)
Thanks Botond!
(In reply to (away until Oct23) Kartikaya Gupta (email:kats@mozilla.com) from comment #9)
> Build failure :( I'll continue with this when I'm back from PTO.

The build failure was caused by a missing file in the patches, which :mconley was kind enough to supply in updated patches to bug 1193838.

New Try push:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=d6c1e91151a5
(In reply to Botond Ballo [:botond] from comment #14)
> New Try push:
> 
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=d6c1e91151a5

This built locally but not in automation due to Werrors. Trying again:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=06e760888ac7
So I had a look at the magnitude of the regressions in bug 1189817, and compared them them to the ones here:

 Suite         |  Windows (bug 1189817)  |  Linux (this bug)
==============================================================
 tp5o_scroll   |  136-250%               |  127-159%
 tp5o          |  19%                    |  10-11%
 tsvgx         |  4.6-5.7%               |  4.2-4.5%
 tresize       |  15-39%                 |  4.8%
 TART          |  6.8-7.1%               |  12-15%
 CART          |  3.8%                   |  4.7%-5.2%
==============================================================

It looks like, for tp5o_scroll, tp5o, tsvgx, and tresize, the regressions in this bug are no greater than the regressions in bug 1189817; as the two bugs are about the same thing (enabling APZ) on different platforms, and we decided to allow them in bug 1189817, I think it makes sense to allow them here, too.

That said, I should note that, while APZ is expected to regress some benchmarks due to painting more than before, there are still efforts underway to improve performance. For example, bug 1204136, for which I did a speculative Talos run [1], looks like it will bring a 46% improvement to tp5o_scroll. Bug 1205025 and other dependencies of bug 1154825 are expected to help as well.

As for TART and CART, it does appear that we've regressed them more on Linux than we had on Windows (a bit more for CART, and significantly more for TART). Presumably this is what Kats was referring to when he said "the numbers in this regression seem a bit larger than what was reported on Windows" (comment 3).

I set out to investigate the TART numbers by comparing profiles before and after. I was able to use Mike Conley's patches from bug 1193838 (with a few fixes to get them to compile) to get profiles for current trunk [2]. I then got some with APZ disabled on Linux, to get a point of comparison [3]. The comparison can be seen here [4]. I'm currently doing some retriggers to gain more confidence in the numbers. I plan to continue my investigation in this direction.

Please let me know if my analysis of the situation with the other tests is reasonable; this is my first time investigating Talos regression, so I apologize if I'm overlooking something obvious.

[1] https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&originalRevision=7603d2860165&newProject=try&newRevision=44e909c5df4a&e10s=1
[2] https://treeherder.mozilla.org/#/jobs?repo=try&revision=1314e51c0978
[3] https://treeherder.mozilla.org/#/jobs?repo=try&revision=83268070d3f3
[4] https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=1314e51c0978&newProject=try&newRevision=83268070d3f3&e10s=1
To try to zero in on the TART regression, I looked at the results broken down by subtest, and picked one that experienced a particularly large improvement when APZ was disabled - "icon-close-DPI2.all", which improved by 35%.

I compared runs on this subtest in Cleopatra, and found that while both sets of runs had the same overall duration, with APZ the refresh driver ticked about 45 times in a run, while without APZ it ticked about 60 times. As this is an ASAP mode test, where the refresh driver is ticked as fast as possible, this is consistent with a 35% regression in paint speed.

Unfortunately, as the duration of a test run is about 250 ms and the Talos harness sets the sampling interval of the profiler to 10 ms, there are only 25 samples in a run, which is not much information to go on to try to track down the cause of the regression.

To get some more detailed data, I reduced the sampling interval to 1 ms, and kicked off some new Try pushes. I also disabled TART tests other than icon-close-DPI2, because Cleopatra was really struggling as it is with the 10 ms profile of all test runs, and I fear that running all tests at 1 ms will make it unusable.

Here are the Try pushes:
  APZ enabled:  https://treeherder.mozilla.org/#/jobs?repo=try&revision=21a02bb81a49
  APZ disabled: https://treeherder.mozilla.org/#/jobs?repo=try&revision=6290463152aa
It might be interesting to see if we regress with DisplayPort == ViewPort. These test measure how fast we do transactions on the main thread and necessarily by doing more work on the main thread transaction we're going to be slower. If we don't regress when DP == VP then it's a sign that the regressions are strictly coming from the DP which would be a good sign.
(In reply to Benoit Girard (:BenWa) from comment #18)
> It might be interesting to see if we regress with DisplayPort == ViewPort.
> These test measure how fast we do transactions on the main thread and
> necessarily by doing more work on the main thread transaction we're going to
> be slower. If we don't regress when DP == VP then it's a sign that the
> regressions are strictly coming from the DP which would be a good sign.

Good idea! Here are some Try pushes with DP == VP:
  APZ enabled: https://treeherder.mozilla.org/#/jobs?repo=try&revision=892eeb685661
  APZ disabled: https://treeherder.mozilla.org/#/jobs?repo=try&revision=a76e0d710d05
(In reply to Botond Ballo [:botond] from comment #17)
> Here are the Try pushes:
>   APZ enabled:  https://treeherder.mozilla.org/#/jobs?repo=try&revision=21a02bb81a49
>   APZ disabled: https://treeherder.mozilla.org/#/jobs?repo=try&revision=6290463152aa

I should have mention that these Try pushes from comment 17 didn't work, because something ("graph server") didn't like the fact that some of the TART subtests didn't run. Therefore, I repeated the pushes with all subtests enabled and a 1 ms sampling interval, hoping that Cleopatra would handle it:

  APZ enabled:  https://treeherder.mozilla.org/#/jobs?repo=try&revision=329c9cc34454
  APZ disabled: https://treeherder.mozilla.org/#/jobs?repo=try&revision=0cf7f442b858

The pushes from comment 19 already had this fix.
The DP == VP runs showed pretty much the same TART regression with APZ enabled as the ones with our regular displayport heuristics, suggesting that, at least for TART, the issue is not with the extra painting (for some other tests like tscrollx it still probably is).

I looked at the profiles in Cleopatra to try to see what the problem is. (Thankfully, it opened the 1 ms sampling interval profiles successfully.) Here's what I noticed:

  - First, in both cases a lot of time is spent doing GC, leading to some very long
    refresh driver ticks where most of the time is GC. Not sure if this is expected,
    but it's the same in both cases (about 100 out of 250 samples for a single test
    run are in GC), so it's likely unrelated.

  - Looking at the rest of the refresh driver ticks, which are much shorter, the
    ones with APZ enabled are each a *bit* slower than the ones without APZ enabled.
    This suggests that APZ is adding a small amount of overhead to each paint.
    (Thanks to BenWa for pointing this out!) However, it's not that we're painting 
    a larger region, because then we would see an improvement with DP == VP.
    Rather, it's something else in the rendering pipeline.

Unfortunately, even at a 1 ms sampling interval, we only get about 2 samples per refresh for refreshes that don't include the long GC pause, which is not nearly enough to tell what inside that refresh is taking 10-15% more time.

One idea for what the cause of the overhead might be, is building event regions. To test this theory, I did a Try push with APZ disabled but the layout.event-regions.enabled pref set to true. If my theory is right, we should see a regression comparable to what we see from turning on APZ:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=9085031ac7b0
Enabling event-regions caused a 7.21% regression in TART [1], compared to enabling APZ (which activates a superset of the codepaths that enabling event-regions does) which caused a 10.21% regression [2]. This tells us that the bulk of the regression comes from extra time spent building event regions. Event regions are important for APZ's correctness, but it's possible that the code for building them could be more optimized.

[1] https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=0cf7f442b858&newProject=try&newRevision=9085031ac7b0&e10s=1
[2] https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=0cf7f442b858&newProject=try&newRevision=329c9cc34454&e10s=1
Depends on: 1215884
Thanks for the detailed investigation Botond! It does make sense that a large chunk of the regression is caused by building event regions, as this was something we knew would cause a perf hit. As you said it is needed for correctness so we can't really do away with it, but we should definitely look into optimizing it as much as possible.
Also given the above it might make sense to close this bug as a WONTFIX since there isn't any one change that will "fix" the regression. It will be more like a series of slow improvements as we find things to optimize. We can also leave it open to track the improvements that are specifically targeted at improving these tests. I'm not sure which is more appropriate from a "talos regression" process standpoint. Joel?
Flags: needinfo?(jmaher)
sounds good to me.  How do we track the work on improvements?
Status: NEW → RESOLVED
Closed: 9 years ago
Flags: needinfo?(jmaher)
Resolution: --- → WONTFIX
We can use bug 1205461 for that, it seems appropriately broad.
You need to log in before you can comment on or make changes to this bug.