Bugzilla

Comment 3

•

9 years ago

Yeah, turning on APZ tends to have these sorts of effects. Some of these are going to stay because APZ inherently paints more than we did before. There is probably some room for improvement though.

The numbers in this regression seem a bit larger than what was reported for windows when it was enabled there (see bug 1189817). The tp5 scroll regression in particular is probably also picking up the regression from bug 1208636 comment 10 which landed just before this did. (i.e. on windows we enabled APZ first, then landed bug 1208636, so the regression was split across the two bugs; on linux we landed bug 1208636 first and then enabled it with APZ, so all the regression got attributed to APZ).

Nonetheless I'll kick off some try pushes with profiles before and after to see what stands out.

Blocks: apz-desktop

Flags: needinfo?(bugmail.mozilla)

Updated

•

9 years ago

Assignee: nobody → bugmail.mozilla

Tracy Walker [:tracy]

Updated

•

9 years ago

Blocks: apz-e10s

tracking-e10s: --- → +

Comment 4

•

9 years ago

(In reply to William Lachance (:wlach) from comment #0)
> Reproducing and debugging the regression:
> If you would like to re-run this Talos test on a potential fix, use try with
> the following syntax:
> try: -b o -p linux64 -u none -t tp  # add "mozharness: --spsProfile" to
> generate profile data

This is a lie! I pushed https://treeherder.mozilla.org/#/jobs?repo=try&revision=d9a71d15d6af and it didn't do the talos run :(

Comment 5

•

9 years ago

... and using try-extender to get the right jobs doesn't seem to give me the SPS profile. Or maybe I'm not looking in the right place? Will - do you know if I need to redo the try push to get the profile info?

Flags: needinfo?(wlachance)

Updated

•

9 years ago

Blocks: 1207141

Keywords: perf, regression

Whiteboard: [talos_regression][e10s]

Comment 6

•

9 years ago

this seems to be the right compare view that shows just this change compared to the previous revision to it:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-inbound&originalRevision=077a6b1a1ba5&newProject=mozilla-inbound&newRevision=30742281c223&e10s=1

I found the previous tip via:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&fromchange=2258001851df&tochange=8ad5cb2c028d&filter-searchStr=Ubuntu%20HW%2012.04%20x64%20mozilla-inbound%20talos%20g1

and the magic try syntax:
try: -b o -p linux64 -u none -t tp5o mozharness: --spsProfile

Flags: needinfo?(wlachance)

Updated

•

9 years ago

Depends on: 1210521

Comment 7

•

9 years ago

Quick update: I'm currently unable to get profile data from tryserver, see bug 1210521, and that is blocking any progress here.

Updated

•

9 years ago

Depends on: 1210964

Updated

•

9 years ago

Depends on: 1210967

Updated

•

9 years ago

Depends on: 1193838
No longer depends on: 1210967

Comment 8

•

9 years ago

Now trying with mconley's patches from bug 1193838:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=8fbcb9d48e41

Comment 9

•

9 years ago

Build failure :( I'll continue with this when I'm back from PTO.

Comment 10

•

9 years ago

is there any problem with backing this out until we can work on this?  with :kats on PTO for the week and probably another week to fix it, I would like to keep regressions out of the tree if they are not being worked on.  

I am not sure who to talk to, :mstange, can you help me find a contact

Flags: needinfo?(mstange)

Markus Stange [:mstange]

Comment 11

•

9 years ago

Botond, how do you feel about this?

Flags: needinfo?(mstange) → needinfo?(botond)

Assignee

Comment 12

•

9 years ago

(In reply to Joel Maher (:jmaher) from comment #10)
> is there any problem with backing this out until we can work on this?

We'd like to avoid disabling APZ on Linux, because some test suites like e10s mochitests only run on Linux, and with APZ disabled on Linux, APZ wouldn't have test coverage in automation for those suites; enabling APZ was an uphill battle as it is because, without automation coverage, tests kept regressing with APZ on.

I can pick up the investigation while Kats is on PTO.

Assignee: bugmail.mozilla → botond

Flags: needinfo?(botond)

Comment 13

•

9 years ago

Thanks Botond!

Assignee

Comment 14

•

9 years ago

(In reply to (away until Oct23) Kartikaya Gupta (email:kats@mozilla.com) from comment #9)
> Build failure :( I'll continue with this when I'm back from PTO.

The build failure was caused by a missing file in the patches, which :mconley was kind enough to supply in updated patches to bug 1193838.

New Try push:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=d6c1e91151a5

Assignee

Comment 15

•

9 years ago

(In reply to Botond Ballo [:botond] from comment #14)
> New Try push:
> 
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=d6c1e91151a5

This built locally but not in automation due to Werrors. Trying again:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=06e760888ac7

Assignee

Comment 16

•

9 years ago

So I had a look at the magnitude of the regressions in bug 1189817, and compared them them to the ones here:

 Suite         |  Windows (bug 1189817)  |  Linux (this bug)
==============================================================
 tp5o_scroll   |  136-250%               |  127-159%
 tp5o          |  19%                    |  10-11%
 tsvgx         |  4.6-5.7%               |  4.2-4.5%
 tresize       |  15-39%                 |  4.8%
 TART          |  6.8-7.1%               |  12-15%
 CART          |  3.8%                   |  4.7%-5.2%
==============================================================

It looks like, for tp5o_scroll, tp5o, tsvgx, and tresize, the regressions in this bug are no greater than the regressions in bug 1189817; as the two bugs are about the same thing (enabling APZ) on different platforms, and we decided to allow them in bug 1189817, I think it makes sense to allow them here, too.

That said, I should note that, while APZ is expected to regress some benchmarks due to painting more than before, there are still efforts underway to improve performance. For example, bug 1204136, for which I did a speculative Talos run [1], looks like it will bring a 46% improvement to tp5o_scroll. Bug 1205025 and other dependencies of bug 1154825 are expected to help as well.

As for TART and CART, it does appear that we've regressed them more on Linux than we had on Windows (a bit more for CART, and significantly more for TART). Presumably this is what Kats was referring to when he said "the numbers in this regression seem a bit larger than what was reported on Windows" (comment 3).

I set out to investigate the TART numbers by comparing profiles before and after. I was able to use Mike Conley's patches from bug 1193838 (with a few fixes to get them to compile) to get profiles for current trunk [2]. I then got some with APZ disabled on Linux, to get a point of comparison [3]. The comparison can be seen here [4]. I'm currently doing some retriggers to gain more confidence in the numbers. I plan to continue my investigation in this direction.

Please let me know if my analysis of the situation with the other tests is reasonable; this is my first time investigating Talos regression, so I apologize if I'm overlooking something obvious.

[1] https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&originalRevision=7603d2860165&newProject=try&newRevision=44e909c5df4a&e10s=1
[2] https://treeherder.mozilla.org/#/jobs?repo=try&revision=1314e51c0978
[3] https://treeherder.mozilla.org/#/jobs?repo=try&revision=83268070d3f3
[4] https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=1314e51c0978&newProject=try&newRevision=83268070d3f3&e10s=1

Assignee

Comment 17

•

9 years ago

To try to zero in on the TART regression, I looked at the results broken down by subtest, and picked one that experienced a particularly large improvement when APZ was disabled - "icon-close-DPI2.all", which improved by 35%.

I compared runs on this subtest in Cleopatra, and found that while both sets of runs had the same overall duration, with APZ the refresh driver ticked about 45 times in a run, while without APZ it ticked about 60 times. As this is an ASAP mode test, where the refresh driver is ticked as fast as possible, this is consistent with a 35% regression in paint speed.

Unfortunately, as the duration of a test run is about 250 ms and the Talos harness sets the sampling interval of the profiler to 10 ms, there are only 25 samples in a run, which is not much information to go on to try to track down the cause of the regression.

To get some more detailed data, I reduced the sampling interval to 1 ms, and kicked off some new Try pushes. I also disabled TART tests other than icon-close-DPI2, because Cleopatra was really struggling as it is with the 10 ms profile of all test runs, and I fear that running all tests at 1 ms will make it unusable.

Here are the Try pushes:
  APZ enabled:  https://treeherder.mozilla.org/#/jobs?repo=try&revision=21a02bb81a49
  APZ disabled: https://treeherder.mozilla.org/#/jobs?repo=try&revision=6290463152aa

Benoit Girard (:BenWa)

Comment 18

•

9 years ago

It might be interesting to see if we regress with DisplayPort == ViewPort. These test measure how fast we do transactions on the main thread and necessarily by doing more work on the main thread transaction we're going to be slower. If we don't regress when DP == VP then it's a sign that the regressions are strictly coming from the DP which would be a good sign.

Assignee

Comment 19

•

9 years ago

(In reply to Benoit Girard (:BenWa) from comment #18)
> It might be interesting to see if we regress with DisplayPort == ViewPort.
> These test measure how fast we do transactions on the main thread and
> necessarily by doing more work on the main thread transaction we're going to
> be slower. If we don't regress when DP == VP then it's a sign that the
> regressions are strictly coming from the DP which would be a good sign.

Good idea! Here are some Try pushes with DP == VP:
  APZ enabled: https://treeherder.mozilla.org/#/jobs?repo=try&revision=892eeb685661
  APZ disabled: https://treeherder.mozilla.org/#/jobs?repo=try&revision=a76e0d710d05

Assignee

Comment 20

•

9 years ago

(In reply to Botond Ballo [:botond] from comment #17)
> Here are the Try pushes:
>   APZ enabled:  https://treeherder.mozilla.org/#/jobs?repo=try&revision=21a02bb81a49
>   APZ disabled: https://treeherder.mozilla.org/#/jobs?repo=try&revision=6290463152aa

I should have mention that these Try pushes from comment 17 didn't work, because something ("graph server") didn't like the fact that some of the TART subtests didn't run. Therefore, I repeated the pushes with all subtests enabled and a 1 ms sampling interval, hoping that Cleopatra would handle it:

  APZ enabled:  https://treeherder.mozilla.org/#/jobs?repo=try&revision=329c9cc34454
  APZ disabled: https://treeherder.mozilla.org/#/jobs?repo=try&revision=0cf7f442b858

The pushes from comment 19 already had this fix.

Assignee

Comment 21

•

9 years ago

The DP == VP runs showed pretty much the same TART regression with APZ enabled as the ones with our regular displayport heuristics, suggesting that, at least for TART, the issue is not with the extra painting (for some other tests like tscrollx it still probably is).

I looked at the profiles in Cleopatra to try to see what the problem is. (Thankfully, it opened the 1 ms sampling interval profiles successfully.) Here's what I noticed:

- First, in both cases a lot of time is spent doing GC, leading to some very long
refresh driver ticks where most of the time is GC. Not sure if this is expected,
but it's the same in both cases (about 100 out of 250 samples for a single test
run are in GC), so it's likely unrelated.

- Looking at the rest of the refresh driver ticks, which are much shorter, the
ones with APZ enabled are each a *bit* slower than the ones without APZ enabled.
This suggests that APZ is adding a small amount of overhead to each paint.
(Thanks to BenWa for pointing this out!) However, it's not that we're painting
a larger region, because then we would see an improvement with DP == VP.
Rather, it's something else in the rendering pipeline.

Unfortunately, even at a 1 ms sampling interval, we only get about 2 samples per refresh for refreshes that don't include the long GC pause, which is not nearly enough to tell what inside that refresh is taking 10-15% more time.

One idea for what the cause of the overhead might be, is building event regions. To test this theory, I did a Try push with APZ disabled but the layout.event-regions.enabled pref set to true. If my theory is right, we should see a regression comparable to what we see from turning on APZ:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=9085031ac7b0

Assignee

Comment 22

•

9 years ago

Enabling event-regions caused a 7.21% regression in TART [1], compared to enabling APZ (which activates a superset of the codepaths that enabling event-regions does) which caused a 10.21% regression [2]. This tells us that the bulk of the regression comes from extra time spent building event regions. Event regions are important for APZ's correctness, but it's possible that the code for building them could be more optimized.

[1] https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=0cf7f442b858&newProject=try&newRevision=9085031ac7b0&e10s=1
[2] https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=0cf7f442b858&newProject=try&newRevision=329c9cc34454&e10s=1

Assignee

Updated

•

9 years ago

Depends on: 1215884

Comment 23

•

9 years ago

Thanks for the detailed investigation Botond! It does make sense that a large chunk of the regression is caused by building event regions, as this was something we knew would cause a perf hit. As you said it is needed for correctness so we can't really do away with it, but we should definitely look into optimizing it as much as possible.

Comment 24

•

9 years ago

Also given the above it might make sense to close this bug as a WONTFIX since there isn't any one change that will "fix" the regression. It will be more like a series of slow improvements as we find things to optimize. We can also leave it open to track the improvements that are specifically targeted at improving these tests. I'm not sure which is more appropriate from a "talos regression" process standpoint. Joel?

Flags: needinfo?(jmaher)

Comment 25

•

9 years ago

sounds good to me.  How do we track the work on improvements?

Status: NEW → RESOLVED

Closed: 9 years ago

Flags: needinfo?(jmaher)

Resolution: --- → WONTFIX