Closed Bug 1210351 Opened 4 years ago Closed 4 years ago

6-16% OSX regressions on mozilla-inbound (v.44) seen September 28, from rev 97809a7cd494

Categories

(Testing :: Talos, defect)

defect
Not set

Tracking

(firefox44 fixed)

RESOLVED FIXED
mozilla44
Tracking Status
firefox44 --- fixed

People

(Reporter: jmaher, Assigned: jnicol)

References

Details

(Keywords: perf, regression, Whiteboard: [talos_regression])

Attachments

(1 file)

a variety of regressions have shown up from push:
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?changeset=97809a7cd494

There are a lot of patches in there, to save resource on try from doing a lot of pushes, I wanted to see if any of this is know.

the regressions:
* tscrollx - 6.4% (http://graphs.mozilla.org/graph.html#tests=[[287,64,55]]&sel=none&displayrange=30&datatype=geo)
* tp5 main rss (e10s) - 12% (http://graphs.mozilla.org/graph.html#tests=[[261,64,57]]&sel=none&displayrange=30&datatype=geo)
* tps e10s - 16% (http://graphs.mozilla.org/graph.html#tests=[[329,63,57]]&sel=none&displayrange=30&datatype=geo)
* tresize - ~5% (http://graphs.mozilla.org/graph.html#tests=[[254,63,55]]&sel=1443092129822,1443696929822&displayrange=7&datatype=geo)

using our famous compare view in perfherder, we can see the regressions:
https://treeherder.allizom.org/perf.html#/compare?originalProject=mozilla-inbound&originalRevision=da63c513fe8c&newProject=mozilla-inbound&newRevision=97809a7cd494&showExcludedPlatforms=1

and e10s:
https://treeherder.allizom.org/perf.html#/compare?originalProject=mozilla-inbound&originalRevision=da63c513fe8c&newProject=mozilla-inbound&newRevision=97809a7cd494&showExcludedPlatforms=1&e10s=1

I did retriggers on inbound:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&fromchange=bf14f55ecc7e&tochange=9ed7f553c3de&filter-searchStr=talos%20osx

we have enough data to show the set of patches in rev 97809a7cd494 is the root cause.

on the positive side there are some improvements to android talos and tsvgx in general :)
:snorp, can you take a look at this and help us figure out what the root cause is and if we can fix it or need to live with it?
Flags: needinfo?(snorp)
Yeah, this is somewhat unexpected. I believe the only patch that could affect Mac in there is increasing the tile size. This is determined by screen size now, but it was hard coded at 512 before. We could be bumping it up to 1024 now, which could cause paints to be slower due to increased displayport size. Joel do you know what screen size gets reported on the Talos Macs?
Flags: needinfo?(snorp) → needinfo?(jmaher)
we have 1600x1200:
https://wiki.mozilla.org/Buildbot/Talos/Misc#Hardware_Profile_of_machines_used_in_automation

right now the 10.10.2 are running the rev5 minis, the same as 10.8 was running.  In a few weeks we will be running on r7 minis and running 10.10.5.

Would the 1600x1200 make a difference?
Flags: needinfo?(jmaher)
1600x1200 would result in a tile size of 1024, so these results make some sense. Nical, what do you think we should do? Change the heuristic to keep the tile size at 512 on those screens? Live with the perf impact? I would expect better compositor performance, but the slightly slower paint times seem plausible.
Flags: needinfo?(nical.bugzilla)
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #4)
> ... Nical, what do you think we should do? Change the heuristic to keep
> the tile size at 512 on those screens? Live with the perf impact? I would
> expect better compositor performance, but the slightly slower paint times
> seem plausible.

Just to make sure, the suggested change is not just to "satisfy" the test results, right?

The goal should be how it performs for users. If "those screens" is something which users would have as well, then it would make sense to consider improvements for those cases (and balance it with tradeoffs etc).

But if this case is not common among users, then I don't think it should be handled just to improve the talos test results. On such case, let's just document why the results show a regression and why it's not too relevant for end users, and leave it at that (and possibly try to improve the test to be more relevant as well).
I don't think that Compositor performance is our bottleneck on mac, so I would rather tweak the heuristic to land on something that doesn't regress talos, since we don't have any indication that the performance perceived by the user is improved on mac with the new tile size (or investigate whether we the perceived performance is actually improved (less checkerboarding when scrolling fast?).
Flags: needinfo?(nical.bugzilla)
(In reply to Nicolas Silva [:nical] from comment #6)
> I would rather tweak the heuristic to land on something that doesn't regress
> talos

I really think we shouldn't tweak anything to make the talos results "nicer". The only valid reason for such tweaks would be to improve the experience for users. If we know that a change benefits users but talos results still show a "regression", we'll deal with it later. But the first and only priority is the users.
I agree, we shouldn't try to game the test. It looks like this is most likely a real regression. I think maybe capping the tile size at 512 (which is the current setting for Mac anyway) might make sense for now, or at least requiring a much larger screen to get to 1024.
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #8)
> I agree, we shouldn't try to game the test. It looks like this is most
> likely a real regression. I think maybe capping the tile size at 512

That's what I mean by "tweaking the heuristic". My point is that if we don't have indications that 1024x1024 improves things (in any way that is not measured with talos), and just get talos regressions on Mac, the change that caused the talos regression should be modified since we don't have indications that we are trading the regression for anything good.
Thanks. I like this better, and talos can indeed be used as a goal to tune some parameters - if we trust that it represent user experience well in this specific regard.

That being said, talos tests don't cover all the use cases, and there's no substitute to actually testing on supposedly affected systems and observe with one's own eyes that one option behaves better than another option.

But that's already up to you.
what are the next steps on this bug?  Can we assign this to someone (:snorp) and come up with a plan to fix it in the next week?
Jamie is going to make the max tile size 512 for me, since it has come up with another bug too.
Assignee: nobody → jnicol
Attached patch Patch v1Splinter Review
Make tiles either 256 or 512
Attachment #8677532 - Flags: review?(snorp)
Attachment #8677532 - Flags: review?(snorp) → review+
Try run (with a few unrelated intermittent errors, but looks good): https://treeherder.mozilla.org/#/jobs?repo=try&revision=0fd0653c7858
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/51036998fbc8
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla44
thanks, verified on the graphs!
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #8)
> I think maybe capping the tile size at 512 (which
> is the current setting for Mac anyway)

It was only the default for non-HiDPI Mac. For HiDPI we were using 1024, and this patch has caused us to use 512 everywhere. I've filed bug 1221073 about this.
You need to log in before you can comment on or make changes to this bug.