Closed Bug 1244522 Opened 4 years ago Closed 4 years ago

5% winxp e10s PGO tart regression on Mozilla-Inbound on January 29, 2016 from push cc9bef95c7e1

Categories

(Firefox :: New Tab Page, defect, P3)

46 Branch
defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
e10s + ---

People

(Reporter: jmaher, Assigned: oyiptong)

References

Details

(Keywords: perf, regression, Whiteboard: [talos_regression])

Talos has detected a Firefox performance regression from your commit 33342c3d45efdb7610c9be83bf09d5664f898cd7 in bug 1178385.  We need you to address this regression.

This is a list of all known regressions and improvements related to your bug:
http://alertmanager.allizom.org:8080/alerts.html?rev=33342c3d45efdb7610c9be83bf09d5664f898cd7&showAll=1

On the page above you can see Talos alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test, please see: https://wiki.mozilla.org/Buildbot/Talos/Tests#TART.2FCART

Reproducing and debugging the regression:
If you would like to re-run this Talos test on a potential fix, use try with the following syntax:
try: -b o -p win32 -u none -t svgr-e10s[Windows XP]  # add "mozharness: --spsProfile" to generate profile data
* note, you need pgo: https://wiki.mozilla.org/ReleaseEngineering/TryChooser#What_if_I_want_PGO_for_my_build

To run the test locally and do a more in-depth investigation, first set up a local Talos environment:
https://wiki.mozilla.org/Buildbot/Talos/Running#Running_locally_-_Source_Code

Then run the following command from the directory where you set up Talos:
talos --develop -e <path>/firefox -a tart

Making a decision:
As the patch author we need your feedback to help us handle this regression.
*** Please let us know your plans by Monday, or the offending patch will be backed out! ***

Our wiki page outlines the common responses and expectations:
https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
please note this is PGO only, not opt.  

Looking at the subtests we have a few that stand out as the main regressed:
https://treeherder.allizom.org/perf.html#/comparesubtest?originalProject=mozilla-inbound&originalRevision=52ad2fa6b79b&newProject=mozilla-inbound&newRevision=d81da0ef528d&originalSignature=04f07130964ffd56db0e4039c3baf9c176e873af&newSignature=04f07130964ffd56db0e4039c3baf9c176e873af

looking higher level, we can see that svg opacity took a win:
https://treeherder.allizom.org/perf.html#/compare?originalProject=mozilla-inbound&originalRevision=52ad2fa6b79b&newProject=mozilla-inbound&newRevision=d81da0ef528d&framework=1


In bisecting this down on try, I see the root cause of this as:
https://hg.mozilla.org/integration/mozilla-inbound/rev/1ec780c2516c

:oyiptong, can you take a look at this and let us know by Wednesday (the automated bug filer above doesn't account for weekends)
Flags: needinfo?(oyiptong)
another issue with the initial comment, this is from bug 1240169 and push cc9bef95c7e1:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=cc9bef95c7e1
Summary: 5% winxp e10s PGO tart regression on Mozilla-Inbound on January 29, 2016 from push 33342c3d45efdb7610c9be83bf09d5664f898cd7 → 5% winxp e10s PGO tart regression on Mozilla-Inbound on January 29, 2016 from push cc9bef95c7e1
Thanks, I'll take a look
Flags: needinfo?(oyiptong)
Assignee: nobody → oyiptong
tracking-e10s: --- → +
Priority: -- → P3
My initial hypothesis for this regression was that bug 1218996 made tab loading faster by removing some code execution at new tab open, assuming that the time spent changing newtab urls would affect the tests.

bug 1240169 reverted that change back to how it was before bug 1218996, i.e bug 1096804.

Here are some talos runs:

There is indeed a ~4% drop in PGO tart between deaf7c6ca55b (bug 1096804, immediately before 1218996 ) and 8e0a8cdc62ad (bug 1218996):
https://treeherder.allizom.org/perf.html#/compare?originalProject=try&originalRevision=9e343fe7a3d4&newProject=try&newRevision=47fea4a5d050&framework=1&filter=tart&showOnlyImportant=0

Applying patches from bug 1240169 on top of 8e0a8cdc62ad, we see a decrease in PGO performance of 37%:

https://treeherder.allizom.org/perf.html#/compare?originalProject=try&originalRevision=47fea4a5d050&newProject=try&newRevision=d255a68decf8&framework=1&filter=tart&showOnlyImportant=0

Comparing 1096804 and 1240169 should've yielded about the same performance, but it shows a decrease in PGO performance of 30%:

https://treeherder.allizom.org/perf.html#/compare?originalProject=try&originalRevision=9e343fe7a3d4&newProject=try&newRevision=d255a68decf8&framework=1&filter=tart&showOnlyImportant=0

It is clear that those tests rely on the performance of newtab opens. Should they?

I need more time to investigate why the newtab page opens should affect those tests. We *are* doing more work when overriding new tab urls.
tart is a tab animation test, we open/close the new tab often.  In fact we depend on about newtab, look at what newtab prefs and references are in the talos source:
https://dxr.mozilla.org/mozilla-central/search?q=path%3Atesting%2Ftalos%2Ftalos+newtab&redirect=false&case=false

Just ask if you need help!  I appreciate the detailed work and what is mentioned in comment 4 makes sense.
Ack! I went back to my tests and took a look at the numbers. The tests in my analysis were not measuring the right thing: in push 47fea4a5d050, I was comparing the base PGO build to a non-PGO build. My hypothesis was actually correct and validated.

To restate, there is a performance drop of 4% PGO tart between deaf7c6ca55b (bug 1096804, immediately before 1218996) and 8e0a8cdc62ad (bug 1218996):

https://treeherder.allizom.org/perf.html#/compare?originalProject=try&originalRevision=9e343fe7a3d4&newProject=try&newRevision=47fea4a5d050&framework=1&filter=tart&showOnlyImportant=0

Expected perf drop of ~4%, observed perf drop of 3%:

Applying patches from bug 1240169 on top og 8e0a8cdc62ad, we see a decrease in PGO performance of 3%:
https://treeherder.allizom.org/perf.html#/compare?originalProject=try&originalRevision=47fea4a5d050&newProject=try&newRevision=59dc72af7490&framework=1&filter=tart&showOnlyImportant=0

Comparing 1096804 and 1240169, we see about the same performance.

https://treeherder.allizom.org/perf.html#/compare?originalProject=try&originalRevision=9e343fe7a3d4&newProject=try&newRevision=59dc72af7490&framework=1&filter=tart&showOnlyImportant=0

In other words, bug 1218996 made performance ~4% better by removing code. bug 1240169 put performance back to what it was prior to bug 1218996.
that seems fair- do you think there is work to do here to shoot for a win?  Otherwise, this seems as though we can accept this as a net zero change and move on.
While working on this patch, I tried to use caching but that didn't net any PGO build gains:

https://treeherder.allizom.org/perf.html#/compare?originalProject=try&originalRevision=9e343fe7a3d4&newProject=try&newRevision=cff6c22cfd3b&framework=1&filter=tart&showOnlyImportant=0

I think we should leave this as-is. A patch to but 1240408 may actually increase performance here.
this is great, marking as worksforme as this is a net zero regression.
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WORKSFORME
Version: unspecified → 46 Branch
You need to log in before you can comment on or make changes to this bug.