Closed Bug 1483172 Opened Last year Closed 11 months ago

2.21 - 21.07% sessionrestore / sessionrestore_no_auto_restore / tsvgx (linux64, osx-10-10, windows10-64, windows7-32) regression on push e6005ad88309 (Fri Aug 10 2018)

Categories

(Firefox Build System :: General, defect)

63 Branch
defect
Not set

Tracking

(firefox-esr52 unaffected, firefox-esr60 unaffected, firefox61 unaffected, firefox62 unaffected, firefox63 wontfix, firefox64 wontfix)

RESOLVED WONTFIX
Tracking Status
firefox-esr52 --- unaffected
firefox-esr60 --- unaffected
firefox61 --- unaffected
firefox62 --- unaffected
firefox63 --- wontfix
firefox64 --- wontfix

People

(Reporter: igoldan, Unassigned)

References

Details

(Keywords: perf, regression, talos-regression)

Talos has detected a Firefox performance regression from push:

https://hg.mozilla.org/integration/autoland/rev/e6005ad88309

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

 21%  sessionrestore linux64 opt e10s stylo     274.50 -> 332.33
 21%  sessionrestore_no_auto_restore linux64 opt e10s stylo296.17 -> 358.08
 21%  sessionrestore_no_auto_restore linux64 pgo e10s stylo281.08 -> 339.17
 21%  sessionrestore linux64 pgo e10s stylo     263.00 -> 316.92
 17%  sessionrestore_no_auto_restore windows10-64 opt e10s stylo265.17 -> 311.00
 17%  sessionrestore_no_auto_restore osx-10-10 opt e10s stylo599.42 -> 701.42
 17%  sessionrestore windows7-32 opt e10s stylo 238.08 -> 278.33
 17%  sessionrestore_no_auto_restore windows10-64 pgo e10s stylo248.00 -> 289.33
 16%  sessionrestore windows7-32 pgo e10s stylo 231.75 -> 269.92
 15%  sessionrestore_no_auto_restore windows7-32 pgo e10s stylo265.67 -> 306.67
 15%  sessionrestore_no_auto_restore windows7-32 opt e10s stylo276.50 -> 317.08
 14%  sessionrestore osx-10-10 opt e10s stylo   562.08 -> 643.33
 10%  sessionrestore windows10-64 opt e10s stylo229.54 -> 253.00
 10%  sessionrestore windows10-64 pgo e10s stylo214.46 -> 235.67


You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=14867

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Buildbot/Talos/Tests

For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Buildbot/Talos/Running

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
Product: Testing → Firefox Build System
Flags: needinfo?(aswan)
I'll post the Gecko profiles soon.
How did you attribute this to the push from comment 0?  From a quick glance at some perfherder graphs, this regression was present on autoland before my patch landed, it looks like it came from the previous merge from central.
Flags: needinfo?(aswan) → needinfo?(igoldan)
:aswan, it showed up on all the merges, so we worked on pushes to try bisecting the merges and this ended up being the culprit:
https://treeherder.mozilla.org/#/jobs?repo=try&author=igoldan@mozilla.com&filter-searchStr=talos%20other&fromchange=ebee17803f232bac510c2d8a1f60420ab5825995


you can look at the results:
https://treeherder.mozilla.org/perf.html#/graphs?timerange=86400&series=try,1536960,1,1&zoom=1534227879872.951,1534238294403.6885,220,274.65855287701896

since this was a try bisection where we removed one patch at a time individually (not cumulative), we can easily see where the numbers are different, in this case it is this push:
https://hg.mozilla.org/try/pushloghtml?fromchange=8a26688268b167ffdea25c8e6dad0164ff1f7088&tochange=78c05420710573942c40d5e8a51f7db12c4b96b9

I am not sure why this was difficult to show up, maybe this is a reaction with another patch that landed and when merged around problems began.
Flags: needinfo?(igoldan)
(In reply to Joel Maher ( :jmaher ) (UTC+2) from comment #4)
> I am not sure why this was difficult to show up, maybe this is a reaction
> with another patch that landed and when merged around problems began.

I suspect this was the combination of this patch and bug 1480610.  From eyeballing the profiles, with those two patches both present it looks like we're not doing any extra work, but that we are starting the extension process earlier.  This is a cost that I think we're going to have to bear at some point (unless we just stick with bootstrapped extensions running in the parent process indefinitely)
(In reply to Andrew Swan [:aswan] from comment #5)
> (In reply to Joel Maher ( :jmaher ) (UTC+2) from comment #4)
> > I am not sure why this was difficult to show up, maybe this is a reaction
> > with another patch that landed and when merged around problems began.
> 
> I suspect this was the combination of this patch and bug 1480610.  From
> eyeballing the profiles, with those two patches both present it looks like
> we're not doing any extra work, but that we are starting the extension
> process earlier.  This is a cost that I think we're going to have to bear at
> some point (unless we just stick with bootstrapped extensions running in the
> parent process indefinitely)

That's probably accurate. Also, this is probably a cost we're already paying for users who have extensions installed.

It is interesting, though, that the fxmonitor extension seems to be completely empty. That suggests that this overhead is all from just loading the extension framework. The profiles aren't especially illuminating, though.
(In reply to Andrew Swan [:aswan] from comment #5)
> (In reply to Joel Maher ( :jmaher ) (UTC+2) from comment #4)
> > I am not sure why this was difficult to show up, maybe this is a reaction
> > with another patch that landed and when merged around problems began.
> 
> I suspect this was the combination of this patch and bug 1480610.  From
> eyeballing the profiles, with those two patches both present it looks like
> we're not doing any extra work, but that we are starting the extension
> process earlier.  This is a cost that I think we're going to have to bear at
> some point (unless we just stick with bootstrapped extensions running in the
> parent process indefinitely)

FWIW, I decided to backout that patch: please see bug 1483430 comment 0.
(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #2)
> Gecko profiles from Windows 10:

Do you have profiles from Linux? The sampling rate on Windows is much lower, and these profiles don't have native symbols.
Flags: needinfo?(igoldan)
please see bug 1449055, when that was landed no perf changes, but when it was backed out it was as if all of these regression on this bug were fixed...now that it is landed again the regressions are back.  I suspect something in code from this bug and code from bug 1449055 are tightly coupled.
(In reply to Joel Maher ( :jmaher ) (UTC+2) from comment #9)
> please see bug 1449055, when that was landed no perf changes, but when it
> was backed out it was as if all of these regression on this bug were
> fixed...now that it is landed again the regressions are back.  I suspect
> something in code from this bug and code from bug 1449055 are tightly
> coupled.

That pretty much confirms my guess that this is just the overhead of loading the framework, and is not per-extension overhead.
given that igoldan did backouts of each patch on try, I am not sure how to gather performance profiles- maybe we can use the new regression:
https://treeherder.mozilla.org/perf.html#/graphs?series=mozilla-inbound,1640643,1,1&series=autoland,1646280,1,1&selected=mozilla-inbound,1640643,365881,528096659,1

:kmag, would that be useful?
this is the summary of alerts that came in when Bug 1449055 was relanded

== Change summary for alert #14867 (as of Fri, 10 Aug 2018 02:49:49 GMT) ==

Regressions:

 21%  sessionrestore linux64 opt e10s stylo     274.50 -> 332.33
 21%  sessionrestore_no_auto_restore linux64 opt e10s stylo296.17 -> 358.08
 21%  sessionrestore_no_auto_restore linux64 pgo e10s stylo281.08 -> 339.17
 21%  sessionrestore linux64 pgo e10s stylo     263.00 -> 316.92
 17%  sessionrestore_no_auto_restore windows10-64 opt e10s stylo265.17 -> 311.00
 17%  sessionrestore_no_auto_restore osx-10-10 opt e10s stylo599.42 -> 701.42
 17%  sessionrestore windows7-32 opt e10s stylo 238.08 -> 278.33
 17%  sessionrestore_no_auto_restore windows10-64 pgo e10s stylo248.00 -> 289.33
 16%  sessionrestore windows7-32 pgo e10s stylo 231.75 -> 269.92
 15%  sessionrestore_no_auto_restore windows7-32 pgo e10s stylo265.67 -> 306.67
 15%  sessionrestore_no_auto_restore windows7-32 opt e10s stylo276.50 -> 317.08
 14%  sessionrestore osx-10-10 opt e10s stylo   562.08 -> 643.33
 10%  sessionrestore windows10-64 opt e10s stylo229.54 -> 253.00
 10%  sessionrestore windows10-64 pgo e10s stylo214.46 -> 235.67
  2%  tsvgx linux64 opt e10s stylo              179.95 -> 183.93

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=14867
I posted the wrong summary in comment 12, this is the new one:
== Change summary for alert #15086 (as of Thu, 16 Aug 2018 19:33:48 GMT) ==

Regressions:

 27%  sessionrestore linux64 pgo e10s stylo     263.75 -> 333.75
 26%  sessionrestore linux64 opt e10s stylo     272.00 -> 343.92
 26%  sessionrestore_no_auto_restore linux64 pgo e10s stylo283.42 -> 357.67
 25%  sessionrestore_no_auto_restore linux64 opt e10s stylo291.40 -> 365.67
 24%  sessionrestore windows10-64 pgo e10s stylo215.17 -> 267.83
 24%  sessionrestore_no_auto_restore windows10-64 opt e10s stylo267.75 -> 332.92
 23%  sessionrestore windows10-64 opt e10s stylo234.67 -> 289.75
 23%  sessionrestore_no_auto_restore windows10-64 pgo e10s stylo248.58 -> 304.92
 21%  sessionrestore windows7-32 opt e10s stylo 241.83 -> 293.75
 21%  sessionrestore_no_auto_restore osx-10-10 opt e10s stylo605.25 -> 729.67
 19%  sessionrestore osx-10-10 opt e10s stylo   571.58 -> 681.08
 19%  sessionrestore windows7-32 pgo e10s stylo 235.50 -> 280.58
 18%  sessionrestore_no_auto_restore windows7-32 pgo e10s stylo269.92 -> 319.58
 16%  sessionrestore_no_auto_restore windows7-32 opt e10s stylo288.17 -> 334.75

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=15086
(In reply to Joel Maher ( :jmaher ) (UTC+2) from comment #11)
> given that igoldan did backouts of each patch on try, I am not sure how to
> gather performance profiles- maybe we can use the new regression:
> https://treeherder.mozilla.org/perf.html#/graphs?series=mozilla-inbound,
> 1640643,1,1&series=autoland,1646280,1,1&selected=mozilla-inbound,1640643,
> 365881,528096659,1
> 
> :kmag, would that be useful?

I think so, yeah
Kris - are you going to take this bug?
Flags: needinfo?(kmaglione+bmo)
Kris, should we assign this bug to you?
I've been working on removing the legacy bootstrap.js from Screenshots, and have found similar regressions in the same two Talos tests, see bug 1491997.
If this is what we get for now with bootstrap removal, then this is the cost to bear until we specifically target address the  performance of loading the framework. And that would not happen for 63 or 64 (at this point).

I assume 64 would be affected as well.
Flags: needinfo?(kmaglione+bmo)
Depends on: 1451485
Blocks: 1451485
No longer depends on: 1451485
This bug is about measurements for sessionrestore in code that has already landed in mozilla-central, the measurements from the webcompat patches are additional changes on top of those tracked in this bug.  The issues may be related, but I don't think there's a dependency between these bugs.
No longer blocks: 1451484, 1451485
Shouldn't we mark this as wontfix for Firefox 64 as well?
Flags: needinfo?(jaws)
At this point I believe we should mark it as wontfix for 64, and if we want it fixed for 65 we would need to find someone to work on this ASAP.
Flags: needinfo?(jaws)

Tim, can you find someone who can fix this regression? This is the last bug that's blocking bug 1470850.

Flags: needinfo?(tspurway)

:aswan, can you take this one?

Flags: needinfo?(tspurway) → needinfo?(aswan)

The cause here was identified back in comment 5 and apparently (?) confirmed in comment 10.
Note that we got a little bit of the original hit back with bug 1498027.

Status: NEW → RESOLVED
Closed: 11 months ago
Flags: needinfo?(aswan)
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.