Closed Bug 1405960 Opened 7 years ago Closed 7 years ago

2.42 - 250.58% damp / sessionrestore_many_windows / tabpaint / tart / tps (linux64, osx-10-10, windows10-64, windows7-32) regression on push 8d3464b1fd54fa48bc42ae51bdd365bbb40d1c1d (Mon Oct 2 2017)

Categories

(Firefox :: Untriaged, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
Firefox 58
Tracking Status
firefox-esr52 --- unaffected
firefox56 --- unaffected
firefox57 blocking fixed
firefox58 --- fixed

People

(Reporter: igoldan, Assigned: mrbkap)

References

Details

(Keywords: perf, regression, talos-regression)

Attachments

(1 file)

Talos has detected a Firefox performance regression from push:

https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?changeset=8d3464b1fd54fa48bc42ae51bdd365bbb40d1c1d

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

251%  tabpaint summary osx-10-10 opt e10s stylo_disabled     64.38 -> 225.72
237%  tabpaint summary osx-10-10 opt e10s                    66.17 -> 222.68
157%  tabpaint summary windows10-64 opt e10s                 57.11 -> 146.76
130%  tabpaint summary windows7-32 opt e10s                  55.69 -> 128.21
123%  tabpaint summary linux64 opt e10s                      71.16 -> 158.99
 23%  sessionrestore_many_windows osx-10-10 opt e10s stylo_disabled2,213.83 -> 2,725.42
 22%  sessionrestore_many_windows osx-10-10 opt e10s         2,242.50 -> 2,740.33
 15%  tart summary linux64 opt e10s                          4.24 -> 4.87
 10%  damp summary linux64 opt e10s                          209.76 -> 231.43
 10%  tart summary windows7-32 opt e10s                      3.72 -> 4.10
 10%  damp summary windows7-32 opt e10s                      192.85 -> 211.40
  8%  damp summary windows10-64 opt e10s                     198.56 -> 213.64
  8%  tart summary osx-10-10 opt e10s                        7.68 -> 8.26
  7%  tart summary osx-10-10 opt e10s stylo_disabled         7.69 -> 8.24
  7%  tps summary linux64 opt e10s                           24.63 -> 26.37
  6%  damp summary osx-10-10 opt e10s stylo_disabled         246.50 -> 260.77
  3%  sessionrestore_many_windows windows10-64 opt e10s      3,350.25 -> 3,437.00
  2%  sessionrestore_many_windows windows7-32 opt e10s       3,383.67 -> 3,465.67

Improvements:

 15%  tps summary windows7-32 opt e10s     22.38 -> 18.94


You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=9813

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Buildbot/Talos/Tests

For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Buildbot/Talos/Running

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
These are very serious regressions which have been uplifted to Beta.

I did the bisection using OPT builds. Unfortunately, the results aren't clear enough to properly identify the regression's origin. I am now doing another set of bisections, using just PGO builds, just like Firefox Beta uses (the opt above are actually PGO builds also).

The patch consisted of 13 changesets. Looking on the current bisection results, it seems the regressions don't necessarily originate from a single changeset, rather from multiple ones.

The clearest results show that bug 1402493 caused the Tabpaint regressions.
The DAMP regression was probably caused by bug 1401943 or bug 1399504; not too confident though.
The Tps regression was probably caused by bug 1400219; again, not too confident.

I am afraid I have to ni? all of you, as this is already old and needs attention. My apologies for that.
Most likely, tomorrow the new bisection will properly identify the regressions' origin.
Flags: needinfo?(selee)
Flags: needinfo?(prathikshaprasadsuman)
Flags: needinfo?(mrbkap)
Flags: needinfo?(jjones)
Flags: needinfo?(jhofmann)
Flags: needinfo?(jaws)
Flags: needinfo?(hikezoe)
Flags: needinfo?(gijskruitbosch+bugs)
Flags: needinfo?(dvander)
Flags: needinfo?(droeh)
It's extremely unlikely that Bug 1402287 and Bug 1400219 caused this. Those are small style changes that only affect the permissions dialog in the preferences page. I think we can cross those two off the list.

Bug 1390885 should _in theory_ only affect lightweight themes on OSX, so I'm going with the assumption that it didn't cause any performance regression until there's strong indication otherwise.
Flags: needinfo?(prathikshaprasadsuman)
Flags: needinfo?(jhofmann)
Bug 1399504 only affects Android, I don't see how it could be responsible for any of these regressions.
Flags: needinfo?(droeh)
The previous numbers are wildly low, I think that is part the problem here.

On my push to mozilla-beta (https://hg.mozilla.org/releases/mozilla-beta/rev/0fe06c5dd395), there is a link to the Talos perfherder. That link shows the Talos results from that single push compared to the previous push.

I retriggered the osx-10-10 tabpaint opt e10s test 10 times for both the base and new changesets. The results are:

base: 219.66 ± 7.02%
new: 219.66 ± 7.02%
runs: 11 / 11

I will admit it's mysterious how the base and new averages are exactly the same with the same margin of error, https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-beta&originalRevision=8d3464b1fd54fa48bc42ae51bdd365bbb40d1c1d&newProject=mozilla-beta&newRevision=5fba16e588e6616d20ae3ba638bd42a50a4e2f43&framework=1&filter=tabpaint

Individual values of the test run show different values, but the averages come out to the same.
Flags: needinfo?(jaws)
Bug 1403279 is a pref'd off one-liner constant [1]; there's no way for it to cause a performance regression at this time.

[1] https://hg.mozilla.org/releases/mozilla-beta/rev/cfde20f68c1f
Flags: needinfo?(jjones)
The PGO based bisection finished. Unfortunately, the conclusion of it is even more ambiguous. I'm don't know to what degree the build configs from Try and Beta differ, but the results are not consistent and their magnitude is nowhere near the one spotted on Beta.
(In reply to Jared Wein [:jaws] (please needinfo? me) from comment #4)
> The previous numbers are wildly low, I think that is part the problem here.

Looking at the graphs I see that the previous numbers, while low, were consistent. The numbers are now consistently much higher.
Bug 1403433 seems plausible.

Ionuț, is it possible for you to run separate builds with f4e84e6ed76d and 0fe06c5dd395 backed out?
Flags: needinfo?(igoldan)
odd, 8d3464b1fd54 shows a regression, I decided to push from mozilla-beta at 8d3464b1fd54 and the previous rev c7b47e1282c2:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=41e3dca82727&newProject=try&newRevision=c0ee66ff6bcfac7435429d50293878c3f1b2c634&framework=1

linux64 tabpaint went from ~75 -> ~160, on both of the above pushes to try, I get ~73;  So I am not able to produce this.  Did our builders or testers change?  I am disappointed we are not able to reproduce beta failures so reliably on try.
the builds I see on try are running in 26 or 42 minutes, whereas on beta they take 76 minutes.  RyanVM, how do I push to try for mozilla-beta to replicate the builds used in automation?
Flags: needinfo?(ryanvm)
I'm guessing you're running afoul of branch_specifics.py here. For the uplift simulations, we manually tell it to use the Beta mozconfigs on the Try repo, a la:
https://hg.mozilla.org/try/rev/adac0cbd1d664eb766189fd46a59c37468112a6c
Flags: needinfo?(ryanvm)
bug 1403353 only affects win7 pre-sp1, which I assume Talos does not use, so it's probably not that.
Flags: needinfo?(dvander)
bug 1401411 is only related to the form submit process. I assume Talos won't verify that code path, so it's probably not that.
Flags: needinfo?(selee)
I have tried :ryanVM's try patches and still get the old numbers, even ensuring talos is not using special try bits:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=cb8ef71e24c33843406f6ab5343f4a0e5a753237&selectedJob=135189808

my builds are at ~55 minutes, still <75 minutes, possibly I am getting the wrong builds still.
bug 1402981 only modifies customize mode, so it's definitely not that.

In theory, it could be bug 1401943, which modifies the initialization of the stop/reload buttons. But because it's the initialization, I wouldn't expect it to impact tabswitch or -paint times when opening/switching them repeatedly.


We really need an accurate regression window here. If it comes to it, could we do 2 pushes to beta to back out part of the csets, and then land them again (ensuring, of course, that we don't end up doing nightlies on one of those pushes) ?

Do we have similar regressions on 58?

Ryan, do you have ideas about what else to try here given comment #14 ?
Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(ryanvm)
I don't have any good ideas or theories as to what's going on here either. Looking at the pushlog, nothing jumped out at me either as to what might be behind it. Which certainly makes one suspicious of the results. I agree that the change in numbers is really worrying, but I'm also not convinced there's a real in-product issue here :(
Flags: needinfo?(ryanvm)
yeah, there might not be anything related to patches, maybe builds or other packages the builds depend on changed?  Possibly some process for signing?  I am retriggering a build to see if there is someway that the build process on mozilla-beta is non-deterministic
Adding catlee for visibility.
This is a blocking issue for beta57 until we know more.
From the separate email chain going about this:

Ritu notes that bug 1404098 at least could in theory be a patch explaining why we see a change on Beta but not on trunk. Is it possible we've accidentally changed something with e10s-multi?

Bug 1404098 could also explain why attempts to reproduce on Try haven't been successful if the builds we run Talos on have "default" as their update channel instead of "beta" and are therefore not hitting the logic changed by that patch. *IF* this all proves out correctly, I think that means we're enabling e10s-multi when we hadn't been previously and seeing big changes in the resulting perf results.

Which is terrifying if true :-(

Anyway, we're trying to build 57b6 today. We can certainly try backing that out tomorrow and see what happens.
Confirmed that the builds Talos ran on from Joel's push in comment 14 still have "default" as their update channel. So yeah, I would expect it to not be affected if this is an issue caused by the e10srollout addon looking at the "beta" update channel.
Bug 1403433 cut off a process for SVG, so it should not make performance *worse* ideally. That's said, I understand the change is one of the suspects. Keep ni? to me for now, I really want to know narrowing down regression window.
I'm looking into this, starting with the tabpaint regression. It would be extremely surprising to have a massive performance regression for multi on beta but not on Nightly. Indeed, looking at the harness for tabpaint, I see [1], which disables multi for tabpaint. However, because we use e10srollout on Beta that line of code is a no-op. We didn't notice this before because the tabpaint harness is a bootstrap addon and therefore forced us to disable multi on its own. Now, we're setting dom.ipc.processCount to its default value and enabling multi.

Fixing that is easy -- we should just disable e10s in a way that is cross-branch (at least until we kill e10srollout for good). 

The reasoning for forcing a single process can be found in bug 1317312, comment 17 (tl;dr: tabpaint shouldn't include process launch time in its measurement).

I'm clearing the rest of the needinfos for now because I think we have a good handle on what happened.

[1] http://searchfox.org/mozilla-central/rev/7ba03cebf1dcf3b4c40c85a4efad42a5022cf1ec/testing/talos/talos/tests/tabpaint/bootstrap.js#76
Assignee: nobody → mrbkap
Flags: needinfo?(mrbkap)
Flags: needinfo?(igoldan)
Flags: needinfo?(hikezoe)
Here's a trunk push with the patch to ensure it doesn't regress things for central. Ryan, can you try applying this patch to beta and pushing with the correct channel so we can ensure that this does fix the regression?

https://treeherder.mozilla.org/perf.html#/comparechooser?newProject=try&newRevision=f7fa7d7ad152d09c1ad23e2cf24fb63358803c61
Flags: needinfo?(ryanvm)
Some more thoughts:

The patch here will correct the tabpaint tests' results (including tps).

The sessionrestore test regression almost certainly also exists on Nightly (since we're doing the same work now as we were before), but we don't run that comparison. One reason this took us by surprise is that Talos sets extensions.e10sBlocksEnabling in order to enable e10s (which might actually penalize us because of shims and CPOWs, see bug 1230608 for more) but it doesn't set extensions.e10sMultiBlocksEnabling. We've eaten it there for as long as multi has been around and I don't see a reason to start worrying about it now.

The tart and damp regressions already exist on central, see bug 1380385 and bug 1227663. Again, we have been living with these regressions for close to a year now and have mitigations and workarounds for real users (such as the preallocated process manager).

That explains all of the regressions here. Once we land the patch, I think we should be in a better place wrt consistency between Nightly and Beta/Release.

Also, for those of you playing along, bug 1406212 now tracks the removal of the e10srollout addon meaning that we will be entirely consistent between Nightly, Beta, and Release in terms of how e10s (multi) is enabled or disabled.
Comment on attachment 8915764 [details]
Bug 1405960 - Force e10s-multi off in a cross-branch way.

https://reviewboard.mozilla.org/r/186980/#review192258

::: testing/talos/talos/tests/tabpaint/bootstrap.js:85
(Diff revision 2)
>      for (let msgName of this.MESSAGES) {
>        Services.mm.removeMessageListener(msgName, this);
>      }
>  
>      Services.prefs.setBoolPref(ANIMATION_PREF, this.originalAnimate);
> -    Services.prefs.setIntPref(PROCESS_COUNT_PREF, this.originalProcessCount);
> +    Services.prefs.clearUserPref(MULTI_OPT_OUT_PREF);

I wonder if this line does what it should. But it seems like your patch turned off e10s-multi for cpstartup as well. I wonder if this line is the reason for that side effect or what happened there exactly...

https://treeherder.mozilla.org/perf.html#/comparechooser?newProject=try&newRevision=8a3b8491838e394e494fcb583b18ba80d4794d8f
Comment on attachment 8915764 [details]
Bug 1405960 - Force e10s-multi off in a cross-branch way.

https://reviewboard.mozilla.org/r/186980/#review192322

Assuming this isn't having some kind of leak-over effect with cpstartup, this looks good to me. TIL about dom.ipc.multiOptOut! Thanks, mrbkap.
Attachment #8915764 - Flags: review?(mconley) → review+
(In reply to Mike Conley (:mconley) (:⚙️) - Backlogged on reviews from comment #30)
> Comment on attachment 8915764 [details]
> Bug 1405960 - Force e10s-multi off in a cross-branch way.
> 
> https://reviewboard.mozilla.org/r/186980/#review192322
> 
> Assuming this isn't having some kind of leak-over effect with cpstartup,
> this looks good to me. TIL about dom.ipc.multiOptOut! Thanks, mrbkap.

I'll try to figure this out on Monday, I want to understand what causes that side effect in cpstartup. And I've talked about this during the cross-functional meeting and the consensus was that it's OK to delay the landing with a couple of days. Thanks Mike for the review, and sorry Blake that I'm holding this up a little bit longer...
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #31)
> I'll try to figure this out on Monday, I want to understand what causes that
> side effect in cpstartup. And I've talked about this during the
> cross-functional meeting and the consensus was that it's OK to delay the
> landing with a couple of days. Thanks Mike for the review, and sorry Blake
> that I'm holding this up a little bit longer...

OK. This seems like a different issue: https://treeherder.mozilla.org/perf.html#/graphs?series=try,1558970,1,1&highlightedRevisions=8a3b8491838e

I'll investigate this issue in a different bug and will land this patch as it is.
Pushed by gkrizsanits@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ca1126dea7c8
Force e10s-multi off in a cross-branch way. r=mconley
https://hg.mozilla.org/mozilla-central/rev/ca1126dea7c8
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 58
Comment on attachment 8915764 [details]
Bug 1405960 - Force e10s-multi off in a cross-branch way.

Approval Request Comment
[Feature/Bug causing the regression]: n/a
[User impact if declined]: None, this only affects Talos numbers.
[Is this code covered by automated tests?]: Yes.
[Has the fix been verified in Nightly?]: n/a (this fix should have no effect on Nightly, though the Talos numbers on Nightly not changing should give us confidence that this fix will work on Beta).
[Needs manual test from QE? If yes, steps to reproduce]: n/a
[List of other uplifts needed for the feature/fix]: n/a
[Is the change risky?]: No.
[Why is the change risky/not risky?]: It only affects a Talos extension to use a different method to disable e10s-multi.
[String changes made/needed]: n/a
Attachment #8915764 - Flags: approval-mozilla-beta?
Comment on attachment 8915764 [details]
Bug 1405960 - Force e10s-multi off in a cross-branch way.

Hopefully will fix the talos issue, taking it.
Should be in 57b8
Attachment #8915764 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
I'll push this to Beta by itself so we can better isolate any Talos changes from it.
Flags: needinfo?(ryanvm)
I don't see improvements for any of the regressions. Shouldn't we reopen this bug?
Flags: needinfo?(sledru)
Flags: needinfo?(ryanvm)
Flags: needinfo?(mrbkap)
Yes, we should.
Blake, any idea what is going on?
Thanks
Flags: needinfo?(sledru)
Flags: needinfo?(ryanvm)
Most likely it's because on beta we still have to update the .xpi file as well. Which might be difficult since it needs signing which I'm not sure it's simple any more.

Joel, is it easier to update a talos .xpi on beta or uplift the change that deprecates xpi files in talos tests?
Flags: needinfo?(jmaher)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
good call, that was done in bug 1402426, here is the revision:
https://hg.mozilla.org/mozilla-central/rev/da9caebfed41

I am fairly confident we can uplift this without problems- :ryanvm, can you do the honors?
Flags: needinfo?(jmaher) → needinfo?(ryanvm)
Flags: needinfo?(ryanvm)
Flags: needinfo?(mrbkap)
After uplifting bug 1402426, *almost* all of the tabpaint regressions got fixed.

Unfortunately, the other regressions remained the same and Windows 7 tabpaint fixed only halfway.
I am afraid we need to further investigate this bug.
(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #43)
> After uplifting bug 1402426, *almost* all of the tabpaint regressions got
> fixed.
> 
> Unfortunately, the other regressions remained the same and Windows 7
> tabpaint fixed only halfway.
> I am afraid we need to further investigate this bug.

See comment 28 about the other regressions. The windows 7 halfway fix sounds like a problem... https://treeherder.mozilla.org/perf.html#/graphs?timerange=2592000&series=mozilla-beta,1482675,1,1&series=mozilla-beta,1504501,1,1&series=mozilla-beta,1534697,1,1 that seems like a new regression that sneaked in... Although it looks a lot less critical then the one reported in this bug, I think we should investigate it in a separate bug.
I'm re-closing this based on comment 44.
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Comment 43 follow up:

== Change summary for alert #10152 (as of October 19 2017 13:23 UTC) ==

Improvements:

 69%  tabpaint summary osx-10-10 opt e10s     213.54 -> 66.03
 69%  tabpaint summary osx-10-10 opt e10s stylo_disabled210.99 -> 66.24
 55%  tabpaint summary linux64 opt e10s       159.88 -> 71.41

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=10152
And these are the improvements noticed for Windows:

== Change summary for alert #10152 (as of October 19 2017 13:23 UTC) ==

Improvements:

 56%  tabpaint summary windows10-64 opt e10s  148.05 -> 64.54
 51%  tabpaint summary windows7-32 opt e10s   131.37 -> 64.56

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=10152

As I said, they're partial, as we should have expected

~57 for tabpaint summary windows10-64 opt e10s                 
~56 for tabpaint summary windows7-32 opt e10s

(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #44)
> See comment 28 about the other regressions. The windows 7 halfway fix sounds
> like a problem...
> https://treeherder.mozilla.org/perf.html#/
> graphs?timerange=2592000&series=mozilla-beta,1482675,1,1&series=mozilla-beta,
> 1504501,1,1&series=mozilla-beta,1534697,1,1 that seems like a new regression
> that sneaked in... Although it looks a lot less critical then the one
> reported in this bug, I think we should investigate it in a separate bug.
It doesn't actually look like half-way, but I will look for the regression which sneaked in.
Thanks a lot for your help on resolving this bug!
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: