Closed Bug 1268508 Opened 8 years ago Closed 8 years ago

3.12% sessionrestore/ts_paint (linux64 pgo only) regression on push 86e0ea42f55c (Mon Apr 25 2016)

Categories

(Firefox :: Pocket, defect)

47 Branch
defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: jmaher, Unassigned)

References

Details

(Keywords: perf, regression, Whiteboard: [talos_regression])

Talos has detected a Firefox performance regression from push 86e0ea42f55c. As author of one of the patches included in that push, we need your help to address this regression.

A link to the push:
https://hg.mozilla.org/integration/fx-team/pushloghtml?fromchange=14952428292c90aa00c28fdfb1f1e79bd8257ce2&tochange=86e0ea42f55c34fc90dd9e32439783a9929f28f4

This is a list of all known regressions and improvements related to the push:
https://treeherder.mozilla.org/perf.html#/alerts?id=1038

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see:

https://wiki.mozilla.org/Buildbot/Talos/Tests#sessionrestore.2Fsessionrestore_no_auto_restore

Reproducing and debugging the regression:
If you would like to re-run this Talos test on a potential fix, use try with the following syntax:
try: -b o -p linux64 -u none -t other --rebuild 5  # add "mozharness: --spsProfile" to generate profile data

(we suggest --rebuild 5 to be more confident in the results)

To run the test locally and do a more in-depth investigation, first set up a local Talos environment:
https://wiki.mozilla.lorg/Buildbot/Talos/Running#Running_locally_-_Source_Code

Then run the following command from the directory where you set up Talos:
talos --develop -e [path]/firefox -a sessionrestore

Making a decision:
As the patch author we need your feedback to help us handle this regression.
*** Please let us know your plans by Monday, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations:
https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
:mixedpuppy, thanks for landing the code to fix the regression- this seemed to have caused a pgo (what we ship) only regression on linux64.  Can you look into ways to fix this performance regression or help explain what this code does and why it is causing a regression.
Flags: needinfo?(mixedpuppy)
oh, a new alert came in, ts_paint.  probably something in this code is causing startup regressions- I am not sure if we can offload that somehow.
Summary: 3.12% sessionrestore (linux64 pgo only) regression on push 86e0ea42f55c (Mon Apr 25 2016) → 3.12% sessionrestore/ts_paint (linux64 pgo only) regression on push 86e0ea42f55c (Mon Apr 25 2016)
If I had to guess, this is probably happening because we're force-loading CUI "early" in the particular circumstances of the x64 linux build.

We can workaround by tiptoeing around the issue: if the module is not loaded, CUI.windows is going to be empty anyway. So wrapping this in Cu.isModuleLoaded("resource:///modules/CustomizableUI.jsm") should get rid of it.

It'd be useful to get confirmation that that's the case from try, though.
hrm, this is weird.  

Gijs, I don't think checking the module load will work here (we already did things in CUI above the changed line, thus if we were causing CUI to load, we were doing so before this change).  

Since the prior iterator had failures that using CUI.windows fixes, we have to assume that merely adding 2 menuitems (sub-menu, not top level, not even visible on startup) causes the performance issues.
Flags: needinfo?(mixedpuppy)
(In reply to Shane Caraveo (:mixedpuppy) from comment #5)
> hrm, this is weird.  
> 
> Gijs, I don't think checking the module load will work here (we already did
> things in CUI above the changed line, thus if we were causing CUI to load,
> we were doing so before this change).  

Oh, d'oh, yes, you're right.

> Since the prior iterator had failures that using CUI.windows fixes, we have
> to assume that merely adding 2 menuitems (sub-menu, not top level, not even
> visible on startup) causes the performance issues.

that would be... surprising? Possible, I guess... but surprising.
So, what we do in onWindowOpened:

add a stylesheet
add 4 window properties (lazy getters)
add 2 submenu items that are not visible (CUI.getPlacementOfWidget is called here)

It feels like the most likely culprit is the spreadsheet, I wonder if it is being loaded after CUI adds the pocket button to the toolbar.  That could conceivably cause a repaint since the button icon is via the spreadsheet.  Maybe this change is affecting order here somehow.
Gijs, above try is the only obvious thing I can see as a cause of these regressions.  Thoughts?
Flags: needinfo?(gijskruitbosch+bugs)
(In reply to Shane Caraveo (:mixedpuppy) from comment #7)
> So, what we do in onWindowOpened:
> 
> add a stylesheet
> add 4 window properties (lazy getters)
> add 2 submenu items that are not visible (CUI.getPlacementOfWidget is called
> here)
> 
> It feels like the most likely culprit is the spreadsheet,

I'm assuming you mean the stylesheet... and sorry to hear you've been thinking about spreadsheets? :-)

> I wonder if it is
> being loaded after CUI adds the pocket button to the toolbar.  That could
> conceivably cause a repaint since the button icon is via the spreadsheet. 

Hm. Yes, that is possible.

> Maybe this change is affecting order here somehow.

So... let's think very carefully about this for a bit. The only change (relevant to startup) that we made was this loop at the bottom of bootstrap's startup.

We used to loop through all open windows, and now we're only looping through the ones CUI knows about. So if anything that loop is now doing less work. The windows are a subset of the ones we used to use rather than a superset.

If that loop is now doing less work, that means more work will happen once CUI fires onWindowOpened. But even that, looking at the code, should be happening before we stick the pocket button in the toolbar... so I don't really understand why that would have made a difference.

Unfortunately the patch seems to have not been pgo'd properly, or something. It's a bit of a faff trying to compare with fx-team because of the pgo-to-opt renaming that try does, so you have to compare the opt numbers from "new" with the pgo numbers from "old", but:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=fx-team&originalRevision=4dc48f3688f0&newProject=try&newRevision=a5b3b403b01c&framework=1&showOnlyImportant=0

shows 1222.33 for ts_paint on the opt non-e10s trypush, but 1046.67 for the pgo fx-team push. That makes no sense, because that's more like a 20% increase, much more than the 3% this bug is about... and the numbers are more in the right ballpark for a 'regular' opt build - but the trypush clearly has the mk_add_options mozconfig switch. Joel, can you check if that really worked correctly?

In the meantime, that push also has failing tests, so clearly something is not quite right...
Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(jmaher)
the build on try is pgo as far as I can tell (both the options, build times and what i see in the logs), why it is showing much different numbers than normal pgo and more like opt is really confusing.

I would recommend that since it seems as though all aspects of the changes have been scrutinized and we have attempted to fix them that we either push these fixes forward or ignore this.

If we want to look more at this on try- we could:
* push a blank changeset with just the pgo options
* compare that to the above one (assuming we have the same base revision)
Flags: needinfo?(jmaher)
The push in comment 12 is just a baseline fx-team pgo build.  in comment 13 is a fixed patch for the test failures in the previous try.

I did a ton of timing stuff yesterday, everything seems to always happen in the same order (with or without the patch that caused this bug) except that (prior to said patch) sometimes pocket initialized *earlier* than the window causing the failure to add the menu items.  Otherwise, the stylesheet was always added prior to the MozAfterPaint event and prior to CUI creating the widget.  

The only difference I could see is whether or not the menu items got added.  This patch moves adding the menu items to the browser-delayed-startup-finished observer.
oh, btw, the test failures was a typo which could have been the cause of the much faster times
unfortunately there is no difference here- I think both pushes showed the normal pgo range for ts_paint and sessionrestore, which is a good sign.  Possibly we don't have much we can do here?
one last try, reverting the changes from bug 1263599 to see if it stays in a normal range.
and this is showing a ts_paint and sessionrestore improvement- so there is something in the original changes, although it seems to be a mystery.
per irc with jmaher we'll wontfix this.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.