Closed Bug 1510568 Opened 6 years ago Closed 6 years ago

20.4 - 20.44% tp5n nonmain_startup_fileio (windows7-32) regression on push d05c6310d573658af4a4f2cee332bb01732fcabf (Tue Nov 27 2018)

Categories

(Firefox :: Tabbed Browser, defect)

x86
Windows 7
defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: igoldan, Unassigned)

References

Details

(Keywords: perf, regression, talos-regression)

Talos has detected a Firefox performance regression from push:

https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=26598a115edd552ce42fe28561d7b71158f06298&tochange=d05c6310d573658af4a4f2cee332bb01732fcabf

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

 20%  tp5n nonmain_startup_fileio windows7-32 pgo e10s stylo     2,142,473.04 -> 2,580,344.92
 20%  tp5n nonmain_startup_fileio windows7-32 opt e10s stylo     2,140,595.42 -> 2,577,274.17


You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=17848

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Performance_sheriffing/Talos/Tests

For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Performance_sheriffing/Talos/Running

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Performance_sheriffing/Talos/RegressionBugsHandling
Component: General → Tabbed Browser
Product: Testing → Firefox
Flags: needinfo?(gijskruitbosch+bugs)
I'll post the Gecko profiles soon.
This is just the inverse of the "improvements" seen in bug 1503796 from landing bug 1472212.

It also seems this is sparingly bimodal, note the lower-again autoland value on https://treeherder.mozilla.org/perf.html#/graphs?timerange=1209600&series=autoland,1649799,1&series=mozilla-inbound,ca79aab9e2b34999b2ce07f5b5a59a2306c96d3e,0 .

Given that these are Windows-only, I suspect it's to do with timings of when we start the additional content process. If there was a significant problem I'd expect it to also show up on other platforms.

I don't think the profiles help here because this basically boils down to "did we try to load about:newtab and/or the privileged content process at some point", and looking for that happening in the 100 before/after profiles is a bit of a needle-in-haystack, where we're not 100% sure what the needle looks like. I'm a bit surprised to find this affecting tp5n at all, but if I had to guess it's additional IO in the privileged content process that we sometimes don't get because the run completes before we get to the point where that IO happens.

So I'm inclined to say we should just accept this. Going to ask for 2 needinfos:

1) Mike, does the above sound OK to you, ie can you sanity check me?
2) Ionuț, where can I find an overview of the before/after IO that's being reported here? That is, a list of files and sizes of reads/writes, ideally in a diff(able) format, ie so I can see the difference in what is being read/written? That'd help solidify my analysis.
Flags: needinfo?(mconley)
Flags: needinfo?(igoldan)
Flags: needinfo?(gijskruitbosch+bugs)
(In reply to :Gijs (he/him) from comment #3)
> I'm a bit surprised to find this affecting tp5n at all, but if I
> had to guess it's additional IO in the privileged content process that we
> sometimes don't get because the run completes before we get to the point
> where that IO happens.
> 

The xperf stuff we've got is pretty mysterious and not very well documented, so it took a little while to piece this together.

If I'm reading this right, the nonmain_startup_fileio reading specifically measures off-main-thread IO occurring in the _parent_ process before the sessionstore-windows-restored notification fires.

My guess is that with your patch, the Privileged Content process is starting sooner, which means that Activity Stream runs sooner, which means that it asks the parent process to query Places for Activity Stream data sooner, which I suspect is where the OMT IO is coming from. I suspect it's bi-modal because _sometimes_ Activity Stream is loading a little later, or the IO thread is a little bit busy doing something else, and so the IO leaks out of the "startup" stage into the "normal" stage.

I suspect we're not really noticing the change in the normal stage because it's so noisy in there, and the amount of IO in that stage is so much that the amount that we took out of there was just a drop in the bucket.

So I recommend:

1. We take this regression with the understanding that we're likely just moving OMT IO from one stage to another.
2. We look at bug 1497996 as an opportunity to avoid Activity Stream causing SQLite queries early during start-up.
Flags: needinfo?(mconley)
Per comment 3/4, I think we can just close this. Thanks!
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(igoldan)
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.