Closed Bug 1390325 Opened 2 years ago Closed 2 years ago

9.35% tp5n nonmain_startup_fileio (windows7-32) regression on push 87a67b8573623763cc882454fba1b9dc06317839 (Mon Aug 14 2017)

Categories

(Firefox :: Normandy Client, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
Tracking Status
firefox55 --- unaffected
firefox56 --- fixed
firefox57 --- fixed

People

(Reporter: jmaher, Assigned: osmose)

References

Details

(Keywords: perf, regression, talos-regression)

Talos has detected a Firefox performance regression from push:

https://hg.mozilla.org/integration/autoland/pushloghtml?changeset=87a67b8573623763cc882454fba1b9dc06317839

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

  9%  tp5n nonmain_startup_fileio windows7-32 pgo e10s     1,524,630.00 -> 1,667,109.58


You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=8774

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Buildbot/Talos/Tests

For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Buildbot/Talos/Running

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
This is an increase in fileIO of about 140K during startup, but not on the main thread.  I see we updated some shield code from github to in-tree; is this expected?
Flags: needinfo?(mcooper)
Component: Untriaged → Add-on
Product: Firefox → Shield
Version: 53 Branch → unspecified
I see the root cause is backed out, and we have an improvement:
== Change summary for alert #8782 (as of August 15 2017 00:00 UTC) ==

Improvements:

  8%  tp5n nonmain_startup_fileio windows7-32 pgo e10s     1,657,735.42 -> 1,531,120.33

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=8782
Is there some way to get a list of files or whatever that actually indicates what IO this was? Off-hand, 140k seems like way too much, whatever we're doing, but then again, I also don't know what it's measuring...
Flags: needinfo?(jmaher)
This is likely due to our new about: page, which runs off-main-thread. We are pulling in React to do the page rendering, which is about 140K. I'm not entirely sure why it is loading immediately instead of when the page is first loaded though.

How serious of a regression is this? This push is related to some time-sensitive things in Shield. If this very important, I can work on lazily loading these assets as soon as I fix the test failures that got this patch backed out in the first place.
Flags: needinfo?(mcooper)
this didn't show any regressions in ts_paint or other startup measurements.  Can we get a bug on file to fix this and lazy load the react assets?
Flags: needinfo?(jmaher)
Assignee: nobody → mkelly
jmaher: I'd like to run these tests locally to help test out potential fixes, but I can't find any working links to tp5n.zip, which the wiki docs say I need. Would you be able to provide a working link?

Also, if you know the mach command to run the particular test that's failing I'd be very grateful to know it as well. :D
Flags: needinfo?(jmaher)
this data is generated from the xperf test, which only runs on windows 7 and you need to have the windows performance toolkit installed in the right path for that to work.  Given that, we don't have any mach command to run this, it would be |./mach talos-test -a xperf-e10s|, but I can pretty much guarantee that won't work for you locally.

Now that we don't have people.mozilla.org anymore, I need to find a new place to store the tp5n.zip file- thanks for reminding me.
Flags: needinfo?(jmaher)
Duplicate of this bug: 1391684
Did some further testing today (at rhelmer's suggestion):

- Added a 5MiB file to the add-on with random data to test if the entire XPI was being loaded at once and was causing the startup file IO: https://treeherder.mozilla.org/#/jobs?repo=try&revision=ca3dd0cc1062f7d92ff49f2104eff6a27eda68f9&selectedJob=124220719

  From what I can tell, the XPI files are correctly only being partially read and just adding a file without using it during startup does not have an effect on the xperf tests.

- Took the current mozilla-central version of Shield and made the bootstrap.js file do nothing at all (no imports, empty startup/shutdown/etc): https://treeherder.mozilla.org/#/jobs?repo=try&revision=4c66e00fd7bb94d6f25f3d75dde0f7b25f2d7ce3&selectedJob=124285634

  Here's the comparison link I used for comparing it to mozilla-central: https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&newProject=try&newRevision=4c66e00fd7bb94d6f25f3d75dde0f7b25f2d7ce3&framework=1&filter=fileio&showOnlyImportant=0&selectedTimeRange=172800

  If I'm doing this right, it looks like making the add-on do nothing has almost no effect on fileio in any case except main_normal_fileio, which sees a 21% drop. That implies that on startup, Shield isn't doing a lot of fileio anyway, and we are doing most of our fileio after startup.

I can't actually think of anything that would explain these two results both being true except that my method of measurement is wrong in some way.
After further reflection and help from mhowell, I've think I've got a convincing explanation of where this came from, pending one question:

jmaher: What are these xperf tests measuring? The Talos wiki page says it's measuring "IO counters". I read through etlparser.py and it looks like we log both the bytes read and the number of rows in the xperf logs of read events.

If it's write events, I'm still kind of stumped, because 140k write events seems like a HUGE amount of new writes to add regardless of what we're doing. If it's 140k bytes being read, however, then it makes more sense.

In order to add a new about: page, we create a frame script for each open tab, and a process script for each process that loads a JSM file. This would add ~10-15kb of data read for each content process. During my tests on a Windows VM and other windows machines, I was observing between 4-6 content processes being made during startup, which would be in the area of ~90kb of extra data read. That seems in the ballpark for this regression, depending on exactly how the xperf tests are running and if there's some extra JSM imports happening that I'm missing. 

As far as I know there's not really a way around this, as it's just a requirement for registering a new about: page that runs in the content process from a system add-on. Normal about: pages don't see as much of a hit because they register themselves in a different way that doesn't require loading code into each content process on startup. This may end up being a perf improvement to be made in the about pages system.

Gijs: Does this sound right to you, based on your review of the about:studies patch?
Flags: needinfo?(jmaher)
Flags: needinfo?(gijskruitbosch+bugs)
(In reply to Michael Kelly [:mkelly,:Osmose] from comment #10)
> As far as I know there's not really a way around this

Actually, I bet I can at least influence the number by moving some of the parent-only code to a separate JSM. That'll also help prove out my theory.
(In reply to Michael Kelly [:mkelly,:Osmose] from comment #9)
> - Took the current mozilla-central version of Shield and made the
> bootstrap.js file do nothing at all (no imports, empty
> startup/shutdown/etc):
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=4c66e00fd7bb94d6f25f3d75dde0f7b25f2d7ce3&selectedJob=1
> 24285634
> 
>   Here's the comparison link I used for comparing it to mozilla-central:
> https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-
> central&newProject=try&newRevision=4c66e00fd7bb94d6f25f3d75dde0f7b25f2d7ce3&f
> ramework=1&filter=fileio&showOnlyImportant=0&selectedTimeRange=172800
> 
>   If I'm doing this right, it looks like making the add-on do nothing has
> almost no effect on fileio in any case except main_normal_fileio, which sees
> a 21% drop. That implies that on startup, Shield isn't doing a lot of fileio
> anyway, and we are doing most of our fileio after startup.

For reasons unknown to me, that comparison link is now showing a comparable decrease in nonmain startup fileio, which is what I originally expected.
(In reply to Michael Kelly [:mkelly,:Osmose] from comment #11)
> (In reply to Michael Kelly [:mkelly,:Osmose] from comment #10)
> > As far as I know there's not really a way around this
> 
> Actually, I bet I can at least influence the number by moving some of the
> parent-only code to a separate JSM. That'll also help prove out my theory.

Wrote a patch that removed all the unnecessary code being imported into the content process scripts, here's the comparison to mozilla-central: https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&newProject=try&newRevision=9a004d1de59806dfd30c36adb1c2537fba1f0b11&framework=1&filter=fileio&showOnlyImportant=0&selectedTimeRange=172800

As of this posting it's showing a 1% increase in nonmain startup fileio with a very low confidence, so probably didn't have much of an effect. Hrm.
(In reply to Michael Kelly [:mkelly,:Osmose] from comment #12)
> >   If I'm doing this right, it looks like making the add-on do nothing has
> > almost no effect on fileio in any case except main_normal_fileio, which sees
> > a 21% drop. That implies that on startup, Shield isn't doing a lot of fileio
> > anyway, and we are doing most of our fileio after startup.
> 
> For reasons unknown to me, that comparison link is now showing a comparable
> decrease in nonmain startup fileio, which is what I originally expected.

I mean, we do a lot of other things beside about:studies in this patch - AFAICT we also added the indexeddb studies store, and we'll now read that on startup (because we call AddonStudies.init()). For all I know, that could explain it? Do we need to do that (that early) on startup? Is the amount of IO that's doing reasonable?

Basically, given that the about:studies thing didn't help, and the original "make bootstrap do nothing" thing did help, I think the easiest thing to do (if we still haven't got the test pages to run this locally, which would make this all go a lot quicker - jmaher?) would be to try to bisect what part of initialization is causing the extra IO, by effectively bisecting the contents of bootstrap - either by splitting up the merge so we push parts of it to try individually, or by pushing m-c to try with different parts of initialization commented out.


/me chalks one up for syncing github things to m-c more frequently so this is less needle-in-haystack-style work. :-(
Flags: needinfo?(gijskruitbosch+bugs)
I would be happy to do some debugging on try server and figure out which files are reading differently- it does take some hacking- is there a patch I can push to show the regression?

as a note, we measure IO which is read + write;
Flags: needinfo?(jmaher)
this is now on mozilla-beta
Blocks: 1373320
FYI the PR for bug 1371350 should address at least part of this; I measured a 4-5% decrease in startup fileio. The PR is under review currently: https://github.com/mozilla/normandy/pull/1024

After that lands we can evaluate the impact it has on this regression and whether we need more work or can accept whatever remaining regression exists.
Bug 1397415 comment 9 shows a 7% drop in startup fileio from the PR I mentioned above. Is that enough to call this fixed?
Flags: needinfo?(jmaher)
Status: NEW → RESOLVED
Closed: 2 years ago
Flags: needinfo?(jmaher)
Resolution: --- → FIXED
Product: Shield → Firefox
You need to log in before you can comment on or make changes to this bug.