1390325 - 9.35% tp5n nonmain_startup_fileio (windows7-32) regression on push 87a67b8573623763cc882454fba1b9dc06317839 (Mon Aug 14 2017)

Reporter

Description

•

7 years ago

Talos has detected a Firefox performance regression from push:

https://hg.mozilla.org/integration/autoland/pushloghtml?changeset=87a67b8573623763cc882454fba1b9dc06317839

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

  9%  tp5n nonmain_startup_fileio windows7-32 pgo e10s     1,524,630.00 -> 1,667,109.58


You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=8774

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Buildbot/Talos/Tests

For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Buildbot/Talos/Running

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 1

•

7 years ago

This is an increase in fileIO of about 140K during startup, but not on the main thread.  I see we updated some shield code from github to in-tree; is this expected?

Flags: needinfo?(mcooper)

Kohei Yoshino [:kohei]

Updated

•

7 years ago

Component: Untriaged → Add-on

Product: Firefox → Shield

Version: 53 Branch → unspecified

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 2

•

7 years ago

I see the root cause is backed out, and we have an improvement:
== Change summary for alert #8782 (as of August 15 2017 00:00 UTC) ==

Improvements:

  8%  tp5n nonmain_startup_fileio windows7-32 pgo e10s     1,657,735.42 -> 1,531,120.33

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=8782

:Gijs (he/him)

Comment 3

•

7 years ago

Is there some way to get a list of files or whatever that actually indicates what IO this was? Off-hand, 140k seems like way too much, whatever we're doing, but then again, I also don't know what it's measuring...

Flags: needinfo?(jmaher)

Michael Cooper [:mythmon]

Comment 4

•

7 years ago

This is likely due to our new about: page, which runs off-main-thread. We are pulling in React to do the page rendering, which is about 140K. I'm not entirely sure why it is loading immediately instead of when the page is first loaded though.

How serious of a regression is this? This push is related to some time-sensitive things in Shield. If this very important, I can work on lazily loading these assets as soon as I fix the test failures that got this patch backed out in the first place.

Flags: needinfo?(mcooper)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 5

•

7 years ago

this didn't show any regressions in ts_paint or other startup measurements.  Can we get a bug on file to fix this and lazy load the react assets?

Flags: needinfo?(jmaher)

Osmose [:osmose, :mkelly]

Assignee

Updated

•

7 years ago

Assignee: nobody → mkelly

Osmose [:osmose, :mkelly]

Assignee

Comment 6

•

7 years ago

jmaher: I'd like to run these tests locally to help test out potential fixes, but I can't find any working links to tp5n.zip, which the wiki docs say I need. Would you be able to provide a working link?

Also, if you know the mach command to run the particular test that's failing I'd be very grateful to know it as well. :D

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 7

•

7 years ago

this data is generated from the xperf test, which only runs on windows 7 and you need to have the windows performance toolkit installed in the right path for that to work.  Given that, we don't have any mach command to run this, it would be |./mach talos-test -a xperf-e10s|, but I can pretty much guarantee that won't work for you locally.

Now that we don't have people.mozilla.org anymore, I need to find a new place to store the tp5n.zip file- thanks for reminding me.

Flags: needinfo?(jmaher)

Osmose [:osmose, :mkelly]

Assignee

Comment 9

•

7 years ago

Did some further testing today (at rhelmer's suggestion):

- Added a 5MiB file to the add-on with random data to test if the entire XPI was being loaded at once and was causing the startup file IO: https://treeherder.mozilla.org/#/jobs?repo=try&revision=ca3dd0cc1062f7d92ff49f2104eff6a27eda68f9&selectedJob=124220719

  From what I can tell, the XPI files are correctly only being partially read and just adding a file without using it during startup does not have an effect on the xperf tests.

- Took the current mozilla-central version of Shield and made the bootstrap.js file do nothing at all (no imports, empty startup/shutdown/etc): https://treeherder.mozilla.org/#/jobs?repo=try&revision=4c66e00fd7bb94d6f25f3d75dde0f7b25f2d7ce3&selectedJob=124285634

  Here's the comparison link I used for comparing it to mozilla-central: https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&newProject=try&newRevision=4c66e00fd7bb94d6f25f3d75dde0f7b25f2d7ce3&framework=1&filter=fileio&showOnlyImportant=0&selectedTimeRange=172800

  If I'm doing this right, it looks like making the add-on do nothing has almost no effect on fileio in any case except main_normal_fileio, which sees a 21% drop. That implies that on startup, Shield isn't doing a lot of fileio anyway, and we are doing most of our fileio after startup.

I can't actually think of anything that would explain these two results both being true except that my method of measurement is wrong in some way.

Osmose [:osmose, :mkelly]

Assignee

Comment 10

•

7 years ago

After further reflection and help from mhowell, I've think I've got a convincing explanation of where this came from, pending one question:

jmaher: What are these xperf tests measuring? The Talos wiki page says it's measuring "IO counters". I read through etlparser.py and it looks like we log both the bytes read and the number of rows in the xperf logs of read events.

If it's write events, I'm still kind of stumped, because 140k write events seems like a HUGE amount of new writes to add regardless of what we're doing. If it's 140k bytes being read, however, then it makes more sense.

In order to add a new about: page, we create a frame script for each open tab, and a process script for each process that loads a JSM file. This would add ~10-15kb of data read for each content process. During my tests on a Windows VM and other windows machines, I was observing between 4-6 content processes being made during startup, which would be in the area of ~90kb of extra data read. That seems in the ballpark for this regression, depending on exactly how the xperf tests are running and if there's some extra JSM imports happening that I'm missing.

As far as I know there's not really a way around this, as it's just a requirement for registering a new about: page that runs in the content process from a system add-on. Normal about: pages don't see as much of a hit because they register themselves in a different way that doesn't require loading code into each content process on startup. This may end up being a perf improvement to be made in the about pages system.

Gijs: Does this sound right to you, based on your review of the about:studies patch?

Flags: needinfo?(jmaher)

Flags: needinfo?(gijskruitbosch+bugs)

Osmose [:osmose, :mkelly]

Assignee

Comment 11

•

7 years ago

(In reply to Michael Kelly [:mkelly,:Osmose] from comment #10)
> As far as I know there's not really a way around this

Actually, I bet I can at least influence the number by moving some of the parent-only code to a separate JSM. That'll also help prove out my theory.

Osmose [:osmose, :mkelly]

Assignee

Comment 12

•

7 years ago

(In reply to Michael Kelly [:mkelly,:Osmose] from comment #9)
> - Took the current mozilla-central version of Shield and made the
> bootstrap.js file do nothing at all (no imports, empty
> startup/shutdown/etc):
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=4c66e00fd7bb94d6f25f3d75dde0f7b25f2d7ce3&selectedJob=1
> 24285634
> 
>   Here's the comparison link I used for comparing it to mozilla-central:
> https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-
> central&newProject=try&newRevision=4c66e00fd7bb94d6f25f3d75dde0f7b25f2d7ce3&f
> ramework=1&filter=fileio&showOnlyImportant=0&selectedTimeRange=172800
> 
>   If I'm doing this right, it looks like making the add-on do nothing has
> almost no effect on fileio in any case except main_normal_fileio, which sees
> a 21% drop. That implies that on startup, Shield isn't doing a lot of fileio
> anyway, and we are doing most of our fileio after startup.

For reasons unknown to me, that comparison link is now showing a comparable decrease in nonmain startup fileio, which is what I originally expected.

Osmose [:osmose, :mkelly]

Assignee

Comment 13

•

7 years ago

(In reply to Michael Kelly [:mkelly,:Osmose] from comment #11)
> (In reply to Michael Kelly [:mkelly,:Osmose] from comment #10)
> > As far as I know there's not really a way around this
> 
> Actually, I bet I can at least influence the number by moving some of the
> parent-only code to a separate JSM. That'll also help prove out my theory.

Wrote a patch that removed all the unnecessary code being imported into the content process scripts, here's the comparison to mozilla-central: https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&newProject=try&newRevision=9a004d1de59806dfd30c36adb1c2537fba1f0b11&framework=1&filter=fileio&showOnlyImportant=0&selectedTimeRange=172800

As of this posting it's showing a 1% increase in nonmain startup fileio with a very low confidence, so probably didn't have much of an effect. Hrm.

:Gijs (he/him)

Comment 14

•

7 years ago

(In reply to Michael Kelly [:mkelly,:Osmose] from comment #12)
> >   If I'm doing this right, it looks like making the add-on do nothing has
> > almost no effect on fileio in any case except main_normal_fileio, which sees
> > a 21% drop. That implies that on startup, Shield isn't doing a lot of fileio
> > anyway, and we are doing most of our fileio after startup.
> 
> For reasons unknown to me, that comparison link is now showing a comparable
> decrease in nonmain startup fileio, which is what I originally expected.

I mean, we do a lot of other things beside about:studies in this patch - AFAICT we also added the indexeddb studies store, and we'll now read that on startup (because we call AddonStudies.init()). For all I know, that could explain it? Do we need to do that (that early) on startup? Is the amount of IO that's doing reasonable?

Basically, given that the about:studies thing didn't help, and the original "make bootstrap do nothing" thing did help, I think the easiest thing to do (if we still haven't got the test pages to run this locally, which would make this all go a lot quicker - jmaher?) would be to try to bisect what part of initialization is causing the extra IO, by effectively bisecting the contents of bootstrap - either by splitting up the merge so we push parts of it to try individually, or by pushing m-c to try with different parts of initialization commented out.


/me chalks one up for syncing github things to m-c more frequently so this is less needle-in-haystack-style work. :-(

Flags: needinfo?(gijskruitbosch+bugs)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 15

•

7 years ago

I would be happy to do some debugging on try server and figure out which files are reading differently- it does take some hacking- is there a patch I can push to show the regression?

as a note, we measure IO which is read + write;

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 16

•

7 years ago

this is now on mozilla-beta

Blocks: 1373320

Osmose [:osmose, :mkelly]

Assignee

Comment 17

•

7 years ago

FYI the PR for bug 1371350 should address at least part of this; I measured a 4-5% decrease in startup fileio. The PR is under review currently: https://github.com/mozilla/normandy/pull/1024

After that lands we can evaluate the impact it has on this regression and whether we need more work or can accept whatever remaining regression exists.

Osmose [:osmose, :mkelly]

Assignee

Comment 18

•

7 years ago

Bug 1397415 comment 9 shows a 7% drop in startup fileio from the PR I mentioned above. Is that enough to call this fixed?

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Updated

•

7 years ago

Status: NEW → RESOLVED

Closed: 7 years ago

Flags: needinfo?(jmaher)

Resolution: --- → FIXED

Ryan VanderMeulen [:RyanVM]

Updated

•

7 years ago

status-firefox55: --- → unaffected

status-firefox56: --- → fixed

status-firefox57: --- → fixed

BMO Automation

Updated

•

6 years ago

Product: Shield → Firefox

Bugzilla

Quick Search

9.35% tp5n nonmain_startup_fileio (windows7-32) regression on push 87a67b8573623763cc882454fba1b9dc06317839 (Mon Aug 14 2017)

Categories

(Firefox :: Normandy Client, defect)

Tracking

()

People

(Reporter: jmaher, Assigned: osmose)

References

Details

(Keywords: perf, regression, talos-regression)

Crash Data

Security

(public)

User Story

Description

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Updated

Comment 6

Comment 7

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Comment 17

Comment 18

Updated

Updated

Updated