Closed Bug 1229731 Opened 8 years ago Closed 8 years ago

4% Win7 tp5n(startup file I/O bytes) regression on Fx-Team (v.45) on December 01, 2015 from push 29ce9059dc2c

Categories

(Hello (Loop) :: Client, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jmaher, Unassigned)

References

Details

(Keywords: perf, regression, Whiteboard: [talos_regression])

Attachments

(1 file)

I suspect this is expected, but wanted to close the loop and officially document this.  In the xperf suite, we measure file and network I/O with the xperf toolchain on windows 7.

Both xperf and xperf-e10s have shows regressions:
http://alertmanager.allizom.org:8080/alerts.html?rev=a5ad0e73b4bf19c63eb157f4fd1bbaa98eb051d8&showAll=1&testIndex=0&platIndex=0

This isn't seen on non-pgo, it is only seen on pgo.  Not sure why, probably an artifact of how we do pgo.

I strongly suspect this matches the bytes for the .xpi- I would like to confirm that and ensure that we intended for this to be on the main thread during startup.
Mark, can you take a look at this?
Flags: needinfo?(standard8)
I just had a look, on Window the size of the xpi in the application directory is 1,241,673 bytes in the latest nightly.

According to the emails, the increase in file i/o is 2,236,167 bytes.

That's a little more, but I'm guessing that we might be registering multiple reads or something... its a little hard to tell with out logs.

I would expect a lot of this is the fact that we've moved to the xpi. We haven't really changed anything that we do.

We might be loading style sheets a bit earlier as we're in bootstrap and not the browser window itself, but I can't see that we can do much about that, as we need that info here.

I'm going to tag Dave and see if he's got any concerns here as well.
Flags: needinfo?(standard8) → needinfo?(dtownsend)
Two things worth considering:

Are we compressing the XPI? That might save us file I/O and probably wouldn't impact perf much but there is likely a tradeoff there.

How much did omni.ja shrink with the move? I'm surprised that we see so much additional file i/o when presumably we should have lost a bunch from omni.ja
Flags: needinfo?(dtownsend)
in the logs we have a list of files and the bytes read (we also have write, but I only did this for the reads).

old log: http://archive.mozilla.org/pub/firefox/tinderbox-builds/fx-team-win32-pgo/1449052255/fx-team_win7-ix_test-xperf-pgo-bm109-tests1-windows-build342.txt.gz

new log: http://archive.mozilla.org/pub/firefox/tinderbox-builds/fx-team-win32-pgo/1448814650/fx-team_win7-ix_test-xperf-pgo-bm109-tests1-windows-build338.txt.gz

cleanup tmpdir names, and sanitize into a spreadsheet:
https://docs.google.com/spreadsheets/d/1JFYUhcN8NGZAmZjcoGE5-Pw8uz6m9H9dXaokXok0Ues/edit?usp=sharing

you can see these differences:
C:\slave\test-pgo\build\application\firefox\browser\features\loop@mozilla.org.xpi, 4980736 bytes
C:\slave\test-pgo\build\application\firefox\browser\omni.ja, -131072 bytes
C:\slave\test-pgo\build\application\firefox\nss3.dll, -131072 bytes

and sadly:
C:\slave\test-pgo\build\application\firefox\omni.ja remains the same at 4980736 bytes.

if these numbers look off, lets investigate more- it could be something odd on our machines or with the builds- we can look at other data.
(In reply to Joel Maher (:jmaher) from comment #4)
> you can see these differences:
> C:\slave\test-pgo\build\application\firefox\browser\features\loop@mozilla.
> org.xpi, 4980736 bytes
> C:\slave\test-pgo\build\application\firefox\browser\omni.ja, -131072 bytes
> C:\slave\test-pgo\build\application\firefox\nss3.dll, -131072 bytes
> 
> and sadly:
> C:\slave\test-pgo\build\application\firefox\omni.ja remains the same at
> 4980736 bytes.

Hello is browser specific so I'd expect the browser\omni.ja to be the one to reduce.

> if these numbers look off, lets investigate more- it could be something odd
> on our machines or with the builds- we can look at other data.

Is it odd that browser/omni.ja and nss3.dll are exactly the same size before and after (with the same drop), same with the after numbers of omni.ja and loop@mozilla.org.xpi ?

I wouldn't have expected reads to nss3.dll to drop.

If we believe the numbers, then it also means loop@mozilla.org.xpi was read in full 4 times.
Mike, are xpis packed to the same compression standard as omni.ja?
Flags: needinfo?(mh+mozilla)
(In reply to Mark Banner (:standard8) from comment #6)
> Mike, are xpis packed to the same compression standard as omni.ja?

Yes.

Now, looking at the actual builds, this is what can be seen:
- before loop addon:
  browser/omni.ja 15304507

- with loop addon:
  browser/omni.ja 14101105
  browser/features/loop@mozilla.org.xpi 1241648

The sum of the above is 15342753, which is 38246 bytes more, which is not totally crazy, it might come from things like install.rdf and bootstrap.js.

Now, since those files come from PGO builds, they also have preloading information (which tells the jar reader in gecko to read the first n bytes in the jar ; you can read this yourself, it's the 4 first bytes in the jar files, in little endian format), which is the following:
- before loop addon:
  browser/omni.ja 1758427

- with loop addon:
  browser/omni.ja 1700114
  browser/features/loop@mozilla.org.xpi 1241626

The way the preloading information is computed is to record, when Firefox is started during the PGO profiling phase, what files are being accessed by Gecko's jar reader, and use that information when packaging.

If we are effectively reading the xpi 4 times, there are few things that can come out of this:
- the xperf test does not care about what is in memory already. Re-reading a file, while suboptimal, is not so much of a problem.
- something (the addon manager?) is reading the entire xpi when Firefox starts

The difference for nss3.dll doesn't make much sense... except that the files have different sizes in the builds:
- before loop addon:
  nss3.dll 1722672

- with loop addon:
  nss3.dll 1700144

So yes, that could have that effect. As to why the file shrunk, that could be due to a number of things, including PGO itself inducing variations in size for the same build. Or there could have been some changes between both PGO builds.
Flags: needinfo?(mh+mozilla)
(In reply to Mike Hommey [:glandium] from comment #7)
> - the xperf test does not care about what is in memory already. Re-reading a
> file, while suboptimal, is not so much of a problem.

Note this also means that the test could be mis-categorizing other things: you could have a change that makes smaller parts of a file read more often categorized as a regression while the actual disk hit would be smaller, or on the opposite end, a change that makes a file read less often, but reading bigger parts of it categorized as an improvement.
(In reply to Mike Hommey [:glandium] from comment #7)
> (In reply to Mark Banner (:standard8) from comment #6)
> > Mike, are xpis packed to the same compression standard as omni.ja?
> 
> Yes.
> 
> Now, looking at the actual builds, this is what can be seen:
> - before loop addon:
>   browser/omni.ja 15304507
> 
> - with loop addon:
>   browser/omni.ja 14101105
>   browser/features/loop@mozilla.org.xpi 1241648
> 
> The sum of the above is 15342753, which is 38246 bytes more, which is not
> totally crazy, it might come from things like install.rdf and bootstrap.js.

bootstrap.js alone is 31k so that seems reasonable.

> Now, since those files come from PGO builds, they also have preloading
> information (which tells the jar reader in gecko to read the first n bytes
> in the jar ; you can read this yourself, it's the 4 first bytes in the jar
> files, in little endian format), which is the following:
> - before loop addon:
>   browser/omni.ja 1758427
> 
> - with loop addon:
>   browser/omni.ja 1700114
>   browser/features/loop@mozilla.org.xpi 1241626
> 
> The way the preloading information is computed is to record, when Firefox is
> started during the PGO profiling phase, what files are being accessed by
> Gecko's jar reader, and use that information when packaging.

Does this profiling phase start from a clean profile or is it the second run of a profile? Same question for the file I/O check we're concerned with here?

> If we are effectively reading the xpi 4 times, there are few things that can
> come out of this:
> - the xperf test does not care about what is in memory already. Re-reading a
> file, while suboptimal, is not so much of a problem.
> - something (the addon manager?) is reading the entire xpi when Firefox
> starts

For a clean profile the add-ons manager is going to trigger a signing check on the loop XPI which means we're going to read the entire thing. That isn't normal and we shouldn't really use that to determine preloading info (do we actually do that for regular XPIs?)
(In reply to Dave Townsend [:mossop] from comment #9)
> > If we are effectively reading the xpi 4 times, there are few things that can
> > come out of this:
> > - the xperf test does not care about what is in memory already. Re-reading a
> > file, while suboptimal, is not so much of a problem.
> > - something (the addon manager?) is reading the entire xpi when Firefox
> > starts
> 
> For a clean profile the add-ons manager is going to trigger a signing check
> on the loop XPI which means we're going to read the entire thing. That isn't
> normal and we shouldn't really use that to determine preloading info (do we
> actually do that for regular XPIs?)

Bleh, I am misspeaking here. The add-on isn't signed so we shouldn't be reading it all. We will read install.rdf and bootstrap.js though. On my machine these are right at the start and the end of the XPI. Is it possible we're just seeing some OS caching read the rest in that case?
(In reply to Dave Townsend [:mossop] from comment #9)
> Does this profiling phase start from a clean profile or is it the second run
> of a profile?

A clean profile, cf. build/pgo/profileserver.py
I am not clear on the next steps for this bug.  Can I help collect more data or aid in experimental fixes?
this is now seen on mozilla-aurora!  I cannot determine what the next steps are here.
Sorry, missed the previous comment. I just realised that we should be able to look at the zip reader and see what's being loaded. I found there's an env var:

MOZ_JAR_LOG_FILE=`pwd`/test.txt /Applications/FirefoxNightly.app/Contents/MacOS/firefox -P local1 -purgecaches

I tested with my latest dev add-on and nightly, and this is what I got for just the loop xpi loads:

loop@mozilla.org.xpi chrome.manifest
loop@mozilla.org.xpi bootstrap.js
loop@mozilla.org.xpi chrome/content/preferences/prefs.js
loop@mozilla.org.xpi chrome/skin/shared/loop.css
loop@mozilla.org.xpi chrome/skin/osx/platform.css
loop@mozilla.org.xpi chrome/skin/osx/toolbar-yosemite.png
loop@mozilla.org.xpi chrome/content/modules/MozLoopService.jsm
loop@mozilla.org.xpi chrome/content/modules/LoopRooms.jsm
loop@mozilla.org.xpi chrome/skin/shared/loop.css
loop@mozilla.org.xpi chrome/skin/osx/platform.css
loop@mozilla.org.xpi chrome/content/modules/MozLoopPushHandler.jsm

The manifest & bootstrap I would guess are loaded on startup by the various bits that need them. The rest of the items there are triggered via the bootstrap startup() function. I suspect the skins are loaded twice due to the additional hidden window on Mac, we're certainly not loading them twice.

What this doesn't show us, is the loading that the add-on manager is doing as I guess it must be loading install.rdf from there?

Dave, any hints here?
Flags: needinfo?(dtownsend)
(In reply to Mark Banner (:standard8) from comment #14)
> Sorry, missed the previous comment. I just realised that we should be able
> to look at the zip reader and see what's being loaded. I found there's an
> env var:
> 
> MOZ_JAR_LOG_FILE=`pwd`/test.txt
> /Applications/FirefoxNightly.app/Contents/MacOS/firefox -P local1
> -purgecaches
> 
> I tested with my latest dev add-on and nightly, and this is what I got for
> just the loop xpi loads:
> 
> loop@mozilla.org.xpi chrome.manifest
> loop@mozilla.org.xpi bootstrap.js
> loop@mozilla.org.xpi chrome/content/preferences/prefs.js
> loop@mozilla.org.xpi chrome/skin/shared/loop.css
> loop@mozilla.org.xpi chrome/skin/osx/platform.css
> loop@mozilla.org.xpi chrome/skin/osx/toolbar-yosemite.png
> loop@mozilla.org.xpi chrome/content/modules/MozLoopService.jsm
> loop@mozilla.org.xpi chrome/content/modules/LoopRooms.jsm
> loop@mozilla.org.xpi chrome/skin/shared/loop.css
> loop@mozilla.org.xpi chrome/skin/osx/platform.css
> loop@mozilla.org.xpi chrome/content/modules/MozLoopPushHandler.jsm
> 
> The manifest & bootstrap I would guess are loaded on startup by the various
> bits that need them. The rest of the items there are triggered via the
> bootstrap startup() function. I suspect the skins are loaded twice due to
> the additional hidden window on Mac, we're certainly not loading them twice.
> 
> What this doesn't show us, is the loading that the add-on manager is doing
> as I guess it must be loading install.rdf from there?
> 
> Dave, any hints here?

If the add-on is already installed the add-ons manager doesn't load install.rdf, it just keeps a cached version around in extensions.json. The add-ons manager is triggering the bootstrap.js and chrome.manifest loads.

We actually log how long it takes to load the manifest and call bootstrap.js in telemetry (http://mxr.mozilla.org/mozilla-central/source/toolkit/mozapps/extensions/internal/XPIProvider.jsm#4612), can we have a look at what those numbers look like?
Flags: needinfo?(dtownsend)
This will have several impacts on PGO:
- Both the first run and second run of Firefox will now be profiled, so there are more occasions for the compiler to optimize code that is actually used by normal users that are not optimized at the moment because they are not exercised. This should have a positive outcome.
- We only look at the jar log on the second run, which will alter how much preloading we do when launching Firefox, which will affect first runs (and runs after upgrades, presumably) in a negative way, but the outcome should be positive for non first runs.

The latter is what should fix the regression from using system addons.
Attachment #8702849 - Flags: review?(ted)
I made the mistake of pushing to try while removing the loop entry from testing/talos/talos/xtalos/xperf_whitelist.json, which turned the talos run orange, thus lacking the precious data to compare to m-c... That said, I at least know that it complains about 786432 bytes being read instead of the 4980736 it was when bug 1223573 first landed.
Attachment #8702849 - Flags: review?(ted) → review+
So, this gets us down from 62521112.00 after the landing of bug 1223573, to 56728412.00 with the patch, but the value before bug 1223573 was 50921276.00, so we're only halfway down. Someone else will have to look at this further.
Keywords: leave-open
The improvement doesn't cover the regression, so someone will have to find out if it's wanted before this can be closed.
Ah, it seems I was looking at the wrong dip all along. The one this bug was about is a much smaller one in that interval:
https://hg.mozilla.org/integration/fx-team/pushloghtml?fromchange=c952179edbfc58e2e8db3df6782c943782229112&tochange=a5ad0e73b4bf19c63eb157f4fd1bbaa98eb051d8

Now, since bug 1231379 changed the parameters, it's hard to tell if we're fully covered or not. We'd need data with bug 1231379 backed out.
Rank: 19
Priority: -- → P1
(In reply to Mike Hommey (VAC: 31 Dec - 11 Jan) [:glandium] from comment #26)
> Ah, it seems I was looking at the wrong dip all along. The one this bug was
> about is a much smaller one in that interval:
> https://hg.mozilla.org/integration/fx-team/
> pushloghtml?fromchange=c952179edbfc58e2e8db3df6782c943782229112&tochange=a5ad
> 0e73b4bf19c63eb157f4fd1bbaa98eb051d8
> 
> Now, since bug 1231379 changed the parameters, it's hard to tell if we're
> fully covered or not. We'd need data with bug 1231379 backed out.

Joel, is this something you could try for us?
Flags: needinfo?(jmaher)
I have pushed to try:
https://treeherder.mozilla.org/#/jobs?repo=try&author=jmaher@mozilla.com&fromchange=0496289c57d0&tochange=3f5709dd4a15

these are pgo builds, lets give it a couple of hours.  We can click on the 'x' job in each of the two and determine what the values are.  In the second push I did, I forgot to change 'baseline', but you can see the backout patch in the queue.
Flags: needinfo?(jmaher)
the results are in, backing out bug 1231379 does reduce our numbers here:
https://treeherder.mozilla.org/perf.html#/graphs?timerange=5184000&series=[try,79a8a84728f7936baecb7cb4a7e789468af24dc0,1]&series=[mozilla-central,9289b3016727523cced559336db000e7a58d0108,1]&highlightedRevisions=0496289c57d0&highlightedRevisions=3f5709dd4a15

the graph is misleading because non-pgo has much lower numbers and to build pgo on try we hack and it is reported as non-pgo.  In raw numbers:

           non-e10s      e10s
baseline: 57312348.00  57496752.00
backout:  48954944.00  49017580.00

This would put us back to where we were prior to December 1st.
If we're back to numbers prior to Dec 1st with that backout, then that means the regression this bug was filed for is gone. Whether we want to address the one of bug 1231379 is another matter, maybe it has its own bug already? Should it?

Anyways, as far as this one goes, it's fixed. Thanks for the confirmation, Joel.
Status: NEW → RESOLVED
Closed: 8 years ago
Keywords: leave-open
Resolution: --- → FIXED
this is now on beta- but aurora has the fix.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: