Closed
Bug 1229731
Opened 8 years ago
Closed 8 years ago
4% Win7 tp5n(startup file I/O bytes) regression on Fx-Team (v.45) on December 01, 2015 from push 29ce9059dc2c
Categories
(Hello (Loop) :: Client, defect, P1)
Hello (Loop)
Client
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jmaher, Unassigned)
References
Details
(Keywords: perf, regression, Whiteboard: [talos_regression])
Attachments
(1 file)
I suspect this is expected, but wanted to close the loop and officially document this. In the xperf suite, we measure file and network I/O with the xperf toolchain on windows 7. Both xperf and xperf-e10s have shows regressions: http://alertmanager.allizom.org:8080/alerts.html?rev=a5ad0e73b4bf19c63eb157f4fd1bbaa98eb051d8&showAll=1&testIndex=0&platIndex=0 This isn't seen on non-pgo, it is only seen on pgo. Not sure why, probably an artifact of how we do pgo. I strongly suspect this matches the bytes for the .xpi- I would like to confirm that and ensure that we intended for this to be on the main thread during startup.
Comment 2•8 years ago
|
||
I just had a look, on Window the size of the xpi in the application directory is 1,241,673 bytes in the latest nightly. According to the emails, the increase in file i/o is 2,236,167 bytes. That's a little more, but I'm guessing that we might be registering multiple reads or something... its a little hard to tell with out logs. I would expect a lot of this is the fact that we've moved to the xpi. We haven't really changed anything that we do. We might be loading style sheets a bit earlier as we're in bootstrap and not the browser window itself, but I can't see that we can do much about that, as we need that info here. I'm going to tag Dave and see if he's got any concerns here as well.
Flags: needinfo?(standard8) → needinfo?(dtownsend)
Comment 3•8 years ago
|
||
Two things worth considering: Are we compressing the XPI? That might save us file I/O and probably wouldn't impact perf much but there is likely a tradeoff there. How much did omni.ja shrink with the move? I'm surprised that we see so much additional file i/o when presumably we should have lost a bunch from omni.ja
Flags: needinfo?(dtownsend)
Reporter | ||
Comment 4•8 years ago
|
||
in the logs we have a list of files and the bytes read (we also have write, but I only did this for the reads). old log: http://archive.mozilla.org/pub/firefox/tinderbox-builds/fx-team-win32-pgo/1449052255/fx-team_win7-ix_test-xperf-pgo-bm109-tests1-windows-build342.txt.gz new log: http://archive.mozilla.org/pub/firefox/tinderbox-builds/fx-team-win32-pgo/1448814650/fx-team_win7-ix_test-xperf-pgo-bm109-tests1-windows-build338.txt.gz cleanup tmpdir names, and sanitize into a spreadsheet: https://docs.google.com/spreadsheets/d/1JFYUhcN8NGZAmZjcoGE5-Pw8uz6m9H9dXaokXok0Ues/edit?usp=sharing you can see these differences: C:\slave\test-pgo\build\application\firefox\browser\features\loop@mozilla.org.xpi, 4980736 bytes C:\slave\test-pgo\build\application\firefox\browser\omni.ja, -131072 bytes C:\slave\test-pgo\build\application\firefox\nss3.dll, -131072 bytes and sadly: C:\slave\test-pgo\build\application\firefox\omni.ja remains the same at 4980736 bytes. if these numbers look off, lets investigate more- it could be something odd on our machines or with the builds- we can look at other data.
Comment 5•8 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #4) > you can see these differences: > C:\slave\test-pgo\build\application\firefox\browser\features\loop@mozilla. > org.xpi, 4980736 bytes > C:\slave\test-pgo\build\application\firefox\browser\omni.ja, -131072 bytes > C:\slave\test-pgo\build\application\firefox\nss3.dll, -131072 bytes > > and sadly: > C:\slave\test-pgo\build\application\firefox\omni.ja remains the same at > 4980736 bytes. Hello is browser specific so I'd expect the browser\omni.ja to be the one to reduce. > if these numbers look off, lets investigate more- it could be something odd > on our machines or with the builds- we can look at other data. Is it odd that browser/omni.ja and nss3.dll are exactly the same size before and after (with the same drop), same with the after numbers of omni.ja and loop@mozilla.org.xpi ? I wouldn't have expected reads to nss3.dll to drop. If we believe the numbers, then it also means loop@mozilla.org.xpi was read in full 4 times.
Comment 6•8 years ago
|
||
Mike, are xpis packed to the same compression standard as omni.ja?
Flags: needinfo?(mh+mozilla)
Comment 7•8 years ago
|
||
(In reply to Mark Banner (:standard8) from comment #6) > Mike, are xpis packed to the same compression standard as omni.ja? Yes. Now, looking at the actual builds, this is what can be seen: - before loop addon: browser/omni.ja 15304507 - with loop addon: browser/omni.ja 14101105 browser/features/loop@mozilla.org.xpi 1241648 The sum of the above is 15342753, which is 38246 bytes more, which is not totally crazy, it might come from things like install.rdf and bootstrap.js. Now, since those files come from PGO builds, they also have preloading information (which tells the jar reader in gecko to read the first n bytes in the jar ; you can read this yourself, it's the 4 first bytes in the jar files, in little endian format), which is the following: - before loop addon: browser/omni.ja 1758427 - with loop addon: browser/omni.ja 1700114 browser/features/loop@mozilla.org.xpi 1241626 The way the preloading information is computed is to record, when Firefox is started during the PGO profiling phase, what files are being accessed by Gecko's jar reader, and use that information when packaging. If we are effectively reading the xpi 4 times, there are few things that can come out of this: - the xperf test does not care about what is in memory already. Re-reading a file, while suboptimal, is not so much of a problem. - something (the addon manager?) is reading the entire xpi when Firefox starts The difference for nss3.dll doesn't make much sense... except that the files have different sizes in the builds: - before loop addon: nss3.dll 1722672 - with loop addon: nss3.dll 1700144 So yes, that could have that effect. As to why the file shrunk, that could be due to a number of things, including PGO itself inducing variations in size for the same build. Or there could have been some changes between both PGO builds.
Flags: needinfo?(mh+mozilla)
Comment 8•8 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #7) > - the xperf test does not care about what is in memory already. Re-reading a > file, while suboptimal, is not so much of a problem. Note this also means that the test could be mis-categorizing other things: you could have a change that makes smaller parts of a file read more often categorized as a regression while the actual disk hit would be smaller, or on the opposite end, a change that makes a file read less often, but reading bigger parts of it categorized as an improvement.
Comment 9•8 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #7) > (In reply to Mark Banner (:standard8) from comment #6) > > Mike, are xpis packed to the same compression standard as omni.ja? > > Yes. > > Now, looking at the actual builds, this is what can be seen: > - before loop addon: > browser/omni.ja 15304507 > > - with loop addon: > browser/omni.ja 14101105 > browser/features/loop@mozilla.org.xpi 1241648 > > The sum of the above is 15342753, which is 38246 bytes more, which is not > totally crazy, it might come from things like install.rdf and bootstrap.js. bootstrap.js alone is 31k so that seems reasonable. > Now, since those files come from PGO builds, they also have preloading > information (which tells the jar reader in gecko to read the first n bytes > in the jar ; you can read this yourself, it's the 4 first bytes in the jar > files, in little endian format), which is the following: > - before loop addon: > browser/omni.ja 1758427 > > - with loop addon: > browser/omni.ja 1700114 > browser/features/loop@mozilla.org.xpi 1241626 > > The way the preloading information is computed is to record, when Firefox is > started during the PGO profiling phase, what files are being accessed by > Gecko's jar reader, and use that information when packaging. Does this profiling phase start from a clean profile or is it the second run of a profile? Same question for the file I/O check we're concerned with here? > If we are effectively reading the xpi 4 times, there are few things that can > come out of this: > - the xperf test does not care about what is in memory already. Re-reading a > file, while suboptimal, is not so much of a problem. > - something (the addon manager?) is reading the entire xpi when Firefox > starts For a clean profile the add-ons manager is going to trigger a signing check on the loop XPI which means we're going to read the entire thing. That isn't normal and we shouldn't really use that to determine preloading info (do we actually do that for regular XPIs?)
Comment 10•8 years ago
|
||
(In reply to Dave Townsend [:mossop] from comment #9) > > If we are effectively reading the xpi 4 times, there are few things that can > > come out of this: > > - the xperf test does not care about what is in memory already. Re-reading a > > file, while suboptimal, is not so much of a problem. > > - something (the addon manager?) is reading the entire xpi when Firefox > > starts > > For a clean profile the add-ons manager is going to trigger a signing check > on the loop XPI which means we're going to read the entire thing. That isn't > normal and we shouldn't really use that to determine preloading info (do we > actually do that for regular XPIs?) Bleh, I am misspeaking here. The add-on isn't signed so we shouldn't be reading it all. We will read install.rdf and bootstrap.js though. On my machine these are right at the start and the end of the XPI. Is it possible we're just seeing some OS caching read the rest in that case?
Comment 11•8 years ago
|
||
(In reply to Dave Townsend [:mossop] from comment #9) > Does this profiling phase start from a clean profile or is it the second run > of a profile? A clean profile, cf. build/pgo/profileserver.py
Reporter | ||
Comment 12•8 years ago
|
||
I am not clear on the next steps for this bug. Can I help collect more data or aid in experimental fixes?
Reporter | ||
Comment 13•8 years ago
|
||
this is now seen on mozilla-aurora! I cannot determine what the next steps are here.
Comment 14•8 years ago
|
||
Sorry, missed the previous comment. I just realised that we should be able to look at the zip reader and see what's being loaded. I found there's an env var: MOZ_JAR_LOG_FILE=`pwd`/test.txt /Applications/FirefoxNightly.app/Contents/MacOS/firefox -P local1 -purgecaches I tested with my latest dev add-on and nightly, and this is what I got for just the loop xpi loads: loop@mozilla.org.xpi chrome.manifest loop@mozilla.org.xpi bootstrap.js loop@mozilla.org.xpi chrome/content/preferences/prefs.js loop@mozilla.org.xpi chrome/skin/shared/loop.css loop@mozilla.org.xpi chrome/skin/osx/platform.css loop@mozilla.org.xpi chrome/skin/osx/toolbar-yosemite.png loop@mozilla.org.xpi chrome/content/modules/MozLoopService.jsm loop@mozilla.org.xpi chrome/content/modules/LoopRooms.jsm loop@mozilla.org.xpi chrome/skin/shared/loop.css loop@mozilla.org.xpi chrome/skin/osx/platform.css loop@mozilla.org.xpi chrome/content/modules/MozLoopPushHandler.jsm The manifest & bootstrap I would guess are loaded on startup by the various bits that need them. The rest of the items there are triggered via the bootstrap startup() function. I suspect the skins are loaded twice due to the additional hidden window on Mac, we're certainly not loading them twice. What this doesn't show us, is the loading that the add-on manager is doing as I guess it must be loading install.rdf from there? Dave, any hints here?
Flags: needinfo?(dtownsend)
Comment 15•8 years ago
|
||
(In reply to Mark Banner (:standard8) from comment #14) > Sorry, missed the previous comment. I just realised that we should be able > to look at the zip reader and see what's being loaded. I found there's an > env var: > > MOZ_JAR_LOG_FILE=`pwd`/test.txt > /Applications/FirefoxNightly.app/Contents/MacOS/firefox -P local1 > -purgecaches > > I tested with my latest dev add-on and nightly, and this is what I got for > just the loop xpi loads: > > loop@mozilla.org.xpi chrome.manifest > loop@mozilla.org.xpi bootstrap.js > loop@mozilla.org.xpi chrome/content/preferences/prefs.js > loop@mozilla.org.xpi chrome/skin/shared/loop.css > loop@mozilla.org.xpi chrome/skin/osx/platform.css > loop@mozilla.org.xpi chrome/skin/osx/toolbar-yosemite.png > loop@mozilla.org.xpi chrome/content/modules/MozLoopService.jsm > loop@mozilla.org.xpi chrome/content/modules/LoopRooms.jsm > loop@mozilla.org.xpi chrome/skin/shared/loop.css > loop@mozilla.org.xpi chrome/skin/osx/platform.css > loop@mozilla.org.xpi chrome/content/modules/MozLoopPushHandler.jsm > > The manifest & bootstrap I would guess are loaded on startup by the various > bits that need them. The rest of the items there are triggered via the > bootstrap startup() function. I suspect the skins are loaded twice due to > the additional hidden window on Mac, we're certainly not loading them twice. > > What this doesn't show us, is the loading that the add-on manager is doing > as I guess it must be loading install.rdf from there? > > Dave, any hints here? If the add-on is already installed the add-ons manager doesn't load install.rdf, it just keeps a cached version around in extensions.json. The add-ons manager is triggering the bootstrap.js and chrome.manifest loads. We actually log how long it takes to load the manifest and call bootstrap.js in telemetry (http://mxr.mozilla.org/mozilla-central/source/toolkit/mozapps/extensions/internal/XPIProvider.jsm#4612), can we have a look at what those numbers look like?
Flags: needinfo?(dtownsend)
Comment 16•8 years ago
|
||
This will have several impacts on PGO: - Both the first run and second run of Firefox will now be profiled, so there are more occasions for the compiler to optimize code that is actually used by normal users that are not optimized at the moment because they are not exercised. This should have a positive outcome. - We only look at the jar log on the second run, which will alter how much preloading we do when launching Firefox, which will affect first runs (and runs after upgrades, presumably) in a negative way, but the outcome should be positive for non first runs. The latter is what should fix the regression from using system addons.
Attachment #8702849 -
Flags: review?(ted)
Comment 17•8 years ago
|
||
I made the mistake of pushing to try while removing the loop entry from testing/talos/talos/xtalos/xperf_whitelist.json, which turned the talos run orange, thus lacking the precious data to compare to m-c... That said, I at least know that it complains about 786432 bytes being read instead of the 4980736 it was when bug 1223573 first landed.
Updated•8 years ago
|
Attachment #8702849 -
Flags: review?(ted) → review+
Comment 19•8 years ago
|
||
So, this gets us down from 62521112.00 after the landing of bug 1223573, to 56728412.00 with the patch, but the value before bug 1223573 was 50921276.00, so we're only halfway down. Someone else will have to look at this further.
Reporter | ||
Comment 20•8 years ago
|
||
a few other regressions have gone in since Dec 1, look at the growth here: https://treeherder.mozilla.org/perf.html#/graphs?timerange=5184000&series=[mozilla-inbound,79a8a84728f7936baecb7cb4a7e789468af24dc0,1]&series=[mozilla-central,79a8a84728f7936baecb7cb4a7e789468af24dc0,1] I don't see any improvements there, nor here (the original link from the alert): http://graphs.mozilla.org/graph.html#tests=[[242,64,47],[242,132,47],[242,131,47]]&sel=none&displayrange=30&datatype=geo
Updated•8 years ago
|
Keywords: leave-open
Comment 22•8 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/035431609d12
Reporter | ||
Comment 23•8 years ago
|
||
on the pgo graph I see an improvement from this: http://graphs.mozilla.org/graph.html#tests=[[242,64,47],[242,132,47],[242,131,47]]&sel=none&displayrange=30&datatype=geo do we need to leave this open?
Comment 24•8 years ago
|
||
The improvement doesn't cover the regression, so someone will have to find out if it's wanted before this can be closed.
Comment 25•8 years ago
|
||
Looking at https://hg.mozilla.org/integration/fx-team/pushloghtml?fromchange=1e4c936c142d5d13da442e8a82d4f30a40971153&tochange=a9eaed26451a0381ed8a599597aa1c8629a2f3d1 the likely cause of the big dip is bug 1231379.
Comment 26•8 years ago
|
||
Ah, it seems I was looking at the wrong dip all along. The one this bug was about is a much smaller one in that interval: https://hg.mozilla.org/integration/fx-team/pushloghtml?fromchange=c952179edbfc58e2e8db3df6782c943782229112&tochange=a5ad0e73b4bf19c63eb157f4fd1bbaa98eb051d8 Now, since bug 1231379 changed the parameters, it's hard to tell if we're fully covered or not. We'd need data with bug 1231379 backed out.
Updated•8 years ago
|
Rank: 19
Priority: -- → P1
Comment 27•8 years ago
|
||
(In reply to Mike Hommey (VAC: 31 Dec - 11 Jan) [:glandium] from comment #26) > Ah, it seems I was looking at the wrong dip all along. The one this bug was > about is a much smaller one in that interval: > https://hg.mozilla.org/integration/fx-team/ > pushloghtml?fromchange=c952179edbfc58e2e8db3df6782c943782229112&tochange=a5ad > 0e73b4bf19c63eb157f4fd1bbaa98eb051d8 > > Now, since bug 1231379 changed the parameters, it's hard to tell if we're > fully covered or not. We'd need data with bug 1231379 backed out. Joel, is this something you could try for us?
Flags: needinfo?(jmaher)
Reporter | ||
Comment 28•8 years ago
|
||
I have pushed to try: https://treeherder.mozilla.org/#/jobs?repo=try&author=jmaher@mozilla.com&fromchange=0496289c57d0&tochange=3f5709dd4a15 these are pgo builds, lets give it a couple of hours. We can click on the 'x' job in each of the two and determine what the values are. In the second push I did, I forgot to change 'baseline', but you can see the backout patch in the queue.
Flags: needinfo?(jmaher)
Reporter | ||
Comment 29•8 years ago
|
||
the results are in, backing out bug 1231379 does reduce our numbers here: https://treeherder.mozilla.org/perf.html#/graphs?timerange=5184000&series=[try,79a8a84728f7936baecb7cb4a7e789468af24dc0,1]&series=[mozilla-central,9289b3016727523cced559336db000e7a58d0108,1]&highlightedRevisions=0496289c57d0&highlightedRevisions=3f5709dd4a15 the graph is misleading because non-pgo has much lower numbers and to build pgo on try we hack and it is reported as non-pgo. In raw numbers: non-e10s e10s baseline: 57312348.00 57496752.00 backout: 48954944.00 49017580.00 This would put us back to where we were prior to December 1st.
Comment 30•8 years ago
|
||
If we're back to numbers prior to Dec 1st with that backout, then that means the regression this bug was filed for is gone. Whether we want to address the one of bug 1231379 is another matter, maybe it has its own bug already? Should it? Anyways, as far as this one goes, it's fixed. Thanks for the confirmation, Joel.
Reporter | ||
Comment 31•8 years ago
|
||
this is now on beta- but aurora has the fix.
You need to log in
before you can comment on or make changes to this bug.
Description
•