Open Bug 1254767 Opened 9 years ago Updated 2 years ago

Assess Talos/performance impact of switching to Visual Studio 2015

Categories

(Firefox Build System :: General, defect)

defect

Tracking

(Not tracked)

People

(Reporter: gps, Unassigned)

References

Details

(Keywords: perf)

Before we switch Windows builds from Visual Studio 2013 to Visual Studio 2015, we need to assess the performance impact of switching toolchains. Worth calling out are differences with the Universal CRT (ucrt) in Visual Studio 2015 (https://blogs.msdn.microsoft.com/vcblog/2015/03/03/introducing-the-universal-crt/). Basically, there are now ~40 api-ms-win-*.dll files that ship with Firefox that constitute the ucrt. Many of these load or partially load at startup. Looking at system calls, there is file activity and some memory mapping for these files on process startup. I would not be surprised if process start time changed as a result, especially on cold boot. Aside from the ucrt fallout, I'm optimistic there won't be any major regressions. But you never know. The jobs are still in progress and there's a chance they will fail, but I'm optimistic the following Try pushes will go green and can be used for comparing Talos numbers. https://treeherder.mozilla.org/#/jobs?repo=try&revision=d67e4a1f3735 https://treeherder.mozilla.org/#/jobs?repo=try&revision=5c2a7b4e81d0 (PGO)
looking at the data for win7 opt (non pgo): https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&originalRevision=05c087337043&newProject=try&newRevision=d67e4a1f3735&framework=1&showOnlyImportant=0 we have plenty of data points for he base and the try push- I don't see any concerns there. Win8 might take a day or two. I also need to manually look at pgo since our tools don't work with try-pgo stuff.
looking at pgo: win7: * tcanvasmark: 7800 base, 7570 vs2015 - regression (higher is better) * tresize: 10.3 base, 10.5 vs2015 - regression * dromaeo_css: 6550 base, 6350 vs2015 - regression (higher is better) * kraken: 1595 base, 1600 vs2015 - slight regression, might be noise * glterrain: 5.16 base, 5.45 vs2015 - regression * tp5o_scroll: 2.9 base, 3.12 vs2015 - regression * damp: 193 base, 193 vs2015 * tps: 47 base, 47 vs2015 * tp5o: 226 base, 236 vs2015 - regression * tp5o_responsiveness: 38 vase, 39 vs2015 - slight regression we have 14 other tests, then all 24 are duplicated in e10s- I can look at the rest of these if you want, but if there is additional work, etc. then I would like to save some manual comparison time.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=46e456744f3a is the most recent PGO build I've performed in automation. Although, no test jobs for PGO builds were performed. The builds should be good, however. It shouldn't be materially different from the links in the initial comment.
Summary: Asses Talos/performance impact of switching to Visual Studio 2015 → Assess Talos/performance impact of switching to Visual Studio 2015
I wrote a script to pull data from perfherder: http://people.mozilla.org/~jmaher/pgo.py.txt Here is the output: test, base avg, base count, new avg, new count, % regress a11yr pgo,6,401.008333333,6,748.335,-46.41% a11yr pgo e10s,4,410.135,2,761.025,-46.11% tabpaint pgo e10s,4,50.64,4,79.2025,-36.06% tabpaint pgo,6,52.0066666667,6,78.5933333333,-33.83% cart pgo,6,23.6166666667,6,34.3866666667,-31.32% cart pgo e10s,6,24.9616666667,2,36.335,-31.30% tp5o pgo,6,229.381666667,6,333.856666667,-31.29% tp5o_scroll pgo,7,2.9414285714,6,4.2683333333,-31.09% tpaint pgo,2,189.15,4,272.8325,-30.67% tpaint pgo e10s,2,241.81,2,346.435,-30.20% tart pgo,6,3.7966666667,6,5.4283333333,-30.06% tart pgo e10s,2,4.21,2,5.965,-29.42% tp5o pgo e10s,2,225.2,2,317.99,-29.18% damp pgo,5,192.386,6,265.236666667,-27.47% sessionrestore pgo e10s,2,1580,4,2177,-27.42% sessionrestore pgo,2,1574,4,2164.25,-27.27% damp pgo e10s,2,195.805,2,266.41,-26.50% sessionrestore_no_auto_restore pgo,4,625.75,4,828,-24.43% sessionrestore_no_auto_restore pgo e10s,2,629.5,4,817.5,-23.00% ts_paint pgo,4,862.75,2,1107,-22.06% tsvgr_opacity pgo,6,530.566666667,6,679.945,-21.97% tps pgo,5,48.15,6,61.4533333333,-21.65% tsvgx pgo e10s,2,196.325,4,248.8575,-21.11% ts_paint pgo e10s,4,444.25,2,556,-20.10% tsvgr_opacity pgo e10s,6,893.391666667,4,1065.6025,-16.16% tsvgx pgo,6,387.685,6,451.788333333,-14.19% dromaeo_css pgo,6,6567.05666667,6,5788.805,-13.44% dromaeo_css pgo e10s,4,6626.59,2,5864.52,-12.99% tcanvasmark pgo,6,7771.41666667,6,6913.58333333,-12.41% tresize pgo e10s,2,10.455,2,11.805,-11.44% tcanvasmark pgo e10s,2,7866.5,2,7096,-10.86% tps pgo e10s,2,75.66,4,83.1525,-9.01% tresize pgo,4,9.765,4,10.6,-7.88% tp5o_scroll pgo e10s,2,2.855,3,3.0966666667,-7.80% glterrain pgo e10s,4,5.095,2,5.265,-3.23% kraken pgo e10s,4,1597.5,4,1606.2,-0.54% tscrollx pgo e10s,6,2.8633333333,2,2.87,-0.23% kraken pgo,6,1589.5,6,1589.13333333,0.02% tscrollx pgo,6,2.8716666667,6,2.86,0.41% glterrain pgo,7,5.29,6,5.2083333333,1.57% when comparing pgo to the opt numbers, they are for the most part identical. Are you sure we are actually running pgo here?
after chatting with :ted on irc, it appears we are actually running the talos tests against the opt binary. :gps, can you do a pgo build like this: https://wiki.mozilla.org/ReleaseEngineering/TryChooser#What_if_I_want_PGO_for_my_build
Flags: needinfo?(gps)
OK. Looking at logs, I think some wires got mixed between buildbot and TaskCluster. https://treeherder.mozilla.org/#/jobs?repo=try&revision=cc55caee292a is a fresh PGO push based on latest mozilla-central. Now we wait ~4 hours.
The first talos jobs from the Try push earlier just started running. It looks like there is a 5+ hour delay on the Windows testers right now :/ Hopefully there will be results by the morning for jmaher to pour over. Each Talos job is scheduled 5 times.
hey, now we are comparing pgo to pgo! this is looking better- outliers: win8: >=5% regression: a11yr pgo e10s 374.35 405.846 -7.76% dromaeo_css pgo e10s 6979.31777778 6498.764 -7.39% a11yr pgo 378.471666667 407.28 -7.07% dromaeo_css pgo 6851.935 6436.546 -6.45% >=2% improvement: sessionrestore_no_auto_restore pgo 637.666666667 623.6 2.26% tcanvasmark pgo 8495.66666667 8856 4.07% tpaint pgo 223.863333333 202.046 10.80% tresize pgo 34.2883333333 11.024 211.03% win7: >=5% regression: a11yr pgo e10s 401.921666667 450.768 -10.84% a11yr pgo 396.041666667 441.74 -10.35% tresize pgo e10s 9.825 10.598 -7.29% >=2% improvement: tresize pgo 10.305 9.056 13.79% what really concerns me is the win7 tresize- e10s is a regression and non e10s is a win. Overall less concerns, but we should investigate a11y, dromaeo_css, tresize in more detail. :avih, could you weigh in on the tresize stuff? :davidb, could you help us (or find the right person to) figure out the a11yr regression? :bz, could you help us (or find the right person to) understand the dromaeo_css regression for win8 (win7 regressions are 1.5%, and 3.5% for e10s)?
Flags: needinfo?(dbolter)
Flags: needinfo?(bzbarsky)
Flags: needinfo?(avihpit)
(In reply to Joel Maher (:jmaher) from comment #9) > win8: > >=5% regression: > a11yr pgo e10s 374.35 405.846 -7.76% > dromaeo_css pgo e10s 6979.31777778 6498.764 -7.39% > a11yr pgo 378.471666667 407.28 -7.07% > dromaeo_css pgo 6851.935 6436.546 -6.45% > :davidb, could you help us (or find the right person to) figure out the > a11yr regression? Can start with me except I'm not yet sure how one would dig in here. We are comparing performance guided optimization between VS 2013 and VS 2015 right? Ideally we would have some way of computing the difference between performance profiles of each build running on our Talos tests. I'm not sure I could catch interesting differences manually (by eyeballing).
Flags: needinfo?(dbolter)
(In reply to David Bolter [:davidb] from comment #10) > performance profiles of each build running on our Talos tests Note I'm talking about gecko profiler (or similar) results here, which could show hotpath differences etc.
How stable are the dromaeo_css numbers? That is, can I see something resembling error bars on those values? Past that, what does the per-subtest breakdown look like?
Flags: needinfo?(bzbarsky)
oh, good point here- let me figure out how to run this with --spsProfiling. Right now for subtests we don't have a clean way to do this- our limitation is the tools we have compare opt<->opt and pgo<->pgo, but not try-pgo (which is ran as opt)<->pgo. I can gather some sps profiling info and work on breaking down a set of raw data point for the subtests as well.
Assuming I read the data correctly (that tresize on win7 e10s regressed from ~11 to ~34 after moving from VC2013 to VC2015), it does seem weird. My hunch would be that switching compiler (and supposedly to a better one) should not make any test 3 times worse than before, therefore I'd suspect the test doesn't measure correctly what's it's set up to measure, and possibly depends on timing in some incorrect way, and switching compilers made it cross some timing threshold to jump to a new "bucket" of results. That being said, jimm suggested recently that the test should be good, maybe he has some ideas on what could be going on here?
Flags: needinfo?(avihpit) → needinfo?(jmathies)
thanks for all the input so far. profiling push with old compiler: https://treeherder.mozilla.org/#/jobs?repo=try&revision=feb9b7b467b2 profiling push with new compiler: https://treeherder.mozilla.org/#/jobs?repo=try&revision=4e3d5275f11a ---------- bz: some raw data from the dromaeo_css tests (all regression, dojo is highest, ext/jquery non e10s are the lowest regressions): test, base avg, new avg, %% regress dojo.html pgo, 5053.64833333, 4580.536, -10.33% base: [5093.39, 5106.74, 4794.13, 5104.69, 5106.47, 5116.47] new: [4580.8, 4588.35, 4626.16, 4526.49, 4580.88] dojo.html pgo e10s, 5261.65555556, 4740.696, -10.99% base: [5271.72, 5258.84, 5257.39, 5176.73, 5284.37, 5252.83, 5240.31, 5296.24, 5316.47] new: [4754.14, 4736.19, 4757.93, 4695.0, 4760.22] dromaeo_css summary pgo, 6851.935, 6436.546, -6.45% base: [6940.97, 6949.48, 6775.98, 6846.46, 6764.94, 6833.78] new: [6528.3, 6383.71, 6410.01, 6358.38, 6502.33] dromaeo_css summary pgo e10s, 6979.31777778, 6498.764, -7.39% base: [6935.53, 7022.68, 6955.49, 6948.29, 7027.87, 6910.9, 7017.29, 6959.38, 7036.43] new: [6385.81, 6565.75, 6568.9, 6412.25, 6561.11] ext.html pgo, 17780.09, 17303.964, -2.75% base: [17750.27, 17959.07, 17541.14, 18278.35, 17605.03, 17546.68] new: [17354.04, 17384.46, 17568.12, 17313.14, 16900.06] ext.html pgo e10s, 18398.2155556, 17486.542, -5.21% base: [17867.28, 18738.58, 18831.64, 18214.3, 18489.82, 17743.25, 18599.52, 18291.06, 18808.49] new: [16883.72, 17280.65, 17353.64, 18131.17, 17783.53] jquery.html pgo, 8419.24666667, 8165.48, -3.11% base: [9001.48, 8896.49, 8115.28, 8242.48, 7986.5, 8273.25] new: [8842.68, 7912.02, 7836.51, 7491.36, 8744.83] jquery.html pgo e10s, 8142.73111111, 7594.524, -7.22% base: [7903.69, 8182.32, 7873.17, 8191.39, 8330.56, 8303.67, 8041.38, 8409.3, 8049.1] new: [6963.5, 7913.15, 8043.35, 7411.55, 7641.07] mootools.html pgo, 4440.87666667, 4110.708, -8.03% base: [4479.24, 4425.0, 4500.61, 4378.73, 4469.24, 4392.44] new: [4132.14, 4074.99, 4084.16, 4137.76, 4124.49] mootools.html pgo e10s, 4459.91777778, 4189.552, -6.45% base: [4459.82, 4535.33, 4394.22, 4515.29, 4429.64, 4373.64, 4580.48, 4406.7, 4444.14] new: [4217.77, 4266.84, 4191.57, 3987.13, 4284.45] prototype.html pgo, 4749.6, 4396.164, -8.04% base: [4757.52, 4766.65, 4774.62, 4776.77, 4621.05, 4800.99] new: [4404.52, 4412.32, 4430.78, 4385.59, 4347.61] prototype.html pgo e10s, 5032.74888889, 4649.26, -8.25% base: [5075.95, 5032.42, 5012.3, 4959.18, 5085.36, 4970.49, 5072.73, 4968.03, 5118.28] new: [4660.75, 4648.88, 4669.67, 4624.91, 4642.09] yui.html pgo, 6493.155, 6091.792, -6.59% base: [6447.19, 6544.87, 6599.52, 6401.95, 6463.34, 6502.06] new: [6049.98, 5963.43, 6018.07, 6202.18, 6225.3] yui.html pgo e10s, 6535.79111111, 6152.268, -6.23% base: [6603.38, 6517.68, 6594.82, 6505.93, 6570.74, 6474.92, 6556.03, 6369.76, 6628.86] new: [6170.94, 6235.53, 6180.28, 5974.24, 6200.35]
OK, so those are pretty consistent. Looking at just dojo.html, are there particular subtests that are much slower, or just slowdown across the board? That test is pretty JS-heavy, iirc, so it might just be that we're not optimizing the SpiderMonkey interpreter that well, or it might be something else....
we only upload the result for dojo, in the raw logs I can get the sub-subtest data. I graphed the raw data in a google spreadsheet: https://docs.google.com/spreadsheets/d/1BqYEQ8siCUMgO-qCO29UAEDmeINq0e5s5YV9nWHZdXU/edit?usp=sharing I believe the lower value points in the middle of the graph refer to: https://dxr.mozilla.org/mozilla-central/source/testing/talos/talos/tests/dromaeo/tests/cssquery-dojo.html#85 for the most part all of the sub-subtests are equally regressed, slight differences between one and another, but the general trend is the same.
OK. Sounds like we should profile and see if anything comes up there, then. :(
Can someone post the perfherder compare link the numbers in comment 9 came from? I tried to piece one together but I don't see the same results - https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=46e456744f3a&newProject=try&newRevision=cc55caee292a&framework=1&filter=tresize&showOnlyImportant=0
:jimm, unfortunately there is no simple UI tool to make this comparison easy to view. The problem is PGO on try gets run on pgo builds, but end up posting results as opt. The compare view in perfherder compares opt<->opt and pgo<->pgo. I have a hacked up script that pulls the data for me: http://people.mozilla.org/~jmaher/pgo.py.txt
(In reply to Joel Maher (:jmaher) from comment #13) > Right now for subtests we don't have a clean way to do this- our limitation > is the tools we have compare opt<->opt and pgo<->pgo, but not try-pgo (which > is ran as opt)<->pgo. Note it is now possible to trigger actual pgo builds on try, albeit manually.
(In reply to Mike Hommey [:glandium] from comment #21) > Note it is now possible to trigger actual pgo builds on try, albeit manually. I've done this a few times in the past week. However, it doesn't appear to trigger derived jobs, even if the try syntax includes -t all or -u all.
(In reply to Joel Maher (:jmaher) from comment #20) > :jimm, unfortunately there is no simple UI tool to make this comparison easy > to view. The problem is PGO on try gets run on pgo builds, but end up > posting results as opt. The compare view in perfherder compares opt<->opt > and pgo<->pgo. I have a hacked up script that pulls the data for me: > http://people.mozilla.org/~jmaher/pgo.py.txt Ok, I was looking for a way to confirm the tresize regression here. I'll take your word on it. Resizing exercises gfx and ipc code associated with painting a basic content view - layer updates and ipc traffic plus layout in both processes. The regression here is ~.7 msec for a window resize operation of 2 pixels in bolth width and height. This shouldn't block rollout of the new toolchain but we should look for new bottlenecks in the builds.
Flags: needinfo?(jmathies)
runs: vs2015; https://treeherder.allizom.org/#/jobs?repo=try&revision=c2ddfa6f35c8 vs2013; https://treeherder.allizom.org/#/jobs?repo=try&revision=4e7913ad6329 some profiling info: dromaeo_css win8 non e10s: vs2013: http://people.mozilla.com/~bgirard/cleopatra/?zippedProfile=http://mozilla-releng-blobs.s3.amazonaws.com/blobs/Try/sha512/6564689587aa3a74bcb66f5764107bb5088276e923c7122b79c332ab98822239528358db4375ac1ee5d2b75e480261630f9bb51c8dd423c5da7892361721685a vs2015: http://people.mozilla.com/~bgirard/cleopatra/?zippedProfile=http://mozilla-releng-blobs.s3.amazonaws.com/blobs/Try/sha512/45b05a8d1956ad394f78fac7ea5f239c8ce5c4738b6362365aeb9522ed3fdb43c51caba7cd82a4a82c6fb168c0deede254c1915df6a752337e3693cf2dad0b28 a11y win8 e10s (still waiting on non e10s if that is better): vs2013: http://people.mozilla.com/~bgirard/cleopatra/?zippedProfile=http://mozilla-releng-blobs.s3.amazonaws.com/blobs/Try/sha512/8e9835ce3b071238ad30de144c43cd8f30ebdd051901b5d9e41c982391834489d3c1e60f2a0c247327ad47c9a0183a4843b53c9cb4150f39e6e976c35ef7959d vs2015: http://people.mozilla.com/~bgirard/cleopatra/?zippedProfile=http://mozilla-releng-blobs.s3.amazonaws.com/blobs/Try/sha512/90923b31c34470ec8a5de23b2a3f9814e2e7a64b46491ed829b8e31c34b8ef806d87ca42612dda527494c36edcf839e291c58eada74893daa644f70b2fce3c38 lastly tresize win7 e10s (regression): vs2013: still waiting, results will be found on this link ( https://treeherder.allizom.org/#/jobs?repo=try&revision=4e7913ad6329&selectedJob=18784207&filter-searchStr=Windows%207%20opt%20Talos%20Performance%20e10s%20Talos%20chrome%20e10s%20T-e10s(c) ) vs2015: http://people.mozilla.com/~bgirard/cleopatra/?zippedProfile=http://mozilla-releng-blobs.s3.amazonaws.com/blobs/Try/sha512/54695668ac03e7f745d528da60267f3f7c8384bea7f441dfa094903ad5f215d8a48f9a3cfaddea7981b597bafc9c7920c5919a60c9a11175ea677f13b0790438 tresize win7 non e10s (improvement): vs2013: http://people.mozilla.com/~bgirard/cleopatra/?zippedProfile=http://mozilla-releng-blobs.s3.amazonaws.com/blobs/Try/sha512/33dbccb6e3659c7540db081fdff9c56bc8e35666736e5b36d58ae2e2f9560b9f0531d9b887795d9f509e2d1ba3adc1812211b91899d4c77c086a6ae2cc595f8e vs2015: http://people.mozilla.com/~bgirard/cleopatra/?zippedProfile=http://mozilla-releng-blobs.s3.amazonaws.com/blobs/Try/sha512/800dd47486d200bcea221162db224fa580d71fc3aa8ef2a92121f85a0d5237964f54aa567345cedd58e749464e913d880a2ec6e89fa342d7838ec756dcc12587
Nothing's jumping out at me there. The number of samples is _smaller_ in the 2015 case. Is that just because of the manual start/stop procedure used? Makes it pretty hard to actually compare the profiles...
thanks for looking bz. I think we can keep iterating on this and let vs2015 land in the meantime. We have 3 tests to be concerned with out of many. :BenWa, could you help answer :bz's question about the samples? I am not a sps profiling expert
Flags: needinfo?(bgirard)
(In reply to Boris Zbarsky [:bz] from comment #25) > Nothing's jumping out at me there. The number of samples is _smaller_ in > the 2015 case. Is that just because of the manual start/stop procedure > used? Makes it pretty hard to actually compare the profiles... There's no expectation of sampling to be similar even if the execution is deterministic. We're profiling from user-space using a normal priority thread. To know how long something last you can drag a selection on the timeline and it will give you how long that range using the selected green samples or the markers as a guide. When comparing two profiles you ideally want to pick a similar time range using something like the test start/end markers and comparing execution % and looking for functions that appear/disappear or have a significant % change. With something like a toolchain change I don't expect something like the Gecko Profiler to be very useful if the problem is worse codegen. It's meant at picking out why IPC threads are waiting on each other, finding what we're doing when we're unresponsive and similar scenarios. It wouldn't be able to point out if we're getting hurt by evenly spread-out paper cuts if the codegen was worse. For that I suspect something like VTune would be more effective but it's difficult to collect a good profile with it and therefore might be hard to investigate. I did have a look at the dromaeo css dojo profile and there's nothing that I can see.
Flags: needinfo?(bgirard)
Thanks BenWa!
Product: Core → Firefox Build System
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.