Closed Bug 714455 Opened 8 years ago Closed 2 years ago
MSVC builds are being compiled with -O1 (unlike GCC, optimises for size)
User Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:9.0.1) Gecko/20100101 Firefox/9.0.1 Build ID: 20111226044539 Steps to reproduce: MSVC builds are compiling with -O1 specified; unlike GCC, this will optimise for size which I don't think is really what one would want to be doing. A complete breakdown of the MSVC optimisation flags are provided here: http://msdn.microsoft.com/en-us/library/8f8h5cxt.aspx Actual results: -O1 builds are smaller than -O2 builds, but invariably slower due to using Expected results: -O2 should be used to optimise for speed rather than size.
Summary: MSVC builds are being compiled with -O1 (unlike GCC, opts 4 size) → MSVC builds are being compiled with -O1 (unlike GCC, optimises 4 size)
Summary: MSVC builds are being compiled with -O1 (unlike GCC, optimises 4 size) → MSVC builds are being compiled with -O1 (unlike GCC, optimises for size)
This may be true, but I'm not sure it matters in practice. All of our release builds are compiled with PGO, so important sections of code are likely to be optimized for speed anyway.
there is a noticeable size difference between -O1 and -O2 builds. Whether or not it has any tangible impact on performance is something that would need to be tested; nonetheless, I would be surprised if even with PGO this wasn't having an undesired effect.
So, in the hope someone will finally do something with this ticket, I've had a bit more of a think about it, and here's my conclusion: PGO relies on executing the first iteration of code and optimising it based on execution characteristics; the fact that this first iteration is optimised for size rather than speed is going to undoubtedly be a detriment to that process; PGO will *not* rework an entire function for speed but it *may* inline it in a frequently called function - nonetheless, you still end up with code that was optimised for size.
Your thoughts would be much more effective if expressed in the form of a patch. :)
there is nothing I can offer a patch for; build config does not exist in the source code, it is something you produce yourself.
This is untrue. We use the default optimization flags, which are specified in configure.in.
right you are, you learn something new every day, patch attached :)
I pushed this to the try server for you: https://tbpl.mozilla.org/?tree=Try&rev=9a0e97ebb2f0 It's set to do a Win32 PGO build and run all the Talos performance tests on that, so we should be able to see what kind of perf impact it has.
Assignee: nobody → olipro
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Comment on attachment 605849 [details] [diff] [review] tiny patch to MSVC opt flag Requesting review on behalf of Olipro.
Attachment #605849 - Flags: review?(ted.mielczarek)
Product: Firefox → Core
QA Contact: build.config → build-config
Comment on attachment 605849 [details] [diff] [review] tiny patch to MSVC opt flag FWIW, we have data from MSVC2005 that -O1 produces faster resulting than -O2 in our particular case: this is because we have lots of cold code, and the memory/cache effects dominate the runtime effects when we're not doing PGO. It's not clear to me why this patch would make any difference when running the first phase of a PGO build: the assertion "the fact that this first iteration is optimised for size rather than speed is going to undoubtedly be a detriment to that process; PGO will *not* rework an entire function for speed but it *may* inline it in a frequently called function" neigher seems undoubtable nor correct to me: MSVC with PGO and LTCG definitely does choose to optimize frequently-called functions for speed while optimizing cold functions or even function sections for space.
Interestingly, this change apparently made us fail TestDllInterceptor: https://tbpl.mozilla.org/php/getParsedLog.php?id=10072600&tree=Try
(In reply to Ted Mielczarek [:ted] from comment #12) > Interestingly, this change apparently made us fail TestDllInterceptor: > https://tbpl.mozilla.org/php/getParsedLog.php?id=10072600&tree=Try I'm not entirely surprised. The Dll interceptor has a limited support of x86 instructions, and changing the way the test is compiled changes the instructions in the rotatePayload function, making the interceptor possibly fail. Considering we're only really intercepting functions from system dlls, a way to get around the problem would be to force building that test with -O1, or add a pragma that changes optim flags for rotatePayload only.
Mike - I'm happy to pragma that section of code if you can point me to the file it lives in
(In reply to Olipro from comment #14) > Mike - I'm happy to pragma that section of code if you can point me to the > file it lives in ./toolkit/xre/test/win/TestDllInterceptor.cpp
Patch included, please test it - if you have no luck, I'll modify it to totally disable optimisation in this file.
This needs perf comparisons done. I had pushed it to try back in comment 9, and then realized I needed to push an effectively empty changeset to try to compare against, but that got lost in the shuffle. I can't r+ this unless we know what the perf difference is going to look like.
Both patches + the MOZ_PGO=1 change, with full tests requested... https://tbpl.mozilla.org/?tree=Try&rev=114cefd89ddd Unpatched + the MOZ_PGO=1 change, with full tests requested... https://tbpl.mozilla.org/?tree=Try&rev=532d94b0db0a Compare URL: http://perf.snarkfest.net/compare-talos/index.html?oldRevs=114cefd89ddd&newRev=532d94b0db0a&submit=true (Though I imagine we'll need a number of talos retriggers on both runs before the noise in the comparison is reduced)
Thanks for taking this on. I just lost track of it and haven't had time to pick it back up.
In my local testing this accounts for half of the Speedometer difference between clang-cl and MSVC builds of 32-bit Firefox. Speedometer v2 score and xul.dll size: 120.8 61968896 cl.exe with -O1 -Oi (no PGO) 131.7 71083520 cl.exe with -O2 (no PGO) 142.9 69440512 clang-cl.exe with -O1 -Oi (no PGO) We'll need to test whether this is still true under PGO -- I'll kick off some try builds.
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=5a0a5aed7d0e&newProject=try&newRevision=3f130f23ee9b After running the builds through PGO, things are less clear. Talos does show some potential gains (including a surprisingly large one on bloom_basic_singleton), but Speedometer doesn't benefit. Locally, I can't see a clear winner on 64-bit, and on 32-bit we maybe even _lose_ 0-3 points, but my machine is noisy. jmaher: Can I trust that bloom_basic_singleton test? The history graph shows that its typical value is in the 900s, with occasional anomalies, so I don't know if it's meaningful that we went from 360 to 328. ehsan: Do you have access to a reference machine that produces less noisy Speedometer numbers than my machine? It would be good to have a better understanding of whether we'd have to accept a loss in exchange for those Talos gains.
Also worth noting is that after PGO, there isn't really an appreciable code size difference between xul.dll's. (<200k)
I'm unfortunately out of town until next Thursday. Andrew, Naveed, any chance you can please compare the Speedometer score of the following builds on the reference machine? Before (Win64): https://queue.taskcluster.net/v1/task/PnRGJb00RIOAlI91zFrEgw/runs/0/artifacts/public/build/target.zip Before (Win32): https://queue.taskcluster.net/v1/task/Ug6IXO3-TYmLpmDnIhPpaw/runs/0/artifacts/public/build/target.zip After (Win64): https://queue.taskcluster.net/v1/task/b-sLkHB_Tji02Q9EiAyZeA/runs/0/artifacts/public/build/target.zip After (Win32): https://queue.taskcluster.net/v1/task/MavnEDf3RqGRZnZUVqXdnQ/runs/0/artifacts/public/build/target.zip
I wouldn't trust the data for bloom_basic_singleton, both the base and new revisions show much different data than what we see on autoland/inbound- yet they have many replicates and there is not a lot of noise in them.
(In reply to :Ehsan Akhgari (needinfo please, extremely long backlog) from comment #23) > I'm unfortunately out of town until next Thursday. Andrew, Naveed, any > chance you can please compare the Speedometer score of the following builds > on the reference machine? > > Before (Win64): > https://queue.taskcluster.net/v1/task/PnRGJb00RIOAlI91zFrEgw/runs/0/ > artifacts/public/build/target.zip 72.49 (but the window height was likely too small) 79.08 (second run with better window size) > After (Win64): > https://queue.taskcluster.net/v1/task/b-sLkHB_Tji02Q9EiAyZeA/runs/0/ > artifacts/public/build/target.zip 80.87 78.46
(In reply to :Ehsan Akhgari (needinfo please, extremely long backlog) from comment #23) > Before (Win32): > https://queue.taskcluster.net/v1/task/Ug6IXO3-TYmLpmDnIhPpaw/runs/0/ > artifacts/public/build/target.zip 83.57 (the little "choose what you share" thing popped up at some point during the test) 82.01 > After (Win32): > https://queue.taskcluster.net/v1/task/MavnEDf3RqGRZnZUVqXdnQ/runs/0/ > artifacts/public/build/target.zip 82.62 (this time I made sure I dismissed the "choose what you share" thing before pressing start) 81.88
Thanks. At this point I don't see a compelling reason to continue with this patch. Does anyone feel differently?
No, I agree. Changing compiler flags also adds risk, and I don't think this is necessarily worth doing since it is unclear what we would gain from doing it. Thanks for investigating it!
No longer blocks: Speedometer_V2
So I guess this is WONTFIX then ...
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.