So, bryner ran an experiment on the redwood firebird tbox (gcc 3.3.2) a few days ago, by switching the optimization flag to use -Os instead of -O2. codesize reduced by 1,545kb out of 14,387kb, a 10.7% reduction. the impact on perf metrics seems to be neutral overall - Ts remained about the same, Txul improved by ~1%, and Tp got larger by a barely measurable amount (maybe ~0.5%). http://tinderbox.mozilla.org/showbuilds.cgi?tree=Phoenix&hours=24&maxdate=1068502523&legend=0 i've done a comparison between my local builds of seamonkey, between -O2 and -Os. for seamonkey, we save 9.3% in binary codesize. i don't have a -O comparison on-hand at the moment, but i'm rolling one to compare. given that perf remains neutral, and we save about 10% in binary size, is this something we want to do for gcc builds of seamonkey nightlies/releases?
afaict we use -O2 for gcc release builds. it seems the contributed gtk2/xft linux builds use -O3... ;)
my -O build finished: binary size reduces by 2.7% relative to -O2. i'm assuming performance metrics are in between those of -Os and -O2, and hence are also neutral. so it appears -Os is the sweet spot here...
I'd prefer to see a bit more detailed performance analysis before we do this. However if things look good then I'm all for it. The reason why perfomance doesn't degrade could simply be that there is less code. So we'd end up swapping code less frequently and we'd hit instruction-caches more often.
what kind of performance analysis would you suggest? imo, for this kind of change we'd be interested in fairly broad metrics like the ones we have, Ts/Tp/Txul. so perhaps switching one of the tinderboxen to -Os (luna?) would be a good start (ignoring for the moment that it runs a slightly older gcc, 3.2). that said, i think the data we already have for firebird is perfectly applicable to seamonkey. i've run some Ts/Txul tests locally on a p3-550, linux/gtk2, gcc 3.3.2. the Ts tests are not useful, because the standard deviation is far too high (~10%) for any changes to be visible: -Os -O2 Ts avg 3518.6 3505.15 Ts stdev 280.6 299.2 however, my Txul tests show a larger improvement than the firebird tests did (most likely due to the different perf characteristics of the p3-550). these results have a pretty low standard deviation (< 0.5%), and so are statistically significant: -Os -O2 improvement (Os relative to O2) Txul avg 970.4 998.2 2.8% Txul stdev 27.3 26.0 i'm unable to test Tp since i'm outside the firewall.
er, those standard deviations should read: Txul stdev 4.5 3.8
I'd like to see at least Tp measured as well so making the switch on one of the tinderboxen sounds like a good idea. Also if you have any dhtml-tests or js-tests handy that would be good but no requirement from my part (I know they exist but i don't know where, sorry).
There are some scattered in various bugs... (search for "dhtml perf").
Another thing that someone might want to investigate is tweaking gcc's inliner. Dropping the inline limit in half (-finline-limit=300) on gcc-3.3.2 reduced the code size by another 440K. More is probably achievable by changing this value or the underlying parameters (max-inline-*).
There are some large functions that we really do want to inline, since they're only used once or twice. I'd rather tweak inlining by finding the things that really shouldn't be inlined (probably in the string code) and making them not inline.
David: Are you sure these functions are really being inlined? MSVC has a pretty low limit for what it is willing to inline (for example some of the nsVoidArray functions arn't always inlined) and gcc too has a limit for what it will inline. So in general you shouldn't rely on having your functions inlined unless they are really small.
the only way to positively force inlining is by using the gcc __attribute__((always_inline)). having said that, i agree with dbaron's view, especially as applied to strings... the inlining model there is whacky. i'm sure we could do great things for both codesize and perf by fixing that.
Note that -Os seems to trigger a bunch of compiler bugs; depending on the target CPU type you may see "simple" defects like bug 233497 (on x86/IA32, a simply |if()|/|else|-construct will only use the |else|-branch etc.) or totally defunct binaries (like on SPARC).
Of note is that while overall (compressed) tarball size does in fact drop by about 10%, the size of some libraries drops by more than that. gklayout and necko (both stripped) drop by about 20% here (-O2 compared to -Os, gcc 3.2). xpcom, docshell, and a few others drop by 10%. uconv drops by 2%. So on some libraries we're actually seing a huge win from -Os (20% of gklayout is about 900KB). Frankly, I would be in favor of flipping the switch sometime in an alpha milestone (like now, say) for tinderbox and the nightlies and seeing what happens. Once we have nightlies with the change, we can put out a call to people who do DHTML stuff (most of whom don't build) to compare the new and old builds....
In other words, we have all these nighlies that are _supposed_ to be for testing purposes and we have people testing them. We should make use of that.
Compare bug 53486 > if you have any dhtml-tests <http://www.world-direct.com/mozilla/dhtml/funo/domtestcases/index.htm>
firefox is using -Os, any reason not to switch comet (seamonkey release) or luna (seamonkey perf tests) over to doing -Os builds at this point, or do we want to wait for post 1.8?
Assignee: leaf → cmp
Priority: -- → P3
*** Bug 53486 has been marked as a duplicate of this bug. ***
granrose: switching now sounds entirely reasonable to me.
I think we should get dbaron's approval to change the tinderboxen; we generally prefer the historical comparison in the numbers by using the same build flags (which is why btek still uses egcs), even if this doesn't produce the most optimized builds.
FWIW, I'd expect -O2 builds to be faster than -Os, especially with newer gccs, thanks to basic block reordering. (We've tagged a few hotspots with NS_LIKELY / NS_UNLIKELY since comment 0 happened, so it could be worth re-measuring.) I'd rather not change tinderboxes that are generating performance data. I think we already have some with -O2 and some with -Os.
dbaron: the results in comment 4 (alas, Txul only, no Tp measurements) were done with 3.3.2... did block reordering come in recently (3.4), or are my results still representative?
IIRC, NS_LIKELY and NS_UNLIKELY are more recent than comment 4. From memory: * gcc 3.3.x does basic block reordering (-freorder-blocks) at -O2 but not -Os * gcc 3.4 also does ,pt / ,pf annotations on conditional jump instructions (which solves the branch prediction problem but not the cache miss problem that's solved by -freorder-blocks), but I'm not sure at what optimization levels.
Mass re-assign of bugs that aren't on the build team radar, so bugs assigned to firstname.lastname@example.org reflects reality. If there is a bug you really think we need to be looking at, please *email* email@example.com with a bug number and explanation.
Assignee: build → nobody
Apparently Linux releases on the 1.8 branch have been built -Os for awhile; Chris Cooper added this in November as part of migrating tinderbox bits to the public repository, as seen in http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/tools/tinderbox-configs/firefox/linux/mozconfig&rev=MOZILLA_1_8_BRANCH_release . Mac is being built -O2 on trunk and branches.
Is still still something we're looking into or should it be closed in some way?
I still think this deserves investigation. At least, we should revisit some performance testing with newer gccs
At the very least we need to do a -Os/-O2 comparison on Macs.
Assignee: stanshebs → nobody
Component: Build Config → Build Config
Product: Mozilla Application Suite → Core
QA Contact: build-config
What's the relation to bug 409803 and possibly other bugs (cc'ing sayrer)? I can guess, but it would be great to have our story for 1.9/fx3 sorted out soon, so nominating blocking. /be
(In reply to comment #30) > What's the relation to bug 409803 and possibly other bugs (cc'ing sayrer)? I > can guess, but it would be great to have our story for 1.9/fx3 sorted out soon, > so nominating blocking. To recap: We build -Os for release builds on linux. We build -O2 for release builds on mac. We build -O1 on msvc (it's somewhere between GCC's -Os and -O2, it does inline etc.) I tried building mac at -Os, and saw a ~5% slowdown on Tdhtml and a 2-3% slowdown on Tp/Tp2. However, the code was quite a bit smaller. To me, that indicates certain parts of the tree are faster at -O2 and others at -Os. For example, we know spidermonkey is better at -Os.
the 5% slowdown could be due (in full or part) to bug 409803 - any data we can get on mac gcc4.0 regarding that would be gold, and might make it easier to figure out module-specific settings. (speculation here, but the bug mostly affects code that makes heavy use of c++ wrappers, e.g. string libs, which might explain why spidermonkey isn't affected?)
+ing so we figure out one way or another
Flags: blocking1.9? → blocking1.9+
Status: NEW → RESOLVED
Last Resolved: 11 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.