Closed
Bug 409803
Opened 18 years ago
Closed 17 years ago
gcc 4.1/4.2 zealously avoid inlining at -Os
Categories
(Firefox Build System :: General, defect, P2)
Tracking
(Not tracked)
RESOLVED
FIXED
mozilla1.9
People
(Reporter: dwitte, Assigned: dwitte)
References
()
Details
(Keywords: memory-footprint, perf)
Attachments
(1 file, 3 obsolete files)
4.41 KB,
patch
|
ted
:
review+
|
Details | Diff | Splinter Review |
it turns out that gcc 4.1.2 on linux, at our default optimization setting "-Os -freorder-blocks -fno-reorder-functions", avoids inlining even trivial functions (where the cost of doing so is less than even the fncall overhead). this is bad news for things like nsTArray, nsCOMPtr etc, which can result in many layers of wrapper calls if not inlined sensibly.
gcc has an option to control inlining, "-finline-limit=n", which will (roughly) inline functions up to length n pseudo-instructions. to give some sense for numbers, the default value of n at -O2 is 600. i ran some tests and found that with our current settings and -finline-limit=50 on a 32-bit linux build, which is enough to inline trivial (one or two line) wrapper methods but no more, we can get a codesize saving of 225kb (2%), a Ts win of 3%, a Txul win of 18%, and a Tp2 win of about 25% (!).
i also compared this to plain -O2: Txul is unchanged, Ts improves 3%, and Tp2 improves about 4%. however, codesize jumps 2,414kb (19%). maybe we can increase the inline limit at -Os to get back a bit of this perf, without exploding codesize. (we originally moved from -O2 to -Os on gcc 3.x, because it gave a huge codesize win and also a perf win of a few percent on Ts, Txul, and Tp. so, it seems gcc4.x behaves quite differently.)
it should be noted that our linux ref plat, which uses gcc4.1.1, shows a tiny codesize gain and no perf change with -finline-limit=50. this may be because it already inlines some at -Os, or because the -finline-limit option doesn't work the same. but in any case, with linux, distros are what matter - and they'll all be building on newer gcc's than 4.1.1 nowadays. (i haven't run numbers on gcc 4.2 yet.)
-finline-limit is implemented in gcc 4.0 also, so it might be worthwhile playing around with it on mac, since it's a simple way to alter the codesize/perf tradeoff.
Comment 1•18 years ago
|
||
Holy shit!
Assignee | ||
Comment 2•18 years ago
|
||
i ran some tests doubling -finline-limit to 100, which starts to inline bigger functions (3-5 lines or so, so probably a large fraction of constructors/destructors and such). Tp2 results are mixed, some slower and some faster, but overall they're a few percent faster than -Os with -finline-limit=50 and about on par with -O2. codesize is about 500k larger than at 50. so this is a good knob to tune for size/speed tradeoff.
Comment 3•18 years ago
|
||
__attribute__((always_inline)) in combination with declaring the function inline should do the trick, perhaps behind a macro:
#ifdef __GNUC__
#define ALWAYS_INLINE __attribute__((always_inline))
#else
#define ALWAYS_INLINE /* nothing */
#endif
Comment 5•18 years ago
|
||
I would prefer to avoid sprinkling always-inline markers all over our code (MSVC has an equivalent attribute). If we can make the compiler do the right thing automatically via optimization options that would be much better... I'm appalled but not terribly surprised that -Os doesn't do the right thing automatically.
![]() |
||
Comment 7•18 years ago
|
||
Have we filed a gcc bug to make -Os do the right thing?
Assignee | ||
Comment 8•18 years ago
|
||
not yet (assuming there isn't one already, and/or they haven't fixed it in 4.3). ubuntu compiles at -O2 and thus sidesteps the problem; fedora 8 uses 4.1.2 and compiles at -Os, so if they're prime for a win if they respin their build.
since 4.3 is where any fix will be (and caillon informs me redhat is about to stick that into rawhide), i'd like to get some data on that next.
Comment 9•18 years ago
|
||
(In reply to comment #8)
> not yet (assuming there isn't one already, and/or they haven't fixed it in
> 4.3). ubuntu compiles at -O2 and thus sidesteps the problem;
We need to make sure they aren't building -O2 with --enable-optimize="-O2", since that will build spidermonkey with -O2 and make it slower.
> fedora 8 uses
> 4.1.2 and compiles at -Os, so if they're prime for a win if they respin their
> build.
>
> since 4.3 is where any fix will be (and caillon informs me redhat is about to
> stick that into rawhide), i'd like to get some data on that next.
Assignee | ||
Comment 10•18 years ago
|
||
(In reply to comment #9)
> We need to make sure they aren't building -O2 with --enable-optimize="-O2",
> since that will build spidermonkey with -O2 and make it slower.
they are. they also stick a whole bunch of other options into --enable-optimize, since it's a convenient dumping ground to pass flags into the compiler.
what should we be telling them to do instead? patch their desired optimize setting into configure.in?
Assignee | ||
Comment 11•18 years ago
|
||
i should note that fedora, and probably every other distro, does this too.
for instance, my x86_64 fedora 7 install uses:
'--enable-optimize=-Os -g -pipe -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m32 -march=i386 -mtune=generic -fasynchronous-unwind-tables'
Comment 12•18 years ago
|
||
We should ask them what they want to happen. It's unlikely that the entire browser is optimally compiled with the same settings, so we should tell them to change to configure.in or we should change the way --enable-optimize works.
Comment 13•18 years ago
|
||
Ideally most of those options would just be passed in CFLAGS/CXXFLAGS:
CFLAGS="-D_FORTIFY_SOURCE=2 ..." ../configure --options here
although why they would pass -fexceptions and such is beyond me... it's just useless code generation.
Comment 14•18 years ago
|
||
In our case, those are simply the default optimization flags in a standard rpmbuild, though I did s,-O2,-Os, during the gcc3.x days when it actually did good.
![]() |
||
Comment 15•18 years ago
|
||
I should note that gcc 4.0.2 (what I have over here) certainly seems to be inlining things when compiling with -Os. So it sounds like this is a regression on the gcc end...
Comment 16•18 years ago
|
||
Have we tried similar knobs on the mac? Are we building with stack-protector on that (looks like it requires XCODE 3.0 so prob not).
Updated•18 years ago
|
Flags: blocking-firefox3+
Priority: -- → P2
Assignee | ||
Comment 17•18 years ago
|
||
i've tested gcc 4.3 a bit. to summarize, it looks like this pathological -Os behavior is specific to 4.1 branch, and possibly just 4.1.2. also, there are some substantial perf and codesize wins to be had with gcc 4.3.
gory details: tested with gcc 4.3 (20080104 pull). "stock configuration" is "-Os -freorder-blocks -fno-reorder-functions". some Tp2 numbers:
baseline: gcc 4.3, stock: 142.78 ms
stock, with -finline-limit=50: 146.89 ms (+2.9%)
-O2: 131.56 ms (-7.9%)
for comparison with previous results (comment 0):
gcc 4.1.2, stock: 199 ms (+39%)
stock, with -finline-limit=50: 149.33 ms (+4.6%)
-O2: 142.67 ms (even)
|size libxul.so|
gcc 4.3, stock: 12,387kb
stock, with -finline-limit=50: 12,325kb (-62kb)
-O2: 15,061kb (+2,674kb)
gcc 4.1.2, stock: 13,249kb (+862kb)
stock, with -finline-limit=50: 13,025kb (+638kb)
-O2: 15,440kb (+3,053kb)
a few points from this data:
1) -Os is very sane on 4.3 by default.
2) on 4.3, relative to -Os, -O2 has improved a lot (8% Tp win, although at a 2.7Mb codesize cost).
3) 4.3 is 5 - 8% faster on Tp2 than 4.1.2, depending on -Os/-O2.
4) 4.3 gives an 400-800k codesize saving over 4.1.2.
3 & 4) are probably the same thing - a result of the hidden visibility propagation improvements introduced in gcc 4.2. these are a major win for us.
Comment 18•18 years ago
|
||
(In reply to comment #5)
> I would prefer to avoid sprinkling always-inline markers all over our code
> (MSVC has an equivalent attribute).
I sort of agree, but how is it worse than adding the |inline| keyword to a method? I don't see any real difference in purity between |inline| and |ALWAYS_INLINE| or even just |INLINE|.
Comment 19•18 years ago
|
||
It's worse because you have to make an optimization decision in the code. Methods that look like they would be good inlines aren't always... the compiler should make that choice. Besides that the closer we stick to standard C++, the more approachable our codebase is.
Comment 20•18 years ago
|
||
So you test before doing so, and you only use it when tests show it's a win, which you should have been doing anyway. And while approachability is nice, I think it loses to actual perf gains for an obviously-named macro that's simple to understand.
Comment 21•18 years ago
|
||
(In reply to comment #20)
Any given change could be best for one platform but not for another. Or even one _version_ of a platform. Also, would sprinkling more inline keywords around mean that redistributors/embeddors need to test to make sure that whatever code level optimization decisions made are actually right for their supported compiler/hardware combinations?
No, I think the compiler's optimizer really should be doing this work.
Comment 22•18 years ago
|
||
The point of this bug is that the compiler's optimizer _isn't_ doing this work, no? "Should" doesn't improve the quality of code we ship to users, alas.
Comment 23•18 years ago
|
||
If our choice is "pick special optimizer options via -finline-limit" or "sprinkle our codebase with FORCE_INLINE" I will pick the former without a second thought... is that not the choice here?
Comment 24•18 years ago
|
||
Right, I meant that if the optimizer isn't doing the work with the default options we're giving it, we should find better options to let it do better work. I agree with bsmedberg.
Comment 25•18 years ago
|
||
also see bug 237041. gcc 3.3. not sure if these options are interesting or not. They were provided by a embedder.
Comment 26•18 years ago
|
||
The inline keyword makes sense if you have a situation where you have knowledge that the compiler can't possibly deduce; for instance, if there is an nsCOMPtr method that expands to larger than the -finline-limit, but we know it's used in some critical inner loops, and so it's a win to inline anyway. But each such keyword use should be accompanied with a long comment explaining why, including perf statistics, and many of those will want to be conditionalized per platform. As an example, on the Mac, received wisdom at Apple is that there is so much memory pressure (from all the frameworks, etc) that the reduced footprint of -Os is a net win, even it means losing out on some inlining opportunities.
Updated•17 years ago
|
Assignee: dwitte → nobody
Flags: blocking-firefox3+
Product: Firefox → Core
QA Contact: build.config → build-config
Target Milestone: --- → mozilla1.9
Version: unspecified → Trunk
Comment 28•17 years ago
|
||
This is really a core bug...
This matters on Linux, since the refplatform uses gcc 4.1.1, is it better to bump the compiler version?
I don't think this blocks at this point, but will leave to platform drivers to decide if we want to monkey with this.
Assignee: nobody → dwitte
Flags: blocking1.9?
If there's a 25% Tp win this definitely blocks.
Comment 30•17 years ago
|
||
I think since we can easily change this via a compiler setting, we should just change the setting and look at the effect on numbers. That's a lot easier than updating our refplatform. dwitte: can you submit a patch to just put a sensible -finline-limit in fx-linux-tbox's mozconfig? (CFLAGS/CXXFLAGS) That would let us assess the perf win, and we can move it into configure if it's good.
Flags: blocking1.9? → blocking1.9+
Assignee | ||
Comment 31•17 years ago
|
||
Attachment #308835 -
Flags: review?(ted.mielczarek)
Comment 32•17 years ago
|
||
Comment on attachment 308835 [details] [diff] [review]
tweak tinderbox config
CFLAGS/CXXFLAGS are already defined at the top of the file! Just append to those. Let's get this in and see what happens.
Attachment #308835 -
Flags: review?(ted.mielczarek) → review+
Assignee | ||
Comment 33•17 years ago
|
||
test landed on fx-linux-tbox... tried two settings, -finline-limit=50 (conservative) and -finline-limit=100 (more aggressive). the nightly went out with 50. results from qm-plinux-fast01:
baseline -finline-limit=50 -finline-limit=100
--------------------------------------------------------------------
Tp 231 190 (-17.7%) 183 (-20.8%)
Txul 246 214 (-13.0%) 210 (-14.6%)
Ts 1410 1319 (- 6.5%) 1310 (- 7.1%)
Tdhtml 982 943 (- 4.0%) 934 (- 4.9%)
Tsspider 388 355 (- 8.5%) 351 (- 9.5%)
codesize -86k (- 0.6%) +494k (+ 3.5%)
so at 50 we're good for a 17.7% Tp win and an 86k codesize drop!
question is do we want to do 100 - gets us an extra 3% Tp and a tiny bit on Ts, costs 579k in codesize above the 50 setting. 3% Tp for 4.1% codesize? this one isn't so obvious ;)
when we deploy this, changing the optimization settings in configure.in will be slightly different to what we did here (with CFLAGS/CXXFLAGS). we should see what other modules are using -Os and make sure to change them too. (what about NSPR/NSS?)
Comment 34•17 years ago
|
||
> so at 50 we're good for a 17.7% Tp win and an 86k codesize drop!
>
> question is do we want to do 100 - gets us an extra 3% Tp and a tiny bit on Ts,
> costs 579k in codesize above the 50 setting. 3% Tp for 4.1% codesize? this one
> isn't so obvious ;)
>
Take the easy sure win (50).
Assignee | ||
Comment 35•17 years ago
|
||
so, probably want to do this by default, so users & distros have better chance of getting things right out-of-box. this doesn't really hurt anything on newer gcc's, since -finline-limit=50 is about what -Os should be doing anyway. we could maxversion this flag, but there's something to be said for keeping it simple - especially since we need this flag in two other places, xpcom/io and js/src.
Attachment #308835 -
Attachment is obsolete: true
Attachment #309169 -
Flags: review?(ted.mielczarek)
Comment 36•17 years ago
|
||
Comment on attachment 309169 [details] [diff] [review]
tweak defaults
Please back out your changes to the fx-linux-tbox mozconfig when you check this in. No need to leave extra CFLAGS scattered all over the place.
Attachment #309169 -
Flags: review?(ted.mielczarek) → review+
Assignee | ||
Comment 38•17 years ago
|
||
checked mac on gcc4.0.1 briefly; -Os seems sane there, and we mostly use -O2 on mac, so nothing to do there.
Any chance we can get this in today? I think that the config flag change should ship with b5.
Comment 40•17 years ago
|
||
Comment on attachment 309175 [details] [diff] [review]
updated patch for js/src
I gave this r- over IRC with dwitte. We shouldn't set -finline-limit on anything other than GCC 4.1.x.
Attachment #309175 -
Flags: review-
Assignee | ||
Comment 41•17 years ago
|
||
doing so will require more changes to configure.in to sniff the GCC version. i don't know if i can get to this today, but i'll try.
Comment 42•17 years ago
|
||
(In reply to comment #39)
> Any chance we can get this in today? I think that the config flag change
> should ship with b5.
Vlad: we already enabled this on the tinderbox, so our nightly builds should be fine. This follow-on patch is just about making it the default in configure.
Updated•17 years ago
|
Summary: gcc zealously avoids inlining at -Os → gcc (4.1 only) zealously avoids inlining at -Os
Assignee | ||
Comment 43•17 years ago
|
||
this may also apply to gcc 4.0 and 4.2. can anyone with those compilers, on linux, test?
Assignee | ||
Comment 44•17 years ago
|
||
a stab at version checking. not sure how you'd prefer it done, so i kept it pretty centralized and simple. alternatively, could do a MOZ_OPTIMIZE_SIZE that rolls up "-Os -finline-limit=50"...
Attachment #309169 -
Attachment is obsolete: true
Attachment #309175 -
Attachment is obsolete: true
Attachment #309635 -
Flags: review?(ted.mielczarek)
Comment 45•17 years ago
|
||
Comment on attachment 309635 [details] [diff] [review]
tweak defaults v2
Looks reasonable. I don't think you need the bug number, isn't that what blame is for? (But I also don't care enough to make you change it if you really want it there.)
Attachment #309635 -
Flags: review?(ted.mielczarek) → review+
Assignee | ||
Comment 46•17 years ago
|
||
landed (minus bug#), and reverted tbox mozconfig flags. looks like the new patch dropped Ts another 1%, but that could be artifactual.
-> fixed, nothing to do on mac, and we don't have evidence that gcc4.2 needs this.
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 47•17 years ago
|
||
i tested gcc 4.2.3 and found it suffered exactly the same problem, and that -finline-limit=50 solved it in an identical way to 4.1.x. i've added 4.2.x to the case statement in configure.in.
Assignee | ||
Updated•17 years ago
|
Summary: gcc (4.1 only) zealously avoids inlining at -Os → gcc 4.1/4.2 zealously avoid inlining at -Os
![]() |
||
Comment 48•17 years ago
|
||
The gcc version check introduced here has caused bug 423261, btw, only checking for "version" in the line is not ideal.
Depends on: 423261
Comment 49•17 years ago
|
||
So would any of -finline-limit=60, ...=70, ...=80, ...=90 (or maybe even ...=57) find us a place where the code size difference is 0 and the performance gains are up, or where performance is roughly the same (or better!) and we save even more on code size? Or does the limit only take multiples of 50?
Assignee | ||
Comment 50•17 years ago
|
||
no. i investigated several settings in between 50 and 100, didn't find any sweeter spots - problem is it's a pretty granular thing, e.g. you're either inlining one line of C code or two, which as a big effect applied across the whole codebase.
i think a more interesting thing to do is play around with gcc 4.3, with its 8% Tp win at -O2, and see which specific passes in -O2 are responsible for that win - then see if we can take just those and avoid the 2MB codesize bloat. clearly they've added some sweet new passes in 4.3 that make -O2 pull ahead.
Updated•7 years ago
|
Product: Core → Firefox Build System
You need to log in
before you can comment on or make changes to this bug.
Description
•