Closed Bug 409803 Opened 18 years ago Closed 17 years ago

gcc 4.1/4.2 zealously avoid inlining at -Os

Categories

(Firefox Build System :: General, defect, P2)

x86
Linux
defect

Tracking

(Not tracked)

RESOLVED FIXED
mozilla1.9

People

(Reporter: dwitte, Assigned: dwitte)

References

()

Details

(Keywords: memory-footprint, perf)

Attachments

(1 file, 3 obsolete files)

it turns out that gcc 4.1.2 on linux, at our default optimization setting "-Os -freorder-blocks -fno-reorder-functions", avoids inlining even trivial functions (where the cost of doing so is less than even the fncall overhead). this is bad news for things like nsTArray, nsCOMPtr etc, which can result in many layers of wrapper calls if not inlined sensibly. gcc has an option to control inlining, "-finline-limit=n", which will (roughly) inline functions up to length n pseudo-instructions. to give some sense for numbers, the default value of n at -O2 is 600. i ran some tests and found that with our current settings and -finline-limit=50 on a 32-bit linux build, which is enough to inline trivial (one or two line) wrapper methods but no more, we can get a codesize saving of 225kb (2%), a Ts win of 3%, a Txul win of 18%, and a Tp2 win of about 25% (!). i also compared this to plain -O2: Txul is unchanged, Ts improves 3%, and Tp2 improves about 4%. however, codesize jumps 2,414kb (19%). maybe we can increase the inline limit at -Os to get back a bit of this perf, without exploding codesize. (we originally moved from -O2 to -Os on gcc 3.x, because it gave a huge codesize win and also a perf win of a few percent on Ts, Txul, and Tp. so, it seems gcc4.x behaves quite differently.) it should be noted that our linux ref plat, which uses gcc4.1.1, shows a tiny codesize gain and no perf change with -finline-limit=50. this may be because it already inlines some at -Os, or because the -finline-limit option doesn't work the same. but in any case, with linux, distros are what matter - and they'll all be building on newer gcc's than 4.1.1 nowadays. (i haven't run numbers on gcc 4.2 yet.) -finline-limit is implemented in gcc 4.0 also, so it might be worthwhile playing around with it on mac, since it's a simple way to alter the codesize/perf tradeoff.
Holy shit!
i ran some tests doubling -finline-limit to 100, which starts to inline bigger functions (3-5 lines or so, so probably a large fraction of constructors/destructors and such). Tp2 results are mixed, some slower and some faster, but overall they're a few percent faster than -Os with -finline-limit=50 and about on par with -O2. codesize is about 500k larger than at 50. so this is a good knob to tune for size/speed tradeoff.
__attribute__((always_inline)) in combination with declaring the function inline should do the trick, perhaps behind a macro: #ifdef __GNUC__ #define ALWAYS_INLINE __attribute__((always_inline)) #else #define ALWAYS_INLINE /* nothing */ #endif
I would prefer to avoid sprinkling always-inline markers all over our code (MSVC has an equivalent attribute). If we can make the compiler do the right thing automatically via optimization options that would be much better... I'm appalled but not terribly surprised that -Os doesn't do the right thing automatically.
Have we filed a gcc bug to make -Os do the right thing?
not yet (assuming there isn't one already, and/or they haven't fixed it in 4.3). ubuntu compiles at -O2 and thus sidesteps the problem; fedora 8 uses 4.1.2 and compiles at -Os, so if they're prime for a win if they respin their build. since 4.3 is where any fix will be (and caillon informs me redhat is about to stick that into rawhide), i'd like to get some data on that next.
(In reply to comment #8) > not yet (assuming there isn't one already, and/or they haven't fixed it in > 4.3). ubuntu compiles at -O2 and thus sidesteps the problem; We need to make sure they aren't building -O2 with --enable-optimize="-O2", since that will build spidermonkey with -O2 and make it slower. > fedora 8 uses > 4.1.2 and compiles at -Os, so if they're prime for a win if they respin their > build. > > since 4.3 is where any fix will be (and caillon informs me redhat is about to > stick that into rawhide), i'd like to get some data on that next.
(In reply to comment #9) > We need to make sure they aren't building -O2 with --enable-optimize="-O2", > since that will build spidermonkey with -O2 and make it slower. they are. they also stick a whole bunch of other options into --enable-optimize, since it's a convenient dumping ground to pass flags into the compiler. what should we be telling them to do instead? patch their desired optimize setting into configure.in?
i should note that fedora, and probably every other distro, does this too. for instance, my x86_64 fedora 7 install uses: '--enable-optimize=-Os -g -pipe -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m32 -march=i386 -mtune=generic -fasynchronous-unwind-tables'
We should ask them what they want to happen. It's unlikely that the entire browser is optimally compiled with the same settings, so we should tell them to change to configure.in or we should change the way --enable-optimize works.
Ideally most of those options would just be passed in CFLAGS/CXXFLAGS: CFLAGS="-D_FORTIFY_SOURCE=2 ..." ../configure --options here although why they would pass -fexceptions and such is beyond me... it's just useless code generation.
In our case, those are simply the default optimization flags in a standard rpmbuild, though I did s,-O2,-Os, during the gcc3.x days when it actually did good.
I should note that gcc 4.0.2 (what I have over here) certainly seems to be inlining things when compiling with -Os. So it sounds like this is a regression on the gcc end...
Have we tried similar knobs on the mac? Are we building with stack-protector on that (looks like it requires XCODE 3.0 so prob not).
Flags: blocking-firefox3+
Priority: -- → P2
i've tested gcc 4.3 a bit. to summarize, it looks like this pathological -Os behavior is specific to 4.1 branch, and possibly just 4.1.2. also, there are some substantial perf and codesize wins to be had with gcc 4.3. gory details: tested with gcc 4.3 (20080104 pull). "stock configuration" is "-Os -freorder-blocks -fno-reorder-functions". some Tp2 numbers: baseline: gcc 4.3, stock: 142.78 ms stock, with -finline-limit=50: 146.89 ms (+2.9%) -O2: 131.56 ms (-7.9%) for comparison with previous results (comment 0): gcc 4.1.2, stock: 199 ms (+39%) stock, with -finline-limit=50: 149.33 ms (+4.6%) -O2: 142.67 ms (even) |size libxul.so| gcc 4.3, stock: 12,387kb stock, with -finline-limit=50: 12,325kb (-62kb) -O2: 15,061kb (+2,674kb) gcc 4.1.2, stock: 13,249kb (+862kb) stock, with -finline-limit=50: 13,025kb (+638kb) -O2: 15,440kb (+3,053kb) a few points from this data: 1) -Os is very sane on 4.3 by default. 2) on 4.3, relative to -Os, -O2 has improved a lot (8% Tp win, although at a 2.7Mb codesize cost). 3) 4.3 is 5 - 8% faster on Tp2 than 4.1.2, depending on -Os/-O2. 4) 4.3 gives an 400-800k codesize saving over 4.1.2. 3 & 4) are probably the same thing - a result of the hidden visibility propagation improvements introduced in gcc 4.2. these are a major win for us.
(In reply to comment #5) > I would prefer to avoid sprinkling always-inline markers all over our code > (MSVC has an equivalent attribute). I sort of agree, but how is it worse than adding the |inline| keyword to a method? I don't see any real difference in purity between |inline| and |ALWAYS_INLINE| or even just |INLINE|.
It's worse because you have to make an optimization decision in the code. Methods that look like they would be good inlines aren't always... the compiler should make that choice. Besides that the closer we stick to standard C++, the more approachable our codebase is.
So you test before doing so, and you only use it when tests show it's a win, which you should have been doing anyway. And while approachability is nice, I think it loses to actual perf gains for an obviously-named macro that's simple to understand.
(In reply to comment #20) Any given change could be best for one platform but not for another. Or even one _version_ of a platform. Also, would sprinkling more inline keywords around mean that redistributors/embeddors need to test to make sure that whatever code level optimization decisions made are actually right for their supported compiler/hardware combinations? No, I think the compiler's optimizer really should be doing this work.
The point of this bug is that the compiler's optimizer _isn't_ doing this work, no? "Should" doesn't improve the quality of code we ship to users, alas.
If our choice is "pick special optimizer options via -finline-limit" or "sprinkle our codebase with FORCE_INLINE" I will pick the former without a second thought... is that not the choice here?
Right, I meant that if the optimizer isn't doing the work with the default options we're giving it, we should find better options to let it do better work. I agree with bsmedberg.
also see bug 237041. gcc 3.3. not sure if these options are interesting or not. They were provided by a embedder.
The inline keyword makes sense if you have a situation where you have knowledge that the compiler can't possibly deduce; for instance, if there is an nsCOMPtr method that expands to larger than the -finline-limit, but we know it's used in some critical inner loops, and so it's a win to inline anyway. But each such keyword use should be accompanied with a long comment explaining why, including perf statistics, and many of those will want to be conditionalized per platform. As an example, on the Mac, received wisdom at Apple is that there is so much memory pressure (from all the frameworks, etc) that the reduced footprint of -Os is a net win, even it means losing out on some inlining opportunities.
dwitte - assigning to you...
Assignee: nobody → dwitte
Assignee: dwitte → nobody
Flags: blocking-firefox3+
Product: Firefox → Core
QA Contact: build.config → build-config
Target Milestone: --- → mozilla1.9
Version: unspecified → Trunk
This is really a core bug... This matters on Linux, since the refplatform uses gcc 4.1.1, is it better to bump the compiler version? I don't think this blocks at this point, but will leave to platform drivers to decide if we want to monkey with this.
Assignee: nobody → dwitte
Flags: blocking1.9?
If there's a 25% Tp win this definitely blocks.
I think since we can easily change this via a compiler setting, we should just change the setting and look at the effect on numbers. That's a lot easier than updating our refplatform. dwitte: can you submit a patch to just put a sensible -finline-limit in fx-linux-tbox's mozconfig? (CFLAGS/CXXFLAGS) That would let us assess the perf win, and we can move it into configure if it's good.
Flags: blocking1.9? → blocking1.9+
Attached patch tweak tinderbox config (obsolete) — Splinter Review
Attachment #308835 - Flags: review?(ted.mielczarek)
Comment on attachment 308835 [details] [diff] [review] tweak tinderbox config CFLAGS/CXXFLAGS are already defined at the top of the file! Just append to those. Let's get this in and see what happens.
Attachment #308835 - Flags: review?(ted.mielczarek) → review+
test landed on fx-linux-tbox... tried two settings, -finline-limit=50 (conservative) and -finline-limit=100 (more aggressive). the nightly went out with 50. results from qm-plinux-fast01: baseline -finline-limit=50 -finline-limit=100 -------------------------------------------------------------------- Tp 231 190 (-17.7%) 183 (-20.8%) Txul 246 214 (-13.0%) 210 (-14.6%) Ts 1410 1319 (- 6.5%) 1310 (- 7.1%) Tdhtml 982 943 (- 4.0%) 934 (- 4.9%) Tsspider 388 355 (- 8.5%) 351 (- 9.5%) codesize -86k (- 0.6%) +494k (+ 3.5%) so at 50 we're good for a 17.7% Tp win and an 86k codesize drop! question is do we want to do 100 - gets us an extra 3% Tp and a tiny bit on Ts, costs 579k in codesize above the 50 setting. 3% Tp for 4.1% codesize? this one isn't so obvious ;) when we deploy this, changing the optimization settings in configure.in will be slightly different to what we did here (with CFLAGS/CXXFLAGS). we should see what other modules are using -Os and make sure to change them too. (what about NSPR/NSS?)
> so at 50 we're good for a 17.7% Tp win and an 86k codesize drop! > > question is do we want to do 100 - gets us an extra 3% Tp and a tiny bit on Ts, > costs 579k in codesize above the 50 setting. 3% Tp for 4.1% codesize? this one > isn't so obvious ;) > Take the easy sure win (50).
Attached patch tweak defaults (obsolete) — Splinter Review
so, probably want to do this by default, so users & distros have better chance of getting things right out-of-box. this doesn't really hurt anything on newer gcc's, since -finline-limit=50 is about what -Os should be doing anyway. we could maxversion this flag, but there's something to be said for keeping it simple - especially since we need this flag in two other places, xpcom/io and js/src.
Attachment #308835 - Attachment is obsolete: true
Attachment #309169 - Flags: review?(ted.mielczarek)
Comment on attachment 309169 [details] [diff] [review] tweak defaults Please back out your changes to the fx-linux-tbox mozconfig when you check this in. No need to leave extra CFLAGS scattered all over the place.
Attachment #309169 - Flags: review?(ted.mielczarek) → review+
Attached patch updated patch for js/src (obsolete) — Splinter Review
updated for bitrot.
Attachment #309175 - Flags: review+
checked mac on gcc4.0.1 briefly; -Os seems sane there, and we mostly use -O2 on mac, so nothing to do there.
Any chance we can get this in today? I think that the config flag change should ship with b5.
Comment on attachment 309175 [details] [diff] [review] updated patch for js/src I gave this r- over IRC with dwitte. We shouldn't set -finline-limit on anything other than GCC 4.1.x.
Attachment #309175 - Flags: review-
doing so will require more changes to configure.in to sniff the GCC version. i don't know if i can get to this today, but i'll try.
(In reply to comment #39) > Any chance we can get this in today? I think that the config flag change > should ship with b5. Vlad: we already enabled this on the tinderbox, so our nightly builds should be fine. This follow-on patch is just about making it the default in configure.
Summary: gcc zealously avoids inlining at -Os → gcc (4.1 only) zealously avoids inlining at -Os
this may also apply to gcc 4.0 and 4.2. can anyone with those compilers, on linux, test?
a stab at version checking. not sure how you'd prefer it done, so i kept it pretty centralized and simple. alternatively, could do a MOZ_OPTIMIZE_SIZE that rolls up "-Os -finline-limit=50"...
Attachment #309169 - Attachment is obsolete: true
Attachment #309175 - Attachment is obsolete: true
Attachment #309635 - Flags: review?(ted.mielczarek)
Comment on attachment 309635 [details] [diff] [review] tweak defaults v2 Looks reasonable. I don't think you need the bug number, isn't that what blame is for? (But I also don't care enough to make you change it if you really want it there.)
Attachment #309635 - Flags: review?(ted.mielczarek) → review+
landed (minus bug#), and reverted tbox mozconfig flags. looks like the new patch dropped Ts another 1%, but that could be artifactual. -> fixed, nothing to do on mac, and we don't have evidence that gcc4.2 needs this.
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
i tested gcc 4.2.3 and found it suffered exactly the same problem, and that -finline-limit=50 solved it in an identical way to 4.1.x. i've added 4.2.x to the case statement in configure.in.
Summary: gcc (4.1 only) zealously avoids inlining at -Os → gcc 4.1/4.2 zealously avoid inlining at -Os
The gcc version check introduced here has caused bug 423261, btw, only checking for "version" in the line is not ideal.
Depends on: 423261
So would any of -finline-limit=60, ...=70, ...=80, ...=90 (or maybe even ...=57) find us a place where the code size difference is 0 and the performance gains are up, or where performance is roughly the same (or better!) and we save even more on code size? Or does the limit only take multiples of 50?
no. i investigated several settings in between 50 and 100, didn't find any sweeter spots - problem is it's a pretty granular thing, e.g. you're either inlining one line of C code or two, which as a big effect applied across the whole codebase. i think a more interesting thing to do is play around with gcc 4.3, with its 8% Tp win at -O2, and see which specific passes in -O2 are responsible for that win - then see if we can take just those and avoid the 2MB codesize bloat. clearly they've added some sweet new passes in 4.3 that make -O2 pull ahead.
Product: Core → Firefox Build System
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: