Closed Bug 225433 Opened 21 years ago Closed 16 years ago

investigate -Os for nightly/release builds

Categories

(Firefox Build System :: General, defect, P3)

x86
Linux
defect

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: dwitte, Unassigned)

References

Details

(Keywords: memory-footprint)

So, bryner ran an experiment on the redwood firebird tbox (gcc 3.3.2) a few days
ago, by switching the optimization flag to use -Os instead of -O2. codesize
reduced by 1,545kb out of 14,387kb, a 10.7% reduction. the impact on perf
metrics seems to be neutral overall - Ts remained about the same, Txul improved
by ~1%, and Tp got larger by a barely measurable amount (maybe ~0.5%).

http://tinderbox.mozilla.org/showbuilds.cgi?tree=Phoenix&hours=24&maxdate=1068502523&legend=0

i've done a comparison between my local builds of seamonkey, between -O2 and
-Os. for seamonkey, we save 9.3% in binary codesize. i don't have a -O
comparison on-hand at the moment, but i'm rolling one to compare.

given that perf remains neutral, and we save about 10% in binary size, is this
something we want to do for gcc builds of seamonkey nightlies/releases?
afaict we use -O2 for gcc release builds. it seems the contributed gtk2/xft
linux builds use -O3... ;)
my -O build finished: binary size reduces by 2.7% relative to -O2. i'm assuming
performance metrics are in between those of -Os and -O2, and hence are also neutral.

so it appears -Os is the sweet spot here...
I'd prefer to see a bit more detailed performance analysis before we do this.
However if things look good then I'm all for it.

The reason why perfomance doesn't degrade could simply be that there is less
code. So we'd end up swapping code less frequently and we'd hit
instruction-caches more often.
what kind of performance analysis would you suggest? imo, for this kind of
change we'd be interested in fairly broad metrics like the ones we have,
Ts/Tp/Txul. so perhaps switching one of the tinderboxen to -Os (luna?) would be
a good start (ignoring for the moment that it runs a slightly older gcc, 3.2).
that said, i think the data we already have for firebird is perfectly applicable
to seamonkey.

i've run some Ts/Txul tests locally on a p3-550, linux/gtk2, gcc 3.3.2. the Ts
tests are not useful, because the standard deviation is far too high (~10%) for
any changes to be visible:

                  -Os           -O2
Ts avg          3518.6        3505.15
Ts stdev         280.6         299.2

however, my Txul tests show a larger improvement than the firebird tests did
(most likely due to the different perf characteristics of the p3-550). these
results have a pretty low standard deviation (< 0.5%), and so are statistically
significant:

                  -Os           -O2        improvement (Os relative to O2)
Txul avg         970.4         998.2       2.8%
Txul stdev        27.3          26.0

i'm unable to test Tp since i'm outside the firewall.
er, those standard deviations should read:

Txul stdev         4.5           3.8
I'd like to see at least Tp measured as well so making the switch on one of the
tinderboxen sounds like a good idea. Also if you have any dhtml-tests or
js-tests handy that would be good but no requirement from my part (I know they
exist but i don't know where, sorry).
There are some scattered in various bugs... (search for "dhtml perf").
Another thing that someone might want to investigate is tweaking gcc's
inliner.  Dropping the inline limit in half (-finline-limit=300) on
gcc-3.3.2 reduced the code size by another 440K.  More is probably
achievable by changing this value or the underlying parameters
(max-inline-*).

There are some large functions that we really do want to inline, since they're
only used once or twice.  I'd rather tweak inlining by finding the things that
really shouldn't be inlined (probably in the string code) and making them not
inline.
David: Are you sure these functions are really being inlined? MSVC has a pretty
low limit for what it is willing to inline (for example some of the nsVoidArray
functions arn't always inlined) and gcc too has a limit for what it will inline.

So in general you shouldn't rely on having your functions inlined unless they
are really small.
the only way to positively force inlining is by using the gcc
__attribute__((always_inline)). having said that, i agree with dbaron's view,
especially as applied to strings... the inlining model there is whacky. i'm sure
we could do great things for both codesize and perf by fixing that.
Note that -Os seems to trigger a bunch of compiler bugs; depending on the target
CPU type you may see "simple" defects like bug 233497 (on x86/IA32, a simply
|if()|/|else|-construct will only use the |else|-branch etc.) or totally defunct
binaries (like on SPARC).
Of note is that while overall (compressed) tarball size does in fact drop by
about 10%, the size of some libraries drops by more than that.  gklayout and
necko (both stripped) drop by about 20% here (-O2 compared to -Os, gcc 3.2). 
xpcom, docshell, and a few others drop by 10%.  uconv drops by 2%.  So on some
libraries we're actually seing a huge win from -Os (20% of gklayout is about 900KB).

Frankly, I would be in favor of flipping the switch sometime in an alpha
milestone (like now, say) for tinderbox and the nightlies and seeing what
happens.  Once we have nightlies with the change, we can put out a call to
people who do DHTML stuff (most of whom don't build) to compare the new and old
builds....
In other words, we have all these nighlies that are _supposed_ to be for testing
purposes and we have people testing them.  We should make use of that.
firefox is using -Os, any reason not to switch comet (seamonkey release) or luna
(seamonkey perf tests) over to doing -Os builds at this point, or do we want to
wait for post 1.8?
Assignee: leaf → cmp
Priority: -- → P3
*** Bug 53486 has been marked as a duplicate of this bug. ***
granrose: switching now sounds entirely reasonable to me.
I think we should get dbaron's approval to change the tinderboxen; we generally
prefer the historical comparison in the numbers by using the same build flags
(which is why btek still uses egcs), even if this doesn't produce the most
optimized builds.
FWIW, I'd expect -O2 builds to be faster than -Os, especially with newer gccs,
thanks to basic block reordering.  (We've tagged a few hotspots with NS_LIKELY /
NS_UNLIKELY since comment 0 happened, so it could be worth re-measuring.)

I'd rather not change tinderboxes that are generating performance data.  I think
we already have some with -O2 and some with -Os.
dbaron: the results in comment 4 (alas, Txul only, no Tp measurements) were done
with 3.3.2... did block reordering come in recently (3.4), or are my results
still representative?
IIRC, NS_LIKELY and NS_UNLIKELY are more recent than comment 4.

From memory:
 * gcc 3.3.x does basic block reordering (-freorder-blocks) at -O2 but not -Os
 * gcc 3.4 also does ,pt / ,pf annotations on conditional jump instructions
(which solves the branch prediction problem but not the cache miss problem
that's solved by -freorder-blocks), but I'm not sure at what optimization levels.
Product: Browser → Seamonkey
Mass reassign of open bugs for chase@mozilla.org to build@mozilla-org.bugs.
Assignee: chase → build
Mass re-assign of bugs that aren't on the build team radar, so bugs assigned to build@mozilla-org.bugs reflects reality.

If there is a bug you really think we need to be looking at, please *email* build@mozilla.org with a bug number and explanation.
Assignee: build → nobody
Assignee: nobody → stanshebs
Apparently Linux releases on the 1.8 branch have been built -Os for awhile; Chris Cooper added this in November as part of migrating tinderbox bits to the public repository, as seen in http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/tools/tinderbox-configs/firefox/linux/mozconfig&rev=MOZILLA_1_8_BRANCH_release .

Mac is being built -O2 on trunk and branches.
Perf?
Is still still something we're looking into or should it be closed in some way?
I still think this deserves investigation.  At least, we should revisit some performance testing with newer gccs
At the very least we need to do a -Os/-O2 comparison on Macs.
Assignee: stanshebs → nobody
Product: Mozilla Application Suite → Core
QA Contact: build-config
What's the relation to bug 409803 and possibly other bugs (cc'ing sayrer)? I can guess, but it would be great to have our story for 1.9/fx3 sorted out soon, so nominating blocking.

/be
Flags: blocking1.9?
(In reply to comment #30)
> What's the relation to bug 409803 and possibly other bugs (cc'ing sayrer)? I
> can guess, but it would be great to have our story for 1.9/fx3 sorted out soon,
> so nominating blocking.

To recap:

We build -Os for release builds on linux.
We build -O2 for release builds on mac.
We build -O1 on msvc (it's somewhere between GCC's -Os and -O2, it does inline etc.)

I tried building mac at -Os, and saw a ~5% slowdown on Tdhtml and a 2-3% slowdown on Tp/Tp2. However, the code was quite a bit smaller.

To me, that indicates certain parts of the tree are faster at -O2 and others at -Os. For example, we know spidermonkey is better at -Os.
the 5% slowdown could be due (in full or part) to bug 409803 - any data we can get on mac gcc4.0 regarding that would be gold, and might make it easier to figure out module-specific settings. (speculation here, but the bug mostly affects code that makes heavy use of c++ wrappers, e.g. string libs, which might explain why spidermonkey isn't affected?)
+ing so we figure out one way or another
Flags: blocking1.9? → blocking1.9+
Status: NEW → RESOLVED
Closed: 16 years ago
Flags: tracking1.9+
Resolution: --- → WORKSFORME
Product: Core → Firefox Build System
You need to log in before you can comment on or make changes to this bug.