Ok, so it appears that bug 709193 is back - Win PGO builds are failing again :-(
Has happened twice on inbound:
* https://tbpl.mozilla.org/php/getParsedLog.php?id=11348221&tree=Mozilla-Inbound - "fatal error C1002: compiler is out of heap space in pass 2"
* a retrigger of this rev completed fine: https://tbpl.mozilla.org/php/getParsedLog.php?id=11345300&tree=Mozilla-Inbound - "linker max virtual size: 3021185024"
* https://tbpl.mozilla.org/php/getParsedLog.php?id=11354573&tree=Mozilla-Inbound (same error)
I've created this to track the failure + short term mitigation, however there is also:
* Bug 710840 (tracking the increase over time) - where I'm about to start bisecting.
* Bug 709480 (switching to win64 builders), which is the real long-term solution here.
Filtered inbound TBPL view showing just win PGO:
Failure rate is 2 out of 5 in the last 12 hours (bug 750611 caused a bit of a backlog, so still quite a few pending/running as I post this).
mozilla-central seems ok for now - khuey found that inbound was using 30mb more on the last green, so appears there has been a significant rise since the last merge, which we'll bisect now.
I've collected the win pgo peak linker values for the last month for inbound in bug 710840 (attachment 619911 [details]).
The most relevant part being:
ac1504ff8740: 3016626176; 3021967360; + 1x failed
0831ce6ba72f: 3021553664; + 1x failed
f99cf2f41355: 2993434624; 2962083840; 2993422336
Created attachment 619919 [details]
Posting this just to keep everyone in the loop (seeing as the trees have now been closed as of 15:14 UTC+1).
mbrubeck kindly graphed the values from attachment 619911 [details].
The large jump is in the range:
However ehsan's inspection of the changesets found mainly mobile/linux only or else very small patches :-(
The inbound win nightly has just failed too (three pushes after those listed in comment 0):
"e:\builds\moz2_slave\m-in-w32-ntly\build\xpfe\appshell\src\nswindowmediator.cpp(810) : fatal error C1001: An internal error has occurred in the compiler.
LINK : fatal error LNK1000: Internal error during IMAGE::BuildImage "
Also occurred on profiling branch, which doesn't yet have the mozilla-inbound changes:
WINNT 5.2 profiling nightly on 2012-04-30 04:02:17 PDT for push cf0acd702251
WINNT 5.2 profiling nightly on 2012-05-01 04:02:23 PDT for push 1fe40e6e26b0
"nswindowmediator.cpp(810) : fatal error C1001: An internal error has occurred in the compiler."
To summarise what's been discussed on IRC, for those waiting on the closed tree:
* There's nothing hugely obvious (that we've been able to find) that has caused an increase, that could be backed out short term (it would seem we've been close to the limit for a while, but without bug 710840, there was no easy way to keep track). Also, the peak linker vsize values seem to be bi or even trimodal (see mbrubeck's attached graph), which makes finding ranges of increases a pain.
* Short term our options are yet again (bug 709193 déjà vu): remove deadcode, split as much as we can out of libxul, turn off PGO for our trunk nightlies. Ehsan has filed a number of bugs for splitting things out - see dependants. I think we exhausted much of the obvious deadcode removal last time - at least some of what's left still requires a fair amount of work before it can be removed, aiui (eg RDF, old parser).
* Longer term we're completely reliant on bug 709480.
(In reply to Ed Morley [:edmorley] from comment #6)
> * Longer term we're completely reliant on bug 709480.
Or something else I mentioned on irc: Try doing PGO on subparts of libxul when building the static libraries (maybe gklayout + the rest would do), and then link the static libraries together as libxul. Performance impact would need to be studied, though.
To try and unblock people a bit, mozilla-central has now been set to approved required for landings that do not affect any part of the windows build. mozilla-inbound will remain closed for now.
For more info see:
Latest values from inbound:
(continuing on from comment 2, this time oldest first)
d2596504ce97: 3021832192, 3021963264
d60f77b10824: 2992345088 (test-only)
bfa638e5df16: 2962644992, 2992783360 (disabling graphite)
e1f1d4f79b2d: 2992783360, 2992779264 (NPOTB)
83ff77ce8d6c: 2992455680 (reenable graphite + move to libgkmedias)
c3813fbb1c9a: 2992640000, 2962808832, 2992627712, 2992619520 (bug 748343)
Now that bug 750717 has landed, the values show up in TBPL's middle stats panel (under linker max vsize) when the build is selected, no need to open the logs.
How far back does our linker virtual mem size data go?
It's in the build logs, so as far as logs go, 30 days.
(In reply to Phil Ringnalda (:philor) from comment #12)
> It's in the build logs, so as far as logs go, 30 days.
Darn. I would be curious to see 12 months worth :)
The trees are reopened now, I'm gonna call this fixed.
Since we'll lose the logs after 30 days, posting a few more peak linker usage values after Ehsan's awesome work, for future reference:
(continuing on from comment 10; old to new)
75de3dfde0bd: 2962407424 (libjpeg ripped out of libxul)
b60dc9ae8aae: 2992517120, 2962640896 (and libpng)
81f7513ed312: 2962550784 (qcms)
e15be411dff8: 2992361472, 2962223104 (expat)
a642269f01a2: 2962214912 (rm unused cairo debugging code)
e0d9d5a0987b: 2962190336, 2992058368 (bholley's CAPS pruning)
828281d69978: 2980945920 (cairo + pixman ripped out from libxul)
75c104703999: 2980843520 (tree now reopened, normal landings...)
5900fe7cd355: 2963116032, 2980843520
Subtracting the highest value post libxul diet from those in comment 2, shows we now have ~39MB more headroom.
To give a rough idea of how long this may last us (obviously extremely dependant on what lands, but better than nothing), between 2012-02-27 (bug 710840 comment 6) and 2012-05-01 there was a ~96MB increase.
We keep logs for nightly builds FOREVAAAAAH, e.g.
linker max virtual size: 3028107264
(In reply to Mike Hommey [:glandium] from comment #7)
> (In reply to Ed Morley [:edmorley] from comment #6)
> > * Longer term we're completely reliant on bug 709480.
> Or something else I mentioned on irc: Try doing PGO on subparts of libxul
> when building the static libraries (maybe gklayout + the rest would do), and
> then link the static libraries together as libxul. Performance impact would
> need to be studied, though.
No cheese. lib.exe doesn't do anything really useful with /LTCG. It's still the final linkage doing all the work. The only possible way out with this technique would be create a PGOed dll for gklayout, and convert it to a static library. It would require that 1. gklayout is compilable as a dll and that 2. we have something to convert a dll to a static library.
(In reply to Mike Hommey [:glandium] from comment #17)
> (In reply to Mike Hommey [:glandium] from comment #7)
> > (In reply to Ed Morley [:edmorley] from comment #6)
> > > * Longer term we're completely reliant on bug 709480.
> > Or something else I mentioned on irc: Try doing PGO on subparts of libxul
> > when building the static libraries (maybe gklayout + the rest would do), and
> > then link the static libraries together as libxul. Performance impact would
> > need to be studied, though.
> No cheese. lib.exe doesn't do anything really useful with /LTCG. It's still
> the final linkage doing all the work.
> The only possible way out with this
> technique would be create a PGOed dll for gklayout, and convert it to a
> static library. It would require that 1. gklayout is compilable as a dll and
> that 2. we have something to convert a dll to a static library.
I'm pretty sure that we've broken (1) since everything moved into libxul. Also, I don't know of a way to do (2) but that doesn't mean that it's not possible.