Closed Bug 750661 Opened 12 years ago Closed 12 years ago

Win PGO builds hitting 3GB virtual address space limit again, failing with: "nshtml5attributename.cpp(1977) : fatal error C1002: compiler is out of heap space in pass 2"

Categories

(Firefox Build System :: General, defect)

x86
Windows Server 2003
defect
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: ehsan.akhgari)

References

Details

Attachments

(1 file)

Ok, so it appears that bug 709193 is back - Win PGO builds are failing again :-(

Has happened twice on inbound:

rev 0831ce6ba72f:
*  https://tbpl.mozilla.org/php/getParsedLog.php?id=11348221&tree=Mozilla-Inbound - "fatal error C1002: compiler is out of heap space in pass 2"
* a retrigger of this rev completed fine: https://tbpl.mozilla.org/php/getParsedLog.php?id=11345300&tree=Mozilla-Inbound - "linker max virtual size: 3021185024"

rev ac1504ff8740:
*  https://tbpl.mozilla.org/php/getParsedLog.php?id=11354573&tree=Mozilla-Inbound (same error)


I've created this to track the failure + short term mitigation, however there is also:
* Bug 710840 (tracking the increase over time) - where I'm about to start bisecting.
* Bug 709480 (switching to win64 builders), which is the real long-term solution here.
Depends on: 709193
No longer depends on: msvc2010, 679352, 709721, 710473, 713169
Filtered inbound TBPL view showing just win PGO:
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=WINNT%205.2%20mozilla-inbound%20pgo-build

Failure rate is 2 out of 5 in the last 12 hours (bug 750611 caused a bit of a backlog, so still quite a few pending/running as I post this).

mozilla-central seems ok for now - khuey found that inbound was using 30mb more on the last green, so appears there has been a significant rise since the last merge, which we'll bisect now.
I've collected the win pgo peak linker values for the last month for inbound in bug 710840 (attachment 619911 [details]).

The most relevant part being:
ac1504ff8740: 3016626176; 3021967360; + 1x failed
221db28204cf: 3021996032
0831ce6ba72f: 3021553664; + 1x failed
32e001c1351b: 2962857984
c0822f99d850: 2962464768
f8c388f622f1: 3021185024
0e2658794e06: 2982510592
609aeba1b2fe: 2962075648
f99cf2f41355: 2993434624; 2962083840; 2993422336
043266d76bb3: 2992799744
Depends on: 750717
Attached image Graph
Posting this just to keep everyone in the loop (seeing as the trees have now been closed as of 15:14 UTC+1).

mbrubeck kindly graphed the values from attachment 619911 [details].

The large jump is in the range:
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=0e2658794e06&tochange=f8c388f622f1

However ehsan's inspection of the changesets found mainly mobile/linux only or else very small patches :-(
Depends on: 750728
The inbound win nightly has just failed too (three pushes after those listed in comment 0):

d2596504ce97
https://tbpl.mozilla.org/php/getParsedLog.php?id=11358759&tree=Mozilla-Inbound
"e:\builds\moz2_slave\m-in-w32-ntly\build\xpfe\appshell\src\nswindowmediator.cpp(810) : fatal error C1001: An internal error has occurred in the compiler.
LINK : fatal error LNK1000: Internal error during IMAGE::BuildImage "
Depends on: 750747
Also occurred on profiling branch, which doesn't yet have the mozilla-inbound changes:

WINNT 5.2 profiling nightly on 2012-04-30 04:02:17 PDT for push cf0acd702251
https://tbpl.mozilla.org/php/getParsedLog.php?id=11323765&tree=Profiling

and

WINNT 5.2 profiling nightly on 2012-05-01 04:02:23 PDT for push 1fe40e6e26b0
https://tbpl.mozilla.org/php/getParsedLog.php?id=11358377&tree=Profiling

Both being:
"nswindowmediator.cpp(810) : fatal error C1001: An internal error has occurred in the compiler."
To summarise what's been discussed on IRC, for those waiting on the closed tree:

* There's nothing hugely obvious (that we've been able to find) that has caused an increase, that could be backed out short term (it would seem we've been close to the limit for a while, but without bug 710840, there was no easy way to keep track). Also, the peak linker vsize values seem to be bi or even trimodal (see mbrubeck's attached graph), which makes finding ranges of increases a pain.

* Short term our options are yet again (bug 709193 déjà vu): remove deadcode, split as much as we can out of libxul, turn off PGO for our trunk nightlies. Ehsan has filed a number of bugs for splitting things out - see dependants. I think we exhausted much of the obvious deadcode removal last time - at least some of what's left still requires a fair amount of work before it can be removed, aiui (eg RDF, old parser).

* Longer term we're completely reliant on bug 709480.
(In reply to Ed Morley [:edmorley] from comment #6)
> * Longer term we're completely reliant on bug 709480.

Or something else I mentioned on irc: Try doing PGO on subparts of libxul when building the static libraries (maybe gklayout + the rest would do), and then link the static libraries together as libxul. Performance impact would need to be studied, though.
To try and unblock people a bit, mozilla-central has now been set to approved required for landings that do not affect any part of the windows build. mozilla-inbound will remain closed for now.

For more info see:
https://wiki.mozilla.org/Tree_Rules
Bah, s/approved/approval/
Depends on: 750867
Depends on: 750859
Depends on: 748343
Latest values from inbound:
(continuing on from comment 2, this time oldest first)

d2596504ce97: 3021832192, 3021963264
d60f77b10824: 2992345088 (test-only)
bfa638e5df16: 2962644992, 2992783360 (disabling graphite)
e1f1d4f79b2d: 2992783360, 2992779264 (NPOTB)
83ff77ce8d6c: 2992455680 (reenable graphite + move to libgkmedias)
c3813fbb1c9a: 2992640000, 2962808832, 2992627712, 2992619520 (bug 748343)

Now that bug 750717 has landed, the values show up in TBPL's middle stats panel (under linker max vsize) when the build is selected, no need to open the logs.
Depends on: 751151
Depends on: 751186
Depends on: 751201
No longer depends on: 751151
Depends on: 751273
How far back does our linker virtual mem size data go?
It's in the build logs, so as far as logs go, 30 days.
(In reply to Phil Ringnalda (:philor) from comment #12)
> It's in the build logs, so as far as logs go, 30 days.

Darn. I would be curious to see 12 months worth :)
The trees are reopened now, I'm gonna call this fixed.
Assignee: nobody → ehsan
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Since we'll lose the logs after 30 days, posting a few more peak linker usage values after Ehsan's awesome work, for future reference:

(continuing on from comment 10; old to new)

75de3dfde0bd: 2962407424 (libjpeg ripped out of libxul)
b60dc9ae8aae: 2992517120, 2962640896 (and libpng)
81f7513ed312: 2962550784 (qcms)
e15be411dff8: 2992361472, 2962223104 (expat)
a642269f01a2: 2962214912 (rm unused cairo debugging code)
e0d9d5a0987b: 2962190336, 2992058368 (bholley's CAPS pruning)
828281d69978: 2980945920 (cairo + pixman ripped out from libxul)
75c104703999: 2980843520 (tree now reopened, normal landings...)
5900fe7cd355: 2963116032, 2980843520
807403a04a6a: 2980831232

Subtracting the highest value post libxul diet from those in comment 2, shows we now have ~39MB more headroom.

To give a rough idea of how long this may last us (obviously extremely dependant on what lands, but better than nothing), between 2012-02-27 (bug 710840 comment 6) and 2012-05-01 there was a ~96MB increase.
(In reply to Mike Hommey [:glandium] from comment #7)
> (In reply to Ed Morley [:edmorley] from comment #6)
> > * Longer term we're completely reliant on bug 709480.
> 
> Or something else I mentioned on irc: Try doing PGO on subparts of libxul
> when building the static libraries (maybe gklayout + the rest would do), and
> then link the static libraries together as libxul. Performance impact would
> need to be studied, though.

No cheese. lib.exe doesn't do anything really useful with /LTCG. It's still the final linkage doing all the work. The only possible way out with this technique would be create a PGOed dll for gklayout, and convert it to a static library. It would require that 1. gklayout is compilable as a dll and that 2. we have something to convert a dll to a static library.
(In reply to Mike Hommey [:glandium] from comment #17)
> (In reply to Mike Hommey [:glandium] from comment #7)
> > (In reply to Ed Morley [:edmorley] from comment #6)
> > > * Longer term we're completely reliant on bug 709480.
> > 
> > Or something else I mentioned on irc: Try doing PGO on subparts of libxul
> > when building the static libraries (maybe gklayout + the rest would do), and
> > then link the static libraries together as libxul. Performance impact would
> > need to be studied, though.
> 
> No cheese. lib.exe doesn't do anything really useful with /LTCG. It's still
> the final linkage doing all the work.

boo!

> The only possible way out with this
> technique would be create a PGOed dll for gklayout, and convert it to a
> static library. It would require that 1. gklayout is compilable as a dll and
> that 2. we have something to convert a dll to a static library.

I'm pretty sure that we've broken (1) since everything moved into libxul.  Also, I don't know of a way to do (2) but that doesn't mean that it's not possible.
No longer depends on: 711386
Product: Core → Firefox Build System
You need to log in before you can comment on or make changes to this bug.