Closed Bug 906119 Opened 6 years ago Closed 6 years ago

Enable incremental linking with Visual C++

Categories

(Firefox Build System :: General, defect)

x86
macOS
defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED
mozilla26

People

(Reporter: ehsan, Assigned: ehsan)

References

Details

(Keywords: regression)

Attachments

(1 file, 1 obsolete file)

So we do two stupid things which breaks incremental linking on Windows: one is a hack from the days of VC8 and the other is passing in the -RELEASE flag to the linker which disables incremental linking for a mostly useless checksum field apparently somewhere in the PE header.  We need to stop doing both of these.
Note that the -RELEASE flag apparently made our PGO builds significantly faster (I don't know why), so we'll want to keep it on for those builds.
This used to work a while ago.
Keywords: regression
We could either stick it behind bug 904979 to keep it on for all official builds, or flip it on solely for PGO builds somehow.
Wait.  How does -RELEASE possibly make PGO builds faster?  I don't believe you.  :-)
I kind of don't either -- -LTCG implicitly disables -INCREMENTAL (from the docs: "/LTCG is not valid for use with /INCREMENTAL."), so it can't be that -RELEASE was disabling incremental linking which made PGO builds faster.  Other than that, -RELEASE only does the checksum goop as far as I can find.
I was similarly skeptical, but we had a big PGO build time drop and this was the only thing in the range that was remotely plausible:
http://people.mozilla.org/~catlee/sattap/2b3e3d61.png

I didn't do the sleuthing on it because I didn't have time and it didn't seem super-important to track down.
Attached patch Patch (v1) (obsolete) — Splinter Review
Just to put my patch where my mouth is.  ;-)
Assignee: nobody → ehsan
Status: NEW → ASSIGNED
Attachment #791440 - Flags: review?(ted)
Also, note that "-RELEASE" is only useful for device driver development, don't be mislead by its name.  It should really be called "-ADDUSELESSCHECKSUM" or something!
(In reply to comment #6)
> I was similarly skeptical, but we had a big PGO build time drop and this was
> the only thing in the range that was remotely plausible:
> http://people.mozilla.org/~catlee/sattap/2b3e3d61.png
> 
> I didn't do the sleuthing on it because I didn't have time and it didn't seem
> super-important to track down.

Hmm, still that doesn't make a lot of sense to me...  I guess we'll know for sure if we land this patch though.
(In reply to :Ehsan Akhgari (needinfo? me!) from comment #8)
> Also, note that "-RELEASE" is only useful for device driver development,
> don't be mislead by its name.  It should really be called
> "-ADDUSELESSCHECKSUM" or something!

I know. The only reason I added it was to get rid of the annoying warnings in WinDBG about missing checksums. If I had known it disabled incremental linking I would not have bothered.

I'd like to see you push this to try and get some PGO builds to compare build times with and without before we land this. If it does actually have an impact then we should not regress that.
I'd rather see this "protected" behind a -z "$DEVELOPER_OPTIONS" test.
Let's see what it does on try to PGO build times first.  The fewer options the better, especially something like this that *shouldn't* have an impact. (The checksum thing is just a one-time warning in a debug session, right?  The debug info loading is noisy as it is, I don't think it matters that much.)
I'm not sure incremental linking won't add more noise to the already noisy talos results.
(In reply to Mike Hommey [:glandium] from comment #13)
> I'm not sure incremental linking won't add more noise to the already noisy
> talos results.

How would incrementally linking PGO builds even work?
(In reply to :Ehsan Akhgari (needinfo? me!) from comment #14)
> (In reply to Mike Hommey [:glandium] from comment #13)
> > I'm not sure incremental linking won't add more noise to the already noisy
> > talos results.
> 
> How would incrementally linking PGO builds even work?

FWIW, <http://msdn.microsoft.com/en-us/library/xbf3tbeh.aspx> says that builds with /LTCG will not be incremental.  Which makes perfect sense to me, since the linker doesn't just link in those builds, it also code-gens.
(In reply to :Ehsan Akhgari (needinfo? me!) from comment #14)
> (In reply to Mike Hommey [:glandium] from comment #13)
> > I'm not sure incremental linking won't add more noise to the already noisy
> > talos results.
> 
> How would incrementally linking PGO builds even work?

We're running talos on non-PGO builds.
(In reply to Mike Hommey [:glandium] from comment #17)
> We're running talos on non-PGO builds.

That's kind of a non-sequitur, but also interesting and sad!  We're not running our only performance tests using the builds that we specifically optimize for performance?
(In reply to Vladimir Vukicevic [:vlad] [:vladv] from comment #18)
> (In reply to Mike Hommey [:glandium] from comment #17)
> > We're running talos on non-PGO builds.
> 
> That's kind of a non-sequitur, but also interesting and sad!  We're not
> running our only performance tests using the builds that we specifically
> optimize for performance?

No, because we don't do pgo builds on every push.
Well, we *do* run talos on PGO builds, but we're *also* running it on non-PGO builds. Without that tracking regressions would be much harder.
Whether or not it makes sense (or more accurately, whether or not it is the result of a documented effect rather than an undocumented side-effect), -RELEASE dropped our runtime for debug browser-chrome on Windows from 120 minutes (which is the maximum runtime buildbot allows, so we were frequently turning it red from being too slow) to 100 minutes, so be careful you don't regress bug 864085 by getting rid of it.
I looked at the job execution time of Talos jobs on inbound for a single non-PGO build (https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=c21468186d3d):

	c	d	o	p	s	tp	x	Total
Linux32	9	25	21	12	28	15		110
Linux64	9	26	21	13	28	14		111
Win XP	10		20	13	29	16		88
Win 7	13	32	22	15	31	17	10	140
Win 8	12	30	22	14	30	17		125

Comparing build times (took longest build times from 3 recent m-c builds):

	Non-PGO	PGO	Difference
Linux32	117	172	55
Linux64	112	175	63
Win XP	N/A	N/A	
Win 7	147	276	129
Win 8	N/A	N/A	

So, if we performed PGO on *every* build and eliminated non-PGO builds and their corresponding Talos jobs, we'd use less overall machine time. Of course, if we got rid of non-PGO builds, we could also kill *all* non-PGO tests, which would free up yet more resources (277 minutes more for Linux32).

There are downsides to killing non-PGO builds, of course. But considering it would be a net machine time win - and that is something RelEng is always making a big deal about - it's certainly worth thinking about.
Latency of build results matters too.
Having to wait for 4 hours for builds to finish would certainly affect the inbound close times.
(In reply to Gregory Szorc [:gps] from comment #22)
> So, if we performed PGO on *every* build and eliminated non-PGO builds and
> their corresponding Talos jobs, we'd use less overall machine time. Of
> course

That's not true, either. PGO builds are always clobbers. Non-PGO builds aren't. Sure, we could make the PGO builds non-clobbers with some effort, but that won't make them magically as fast as Non-PGO non-clobber builds (the fact that relinking xul.dll alone takes 90 minutes with PGO, for instance, is certainly not going to help).
(In reply to Mike Hommey [:glandium] from comment #25)
> (In reply to Gregory Szorc [:gps] from comment #22)
> > So, if we performed PGO on *every* build and eliminated non-PGO builds and
> > their corresponding Talos jobs, we'd use less overall machine time. Of
> > course
> 
> That's not true, either. PGO builds are always clobbers. Non-PGO builds
> aren't. Sure, we could make the PGO builds non-clobbers with some effort,
> but that won't make them magically as fast as Non-PGO non-clobber builds
> (the fact that relinking xul.dll alone takes 90 minutes with PGO, for
> instance, is certainly not going to help).

How is it not true? Automation machine time for tests is much greater than automation time for builds. If we eliminate a whole class of builds, most of the savings come from eliminating the tests. That may change as test suites are parallelized. But for now it holds true.
(In reply to Gregory Szorc [:gps] from comment #22)
> So, if we performed PGO on *every* build and eliminated non-PGO builds and
> their corresponding Talos jobs, we'd use less overall machine time. Of
> course, if we got rid of non-PGO builds, we could also kill *all* non-PGO
> tests, which would free up yet more resources (277 minutes more for Linux32).
> 
> There are downsides to killing non-PGO builds, of course. But considering it
> would be a net machine time win - and that is something RelEng is always
> making a big deal about - it's certainly worth thinking about.

We explicitly made this tradeoff last year because of the latency of results. PGO builds are always going to take longer to finish, meaning that it's always going to take longer to get results out. We are using more machine time to do periodic PGO builds, but the tradeoff is that we get faster results for the non-PGO builds so we can more quickly assess test breakage. I don't think we want to revert that decision.
(In reply to Phil Ringnalda (:philor) from comment #21)
> Whether or not it makes sense (or more accurately, whether or not it is the
> result of a documented effect rather than an undocumented side-effect),
> -RELEASE dropped our runtime for debug browser-chrome on Windows from 120
> minutes (which is the maximum runtime buildbot allows, so we were frequently
> turning it red from being too slow) to 100 minutes, so be careful you don't
> regress bug 864085 by getting rid of it.

Oh, perhaps this was the thing I was misremembering in comment 1. It was debug test runtimes and not PGO build speed. Exciting either way! (That's right, it was my PGO-exclusion patch that made PGO builds faster.)
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #27)
> We are using more
> machine time to do periodic PGO builds, but the tradeoff is that we get
> faster results for the non-PGO builds so we can more quickly assess test
> breakage. I don't think we want to revert that decision.

Indeed.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #28)
> (In reply to Phil Ringnalda (:philor) from comment #21)
> > Whether or not it makes sense (or more accurately, whether or not it is the
> > result of a documented effect rather than an undocumented side-effect),
> > -RELEASE dropped our runtime for debug browser-chrome on Windows from 120
> > minutes (which is the maximum runtime buildbot allows, so we were frequently
> > turning it red from being too slow) to 100 minutes, so be careful you don't
> > regress bug 864085 by getting rid of it.
> 
> Oh, perhaps this was the thing I was misremembering in comment 1. It was
> debug test runtimes and not PGO build speed. Exciting either way! (That's
> right, it was my PGO-exclusion patch that made PGO builds faster.)

OK, I'll do another try push for debug tests.  FWIW, the build times were no different with this patch.

Also, can everybody please take the parts of the conversation that are not related to this bug to a better forum, such as dev.platform?  This bug is *only* about enabling incremental linking with Visual C++, as suggested by the summary, discussions of things such as dropping non-PGO builds do not belong here!  Thanks.  :-)
(In reply to :Ehsan Akhgari (needinfo? me!) from comment #30)
> OK, I'll do another try push for debug tests.  FWIW, the build times were no
> different with this patch.

https://tbpl.mozilla.org/?tree=Try&rev=863d070a5909
(In reply to :Ehsan Akhgari (needinfo? me!) from comment #31)
> (In reply to :Ehsan Akhgari (needinfo? me!) from comment #30)
> > OK, I'll do another try push for debug tests.  FWIW, the build times were no
> > different with this patch.
> 
> https://tbpl.mozilla.org/?tree=Try&rev=863d070a5909

o_O

Comparing this to <https://tbpl.mozilla.org/?rev=764ac2c7cb0a>, we go from 88mins to 120mins running browser-chrome.  This is ridiculous.

I'll try to hide this behind $DEVELOPER_OPTIONS.
Attached patch Patch (v2)Splinter Review
https://tbpl.mozilla.org/?tree=Try&rev=60baca47112d
Attachment #791440 - Attachment is obsolete: true
Attachment #791440 - Flags: review?(ted)
Attachment #793140 - Flags: review?(ted)
Attachment #793140 - Flags: review?(mh+mozilla)
If you'd like to try to profile it that might be interesting. My best guess is the dbghelp-using NS_StackWalk implementation hits some pathological case without these checksums. (I didn't have the stomach for trying to figure it out at the time.)
I was going to write a script to parse out the times for each test in the two logs.  But then I realized I really shouldn't do that, not right now. :)  Someone else should though!
Comment on attachment 793140 [details] [diff] [review]
Patch (v2)

Review of attachment 793140 [details] [diff] [review]:
-----------------------------------------------------------------

::: configure.in
@@ +2075,5 @@
>          DYNAMIC_XPCOM_LIBS='$(LIBXUL_DIST)/lib/xul.lib $(LIBXUL_DIST)/lib/xpcom_core.lib $(LIBXUL_DIST)/lib/mozalloc.lib'
>          XPCOM_FROZEN_LDOPTS='$(LIBXUL_DIST)/lib/xul.lib $(LIBXUL_DIST)/lib/mozalloc.lib'
>          LIBXUL_LIBS='$(LIBXUL_DIST)/lib/xul.lib $(LIBXUL_DIST)/lib/mozalloc.lib'
>          MOZ_COMPONENT_NSPR_LIBS='$(NSPR_LIBS)'
> +        if test -n "$DEVELOPER_OPTIONS"; then

Would be better to do that as
LDFLAGS="$LDFLAGS -LARGEADDRESSAWARE -NXCOMPAT"
if test -z "$DEVELOPER_OPTIONS"; then
  LDFLAGS="$LDFLAGS -RELEASE"
fi
Attachment #793140 - Flags: review?(mh+mozilla) → review+
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #34)
> If you'd like to try to profile it that might be interesting. My best guess
> is the dbghelp-using NS_StackWalk implementation hits some pathological case
> without these checksums. (I didn't have the stomach for trying to figure it
> out at the time.)

Hmm, yeah that's plausible, but I don't hope that there is much that we can do about that.  At any rate, this patch gives us what _I_ was looking for (faster builds for developers), so I filed bug 907701 as a follow-up in case anybody is interested to do this for RelEng builds.
Attachment #793140 - Flags: review?(ted)
https://hg.mozilla.org/mozilla-central/rev/e36e3bec59da
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla26
Just for reference (for now): It's still possible to run into the ILK size limit of incremental linking (Bug 653662 disabled it back then) with VS2010 as the limit has only been increased to 384MB/768MB according to http://connect.microsoft.com/VisualStudio/feedback/details/283665/increase-linker-memory-to-avoid-linker-error-lnk1210-exceeded-internal-ilk-size-limit. Someone in #developers on IRC has run into this limit with a debug build/no opt with VS2010 for about a week now. But a new bug will be filed on this if necessary.
(In reply to Frank Wein [:mcsmurf] from comment #40)
> limit of incremental linking (Bug 653662 disabled it back then) with VS2010

First: Could someone please set a reference to those old bugs (I found this bug through the rev.-history of Makefile.in <SIC!>)

> Someone in #developers on IRC has run into this limit with a debug build/no
> opt with VS2010 for about a week now.

Confirmed! I'm just right now run into the Bug.
Alfred: Can you file a new bug for this via https://bugzilla.mozilla.org/enter_bug.cgi ? It looks like multiple people are affected by this. File it in Produce Core, Component "Build Config". Also include the info if you use a 32bit or a 64bit Windows.
(In reply to Frank Wein [:mcsmurf] from comment #42)
> Alfred: Can you file a new bug for this via

Done: bug 931371
Duplicate of this bug: 657571
Blocks: 703845
Product: Core → Firefox Build System
You need to log in before you can comment on or make changes to this bug.