Closed Bug 420320 Opened 12 years ago Closed 10 years ago

figure out a way to mitigate cycle time of PGO builds

Categories

(Release Engineering :: General, defect, P3)

defect

Tracking

(Not tracked)

VERIFIED DUPLICATE of bug 545136

People

(Reporter: ted, Unassigned)

Details

Now that we've enabled PGO on fx-win32-tbox, the cycle time has gone way up. Clobber builds take about 2 hours 30 minutes, and depend builds take 1 hour 30 minutes. Turning it on on fx-linux-tbox will be even worse, with every build being a clobber that will probably take about 2 hours. This is going to hurt us im terms of getting perf numbers quickly, and just generally having things cycle on tinderbox. A couple of "easy" fixes off the top of my head:
1) Make the tinderboxes faster--they're currently VMs, put them on real hardware. This has maintenance issues etc, but would probably help a lot. Note that more cores won't help the Windows build that much, the linker is single threaded (at least for whatever it does in PGO), and that's where we're spending the most time.
2) Bring up another set of tinderboxes to do non-PGO builds, and perf test them in parallel--This means having a lot of extra boxes, but would have the side benefit of allowing us to compare our PGO and non-PGO numbers side-by-side.

Harder fixes:
* Only do PGO on nightly/release builds--This kind of sucks since we won't get consistent perf numbers, but these are the builds that most people actually run, so it makes sense. Tinderbox doesn't have support for this right now, might be easier with buildbot.
Two other problems caused by the longer PGO builds are:

1) Builds now take longer then talos runs, so some talos slaves remain idle, and eventually drop off the waterfall page. We're trying bug#419071 in an attempt to reduce (but not fix) this problem.

2) Whenever a talos machine has problems, burns and closes the tree, we fix/reboot the machine, and then have to wait for a *new* build to available and be tested to see if the tree can be reopened or not. This wait is now significantly longer, meaning the tree remains closed for longer. A recent example of this is detailed in tree closure bug#420183.
(sorry, forgot this)

3) nagios incorrectly assumed that longer build times were errors. bug#420166 increased nagios timeout settings to 5h15mins, and we need to remember to undo this once the PGObuild times are reduced. In the meanwhile, it will take longer for us to notice if a build machine genuinely does hang.
Component: Release Engineering → Release Engineering: Future
QA Contact: build → release
Priority: -- → P3
I know we've talked about this during meetings, but I'm not sure we've ever implemented it. Can we get a box with as much RAM as we can fit in it and buy some RAM-drive software? I'd be curious to see if we could run a build almost entirely in memory and what that would do to build times.
As I mentioned in comment 0, Win32 builds are pretty much CPU bound in PGO.
The build tree fits in < 1G for a non-debug build, I'm sure, so we could try the ramdisk approach without new hardware, but I agree that it's not likely to make much of a difference.  An experiment on someone's desktop build machine would tell the tale, probably.
ah, sorry for the noise, that was reading failure on my part.

We have seen improvement moving the unittest machines from VMs to hardware in the past. That speeds them up anywhere from 30-50%.

In response to Ted's suggestion about only doing PGO for the nightly runs, that actually would be interesting to try. We'd get a daily baseline of talos numbers on non-PGO builds followed by our nightly build showing optimized results.
Related to bug 472711?
Not really. I filed that bug to see if there was anything buggy about our build system causing dep builds to take longer. (There might have been, see bug 518107.) This bug just assumes that PGO builds will always take a long time, and suggests that we investigate ways to make life better in spite of that.
Mass move of bugs from Release Engineering:Future -> Release Engineering. See
http://coop.deadsquid.com/2010/02/kiss-the-future-goodbye/ for more details.
Component: Release Engineering: Future → Release Engineering
aiui, there's two items of work here:

1) investigate and if possible speed up efficiencies of Makefiles. This is being tracked in bug#472711. 

2) do win32 builds on faster machines (ie physical hardware?). This is being tracked in bug#545136. 

Closing as DUP, but please reopen if I missed anything.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 545136
Works for me!
Status: RESOLVED → VERIFIED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.