Closed Bug 479192 Opened 12 years ago Closed 11 years ago

tinderbox machines should use faster hardware

Categories

(Release Engineering :: General, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 545136

People

(Reporter: dbaron, Unassigned)

Details

Over the past few years, build times on tinderbox have been increasing significantly.  Given that hardware has been becoming more powerful (significantly faster than our source code has been increasing in size), one would expect the opposite.  In the old days, we invested in top-of-the-line hardware in tinderbox machines:
http://groups.google.com/group/netscape.public.mozilla.builds/msg/4262317a44ec69e1?
and we should start doing so again.

Developers spend significant amounts of time waiting for tinderbox machines (whether it's mozilla-central, tracemonkey, mozilla-1.9.1, or the try server) to build and run tests.  While talos machines need to be slow so that they have typical performance characteristics, the build machines, leak test machines, and unit test machines ought to be fast so that developers can stop wasting hours every day waiting for these slow machines to cycle.  (I think most of the tinderboxes today are significantly slower than the typical desktop machine a developer would have; in the old days they were faster.)

(Note that there are probably a whole bunch of other things we can do to improve cycle times other than getting faster hardware; it's worth getting additional bugs filed on those as well.)
See also my proposal in bug 468554.
A quick analysis of builds on mozilla-central gives these results:

Windows average build time: 03h02
Windows average debug build time: 01h07

Mac average build time: 00h27
Mac average debug build time: 00h24

Linux average build time: 00h33
Linux average debug build time: 00h29

Note that these times include the time it takes to pull from hg, do a build, and run some tests.  Also, we're doing PGO builds on the non-debug windows builds, which significantly affects the build time.

We need data on how much time we could expect to gain from better hardware. Bug 468554 isn't set to be done this quarter, but if you can find somebody with a good machine we can run tests on, that could work too.
http://tinderbox.mozilla.org/MozillaTry/ says:

Approximate build times:
    * Linux: 1hr 30m
    * Mac OSX: 35m -2h (depending on whether you get an xserve or mini)
    * Windows: 2hr
Is the try server using different build hardware?

The numbers in comment 2 seem to be more within a reasonable range (except for the Windows PGO build), but aren't most Mac and Linux builds depend builds?  I think commits that cause more to be rebuilt cause longer builds... any idea how long a full build typically takes?
The windows and linux slaves on try server are on similar VMs as are used in the regular build environment.

The mac slaves on try are dedicated hardware, like on regular build, and IIRC there's an xserve and a mini there.  We also have a mix of xserve's and mini's in the regular build pool.

Try server is doing full clobbers, but not PGO builds.

For our nightly builds, which are full rebuilds, here are the average build times for mozilla-central:
Windows: 3h34
Linux: 0h52
Mac: 1h27

I should mention that these averages, and those in comment #2, are based on data from Feb 15th to Feb 19th.
I do not believe hardware is the major limiting factor here. As catlee has already mentioned, dep build time on Linux and Mac is very low. Windows build times are very long because of PGO and how long it takes to link xul.lib. If we were not using PGO build times would roughly 30-45 minutes, as they do on the leak test machines.

Of course, Talos can't start until the builds finish, so that makes things especially bad on Windows.

I don't think it's fair to put all of this onus on us. 30 minute build times for dep builds are perfectly reasonable IMHO. We can certainly reduce the total run times for everything after we can run unittests on packaged builds, and perhaps by splitting up Talos test suites (random idea, I'm not committing us to that).

The fact is we're doing more: unittests, performance tests, PGO on windows. We can't do it in the same amount of time.
Why is dep build time so slow? 25-30 minutes is an awfully long time for a dep build on linux/mac. A 4-core machine should be able to do a rebuild-nothing dep build in 6-8 minutes, and an 8-core machine can do it in 3 minutes.

Perhaps we should consider making Windows hourlies non-PGO by default and do nightlies as PGO. This means the hourly Talos numbers will be different from the final numbers, but will at least identify code regressions quickly.
(In reply to comment #6)
> Why is dep build time so slow? 25-30 minutes is an awfully long time for a dep
> build on linux/mac. A 4-core machine should be able to do a rebuild-nothing dep
> build in 6-8 minutes, and an 8-core machine can do it in 3 minutes.

It's not as fast as it could be, admittedly, but I don't think it's unreasonable.

> Perhaps we should consider making Windows hourlies non-PGO by default and do
> nightlies as PGO. This means the hourly Talos numbers will be different from
> the final numbers, but will at least identify code regressions quickly.

The leak builds do much shorter, non-pgo build which tests "does it compile" and "does it startup". I think you're talking about performance regressions though, which obviously require a completed build to even start.

Either way, this would do wonders for our "checkin happens" -> "all tests done" time and save a bunch of cycles.
I guess I'll file specific bugs on any builds that take longer than 30 minutes, then.
(In reply to comment #6)
> Why is dep build time so slow? 25-30 minutes is an awfully long time for a dep
> build on linux/mac. A 4-core machine should be able to do a rebuild-nothing dep
> build in 6-8 minutes, and an 8-core machine can do it in 3 minutes.

Also remember that the 'build' consists of more than just running make.

We also need to get code updates from 3 repositories and make sure we have enough space to do a build before we actually start.  Then we do the build.  After that, we run things like codesighs (on mac and linux), create build symbols, create a package, and upload results to various servers.
Here's a breakdown of the time spent in for one windows build:

pre-update steps: ~1 minute
hg update: 7 minutes
compile: 2h22
post-compile steps: ~1 minute 

total elapsed: 2h31

The pre-update step can become much more than 1 minute if the machine needs to do a clobber or free up some disk space.
I'm told we're not buying a bunch more ESX hosts right now, nor are we doing a wholesale switch to hardware. According to https://bugzilla.mozilla.org/show_bug.cgi?id=477885#c12 we are not in an overloaded state, which indicates that buying more ESX hosts may not even solve the problem.

Right now, we are trying to focus on reducing end-to-end time through other means, such as parallelizing tests.
Component: Release Engineering → Release Engineering: Future
Priority: -- → P3
(In reply to comment #11)
> I'm told we're not buying a bunch more ESX hosts right now, nor are we doing a
> wholesale switch to hardware. According to
> https://bugzilla.mozilla.org/show_bug.cgi?id=477885#c12 we are not in an
> overloaded state, which indicates that buying more ESX hosts may not even solve
> the problem.
> 
> Right now, we are trying to focus on reducing end-to-end time through other
> means, such as parallelizing tests.

Actually, I'd paraphrase this to say: Switching to faster VMs, or even faster hardware will only get us an incremental perf gain. We can get a much bigger end-to-end performance gain by focusing on things like:
- running unittests without having to rebuild each time
- running each unittest suite in parallel, rather then one-after-another. 
...so we'll continue working on that as a higher priority first. 

Once we have those improvements in place, we can revisit this bug - hence we didnt WONTFIX this bug, and instead moved this bug to Future.
Mass move of bugs from Release Engineering:Future -> Release Engineering. See
http://coop.deadsquid.com/2010/02/kiss-the-future-goodbye/ for more details.
Component: Release Engineering: Future → Release Engineering
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 545136
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.