Closed Bug 468554 Opened 16 years ago Closed 14 years ago

Consider adding some hardware win32 slave to the build pool (just for m-c PGO builds)

Categories

(Release Engineering :: General, defect, P4)

x86
Windows XP
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ted, Assigned: jhford)

References

()

Details

bug 428559 showed that a marked improvement in build time could be had from a hardware win32 slave vs. our current VMs. 60 minutes for a dep build vs. 2.5 hours is a huge improvement. Now certainly we can't support all hardware builders, but could we support a few fast hardware machines mixed into the pool, allocated only for mozilla-central win32 PGO builds? If you still have VM builders in the pool, then each hardware machine doesn't need Tier 1 support.

I don't know what the cost/benefit ratio is here, but it's something to think about. nthomas asked about migrating our refplatform VM to a hardware machine. The URL points at a VMWare doc describing how to to Virtual to Physical migrations, and notes that DL360s are one of the supported platforms.
Could we accomplish the same thing (or at least some improvement) by reserving some VM resources for the win32 PGO buildslaves? We just did the same thing for production-master (https://bugzilla.mozilla.org/show_bug.cgi?id=467634#c7).
I don't know, it's worth testing, at least. Should be fairly easy to try it and see what the numbers look like.
This might be nice to try, but I don't see it happening soon.
Component: Release Engineering → Release Engineering: Future
It was tried in bug 428559, and showed a 60% reduction in build times.  When _do_ you see it happening, if not soon?  Late Q1? Q2? Never?

How can someone else help make it happen, if releng doesn't have cycles?  Build-feedback turnaround time is a _major_ issue for our development process right now.
John:

What would you think about a Q1 goal to get one hardware machine setup as a build slave and allocated to the m-c win32 build pool? It's not going to solve our build time problems, but it would give us a feel for how much better it would be, without committing to maintaining a bunch of machines. That way, if there turn out to be serious problems with this plan, we'll have them enumerated, and can figure out ways to fix them, instead of just pushing back because we know there will be problems, but we don't know what they are.

From IRC discussion, it seems like most of the concerns are with maintenance, since with VMs we can always throw away a VM and re-clone a new one from the refplatform. With a hardware machine, it's not so simple, so we run into this problem any time the refplatform gets updated. There is a RelEng bug on file about getting tools for maintaining software configuration across machines, so that would probably go a long way to alleviating those problems. Regardless, I think a pilot program would help both sides see what we can accomplish.
So I just ran 'make -f client.mk profiledbuild' on real hardware 6 times.  Between each run, I ran 'rm -rf obj-firefox'.

The first run was successful, but I don't know how long it took because I didn't run it with 'time'.

The next 3 runs hung at various points in the build process.  These I ran with 'time make -f client.mk profiledbuild'  There were 3 or 4 make processes chewing up CPU, but they didn't seem to be making any progress.

Running as 'date; make -f client.mk profiledbuild &> build.log; date' also hung.

The last run I did with 'make -f client.mk profiledbuild; date', making a note of the start time elsewhere.  This run completed successfully in 1 hour 41 minutes.  Our VMs complete a full clobber build somewhere in the range of 2h:30-3h:30.

The physical hardware is therefore able to complete a nightly build at least 50 minutes faster, which is a 33% improvement.

The machine I'm running the tests on is a quad-core Xeon 2.5 GHz with 3.5 GB of ram and a 5400 rpm drive.  I was using the standard nightly mozconfig, which uses -j4.

The build hangs are worrying, we used to think those were specific to VMs with multiple virtual cores, but now I'm wondering if it's a bug in make, or bad make/shell interaction.

I'll see if I can get an idea for depend builds tomorrow.
I've seen some hangs in make recently on my Windows 7 system. Might be just make badness. I think we should get bsmedberg's last pymake-compat patch landed and do some tests building with PyMake instead.

Also:
(In reply to comment #6)
> The machine I'm running the tests on is a quad-core Xeon 2.5 GHz with 3.5 GB of
> ram and a 5400 rpm drive.  I was using the standard nightly mozconfig, which
> uses -j4.

5400rpm? Why is that so slow compared to the rest of the beefy hardware?
(In reply to comment #7)
> Also:
> (In reply to comment #6)
> > The machine I'm running the tests on is a quad-core Xeon 2.5 GHz with 3.5 GB of
> > ram and a 5400 rpm drive.  I was using the standard nightly mozconfig, which
> > uses -j4.
> 
> 5400rpm? Why is that so slow compared to the rest of the beefy hardware?

I expect this is the standard drive for blades...it's a 2.5" drive.  I'll ask mrz about it...I wonder if we can get a 10k drive or an SSD drive in there.
What about running the build through a really simple buildbot setup (factory with just rm -rf and make -f client.mk profiledbuild)?  That would give you a way to time it.
I've been timing via:

date > build.log; make -f client.mk profiledbuild; date >> build.log

also, changing the build directory seems to have fixed the hang for me.  I used to be running out of /builds/mozilla-central, which is an msys path to c:/mozilla-build/msys/builds/mozilla-central.

running out of /c/builds/mozilla-central aka c:/builds/mozilla-central seems to have fixed it, so now running with 'date; make -f client.mk profiledbuild &> build.log; build.log' works without a hang...at least twice now anyway.
no-op builds take 1 hour 17 minutes on this machine, most of which is spent relinking xul (see bug 518107).
New machines should be arriving this week...
Assignee: nobody → catlee
Component: Release Engineering: Future → Release Engineering
Priority: -- → P4
John is working on this now...
Assignee: catlee → jford
I am running timit.py on a windows 2003 vm (win32-slave03) and a new ix systems machine (bm-win32-test02).  I have them set to run 5 cycles, each having 3 dep builds.  timit.py and the mozconfigs are versioned in http://hg.johnford.info/machine-timing.  For the vm, I am using -j4 in my mozconfig.  I am using -j1 on hardware because of bug 524149
These machines were able to complete 3 PGO Clobber builds and 9 no-op dep builds before the vm was able to finish the first PGO Clobber build.  The hardware had -j1 and the vm had -j4 because the hardware doesn't work with anything other than -j1.  

I believe that we ordered some more ix systems.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
I guess we'll fix the remaining PyMake bugs and get that parallelism back on the hardware machines!
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.