468554 - Consider adding some hardware win32 slave to the build pool (just for m-c PGO builds)

Reporter

Description

•

16 years ago

bug 428559 showed that a marked improvement in build time could be had from a hardware win32 slave vs. our current VMs. 60 minutes for a dep build vs. 2.5 hours is a huge improvement. Now certainly we can't support all hardware builders, but could we support a few fast hardware machines mixed into the pool, allocated only for mozilla-central win32 PGO builds? If you still have VM builders in the pool, then each hardware machine doesn't need Tier 1 support.

I don't know what the cost/benefit ratio is here, but it's something to think about. nthomas asked about migrating our refplatform VM to a hardware machine. The URL points at a VMWare doc describing how to to Virtual to Physical migrations, and notes that DL360s are one of the supported platforms.

Chris Cooper [:coop] (he/him)

Comment 1

•

16 years ago

Could we accomplish the same thing (or at least some improvement) by reserving some VM resources for the win32 PGO buildslaves? We just did the same thing for production-master (https://bugzilla.mozilla.org/show_bug.cgi?id=467634#c7).

(not currently active) Ted Mielczarek

Reporter

Comment 2

•

16 years ago

I don't know, it's worth testing, at least. Should be fairly easy to try it and see what the numbers look like.

bhearsum@mozilla.com (:bhearsum)

Comment 3

•

16 years ago

This might be nice to try, but I don't see it happening soon.

Component: Release Engineering → Release Engineering: Future

Mike Shaver (:shaver -- probably not reading bugmail closely)

Comment 4

•

16 years ago

It was tried in bug 428559, and showed a 60% reduction in build times.  When _do_ you see it happening, if not soon?  Late Q1? Q2? Never?

How can someone else help make it happen, if releng doesn't have cycles?  Build-feedback turnaround time is a _major_ issue for our development process right now.

(not currently active) Ted Mielczarek

Reporter

Comment 5

•

16 years ago

John:

What would you think about a Q1 goal to get one hardware machine setup as a build slave and allocated to the m-c win32 build pool? It's not going to solve our build time problems, but it would give us a feel for how much better it would be, without committing to maintaining a bunch of machines. That way, if there turn out to be serious problems with this plan, we'll have them enumerated, and can figure out ways to fix them, instead of just pushing back because we know there will be problems, but we don't know what they are.

From IRC discussion, it seems like most of the concerns are with maintenance, since with VMs we can always throw away a VM and re-clone a new one from the refplatform. With a hardware machine, it's not so simple, so we run into this problem any time the refplatform gets updated. There is a RelEng bug on file about getting tools for maintaining software configuration across machines, so that would probably go a long way to alleviating those problems. Regardless, I think a pilot program would help both sides see what we can accomplish.

Chris AtLee [:catlee]

Comment 6

•

15 years ago

So I just ran 'make -f client.mk profiledbuild' on real hardware 6 times.  Between each run, I ran 'rm -rf obj-firefox'.

The first run was successful, but I don't know how long it took because I didn't run it with 'time'.

The next 3 runs hung at various points in the build process.  These I ran with 'time make -f client.mk profiledbuild'  There were 3 or 4 make processes chewing up CPU, but they didn't seem to be making any progress.

Running as 'date; make -f client.mk profiledbuild &> build.log; date' also hung.

The last run I did with 'make -f client.mk profiledbuild; date', making a note of the start time elsewhere.  This run completed successfully in 1 hour 41 minutes.  Our VMs complete a full clobber build somewhere in the range of 2h:30-3h:30.

The physical hardware is therefore able to complete a nightly build at least 50 minutes faster, which is a 33% improvement.

The machine I'm running the tests on is a quad-core Xeon 2.5 GHz with 3.5 GB of ram and a 5400 rpm drive.  I was using the standard nightly mozconfig, which uses -j4.

The build hangs are worrying, we used to think those were specific to VMs with multiple virtual cores, but now I'm wondering if it's a bug in make, or bad make/shell interaction.

I'll see if I can get an idea for depend builds tomorrow.

(not currently active) Ted Mielczarek

Reporter

Comment 7

•

15 years ago

I've seen some hangs in make recently on my Windows 7 system. Might be just make badness. I think we should get bsmedberg's last pymake-compat patch landed and do some tests building with PyMake instead.

Also:
(In reply to comment #6)
> The machine I'm running the tests on is a quad-core Xeon 2.5 GHz with 3.5 GB of
> ram and a 5400 rpm drive.  I was using the standard nightly mozconfig, which
> uses -j4.

5400rpm? Why is that so slow compared to the rest of the beefy hardware?

Chris AtLee [:catlee]

Comment 8

•

15 years ago

(In reply to comment #7)
> Also:
> (In reply to comment #6)
> > The machine I'm running the tests on is a quad-core Xeon 2.5 GHz with 3.5 GB of
> > ram and a 5400 rpm drive.  I was using the standard nightly mozconfig, which
> > uses -j4.
> 
> 5400rpm? Why is that so slow compared to the rest of the beefy hardware?

I expect this is the standard drive for blades...it's a 2.5" drive.  I'll ask mrz about it...I wonder if we can get a 10k drive or an SSD drive in there.

John Ford [:jhford] CET/CEST Berlin Time

Assignee

Comment 9

•

15 years ago

What about running the build through a really simple buildbot setup (factory with just rm -rf and make -f client.mk profiledbuild)?  That would give you a way to time it.

Chris AtLee [:catlee]

Comment 10

•

15 years ago

I've been timing via:

date > build.log; make -f client.mk profiledbuild; date >> build.log

also, changing the build directory seems to have fixed the hang for me.  I used to be running out of /builds/mozilla-central, which is an msys path to c:/mozilla-build/msys/builds/mozilla-central.

running out of /c/builds/mozilla-central aka c:/builds/mozilla-central seems to have fixed it, so now running with 'date; make -f client.mk profiledbuild &> build.log; build.log' works without a hang...at least twice now anyway.

Chris AtLee [:catlee]

Comment 11

•

15 years ago

no-op builds take 1 hour 17 minutes on this machine, most of which is spent relinking xul (see bug 518107).

Chris AtLee [:catlee]

Comment 12

•

15 years ago

New machines should be arriving this week...

Assignee: nobody → catlee

Component: Release Engineering: Future → Release Engineering

Priority: -- → P4

Chris AtLee [:catlee]

Comment 13

•

15 years ago

John is working on this now...

Assignee: catlee → jford

John Ford [:jhford] CET/CEST Berlin Time

Assignee

Comment 14

•

15 years ago

I am running timit.py on a windows 2003 vm (win32-slave03) and a new ix systems machine (bm-win32-test02).  I have them set to run 5 cycles, each having 3 dep builds.  timit.py and the mozconfigs are versioned in http://hg.johnford.info/machine-timing.  For the vm, I am using -j4 in my mozconfig.  I am using -j1 on hardware because of bug 524149

John Ford [:jhford] CET/CEST Berlin Time

Assignee

Comment 15

•

14 years ago

These machines were able to complete 3 PGO Clobber builds and 9 no-op dep builds before the vm was able to finish the first PGO Clobber build.  The hardware had -j1 and the vm had -j4 because the hardware doesn't work with anything other than -j1.  

I believe that we ordered some more ix systems.

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

(not currently active) Ted Mielczarek

Reporter

Comment 16

•

14 years ago

I guess we'll fix the remaining PyMake bugs and get that parallelism back on the hardware machines!

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: mozilla.org → Release Engineering

Bugzilla

Quick Search

Consider adding some hardware win32 slave to the build pool (just for m-c PGO builds)

Categories

(Release Engineering :: General, defect, P4)

Tracking

(Not tracked)

People

(Reporter: ted, Assigned: jhford)

References

(
URL
)

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Updated