Closed
Bug 468554
Opened 16 years ago
Closed 14 years ago
Consider adding some hardware win32 slave to the build pool (just for m-c PGO builds)
Categories
(Release Engineering :: General, defect, P4)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: ted, Assigned: jhford)
References
()
Details
bug 428559 showed that a marked improvement in build time could be had from a hardware win32 slave vs. our current VMs. 60 minutes for a dep build vs. 2.5 hours is a huge improvement. Now certainly we can't support all hardware builders, but could we support a few fast hardware machines mixed into the pool, allocated only for mozilla-central win32 PGO builds? If you still have VM builders in the pool, then each hardware machine doesn't need Tier 1 support. I don't know what the cost/benefit ratio is here, but it's something to think about. nthomas asked about migrating our refplatform VM to a hardware machine. The URL points at a VMWare doc describing how to to Virtual to Physical migrations, and notes that DL360s are one of the supported platforms.
Comment 1•16 years ago
|
||
Could we accomplish the same thing (or at least some improvement) by reserving some VM resources for the win32 PGO buildslaves? We just did the same thing for production-master (https://bugzilla.mozilla.org/show_bug.cgi?id=467634#c7).
Reporter | ||
Comment 2•16 years ago
|
||
I don't know, it's worth testing, at least. Should be fairly easy to try it and see what the numbers look like.
Comment 3•16 years ago
|
||
This might be nice to try, but I don't see it happening soon.
Component: Release Engineering → Release Engineering: Future
It was tried in bug 428559, and showed a 60% reduction in build times. When _do_ you see it happening, if not soon? Late Q1? Q2? Never? How can someone else help make it happen, if releng doesn't have cycles? Build-feedback turnaround time is a _major_ issue for our development process right now.
Reporter | ||
Comment 5•16 years ago
|
||
John: What would you think about a Q1 goal to get one hardware machine setup as a build slave and allocated to the m-c win32 build pool? It's not going to solve our build time problems, but it would give us a feel for how much better it would be, without committing to maintaining a bunch of machines. That way, if there turn out to be serious problems with this plan, we'll have them enumerated, and can figure out ways to fix them, instead of just pushing back because we know there will be problems, but we don't know what they are. From IRC discussion, it seems like most of the concerns are with maintenance, since with VMs we can always throw away a VM and re-clone a new one from the refplatform. With a hardware machine, it's not so simple, so we run into this problem any time the refplatform gets updated. There is a RelEng bug on file about getting tools for maintaining software configuration across machines, so that would probably go a long way to alleviating those problems. Regardless, I think a pilot program would help both sides see what we can accomplish.
Comment 6•15 years ago
|
||
So I just ran 'make -f client.mk profiledbuild' on real hardware 6 times. Between each run, I ran 'rm -rf obj-firefox'. The first run was successful, but I don't know how long it took because I didn't run it with 'time'. The next 3 runs hung at various points in the build process. These I ran with 'time make -f client.mk profiledbuild' There were 3 or 4 make processes chewing up CPU, but they didn't seem to be making any progress. Running as 'date; make -f client.mk profiledbuild &> build.log; date' also hung. The last run I did with 'make -f client.mk profiledbuild; date', making a note of the start time elsewhere. This run completed successfully in 1 hour 41 minutes. Our VMs complete a full clobber build somewhere in the range of 2h:30-3h:30. The physical hardware is therefore able to complete a nightly build at least 50 minutes faster, which is a 33% improvement. The machine I'm running the tests on is a quad-core Xeon 2.5 GHz with 3.5 GB of ram and a 5400 rpm drive. I was using the standard nightly mozconfig, which uses -j4. The build hangs are worrying, we used to think those were specific to VMs with multiple virtual cores, but now I'm wondering if it's a bug in make, or bad make/shell interaction. I'll see if I can get an idea for depend builds tomorrow.
Reporter | ||
Comment 7•15 years ago
|
||
I've seen some hangs in make recently on my Windows 7 system. Might be just make badness. I think we should get bsmedberg's last pymake-compat patch landed and do some tests building with PyMake instead. Also: (In reply to comment #6) > The machine I'm running the tests on is a quad-core Xeon 2.5 GHz with 3.5 GB of > ram and a 5400 rpm drive. I was using the standard nightly mozconfig, which > uses -j4. 5400rpm? Why is that so slow compared to the rest of the beefy hardware?
Comment 8•15 years ago
|
||
(In reply to comment #7) > Also: > (In reply to comment #6) > > The machine I'm running the tests on is a quad-core Xeon 2.5 GHz with 3.5 GB of > > ram and a 5400 rpm drive. I was using the standard nightly mozconfig, which > > uses -j4. > > 5400rpm? Why is that so slow compared to the rest of the beefy hardware? I expect this is the standard drive for blades...it's a 2.5" drive. I'll ask mrz about it...I wonder if we can get a 10k drive or an SSD drive in there.
Assignee | ||
Comment 9•15 years ago
|
||
What about running the build through a really simple buildbot setup (factory with just rm -rf and make -f client.mk profiledbuild)? That would give you a way to time it.
Comment 10•15 years ago
|
||
I've been timing via: date > build.log; make -f client.mk profiledbuild; date >> build.log also, changing the build directory seems to have fixed the hang for me. I used to be running out of /builds/mozilla-central, which is an msys path to c:/mozilla-build/msys/builds/mozilla-central. running out of /c/builds/mozilla-central aka c:/builds/mozilla-central seems to have fixed it, so now running with 'date; make -f client.mk profiledbuild &> build.log; build.log' works without a hang...at least twice now anyway.
Comment 11•15 years ago
|
||
no-op builds take 1 hour 17 minutes on this machine, most of which is spent relinking xul (see bug 518107).
Comment 12•15 years ago
|
||
New machines should be arriving this week...
Assignee: nobody → catlee
Component: Release Engineering: Future → Release Engineering
Priority: -- → P4
Assignee | ||
Comment 14•15 years ago
|
||
I am running timit.py on a windows 2003 vm (win32-slave03) and a new ix systems machine (bm-win32-test02). I have them set to run 5 cycles, each having 3 dep builds. timit.py and the mozconfigs are versioned in http://hg.johnford.info/machine-timing. For the vm, I am using -j4 in my mozconfig. I am using -j1 on hardware because of bug 524149
Assignee | ||
Comment 15•14 years ago
|
||
These machines were able to complete 3 PGO Clobber builds and 9 no-op dep builds before the vm was able to finish the first PGO Clobber build. The hardware had -j1 and the vm had -j4 because the hardware doesn't work with anything other than -j1. I believe that we ordered some more ix systems.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 16•14 years ago
|
||
I guess we'll fix the remaining PyMake bugs and get that parallelism back on the hardware machines!
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•