An initial investigation into whether using distcc might reduce wall clock time for builds, by parallelising compilations across machines. This is related to the idea of using a c cache, and having dedicated compilation machines, but it is not dependent on this other idea. If we can reduce wall time without significantly increasing overall compute time, then we should not reduce our capacity, but gain quicker feedback on build results. Initial idea would be to simply set up a test distcc environment and try running some compilations and comparing wall clock time to single machine compilations. If this goes well, then we can look at bigger challenge of working out a sensible architecture for binding this together with our buildbot infrastructure.
Hey catlee, mshal, gps, I created this a loooong time ago (over a year ago!) - but maybe it is not relevant now - I have the feeling some work has been done to use distcc already. Is this still worth looking into? I'm asking you as I believe you guys are the ones that know this stuff best! Pete
I think it would be worth trying out at least to get some numbers to look at, though it would add another level of complexity & networking in the build process. This would effectively parallelize our builds in a different direction than what we do now. For example, if we have 10 machines and 10 builds to do, we essentially use all of our computing power today anyway by having each machine handle one build. So as long as #builds > #available machines, it's possible that adding distcc would provide a lower overall throughput due to adding networking delays and such (ie: farming a job out from one machine to another machine that is already doing a local build will take longer). The advantage we'd get, as you stated, is that we could dedicate a pool of machines to one build to churn an individual build out quicker, and then use that pool to handle the next build, etc. So the wait time might go up, but turnaround time down. Coupled with some prioritization, it might mean we could turn around a chemspill or other urgent build much quicker than today. So, it'd be useful to get some numbers to see if it's really worthwhile. With sccache in the mix, I think we'd want to make sure that sccache happens first and distcc second. Otherwise we could have distcc send a job from node A to node B (networking!), only to have node B grab the object file from sccache (networking!) and send it back to node A (more networking!).
What mshal said. Furthermore, I'm a fan of icecream over distcc (it transfers a chroot of the build environment so results should not be dependent on the configuration of the remote node - think of it like Docker containers). We already have a bug about icecream (bug 927952) and it has a lot more comments and CCs. I think any discussion about distributed compilation should be had there.