Closed Bug 815219 Opened 7 years ago Closed 7 years ago

Default to building with all available cores

Categories

(Release Engineering :: General, defect)

defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gps, Assigned: gps)

Details

(Keywords: dev-doc-complete)

Attachments

(2 files)

I pushed a specialized build to try which measures system resource usage when building. On the EC2 instance it hit, it never peaked above 50% CPU usage. It appears that the Linux mozconfigs are all running -j4. Since we peaked at 50% CPU, I'm guessing these machines have 8 cores and we should probably increase to -j12.

I'm not sure if it's safe to make this change globally or if we should conditionally increase it on just the EC2 builders.

https://tbpl.mozilla.org/php/getParsedLog.php?id=17338665&tree=Try&full=1 contains the raw data.
catlee told me what type of EC2 instance we were using but I forget now:
http://aws.amazon.com/ec2/instance-types/

I wonder if we shouldn't make the build (whether via mozconfig or other trickery) able to determine the number of cores and set -j to an appropriate value.
I wholeheartedly agree that -j should be chosen automatically. Early implementations of mach featured this. Unfortunately, it got lost when I transitioned to building through client.mk.

There are a number of solutions to this. Unfortunately, I think they are all somewhat dirty.

The one-liner you are looking for to obtain CPU count is:

  python -c 'import multiprocessing; print(multiprocessing.cpu_count() + 1)'

We can bikeshed about how many extra processes to add. I don't think any more than 2 extra is beneficial. My measurements show that even 1 extra doesn't do much, if anything.
-j<#of cores> is the right first pass here. We can tweak it later if there's a more correct value.
What could go wrong?
Attachment #685759 - Flags: review?(ted)
Pretend that last patch doesn't contain the "+ 1"
Comment on attachment 685759 [details] [diff] [review]
Add -jN to MOZ_MAKE_FLAGS automatically, v1

Review of attachment 685759 [details] [diff] [review]:
-----------------------------------------------------------------

Sure, why not?
Attachment #685759 - Flags: review?(ted) → review+
Since we set -j automatically now and the default value is optimal, we should be able to remove its definition from the in-tree mozconfigs. This patch does that.

The only place were MOZ_MAKE_FLAGS is still referenced is Windows. If we're not using pymake, we explicitly use -j1. We should probably just have the driver bail if GNU make is used on Windows. But, I'm pretty sure we can't do that yet since not all Windows tree configs have been swung over to pymake.

Try at https://tbpl.mozilla.org/?tree=Try&rev=d4e98a1704ac
Assignee: nobody → gps
Status: NEW → ASSIGNED
Attachment #685769 - Flags: review?(ted)
Summary: Increase make -j on EC2 builders → Default to building with all available cores
Speak now or forever hold your peace.
gogo
Comment on attachment 685769 [details] [diff] [review]
Part 2: Remove -jN from in-tree mozconfigs, v1

Review of attachment 685769 [details] [diff] [review]:
-----------------------------------------------------------------

We should make sure this doesn't regress build speed on any of our current platforms.
Attachment #685769 - Flags: review?(ted) → review+
Yay!   If anyone objects to passing -j#ofcores automatically, please report them to the authorities.  Thank you!
coop was looking at updating the Build Faster dashboards. Did we ever get those back up and running?
Did we up our python version requirement enough for this?  If so, lets do it!
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #13)
> Did we up our python version requirement enough for this?  If so, lets do it!

multiprocessing was added in Python 2.6, which is our current minimum required Python version.
I object to -j#ofcores.

-j#ofcores*1.5 has my vote. :)
From 2 try builds I performed today. Before -> After times for just the compile buildbot step.

Linux64 Opt:     19:00 (try-linux64-ec2-618) -> 16:21 (bld-centos6-hp-027)
OS X 10.7 Opt:   21:38 (bld-lion-r5-021)     -> 21:05 (bld-lion-r5-009)
Win Opt:         39:50 (w64-ix-slave34)      -> 39:30 (w64-ix-slave28)
Android 2.2 Opt: 32:19 (try-linux64-ec2-317) -> 20:20 (bld-centos6-hp-032)
B2G ARM Opt:     23:31 (try-linux64-ec2-617) -> 17:38 (bld-centos6-hp-041)
B2G Panda Opt:   24:02 (try-linux64-ec2-325) -> 25:15 (bld-centos6-hp-024)
B2G Unagi Opt:   33:33 (try-linux64-ec2-609) -> 24:45 (bld-centos6-hp-040)

For comparison purposes, most of these are worthless because A) machines are different B) ccache state significantly impacts build times.

The numbers do show that there are no significant regressions in performance (I would expect a 2-4x slowdown if the patches didn't work, for example).

Regardless of what this does for buildbot times, this makes individual development much nicer. If you are on a multi-core machine, you just run |./mach build| and you use all the cores without any extra configuration.
https://hg.mozilla.org/integration/mozilla-inbound/rev/ba730945bc6d
https://hg.mozilla.org/integration/mozilla-inbound/rev/7f5e2a9addff

(In reply to Mike Hommey [:glandium] from comment #15)
> I object to -j#ofcores.
> 
> -j#ofcores*1.5 has my vote. :)

Let's wait to discuss this until after data proves we regressed build times. I suspect we won't be having a discussion :)
We will need to scrub MDN of most references to -jN. I think about the only legitimate reference to it should be telling people how to slow builds down in case they are using too many system resources. I would trust our build system to make optimal decisions about the proper -j value. If this means building a lookup table or something or some kind of algorithm supported by data, we should do that in a follow-up bug. i.e. users should not need to take any action to ensure optimal build times.
Keywords: dev-doc-needed
(In reply to comment #17)
> https://hg.mozilla.org/integration/mozilla-inbound/rev/ba730945bc6d
> https://hg.mozilla.org/integration/mozilla-inbound/rev/7f5e2a9addff
> 
> (In reply to Mike Hommey [:glandium] from comment #15)
> > I object to -j#ofcores.
> > 
> > -j#ofcores*1.5 has my vote. :)
> 
> Let's wait to discuss this until after data proves we regressed build times. I
> suspect we won't be having a discussion :)

Because we don't have the data?  :P
Should we backport this to aurora/beta as well?
(In reply to Ehsan Akhgari [:ehsan] from comment #20)
> Should we backport this to aurora/beta as well?

If we build those trees enough to warrant backport, sure. This change should be pretty harmless and a pretty easy candidate for backport.

Although, we should probably wait a day or two. I expect this patch to bring at least one house of cards down (my bet is on l10n nightlies).
(In reply to comment #21)
> (In reply to Ehsan Akhgari [:ehsan] from comment #20)
> > Should we backport this to aurora/beta as well?
> 
> If we build those trees enough to warrant backport, sure. This change should be
> pretty harmless and a pretty easy candidate for backport.

We're doing a lot of landings (I mean, more than usual) on aurora and beta because of b2g.

> Although, we should probably wait a day or two. I expect this patch to bring at
> least one house of cards down (my bet is on l10n nightlies).

Agreed.
I suppose ms2ger cc'ed me on that bug (thanks!) to confirm that it works on OpenBSD.. and yes it does :)

$python2.7 -c 'import multiprocessing;print(multiprocessing.cpu_count())'
2

tip configures fine and -j2 is indeed passed to gmake if i dont set it in .mozconfig.

 | | |-+= 02736 landry gmake -f client.mk
 | | | \-+- 12662 landry gmake -f /src/mozilla-central/client.mk realbuild
 | | |   \-+- 01171 landry gmake -j2 -C /usr/obj/m-c
multiprocessing itself should be safe on BSD's. It's when you get into the true multiprocessing foo that uses locks that you run into trouble.
Product: mozilla.org → Release Engineering
Bah, I'm used to make -s -j3 -f client.mk but now that Standard8 has ported this to comm-central I have to remember to switch to make -s -f client.mk MOZ_MAKE_FLAGS=-j3 otherwise client.mk ignores me.
(this is Windows where I have aliased make to pymake; I'm told that gmake ignores -j3 because client.mk contains .NOTPARALLEL)
(and the reason I add -j3 to the command line is because that way it affects all of my makes, not just the ones that go through client.mk)
I should try not to think about this after midnight, because pymake might be saving me anyway.
So, for gmake, the rule is that a submake with an explicit $(MAKE) -j setting does not trigger the normal parent/child job sharing. The parent makefile treats the submakefile as its own job, parellising it with anything else in the same make, and then the submakefile creates its own parallel jobs. This means that gmake -j3 -f client.mk uses the override setting in client.mk to build.

For pymake, the way that builtins aren't multiprocessed really confused me, but I eventually figured out a workaround by using an external shell script. This shows that pymake ignores -j settings with conflicting values; only -j1 is correctly honoured, so that pymake -j3 -f client.mk ignores the override.
You need to log in before you can comment on or make changes to this bug.