Closed Bug 888698 Opened 11 years ago Closed 10 years ago

Passing -jN to build.sh is not enough to make it use N cores

Categories

(Firefox OS Graveyard :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
2.0 S1 (9may)

People

(Reporter: reuben, Assigned: gsvelto)

References

Details

Attachments

(2 files, 6 obsolete files)

1.01 KB, patch
glandium
: review+
Details | Diff | Splinter Review
3.07 KB, patch
mwu
: review+
Details | Diff | Splinter Review
I've tried various combinations and I could only convince the build system not to use as many cores as it could but changing .config, gonk-misc/default-gecko-config *and* passing -jN to build.sh.

There should be a single option or command-line parameter that makes the build use only the number of scores specified.
I did some investigation on this issue because it has annoyed me quite a bit (-j1 builds are still mostly parallel). There's two separate problems causing the build system to ignore the -j parameter:

- First of all we explicitly set in the .config file the number of build jobs to the number of CPUs + 2, this parameter is passed in MAKE_FLAGS that is later picked up by the .mozconfig file. The relevant line is this: http://is.gd/qQbneb

- To make matters worse we blindly override this value in the default .mozconfig file with 16 jobs: http://is.gd/r1Cnkg
This produces huge build jobs which are particularly annoying because of the high load and huge memory consumption they involve

To fix these issues we need two changes: first of all we should remove the jobs override in the default .mozconfig file as it has no reason to be there. Then we need to detect when the '-j' option is passed to the build.sh script and override the value present in MAKE_FLAGS; this will be then picked up by the .mozconfig file ensuring that those jobs will be used in the gecko build.

I'll be happy to do the changes myself but I've still got to drain my backlog of koi+ bugs so if somebody else wants to pick this up in the meantime feel free to go ahead and fix it.
Assignee: nobody → gsvelto
Status: NEW → ASSIGNED
This patch filters out a -j option passed to the build.sh script and appends it to MAKE_FLAGS so that it overrides the existing -j parameter (if present). MAKE_FLAGS is then exported so that the default .mozconfig (provided as default-gecko-config in gonk-misc) will pick it up properly. This ensures that the number of jobs specified on the command-line overrides any other setting for both the Android build-system and the Gecko one.

For this to work Gecko needs another modification which I'll post shortly as gonk-misc/Android.mk is overriding MAKE_FLAGS with the contents of GECKO_MAKE_FLAGS (or -j16 if it was not specified).
Attachment #8339161 - Flags: review?(jhford)
gonk-misc/Android.mk is replacing MAKE_FLAGS with the contents of GECKO_MAKE_FLAGS if present or -j16 if not. I haven't seen GECKO_MAKE_FLAGS being used anywhere else (including outside of our codebase) so I'm removing it. The default .mozconfig file provided in gonk-misc (default-gecko-config) already picks up MAKE_FLAGS and uses it to set MOZ_MAKE_FLAGS which is then used when invoking the top-level build in Gecko.
Attachment #8339162 - Flags: review?(jhford)
Attachment #8339161 - Flags: review?(jhford) → review+
Attachment #8339162 - Flags: review?(jhford) → review+
Thanks for the reviews John! While giving a last look at the code I realized that the build.sh modifications were actually doing something *very* wrong as they were eating out all the parameters passed to the script so the -j part worked but passing a build target (such as gecko) didn't work anymore. Sorry for not noticing this before asking for review. I've modified my patch to use getopts to parse the options, this makes it much smaller and more importantly it's now working correctly.
Attachment #8339161 - Attachment is obsolete: true
Attachment #8347584 - Flags: review?(jhford)
Review ping.
Sorry to ping you again John, do you have time to re-review the second patch here? I've already encountered a few cases of people with low-spec machines being unable to build B2G w/o these patches applied so I'd like to land this ASAP as it enables boxes with 2G of RAM to reliably run a full build (although *very* slowly).
Flags: needinfo?(jhford)
This needs to land, it's not possible to build B2G on a Macbook Air anymore, and I think there are a lot of devs that have those.
Trying to build B2G on my Thinkpad X1 Carbon, I am also affected by this. 16 cc1plus processes running in parallel is causing heavy swapping and slowing the build down to a crawl. It would be great to have this fixed.
By the way, I applied the patched locally and they worked well for me.
So what effect does this have on people who have high end machines?
Comment on attachment 8347584 [details] [diff] [review]
[PATCH B2G v2] Extract the -j flag passed to the build script and ensure it's properly passed down to the gecko build system

Review of attachment 8347584 [details] [diff] [review]:
-----------------------------------------------------------------

::: build.sh
@@ +71,5 @@
>  configure_device &&
> +if [ -n "${build_jobs}" ] ; then
> +  MAKE_FLAGS="${MAKE_FLAGS} -j${build_jobs}"
> +fi &&
> +export MAKE_FLAGS="${MAKE_FLAGS}" &&

This could just be:

export MAKE_FLAGS

No need to do the assignment to itself.
(In reply to Dave Hylands [:dhylands] from comment #10)
> So what effect does this have on people who have high end machines?

It should use the value from .config, which is the number of cores + 2.  So if they have more than 14 cores, then their make concurrency will increase; and if they have more than 16 cores, then they won't be wasting CPU capacity the way they are now.
Comment on attachment 8347584 [details] [diff] [review]
[PATCH B2G v2] Extract the -j flag passed to the build script and ensure it's properly passed down to the gecko build system

mwu owns these files, I'll defer to him on the review, but it looks good to me.
Attachment #8347584 - Flags: review?(jhford) → review?(mwu)
Flags: needinfo?(jhford)
(In reply to Jed Davis [:jld] from comment #12)
> It should use the value from .config, which is the number of cores + 2.  So
> if they have more than 14 cores, then their make concurrency will increase;
> and if they have more than 16 cores, then they won't be wasting CPU capacity
> the way they are now.

Yes, that's what supposed to happen. With my patches applied you should get number_of_cpu_cores + 2 build jobs across the build system if you don't pass the -j parameter to the build.sh script and the number specified with -j if you do.

People with 24+ hardware threads monsters will get some benefit but ironically this is most beneficial to people with low-end machines. It enables builds on boxes with less than 2GiB of memory (if you run with -j2) and fixes the issue introduced by unified builds that made even 4GiB machines fail with OOM conditions.
Patch modified according to Dave's suggestion in comment 11
Attachment #8347584 - Attachment is obsolete: true
Attachment #8347584 - Flags: review?(mwu)
Attachment #8379740 - Flags: review?(mwu)
Comment on attachment 8379740 [details] [diff] [review]
[PATCH B2G v3] Extract the -j flag passed to the build script and ensure it's properly passed down to the gecko build system

r- because this isn't fixing the real problem - the gecko build system is not using the job server.

The reason that it's not using the job server can be found in $(GECKO)/client.mk - this code checks MOZ_MAKE_FLAGS for -j flags and if it finds none, it inserts its own. However, the jobserver configuration is passed via MAKEFLAGS, not MOZ_MAKE_FLAGS, so it doesn't see that -j is already specified. It also does another thing that breaks jobservers - it uses MOZ_MAKE, which breaks the magic of $(MAKE). If both of those issues are fixed, the gecko build automatically uses the right number of jobs without any hacking around in build.sh. This also means the issue is fixed for our partners who don't use build.sh.
Attachment #8379740 - Flags: review?(mwu) → review-
(In reply to Michael Wu [:mwu] from comment #16)
> The reason that it's not using the job server can be found in
> $(GECKO)/client.mk - this code checks MOZ_MAKE_FLAGS for -j flags and if it
> finds none, it inserts its own. However, the jobserver configuration is
> passed via MAKEFLAGS, not MOZ_MAKE_FLAGS, so it doesn't see that -j is
> already specified. It also does another thing that breaks jobservers - it
> uses MOZ_MAKE, which breaks the magic of $(MAKE). If both of those issues
> are fixed, the gecko build automatically uses the right number of jobs
> without any hacking around in build.sh. This also means the issue is fixed
> for our partners who don't use build.sh.

I've modified client.mk to not blindly override the number of jobs as it was currently doing via MOZ_MAKE_FLAGS but use MAKEFLAGS instead. I've also removed the use of $(MOZ_MAKE) to ensure that the job server works properly during recursion.

This fixes our problem but causes a slight change in the default behavior of a gecko build.

Currently if we build by invoking 'make -f client.mk' or './mach build' without specifying a number of jobs we get one job per core. Preserving this behavior is impossible because it would require client.mk to know how many jobs were passed to the make invocation and this information is not available via MAKEFLAGS. So basically it's impossible to tell between these two invocations:

make -f client.mk

and

make -j1 -f client.mk

What I did in my patch was to preserve the current mach behavior (one job per core if -jN was not passed to the command-line) but changed the behavior of the two invocations above. Both of them will now yield a serial build instead of the one-job-per-core build they previously yielded. Hopefully this shouldn't break anything as it's the most natural and unsurprising way to use make IMHO.
Attachment #8379740 - Attachment is obsolete: true
Attachment #8393527 - Flags: review?(gps)
Attachment #8393527 - Flags: feedback?(mwu)
Comment on attachment 8393527 [details] [diff] [review]
[PATCH] Make client.mk honor the number of jobs passed on the commandline

If we want to preserve the old behavior, we can add a small hack to gonk-misc/Android.mk which ensures we always pass -j1 to the gecko build system if -j isn't found in MAKEFLAGS.
Attachment #8393527 - Flags: feedback?(mwu) → feedback+
Comment on attachment 8393527 [details] [diff] [review]
[PATCH] Make client.mk honor the number of jobs passed on the commandline

Review of attachment 8393527 [details] [diff] [review]:
-----------------------------------------------------------------

I think this is mostly right. There were 3 or 4 things I thought this might break. But looking closer, I think the client.mk changes are sane.

Although, I have to wonder why you can't just export MOZ_PARALLEL_BUILD to your invocation of client.mk? That seems much simpler and doesn't require a change to mozilla-central.

FWIW, I recommend b2g's build system not call make or client.mk directly. We're going to be moving all of automation to call mach instead. See bug 978211. Please file a follow-up bug to get b2g to use mach.

::: python/mozbuild/mozbuild/base.py
@@ +424,5 @@
>                  args.append('-j%d' % multiprocessing.cpu_count())
>          elif num_jobs > 0:
>              args.append('MOZ_PARALLEL_BUILD=%d' % num_jobs)
> +        else:
> +            args.append('MOZ_PARALLEL_BUILD=%d' % multiprocessing.cpu_count())

This will now pass -jN > 1 if allow_parallel is False. That's not correct.
Attachment #8393527 - Flags: review?(gps) → feedback+
(In reply to Gregory Szorc [:gps] from comment #21)
> Although, I have to wonder why you can't just export MOZ_PARALLEL_BUILD to
> your invocation of client.mk? That seems much simpler and doesn't require a
> change to mozilla-central.

That's possible and I was doing something along the lines in my previous patch but it can only be done from b2g's build.sh script. Somebody invoking the top-level makefile won't be able to use it and since b2g relies on Android's build infrastructure there's no way to avoid it. See Michael's note in comment 16.

> FWIW, I recommend b2g's build system not call make or client.mk directly.
> We're going to be moving all of automation to call mach instead. See bug
> 978211. Please file a follow-up bug to get b2g to use mach.

That would involve changing gonk-misc/Android.mk, see here:

https://github.com/mozilla-b2g/gonk-misc/blob/601fb8ff2f00dbbd61cfa6b2ec6184cdce4e6afc/Android.mk#L288

I fear this might cause some problems; as I mentioned above we piggy-back on Android's build system which is make-based, if we simply invoke mach from that rule the jobserver won't know how many jobs are in use globally (it has no visibility within mach) and thus the problem we have here will manifest itself again though in a different form.

> ::: python/mozbuild/mozbuild/base.py
> @@ +424,5 @@
> >                  args.append('-j%d' % multiprocessing.cpu_count())
> >          elif num_jobs > 0:
> >              args.append('MOZ_PARALLEL_BUILD=%d' % num_jobs)
> > +        else:
> > +            args.append('MOZ_PARALLEL_BUILD=%d' % multiprocessing.cpu_count())
> 
> This will now pass -jN > 1 if allow_parallel is False. That's not correct.

Yeah, that's a hack but I'm not sure how to address it. The issue is that the current behavior is to do a parallel build if no -j parameter has been given to mach. That's not directly caused by mach but by client.mk instead, see this line:

http://hg.mozilla.org/mozilla-central/file/199e65efd08b/client.mk#l138

I didn't want to change this behavior as I would assume most people actually rely on it. The issue here is that this logic is split between mach, client.mk and MOZ_MAKE_FLAGS and keeping the current behavior under all possible combinations of those while fixing the problem is tricky.
(In reply to Gabriele Svelto [:gsvelto] from comment #22)
> I didn't want to change this behavior as I would assume most people actually
> rely on it. The issue here is that this logic is split between mach,
> client.mk and MOZ_MAKE_FLAGS and keeping the current behavior under all
> possible combinations of those while fixing the problem is tricky.

And this is why we want kill client.mk, MOZ_MAKE_FLAGS, and unify everything via mach. Make[files] do a decent job at executing things with dependencies. They suck as implementation for control flow (which is what client.mk and Android.mk do a lot of). We are actively porting all the make files that behave as shell scripts to a real programming language: Python. That has the extra benefit of allowing us the opportunity to reinvent the world with years of knowledge applied.
(In reply to Gregory Szorc [:gps] from comment #23)
> And this is why we want kill client.mk, MOZ_MAKE_FLAGS, and unify everything
> via mach. Make[files] do a decent job at executing things with dependencies.
> They suck as implementation for control flow (which is what client.mk and
> Android.mk do a lot of). We are actively porting all the make files that
> behave as shell scripts to a real programming language: Python. That has the
> extra benefit of allowing us the opportunity to reinvent the world with
> years of knowledge applied.

Yes, but this doesn't help us. The Android build system isn't going away unless you convince someone at Google to rewrite their build system. The only realistic option is to integrate things as best as we can.
I've got an updated patch in the pipe that should fix most of the remaining problems without altering mach's behavior. Since integration with the Android build system is important then there's a way we might build via mach while preserving the job server semantics: have mach accept the --jobserver-fds=x,y parameter and pass it along with -j to its own make invocation. If this sounds like a reasonable approach we can open a follow up and try to get it working.
This patch is a slight variation on the former that defines MOZ_PARALLEL_BUILD even if the -j parameter wasn't given to mach at all. In this case MOZ_PARALLEL_BUILD will be set to 0 and the makefile will use one job per core just as it did before. This makes all behavior consistent with what we had before (mach, make -f client.mk with and without MOZ_MAKE_FLAGS) and fixes our problems as client.mk now uses the job server properly.
Attachment #8393527 - Attachment is obsolete: true
Attachment #8394965 - Flags: review?(gps)
Comment on attachment 8394965 [details] [diff] [review]
[PATCH v2] Make client.mk honor the number of jobs passed on the commandline

Review of attachment 8394965 [details] [diff] [review]:
-----------------------------------------------------------------

I'm out of commission for a while. Reassigning review.
Attachment #8394965 - Flags: review?(gps) → review?(mh+mozilla)
Comment on attachment 8394965 [details] [diff] [review]
[PATCH v2] Make client.mk honor the number of jobs passed on the commandline

Review of attachment 8394965 [details] [diff] [review]:
-----------------------------------------------------------------

::: client.mk
@@ +144,5 @@
> +  # If MOZ_MAKE_FLAGS specifies a number of jobs pass it to MAKEFLAGS
> +  ifneq (,$(findstring -j,$(MOZ_MAKE_FLAGS)))
> +    MAKEFLAGS += $(filter -j%,$(MOZ_MAKE_FLAGS))
> +    MOZ_MAKE_FLAGS := $(filter-out -j%,$(MOZ_MAKE_FLAGS))
> +  endif

In practice, your patch makes "make -f client.mk" stop using parallel build. Yes, people are using that.

I'd rather not touch that logic, and make the b2g build system override MOZ_MAKE_FLAGS itself and set it to nothing. As for the MOZ_MAKE thing, adding + before it is enough.

That is, the following is enough on gecko's end:
diff --git a/client.mk b/client.mk
--- a/client.mk
+++ b/client.mk
@@ -390,24 +390,24 @@ ifdef MOZ_PREFLIGHT
          $(MAKE) -f $(TOPSRCDIR)/$$mkfile preflight TOPSRCDIR=$(TOPSRCDIR) OBJDIR=$(OBJDIR) MOZ_OBJDIR=$(MOZ_OBJDIR); \
        done
 endif
 
 ####################################
 # Build it
 
 realbuild::  $(OBJDIR)/Makefile $(OBJDIR)/config.status
-       $(MOZ_MAKE)
+       +$(MOZ_MAKE)
 
 ####################################
 # Other targets
 
 # Pass these target onto the real build system
 $(OBJDIR_TARGETS):: $(OBJDIR)/Makefile $(OBJDIR)/config.status
-       $(MOZ_MAKE) $@
+       +$(MOZ_MAKE) $@
 
 ####################################
 # Postflight
 
 realbuild postflight::
 ifdef MOZ_POSTFLIGHT
        set -e; \
        for mkfile in $(MOZ_POSTFLIGHT); do \


With this, if I create a foo.mk file with the following content:
   all:
           $(MAKE) -f client.mk MOZ_MAKE_FLAGS=

And run make -f foo.mk -j12, gecko builds with -j12 using the main jobserver.
Attachment #8394965 - Flags: review?(mh+mozilla) → review-
(In reply to Mike Hommey [:glandium] from comment #28)
> In practice, your patch makes "make -f client.mk" stop using parallel build.
> Yes, people are using that.

I feared that could be problematic, mostly for automation scripts.

> I'd rather not touch that logic, and make the b2g build system override
> MOZ_MAKE_FLAGS itself and set it to nothing.

The problem with that is this line:

http://hg.mozilla.org/mozilla-central/file/5ecd532a167e/client.mk#l138

If MOZ_MAKE_FLAGS is empty it will be set by the makefile itself to -j<number_of_cpu_cores>. That's what's making plain make -f client.mk builds parallel.

I've tried unconditionally setting MOZ_MAKE_FLAGS to "-j" in the B2G build machinery which works well for parallel builds (it will honor the top-level number of jobs) but doesn't work for -j1 as I think that doesn't get passed from the top-level make invocation and we get client.mk invoked with a plain -j flag which runs an unlimited number of jobs as per make's default behavior.

> As for the MOZ_MAKE thing, adding + before it is enough.

Thanks for the tip, I'll add that in the patch and look for a way around -jN being set automatically when MOZ_MAKE_FLAGS doesn't have a -j parameter already.
(In reply to Gabriele Svelto [:gsvelto] from comment #29)
> The problem with that is this line:
> 
> http://hg.mozilla.org/mozilla-central/file/5ecd532a167e/client.mk#l138
> 
> If MOZ_MAKE_FLAGS is empty it will be set by the makefile itself to
> -j<number_of_cpu_cores>. That's what's making plain make -f client.mk builds
> parallel.
> 
> I've tried unconditionally setting MOZ_MAKE_FLAGS to "-j" in the B2G build
> machinery which works well for parallel builds (it will honor the top-level
> number of jobs) but doesn't work for -j1 as I think that doesn't get passed
> from the top-level make invocation and we get client.mk invoked with a plain
> -j flag which runs an unlimited number of jobs as per make's default
> behavior.

Which is why I'm saying $(MAKE) -f client.mk MOZ_MAKE_FLAGS=
That overrides anything a makefile can do to the variable. Yes, even ?=, :=, and +=.
(In reply to Mike Hommey [:glandium] from comment #30)
> Which is why I'm saying $(MAKE) -f client.mk MOZ_MAKE_FLAGS=
> That overrides anything a makefile can do to the variable. Yes, even ?=, :=,
> and +=.

Ah, excellent! I wasn't aware of that behavior in make. Updated patch coming soon.
Updated patch using only '+' for forcing make to recognize the recursive invocations. I'll attach the adjusted gonk-misc patch for this to work properly when building FxOS.
Attachment #8394965 - Attachment is obsolete: true
Attachment #8412740 - Flags: review?(mh+mozilla)
Updated patch for gonk-misc, the changes are non-trivial so we need another round of review. I've tested this both invoking client.mk directly and via build.sh and it seems to be working fine for all combinations of -j parameters (or lack thereof).
Attachment #8339162 - Attachment is obsolete: true
Attachment #8412745 - Flags: review?(mwu)
Attachment #8412745 - Flags: review?(mwu) → review+
Attachment #8412740 - Flags: review?(mh+mozilla) → review+
I've pushed attachment 8412745 [details] [diff] [review] to gonk-misc/master 2f28398184d20480254ca275a441a1c79a1a9d8c

https://github.com/mozilla-b2g/gonk-misc/commit/2f28398184d20480254ca275a441a1c79a1a9d8c

Now we need to land attachment 8412740 [details] [diff] [review] and we'll be done here, the try run is here:

https://tbpl.mozilla.org/?tree=Try&rev=cf0a5602bb20
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/a97b0e833407
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → 2.0 S1 (9may)
It would be nice if we could uplift this to the b2g30_v1_4 branch.
You need to log in before you can comment on or make changes to this bug.