Closed Bug 31364 Opened 25 years ago Closed 24 years ago

parallel build dies of race condition xpidl<->mkdir in export

Categories

(SeaMonkey :: Build Config, defect, P3)

Sun
Solaris
defect

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: axel, Assigned: cls)

Details

Attachments

(4 files)

doing a parallel build with -j12 on a 12proc machine with sources in /tmp:
cvsco.log from Mar 10 20:12 MET

error messages like (this is not the first one, but they don't differ much)
(second run after nsprpub)

make[3]: Entering directory `/tmp/build/rdf/chrome'
Creating .deps
make[4]: Entering directory `/tmp/build/rdf/chrome/public'
Creating .deps
Creating _xpidlgen
nsIChromeRegistry.idl
../../../dist/bin/xpidl -m header -w -I ../../../dist/idl
-I/tmp/mozilla/rdf/chrome/public -o _xpidlgen/nsIChromeRegistry
/tmp/mozilla/rdf/chrome/public/nsIChromeRegistry.idl
nsIChromeEntry.idl
../../../dist/bin/xpidl -m header -w -I ../../../dist/idl
-I/tmp/mozilla/rdf/chrome/public -o _xpidlgen/nsIChromeEntry
/tmp/mozilla/rdf/chrome/public/nsIChromeEntry.idl
../../../config/nsinstall -R -m 444
/tmp/mozilla/rdf/chrome/public/nsIChromeRegistry.idl
/tmp/mozilla/rdf/chrome/public/nsIChromeEntry.idl ../../../dist/idl
error opening output file: No such file or directory
make[4]: *** [_xpidlgen/nsIChromeEntry.h] Error 1
make[4]: *** Waiting for unfinished jobs....
make[4]: Leaving directory `/tmp/build/rdf/chrome/public'
make[3]: *** [export] Error 2
make[3]: Leaving directory `/tmp/build/rdf/chrome'
Status: UNCONFIRMED → NEW
Ever confirmed: true
Status: NEW → ASSIGNED
Can one of you 3 that sees this problem try building setting XPIDL_GEN_DIR=. ? 
Our problems with parallel builds with the classic NSPR build stemmed from all
of the mkdir calls used to create the .OBJ dirs.   
As setting the environment didn't cut it, I changed rules.mk by hand to replace
XPIDL_GEN_DIR		= _xpidlgen
with
XPIDL_GEN_DIR		= .
in line 234
Then the race is fixed, but the build brakes with

make[1]: Entering directory `/tmp/build/widget/public'
nsIWidget.idl
../../dist/bin/xpidl -m header -w -I ../../dist/idl -I/tmp/mozilla/widget/public
-o .//tmp/mozilla/widget/public/nsIWidget
/tmp/mozilla/widget/public/nsIWidget.idl
error opening output file: No such file or directory
make[1]: *** [/tmp/mozilla/widget/public/nsIWidget.h] Error 1
make[1]: Leaving directory `/tmp/build/widget/public'
make: *** [export] Error 2

New bug? cls said, he would work on it after getting up again.

Axel
How about generating dirs when generating makefiles? I made hack to
acoutput-fast.pl that check if makefile have XPIDLSCR and makes xpidl
dirs.

I found some errors in allmakefiles when testing this, i attach patch
that has hacks to acoutput-fast.pl and fixes to allmakefiles.sh

How about making .deps -dirs same way?

Because you still need to be able to make the directories on the fly after
someone does a 'make clean'.  Axel, which version of gnu make are you using?
I use GNU Make version 3.78.1.
I had a different idea, how about a dummy target, and add that to the
dependencies?

$(XPIDL_GEN_DIR)/%.h: %.idl $(XPIDL_COMPILE) dirs_target
	$(REPORT_BUILD)

dirs_target:
	@if test ! -d $(XPIDL_GEN_DIR); then echo Creating $(XPIDL_GEN_DIR); rm -rf
$(XPIDL_GEN_DIR); mkdir $(XPIDL_GEN_DIR); else true; fi

this way, the headers depend on the exist test for the dir, but not on the dir
itself, right?
I've had nothing but problems doing parallel makes with gnu make > 3.77 .  I
don't know what Smith changed with the jobserver stuff but it doesn't work. 
Downgrade to 3.76.1 and let me know if the problem persists.
mass re-assign of all bugs where i was listed as the qa contact
QA Contact: cyeh → chofmann
Can anyone duplicate this using gnu make <= 3.77?

Target Milestone: --- → M18
After some digging thru the bug-make mail archive, I ran across a thread that
seems to indicate that there is a serious bug with at least make 3.78.1.  Look
at the '3.78.1 Error with "::" targets and "-j" option' thread.

http://www.geocrawler.com/archives/3/351/1999/11/0/

From experience, it doesn't appear to have been fixed with 3.79 but I don't see
anything about it one way or the other.
While make 3.79.1 does fix the bug mentioned in bug-make, it does not fix this.

I wonder if a small example can be come up with, to submit to the make people.
Also, btw, I tried make 3.77, but it seems to have another bug that makes it
fail immediately:

make[5]: Entering directory
`/mnt/proj/mozilla/mozilla/nsprpub/pr/include/obsolete'
../../../config/SunOS5.7_sparc_32_PTH_DBG.OBJ/nsinstall -R -m 444 
/mnt/proj/mozilla/mozilla/dist/include/obsolete
usage: ../../../config/SunOS5.7_sparc_32_PTH_DBG.OBJ/nsinstall [-C cwd] [-L
linkprefix] [-m mode] [-o owner] [-g group]
                                                               [-DdltR] file
[file ...] directory
make[5]: *** [export] Error 2
That is a known issue with the $(wildcard) feature & make 3.77 under solaris. 
You will need to downgrade to 3.76.1. :-/ 
Part of this looks like a basic test-and-create-is-not-atomic race. In many
places across the makefiles we have:

if test ! -d foo; rm -rf foo; mkdir foo; else true; fi

If I just use "mkdir -p foo" instead, the xpidlgen problems go away (however I
still get errors building nspr; I'm looking into those). What's the reasoning
behind this test?
My current patch is above. With this applied, the only error I can consistently
reproduce is one that also happens sometimes on non-SMP systems (well, Master_D
is getting it at least, and he's not SMP). I don't think it's 100% fixed though.
I believe the reason for the test is that the -p option is not supported on
mkdir on all platforms.  I'm wondering if we shouldn't just start using a
mkinstalldirs script like a number of projects do?
Would it be possible to simply make sure the 'export' target gets built with -j1 all of the time? This is where all the problems are, so it would at least be a good workaround until we figure out mkinstalldirs (i'm not familiar with the details of that, unfortunately) or something else.
adding self to cc as our unix daily build systems are multicpu but aren't doing
parallel builds.
hmm, I just tried doing a non-parallel make export and a -j4 make install on
sol26 and cut the build time from about 5 hours down to about 4, but I got a lot of

gmake[2]: warning: -jN forced in submake: disabling jobserver mode.

Other than that, it seemed to complete without problems.  If it works on hpux
and linux I'll turn it on for the daily builds.
I think this is because all the submakes use -j4 .  Taken literally,
this would mean that each submake should start 4 jobs.  Since this is
obviously not what you want, it ignores that and coordinates the number
of jobs with the parent make. The warning is to tell you it's doing that.

It might go away if we could tell the submakes not to use -jN, but that's
probably not trivial.  So I think it can be ignored.
I finally got around to configuring the dual processor linux box for daily
verification builds.  Once I get the daily builds switched over to the new
system (test build going now) I'll be looking at turning this on again for the
daily builds...
If I make sure that the generated have an actual dependency upon a target that
makes the XPIDL_GEN_DIR, then the problem goes away for me.  Can someone with a
hoss test box try this out?

Note: they cannot actually depend upon XPIDL_GEN_DIR as the timestamp of
XPIDL_GEN_DIR changes when its contents change.
Hi,
I tested a (modified version of) cls' patch. The file in xpidlgen_ does the
trick.
I gave some facelifting to the patch by cls.
First, there were some security patches in there, removed those.
The XPIDL_GEN_DIR is not part of the MAKE_DIRS variables anymore, as we have 
the right dependency in there. no need to have it twice.
I rephrased the generating line a bit. Nothing much happened there.

I tested this on our machine, with a make -j6 export. The load is not 
particularily low at the moment, but 4 procs were free.
I figure I should have got trapped if this wouldn't work.
clobber worked out allright, too.

r=me

Axel
Patch has been checked in.  Marking fixed.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
Tested the patch on my SMP system here, works fine. Marking verified.
Status: RESOLVED → VERIFIED
Product: Browser → Seamonkey
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: