Closed Bug 123423 Opened 21 years ago Closed 17 years ago

make -j4 randomly fails on NSPR

Categories

(NSPR :: NSPR, defect)

defect
Not set
major

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: netscape, Assigned: mozbugs-build)

References

Details

Attachments

(2 files, 2 obsolete files)

<I really, really, really, really hate the <ENTER> == form submit bug.>

We have some fast tinderboxes that are randomly dying while trying to build
libnspr4.so.  It appears as though we try to build a single object file twice at
the same time, once from pr/src and once from one of the subdirs of pr/src/.  

http://tinderbox.mozilla.org/showlog.cgi?log=SeaMonkey/1012862100.14800.gz&fulltext=1

There's 2 ways to fix this:
1) Avoid building in subdirs completely and build everything from pr/src
2) Remove the subdir objs from OBJS and add dummy targets in the subdirs that
pr/src/Makefile.in will depend upon before it will link.  Each dummy target will
just build the OBJS for that directory.
I am seeing this bug on Sunblade 1000 with 2 750mhz procs 2 GB of ram

Using Forte 7 EA and Solaris 9 EA, howeverI can also confirm I was seeing this
on Solaris 8 10/01 also. 

--dcran--
Status: NEW → ASSIGNED
Priority: -- → P2
Target Milestone: --- → 4.2
putting myself on the cc: list
*** Bug 127463 has been marked as a duplicate of this bug. ***
*** Bug 128301 has been marked as a duplicate of this bug. ***
Comment on attachment 73353 [details] [diff] [review]
filter -jN out of $(MAKE) and force to -j1

*sigh* Yeah...sure, that'll work.  I still want to make -jN work as NSPR isn't
*that* small and the time spent building NSPR is noticable.
Attachment #73353 - Flags: review+
Oh, I agree.  This patch is definitly not a good long term solution, but is it
good enough for now that its worth checkin in?  I'm sensative to sleestack's
'better a slower build than a false failure' mode of thinking.  I've been
running it in the 'brad' tinderbox since I posted it here and it seems to have
done the job nicely.  Its also running in the MozillaTest 'brad-fast' build
which is doing just a -j1 its also working fine.
Comment on attachment 73353 [details] [diff] [review]
filter -jN out of $(MAKE) and force to -j1

a=asa (on behalf of drivers) for checkin to the 1.0 trunk
Attachment #73353 - Flags: approval+
The temp workaround has been checked into the client branch & the trunk.
Comment on attachment 73353 [details] [diff] [review]
filter -jN out of $(MAKE) and force to -j1

obsoleting checked in patch so it doesn't look like an approved checkin waiting
to land.
Attachment #73353 - Attachment is obsolete: true
Target Milestone: 4.2 → 4.2.1
Target Milestone: 4.2.1 → Future
Mass reassign to new default build assignee
Assignee: seawood → mozbugs-build
Status: ASSIGNED → NEW
Priority: P2 → --
Attached patch Possible fix for parallel builds (obsolete) — Splinter Review
This is an attempt to make a parallel build run for nspr.  It seemed to work on
my win32 box, but I only have a single processor.  Is it possible that just
specifying the subdirectorys as prerequisites of the OBJS will for the
directories to be built first?	This may bust something else, so please test.

Kevin
(In reply to comment #13)
> specifying the subdirectorys as prerequisites of the OBJS will for the
> directories to be built first?  This may bust something else, so please 

that's 'subdirectories' and 'OBJS will force the'

Sorry for the poor typing skills

Kevin
Kevin,

Your patch does not apply to the current NSPR tip. It's a good idea to write the
patches against the latest version. I was able to hand modify your pach and
apply it. The good news is that it doesn't seem to break my NSPR build on
Solaris, running with -j2, on my dual CPU Blade 2000.  The bad news is that it
doesn't seem to truly do concurrent build. I benchmarked the entire NSPR build
at it came to the same 65 seconds with or without -j2 . top showed that for the
most part the CPU usage was identical in both cases, mostly about 50% idle,
which would confirm only one CPU is getting used.
(In reply to comment #15)
> Kevin,
> 
> Your patch does not apply to the current NSPR tip. It's a good idea to write 
> the patches against the latest version. 

D'oh.  I just used my build tree for thunderbird and forgot that it pulls a
branch of NSPR.

> The good news is that it doesn't seem to break my NSPR build on
> Solaris, running with -j2, on my dual CPU Blade 2000.

That's good to hear.

>  The bad news is that it doesn't seem to truly do concurrent build. I 
> benchmarked the entire NSPR build at it came to the same 65 seconds with
> or without -j2 . top showed that for the most part the CPU usage was 
> identical in both cases, mostly about 50% idle, which would confirm only
> one CPU is getting used.
> 
Hmm.  That doesn't make much sense.  Can you try with something like -j4 and see
if that makes a difference?
Kevin,

It just doesn't work. And actually I went back to older code without your patch,
and built NSPR with -j4 without nany problem. This would seem to contradict the
other comments in this bug reports .

Which Makefile are you building when you see a failure, and with what command ?
I build from mozilla/dist/SunOS5.9_DBG.OBJ, and use "gmake -j4" there.

> It just doesn't work. And actually I went back to older code without your patch,
> and built NSPR with -j4 without nany problem. This would seem to contradict the
> other comments in this bug reports .

Did you backout wtc's patch which forces make to use -j1 when you did your
testing?  When I remove that patch, I still see the build problem (usually, when
creating prlink.o).
Same patch against the tip
Attachment #146109 - Attachment is obsolete: true
Julien:

As cls pointed out, if you backed out the patch, then you a back to the current
workaround which is to force a non-parallel build. (that is the line in the
top-level Makefile.in that is removed in the patch)

When I used this patch on my win32 machine, using a make -j4 I could see 4 jobs
get sent off to compile.  Now I don't have multiple processors so I can't verify
that multiple CPUs will be used, but this patch does in fact spawn multiple make
processes and there is no errors in building (on either win32 or Solaris per
your comment above).  Maybe there is a problem with gmake and using multiple CPUs?

Kevin
Chris,

Yes, it was the case that I hadn't disabled the workaround of -j1 when I backed
out Kevin's fix. I do see the build failure with -j4 on my OS/2 box when I do that.

Kevin,

Thanks for the patch update against the tip.
I applied it to my gcc OS/2 NSPR tree at home. I'm running on a dual Athlon MP
2800+.

The patch works, and both CPUs are peaked to 100% when running gmake -j4 .
In fact, on that machine the NSPR build gets cut from 47 seconds to 19 seconds.

Now, more than a 2:1 improvement doesn't sound right, so I disabled the second
CPU on the OS/2 machine, and reran the build :
gmake takes 47.5s, basically the same amount as with two CPUs. The one CPU was
peaked the whole time .
gmake -j4 takes 31s . The one CPU was peaked the whole time.

This means even on my single-CPU machine, the build time was significantly cut
by running gmake -j4, from 31s to 19s, or 39%. All this without any
hyperthreading, since the Athlon MP CPU doesn't have that feature (and OS/2 SMP
doesn't support the Intel virtual CPUs anyway - only true physical CPUs).

I would be interested to find out if this large build time improvement with -j4
is seen on other single CPU machines and platforms too.

When I get to the office, I will try again with your latest patch on my Sun
Blade 2000. Perhaps I missed something and applied it incorrectly yesterday when
I converted it from the branch and applied it to my tip tree.
Just realized I messed up the numbers. Ignore the 39%.
On single CPU, gmake -j4 cut the build time from 47.5s to 31s, or by about 35%.
On dual CPU, gmake -j4 cut the build time from 47s to 19s, or by about 60%.
This patch doesn't work.  As this build log file
shows, many files are compiled twice.  For example,
search for "prfdcach.c".

I appreciate your help with this bug.  I'd like
any new patch to be accompanied by an explanation
of why parallel make breaks in the
mozilla/nsprpub/pr/src directory and how the
patch solves that problem.  Without the explanation,
I won't be convinced that the patch is correct.

I think this bug is not worth fixing, but I agree
that this bug is a rather interesting challenge.
This is why I haven't marked it WONTFIX.
Comment on attachment 146179 [details] [diff] [review]
Same patch against the tip

Here is the sequence to demonstrate another problem
with this patch.

1. Apply the patch to NSPR and do "gmake" in the
nsprpub directory.

2. cd pr/src/io

3. touch dummy

4. cd ../../..

5. gmake

See all the files get recompiled.

In step 2, you can use any subdirectory of pr/src.
The creation of the new file "dummy" changes the
timestamp on the subdirectory, which causes $(OBJS)
to be rebuilt.
Attachment #146179 - Flags: review-
FYI, I have confirmed on my Sun that the patch doesn't work. The busiest the
dual-CPU machine gets is 52% CPU usage, ie. about one full CPU is used. That
might be something to do with the old gmake 3.76.1 which is the one we have in
/usr/dist here.
Actually, I double-checked it on OS/2. I thought the -j4 build had completed,
because the final page of the build before returning to the shell showed no
error, and everything else above scrolled off the screen .

But today I noticed there were no NSPR DLLs produced. The build was in fact cut
short. I then clobbered, rebuilt, and redirected stderr and stdout to files, and
saw that in fact there were problems. No wonder it was so much quicker :-(

On the other hand, I ran two NSPR builds from different source trees in parallel
on my SMP machine, and they completed in the same 47s amount of time  it takes
to do a single build, so if we can get the parallel to work, I think we will see
a good improvement.

However, given how small NSPR is and how short it takes to build it, I agree
with Wan-Teh that we shouldn't spend too much time on this. It would be
worthwile to get this to work for NSS and the Mozilla browser (though I think it
already works for the later).
The patch also broke the 64-bit Solaris build . See
http://bugzilla.mozilla.org/show_bug.cgi?id=241162 .
Marked the bug WONTFIX.
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → WONTFIX
Target Milestone: Future → 4.6
You need to log in before you can comment on or make changes to this bug.