Closed Bug 123423 Opened 21 years ago Closed 18 years ago
make -j4 randomly fails on NSPR
<I really, really, really, really hate the <ENTER> == form submit bug.> We have some fast tinderboxes that are randomly dying while trying to build libnspr4.so. It appears as though we try to build a single object file twice at the same time, once from pr/src and once from one of the subdirs of pr/src/. http://tinderbox.mozilla.org/showlog.cgi?log=SeaMonkey/1012862100.14800.gz&fulltext=1 There's 2 ways to fix this: 1) Avoid building in subdirs completely and build everything from pr/src 2) Remove the subdir objs from OBJS and add dummy targets in the subdirs that pr/src/Makefile.in will depend upon before it will link. Each dummy target will just build the OBJS for that directory.
I am seeing this bug on Sunblade 1000 with 2 750mhz procs 2 GB of ram Using Forte 7 EA and Solaris 9 EA, howeverI can also confirm I was seeing this on Solaris 8 10/01 also. --dcran--
Status: NEW → ASSIGNED
Priority: -- → P2
Target Milestone: --- → 4.2
putting myself on the cc: list
*** Bug 127463 has been marked as a duplicate of this bug. ***
*** Bug 128301 has been marked as a duplicate of this bug. ***
Comment on attachment 73353 [details] [diff] [review] filter -jN out of $(MAKE) and force to -j1 *sigh* Yeah...sure, that'll work. I still want to make -jN work as NSPR isn't *that* small and the time spent building NSPR is noticable.
Attachment #73353 - Flags: review+
Oh, I agree. This patch is definitly not a good long term solution, but is it good enough for now that its worth checkin in? I'm sensative to sleestack's 'better a slower build than a false failure' mode of thinking. I've been running it in the 'brad' tinderbox since I posted it here and it seems to have done the job nicely. Its also running in the MozillaTest 'brad-fast' build which is doing just a -j1 its also working fine.
Comment on attachment 73353 [details] [diff] [review] filter -jN out of $(MAKE) and force to -j1 a=asa (on behalf of drivers) for checkin to the 1.0 trunk
Attachment #73353 - Flags: approval+
The temp workaround has been checked into the client branch & the trunk.
Comment on attachment 73353 [details] [diff] [review] filter -jN out of $(MAKE) and force to -j1 obsoleting checked in patch so it doesn't look like an approved checkin waiting to land.
Attachment #73353 - Attachment is obsolete: true
Target Milestone: 4.2 → 4.2.1
Target Milestone: 4.2.1 → Future
Mass reassign to new default build assignee
Assignee: seawood → mozbugs-build
Status: ASSIGNED → NEW
Priority: P2 → --
This is an attempt to make a parallel build run for nspr. It seemed to work on my win32 box, but I only have a single processor. Is it possible that just specifying the subdirectorys as prerequisites of the OBJS will for the directories to be built first? This may bust something else, so please test. Kevin
(In reply to comment #13) > specifying the subdirectorys as prerequisites of the OBJS will for the > directories to be built first? This may bust something else, so please that's 'subdirectories' and 'OBJS will force the' Sorry for the poor typing skills Kevin
Kevin, Your patch does not apply to the current NSPR tip. It's a good idea to write the patches against the latest version. I was able to hand modify your pach and apply it. The good news is that it doesn't seem to break my NSPR build on Solaris, running with -j2, on my dual CPU Blade 2000. The bad news is that it doesn't seem to truly do concurrent build. I benchmarked the entire NSPR build at it came to the same 65 seconds with or without -j2 . top showed that for the most part the CPU usage was identical in both cases, mostly about 50% idle, which would confirm only one CPU is getting used.
(In reply to comment #15) > Kevin, > > Your patch does not apply to the current NSPR tip. It's a good idea to write > the patches against the latest version. D'oh. I just used my build tree for thunderbird and forgot that it pulls a branch of NSPR. > The good news is that it doesn't seem to break my NSPR build on > Solaris, running with -j2, on my dual CPU Blade 2000. That's good to hear. > The bad news is that it doesn't seem to truly do concurrent build. I > benchmarked the entire NSPR build at it came to the same 65 seconds with > or without -j2 . top showed that for the most part the CPU usage was > identical in both cases, mostly about 50% idle, which would confirm only > one CPU is getting used. > Hmm. That doesn't make much sense. Can you try with something like -j4 and see if that makes a difference?
Kevin, It just doesn't work. And actually I went back to older code without your patch, and built NSPR with -j4 without nany problem. This would seem to contradict the other comments in this bug reports . Which Makefile are you building when you see a failure, and with what command ? I build from mozilla/dist/SunOS5.9_DBG.OBJ, and use "gmake -j4" there.
> It just doesn't work. And actually I went back to older code without your patch, > and built NSPR with -j4 without nany problem. This would seem to contradict the > other comments in this bug reports . Did you backout wtc's patch which forces make to use -j1 when you did your testing? When I remove that patch, I still see the build problem (usually, when creating prlink.o).
Same patch against the tip
Attachment #146109 - Attachment is obsolete: true
Julien: As cls pointed out, if you backed out the patch, then you a back to the current workaround which is to force a non-parallel build. (that is the line in the top-level Makefile.in that is removed in the patch) When I used this patch on my win32 machine, using a make -j4 I could see 4 jobs get sent off to compile. Now I don't have multiple processors so I can't verify that multiple CPUs will be used, but this patch does in fact spawn multiple make processes and there is no errors in building (on either win32 or Solaris per your comment above). Maybe there is a problem with gmake and using multiple CPUs? Kevin
Chris, Yes, it was the case that I hadn't disabled the workaround of -j1 when I backed out Kevin's fix. I do see the build failure with -j4 on my OS/2 box when I do that. Kevin, Thanks for the patch update against the tip. I applied it to my gcc OS/2 NSPR tree at home. I'm running on a dual Athlon MP 2800+. The patch works, and both CPUs are peaked to 100% when running gmake -j4 . In fact, on that machine the NSPR build gets cut from 47 seconds to 19 seconds. Now, more than a 2:1 improvement doesn't sound right, so I disabled the second CPU on the OS/2 machine, and reran the build : gmake takes 47.5s, basically the same amount as with two CPUs. The one CPU was peaked the whole time . gmake -j4 takes 31s . The one CPU was peaked the whole time. This means even on my single-CPU machine, the build time was significantly cut by running gmake -j4, from 31s to 19s, or 39%. All this without any hyperthreading, since the Athlon MP CPU doesn't have that feature (and OS/2 SMP doesn't support the Intel virtual CPUs anyway - only true physical CPUs). I would be interested to find out if this large build time improvement with -j4 is seen on other single CPU machines and platforms too. When I get to the office, I will try again with your latest patch on my Sun Blade 2000. Perhaps I missed something and applied it incorrectly yesterday when I converted it from the branch and applied it to my tip tree.
Just realized I messed up the numbers. Ignore the 39%. On single CPU, gmake -j4 cut the build time from 47.5s to 31s, or by about 35%. On dual CPU, gmake -j4 cut the build time from 47s to 19s, or by about 60%.
This patch doesn't work. As this build log file shows, many files are compiled twice. For example, search for "prfdcach.c". I appreciate your help with this bug. I'd like any new patch to be accompanied by an explanation of why parallel make breaks in the mozilla/nsprpub/pr/src directory and how the patch solves that problem. Without the explanation, I won't be convinced that the patch is correct. I think this bug is not worth fixing, but I agree that this bug is a rather interesting challenge. This is why I haven't marked it WONTFIX.
Comment on attachment 146179 [details] [diff] [review] Same patch against the tip Here is the sequence to demonstrate another problem with this patch. 1. Apply the patch to NSPR and do "gmake" in the nsprpub directory. 2. cd pr/src/io 3. touch dummy 4. cd ../../.. 5. gmake See all the files get recompiled. In step 2, you can use any subdirectory of pr/src. The creation of the new file "dummy" changes the timestamp on the subdirectory, which causes $(OBJS) to be rebuilt.
Attachment #146179 - Flags: review-
FYI, I have confirmed on my Sun that the patch doesn't work. The busiest the dual-CPU machine gets is 52% CPU usage, ie. about one full CPU is used. That might be something to do with the old gmake 3.76.1 which is the one we have in /usr/dist here.
Actually, I double-checked it on OS/2. I thought the -j4 build had completed, because the final page of the build before returning to the shell showed no error, and everything else above scrolled off the screen . But today I noticed there were no NSPR DLLs produced. The build was in fact cut short. I then clobbered, rebuilt, and redirected stderr and stdout to files, and saw that in fact there were problems. No wonder it was so much quicker :-( On the other hand, I ran two NSPR builds from different source trees in parallel on my SMP machine, and they completed in the same 47s amount of time it takes to do a single build, so if we can get the parallel to work, I think we will see a good improvement. However, given how small NSPR is and how short it takes to build it, I agree with Wan-Teh that we shouldn't spend too much time on this. It would be worthwile to get this to work for NSS and the Mozilla browser (though I think it already works for the later).
The patch also broke the 64-bit Solaris build . See http://bugzilla.mozilla.org/show_bug.cgi?id=241162 .
Marked the bug WONTFIX.
Status: NEW → RESOLVED
Closed: 18 years ago
Resolution: --- → WONTFIX
Target Milestone: Future → 4.6
You need to log in before you can comment on or make changes to this bug.