Closed
Bug 106009
Opened 23 years ago
Closed 22 years ago
PAC instantiation hangs Regxpcom Solaris nightly build packaging process
Categories
(Core :: XPCOM, defect)
Tracking
()
RESOLVED
FIXED
mozilla1.0
People
(Reporter: nbidwell, Assigned: dougt)
References
Details
(Keywords: helpwanted, Whiteboard: Needs to land on branch)
Attachments
(7 files)
209.10 KB,
text/plain
|
Details | |
1.51 KB,
text/plain
|
Details | |
242.12 KB,
application/gzip
|
Details | |
3.32 KB,
text/plain
|
Details | |
4.05 KB,
text/plain
|
Details | |
1012 bytes,
patch
|
Details | Diff | Splinter Review | |
1.60 KB,
patch
|
dougt
:
review+
brendan
:
superreview+
jesup
:
approval+
|
Details | Diff | Splinter Review |
As I write this, the latest Solaris nightly build on ftp.mozilla.org is from 10/15/2001. That was a week ago. (Not to mention that that build has a rather broken mail client for me...) Is this intentional?
Comment 1•23 years ago
|
||
Confirming and changing product to mozilla.org. CCing leaf@mozilla.org in hopes that more info might be out there. Related to bug 105981 or 105988?
Status: UNCONFIRMED → NEW
Component: Build Config → FTP - Staging
Ever confirmed: true
Product: Browser → mozilla.org
Comment 2•23 years ago
|
||
Nope, not related to those bugs. There hasn't been a sol26 nightly build log since Oct 15, which is weird. It's still in the crontabs on granite & aesir is up. Running the nightly script by hand to see what it turns up.
Comment 3•23 years ago
|
||
the 8am builds fail because IC is messing with the network and cvs hangs. the 8pm builds fail because the cvs process from 8am is still running. linux finishes because it starts at 4am before they start breaking things.
Comment 4•23 years ago
|
||
Well, I'm posting this with Solaris build 2001102222, so a nightly build was made last night.
Comment 5•23 years ago
|
||
Marking fixed (for lack of IC_screwed_us resolution).
Status: NEW → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 6•23 years ago
|
||
And now there's an even newer nightly build up, so it seems everything is working correctly. Thank you. Now back to my regularly scheduled testing. BTW, what/who is IC?
Status: RESOLVED → VERIFIED
Comment 7•23 years ago
|
||
Reopening becaause Solaris nighlies aren't showing up again. The last available build seems to be 2001110210
Status: VERIFIED → REOPENED
Resolution: FIXED → ---
Comment 8•23 years ago
|
||
John, aesir needs to be resurrected after the network massacre so that solaris nightlies can live again.
Assignee: seawood → antitux
Status: REOPENED → NEW
Comment 9•23 years ago
|
||
In addition to Solaris builds, source code isn't getting put onto the ftp server.
Comment 10•23 years ago
|
||
source balls are on branch, sol26 builds are on aesir, both down right now. antitux is supposed to get all our unix systems up by COB Friday so source tarballs and sol26 builds should start showing up Saturday morning at the latest.
Updated•23 years ago
|
Status: NEW → ASSIGNED
Comment 11•23 years ago
|
||
Closing since Solaris nightlies and source are both back on ftp. Thanks!
Status: ASSIGNED → RESOLVED
Closed: 23 years ago → 23 years ago
Resolution: --- → FIXED
Comment 12•23 years ago
|
||
Re-opening because the Solaris build on ftp is 2001122110. Happy holidays!
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 13•23 years ago
|
||
someone broke the solaris build in nsPluginModule.cpp a while back. I don't know why the tinderboxen are green... could be someone hacked configure, it's an official-only problem, or a parallel build problem, or ???
Component: FTP - Staging → Build Config
Priority: -- → P3
Product: mozilla.org → Browser
Target Milestone: --- → mozilla0.9.8
Comment 14•23 years ago
|
||
btw - you shouldn't keep reopening the same bug. this is a completely different problem from the original reported bug so it makes things confusing. In the future, should the Solaris builds fail to appear on ftp.mozilla.org, it should be a new bug since it may or may not be the same problem.
Status: REOPENED → ASSIGNED
Comment 15•23 years ago
|
||
Tinderboxes are green because they use a) a more recent version of gcc (speedracer's 2.95.3 vs aesir's 2.7.2.1) or b) Forte (nebiros). Since we've dropped support for gcc 2.7.2.x, we should upgrade the compiler on aesir.
Comment 16•23 years ago
|
||
John what is the plugin problem on solaris? There is currently an open bug that is tracking issues on a variety of platforms http://bugzilla.mozilla.org/show_bug.cgi?id=106806 if this is also a solaris problem, then this bug should be updated and we should probably also make the platform to ALL unix, since linux suffers from this as well (the .so.1 issue)
Comment 17•23 years ago
|
||
*** Bug 117712 has been marked as a duplicate of this bug. ***
Comment 18•23 years ago
|
||
reassigning to asasaki since antitux is busy with other work right now. Aki - Can you upgrade gcc on aesir to 2.95.3? You'll need to coordinate with lpham to make sure the build automation is looking in the right place to find the new compiler once it's in place. Thanks.
Comment 19•23 years ago
|
||
actually reassigning this time...
Assignee: antitux → asasaki
Status: ASSIGNED → NEW
Comment 20•23 years ago
|
||
Installed in /opt/gcc-2.95.3 which is softlinked to /opt/gcc (so you shouldn't have to change anything in the env). Old /opt/gcc moved to /opt/gcc-2.95.2 which can be removed once everything is working. There are a lot of old cltbld processes on aesir... should I kill those?
Status: NEW → ASSIGNED
Comment 21•23 years ago
|
||
Yes, please kill them. Look like they are the old builds. Thanks. Loan
Comment 22•23 years ago
|
||
There is now a nightly build for Jan 7 - so thanks! But, the tar file is only 3.5MB :( I've raised bug#118701 for this.
Comment 23•23 years ago
|
||
Hm, looks like it's *building* fine (able to run the package from aesir), but regxpcom is hanging, and it doesn't get past the packaging phase unless I kill that... Want to rerun it but I may hit the 8pm build...? Meanwhile, there's a new build up...
Comment 24•23 years ago
|
||
Again this is the case... build finishes, regxpcom hangs indefinitely after the "registering smime account manager extension" line. Once I kill regxpcom, the rest of the packaging and the push to the ftp site happens. Anyone have any idea why regxpcom might be hanging on aesir?
Comment 25•23 years ago
|
||
IIRC regxpcom is hanging in smime, right? Paste the regxpcom output into the bug, then check lxr or ask on #mozilla for who's been doing smime and regxpcom work and cc them here to get their input. This will probably need to be reassigned to an engineer to fix one or the other.
Comment 26•23 years ago
|
||
truss output showed that regxpcom was hanging in poll(). It appeared to occur after all of the components have been registered and the components.reg file had already been created.
Comment 27•23 years ago
|
||
27077 ./run-mozilla.sh ./regxpcom 27078 *** Registering -venkman handler. 27079 *** Registering -chat handler. 27080 *** Registering x-application-irc handler. 27081 *** Registering irc protocol handler. 27082 *** Registering smime account manager extension. 27083 Terminated
Comment 28•23 years ago
|
||
dougt, kaie -- what are your thoughts on regxpcom hanging after smime registration? TIA.
Assignee | ||
Comment 29•23 years ago
|
||
could you attach some stacktraces of the threads (probably just one thread) involved?
Comment 30•23 years ago
|
||
output from truss, sleeping in poll().
Comment 31•23 years ago
|
||
I think that output in comment 27 does not give a hint that it could have something to do with smime. From what I have seen on Unix optimized builds, that's always the exact output of the first run of a new build.
Comment 32•23 years ago
|
||
Probably true, but it doesn't get to the "terminated" bit until I kill the regxpcom process, which could be many hours after the smime line appears in the log...
Comment 33•23 years ago
|
||
Aki, to find out whether it is indeed a problem with the smime extension, or something else, could you please do the following? As a first test, when the build has finished, just remove mailnews/extensions/smime/build/libmsgsmime.so Another test, you could add another debugging output line to that module. In mailnews/extensions/smime/src/smime-service.js locate SMIMEModule.registerSelf = function (compMgr, fileSpec, location, type) { dump("*** Registering smime account manager extension.\n"); .... } And add some more dump output lines. For example, replace that complete function with: SMIMEModule.registerSelf = function (compMgr, fileSpec, location, type) { dump("*** Registering smime account manager extension.\n"); compMgr = compMgr.QueryInterface(Components.interfaces.nsIComponentManagerObsolete); dump("*** smime 2.\n"); compMgr.registerComponentWithType(SMIME_EXTENSION_SERVICE_CID, "SMIME Account Manager Extension Service", SMIME_EXTENSION_SERVICE_CONTRACTID, fileSpec, location, true, true, type); dump("*** smime 3.\n"); catman = Components.classes["@mozilla.org/categorymanager;1"].getService(nsICategoryManager); dump("*** smime 4.\n"); catman.addCategoryEntry("mailnews-accountmanager-extensions", "smime account manager extension", SMIME_EXTENSION_SERVICE_CONTRACTID, true, true); dump("*** smime 5.\n"); } If you build again and start, if we see the line with "*** smime 5", I think it can't be the smime extension.
Comment 34•23 years ago
|
||
Ok, done. Looks like it's regxpcom =) dougt -- do you want ownership of this bug? Or do you have recommendations as to who would be best able to fix it? thanks.
Assignee | ||
Comment 36•23 years ago
|
||
does someone with a sun build want to look at this. pavlov, do you have a uild that I could peek at?
Keywords: helpwanted
Target Milestone: mozilla0.9.8 → ---
Comment 38•23 years ago
|
||
Since Solaris builds are back and working, the summary should involve the problem that is left.
Summary: Where did the Sun Solaris nightly builds go? → Regxpcom hanging Solaris nightly build packaging process
Comment 39•23 years ago
|
||
There have been no new Solaris nightlies for 5 days now.... so I think they aren't working again (or whatever workaround was in place has stopped working).
Reporter | ||
Comment 40•23 years ago
|
||
I'm assuming that this is still the bug holding up nightly builds. If so, could someone please massage things by hand for a build newer than 1-15-2002? Thank you.
Comment 41•23 years ago
|
||
*** Bug 122813 has been marked as a duplicate of this bug. ***
Comment 42•23 years ago
|
||
My work firewall stops me pulling from CVS and I dont have the time to pull regular tarballs so building it myself is out. I'm sure I'm not alone in being dependent on nightlies for my mozilla testing so I'm guessing theres probably quite a few solaris issues going unnoticed until the nightlies come back simply by virtue of there being fewer users out there running the latest codebase. The longer this goes on the bigger a deal it gets. Any feedback at all on progress to a fix would be welcome.
Comment 43•23 years ago
|
||
I just found out that there is a new nightly build available in http://ftp.mozilla.org/pub/mozilla/nightly/latest/, with build date 2002013122. At first it complained about not being able to find run-mozilla.sh, so I just copied the file from the previous nightly build (2002011510) and now it runs beautifully :)
Comment 44•23 years ago
|
||
The missing run-mozilla.sh is bug#122942, now fixed. Thanks to whoever did the manual update of the Solaris nightly (or fixed the problem) - now I can play with the new Page Info stuff :)
Comment 45•23 years ago
|
||
Before anyone gets too happy, it was the missing run-mozilla.sh that caused the nightly to get delivered. Since run-mozilla.sh didn't exist, regxpcom could not run and then hang.
Updated•23 years ago
|
Target Milestone: --- → Future
Comment 46•23 years ago
|
||
With all due respect, why is this bug being "futured"? Do the drivers not care about testing on Solaris? Yes, it's a minority platform, but so are SGI IRIX and HPUX, both of which have up-to-date nightlies.
Comment 47•23 years ago
|
||
Even if the root problem with Regxpcom isn't worth the effort to fix at this point, it seems like it should be possible to make a work around that would allow nightlies to be distibuted. It seems like an ugly hack to the build process like automatically killing the Regxpcom process (or removing run-mozilla.sh which allowed 2002013122 to be built) would work.
Comment 48•23 years ago
|
||
Or just turn off --enable-crypto on the solaris nightly's to see if that fixes it... --disable-crypto turns off MOZ_PSM which turns off BUILD_SMIME in mozilla/mailnews/extensions/Makefile.in All you have to do is set BUILD_PSM="FALSE" in ns/build/unix/verification/seamonkey-build in the sol26 stanza... line: if this fixes your problem, then someone needs to debug --enable-crypto and smime in the sol26 stanza. But in the meantime there will be nightly builds.
Comment 49•23 years ago
|
||
> Or just turn off --enable-crypto on the solaris nightly's to see if > that fixes it... I thought Aki confirmed in his comment 34 that it is NOT the crypto component?
Comment 50•23 years ago
|
||
I am not saying it is a crypto issue... I am saying if we want to get rid of smime from the build... the quickest and easiest way to do that is to turn off PSM in the nightlies... Since no PSM, no smime... and then regxpcom CAN'T have an issue trying to load it. btw that was line 869 for doing so. Then whoever is the champion of sol26 should figure out what is going on... just like I do when the darn hpux nightlies mess up.
Comment 51•23 years ago
|
||
Whatever the issue is, whether its crypto or something else, this still should NOT be futured. Sure, it doesnt have the visibility of the wintel or linux platforms but one of the main drivers for adopting a solution like Mozilla is that it is truly cross-platform. Inhibiting testing on a major unix platform does not bode well for this continuing. I wonder how many solaris bugs will go unreported between now and the 0.9.9 release if the nightlies remain hosed? Do we really want to suddenly see them all show up at that point rather than have them reported and fixed along the way from nightly trunk builds?
Comment 52•23 years ago
|
||
I am not saying FUTURED... I am trying to get you a nightly build. Granted I am suggesting turning off crypto to get you that (so you can test everything else). If I am reading this correctly you guys (who care about solaris) haven't had a nightly build in like forever. I will shut up, I don't care... I don't care if solaris nightly builds work or not. I don't care if crytpo is on or not. I was just trying to suggest a way to get nightlies going again and to narrow down the problem and not leave it to AKI who hasn't touch a solaris build in "like forever". un-ccing myself, do whatever you want.
Comment 53•23 years ago
|
||
Don't know if this is relevent, but nebiros SunOS/sparc 5.7 Clobber seems to have been orange for ages. Is this the same problem? Also, i386 Solaris 2.6 nightlies seem to be being built fine, it's just the sparc ones (if that helps at all).
Comment 54•23 years ago
|
||
Pav - if there is too much on your plate right now to work on this bug, is there anyone else that could take a stab at it in the meantime?
Comment 55•23 years ago
|
||
We are going to try my suggestion for turning off BUILD_PSM in the sol26 builds. We are only going to do this for this weekend only. Hopefully we will get nightlies (remember they won't have PSM or smime) and then on Mon we will turn it back on. This will help us narrow down the issue
Comment 56•23 years ago
|
||
Umm, I think you're barking up the wrong tree with the PSM issue (but feel free to prove me wrong). regxpcom is hanging at the end of its run. components.reg has already been written out correctly. Removing smime from the components dir does not fix the hanging problem (tested manually on the day I ran truss). Does anyone know what is being poll'ed?
Comment 57•23 years ago
|
||
er, i'm sorry. i'm not sure why this bug is assigned to me. -> cls
Assignee: pavlov → seawood
Target Milestone: Future → ---
Comment 58•23 years ago
|
||
So, this is weird. Regxpcom works fine in a standalone xpcom build on sheep. If I build all of Mozilla (except crypto), I see the hang but according to the truss log, it's not hanging in poll any longer. It appeared to be hanging while processing some of the uconv libs. I used the following build options: --enable-extensions=default,irc --without-system-nspr --without-system-zlib --without-system-jpeg --without-system-png --without-system-mng --disable-debug --enable-optimize --disable-tests
Comment 59•23 years ago
|
||
With a debug build, I'm seeing the same hang when building all of Mozilla minus PSM. The trace shows that the poll() is coming from necko. (gdb) bt #0 0xfee9990c in _poll () from /usr/lib/libc.so.1 #1 0xfef1b22c in poll () from /usr/lib/libthread.so.1 #2 0xff08847c in PR_Poll (pds=0xcd138, npds=1, timeout=3500000) at ../../../../../mozilla/nsprpub/pr/src/pthreads/ptio.c:3963 #3 0xfdabd924 in nsSocketTransportService::Run (this=0xad068) at ../../../../mozilla/netwerk/base/src/nsSocketTransportService.cpp:469 #4 0xff224ae8 in nsThread::Main (arg=0x81500) at ../../../mozilla/xpcom/threads/nsThread.cpp:120 #5 0xff08a600 in _pt_root (arg=0xa6308) at ../../../../../mozilla/nsprpub/pr/src/pthreads/ptthread.c:214 (gdb) Stepping thru gdb shows that the AutoRegister() call returned without any errors (ret = 0). The hang occurs during XPCOM shutdown. Or more specifically, after stepping thru NS_XPCOMShutdown, it's hanging in nsTimerImpl::Shutdown(). It appears to spawn 2 more LWP threads when this occurs. One of those threads is the one shown in the stacktrace above. The extra threads are spawned when mThread->Join() is called from TimerThread::Shutdown() . Pavlov, back to you.
Comment 60•23 years ago
|
||
Could someone clarify which version of Solaris has the problem? I am able to reproduce regxpcom hang on one of Solaris 8 boxes (and after rempval of components.reg problem can be repeated) but it is not reproducible on several others Solaris 7/8/9 boxes (note: i am testing *same* build shared over NFS) This leads me to idea that problem may be solved by instation of appropriate solaris patches. Did anyone try to investigate this? I am not sure which patches are necessary to fix the problem but i i would recommend to try patch 106541 for solaris 7 http://sunsolve.sun.com/pub-cgi/retrieve.pl?doc=fpatches%2F106541&zone_32=libc.so.1 (it contains fix for bug 4207080 hang in poll, application does not get notified of data on stream head) For solaris 8 patch 108991 may be usefull. It also has libc.so fixes and it is one of installed patches on system that does not have problem and too old version of this patch is installed on system that has the problem.
Comment 61•23 years ago
|
||
I wouldn't be surprised if our build system was in need of some patching. Aki - can you check the patch status on the system, and make sure it's up the latest and greatest patch cluster, as well as the patches mentioned above? cls - do you know of any reason we shouldn't upgrade, or any particular patches we should avoid?
Comment 62•23 years ago
|
||
I've heard that the very latest patch cluster from Sun introduces some instabilities. Pavlov and/or Roland would know specifically which one.
Comment 63•23 years ago
|
||
comment 59 makes no sense to me -- why would Join spawn threads? cc'ing wtc. /be
Comment 64•23 years ago
|
||
No clue what's going on. Both Solaris 2.7+latest patches and Solaris 2.8+latest patches (except Xsun patch 108652-47, we are still using rev -46) are working here...
Comment 65•23 years ago
|
||
this is solaris 2.6. downloading the latest recommended patches... which are from 2/5/02. i've had decent luck with the recommended patch bundles, so i'll install these and just keep an eye out for news on any bad patches.
Comment 66•23 years ago
|
||
Brendan asked:
> comment 59 makes no sense to me -- why would Join spawn threads?
> cc'ing wtc
I have no idea either. Sorry.
Comment 67•23 years ago
|
||
patch cluster installed, aesir rebooted. we'll see if that fixes the 8pm build.
Comment 68•23 years ago
|
||
regxpcom is still hanging and the recommended patch cluster had a libc fix, can't find anything else in the 2.6 patchreport about it.
Comment 69•23 years ago
|
||
16 nights and new nightly build... could someone at least manually push one?
Comment 70•23 years ago
|
||
Grr. I meant: 16 nights and *no* new nightly build. Too many chocolate cookies for me today.
Comment 71•23 years ago
|
||
Can't we setup another machine for creating Solaris 2.7 or 2.8 nighly tarballs build with Sun Workshop ?
Comment 72•23 years ago
|
||
Comment on attachment 64581 [details]
output of `rm component.reg; truss ./regxpcom 2>&1 | tee > regxpcom.log`
Can someone provide a log from a hang woth `rm component.reg; truss -u ::
./regxpcom 2>&1 | tee > regxpcom.log`, please ?
Comment 73•23 years ago
|
||
killed regxpcom, should be another package available on the site. there is no -u option available in our version of truss... did you want the same output, but more recent, or different output? also, I believe I accidentally added the ">" to the tee comment, which shouldn't be there... ignore it.
Comment 74•23 years ago
|
||
Thanks for the new nightly build. Any chance of putting "kill regxpcom" in a cron job?
Comment 75•23 years ago
|
||
no. this bug has to be fixed.
Comment 76•23 years ago
|
||
roland: even on non-hanging node truss -u results in 500M+ log file. on node with problems it never stop to grow. Back to the idea about solaris patches - I tried one more solaris 2.8 system that also did not have the regxpcom hang. However, it does have strict subset of patches installed on node with problems :( Therefore either problem is introduced in one of additional patches or it is somethere else in the environment.
Comment 77•23 years ago
|
||
There's a new Solaris build: 2002-02-25-21-trunk/mozilla-sparc-sun-solaris2.6.tar.gz but it doesn't start because of bug 127817. If 127817 is related to security code as suggested in 127817 comment #3, does it mean that it's security stuff that's stopping Solaris builds normally?
Comment 78•23 years ago
|
||
May be a red herring, but take a look at the gzipped truss output file I attached to bug 129567 - Is this related? If it looks similar, then maybe we can compare patch revs or something... see if we can find a patch that if applied causes the problem and can be backed out to make it go away?
Comment 79•23 years ago
|
||
Any possibility of manually getting another nightly Solaris build uploaded (or at least deleting the current broken one)? The most recent Solaris nightly is still the build from 20020225, which is broken due to bug 129749. We've had a couple of duplicates of that long since fixed bug because that build is the only Solaris nightly available.
Comment 80•23 years ago
|
||
we build with gcc (2.95.3).
Comment 81•23 years ago
|
||
I have comperssed the file because it is reather large
Comment 82•23 years ago
|
||
not sure if my attempt earlier in the week to get a new build up worked well or not (killed all the regxpcom procs and i think the various build procs interfered with each other), but there's one from today.
Comment 83•22 years ago
|
||
We are building RPM:s on FreeBSD 4.3, RedHat 7.1 & 7.2 and Solaris 2.6, 7 & 8 and I have seen the problem with the hanging regxpcom many times on Solaris. I have been starting regxpcom thru truss and strace and got huge logfiles, so if someone are interested let me know. Workaround: Our SPEC-file (RPM) places this script in the '.../mozilla/dist/bin'-directory. I tested it today, when building 0.9.9, regxpcom ran from 7 to 42s on FreeBSD, RedHat and Solaris 2.6, and ended normally. On Solaris 7 & 8 it was killed after the timeout and ran about 3s the second time. #!/app/cueshell/bin/cueshell # This is a bash-alias dist_bin=`dirname $0` MOZILLA_FIVE_HOME=$dist_bin LD_LIBRARY_PATH=$dist_bin:$LD_LIBRARY_PATH export MOZILLA_FIVE_HOME LD_LIBRARY_PATH case `uname -s` in SunOS) echo "`date`: Starting regxpcom" ( $dist_bin/regxpcom; echo "`date`: regxpcom done.") & waiting=0 while [ $waiting -lt 1800 ]; do if ps -p $! >/dev/null ; then waiting=`expr $waiting + 30` sleep 30 echo "`date`: Waited $waiting seconds for regxpcom" else echo "`date`: Waiting done." waiting=1800 fi done if ps -p $! >/dev/null ; then echo "`date`: Kills regxpcom " /usr/sbin/fuser -k $dist_bin/regxpcom echo "`date`: Restarting regxpcom" $dist_bin/regxpcom; echo "`date`: regxpcom done." fi ;; *) echo "`date`: Starting regxpcom" $dist_bin/regxpcom; echo "`date`: regxpcom done." ;; esac $dist_bin/regchrome touch $dist_bin/chrome/user-skins.rdf $dist_bin/chrome/user-locales.rdf
Comment 84•22 years ago
|
||
I've been troubleshooting this using the 20020315xx nightly, and I think I have some useful information. First of all, if components/nsProxyAutoConfig.js is removed from an installed copy of mozilla, then regxpcom will run to completion and exit as it should. However, regxpcom isn't hanging while registering this component; it's hanging in the call to NS_ShutdownXPCOM() just before regxpcom exits. It appears that nsProxoyAutoConfig.js causes an nsDNSService thread to be created, which in turn creates a TimerThread. Later at shutdown time, xpcom tries to kill the timer thread, but it isn't dying. I'm going to attach a copy of the /usr/proc/bin/pstack output for a well-hung regxpcom instance. You'll note the following: 1) thread #1 is performing NS_ShutdownXPCOM() and is waiting for a _thrp_join() call to complete. This is actually a pthread_join() call in the source. I think the '6' in the _thrp_join() argument list means thread #6. 2) lwp #1/thread #6 is within a TimerThread::Run call, blissfully waiting for a call to pthread_cond_wait() to complete. 3) thread #5 is within a nsDNSService::Run call. According to truss, thread 5 was spawned by thread 4, which is inside an nsSocketTransportService::Run call. I have trusses from running regxpcom with and without the proxy autoconfig component present. When it's not present, regxpcom never gets beyond four threads; #5 and #6 are never created. The trusses are quite large so I won't attach them.
Comment 85•22 years ago
|
||
Comment 86•22 years ago
|
||
my bet is that the problem is: 235 var PacMan = new nsProxyAutoConfig() ;
Assignee: pavlov → gagan
Component: Build Config → Networking
QA Contact: granrose → benc
Summary: Regxpcom hanging Solaris nightly build packaging process → PAC instantiation hangs Regxpcom Solaris nightly build packaging process
Keywords: helpwanted,
qawanted
I doubt it, since that will just call this nothing function: 55 function nsProxyAutoConfig() {}; Is it possible that some other component is causing network activity, which is in turn causing the proxyautoconfig stuff to get kicked off? If that's the case, we probably need regxpcom to do more mozilla-like things in its shutdown process. Cc:ing Jud, because embedders on Solaris might well run into this problem as well, if they don't do the shutdown perfectly.
Comment 88•22 years ago
|
||
FWIW: PAC download is triggered whenever the PAC preference is modified. see nsProtocolProxyService::PrefsChanged.
Comment 89•22 years ago
|
||
(wonders if the bug on nsDNSshutdown leaking, which he can't find, is related)
Assignee | ||
Comment 90•22 years ago
|
||
regxpcom and InitXPCOM does not create any event queue for the main thread. /me wonders if the timer or DNS threads require one present? attaching hack to test this theory.
Comment 91•22 years ago
|
||
A detailed truss suggests that there may be a race condition within TimerThread (xpcom/threads/TimerThread.cpp). TimerThread::Shutdown() is running before TimerThread::Run(). This is breaking the method that Shutdown() uses to tell Run() to exit. Shutdown() checks a condition variable and a flag: // notify the cond var so that Run() can return if (mCondVar && mWaiting) PR_NotifyCondVar(mCondVar); but Run() hasn't been called yet so the test fails. Shutdown() falls through and eventually calles nsThread::Join() to harvest the Run() thread. Some time later, the Run() thread starts executing, and eventually goes to sleep on PR_WaitCondVar(mCondVar). Deadlock. I'm attaching a truss clip which illustrates the problem; the truss includes calls to libxpcom and libnspr4. I apologize for not including more data; these truss runs take a long time to complete and produce huge amounts of output. The one I'm excerpting is 41MB, for example.
Comment 92•22 years ago
|
||
Assignee | ||
Comment 93•22 years ago
|
||
If this works, we should fix the problem much cleaner by having InitXPCOM startup the event queue directory and Shutdown clean it up. See 135531.
Comment 94•22 years ago
|
||
I tried adding "sleep(1)" to TimerThread::Shutdown() just before the timerthread lock is acquired. This has the desired effect; the Shutdown() thread gives up its timeslice, giving the OS time to schedule the Run() thread. By the time Shutdown() wakes up, the Run() thread is in the state that Shutdown() expects. But of course this is just a hack, not a proper solution. TimerThread uses a flag "mProcessing" to indicate whether TimerThread::Run() should keep going or not, but the logic isn't quite right. The flag is initialized false. Run() sets it true on entry, then keeps looping until it sees the flag become false. Shutdown() sets the flag back to false when it wants Run() to return. But if Shutdown() runs before Run(), then Run() can't tell that Shutdown() has already been called and already written to the flag. The attached patch replaces the mProcessing flag with an mShutdown flag. This flag is initialized to false. It's set to true in Shutdown(). Run() never writes to this flag, but it keeps looping as long as the flag is false. With either the added sleep() call or the mShutdown patch, regxpcom no longer hangs shutting down xpcom. Instead, the last few lines that it prints are as follows: *** Registering irc protocol handler. nNCL: registering deferred (0) nNCL: registering deferred (0) Getting service on shutdown. Denied. ContractID: @mozilla.org/js/xpc/ContextStack;1 IID: {a1339ae0-05c1-11d4-8f92-0010a4e73d9a} ###!!! ASSERTION: Component Manager being held past XPCOM shutdown.: 'cnt == 0', file nsXPComInit.cpp, line 582 ###!!! Break: at file nsXPComInit.cpp, line 582 As far as I can tell, this is an unrelated problem. It may be bug 135330 rearing its head; the source distribution I'm using is from 4/3/2002.
Comment 95•22 years ago
|
||
the getting @ shutdown is bug 134728 in general, shutdown problems have *many* bugs, although searching for bugs filed by me is a good start.
Assignee | ||
Comment 97•22 years ago
|
||
Comment on attachment 78085 [details] [diff] [review] Proposed TimerThread.cpp, TimerThread.h patch r=dougt. Thanks for fixing this.
Attachment #78085 -
Flags: review+
Assignee | ||
Comment 98•22 years ago
|
||
brendan, can you super review? You blame to alot of this code.
Target Milestone: --- → mozilla1.0
Comment 99•22 years ago
|
||
I've applied this patch to my mozilla 0.9.9 tree, and regxpcom no longer hangs on package creation.
Comment 100•22 years ago
|
||
Was it hanging without this patch? We've had solaris nightlies for the past few days (since the 12th apparently). So either the problem resolved itself or someone added a workaround to the build automation, which I don't see.
Comment 101•22 years ago
|
||
Yes, it would consistently hang on building 0.9.8 and 0.9.9 without this patch on solaris 7
Comment 102•22 years ago
|
||
Comment on attachment 78085 [details] [diff] [review] Proposed TimerThread.cpp, TimerThread.h patch sr=brendan@mozilla.org dougt: I took cvsblame in making fixes to pavlov's busted threading code, but I won't take all blame here. I do feel pretty foolish for taking this stuff so close to 1.0 (0.9.8, IIRC -- at least I made pav wait till then, instead of checking in on the last day of 0.9.7 as he wanted to). /be
Attachment #78085 -
Flags: superreview+
Updated•22 years ago
|
Keywords: mozilla1.0+,
nsbeta1
Assignee | ||
Comment 103•22 years ago
|
||
Checked into the trunk: Checking in TimerThread.cpp; /cvsroot/mozilla/xpcom/threads/TimerThread.cpp,v <-- TimerThread.cpp new revision: 1.12; previous revision: 1.11 done Checking in TimerThread.h; /cvsroot/mozilla/xpcom/threads/TimerThread.h,v <-- TimerThread.h new revision: 1.4; previous revision: 1.3 done
Status: NEW → ASSIGNED
Whiteboard: Needs to land on branch
Comment 104•22 years ago
|
||
Comment on attachment 78085 [details] [diff] [review] Proposed TimerThread.cpp, TimerThread.h patch a=rjesup@wgate.com for branch checkin
Attachment #78085 -
Flags: approval+
Assignee | ||
Comment 105•22 years ago
|
||
Checked into branch. Checking in TimerThread.cpp; /cvsroot/mozilla/xpcom/threads/TimerThread.cpp,v <-- TimerThread.cpp new revision: 1.6.4.4; previous revision: 1.6.4.3 done Checking in TimerThread.h; /cvsroot/mozilla/xpcom/threads/TimerThread.h,v <-- TimerThread.h new revision: 1.3.4.2; previous revision: 1.3.4.1 done Kenneth, thank you for the patch.
Status: ASSIGNED → RESOLVED
Closed: 23 years ago → 22 years ago
Resolution: --- → FIXED
Comment 106•22 years ago
|
||
adding fixed1.0.0 keyword (branch resolution). This bug has comments saying it was fixed on the 1.0 branch and a bonsai checkin comment that agrees. To verify the bug has been fixed on the 1.0 branch please replace the fixed1.0.0 keyword with verified1.0.0.
Keywords: fixed1.0.0
Comment 107•22 years ago
|
||
updating component and qa... From reading carefully, it seems like this goes to XPCOM Regsitry. Also, the summary seems out of date, is PAC really the root cause of this?
Component: Networking → XPCOM Registry
QA Contact: benc → dougt
You need to log in
before you can comment on or make changes to this bug.
Description
•