Closed
Bug 106009
Opened 23 years ago
Closed 23 years ago
PAC instantiation hangs Regxpcom Solaris nightly build packaging process
Categories
(Core :: XPCOM, defect)
Tracking
()
RESOLVED
FIXED
mozilla1.0
People
(Reporter: nbidwell, Assigned: dougt)
References
Details
(Keywords: helpwanted, Whiteboard: Needs to land on branch)
Attachments
(7 files)
209.10 KB,
text/plain
|
Details | |
1.51 KB,
text/plain
|
Details | |
242.12 KB,
application/gzip
|
Details | |
3.32 KB,
text/plain
|
Details | |
4.05 KB,
text/plain
|
Details | |
1012 bytes,
patch
|
Details | Diff | Splinter Review | |
1.60 KB,
patch
|
dougt
:
review+
brendan
:
superreview+
jesup
:
approval+
|
Details | Diff | Splinter Review |
As I write this, the latest Solaris nightly build on ftp.mozilla.org is from
10/15/2001. That was a week ago. (Not to mention that that build has a rather
broken mail client for me...) Is this intentional?
Comment 1•23 years ago
|
||
Confirming and changing product to mozilla.org.
CCing leaf@mozilla.org in hopes that more info might be out there.
Related to bug 105981 or 105988?
Status: UNCONFIRMED → NEW
Component: Build Config → FTP - Staging
Ever confirmed: true
Product: Browser → mozilla.org
Comment 2•23 years ago
|
||
Nope, not related to those bugs. There hasn't been a sol26 nightly build log
since Oct 15, which is weird. It's still in the crontabs on granite & aesir is
up. Running the nightly script by hand to see what it turns up.
Comment 3•23 years ago
|
||
the 8am builds fail because IC is messing with the network and cvs hangs. the
8pm builds fail because the cvs process from 8am is still running. linux
finishes because it starts at 4am before they start breaking things.
Comment 4•23 years ago
|
||
Well, I'm posting this with Solaris build 2001102222, so a nightly build was
made last night.
Comment 5•23 years ago
|
||
Marking fixed (for lack of IC_screwed_us resolution).
Status: NEW → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 6•23 years ago
|
||
And now there's an even newer nightly build up, so it seems everything is
working correctly. Thank you. Now back to my regularly scheduled testing.
BTW, what/who is IC?
Status: RESOLVED → VERIFIED
Comment 7•23 years ago
|
||
Reopening becaause Solaris nighlies aren't showing up again. The last available
build seems to be 2001110210
Status: VERIFIED → REOPENED
Resolution: FIXED → ---
Comment 8•23 years ago
|
||
John, aesir needs to be resurrected after the network massacre so that solaris
nightlies can live again.
Assignee: seawood → antitux
Status: REOPENED → NEW
Comment 9•23 years ago
|
||
In addition to Solaris builds, source code isn't getting put onto the ftp server.
Comment 10•23 years ago
|
||
source balls are on branch, sol26 builds are on aesir, both down right now.
antitux is supposed to get all our unix systems up by COB Friday so source
tarballs and sol26 builds should start showing up Saturday morning at the latest.
Updated•23 years ago
|
Status: NEW → ASSIGNED
Comment 11•23 years ago
|
||
Closing since Solaris nightlies and source are both back on ftp. Thanks!
Status: ASSIGNED → RESOLVED
Closed: 23 years ago → 23 years ago
Resolution: --- → FIXED
Comment 12•23 years ago
|
||
Re-opening because the Solaris build on ftp is 2001122110.
Happy holidays!
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 13•23 years ago
|
||
someone broke the solaris build in nsPluginModule.cpp a while back. I don't
know why the tinderboxen are green... could be someone hacked configure, it's an
official-only problem, or a parallel build problem, or ???
Component: FTP - Staging → Build Config
Priority: -- → P3
Product: mozilla.org → Browser
Target Milestone: --- → mozilla0.9.8
Comment 14•23 years ago
|
||
btw - you shouldn't keep reopening the same bug. this is a completely different
problem from the original reported bug so it makes things confusing. In the
future, should the Solaris builds fail to appear on ftp.mozilla.org, it should
be a new bug since it may or may not be the same problem.
Status: REOPENED → ASSIGNED
Comment 15•23 years ago
|
||
Tinderboxes are green because they use a) a more recent version of gcc
(speedracer's 2.95.3 vs aesir's 2.7.2.1) or b) Forte (nebiros). Since we've
dropped support for gcc 2.7.2.x, we should upgrade the compiler on aesir.
Comment 16•23 years ago
|
||
John what is the plugin problem on solaris?
There is currently an open bug that is tracking
issues on a variety of platforms
http://bugzilla.mozilla.org/show_bug.cgi?id=106806
if this is also a solaris problem, then this bug
should be updated and we should probably also make
the platform to ALL unix, since linux suffers from
this as well (the .so.1 issue)
Comment 17•23 years ago
|
||
*** Bug 117712 has been marked as a duplicate of this bug. ***
Comment 18•23 years ago
|
||
reassigning to asasaki since antitux is busy with other work right now.
Aki - Can you upgrade gcc on aesir to 2.95.3? You'll need to coordinate with
lpham to make sure the build automation is looking in the right place to find
the new compiler once it's in place. Thanks.
Comment 19•23 years ago
|
||
actually reassigning this time...
Assignee: antitux → asasaki
Status: ASSIGNED → NEW
Comment 20•23 years ago
|
||
Installed in /opt/gcc-2.95.3 which is softlinked to /opt/gcc (so you shouldn't
have to change anything in the env). Old /opt/gcc moved to /opt/gcc-2.95.2
which can be removed once everything is working.
There are a lot of old cltbld processes on aesir... should I kill those?
Status: NEW → ASSIGNED
Comment 21•23 years ago
|
||
Yes, please kill them. Look like they are the old builds. Thanks. Loan
Comment 22•23 years ago
|
||
There is now a nightly build for Jan 7 - so thanks!
But, the tar file is only 3.5MB :(
I've raised bug#118701 for this.
Comment 23•23 years ago
|
||
Hm, looks like it's *building* fine (able to run the package from aesir), but
regxpcom is hanging, and it doesn't get past the packaging phase unless I kill
that... Want to rerun it but I may hit the 8pm build...? Meanwhile, there's a
new build up...
Comment 24•23 years ago
|
||
Again this is the case... build finishes, regxpcom hangs indefinitely after the
"registering smime account manager extension" line. Once I kill regxpcom, the
rest of the packaging and the push to the ftp site happens.
Anyone have any idea why regxpcom might be hanging on aesir?
Comment 25•23 years ago
|
||
IIRC regxpcom is hanging in smime, right? Paste the regxpcom output into the
bug, then check lxr or ask on #mozilla for who's been doing smime and regxpcom
work and cc them here to get their input. This will probably need to be
reassigned to an engineer to fix one or the other.
Comment 26•23 years ago
|
||
truss output showed that regxpcom was hanging in poll(). It appeared to occur
after all of the components have been registered and the components.reg file had
already been created.
Comment 27•23 years ago
|
||
27077 ./run-mozilla.sh ./regxpcom
27078 *** Registering -venkman handler.
27079 *** Registering -chat handler.
27080 *** Registering x-application-irc handler.
27081 *** Registering irc protocol handler.
27082 *** Registering smime account manager extension.
27083 Terminated
Comment 28•23 years ago
|
||
dougt, kaie -- what are your thoughts on regxpcom hanging after smime
registration? TIA.
Assignee | ||
Comment 29•23 years ago
|
||
could you attach some stacktraces of the threads (probably just one thread)
involved?
Comment 30•23 years ago
|
||
output from truss, sleeping in poll().
Comment 31•23 years ago
|
||
I think that output in comment 27 does not give a hint that it could have
something to do with smime. From what I have seen on Unix optimized builds,
that's always the exact output of the first run of a new build.
Comment 32•23 years ago
|
||
Probably true, but it doesn't get to the "terminated" bit until I kill the
regxpcom process, which could be many hours after the smime line appears in the
log...
Comment 33•23 years ago
|
||
Aki, to find out whether it is indeed a problem with the smime extension, or
something else, could you please do the following?
As a first test, when the build has finished, just remove
mailnews/extensions/smime/build/libmsgsmime.so
Another test, you could add another debugging output line to that module.
In
mailnews/extensions/smime/src/smime-service.js
locate
SMIMEModule.registerSelf =
function (compMgr, fileSpec, location, type)
{
dump("*** Registering smime account manager extension.\n");
....
}
And add some more dump output lines. For example, replace that complete function
with:
SMIMEModule.registerSelf =
function (compMgr, fileSpec, location, type)
{
dump("*** Registering smime account manager extension.\n");
compMgr =
compMgr.QueryInterface(Components.interfaces.nsIComponentManagerObsolete);
dump("*** smime 2.\n");
compMgr.registerComponentWithType(SMIME_EXTENSION_SERVICE_CID,
"SMIME Account Manager Extension Service",
SMIME_EXTENSION_SERVICE_CONTRACTID, fileSpec,
location, true, true, type);
dump("*** smime 3.\n");
catman =
Components.classes["@mozilla.org/categorymanager;1"].getService(nsICategoryManager);
dump("*** smime 4.\n");
catman.addCategoryEntry("mailnews-accountmanager-extensions",
"smime account manager extension",
SMIME_EXTENSION_SERVICE_CONTRACTID, true, true);
dump("*** smime 5.\n");
}
If you build again and start, if we see the line with "*** smime 5", I think it
can't be the smime extension.
Comment 34•23 years ago
|
||
Ok, done. Looks like it's regxpcom =)
dougt -- do you want ownership of this bug? Or do you have recommendations as
to who would be best able to fix it?
thanks.
Assignee | ||
Comment 36•23 years ago
|
||
does someone with a sun build want to look at this. pavlov, do you have a uild
that I could peek at?
Keywords: helpwanted
Target Milestone: mozilla0.9.8 → ---
Comment 38•23 years ago
|
||
Since Solaris builds are back and working, the summary should involve the
problem that is left.
Summary: Where did the Sun Solaris nightly builds go? → Regxpcom hanging Solaris nightly build packaging process
Comment 39•23 years ago
|
||
There have been no new Solaris nightlies for 5 days now....
so I think they aren't working again (or whatever workaround was
in place has stopped working).
Reporter | ||
Comment 40•23 years ago
|
||
I'm assuming that this is still the bug holding up nightly builds. If so, could
someone please massage things by hand for a build newer than 1-15-2002? Thank you.
Comment 41•23 years ago
|
||
*** Bug 122813 has been marked as a duplicate of this bug. ***
Comment 42•23 years ago
|
||
My work firewall stops me pulling from CVS and I dont have the time to pull
regular tarballs so building it myself is out.
I'm sure I'm not alone in being dependent on nightlies for my mozilla testing so
I'm guessing theres probably quite a few solaris issues going unnoticed until
the nightlies come back simply by virtue of there being fewer users out there
running the latest codebase.
The longer this goes on the bigger a deal it gets. Any feedback at all on
progress to a fix would be welcome.
Comment 43•23 years ago
|
||
I just found out that there is a new nightly build available in
http://ftp.mozilla.org/pub/mozilla/nightly/latest/, with build date 2002013122.
At first it complained about not being able to find run-mozilla.sh, so I just
copied the file from the previous nightly build (2002011510) and now it runs
beautifully :)
Comment 44•23 years ago
|
||
The missing run-mozilla.sh is bug#122942, now fixed.
Thanks to whoever did the manual update of the Solaris nightly (or fixed the
problem) - now I can play with the new Page Info stuff :)
Comment 45•23 years ago
|
||
Before anyone gets too happy, it was the missing run-mozilla.sh that caused the
nightly to get delivered. Since run-mozilla.sh didn't exist, regxpcom could not
run and then hang.
Updated•23 years ago
|
Target Milestone: --- → Future
Comment 46•23 years ago
|
||
With all due respect, why is this bug being "futured"? Do the drivers not care
about testing on Solaris? Yes, it's a minority platform, but so are SGI IRIX
and HPUX, both of which have up-to-date nightlies.
Comment 47•23 years ago
|
||
Even if the root problem with Regxpcom isn't worth the effort to fix at this
point, it seems like it should be possible to make a work around that would
allow nightlies to be distibuted. It seems like an ugly hack to the build
process like automatically killing the Regxpcom process (or removing
run-mozilla.sh which allowed 2002013122 to be built) would work.
Comment 48•23 years ago
|
||
Or just turn off --enable-crypto on the solaris nightly's to see if
that fixes it...
--disable-crypto turns off MOZ_PSM which turns off BUILD_SMIME
in mozilla/mailnews/extensions/Makefile.in
All you have to do is set BUILD_PSM="FALSE" in
ns/build/unix/verification/seamonkey-build
in the sol26 stanza... line:
if this fixes your problem, then someone needs to debug
--enable-crypto and smime in the sol26 stanza.
But in the meantime there will be nightly builds.
Comment 49•23 years ago
|
||
> Or just turn off --enable-crypto on the solaris nightly's to see if
> that fixes it...
I thought Aki confirmed in his comment 34 that it is NOT the crypto component?
Comment 50•23 years ago
|
||
I am not saying it is a crypto issue...
I am saying if we want to get rid of smime from
the build... the quickest and easiest way to do
that is to turn off PSM in the nightlies...
Since no PSM, no smime... and then regxpcom
CAN'T have an issue trying to load it.
btw that was line 869 for doing so.
Then whoever is the champion of sol26 should
figure out what is going on... just like I
do when the darn hpux nightlies mess up.
Comment 51•23 years ago
|
||
Whatever the issue is, whether its crypto or something else, this still should
NOT be futured. Sure, it doesnt have the visibility of the wintel or linux
platforms but one of the main drivers for adopting a solution like Mozilla is
that it is truly cross-platform. Inhibiting testing on a major unix platform
does not bode well for this continuing. I wonder how many solaris bugs will go
unreported between now and the 0.9.9 release if the nightlies remain hosed? Do
we really want to suddenly see them all show up at that point rather than have
them reported and fixed along the way from nightly trunk builds?
Comment 52•23 years ago
|
||
I am not saying FUTURED... I am trying to get you
a nightly build. Granted I am suggesting turning
off crypto to get you that (so you can test everything
else). If I am reading this correctly you guys
(who care about solaris) haven't had a nightly
build in like forever.
I will shut up, I don't care...
I don't care if solaris nightly builds work or not.
I don't care if crytpo is on or not.
I was just trying to suggest a way to get
nightlies going again and to narrow down the
problem and not leave it to AKI who hasn't touch
a solaris build in "like forever".
un-ccing myself, do whatever you want.
Comment 53•23 years ago
|
||
Don't know if this is relevent, but nebiros SunOS/sparc 5.7 Clobber seems to have been orange for ages. Is this the same problem?
Also, i386 Solaris 2.6 nightlies seem to be being built fine, it's just the sparc ones (if that helps at all).
Comment 54•23 years ago
|
||
Pav - if there is too much on your plate right now to work on this bug, is there
anyone else that could take a stab at it in the meantime?
Comment 55•23 years ago
|
||
We are going to try my suggestion for turning off BUILD_PSM
in the sol26 builds. We are only going to do this for
this weekend only. Hopefully we will get nightlies (remember
they won't have PSM or smime) and then on Mon we will turn
it back on. This will help us narrow down the issue
Comment 56•23 years ago
|
||
Umm, I think you're barking up the wrong tree with the PSM issue (but feel free
to prove me wrong). regxpcom is hanging at the end of its run. components.reg
has already been written out correctly. Removing smime from the components dir
does not fix the hanging problem (tested manually on the day I ran truss).
Does anyone know what is being poll'ed?
Comment 57•23 years ago
|
||
er, i'm sorry. i'm not sure why this bug is assigned to me.
-> cls
Assignee: pavlov → seawood
Target Milestone: Future → ---
Comment 58•23 years ago
|
||
So, this is weird. Regxpcom works fine in a standalone xpcom build on sheep.
If I build all of Mozilla (except crypto), I see the hang but according to the
truss log, it's not hanging in poll any longer. It appeared to be hanging while
processing some of the uconv libs.
I used the following build options:
--enable-extensions=default,irc
--without-system-nspr
--without-system-zlib
--without-system-jpeg
--without-system-png
--without-system-mng
--disable-debug
--enable-optimize
--disable-tests
Comment 59•23 years ago
|
||
With a debug build, I'm seeing the same hang when building all of Mozilla minus
PSM. The trace shows that the poll() is coming from necko.
(gdb) bt
#0 0xfee9990c in _poll () from /usr/lib/libc.so.1
#1 0xfef1b22c in poll () from /usr/lib/libthread.so.1
#2 0xff08847c in PR_Poll (pds=0xcd138, npds=1, timeout=3500000)
at ../../../../../mozilla/nsprpub/pr/src/pthreads/ptio.c:3963
#3 0xfdabd924 in nsSocketTransportService::Run (this=0xad068)
at ../../../../mozilla/netwerk/base/src/nsSocketTransportService.cpp:469
#4 0xff224ae8 in nsThread::Main (arg=0x81500)
at ../../../mozilla/xpcom/threads/nsThread.cpp:120
#5 0xff08a600 in _pt_root (arg=0xa6308)
at ../../../../../mozilla/nsprpub/pr/src/pthreads/ptthread.c:214
(gdb)
Stepping thru gdb shows that the AutoRegister() call returned without any errors
(ret = 0). The hang occurs during XPCOM shutdown. Or more specifically, after
stepping thru NS_XPCOMShutdown, it's hanging in nsTimerImpl::Shutdown(). It
appears to spawn 2 more LWP threads when this occurs. One of those threads is
the one shown in the stacktrace above. The extra threads are spawned when
mThread->Join() is called from TimerThread::Shutdown() .
Pavlov, back to you.
Comment 60•23 years ago
|
||
Could someone clarify which version of Solaris has the problem?
I am able to reproduce regxpcom hang on one of Solaris 8 boxes
(and after rempval of components.reg problem can be repeated)
but it is not reproducible on several others Solaris 7/8/9 boxes
(note: i am testing *same* build shared over NFS)
This leads me to idea that problem may be solved by instation of appropriate
solaris patches. Did anyone try to investigate this?
I am not sure which patches are necessary to fix the problem
but i i would recommend to try patch 106541 for solaris 7
http://sunsolve.sun.com/pub-cgi/retrieve.pl?doc=fpatches%2F106541&zone_32=libc.so.1
(it contains fix for bug
4207080 hang in poll, application does not get notified of data on stream head)
For solaris 8 patch 108991 may be usefull.
It also has libc.so fixes and it is one of installed patches on
system that does not have problem and too old version of this patch is installed
on system that has the problem.
Comment 61•23 years ago
|
||
I wouldn't be surprised if our build system was in need of some patching.
Aki - can you check the patch status on the system, and make sure it's up the
latest and greatest patch cluster, as well as the patches mentioned above?
cls - do you know of any reason we shouldn't upgrade, or any particular patches
we should avoid?
Comment 62•23 years ago
|
||
I've heard that the very latest patch cluster from Sun introduces some
instabilities. Pavlov and/or Roland would know specifically which one.
Comment 63•23 years ago
|
||
comment 59 makes no sense to me -- why would Join spawn threads? cc'ing wtc.
/be
Comment 64•23 years ago
|
||
No clue what's going on.
Both Solaris 2.7+latest patches and Solaris 2.8+latest patches (except Xsun
patch 108652-47, we are still using rev -46) are working here...
Comment 65•23 years ago
|
||
this is solaris 2.6. downloading the latest recommended patches... which are
from 2/5/02. i've had decent luck with the recommended patch bundles, so i'll
install these and just keep an eye out for news on any bad patches.
Comment 66•23 years ago
|
||
Brendan asked:
> comment 59 makes no sense to me -- why would Join spawn threads?
> cc'ing wtc
I have no idea either. Sorry.
Comment 67•23 years ago
|
||
patch cluster installed, aesir rebooted. we'll see if that fixes the 8pm build.
Comment 68•23 years ago
|
||
regxpcom is still hanging and the recommended patch cluster had a libc fix,
can't find anything else in the 2.6 patchreport about it.
Comment 69•23 years ago
|
||
16 nights and new nightly build... could someone at least manually push one?
Comment 70•23 years ago
|
||
Grr. I meant: 16 nights and *no* new nightly build. Too many chocolate cookies
for me today.
Comment 71•23 years ago
|
||
Can't we setup another machine for creating Solaris 2.7 or 2.8 nighly tarballs
build with Sun Workshop ?
Comment 72•23 years ago
|
||
Comment on attachment 64581 [details]
output of `rm component.reg; truss ./regxpcom 2>&1 | tee > regxpcom.log`
Can someone provide a log from a hang woth `rm component.reg; truss -u ::
./regxpcom 2>&1 | tee > regxpcom.log`, please ?
Comment 73•23 years ago
|
||
killed regxpcom, should be another package available on the site.
there is no -u option available in our version of truss... did you want the same
output, but more recent, or different output?
also, I believe I accidentally added the ">" to the tee comment, which shouldn't
be there... ignore it.
Comment 74•23 years ago
|
||
Thanks for the new nightly build. Any chance of putting "kill regxpcom" in a
cron job?
Comment 75•23 years ago
|
||
no. this bug has to be fixed.
Comment 76•23 years ago
|
||
roland: even on non-hanging node truss -u results in 500M+ log file.
on node with problems it never stop to grow.
Back to the idea about solaris patches - I tried one more solaris 2.8 system
that also did not have the regxpcom hang. However, it does have
strict subset of patches installed on node with problems :(
Therefore either problem is introduced in one of additional patches
or it is somethere else in the environment.
Comment 77•23 years ago
|
||
There's a new Solaris build:
2002-02-25-21-trunk/mozilla-sparc-sun-solaris2.6.tar.gz
but it doesn't start because of bug 127817. If 127817 is related to security code
as suggested in 127817 comment #3, does it mean that it's security stuff that's
stopping Solaris builds normally?
Comment 78•23 years ago
|
||
May be a red herring, but take a look at the gzipped truss output file I
attached to bug 129567 - Is this related? If it looks similar, then maybe we can
compare patch revs or something... see if we can find a patch that if applied
causes the problem and can be backed out to make it go away?
Comment 79•23 years ago
|
||
Any possibility of manually getting another nightly Solaris build uploaded (or
at least deleting the current broken one)? The most recent Solaris nightly is
still the build from 20020225, which is broken due to bug 129749. We've had a
couple of duplicates of that long since fixed bug because that build is the only
Solaris nightly available.
Comment 80•23 years ago
|
||
we build with gcc (2.95.3).
Comment 81•23 years ago
|
||
I have comperssed the file because it is reather large
Comment 82•23 years ago
|
||
not sure if my attempt earlier in the week to get a new build up worked well or
not (killed all the regxpcom procs and i think the various build procs
interfered with each other), but there's one from today.
Comment 83•23 years ago
|
||
We are building RPM:s on FreeBSD 4.3, RedHat 7.1 & 7.2 and Solaris 2.6, 7 & 8
and I have seen the problem with the hanging regxpcom many times on Solaris.
I have been starting regxpcom thru truss and strace and got huge logfiles, so if
someone are interested let me know.
Workaround:
Our SPEC-file (RPM) places this script in the '.../mozilla/dist/bin'-directory.
I tested it today, when building 0.9.9, regxpcom ran from 7 to 42s on FreeBSD,
RedHat and Solaris 2.6, and ended normally.
On Solaris 7 & 8 it was killed after the timeout and ran about 3s the second
time.
#!/app/cueshell/bin/cueshell # This is a bash-alias
dist_bin=`dirname $0`
MOZILLA_FIVE_HOME=$dist_bin
LD_LIBRARY_PATH=$dist_bin:$LD_LIBRARY_PATH
export MOZILLA_FIVE_HOME LD_LIBRARY_PATH
case `uname -s` in
SunOS)
echo "`date`: Starting regxpcom"
( $dist_bin/regxpcom; echo "`date`: regxpcom done.") &
waiting=0
while [ $waiting -lt 1800 ]; do
if ps -p $! >/dev/null ; then
waiting=`expr $waiting + 30`
sleep 30
echo "`date`: Waited $waiting seconds for regxpcom"
else
echo "`date`: Waiting done."
waiting=1800
fi
done
if ps -p $! >/dev/null ; then
echo "`date`: Kills regxpcom "
/usr/sbin/fuser -k $dist_bin/regxpcom
echo "`date`: Restarting regxpcom"
$dist_bin/regxpcom; echo "`date`: regxpcom done."
fi
;;
*)
echo "`date`: Starting regxpcom"
$dist_bin/regxpcom; echo "`date`: regxpcom done."
;;
esac
$dist_bin/regchrome
touch $dist_bin/chrome/user-skins.rdf $dist_bin/chrome/user-locales.rdf
Comment 84•23 years ago
|
||
I've been troubleshooting this using the 20020315xx nightly, and I think I have
some useful information.
First of all, if components/nsProxyAutoConfig.js is removed from an installed
copy of mozilla, then regxpcom will run to completion and exit as it should.
However, regxpcom isn't hanging while registering this component; it's hanging
in the call to NS_ShutdownXPCOM() just before regxpcom exits.
It appears that nsProxoyAutoConfig.js causes an nsDNSService thread to be
created, which in turn creates a TimerThread. Later at shutdown time, xpcom
tries to kill the timer thread, but it isn't dying.
I'm going to attach a copy of the /usr/proc/bin/pstack output for a well-hung
regxpcom instance. You'll note the following:
1) thread #1 is performing NS_ShutdownXPCOM() and is waiting for a _thrp_join()
call to complete. This is actually a pthread_join() call in the source. I think
the '6' in the _thrp_join() argument list means thread #6.
2) lwp #1/thread #6 is within a TimerThread::Run call, blissfully waiting for a
call to pthread_cond_wait() to complete.
3) thread #5 is within a nsDNSService::Run call. According to truss, thread 5
was spawned by thread 4, which is inside an nsSocketTransportService::Run call.
I have trusses from running regxpcom with and without the proxy autoconfig
component present. When it's not present, regxpcom never gets beyond four
threads; #5 and #6 are never created. The trusses are quite large so I won't
attach them.
Comment 85•23 years ago
|
||
Comment 86•23 years ago
|
||
my bet is that the problem is:
235 var PacMan = new nsProxyAutoConfig() ;
Assignee: pavlov → gagan
Component: Build Config → Networking
QA Contact: granrose → benc
Summary: Regxpcom hanging Solaris nightly build packaging process → PAC instantiation hangs Regxpcom Solaris nightly build packaging process
Keywords: helpwanted,
qawanted
Comment 87•23 years ago
|
||
I doubt it, since that will just call this nothing function:
55 function nsProxyAutoConfig() {};
Is it possible that some other component is causing network activity, which is
in turn causing the proxyautoconfig stuff to get kicked off?
If that's the case, we probably need regxpcom to do more mozilla-like things in
its shutdown process. Cc:ing Jud, because embedders on Solaris might well run
into this problem as well, if they don't do the shutdown perfectly.
Comment 88•23 years ago
|
||
FWIW: PAC download is triggered whenever the PAC preference is modified. see
nsProtocolProxyService::PrefsChanged.
Comment 89•23 years ago
|
||
(wonders if the bug on nsDNSshutdown leaking, which he can't find, is related)
Assignee | ||
Comment 90•23 years ago
|
||
regxpcom and InitXPCOM does not create any event queue for the main thread. /me
wonders if the timer or DNS threads require one present?
attaching hack to test this theory.
Comment 91•23 years ago
|
||
A detailed truss suggests that there may be a race condition within TimerThread
(xpcom/threads/TimerThread.cpp). TimerThread::Shutdown() is running before
TimerThread::Run(). This is breaking the method that Shutdown() uses to tell
Run() to exit.
Shutdown() checks a condition variable and a flag:
// notify the cond var so that Run() can return
if (mCondVar && mWaiting)
PR_NotifyCondVar(mCondVar);
but Run() hasn't been called yet so the test fails. Shutdown() falls through and
eventually calles nsThread::Join() to harvest the Run() thread.
Some time later, the Run() thread starts executing, and eventually goes to sleep
on PR_WaitCondVar(mCondVar). Deadlock.
I'm attaching a truss clip which illustrates the problem; the truss includes
calls to libxpcom and libnspr4. I apologize for not including more data; these
truss runs take a long time to complete and produce huge amounts of output. The
one I'm excerpting is 41MB, for example.
Comment 92•23 years ago
|
||
Assignee | ||
Comment 93•23 years ago
|
||
If this works, we should fix the problem much cleaner by having InitXPCOM
startup the event queue directory and Shutdown clean it up. See 135531.
Comment 94•23 years ago
|
||
I tried adding "sleep(1)" to TimerThread::Shutdown() just before the
timerthread lock is acquired. This has the desired effect; the Shutdown()
thread gives up its timeslice, giving the OS time to schedule the Run() thread.
By the time Shutdown() wakes up, the Run() thread is in the state that
Shutdown() expects. But of course this is just a hack, not a proper solution.
TimerThread uses a flag "mProcessing" to indicate whether TimerThread::Run()
should keep going or not, but the logic isn't quite right. The flag is
initialized false. Run() sets it true on entry, then keeps looping until it
sees the flag become false. Shutdown() sets the flag back to false when it
wants Run() to return. But if Shutdown() runs before Run(), then Run() can't
tell that Shutdown() has already been called and already written to the flag.
The attached patch replaces the mProcessing flag with an mShutdown flag. This
flag is initialized to false. It's set to true in Shutdown(). Run() never
writes to this flag, but it keeps looping as long as the flag is false.
With either the added sleep() call or the mShutdown patch, regxpcom no longer
hangs shutting down xpcom. Instead, the last few lines that it prints are as
follows:
*** Registering irc protocol handler.
nNCL: registering deferred (0)
nNCL: registering deferred (0)
Getting service on shutdown. Denied.
ContractID: @mozilla.org/js/xpc/ContextStack;1
IID: {a1339ae0-05c1-11d4-8f92-0010a4e73d9a}
###!!! ASSERTION: Component Manager being held past XPCOM shutdown.: 'cnt ==
0', file nsXPComInit.cpp, line 582
###!!! Break: at file nsXPComInit.cpp, line 582
As far as I can tell, this is an unrelated problem. It may be bug 135330
rearing its head; the source distribution I'm using is from 4/3/2002.
Comment 95•23 years ago
|
||
the getting @ shutdown is bug 134728
in general, shutdown problems have *many* bugs, although searching for bugs
filed by me is a good start.
Assignee | ||
Comment 97•23 years ago
|
||
Comment on attachment 78085 [details] [diff] [review]
Proposed TimerThread.cpp, TimerThread.h patch
r=dougt. Thanks for fixing this.
Attachment #78085 -
Flags: review+
Assignee | ||
Comment 98•23 years ago
|
||
brendan, can you super review? You blame to alot of this code.
Target Milestone: --- → mozilla1.0
Comment 99•23 years ago
|
||
I've applied this patch to my mozilla 0.9.9 tree, and regxpcom no longer hangs
on package creation.
Comment 100•23 years ago
|
||
Was it hanging without this patch? We've had solaris nightlies for the past few
days (since the 12th apparently). So either the problem resolved itself or
someone added a workaround to the build automation, which I don't see.
Comment 101•23 years ago
|
||
Yes, it would consistently hang on building 0.9.8 and 0.9.9 without this patch
on solaris 7
Comment 102•23 years ago
|
||
Comment on attachment 78085 [details] [diff] [review]
Proposed TimerThread.cpp, TimerThread.h patch
sr=brendan@mozilla.org
dougt: I took cvsblame in making fixes to pavlov's busted threading code, but I
won't take all blame here. I do feel pretty foolish for taking this stuff so
close to 1.0 (0.9.8, IIRC -- at least I made pav wait till then, instead of
checking in on the last day of 0.9.7 as he wanted to).
/be
Attachment #78085 -
Flags: superreview+
Updated•23 years ago
|
Keywords: mozilla1.0+,
nsbeta1
Assignee | ||
Comment 103•23 years ago
|
||
Checked into the trunk:
Checking in TimerThread.cpp;
/cvsroot/mozilla/xpcom/threads/TimerThread.cpp,v <-- TimerThread.cpp
new revision: 1.12; previous revision: 1.11
done
Checking in TimerThread.h;
/cvsroot/mozilla/xpcom/threads/TimerThread.h,v <-- TimerThread.h
new revision: 1.4; previous revision: 1.3
done
Status: NEW → ASSIGNED
Whiteboard: Needs to land on branch
Comment 104•23 years ago
|
||
Comment on attachment 78085 [details] [diff] [review]
Proposed TimerThread.cpp, TimerThread.h patch
a=rjesup@wgate.com for branch checkin
Attachment #78085 -
Flags: approval+
Assignee | ||
Comment 105•23 years ago
|
||
Checked into branch.
Checking in TimerThread.cpp;
/cvsroot/mozilla/xpcom/threads/TimerThread.cpp,v <-- TimerThread.cpp
new revision: 1.6.4.4; previous revision: 1.6.4.3
done
Checking in TimerThread.h;
/cvsroot/mozilla/xpcom/threads/TimerThread.h,v <-- TimerThread.h
new revision: 1.3.4.2; previous revision: 1.3.4.1
done
Kenneth, thank you for the patch.
Status: ASSIGNED → RESOLVED
Closed: 23 years ago → 23 years ago
Resolution: --- → FIXED
Comment 106•23 years ago
|
||
adding fixed1.0.0 keyword (branch resolution). This bug has comments saying it
was fixed on the 1.0 branch and a bonsai checkin comment that agrees. To verify
the bug has been fixed on the 1.0 branch please replace the fixed1.0.0 keyword
with verified1.0.0.
Keywords: fixed1.0.0
Comment 107•22 years ago
|
||
updating component and qa...
From reading carefully, it seems like this goes to XPCOM Regsitry. Also, the
summary seems out of date, is PAC really the root cause of this?
Component: Networking → XPCOM Registry
QA Contact: benc → dougt
You need to log in
before you can comment on or make changes to this bug.
Description
•