Closed Bug 106009 Opened 23 years ago Closed 22 years ago

PAC instantiation hangs Regxpcom Solaris nightly build packaging process

Categories

(Core :: XPCOM, defect)

Sun
SunOS
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla1.0

People

(Reporter: nbidwell, Assigned: dougt)

References

Details

(Keywords: helpwanted, Whiteboard: Needs to land on branch)

Attachments

(7 files)

As I write this, the latest Solaris nightly build on ftp.mozilla.org is from
10/15/2001.  That was a week ago.  (Not to mention that that build has a rather
broken mail client for me...)  Is this intentional?
Confirming and changing product to mozilla.org.

CCing leaf@mozilla.org in hopes that more info might be out there.

Related to bug 105981 or 105988?
Status: UNCONFIRMED → NEW
Component: Build Config → FTP - Staging
Ever confirmed: true
Product: Browser → mozilla.org
Nope, not related to those bugs.  There hasn't been a sol26 nightly build log
since Oct 15, which is weird.   It's still in the crontabs on granite & aesir is
up. Running the nightly script by hand to see what it turns up.
the 8am builds fail because IC is messing with the network and cvs hangs.  the
8pm builds fail because the cvs process from 8am is still running.  linux
finishes because it starts at 4am before they start breaking things.
Well, I'm posting this with Solaris build 2001102222, so a nightly build was
made last night. 
Marking fixed (for lack of IC_screwed_us resolution).
Status: NEW → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
And now there's an even newer nightly build up, so it seems everything is
working correctly.  Thank you.  Now back to my regularly scheduled testing.

BTW, what/who is IC?
Status: RESOLVED → VERIFIED
Reopening becaause Solaris nighlies aren't showing up again.  The last available
build seems to be 2001110210
Status: VERIFIED → REOPENED
Resolution: FIXED → ---
John, aesir needs to be resurrected after the network massacre so that solaris
nightlies can live again.
Assignee: seawood → antitux
Status: REOPENED → NEW
In addition to Solaris builds, source code isn't getting put onto the ftp server.
source balls are on branch, sol26 builds are on aesir, both down right now. 
antitux is supposed to get all our unix systems up by COB Friday so source
tarballs and sol26 builds should start showing up Saturday morning at the latest.
Status: NEW → ASSIGNED
Closing since Solaris nightlies and source are both back on ftp.  Thanks!
Status: ASSIGNED → RESOLVED
Closed: 23 years ago23 years ago
Resolution: --- → FIXED
Re-opening because the Solaris build on ftp is 2001122110.

Happy holidays!
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
someone broke the solaris build in nsPluginModule.cpp a while back.  I don't
know why the tinderboxen are green... could be someone hacked configure, it's an
official-only problem, or a parallel build problem, or ???
Component: FTP - Staging → Build Config
Priority: -- → P3
Product: mozilla.org → Browser
Target Milestone: --- → mozilla0.9.8
btw - you shouldn't keep reopening the same bug.  this is a completely different
problem from the original reported bug so it makes things confusing.  In the
future, should the Solaris builds fail to appear on ftp.mozilla.org, it should
be a new bug since it may or may not be the same problem.
Status: REOPENED → ASSIGNED
Tinderboxes are green because they use a) a more recent version of gcc
(speedracer's 2.95.3 vs aesir's 2.7.2.1) or b) Forte (nebiros).  Since we've
dropped support for gcc 2.7.2.x, we should upgrade the compiler on aesir.


John what is the plugin problem on solaris?
There is currently an open bug that is tracking
issues on a variety of platforms
http://bugzilla.mozilla.org/show_bug.cgi?id=106806

if this is also a solaris problem, then this bug
should be updated and we should probably also make
the platform to ALL unix, since linux suffers from
this as well (the .so.1 issue)
*** Bug 117712 has been marked as a duplicate of this bug. ***
reassigning to asasaki since antitux is busy with other work right now.

Aki - Can you upgrade gcc on aesir to 2.95.3?  You'll need to coordinate with
lpham to make sure the build automation is looking in the right place to find
the new compiler once it's in place.  Thanks.
actually reassigning this time...
Assignee: antitux → asasaki
Status: ASSIGNED → NEW
Installed in /opt/gcc-2.95.3 which is softlinked to /opt/gcc (so you shouldn't
have to change anything in the env).  Old /opt/gcc moved to /opt/gcc-2.95.2
which can be removed once everything is working.

There are a lot of old cltbld processes on aesir... should I kill those?
Status: NEW → ASSIGNED
Yes, please kill them.  Look like they are the old builds.  Thanks.  Loan
There is now a nightly build for Jan 7 - so thanks!

But, the tar file is only 3.5MB :(

I've raised bug#118701 for this.
Hm, looks like it's *building* fine (able to run the package from aesir), but
regxpcom is hanging, and it doesn't get past the packaging phase unless I kill
that...  Want to rerun it but I may hit the 8pm build...?  Meanwhile, there's a
new build up...
Again this is the case... build finishes, regxpcom hangs indefinitely after the
"registering smime account manager extension" line.  Once I kill regxpcom, the
rest of the packaging and the push to the ftp site happens.

Anyone have any idea why regxpcom might be hanging on aesir?
IIRC regxpcom is hanging in smime, right?  Paste the regxpcom output into the
bug, then check lxr or ask on #mozilla for who's been doing smime and regxpcom
work and cc them here to get their input.  This will probably need to be
reassigned to an engineer to fix one or the other.
truss output showed that regxpcom was hanging in poll().  It appeared to occur
after all of the components have been registered and the components.reg file had
already been created.
 27077  ./run-mozilla.sh ./regxpcom
 27078  *** Registering -venkman handler.
 27079  *** Registering -chat handler.
 27080  *** Registering x-application-irc handler.
 27081  *** Registering irc protocol handler.
 27082  *** Registering smime account manager extension.
 27083  Terminated
dougt, kaie -- what are your thoughts on regxpcom hanging after smime
registration? TIA.
could you attach some stacktraces of the threads (probably just one thread)
involved?
output from truss, sleeping in poll().
I think that output in comment 27 does not give a hint that it could have
something to do with smime. From what I have seen on Unix optimized builds,
that's always the exact output of the first run of a new build.
Probably true, but it doesn't get to the "terminated" bit until I kill the
regxpcom process, which could be many hours after the smime line appears in the
log...
Aki, to find out whether it is indeed a problem with the smime extension, or
something else, could you please do the following?

As a first test, when the build has finished, just remove
  mailnews/extensions/smime/build/libmsgsmime.so

Another test, you could add another debugging output line to that module.
In
  mailnews/extensions/smime/src/smime-service.js
locate
  SMIMEModule.registerSelf =
  function (compMgr, fileSpec, location, type)
  {
    dump("*** Registering smime account manager extension.\n");
    ....
  }

And add some more dump output lines. For example, replace that complete function
with:

SMIMEModule.registerSelf =
function (compMgr, fileSpec, location, type)
{
  dump("*** Registering smime account manager extension.\n");
  compMgr =
compMgr.QueryInterface(Components.interfaces.nsIComponentManagerObsolete);
  dump("*** smime 2.\n");
  compMgr.registerComponentWithType(SMIME_EXTENSION_SERVICE_CID,
                                      "SMIME Account Manager Extension Service",
                                      SMIME_EXTENSION_SERVICE_CONTRACTID, fileSpec,
                                      location, true, true, type);
  dump("*** smime 3.\n");
  catman =
Components.classes["@mozilla.org/categorymanager;1"].getService(nsICategoryManager);
  dump("*** smime 4.\n");
  catman.addCategoryEntry("mailnews-accountmanager-extensions",
                            "smime account manager extension",
                            SMIME_EXTENSION_SERVICE_CONTRACTID, true, true);
  dump("*** smime 5.\n");
}

If you build again and start, if we see the line with "*** smime 5", I think it
can't be the smime extension.
Ok, done.  Looks like it's regxpcom =)
dougt -- do you want ownership of this bug?  Or do you have recommendations as
to who would be best able to fix it?
thanks.
reassigning.
Assignee: asasaki → dougt
Status: ASSIGNED → NEW
does someone with a sun build want to look at this.  pavlov, do you have a uild
that I could peek at?
Keywords: helpwanted
Target Milestone: mozilla0.9.8 → ---
over to pavlov.  
Assignee: dougt → pavlov
Since Solaris builds are back and working, the summary should involve the
problem that is left.
Summary: Where did the Sun Solaris nightly builds go? → Regxpcom hanging Solaris nightly build packaging process
There have been no new Solaris nightlies for 5 days now....
so I think they aren't working again (or whatever workaround was
in place has stopped working).
I'm assuming that this is still the bug holding up nightly builds.  If so, could
someone please massage things by hand for a build newer than 1-15-2002?  Thank you.
*** Bug 122813 has been marked as a duplicate of this bug. ***
My work firewall stops me pulling from CVS and I dont have the time to pull
regular tarballs so building it myself is out.
I'm sure I'm not alone in being dependent on nightlies for my mozilla testing so
I'm guessing theres probably quite a few solaris issues going unnoticed until
the nightlies come back simply by virtue of there being fewer users out there
running the latest codebase.
The longer this goes on the bigger a deal it gets. Any feedback at all on
progress to a fix would be welcome.
I just found out that there is a new nightly build available in
http://ftp.mozilla.org/pub/mozilla/nightly/latest/, with build date 2002013122.

At first it complained about not being able to find run-mozilla.sh, so I just
copied the file from the previous nightly build (2002011510) and now it runs
beautifully :)

The missing run-mozilla.sh is bug#122942, now fixed.

Thanks to whoever did the manual update of the Solaris nightly (or fixed the
problem) - now I can play with the new Page Info stuff :)
Before anyone gets too happy, it was the missing run-mozilla.sh that caused the
nightly to get delivered.  Since run-mozilla.sh didn't exist, regxpcom could not
run and then hang. 
Target Milestone: --- → Future
With all due respect, why is this bug being "futured"?  Do the drivers not care
about testing on Solaris?  Yes, it's a minority platform, but so are SGI IRIX
and HPUX, both of which have up-to-date nightlies.  
Even if the root problem with Regxpcom isn't worth the effort to fix at this
point, it seems like it should be possible to make a work around that would
allow nightlies to be distibuted. It seems like an ugly hack to the build
process like automatically killing the Regxpcom process (or removing
run-mozilla.sh which allowed 2002013122 to be built) would work.
Or just turn off --enable-crypto on the solaris nightly's to see if 
that fixes it... 

--disable-crypto turns off MOZ_PSM which turns off BUILD_SMIME
in mozilla/mailnews/extensions/Makefile.in

All you have to do is set BUILD_PSM="FALSE" in
ns/build/unix/verification/seamonkey-build
in the sol26 stanza... line:

if this fixes your problem, then someone needs to debug
--enable-crypto and smime in the sol26 stanza.  
But in the meantime there will be nightly builds.
> Or just turn off --enable-crypto on the solaris nightly's to see if 
> that fixes it... 


I thought Aki confirmed in his comment 34 that it is NOT the crypto component?
I am not saying it is a crypto issue...
I am saying if we want to get rid of smime from 
the build... the quickest and easiest way to do 
that is to turn off PSM in the nightlies...

Since no PSM, no smime... and then regxpcom
CAN'T have an issue trying to load it.

btw that was line 869 for doing so.

Then whoever is the champion of sol26 should
figure out what is going on... just like I
do when the darn hpux nightlies mess up.
Whatever the issue is, whether its crypto or something else, this still should
NOT be futured. Sure, it doesnt have the visibility of the wintel or linux
platforms but one of the main drivers for adopting a solution like Mozilla is
that it is truly cross-platform. Inhibiting testing on a major unix platform
does not bode well for this continuing. I wonder how many solaris bugs will go
unreported between now and the 0.9.9 release if the nightlies remain hosed? Do
we really want to suddenly see them all show up at that point rather than have
them reported and fixed along the way from nightly trunk builds?
I am not saying FUTURED... I am trying to get you
a nightly build.  Granted I am suggesting turning
off crypto to get you that (so you can test everything
else).  If I am reading this correctly you guys
(who care about solaris) haven't had a nightly
build in like forever.  

I will shut up, I don't care...
I don't care if solaris nightly builds work or not.
I don't care if crytpo is on or not.
I was just trying to suggest a way to get 
nightlies going again and to narrow down the
problem and not leave it to AKI who hasn't touch
a solaris build in "like forever".

un-ccing myself, do whatever you want.
Don't know if this is relevent, but nebiros SunOS/sparc 5.7 Clobber seems to have been orange for ages. Is this the same problem? 

Also, i386 Solaris 2.6 nightlies seem to be being built fine, it's just the sparc ones (if that helps at all).
Pav - if there is too much on your plate right now to work on this bug, is there
anyone else that could take a stab at it in the meantime?
We are going to try my suggestion for turning off BUILD_PSM
in the sol26 builds.  We are only going to do this for
this weekend only.  Hopefully we will get nightlies (remember
they won't have PSM or smime) and then on Mon we will turn
it back on.  This will help us narrow down the issue
Umm, I think you're barking up the wrong tree with the PSM issue (but feel free
to prove me wrong).  regxpcom is hanging at the end of its run.  components.reg
has already been written out correctly.  Removing smime from the components dir
does not fix the hanging problem (tested manually on the day I ran truss). 

Does anyone know what is being poll'ed?


er, i'm sorry.  i'm not sure why this bug is assigned to me.
-> cls
Assignee: pavlov → seawood
Target Milestone: Future → ---
So, this is weird.  Regxpcom works fine in a standalone xpcom build on sheep. 
If I build all of Mozilla (except crypto), I see the hang but according to the
truss log, it's not hanging in poll any longer.  It appeared to be hanging while
processing some of the uconv libs.

I used the following build options:
--enable-extensions=default,irc
--without-system-nspr
--without-system-zlib
--without-system-jpeg
--without-system-png
--without-system-mng
--disable-debug
--enable-optimize
--disable-tests
With a debug build, I'm seeing the same hang when building all of Mozilla minus
PSM.  The trace shows that the poll() is coming from necko.

(gdb) bt 
#0  0xfee9990c in _poll () from /usr/lib/libc.so.1
#1  0xfef1b22c in poll () from /usr/lib/libthread.so.1
#2  0xff08847c in PR_Poll (pds=0xcd138, npds=1, timeout=3500000)
    at ../../../../../mozilla/nsprpub/pr/src/pthreads/ptio.c:3963
#3  0xfdabd924 in nsSocketTransportService::Run (this=0xad068)
    at ../../../../mozilla/netwerk/base/src/nsSocketTransportService.cpp:469
#4  0xff224ae8 in nsThread::Main (arg=0x81500)
    at ../../../mozilla/xpcom/threads/nsThread.cpp:120
#5  0xff08a600 in _pt_root (arg=0xa6308)
    at ../../../../../mozilla/nsprpub/pr/src/pthreads/ptthread.c:214
(gdb) 

Stepping thru gdb shows that the AutoRegister() call returned without any errors
(ret = 0).  The hang occurs during XPCOM shutdown.   Or more specifically, after
stepping thru NS_XPCOMShutdown, it's hanging in nsTimerImpl::Shutdown().  It
appears to spawn 2 more LWP threads when this occurs.  One of those threads is
the one shown in the stacktrace above.  The extra threads are spawned when
mThread->Join() is called from TimerThread::Shutdown() .

Pavlov, back to you.





Assignee: seawood → pavlov
Severity: normal → critical
Keywords: helpwanted
Priority: P3 → --
Could someone clarify which version of Solaris has the problem?

I am able to reproduce regxpcom hang on one of Solaris 8 boxes
(and after rempval of components.reg problem can be repeated)
but it is not reproducible on several others Solaris 7/8/9 boxes 
(note: i am testing *same* build shared over NFS)

This leads me to idea that problem may be solved by instation of appropriate
solaris patches. Did anyone try to investigate this?

I am not sure which patches are necessary to fix the problem 
but i i would recommend to try patch 106541 for solaris 7
 
http://sunsolve.sun.com/pub-cgi/retrieve.pl?doc=fpatches%2F106541&zone_32=libc.so.1

(it contains fix for bug 
4207080 hang in poll, application does not get notified of data on stream head)

For solaris 8 patch 108991 may be usefull.
It also has libc.so fixes and it is one of installed patches on 
system that does not have problem and too old version of this patch is installed 
on system that has the problem.

 
I wouldn't be surprised if our build system was in need of some patching.

Aki - can you check the patch status on the system, and make sure it's up the
latest and greatest patch cluster, as well as the patches mentioned above?

cls - do you know of any reason we shouldn't upgrade, or any particular patches
we should avoid?
I've heard that the very latest patch cluster from Sun introduces some
instabilities.  Pavlov and/or Roland would know specifically which one.
comment 59 makes no sense to me -- why would Join spawn threads?  cc'ing wtc.

/be
No clue what's going on.
Both Solaris 2.7+latest patches and Solaris 2.8+latest patches (except Xsun
patch 108652-47, we are still using rev -46) are working here...
this is solaris 2.6.  downloading the latest recommended patches... which are
from 2/5/02.  i've had decent luck with the recommended patch bundles, so i'll
install these and just keep an eye out for news on any bad patches.
Brendan asked:
> comment 59 makes no sense to me -- why would Join spawn threads?
> cc'ing wtc

I have no idea either.  Sorry.

patch cluster installed, aesir rebooted.  we'll see if that fixes the 8pm build.
regxpcom is still hanging and the recommended patch cluster had a libc fix,
can't find anything else in the 2.6 patchreport about it.
16 nights and new nightly build... could someone at least manually push one?
Grr.  I meant: 16 nights and *no* new nightly build.  Too many chocolate cookies
for me today.
Can't we setup another machine for creating Solaris 2.7 or 2.8 nighly tarballs
build with Sun Workshop ?
Comment on attachment 64581 [details]
output of `rm component.reg; truss ./regxpcom 2>&1 | tee > regxpcom.log`

Can someone provide a log from a hang woth `rm component.reg; truss -u ::
./regxpcom 2>&1 | tee > regxpcom.log`, please ?
killed regxpcom, should be another package available on the site.
there is no -u option available in our version of truss... did you want the same
output, but more recent, or different output?
also, I believe I accidentally added the ">" to the tee comment, which shouldn't
be there... ignore it.
Thanks for the new nightly build.  Any chance of putting "kill regxpcom" in a
cron job?
no.  this bug has to be fixed.
roland: even on non-hanging node truss -u results in 500M+ log file.
         on node with problems it never stop to grow.
         
Back to the idea about solaris patches - I tried one more solaris 2.8 system 
that also did not have the regxpcom hang. However, it does have 
strict subset of patches installed on node with problems :(
Therefore either problem is introduced in one of additional patches
or it is somethere else in the environment.
There's a new Solaris build:

2002-02-25-21-trunk/mozilla-sparc-sun-solaris2.6.tar.gz

but it doesn't start because of bug 127817. If 127817 is related to security code
as suggested in 127817 comment #3, does it mean that it's security stuff that's 
stopping Solaris builds normally?
May be a red herring, but take a look at the gzipped truss output file I
attached to bug 129567 - Is this related? If it looks similar, then maybe we can
compare patch revs or something... see if we can find a patch that if applied
causes the problem and can be backed out to make it go away?
Any possibility of manually getting another nightly Solaris build uploaded (or
at least deleting the current broken one)?  The most recent Solaris nightly is
still the build from 20020225, which is broken due to bug 129749.  We've had a
couple of duplicates of that long since fixed bug because that build is the only
Solaris nightly available.
not sure if my attempt earlier in the week to get a new build up worked well or
not (killed all the regxpcom procs and i think the various build procs
interfered with each other), but there's one from today.
We are building RPM:s on FreeBSD 4.3, RedHat 7.1 & 7.2 and Solaris 2.6, 7 & 8
and I have seen the problem with the hanging regxpcom many times on Solaris. 
I have been starting regxpcom thru truss and strace and got huge logfiles, so if 
someone are interested let me know. 
Workaround:
Our SPEC-file (RPM) places this script in the '.../mozilla/dist/bin'-directory. 
I tested it today, when building 0.9.9, regxpcom ran from 7 to 42s on FreeBSD, 
RedHat and Solaris 2.6, and ended normally. 
On Solaris 7 & 8 it was killed after the timeout and ran about 3s the second
time. 
#!/app/cueshell/bin/cueshell  # This is a bash-alias
dist_bin=`dirname $0`
MOZILLA_FIVE_HOME=$dist_bin
LD_LIBRARY_PATH=$dist_bin:$LD_LIBRARY_PATH
export MOZILLA_FIVE_HOME LD_LIBRARY_PATH
case `uname -s` in 
  SunOS)
    echo "`date`: Starting regxpcom"
    ( $dist_bin/regxpcom; echo "`date`: regxpcom done.") &
    waiting=0
    while [ $waiting -lt 1800 ]; do
      if ps -p $! >/dev/null ; then 
        waiting=`expr $waiting + 30`
        sleep 30
        echo "`date`: Waited $waiting seconds for regxpcom"
      else
        echo "`date`: Waiting done."
        waiting=1800
      fi
    done
    if ps -p $! >/dev/null ; then 
      echo "`date`: Kills regxpcom "
      /usr/sbin/fuser -k $dist_bin/regxpcom
      echo "`date`: Restarting regxpcom"
       $dist_bin/regxpcom; echo "`date`: regxpcom done." 
    fi 
    ;;
  *)
    echo "`date`: Starting regxpcom"
     $dist_bin/regxpcom; echo "`date`: regxpcom done." 
    ;;
esac
$dist_bin/regchrome 
touch $dist_bin/chrome/user-skins.rdf $dist_bin/chrome/user-locales.rdf
I've been troubleshooting this using the 20020315xx nightly, and I think I have
some useful information.

First of all, if components/nsProxyAutoConfig.js is removed from an installed
copy of mozilla, then regxpcom will run to completion and exit as it should.
However, regxpcom isn't hanging while registering this component; it's hanging
in the call to NS_ShutdownXPCOM() just before regxpcom exits.

It appears that nsProxoyAutoConfig.js causes an nsDNSService thread to be
created, which in turn creates a TimerThread. Later at shutdown time, xpcom
tries to kill the timer thread, but it isn't dying.

I'm going to attach a copy of the /usr/proc/bin/pstack output for a well-hung
regxpcom instance. You'll note the following:

1) thread #1 is performing NS_ShutdownXPCOM() and is waiting for a _thrp_join()
call to complete. This is actually a pthread_join() call in the source. I think
the '6' in the _thrp_join() argument list means thread #6.

2) lwp #1/thread #6 is within a TimerThread::Run call, blissfully waiting for a
call to pthread_cond_wait() to complete.

3) thread #5 is within a nsDNSService::Run call. According to truss, thread 5
was spawned by thread 4, which is inside an nsSocketTransportService::Run call.

I have trusses from running regxpcom with and without the proxy autoconfig
component present.  When it's not present, regxpcom never gets beyond four
threads; #5 and #6 are never created. The trusses are quite large so I won't
attach them.
my bet is that the problem is:
235                  var PacMan = new nsProxyAutoConfig() ;
Assignee: pavlov → gagan
Component: Build Config → Networking
QA Contact: granrose → benc
Summary: Regxpcom hanging Solaris nightly build packaging process → PAC instantiation hangs Regxpcom Solaris nightly build packaging process
Keywords: helpwanted, qawanted
I doubt it, since that will just call this nothing function:
55   function nsProxyAutoConfig() {};

Is it possible that some other component is causing network activity, which is
in turn causing the proxyautoconfig stuff to get kicked off?

If that's the case, we probably need regxpcom to do more mozilla-like things in
its shutdown process.  Cc:ing Jud, because embedders on Solaris might well run
into this problem as well, if they don't do the shutdown perfectly.
FWIW: PAC download is triggered whenever the PAC preference is modified.  see
nsProtocolProxyService::PrefsChanged.
(wonders if the bug on nsDNSshutdown leaking, which he can't find, is related)
regxpcom and InitXPCOM does not create any event queue for the main thread.  /me
wonders if the timer or DNS threads require one present?


attaching hack to test this theory.
A detailed truss suggests that there may be a race condition within TimerThread
(xpcom/threads/TimerThread.cpp). TimerThread::Shutdown() is running before
TimerThread::Run(). This is breaking the method that Shutdown() uses to tell
Run() to exit.

Shutdown() checks a condition variable and a flag:

    // notify the cond var so that Run() can return
    if (mCondVar && mWaiting)
      PR_NotifyCondVar(mCondVar);

but Run() hasn't been called yet so the test fails. Shutdown() falls through and
eventually calles nsThread::Join() to harvest the Run() thread.

Some time later, the Run() thread starts executing, and eventually goes to sleep
on PR_WaitCondVar(mCondVar). Deadlock.

I'm attaching a truss clip which illustrates the problem; the truss includes
calls to libxpcom and libnspr4. I apologize for not including more data; these
truss runs take a long time to complete and produce huge amounts of output. The
one I'm excerpting is 41MB, for example.


If this works, we should fix the problem much cleaner by having InitXPCOM
startup the event queue directory and Shutdown clean it up.  See 135531.
I tried adding "sleep(1)" to TimerThread::Shutdown() just before the
timerthread lock is acquired. This has the desired effect; the Shutdown()
thread gives up its timeslice, giving the OS time to schedule the Run() thread.
By the time Shutdown() wakes up, the Run() thread is in the state that
Shutdown() expects. But of course this is just a hack, not a proper solution.

TimerThread uses a flag "mProcessing" to indicate whether TimerThread::Run()
should keep going or not, but the logic isn't quite right. The flag is
initialized false.  Run() sets it true on entry, then keeps looping until it
sees the flag become false. Shutdown() sets the flag back to false when it
wants Run() to return. But if Shutdown() runs before Run(), then Run() can't
tell that Shutdown() has already been called and already written to the flag.

The attached patch replaces the mProcessing flag with an mShutdown flag. This
flag is initialized to false. It's set to true in Shutdown(). Run() never
writes to this flag, but it keeps looping as long as the flag is false.

With either the added sleep() call or the mShutdown patch, regxpcom no longer
hangs shutting down xpcom.  Instead, the last few lines that it prints are as
follows:

*** Registering irc protocol handler.
nNCL: registering deferred (0)
nNCL: registering deferred (0)
Getting service on shutdown. Denied.
  ContractID: @mozilla.org/js/xpc/ContextStack;1
	 IID: {a1339ae0-05c1-11d4-8f92-0010a4e73d9a}
###!!! ASSERTION: Component Manager being held past XPCOM shutdown.: 'cnt ==
0', file nsXPComInit.cpp, line 582
###!!! Break: at file nsXPComInit.cpp, line 582

As far as I can tell, this is an unrelated problem. It may be bug 135330
rearing its head; the source distribution I'm using is from 4/3/2002.
the getting @ shutdown is bug 134728

in general, shutdown problems have *many* bugs, although searching for bugs 
filed by me is a good start.
->Doug
Assignee: gagan → dougt
Comment on attachment 78085 [details] [diff] [review]
Proposed TimerThread.cpp, TimerThread.h patch

r=dougt.  Thanks for fixing this.
Attachment #78085 - Flags: review+
brendan, can you super review?  You blame to alot of this code.
Target Milestone: --- → mozilla1.0
I've applied this patch to my mozilla 0.9.9 tree, and regxpcom no longer hangs
on package creation.
Was it hanging without this patch?  We've had solaris nightlies for the past few
days (since the 12th apparently). So either the problem resolved itself or
someone added a workaround to the build automation, which I don't see.
Yes, it would consistently hang on building 0.9.8 and 0.9.9 without this patch
on solaris 7
Comment on attachment 78085 [details] [diff] [review]
Proposed TimerThread.cpp, TimerThread.h patch

sr=brendan@mozilla.org

dougt: I took cvsblame in making fixes to pavlov's busted threading code, but I
won't take all blame here.  I do feel pretty foolish for taking this stuff so
close to 1.0 (0.9.8, IIRC -- at least I made pav wait till then, instead of
checking in on the last day of 0.9.7 as he wanted to).

/be
Attachment #78085 - Flags: superreview+
Checked into the trunk:

Checking in TimerThread.cpp;
/cvsroot/mozilla/xpcom/threads/TimerThread.cpp,v  <--  TimerThread.cpp
new revision: 1.12; previous revision: 1.11
done
Checking in TimerThread.h;
/cvsroot/mozilla/xpcom/threads/TimerThread.h,v  <--  TimerThread.h
new revision: 1.4; previous revision: 1.3
done
Status: NEW → ASSIGNED
Whiteboard: Needs to land on branch
Comment on attachment 78085 [details] [diff] [review]
Proposed TimerThread.cpp, TimerThread.h patch

a=rjesup@wgate.com for branch checkin
Attachment #78085 - Flags: approval+
Checked into branch.

Checking in TimerThread.cpp;
/cvsroot/mozilla/xpcom/threads/TimerThread.cpp,v  <--  TimerThread.cpp
new revision: 1.6.4.4; previous revision: 1.6.4.3
done
Checking in TimerThread.h;
/cvsroot/mozilla/xpcom/threads/TimerThread.h,v  <--  TimerThread.h
new revision: 1.3.4.2; previous revision: 1.3.4.1
done

Kenneth, thank you for the patch.
Status: ASSIGNED → RESOLVED
Closed: 23 years ago22 years ago
Resolution: --- → FIXED
adding fixed1.0.0 keyword (branch resolution). This bug has comments saying it
was fixed on the 1.0 branch and a bonsai checkin comment that agrees. To verify
the bug has been fixed on the 1.0 branch please replace the fixed1.0.0 keyword
with verified1.0.0.
Keywords: fixed1.0.0
updating component and qa...
From reading carefully, it seems like this goes to XPCOM Regsitry. Also, the
summary seems out of date, is PAC really the root cause of this?
Component: Networking → XPCOM Registry
QA Contact: benc → dougt
Component: XPCOM Registry → XPCOM
QA Contact: doug.turner → xpcom
Keywords: qawanted
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: