Closed
Bug 21556
Opened 25 years ago
Closed 25 years ago
crash on SMP systems: socket transport in load group
Categories
(Core :: Networking, defect, P3)
Core
Networking
Tracking
()
VERIFIED
FIXED
M14
People
(Reporter: bsemrad, Assigned: warrensomebody)
References
Details
(Keywords: crash, Whiteboard: [PDT+] w/b minus on 3/7 [have fixes!])
Attachments
(2 files)
System specs:
Dual PII 400 running Windows NT 4.0 Service Pack 6a, 128 Meg RAM, Tons of hard
drive. This is also a fairly young install of NT (about 4 months). Communicator
4.7 never crashes on this machine. This particular mozilla build is from
12-10-99 but Mozilla has been acting this way for me for at least a month or so.
I always remove the mozregistry.dat file, the user account directory that gets
created and the entire Moz directory every time I re-install a new daily build.
I have not modified the bookmarks or any other configuration other than to
accept the defaults at initial system startup. I'm not sure if this makes much
difference but I'm accessing the net on my NT box through a linux masquerading
system attached to a cable modem.
Problem:
Mozilla seems very unstable on my SMP system (PC specs below). Besides crashing
about 50% of the time on startup, I usually (90% of the time) get an exception
within 60 seconds of browser startup. Occasionally I can just start Mozilla and
let it sit for a minute or so and it will get a read exception while displaying
the initial mozilla.org web site. I tried this just now but couldn't get it to
reproduce within a few minutes or so. To get it to crash I can usually just type
in "http://www.slashdot.org" or "http://www.linuxworld.com" into the url bar and
press enter. Then, during the display of the home page of either of these sites
Mozilla will usually get an exception before the main page is completely
displayed. In my experience either of these sites will crash Mozilla about
30%-50% of the time.
Reproducing the crash:
Edit the url to be one of the above websites and hit enter. If Mozilla doesn't
crash put the cursor on the url bar and hit enter again. I can usually get it to
crash within the first couple of tries on either web site. I noticed that it
seems much more likely to crash the first few times I visited the site but it
may be my imagination. I went to each site about 10 times just now and got it to
crash about 4 times on each one. The dialog that popped up notifying me of the
exception seemed to be somewhat consistent in that It seemed to be crashing and
displaying the same exception message about every other time.
Comment 1•25 years ago
|
||
Adding some multi-threading gurus/perps to the cc list.
/be
Comment 2•25 years ago
|
||
One problem is 18110. (Jan, I think that this is your reproducible testcase)
Comment 3•25 years ago
|
||
Service Pack 6a. Wow.
We should try and find a developer with that service pack to see where we're
crashing.
Comment 4•25 years ago
|
||
bsemrad@adsoft.net: could you attach a Dr Watson log from Windows NT?
Reporter | ||
Comment 5•25 years ago
|
||
Here is an excerpt from an email that I sent to dougt@netscape.com about the
crash on my machine.
I went ahead and downloaded the source for Mozilla dated on 12-13-99 and
compiled it and then ran it. Below is a copy of the stack trace of the crash
when I tried to go to www.slashdot.org.
nsCOMPtr?nsProxyObject>::assign_with_AddRef(nsISupports * 0x02f69060) line 759 +
9 bytes
nsCOMPtr?nsProxyObject>::operator=(nsProxyObject * 0x02f69060) line 516
nsProxyObjectCallInfo::nsProxyObjectCallInfo(nsProxyObject * 0x02f69060,
nsXPTMethodInfo * 0x021ed670, unsigned int 3, nsXPTCVariant * 0x02f6a3d0,
unsigned int 4, PLEvent * 0x02f6a890) line 65
nsProxyObject::Post(unsigned int 3, nsXPTMethodInfo * 0x021ed670,
nsXPTCMiniVariant * 0x02d1fe18, nsIInterfaceInfo * 0x02f6e060) line 340 + 57
bytes
nsProxyEventObject::CallMethod(nsProxyEventObject * const 0x02f6f810, unsigned
short 3, const nsXPTMethodInfo * 0x021ed670, nsXPTCMiniVariant *
0x02d1fe18) line 391 + 55 bytes
PrepareAndDispatch(nsXPTCStubBase * 0x02f6f810, unsigned int 3, unsigned int *
0x02d1fecc, unsigned int * 0x02d1feb8) line 100 + 31 bytes
SharedStub() line 125
------------------------------------------
Doug then emailed me with the following:
Thanks for the great work. This indeed is bug 18110.
I told Doug that I might have a go at fixing it but it has been several days
since I told him that and I haven't yet had time to look at it seriously so you
should probably not count on me for this one.
Updated•25 years ago
|
Assignee: leger → dp
Component: Browser-General → XPCOM
Comment 6•25 years ago
|
||
Updated•25 years ago
|
Assignee: dp → dougt
Comment 7•25 years ago
|
||
I've been seeing crashes in assign_with_AddRef under SMP Linux as well. RH6.0 +
gcc 2.95.2 + binutils 2.9.1.0.25 + gtk 1.2.5 + glibc 2.1.2 (from RH6.1).
Kernels 2.2.12-2.2.14pre17. On some pages (http://userfriendly.org/,
http://cnn.com/), I can just let the main page load, not touch the browser,
switch to another workspace (using WindowMaker) and the browser will crash
within 30secs. This is from last night's testing with the M12 fullcircle build.
Comment 9•25 years ago
|
||
I've crashed about 90% of the time loading http://userfriendly.org/
getting one of these stacks.
Redhat 6.1, dual pentium II 450
Linux localhost.localdomain 2.2.12-20smp #1 SMP Mon Sep 27 10:34:45 EDT 1999
i686 unknown
gtk+-1.2.5-2
gcc version egcs-2.91.66 19990314/Linux (egcs-1.1.2 release)
binutils-2.9.1.0.23-6
glibc-2.1.2-11
#0 0x3f in ?? ()
#1 0x40529bf1 in nsOnStopRequestEvent::~nsOnStopRequestEvent ()
from /home/endico/mozilla/mozilla/dist/bin/components/libnecko.so
#2 0x4052962c in nsStreamListenerEvent::DestroyPLEvent ()
from /home/endico/mozilla/mozilla/dist/bin/components/libnecko.so
#3 0x40176c6d in PL_DestroyEvent ()
from /home/endico/mozilla/mozilla/dist/bin/libplds3.so
#4 0x40176c46 in PL_HandleEvent ()
from /home/endico/mozilla/mozilla/dist/bin/libplds3.so
#5 0x40176b86 in PL_ProcessPendingEvents ()
from /home/endico/mozilla/mozilla/dist/bin/libplds3.so
#6 0x401471ce in ?? () from /home/endico/mozilla/mozilla/dist/bin/libxpcom.so
#7 0x405b6ac4 in event_processor_callback ()
from /home/endico/mozilla/mozilla/dist/bin/libwidget_gtk.so
#8 0x405b680f in our_gdk_io_invoke ()
from /home/endico/mozilla/mozilla/dist/bin/libwidget_gtk.so
#9 0x4086052a in g_io_unix_dispatch () from /usr/lib/libglib-1.2.so.0
#10 0x40861be6 in g_main_dispatch () from /usr/lib/libglib-1.2.so.0
#11 0x408621a1 in g_main_iterate () from /usr/lib/libglib-1.2.so.0
#12 0x40862341 in g_main_run () from /usr/lib/libglib-1.2.so.0
#13 0x4078c209 in gtk_main () from /usr/lib/libgtk-1.2.so.0
#14 0x405b7067 in nsAppShell::Run ()
from /home/endico/mozilla/mozilla/dist/bin/libwidget_gtk.so
#15 0x404d0c41 in nsAppShellService::Run ()
from /home/endico/mozilla/mozilla/dist/bin/libnsappshell.so
#16 0x804adf1 in main1 ()
#17 0x804b225 in main ()
#18 0x4025e1eb in ?? () from /lib/libc.so.6
#0 0x40333238 in main_arena () from /lib/libc.so.6
#1 0x4003af66 in ?? ()
from /home/endico/mozilla/mozilla/dist/bin/libraptorgfx.so
#2 0x4014ec84 in nsCOMPtr_base::assign_with_AddRef ()
from /home/endico/mozilla/mozilla/dist/bin/libxpcom.so
#3 0x4139d18f in nsCOMPtr<nsIChannel>::operator= ()
from /home/endico/mozilla/mozilla/dist/bin/components/libnecko_http.so
#4 0x41392de8 in nsHTTPRequest::~nsHTTPRequest ()
from /home/endico/mozilla/mozilla/dist/bin/components/libnecko_http.so
#5 0x41392ea0 in nsHTTPRequest::Release ()
from /home/endico/mozilla/mozilla/dist/bin/components/libnecko_http.so
#6 0x4138cba5 in nsHTTPChannel::~nsHTTPChannel ()
from /home/endico/mozilla/mozilla/dist/bin/components/libnecko_http.so
#7 0x4138cd63 in nsHTTPChannel::Release ()
from /home/endico/mozilla/mozilla/dist/bin/components/libnecko_http.so
#8 0x4052954e in nsStreamListenerEvent::~nsStreamListenerEvent ()
from /home/endico/mozilla/mozilla/dist/bin/components/libnecko.so
#9 0x40529bf1 in nsOnStopRequestEvent::~nsOnStopRequestEvent ()
from /home/endico/mozilla/mozilla/dist/bin/components/libnecko.so
#10 0x4052962c in nsStreamListenerEvent::DestroyPLEvent ()
from /home/endico/mozilla/mozilla/dist/bin/components/libnecko.so
#11 0x40176c6d in PL_DestroyEvent ()
from /home/endico/mozilla/mozilla/dist/bin/libplds3.so
#12 0x40176c46 in PL_HandleEvent ()
from /home/endico/mozilla/mozilla/dist/bin/libplds3.so
#13 0x40176b86 in PL_ProcessPendingEvents ()
from /home/endico/mozilla/mozilla/dist/bin/libplds3.so
#14 0x401471ce in nsEventQueueImpl::ProcessPendingEvents ()
from /home/endico/mozilla/mozilla/dist/bin/libxpcom.so
#15 0x405b6ac4 in event_processor_callback ()
from /home/endico/mozilla/mozilla/dist/bin/libwidget_gtk.so
#16 0x405b680f in our_gdk_io_invoke ()
from /home/endico/mozilla/mozilla/dist/bin/libwidget_gtk.so
#17 0x4086052a in g_io_unix_dispatch () from /usr/lib/libglib-1.2.so.0
#18 0x40861be6 in g_main_dispatch () from /usr/lib/libglib-1.2.so.0
#19 0x408621a1 in g_main_iterate () from /usr/lib/libglib-1.2.so.0
#20 0x40862341 in g_main_run () from /usr/lib/libglib-1.2.so.0
#21 0x4078c209 in gtk_main () from /usr/lib/libgtk-1.2.so.0
#22 0x405b7067 in nsAppShell::Run ()
from /home/endico/mozilla/mozilla/dist/bin/libwidget_gtk.so
#23 0x404d0c41 in nsAppShellService::Run ()
from /home/endico/mozilla/mozilla/dist/bin/libnsappshell.so
#24 0x804adf1 in main1 ()
#25 0x804b225 in main ()
#26 0x4025e1eb in __libc_start_main (main=0x804b044 <main>, argc=1,
argv=0xbffffac4, init=0x80493a4 <_init>, fini=0x804d6d8 <_fini>,
rtld_fini=0x4000a610, stack_end=0xbffffabc)
at ../sysdeps/generic/libc-start.c:90
Comment 10•25 years ago
|
||
Comment 11•25 years ago
|
||
added test case with 22 gif images. There was a theory that this problem
was due to animated gifs but reducing the test case to just two animated
gifs didn't cause a crash after 2 tries. I'm guessing that it has more to
do with having lots of threads running on different processors. The problem
may also have to do with one of the gifs being a lot bigger than the others.
It seemed like the userfriendly page had been done loading for a long time
but the throbber was still spinning. Apparently the extra time was being
spent loading extra frames on one of the animated gifs.
Comment 12•25 years ago
|
||
This is a dup of 18110 [dogfood] XPCOM/Proxy needs to be threadsafe!!
Comment 13•25 years ago
|
||
dougt, jband just whacked XPConnect to be threadsafe and otherwise refactored it
for correct thread-local vs. process-global, etc. considerations. Since the
xpcom proxy code sprang from the brow of XPConnect, perhaps his changes could
help safen xpcom/proxy. What's the prognosis?
/be
Updated•25 years ago
|
Status: NEW → ASSIGNED
Comment 14•25 years ago
|
||
many of his changes can be massaged into xpcom/proxy. However, because of the
very nature of xpcom/proxy, me do a really good job at protecting ourselves.
Simply applying his changes are not good enough.
Comment 15•25 years ago
|
||
*** Bug 22648 has been marked as a duplicate of this bug. ***
Updated•25 years ago
|
Summary: Mozilla crashes often on SMP systems. → [Dogfood] Mozilla crashes often on SMP systems.
Comment 16•25 years ago
|
||
ugh! Mozilla is pretty unusable for me any more at home on my smp box.
It crashes too much. Please please please get dougt an smp box to debug
with.
I noticed that looking at slashdot.org is causing problems too. It too
has lots of images/page and often uses animated gifs. I got this stack
after loading mozilla to home page, then loading slashdot.org, then
loading mozillazine and staying there a while. This is with this morning's
build.
#0 0x4017451a in nsProxyObject::Post (this=0x41d585f8, methodIndex=4,
methodInfo=0x407bbd0c, params=0xbf5ffa5c, interfaceInfo=0x41d00870)
at nsProxyEvent.cpp:423
#1 0x40176747 in nsProxyEventObject::CallMethod (this=0x419dbd28,
methodIndex=4, info=0x407bbd0c, params=0xbf5ffa5c)
at nsProxyEventObject.cpp:391
#2 0x40181924 in PrepareAndDispatch (self=0x419dbd28, methodIndex=4,
args=0xbf5ffb14) at xptcstubs_unixish_x86.cpp:92
#3 0x40181a4a in nsXPTCStubBase::Stub4 (this=0x419dbd28)
at ../../../../../../dist/include/xptcstubsdef.inc:6
#4 0x4061211b in ?? ()
from /home/endico/mozilla/mozilla/dist/bin/components/libnecko.so
#5 0x4060f4a0 in ?? ()
from /home/endico/mozilla/mozilla/dist/bin/components/libnecko.so
#6 0x40613307 in ?? ()
from /home/endico/mozilla/mozilla/dist/bin/components/libnecko.so
#7 0x401716b5 in nsThread::Main (arg=0x41b452c0) at nsThread.cpp:83
#8 0x402138fb in _pt_root (arg=0x41b16d98) at ptthread.c:157
#9 0x4022deca in pthread_start_thread (arg=0xbf5ffe60) at manager.c:213
Comment 17•25 years ago
|
||
*** Bug 18659 has been marked as a duplicate of this bug. ***
Comment 18•25 years ago
|
||
yet another stack trace. looked at mozilla.org, slashdot.org, www.benews.com.
It crashed at benews.com in the middle of scrolling after having sat there a
while. Maybe this is timer based. It seems like this morning's crashes are
happening after 10 minutes or so and some happen while the browser is idle.
#0 0x81e48e1 in ?? ()
#1 0x4019d670 in nsCOMPtr<nsProxyObject>::Assert_NoQueryNeeded (
this=0x4210bb30) at ../../../dist/include/nsCOMPtr.h:444
#2 0x4019d630 in nsCOMPtr<nsProxyObject>::operator= (this=0x4210bb30,
rhs=0x81ebd20) at ../../../dist/include/nsCOMPtr.h:516
#3 0x40173448 in nsProxyObjectCallInfo::nsProxyObjectCallInfo (
this=0x4210bb10, owner=0x81ebd20, methodInfo=0x407bbcbc, methodIndex=4,
parameterList=0x4210bad8, parameterCount=3, event=0x4210bab8)
at nsProxyEvent.cpp:63
#4 0x401743e0 in nsProxyObject::Post (this=0x81ebd20, methodIndex=4,
methodInfo=0x407bbcbc, params=0xbf5ffa5c, interfaceInfo=0x824a378)
at nsProxyEvent.cpp:374
#5 0x40176747 in nsProxyEventObject::CallMethod (this=0x81a4708,
methodIndex=4, info=0x407bbcbc, params=0xbf5ffa5c)
at nsProxyEventObject.cpp:391
#6 0x40181924 in PrepareAndDispatch (self=0x81a4708, methodIndex=4,
args=0xbf5ffb14) at xptcstubs_unixish_x86.cpp:92
#7 0x40181a4a in nsXPTCStubBase::Stub4 (this=0x81a4708)
at ../../../../../../dist/include/xptcstubsdef.inc:6
#8 0x4061211b in nsSocketTransport::fireStatus (this=0x81a73c8, aCode=5)
at nsSocketTransport.cpp:1897
#9 0x4060f4a0 in nsSocketTransport::Process (this=0x81a73c8, aSelectFlags=0)
at nsSocketTransport.cpp:539
---Type <return> to continue, or q <return> to quit---
#10 0x40613307 in nsSocketTransportService::Run (this=0x41b6c3b8)
at nsSocketTransportService.cpp:467
#11 0x401716b5 in nsThread::Main (arg=0x41b48728) at nsThread.cpp:83
#12 0x402138fb in _pt_root (arg=0x41b48eb0) at ptthread.c:157
#13 0x4022deca in pthread_start_thread (arg=0xbf5ffe60) at manager.c:213
Comment 19•25 years ago
|
||
Doug, can this be fixed in M13? Or at least get a Target Milestone.
/be
Comment 20•25 years ago
|
||
Here's another stack very similar to one before. It crashed sitting at
a slashdot article while i was away. Maybe it was reloading an ad? I
forget if their ads refresh themselves.
#0 0x85b579c in ?? ()
#1 0x4060e7ab in nsSocketTransport::~nsSocketTransport (this=0x85d75a8,
__in_chrg=3) at nsSocketTransport.cpp:223
#2 0x40610760 in nsSocketTransport::Release (this=0x85d75a8)
at nsSocketTransport.cpp:1191
#3 0x416b9eae in nsCOMPtr<nsIChannel>::assign_assuming_AddRef (
this=0x862bce8, newPtr=0x0) at ../../../../dist/include/nsCOMPtr.h:415
#4 0x416bea8c in nsCOMPtr<nsIChannel>::assign_with_AddRef (this=0x862bce8,
rawPtr=0x0) at ../../../../dist/include/nsCOMPtr.h:760
#5 0x416bf7e3 in nsCOMPtr<nsIChannel>::operator= (this=0x862bce8, rhs=0x0)
at ../../../../dist/include/nsCOMPtr.h:515
#6 0x416b05cf in nsHTTPRequest::~nsHTTPRequest (this=0x862bcd0, __in_chrg=3)
at nsHTTPRequest.cpp:140
#7 0x416b0720 in nsHTTPRequest::Release (this=0x862bcd0)
at nsHTTPRequest.cpp:151
#8 0x416a9cc5 in nsHTTPChannel::~nsHTTPChannel (this=0x85b7af8, __in_chrg=3)
at nsHTTPChannel.cpp:117
#9 0x416a9f22 in nsHTTPChannel::Release (this=0x85b7af8)
at nsHTTPChannel.cpp:127
#10 0x4060b78e in nsStreamListenerEvent::~nsStreamListenerEvent (
this=0x83fe668, __in_chrg=3) at nsAsyncStreamListener.cpp:77
#11 0x4060c091 in nsOnStopRequestEvent::~nsOnStopRequestEvent (this=0x83fe668,
__in_chrg=3) at nsAsyncStreamListener.cpp:257
---Type <return> to continue, or q <return> to quit---
#12 0x4060b8bf in nsStreamListenerEvent::DestroyPLEvent (aEvent=0x84ba438)
at nsAsyncStreamListener.cpp:104
#13 0x401d841b in PL_DestroyEvent (self=0x84ba438) at plevent.c:545
#14 0x401d83b9 in PL_HandleEvent (self=0x84ba438) at plevent.c:532
#15 0x401d827c in PL_ProcessPendingEvents (self=0x80aa1f8) at plevent.c:483
#16 0x4016faa9 in nsEventQueueImpl::ProcessPendingEvents (this=0x80aa1d0)
at nsEventQueue.cpp:201
#17 0x40830da4 in event_processor_callback (data=0x80aa1d0, source=6,
condition=GDK_INPUT_READ) at nsAppShell.cpp:141
#18 0x40830a2f in our_gdk_io_invoke (source=0x8156560, condition=G_IO_IN,
data=0x81f3308) at nsAppShell.cpp:54
#19 0x406d752a in g_io_unix_dispatch () from /usr/lib/libglib-1.2.so.0
#20 0x406d8be6 in g_main_dispatch () from /usr/lib/libglib-1.2.so.0
#21 0x406d91a1 in g_main_iterate () from /usr/lib/libglib-1.2.so.0
#22 0x406d9341 in g_main_run () from /usr/lib/libglib-1.2.so.0
#23 0x40907209 in gtk_main () from /usr/lib/libgtk-1.2.so.0
#24 0x408313a7 in nsAppShell::Run (this=0x8095350) at nsAppShell.cpp:304
#25 0x4058ffbd in nsAppShellService::Run (this=0x80a9fd0)
at nsAppShellService.cpp:465
#26 0x804bf3d in main1 (argc=1, argv=0xbffffba4) at nsAppRunner.cpp:609
#27 0x804c3c7 in main (argc=1, argv=0xbffffba4) at nsAppRunner.cpp:697
Comment 21•25 years ago
|
||
yet another stack trace. note the assert this time. It crashed while
loading http://www.mozilla.org/banners/ It seemed like it was done loading
but the throbber kept going.
Document http://www.mozilla.org/ loaded successfully
Document: Done (9.162 secs)
WEBSHELL+ = 4
Opening file signon.tbl failed
FindShortcut: in='http://www.mozilla.org/banners/' out='null'
###!!! ASSERTION: You can't dereference a NULL nsCOMPtr with operator->().:
'mRawPtr != 0', file ../../dist/include/nsCOMPtr.h, line 569
###!!! Break: at file ../../dist/include/nsCOMPtr.h, line 569
[Switching to Thread 16561]
Program received signal SIGSEGV, Segmentation fault.
0x4017451a in ?? () from /home/endico/mozilla/mozilla/dist/bin/libxpcom.so
(gdb) where
#0 0x4017451a in nsProxyObject::Post (this=0x407575f0, methodIndex=4,
methodInfo=0x816b65c, params=0xbf5ffa5c, interfaceInfo=0x8523ae8)
at nsProxyEvent.cpp:423
#1 0x40176747 in nsProxyEventObject::CallMethod (this=0x40705a90,
methodIndex=4, info=0x816b65c, params=0xbf5ffa5c)
at nsProxyEventObject.cpp:391
#2 0x40181924 in PrepareAndDispatch (self=0x40705a90, methodIndex=4,
args=0xbf5ffb14) at xptcstubs_unixish_x86.cpp:92
#3 0x40181a4a in nsXPTCStubBase::Stub4 (this=0x40705a90)
at ../../../../../../dist/include/xptcstubsdef.inc:6
#4 0x4061211b in nsSocketTransport::fireStatus (this=0x4073aed8, aCode=5)
at nsSocketTransport.cpp:1897
#5 0x4060f4a0 in nsSocketTransport::Process (this=0x4073aed8, aSelectFlags=0)
at nsSocketTransport.cpp:539
#6 0x40613307 in nsSocketTransportService::Run (this=0x407375e0)
at nsSocketTransportService.cpp:467
#7 0x401716b5 in nsThread::Main (arg=0x40738488) at nsThread.cpp:83
#8 0x402138fb in ?? () from /home/endico/mozilla/mozilla/dist/bin/libnspr3.so
#9 0x4022deca in pthread_start_thread (arg=0xbf5ffe60) at manager.c:213
Comment 22•25 years ago
|
||
Putting on PDT+ radar.
Comment 23•25 years ago
|
||
Comment 24•25 years ago
|
||
A single animated gif image is enough to cause a crash although it may take
a while. Load the image and wait. Eventually mozilla will crash. Sometimes
it crashes immediately, sometimes it takes an hour or more. Oddly, I found
that i get good stacks when i view an html file with a link to an animated
gif but the stack is corrupted if I type the url of the gif image directly
into the location bar.
It looks like a networking problem that happens to be exercised by animated
gifs because they aren't cached, and have to be downloaded from the source
over and over.
------------
viewing animated gif
------------
Document http://userfriendly.org/images/buttons/ufbook.gif loaded successfully
Document: Done (35.863 secs)
[Switching to Thread 25851]
Program received signal SIGSEGV, Segmentation fault.
0x84dc660 in ?? ()
(gdb) where
#0 0x84dc660 in ?? ()
#1 0x19d8f2bf in ?? ()
Cannot access memory at address 0x5ff8e808.
---------------
loaded test case 2, an html file that displays an animated gif
----------------
(gdb) where
#0 0x40d00138 in ?? ()
#1 0x416c8a9c in nsCOMPtr<nsIChannel>::assign_with_AddRef (this=0x40d22ad8,
rawPtr=0x0) at ../../../../dist/include/nsCOMPtr.h:760
#2 0x416c97f3 in nsCOMPtr<nsIChannel>::operator= (this=0x40d22ad8, rhs=0x0)
at ../../../../dist/include/nsCOMPtr.h:515
#3 0x416ba5df in nsHTTPRequest::~nsHTTPRequest (this=0x40d22ac0, __in_chrg=3)
at nsHTTPRequest.cpp:140
#4 0x416ba730 in nsHTTPRequest::Release (this=0x40d22ac0)
at nsHTTPRequest.cpp:151
#5 0x416b3cd5 in nsHTTPChannel::~nsHTTPChannel (this=0x418a2588, __in_chrg=3)
at nsHTTPChannel.cpp:117
#6 0x416b3f32 in nsHTTPChannel::Release (this=0x418a2588)
at nsHTTPChannel.cpp:127
#7 0x4060ca1e in nsStreamListenerEvent::~nsStreamListenerEvent (
this=0x853efc0, __in_chrg=3) at nsAsyncStreamListener.cpp:77
#8 0x4060d321 in nsOnStopRequestEvent::~nsOnStopRequestEvent (this=0x853efc0,
__in_chrg=3) at nsAsyncStreamListener.cpp:257
#9 0x4060cb4f in nsStreamListenerEvent::DestroyPLEvent (aEvent=0x84bae80)
at nsAsyncStreamListener.cpp:104
#10 0x401d941b in PL_DestroyEvent (self=0x84bae80) at plevent.c:545
#11 0x401d93b9 in PL_HandleEvent (self=0x84bae80) at plevent.c:532
#12 0x401d927c in PL_ProcessPendingEvents (self=0x80ab660) at plevent.c:483
#13 0x4016fc3c in nsEventQueueImpl::ProcessPendingEvents (this=0x80ab638)
at nsEventQueue.cpp:228
#14 0x406c1064 in event_processor_callback (data=0x80ab638, source=7,
condition=GDK_INPUT_READ) at nsAppShell.cpp:141
#15 0x406c0cef in our_gdk_io_invoke (source=0x811fda8, condition=G_IO_IN,
data=0x82344d0) at nsAppShell.cpp:54
#16 0x4087352a in g_io_unix_dispatch () from /usr/lib/libglib-1.2.so.0
#17 0x40874be6 in g_main_dispatch () from /usr/lib/libglib-1.2.so.0
#18 0x408751a1 in g_main_iterate () from /usr/lib/libglib-1.2.so.0
#19 0x40875341 in g_main_run () from /usr/lib/libglib-1.2.so.0
#20 0x4079c209 in gtk_main () from /usr/lib/libgtk-1.2.so.0
#21 0x406c1667 in nsAppShell::Run (this=0x808d038) at nsAppShell.cpp:304
#22 0x4059107d in nsAppShellService::Run (this=0x80ab438)
at nsAppShellService.cpp:465
#23 0x804bf3d in main1 (argc=1, argv=0xbffffba4) at nsAppRunner.cpp:609
#24 0x804c3c7 in main (argc=1, argv=0xbffffba4) at nsAppRunner.cpp:697
(gdb) print *this
No symbol "this" in current context.
(gdb) print this
No symbol "this" in current context.
(gdb) up
#1 0x416c8a9c in nsCOMPtr<nsIChannel>::assign_with_AddRef (this=0x40d22ad8,
rawPtr=0x0) at ../../../../dist/include/nsCOMPtr.h:760
760 assign_assuming_AddRef(NS_REINTERPRET_CAST(T*, rawPtr));
(gdb) print this
$1 = (nsCOMPtr<nsIChannel> *) 0x0
(gdb) print *this
Cannot access memory at address 0x0.
(gdb) up
#2 0x416c97f3 in nsCOMPtr<nsIChannel>::operator= (this=0x40d22ad8, rhs=0x0)
at ../../../../dist/include/nsCOMPtr.h:515
515 assign_with_AddRef(rhs);
(gdb) print this
$2 = (nsCOMPtr<nsIChannel> *) 0x40d22ad8
(gdb) print *this
$3 = {mRawPtr = 0x0}
(gdb) up
#3 0x416ba5df in nsHTTPRequest::~nsHTTPRequest (this=0x40d22ac0, __in_chrg=3)
at nsHTTPRequest.cpp:140
140 mTransport = null_nsCOMPtr();
(gdb) print this
$4 = (nsHTTPRequest *) 0x40d22ac0
(gdb) print *this
$5 = {<nsIStreamObserver> = {<nsISupports> = {
_vptr. = 0x416cfcc0 <nsHTTPRequest virtual table>}, <No data fields>},
<nsIRequest> = {<nsISupports> = {
_vptr. = 0x416cfc80 <nsHTTPRequest::nsIRequest virtual table>}, <No data
fields>}, mRefCnt = 1, mMethod = HM_GET, mURI = {mRawPtr = 0x40d693b8},
mVersion = HTTP_ONE_ZERO, mTransport = {mRawPtr = 0x0},
mConnection = 0x418a2588, mHeaders = {mHTTPHeaders = {
mRawPtr = 0x40d7dca8}}, mUsingProxy = 0, mRequestBuffer = {<nsStr> = {
mLength = 0, mCapacity = 128, mCharSize = eOneByte, mOwnsBuffer = 1, {
mStr = 0x40da2140 "", mUStr = 0x40da2140}},
_vptr. = 0x401b0084 <nsCString virtual table>}, mPostDataStream = {
mRawPtr = 0x0}}
(gdb) quit
Updated•25 years ago
|
Whiteboard: [PDT+] → [PDT+] need SMP machine
Comment 25•25 years ago
|
||
Doug is not looking at this until he gets his hands on a machine that exhibits
the problem. Anyone else want to take it? Anyone want to get another processor
for Doug?
Comment 26•25 years ago
|
||
Brian, do we have a machine that dougt can use to debug this? One of those
new solaris machines? (with purify?)
Assignee | ||
Comment 27•25 years ago
|
||
Here's the info I dug up on SMP boxes (for the ambitious):
Bill Law and dp have dual processor machines, but they're 200mhz and dp thinks
that's too slow. Rickg has a 733mhz (?!) machine but I'm not sure if
it's here or in san diego. Cyeh may be able to whip something up too. Alec says
he sees deadlocks running an MP kernel on a single-processor
machine.
Warren
Comment 28•25 years ago
|
||
If we think the problem is in xpcom/proxy, we could try a code review, even
before dougt sits in front of a fast SMP machine (or whatever). I'm up for it,
and jband would probably be willing to help.
/be
Comment 29•25 years ago
|
||
Brendan, et al. we (jband andI) have already done this. I need to protect my
hash tables and proxyCallInfo class. I could merely just code these fixes and
check it in, but I would rather be able to verify (for myself) that the problem
does go away when I do this.
Comment 30•25 years ago
|
||
This is silly: you know of thread-safety bugs (MP or Uniprocessor, I dunno),
there are people in the Mozilla community being bitten by these bugs (including
endico@mozilla.org), but you don't wanna code the fixes until you can test 'em
yourself?
This is not the way of the Mozilla bazaar. Can you hack up fixes to the current
revs of the files, and attach cvs diff -u output to this bug, so others can at
least help test for ya? Thanks.
/be
Comment 31•25 years ago
|
||
An even faster way to reproduce this bug is to use mail. Opening a folder
with 2k messages took 5 tries because it made mozilla crash.
Comment 32•25 years ago
|
||
Attach a patch and i'll be happy to test it.
(And let me know what testing needs to be done)
Comment 33•25 years ago
|
||
Rebuilt from the top with dougt's changes and still crashed.
#0 0x4019eb33 in nsCOMPtr<nsProxyObject>::assign_with_AddRef (
this=0x40767028, rawPtr=0x8665848) at ../../../dist/include/nsCOMPtr.h:759
#1 0x4019ef67 in nsCOMPtr<nsProxyObject>::operator= (this=0x40767028,
rhs=0x8665848) at ../../../dist/include/nsCOMPtr.h:515
#2 0x40174b44 in nsProxyObjectCallInfo::nsProxyObjectCallInfo (
this=0x40767008, owner=0x8665848, methodInfo=0x816bfe0, methodIndex=3,
parameterList=0x40746e38, parameterCount=4, event=0x40766f90)
at nsProxyEvent.cpp:70
#3 0x40175b00 in nsProxyObject::Post (this=0x8665848, methodIndex=3,
methodInfo=0x816bfe0, params=0xbf5ffadc, interfaceInfo=0x86831c8)
at nsProxyEvent.cpp:384
#4 0x40177ff7 in nsProxyEventObject::CallMethod (this=0x8680668,
methodIndex=3, info=0x816bfe0, params=0xbf5ffadc)
at nsProxyEventObject.cpp:394
#5 0x40183184 in PrepareAndDispatch (self=0x8680668, methodIndex=3,
args=0xbf5ffb94) at xptcstubs_unixish_x86.cpp:92
#6 0x4018325e in nsXPTCStubBase::Stub3 (this=0x8680668)
at ../../../../../../dist/include/xptcstubsdef.inc:5
#7 0x4061343e in nsSocketTransport::doRead (this=0x868e328, aSelectFlags=1)
at nsSocketTransport.cpp:976
#8 0x40612755 in nsSocketTransport::Process (this=0x868e328, aSelectFlags=1)
at nsSocketTransport.cpp:512
#9 0x406166d7 in nsSocketTransportService::Run (this=0x40768b88)
---Type <return> to continue, or q <return> to quit---
at nsSocketTransportService.cpp:467
#10 0x40172d05 in nsThread::Main (arg=0x4073b4f8) at nsThread.cpp:83
#11 0x402158fb in _pt_root (arg=0x407469a0) at ptthread.c:157
#12 0x4022feca in pthread_start_thread (arg=0xbf5ffe60) at manager.c:213
Comment 34•25 years ago
|
||
I'm still crashing but things don't seem as fragile as before. I was able to
download my mailbox headers twice in a row without crashing. Last time it
crashed 4/5 times. Doug's changes seem to have made an improvement.
Assignee | ||
Comment 35•25 years ago
|
||
Probably the extra locks just slowed down the timing of things, shrinking the
window of vulnerability. Dawn -- sounds like we should get a debug build/env on
your machine so that we can diagnose the problem when it happens. Can you set
that up?
Comment 36•25 years ago
|
||
I just got a crash with a fresh tree on a dual 350 PII running linux, here's a
stack trace.
Program received signal SIGSEGV, Segmentation fault.
0x40175c3a in nsProxyObject::Post (this=0x860ff28, methodIndex=4,
methodInfo=0x812ac44, params=0xbf5ffa38, interfaceInfo=0x849b158)
at nsProxyEvent.cpp:433
433 mDestQueue->PostEvent(event);
(gdb) bt
#0 0x40175c3a in nsProxyObject::Post (this=0x860ff28, methodIndex=4,
methodInfo=0x812ac44, params=0xbf5ffa38, interfaceInfo=0x849b158)
at nsProxyEvent.cpp:433
#1 0x40177ff7 in nsProxyEventObject::CallMethod (this=0x862c7f0,
methodIndex=4, info=0x812ac44, params=0xbf5ffa38)
at nsProxyEventObject.cpp:394
#2 0x40183184 in PrepareAndDispatch (self=0x862c7f0, methodIndex=4,
args=0xbf5ffaf0) at xptcstubs_unixish_x86.cpp:92
#3 0x401832aa in nsXPTCStubBase::Stub4 (this=0x862c7f0)
at ../../../../../../dist/include/xptcstubsdef.inc:6
#4 0x4060a4eb in nsSocketTransport::fireStatus (this=0x862c900, aCode=3)
at nsSocketTransport.cpp:1903
#5 0x40607860 in nsSocketTransport::Process (this=0x862c900, aSelectFlags=0)
at nsSocketTransport.cpp:539
#6 0x4060b0c6 in nsSocketTransportService::ProcessWorkQ (this=0x84f64d0)
at nsSocketTransportService.cpp:259
#7 0x4060b794 in nsSocketTransportService::Run (this=0x84f64d0)
at nsSocketTransportService.cpp:493
#8 0x40172d05 in nsThread::Main (arg=0x84f6810) at nsThread.cpp:83
#9 0x402158fb in _pt_root (arg=0x85bf110) at ptthread.c:157
#10 0x4022feca in pthread_start_thread (arg=0xbf5ffe60) at manager.c:213
(gdb) print this
$2 = (nsProxyObject *) 0x860ff28
(gdb) print *this
$3 = {<nsISupports> = {_vptr. = 0x883ce90}, mRefCnt = 140573992,
mProxyType = 6, mDestQueue = {mRawPtr = 0x0},
mRealObject = {<nsCOMPtr_base> = {mRawPtr = 0x0}, <No data fields>},
mLock = 0x882a7b8}
As far as I can tell "this" was destroyed while one thread is executing
this->Post() since there's a check for !mDestQueue in the beginning of
nsPorxyObject::Post(), so this should not happend...
Comment 37•25 years ago
|
||
Doug,
Looking at EventHandler (shouldn't this be static or something?)...
http://lxr.mozilla.org/seamonkey/source/xpcom/proxy/src/nsProxyEvent.cpp#460
...I see that you are holding a per object lock while invoking
XPTC_InvokeByIndex. This seems excessive and/or dangerous. Aren't you then
precluding reentrant calls via the proxy on the proxied object? Do you really
need to protect more than your shared tables of information about the proxies
and the refcount managment of the proxies themselves?
I think that you should limit the scope of all locks to the bare minimum that is
absolutely require so that you decrease the chance of deadlocks or nspr
assertions on attempts to reenter a non-reantrant lock.
Updated•25 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 25 years ago
Resolution: --- → DUPLICATE
Comment 38•25 years ago
|
||
good catch, both event handlers need to be static. The scope of the locks need
to be reduced.
marking this bug as a dup of 18110
*** This bug has been marked as a duplicate of 18110 ***
Comment 39•25 years ago
|
||
On Linux SMP machine Mozilla M13 crashes almost immediately. It crashes also
while you are doing nothing..
Status: RESOLVED → REOPENED
Comment 40•25 years ago
|
||
anssi@bigfoot.com, why was this reopened if it is in fact a duplicate of 18110?
Your comments don't argue that it is a separate bug from 18110, so I don't see
the point in reopening. Resolving it as a duplicate doesn't mean that the bug
it describes, duplicated by an earlier bugzilla report, is fixed -- it just
means we know that the newer bug is a dup.
/be
Comment 41•25 years ago
|
||
Clearing DUPLICATE resolution due to reopen.
Comment 42•25 years ago
|
||
closing. see other bug.
Status: REOPENED → RESOLVED
Closed: 25 years ago → 25 years ago
Comment 43•25 years ago
|
||
I brought my box in to work today and let dougt hack.
He thinks this may actually be a duplicate of 24711.
Comment 44•25 years ago
|
||
Re-opening because the bug this bug turned out not to be a duplicate
of 18110. Marking as dependent on 24711 and removing dependency on 18110.
Assignee | ||
Comment 47•25 years ago
|
||
Moving to m14.
Updated•25 years ago
|
Summary: [Dogfood] Mozilla crashes often on SMP systems. → Mozilla crashes often on SMP systems.
Assignee | ||
Comment 50•25 years ago
|
||
Why is this considered Networking now? It's purely a proxy problem, isn't it? It
could affect anything.
And why is this owned by Gagan?
Comment 51•25 years ago
|
||
No. this is a the problem with having socket transports in the load group.
The second onStop() crashes SMP machines.
Assignee | ||
Comment 52•25 years ago
|
||
Changing summary from: Mozilla crashes often on SMP systems.
To: crash on SMP systems: socket transport in load group
Reassigning to Rick Potts because I think he's working on this now.
Assignee: gagan → rpotts
Summary: Mozilla crashes often on SMP systems. → crash on SMP systems: socket transport in load group
Comment 53•25 years ago
|
||
hey doug,
are you sure that there is a SocketTransport sitting in a load group? I would
have thought that that was not possible...
-- rick
Comment 54•25 years ago
|
||
gagan and jud are in the know.
Comment 55•25 years ago
|
||
This is not windows only, I been seeing this on linux for a while too, changing
OS and Platform...
OS: Windows NT → All
Hardware: PC → All
Comment 56•25 years ago
|
||
Status whiteboard says you need an SMP machine. Hasn't dougt's arrived yet?
Mozilla is pretty useless for me at home until this bug gets fixed. I could
bring the mahcine in again but the last time I tried that the motherboard
fried.
Comment 57•25 years ago
|
||
Hey Rick; I'm seeing these crashes _constantly_ on my home machine. Almost any
page I visit will eventually end up in this state. Sometimes it's just visiting
the page, sometimes it's when I leave the page, sometimes it's just sitting idle
(so to speak). I'll start forwarding stack traces.
Comment 58•25 years ago
|
||
Here's an *all-too-typical* stack trace on my SMP/NT box...
nsStreamListenerEvent::~nsStreamListenerEvent() line 77 + 24 bytes
nsOnStopRequestEvent::~nsOnStopRequestEvent() line 258 + 8 bytes
nsOnStopRequestEvent::`scalar deleting destructor'(unsigned int 1) + 15 bytes
nsStreamListenerEvent::DestroyPLEvent(PLEvent * 0x02fe63e0) line 104 + 30 bytes
PL_DestroyEvent(PLEvent * 0x02fe63e0) line 549 + 10 bytes
PL_HandleEvent(PLEvent * 0x02fe63e0) line 536 + 9 bytes
PL_ProcessPendingEvents(PLEventQueue * 0x02382cd0) line 487 + 9 bytes
_md_EventReceiverProc(HWND__ * 0x003e0550, unsigned int 49342, unsigned int 0,
long 37235920) line 975 + 9 bytes
USER32! 77e71820()
02382cd0()
I'm certainly willing to drive this machine remotely if someone wants to try to
debug this problem.
Assignee | ||
Comment 59•25 years ago
|
||
Line 77 looks like the release of mContext or possibly mChannel, the line above
it. Rickg: Can you see if one of these looks like it has already been deleted?
Maybe we've got race between an addref on one thread and a release on this one.
Comment 60•25 years ago
|
||
For that particular stack trace, it is possible that the crash is happening on
the NS_RELEASE(mContext) because mContext has already been deleted!
It turns out that mContext is really an nsHTTPCHannel. Unfortunately,
nsHTTPChannel *does not* have thread-safe implementations of AddRef() and
Release()...
Since these methods are caled on multiple threads (ie. socket transport and UI)
there canbe problems :-)
I'll check in a fix to make AddRef() and Release() thread-safe and we'll see if
things get any better...
Are you seeing any other stack traces?
Comment 61•25 years ago
|
||
I've just checked in thread-safe AddRef/Release implementations for
nsHTTPChannel, nsHTTPResponseListener, nsHTTPRequest and nsHTTPEncodeStream.
I suspect that other nsIInputStream implementations (besides nsHTTPEncodeStream)
will need thread-safe Addref/Release implementations... In particular the
"string stream"
Assignee | ||
Comment 62•25 years ago
|
||
Rick,
I've never understood how making addref and release threadsafe really solved
things. If one thread might be doing the last release while another is trying to
addref, there's obviously some higher-level synchronization needed, isn't there?
Or maybe it's just that the thread doing the release shouldn't have been the
final release -- but the refcount got tromped somewhere along the way. It still
seems like more than the refcount needs to be protected in this case.
Warren
Comment 63•25 years ago
|
||
One way it can help is that this threadsafety code makes the manipulation of the
refcount atomic. If you have one release happening when another addref is going
on then the release *might* set the refcount to a lower number then it should be
- ignoring the addref's change; i.e --refcnt is really (get, decrement, store).
If another thread changes the refcount in the middle of that non-atomic set of
actions then you can stomp its change. Only later does that get you when the
'final' release comes when the refcount should really not be zero yet.
Comment 64•25 years ago
|
||
Warren,
The race you worried about is not a problem. The only time folks should be
messing with an object is IF they already have done an adref. There is no
chance that a thread is "about to do an adref" on an object unless that thread
*has* an outstanding adref ahead of time. Hence there is no risk from some
other thread doing a decref (the count is already at least 2, one for each
thread handling the object).
On some platforms, you can get some guarantees about atomic actions for some
class of integers. Waldemar looked into this a LOT for multiprocessor machines,
and can probably chime in with potential answers. If the action is not atomic
(as pointed out by jband), then there is a big risk of losing either an
increment, or a decrement :-(.
Adding Waldemar to this thread in case he has suggestions.
Assignee | ||
Comment 65•25 years ago
|
||
My point is that if 2 threads are manipulating the same channel, then the
channel better be protecting the state for other operations, not just
addref/release.
Comment 66•25 years ago
|
||
The issue that I've seen in the past with non-threadsafe Addref/Release is that
the refcount can prematurely go to zero. For example, if an object has a
reference count of two and two threads call Release() simultaneously, there is a
chance that the --mRefCount will be executed on each thread *before* either one
checks for 0. In this case, both threads will see (mRefCount == 0) and delete
the object.
This double deletion was the whole reason that I added the NS_IMPL_THREADSAFE
macros to nsISupportsUtils.h
Assignee | ||
Comment 67•25 years ago
|
||
Ok. What other channel implementations need this same fix?
Comment 68•25 years ago
|
||
I'm not seeing crashes at home, but as of a day or two ago, I can no longer load
any remote pages on my machine at home (SMP machine).
Comment 69•25 years ago
|
||
warren,
I think that we should examine the File Transport as well... Basically, any
pointer that is Addref/Released on another thread requires thread-safe ISupports
implementations...
Typically, these are the internal nsIStreamListener implementations and the
streams...
I was thinking of adding some assertions to the non-threadsafe AddRef/Release
macros which assert if they are ever called on multiple threads... Do you think
this would be useful ?
I used to have some debug macros, along the lines of NS_ENSURE_THREADSAFE(...)
which could be used to verify that method arguments were threadsafe, but they
required using an NS_IMPL_THREADSAFE_QI macro... Troy whined endlessly about
that so I removed it :-(
However, I could make the checking completely transparent if I added an 'owning
thread' pointer as a data member in NS_DECL_ISUPPORTS (for debug only)
Comment 70•25 years ago
|
||
I'm nominating bug #24642 and #26686 as dups of this bug. What do people think?
Comment 71•25 years ago
|
||
Need to fix by 03/03 for beta1 train.
QA Contact: leger → tever
Whiteboard: [PDT+] need SMP machine → [PDT+] Must fix by 03/03 need SMP machine
Comment 72•25 years ago
|
||
*** Bug 24642 has been marked as a duplicate of this bug. ***
Comment 73•25 years ago
|
||
Rick's comment about two threads doing simultaneous decrefs, and then both think
ing it was their job to do the delete (because they checked non-atomically for
a zero after the decref), is really scary :-(.
Do we have this problem with many classes of objects, or is there a small set
that generally faces this evil handling on multiple threads?
Comment 74•25 years ago
|
||
...another question... if this is a problem on SMTP, why are we not hitting it
on a single processor machine? Considering that task switching between threads
is pre-emptive, I'd expect a similar amount of risk of a conflict. What am I
missing? Is there a way to mark an executable to NOT use more than one
processor?? Would that give a a wimpy work-around for now???
Comment 75•25 years ago
|
||
hey jim,
I think that we *are* seeing this problem on single processor machines. Take
a look at bug #26686 and bug #24642. They both have tvery similar stack
traces...
I think that we are seeing it *more* on SMP boxes because we get more
concurrency... But the problem still exists on single processors...
Comment 76•25 years ago
|
||
Damn.... this is sounding more and more scary. I need to look at how other
systems deal with this while doing ref-counting. Ugh... this looks hard (but at
least that makes it interesting!!!! :-) ).
Whiteboard: [PDT+] Must fix by 03/03 need SMP machine → [PDT+] w/b minus on 03/03- need SMP machine
Comment 77•25 years ago
|
||
After some analysis, I've identified the following classes as being
un-threadsafe in their usage of Addref/Release. This analysis was *only* for
bringing up the browser - there are definately more in FTP and IMAP :-(
For each of these classes, at least one instance is created on one thread and
then Addref/Released by another.
nsThread
nsLocalFileSystem
nsFileTransport
nsLocalFile
nsGenericModule
nsFileTransportService
nsProxyObject
nsInterfaceInfo
nsMIMEService
nsMIMEInfoImpl
nsBasicStringImpl
nsDNSService
nsIOService
nsEventQueueImpl
nsSupportsArray
AtomImpl
nsGenericFactory
Each of these classes needs to be analyzed to determine the extent of
un-threadsafe beyond Addref() and Release()!!
Comment 78•25 years ago
|
||
should we file seperate bugs on each of these? are you going to change the
above to use the thread safe version of addref/release?
Comment 79•25 years ago
|
||
Does anyone have a proposed patch to fix this? Maybe some changes toto the
addref/release macros for everything? I don't crash here at home... I just
can't load pages. I can run mozilla remotly to my xserver at work if anyone
wants me to test this out
Comment 80•25 years ago
|
||
The fix for Addref/Release is trivial. You simply need to use the:
NS_IMPL_THREADSAFE_ADDREF(...)
NS_IMPL_THREADSAFE_RELEASE(...)
macros.
The bigger question is if Addref/Release are being accessed on multiple threads,
what other members are also accessed - and not threadsafe!
I think that as we migrate these classes to use the THREADSAFE macros, we must
*also* do a carful analysis to determine the overall threadsaftey (and thread
exposure) of each class...
Comment 81•25 years ago
|
||
Pavlov, FYI: I'm running Linux at home on a dual 350MHz PII, I've never had
problems with loading remote pages (over a modem line) in mozilla (I update and
test almost daily), and mozilla hasn't crashed in a while either...
Assignee | ||
Comment 82•25 years ago
|
||
Rick: I'd eventually like to get your assertions for this into the tree too so
that the problem doesn't come up in the future (after we've analyzed and fixed
all these). Good work figuring out how to spot this.
Pavlov: What do you say we build us an SMP box out of our Dell 210s? I want to
make sure somebody has a machine in house that will exhibit these problems.
Don/Peter/dp: Do any of you have a spare Dell 210 that you can give up for a
while to make a multiprocessor out of? That would let me keep mine for
development. Thanks.
Comment 83•25 years ago
|
||
Damn, I shouldn't have said that! Now, I'm seeing a crash again, and I was able
to get a stack trace, the stacktrace is different from all the other ones in
this bug but I still think it belongs here.
#0 0x4059b090 in main_arena () from /lib/libc.so.6
#1 0x40042f7e in nsCOMPtr<nsIChannel>::assign_assuming_AddRef (
this=0x89d8150, newPtr=0x0) at ../../dist/include/nsCOMPtr.h:416
#2 0x41c183ac in nsCOMPtr<nsIChannel>::assign_with_AddRef (this=0x89d8150,
rawPtr=0x0) at ../../../../dist/include/nsCOMPtr.h:787
#3 0x41c18de3 in nsCOMPtr<nsIChannel>::operator= (this=0x89d8150, rhs=0x0)
at ../../../../dist/include/nsCOMPtr.h:526
#4 0x41c05ac0 in nsHTTPRequest::~nsHTTPRequest (this=0x89d8138, __in_chrg=3)
at nsHTTPRequest.cpp:146
#5 0x41c05c25 in nsHTTPRequest::Release (this=0x89d8138)
at nsHTTPRequest.cpp:154
#6 0x41bfe9b5 in nsHTTPChannel::~nsHTTPChannel (this=0x81c5660, __in_chrg=3)
at nsHTTPChannel.cpp:127
#7 0x41bfec33 in nsHTTPChannel::Release (this=0x81c5660)
at nsHTTPChannel.cpp:142
#8 0x40043a74 in nsCOMPtr<nsIChannel>::~nsCOMPtr (this=0xbffff2c4,
__in_chrg=2) at ../../dist/include/nsCOMPtr.h:434
#9 0x40e4fcc7 in nsDocLoaderImpl::DocLoaderIsEmpty (this=0x85c3918, aStatus=0)
at nsDocLoader.cpp:495
#10 0x40e4fb18 in nsDocLoaderImpl::OnStopRequest (this=0x85c3918,
aChannel=0x8c87db8, aCtxt=0x0, aStatus=0, aMsg=0x0) at nsDocLoader.cpp:437
#11 0x40706b52 in nsLoadGroup::RemoveChannel (this=0x85c3970,
channel=0x8c87db8, ctxt=0x0, status=0, errorMsg=0x0) at nsLoadGroup.cpp:535
#12 0x407405bb in nsFileChannel::OnStopRequest (this=0x8c87db8,
transportChannel=0x8c87ec8, context=0x0, aStatus=0, aMsg=0x0)
at nsFileChannel.cpp:450
#13 0x406efb0d in nsOnStopRequestEvent::HandleEvent (this=0x408ea618)
at nsAsyncStreamListener.cpp:282
#14 0x406ef1e7 in nsStreamListenerEvent::HandlePLEvent (aEvent=0x41dd6560)
at nsAsyncStreamListener.cpp:97
(More stack frames follow...)
Here what it crashed on
#1 0x40042f7e in nsCOMPtr<nsIChannel>::assign_assuming_AddRef (
this=0x89d8150, newPtr=0x0) at ../../dist/include/nsCOMPtr.h:416
416 NSCAP_RELEASE(oldPtr);
(gdb) print oldPtr
$6 = (nsIChannel *) 0x88346f4
(gdb) print *oldPtr
$7 = {<nsIRequest> = {<nsISupports> = {
_vptr. = 0x4059b088}, <No data fields>}, <No data fields>}
Still no problems loading remote pages tho...
Comment 84•25 years ago
|
||
I've got a 210. It isn't spare, but I could loan it out for a short time,
especially over the weekend.
Comment 85•25 years ago
|
||
For what it is worth, the xpcom log might help. It is enabled for release
builds too. Here is how you get it:
set env NSPR_LOG_MODULES=nsComponentManager:5
set env NSPR_LOG_FILE=xpcom.log
mozilla
now you should have a xpcom.log There is a sufficiently large chance that we
might be able to tell what is happening from the log.
Comment 86•25 years ago
|
||
It looks like the last stack trace is slightly different...
In this case, the last URL of the document has finished and the LoadGroup is
releasing its reference to the "document channel" (which is a nsHTTPChannel).
The nsHTTPChannel (this=0x81c5660) releases its nsHTTPRequest (this=0x89d8138),
which in turn releases its reference to the nsSocketTransport (0x88346f4)
- which is an nsIChannel. Unfortunately, the nsSocketTransport instance has
already been deleted :-(
Comment 87•25 years ago
|
||
That class of problem (release on an already deleted object) is exactly the sort
of thing that would be expected from the problem you isolated. When the ref
count on the object is down-counted to zero, and the hit to zero is felt by
*two* threads, then *both* threads will delete and clean up that object. When
both threads start to "clean up," some related objects will be deleted on one
thread, and then later the other thread will come along to "clean up" and do
additonal releases on a collected object.
This all seems to fit... or am I missing something??
Comment 88•25 years ago
|
||
The good news is that I no longer crash when sitting there viewing slashdot.org
or the test case. It appears that animated gifs are now cached instead of being
downloaded over and over.
The bad news is that I still get random crashes with the same stack traces.
I'll try gagan's xpcom logging suggestion.
Comment 89•25 years ago
|
||
holy cow! I ran mozilla for 15 or so minutes with NSPR_LOG_MODULES and
NSPR_LOG_FILE set. I generated a 56mb log file filled with millions of
these
1024[8058968]: found lib:libgfx_gtk.so as 812b930 in factory cache.
1024[8058968]: Factory CreateInstance() succeeded.
1024[8058968]: nsComponentManager:
FindFactory({e12752f0-ee9a-11d1-a82a-0040959a28c9})
These spewed out at the rate of 1 or 2 per second even just sitting
at http://www.mozilla.org/
Eventually after reloading my mailbox and loading some other pages it crashed.
#0 0x40800149 in ?? ()
#1 0x41c373ac in nsCOMPtr<nsIChannel>::assign_with_AddRef (this=0x420a8ac0,
rawPtr=0x0) at ../../../../dist/include/nsCOMPtr.h:787
#2 0x41c37de3 in nsCOMPtr<nsIChannel>::operator= (this=0x420a8ac0, rhs=0x0)
at ../../../../dist/include/nsCOMPtr.h:526
#3 0x41c24ac0 in nsHTTPRequest::~nsHTTPRequest (this=0x420a8aa8, __in_chrg=3)
at nsHTTPRequest.cpp:146
#4 0x41c24c25 in nsHTTPRequest::Release (this=0x420a8aa8)
at nsHTTPRequest.cpp:154
#5 0x41c1d9b5 in nsHTTPChannel::~nsHTTPChannel (this=0x42029b80, __in_chrg=3)
at nsHTTPChannel.cpp:127
#6 0x41c1dc33 in nsHTTPChannel::Release (this=0x42029b80)
at nsHTTPChannel.cpp:142
#7 0x406e40ee in nsStreamListenerEvent::~nsStreamListenerEvent (
this=0x82efb48, __in_chrg=3) at nsAsyncStreamListener.cpp:81
#8 0x406e4a01 in nsOnStopRequestEvent::~nsOnStopRequestEvent (this=0x82efb48,
__in_chrg=3) at nsAsyncStreamListener.cpp:261
#9 0x406e421f in nsStreamListenerEvent::DestroyPLEvent (aEvent=0x84689c8)
at nsAsyncStreamListener.cpp:108
#10 0x40189c5b in PL_DestroyEvent (self=0x84689c8) at plevent.c:549
#11 0x40189bf9 in PL_HandleEvent (self=0x84689c8) at plevent.c:536
#12 0x40189abc in PL_ProcessPendingEvents (self=0x812cf78) at plevent.c:487
#13 0x4018b5fc in nsEventQueueImpl::ProcessPendingEvents (this=0x812cf50)
at nsEventQueue.cpp:298
#14 0x40935a64 in event_processor_callback (data=0x812cf50, source=9,
condition=GDK_INPUT_READ) at nsAppShell.cpp:141
#15 0x409356ef in our_gdk_io_invoke (source=0x4159f368, condition=G_IO_IN,
data=0x415b2988) at nsAppShell.cpp:54
#16 0x407cc52a in g_io_unix_dispatch () from /usr/lib/libglib-1.2.so.0
#17 0x407cdbe6 in g_main_dispatch () from /usr/lib/libglib-1.2.so.0
#18 0x407ce1a1 in g_main_iterate () from /usr/lib/libglib-1.2.so.0
#19 0x407ce341 in g_main_run () from /usr/lib/libglib-1.2.so.0
#20 0x40a12209 in gtk_main () from /usr/lib/libgtk-1.2.so.0
#21 0x40936067 in nsAppShell::Run (this=0x40812e38) at nsAppShell.cpp:304
#22 0x4064eaad in ?? ()
from /home/endico/mozilla/mozilla/dist/bin/components/libnsappshell.so
#23 0x804e60e in main1 (argc=1, argv=0xbffff9e4, splashScreen=0x0)
at nsAppRunner.cpp:763
#24 0x804eba0 in main (argc=1, argv=0xbffff9e4) at nsAppRunner.cpp:883
from the end of xpcom.log:
1024[8058968]: found rel:libnecko.so as 807ac80 in factory cache.
1024[8058968]: Factory CreateInstance() succeeded.
1024[8058968]: nsComponentManager:
ProgIDToClassID(component://netscape/image/decoder&type=image/gif)->{0d471b70-baf5-11d2-802c-0060088f91a3}
1024[8058968]: nsComponentManager:
FindFactory({0d471b70-baf5-11d2-802c-0060088f91a3})
1024[8058968]: found rel:libnsgif.so as 8085720 in factory cache.
1024[8058968]: Factory CreateInstance() succeeded.
1024[8058968]: nsComponentManager:
FindFactory({6049b261-c1e6-11d1-a827-0040959a28c9})
1024[8058968]: found lib:libgfx_gtk.so as 812b1d8 in factory cache.
1024[8058968]: Factory CreateInstance() succeeded.
1024[8058968]: nsComponentManager:
FindFactory({e12752f0-ee9a-11d1-a82a-0040959a28c9})
1024[8058968]: found lib:libgfx_gtk.so as 812b930 in factory cache.
1024[8058968]: Factory CreateInstance() succeeded.
1024[8058968]: nsComponentManager:
FindFactory({e12752f0-ee9a-11d1-a82a-0040959a28c9})
1024[8058968]: found lib:libgfx_gtk.so as 812b930 in factory cache.
1024[8058968]: Factory CreateInstance() succeeded.
1024[8058968]: nsComponentManager:
FindFactory({e12752f0-ee9a-11d1-a82a-0040959a28c9})
1024[8058968]: found lib:libgfx_gtk.so as 812b930 in factory cache.
1024[8058968]: Factory CreateInstance() succeeded.
1024[8058968]: nsComponentManager:
FindFactory({e12752f0-ee9a-11d1-a82a-0040959a28c9})
1024[8058968]: found lib:libgfx_gtk.so as 812b930 in factory cache.
1024[8058968]: Factory CreateInstance() succeeded.
1024[8058968]: nsComponentManager:
FindFactory({e12752f0-ee9a-11d1-a82a-0040959a28c9})
1024[8058968]: found lib:libgfx_gtk.so as 812b930 in factory cache.
1024[8058968]: Factory CreateInstance() succeeded.
1024[8058968]: nsComponentManager:
FindFactory({e12752f0-ee9a-11d1-a82a-0040959a28c9})
1024[8058968]: found lib:libgfx_gtk.so as 812b930 in factory cache.
1024[8058968]: Factory CreateInstance() succeeded.
1024[8058968]: nsComponentManager:
FindFactory({e12752f0-ee9a-11d1-a82a-0040959a28c9})
1024[8058968]: found lib:libgfx_gtk.so as 812b930 in factory cache.
1024[8058968]: Factory CreateInstance() succeeded.
1024[8058968]: nsComponentManager:
ProgIDToClassID(component://netscape/network/protocol?name=http)->{52a30880-dd95-11d3-a1a7-0050041caf44}
1024[8058968]: nsComponentManager:
FindFactory({90012125-1616-4fa1-ae14-4e7fa5766eb6})
1024[8058968]: found rel:libnecko.so as 807b070 in factory cache.
1024[8058968]: Factory CreateInstance() succeeded.
1024[8058968]: nsComponentManager:
FindFactory({de9472d0-8034-11d3-9399-00104ba0fd40})
1024[8058968]: found rel:libnecko.so as 807a890 in factory cache.
1024[8058968]: nsComponentManager:
FindFactory({dbf72351-4fd8-46f0-9dbc-fa5ba60a305c})
1024[8058968]: found rel:libnecko.so as 807afc8 in factory cache.
1024[8058968]: Factory CreateInstance() succeeded.
1024[8058968]: Factory CreateInstance() succeeded.
1024[8058968]: nsComponentManager:
ProgIDToClassID(component://netscape/scriptsecuritymanager)->{7ee2a4c0-4b93-17d3-ba18-0060b0f199a2}
1024[8058968]: nsComponentManager:
ProgIDToClassID(component://netscape/network/protocol?name=http)->{52a30880-dd95-11d3-a1a7-0050041caf44}
1024[8058968]: nsComponentManager:
ProgIDToClassID(component://netscape/network/cache?name=manager)->{2030f0b0-9567-11d3-90d3-0040056a906e}
1024[8058968]: nsComponentManager:
FindFactory({60047bb2-91c0-11d3-8cd9-0060b0fc14a3})
1024[8058968]: found rel:libnecko.so as 807ac80 in factory cache.
1024[8058968]: Factory CreateInstance() succeeded.
1024[8058968]: nsComponentManager:
ProgIDToClassID(component://netscape/image/decoder&type=image/gif)->{0d471b70-baf5-11d2-802c-0060088f91a3}
1024[8058968]: nsComponentManager:
FindFactory({0d471b70-baf5-11d2-802c-0060088f91a3})
1024[8058968]: found rel:libnsgif.so as 8085720 in factory cache.
1024[8058968]: Factory CreateInstance() succeeded.
Assignee | ||
Comment 90•25 years ago
|
||
Dawn: Try setting nsSocketTransport:5 instead of nsComponentManager:5. I think
that would be more helpful.
Still working on an SMP machine for Rick. Pavlov agreed to pool his machine
with mine... if I could only find him!
Comment 91•25 years ago
|
||
now that animated gifs don't constantly reload my new test case is browser
buster. it broke for me at about the 3rd url. Here's a new stack and the
last part of xpcom.log. I have 150K or so of log file with random.yahoo.com and
esta.org messages if anyone is interested.
using nsComponentManager:5.
0 0x0 in ?? ()
#1 0x406e40ee in nsStreamListenerEvent::~nsStreamListenerEvent (
this=0x8793948, __in_chrg=3) at nsAsyncStreamListener.cpp:81
#2 0x406e4a01 in nsOnStopRequestEvent::~nsOnStopRequestEvent (this=0x8793948,
__in_chrg=3) at nsAsyncStreamListener.cpp:261
#3 0x406e421f in nsStreamListenerEvent::DestroyPLEvent (aEvent=0x8788988)
at nsAsyncStreamListener.cpp:108
#4 0x40189c5b in PL_DestroyEvent (self=0x8788988) at plevent.c:549
#5 0x40189bf9 in PL_HandleEvent (self=0x8788988) at plevent.c:536
#6 0x40189abc in PL_ProcessPendingEvents (self=0x812b798) at plevent.c:487
#7 0x4018b5fc in nsEventQueueImpl::ProcessPendingEvents (this=0x812b770)
at nsEventQueue.cpp:298
#8 0x40935a64 in event_processor_callback (data=0x812b770, source=9,
condition=GDK_INPUT_READ) at nsAppShell.cpp:141
#9 0x409356ef in our_gdk_io_invoke (source=0x8338070, condition=G_IO_IN,
data=0x81c6f08) at nsAppShell.cpp:54
#10 0x407cc52a in g_io_unix_dispatch () from /usr/lib/libglib-1.2.so.0
#11 0x407cdbe6 in g_main_dispatch () from /usr/lib/libglib-1.2.so.0
#12 0x407ce1a1 in g_main_iterate () from /usr/lib/libglib-1.2.so.0
#13 0x407ce341 in g_main_run () from /usr/lib/libglib-1.2.so.0
#14 0x40a12209 in ?? () from /usr/lib/libgtk-1.2.so.0
#15 0x40936067 in nsAppShell::Run (this=0x8130de8) at nsAppShell.cpp:304
#16 0x4064eaad in nsAppShellService::Run (this=0x812b570)
at nsAppShellService.cpp:399
#17 0x804e60e in main1 (argc=1, argv=0xbffff9e4, splashScreen=0x0)
at nsAppRunner.cpp:763
#18 0x804eba0 in main (argc=1, argv=0xbffff9e4) at nsAppRunner.cpp:883
1026[812d2c8]: --- Leaving nsSocketTransport::Process() [www.esta.org:80
41ccdd20]. mStatus = 80470007. CurrentState = 5
1026[812d2c8]: +++ Entering nsSocketTransport::Process() [www.esta.org:80
41ccdd20]. aSelectFlags = 1. CurrentState = 5
1026[812d2c8]: +++ Entering nsSocketTransport::doRead() [www.esta.org:80
41ccdd20]. aSelectFlags = 1.
1026[812d2c8]: nsReadFromSocket [fd=40805220]. rv = 0. Buffer space = 239.
Bytes read =239
1026[812d2c8]: nsReadFromSocket [fd=40805220]. rv = 0. Buffer space = 2048.
Bytes read =261
1026[812d2c8]: nsReadFromSocket [fd=40805220]. rv = 80470007. Buffer space =
1787. Bytes read =0
1026[812d2c8]: nsSocketTransport::OnWrite() [www.esta.org:80 41ccdd20].
nsIPipe=408fe088 Count=500
1026[812d2c8]: WriteSegments [fd=40805220]. rv = 0. Bytes read =500
1026[812d2c8]: --- Leaving nsSocketTransport::doRead() [www.esta.org:80
41ccdd20]. rv = 80470007. Total bytes read: 500
1026[812d2c8]: --- Leaving nsSocketTransport::Process() [www.esta.org:80
41ccdd20]. mStatus = 80470007. CurrentState = 5
1026[812d2c8]: +++ Entering nsSocketTransport::Process() [www.esta.org:80
41ccdd20]. aSelectFlags = 1. CurrentState = 5
1026[812d2c8]: +++ Entering nsSocketTransport::doRead() [www.esta.org:80
41ccdd20]. aSelectFlags = 1.
1026[812d2c8]: nsReadFromSocket [fd=40805220]. rv = 0. Buffer space = 1787.
Bytes read =500
1026[812d2c8]: nsReadFromSocket [fd=40805220]. rv = 80470007. Buffer space =
1287. Bytes read =0
1026[812d2c8]: nsSocketTransport::OnWrite() [www.esta.org:80 41ccdd20].
nsIPipe=408fe088 Count=500
1026[812d2c8]: WriteSegments [fd=40805220]. rv = 0. Bytes read =500
1026[812d2c8]: --- Leaving nsSocketTransport::doRead() [www.esta.org:80
41ccdd20]. rv = 80470007. Total bytes read: 500
1026[812d2c8]: --- Leaving nsSocketTransport::Process() [www.esta.org:80
41ccdd20]. mStatus = 80470007. CurrentState = 5
1026[812d2c8]: +++ Entering nsSocketTransport::Process() [www.esta.org:80
41ccdd20]. aSelectFlags = 1. CurrentState = 5
1026[812d2c8]: +++ Entering nsSocketTransport::doRead() [www.esta.org:80
41ccdd20]. aSelectFlags = 1.
1026[812d2c8]: nsReadFromSocket [fd=40805220]. rv = 0. Buffer space = 1287.
Bytes read =46
1026[812d2c8]: nsReadFromSocket [fd=40805220]. rv = 0. Buffer space = 1241.
Bytes read =0
1026[812d2c8]: nsSocketTransport::OnWrite() [www.esta.org:80 41ccdd20].
nsIPipe=408fe088 Count=46
1026[812d2c8]: WriteSegments [fd=40805220]. rv = 0. Bytes read =46
1026[812d2c8]: --- Leaving nsSocketTransport::doRead() [www.esta.org:80
41ccdd20]. rv = 80470007. Total bytes read: 46
1026[812d2c8]: --- Leaving nsSocketTransport::Process() [www.esta.org:80
41ccdd20]. mStatus = 80470007. CurrentState = 5
1026[812d2c8]: +++ Entering nsSocketTransport::Process() [www.esta.org:80
41ccdd20]. aSelectFlags = 20. CurrentState = 5
1026[812d2c8]: Operation failed via PR_POLL_HUP. [www.esta.org:80 41ccdd20].
1026[812d2c8]: Transport [www.esta.org:80 41ccdd20] is in error state.
1026[812d2c8]: Transport [www.esta.org:80 41ccdd20] is in done state.
1026[812d2c8]: --- Leaving nsSocketTransport::Process() [www.esta.org:80
41ccdd20]. mStatus = 0. CurrentState = 3
1024[8058968]: Deleting nsSocketTransport [komodo.mozilla.org:80 877f0a8].
1024[8058968]: Deleting nsSocketTransport [random.yahoo.com:80 408c83e8].
1024[8058968]: Deleting nsSocketTransport [www.esta.org:80 41ccdd20].
Comment 92•25 years ago
|
||
hey dawn,
This last bit of logging info is starting to look useful :-) can you try it
again with NSPR_LOG_MODULES=nsHTTPProtocol:5,nsSocketTransport:5
This will give info about how/when the HTTP objects are destroyed too.
Thanks,
-- rick
Comment 93•25 years ago
|
||
*** Bug 26686 has been marked as a duplicate of this bug. ***
Comment 94•25 years ago
|
||
So, I've been trying to reproduce these crashes most of the night on a 2
processor NT machine without any luck :-(
I'll try Linux tomorrow...
Is anyone else still seeing these crashes on SMP NT boxes? Or is Linux the only
platform now?
Comment 95•25 years ago
|
||
I did what rick asked and mailed him another stack trace and log file
rather than pasting it all here. Here's the end of the log.
1024[8058968]: Canceling nsSocketTransport [dspace.dial.pipex.com:80 42564550].
rv = 0
1024[8058968]: Canceling nsSocketTransport [dspace.dial.pipex.com:80 42578428].
rv = 0
1024[8058968]: Deleting nsHTTPChannel [this=8164f10].
1024[8058968]: Deleting nsHTTPRequest [this=814c7b0].
1024[8058968]: Deleting nsSocketTransport [komodo.mozilla.org:80 42093698].
1024[8058968]: Deleting nsHTTPChannel [this=4201e630].
1024[8058968]: Deleting nsHTTPRequest [this=41ec6930].
Comment 96•25 years ago
|
||
I would be happy to do any testing on Linux that is needed. I saw some ideas of
how to get the approiate info eariler in this bug. I did notice that mozilla
nightly from last night/this morning was very unstable compared to 48 hours ago
in linux/smp.
Comment 97•25 years ago
|
||
rpotts@netscape.com asked if anyone still was seeing this on NT: yes. I've
sent the full dump directly, this was on "latest nightly": 2000022908 on a dual
PII 450 running NT. I had been browsing for about an hour or so, /., UF,
mozilla.org, oreily.com, nothing serious when it crashed... took most of NT
with it... I had to logout and kill most of my active processes in order to
get realtime control back... there's a line in the stack trace for the active
thread that might explain that.... (dnetc was running in background, ending
that from a command line helped, but didn't restore full usability. what ever
the crash did it resulted in normal processes only getting time (even to
repaint) when dnetc was IO bound to disk - that is NOT the normal behaviour
of dnetc, it is usually very well behaved. after it shutdown it only took a
minute for the start menu to appear, another minutes for the shutdown menu
option to select.... before that it took ten minutes to get the start->run
dilog.)
here's the top of the active thread:
jsdom!nsGetInterface::operator=
gkhtml!NS_NewEventListenerManager
gkhtml!NS_NewPresShell
gkview!nsCreateInstanceByProgID::operator=
gkview!nsCreateInstanceByProgID::operator=
[...]
mozilla!nsGetInterface::operator=
kernel32!GetProcessPriorityBoost
mozilla!<nosymbols>
Comment 98•25 years ago
|
||
adding link to bug 25910 which most likely is a duplicate
Assignee | ||
Comment 99•25 years ago
|
||
I'll have to take this over now that Rick has gone on sabbatical, but in some
sense it's probably Dougt's bug.
Status: We worked on this all day yesterday on Dawn's machine and saw numerous
crashes. For necko they were often in using the proxy code to post OnStatus and
OnProgress notifications back to the mozilla thread. However, we also saw
problems where the gfx toolkit would go away and others, so solving just the
necko issue won't make us completely stable on MP machines.
Possible solutions: (a) don't deliver status/progress at all (disable them
in the socket transport and just rel-note it) (b) don't use the proxy code to
deliver status/progress (implement the event delivery/thread-switch by hand),
(c) get Doug to track down what's going on with proxies.
Last night we augmented the TestSocketTransport test program to receive
status/progress notifications so that it might also exhibit this problem, and
left it running on the machine but didn't see the same failure by the time we
went home. :-(
Assignee: rpotts → warren
Assignee | ||
Comment 100•25 years ago
|
||
Found it! NS_MT_SUPPORTED was not defined for Linux (!) and a bunch of classes
weren't thread safe.
See news://news.mozilla.org/38BF7E94.3CA715DA%40netscape.com for details.
Assignee | ||
Updated•25 years ago
|
Whiteboard: [PDT+] w/b minus on 03/03- need SMP machine → [PDT+] w/b minus on 03/03 [have fixes!]
Comment 101•25 years ago
|
||
The landing is in progress, so I'm extending this to w/b minus on 3/7
Whiteboard: [PDT+] w/b minus on 03/03 [have fixes!] → [PDT+] w/b minus on 3/7 [have fixes!]
Assignee | ||
Comment 102•25 years ago
|
||
Here's the list of classes I'm having to make threadsafe:
AtomImpl
BasicStringImpl
CacheOutputStream
InterceptStreamListener
MemCacheWriteStreamWrapper
TestConnection
nsAppShellService
nsCacheEntryChannel
nsCharsetConverterManager
nsConverterFactory
nsDNSService
nsDateTimeFormatWin
nsDocShell
nsDocumentOpenInfo
nsEventQueueImpl
nsEventQueueServiceImpl
nsFTPDirListingConv
nsFileSpecImpl
nsFileTransport
nsFileTransportService
nsGenericFactory
nsGenericModule
nsHTTPIndexParser
nsIOService
nsImapFlagAndUidState
nsImapMailCopyState
nsImapMockChannel
nsInputStreamChannel
nsInputStreamFileSystem
nsInterfaceInfoManager
nsLocalFile
nsLocalFileSystem
nsLocale
nsLocaleService
nsMIMEInfoImpl
nsMIMEService
nsMemCacheChannel
nsMemCacheRecord
nsMsgAccountManager
nsMsgIncomingServer
nsMsgMailNewsUrl
nsMsgStatusFeedback
nsMsgWindow
nsObserverService
nsPref
nsPrefMigration
nsProxyEventClass
nsProxyEventObject
nsProxyObjectManager
nsRDFResource
nsRunner
nsSocketTransport
nsSocketTransportService
nsStdURLParser
nsStorageStream
nsStreamConverterService
nsSupportsArray
nsThread
nsThreadPool
nsWalletlibService
Comment 103•25 years ago
|
||
By what evidence are you basing the need to make the imap classes
thread-safe? (by which I assume you mean adding threadsafe add and release refs)
Inspection, or actual evidence of CONCURRENT access to add and release ref from
multiple threads? The imap code uses BLOCKING proxy calls between threads so
that while one thread may be manipulating the ref count, the other thread is
blocked.
Assignee | ||
Comment 104•25 years ago
|
||
These changes went in moments ago, along with Andreas' changes.
David: These classes were determined experimentally. I hadn't thought about the
case where only synchronous proxy code was used, and consequently making
AddRef/Release threadsafe _shouldn't_ be necessary (I'd have to really study the
proxy code to determine whether that's really true), but I think making these
classes threadsafe is mostly harmless -- just a little more overhead in the
AddRef/Release which will hopefully be insignificant. Let's see if anything
shows up during profiling.
Status: NEW → RESOLVED
Closed: 25 years ago → 25 years ago
Resolution: --- → FIXED
Comment 105•25 years ago
|
||
Warren, I was playing around on my machine today in the tree you were
working on and found lots of other thread safety assertions and crashes
in the mail account wizard and while loading my inbox. Do you need that
tree any more or is it safe to update to the tip? I don't want to blow
away your changes but I don't want to report the crashes if they are
unique to my tree.
Assignee | ||
Comment 106•25 years ago
|
||
You can update to the tip. Tons of other fixes went in after that. It would be
great if you could verify that the thread safety asserts you mentioned have gone
away now. If not, you can send them to me, or file new bugs. Thanks.
Comment 107•25 years ago
|
||
Dawn, could you help once again in verifying this bug. I have been told that
you were able to reproduce this. Thanks.
Comment 108•25 years ago
|
||
Oops, i did this the other day and mailed warren but forgot to comment
in the bug. After I updated from the tip things worked great. I got no
assertions and didn't crash after several hours. Marking verified.
Status: RESOLVED → VERIFIED
Comment 109•25 years ago
|
||
I'm running on a Quad Sun UE450 (Solaris 2.6) and have been experiencing quite
a lot of instability.. if I run the exact same code on an UP machine with the
exact same OS etc it's almost perfectly stable.
I bet the quad will trigger smp bugs more than a dual...
I'm running current CVS (tip) with gtk/glib 1.2.6, compiled with gcc 2.95.2
(-O -msupersparc).
Here is a stacktrace from searching for 'Mozilla' in the search sidebar and
waiting a few seconds (repeatable sometimes 8):
#0 0xef1d66b8 in pthread_mutex_lock () from /usr/lib/libthread.so.1
#1 0xef5614c8 in PR_Lock ()
from /scratch/mozilla/mozilla/dist/bin/./libnspr4.so
#2 0xedefff6c in nsSocketTransport::Process ()
from /scratch/mozilla/mozilla/dist/bin/components/libnecko.so
#3 0xedf029f4 in nsSocketTransportService::ProcessWorkQ ()
from /scratch/mozilla/mozilla/dist/bin/components/libnecko.so
#4 0xedf02f30 in nsSocketTransportService::Run ()
from /scratch/mozilla/mozilla/dist/bin/components/libnecko.so
#5 0xef68da64 in nsThread::Main ()
from /scratch/mozilla/mozilla/dist/bin/./libxpcom.so
#6 0xef566208 in _pt_root ()
from /scratch/mozilla/mozilla/dist/bin/./libnspr4.so
Loading a page with a bunch of images resulted in:
#0 0xedefd850 in nsStreamListenerEvent::~nsStreamListenerEvent ()
from /scratch/mozilla/mozilla/dist/bin/components/libnecko.so
#1 0xedefdef4 in nsOnStopRequestEvent::~nsOnStopRequestEvent ()
from /scratch/mozilla/mozilla/dist/bin/components/libnecko.so
#2 0xedefd908 in nsStreamListenerEvent::DestroyPLEvent ()
from /scratch/mozilla/mozilla/dist/bin/components/libnecko.so
#3 0xef68b698 in PL_DestroyEvent ()
from /scratch/mozilla/mozilla/dist/bin/./libxpcom.so
#4 0xef68b674 in PL_HandleEvent ()
from /scratch/mozilla/mozilla/dist/bin/./libxpcom.so
#5 0xef68b584 in PL_ProcessPendingEvents ()
from /scratch/mozilla/mozilla/dist/bin/./libxpcom.so
#6 0xef68c328 in nsEventQueueImpl::ProcessPendingEvents ()
from /scratch/mozilla/mozilla/dist/bin/./libxpcom.so
#7 0xee630a74 in event_processor_callback ()
from /scratch/mozilla/mozilla/dist/bin/./libwidget_gtk.so
#8 0xee630794 in our_gdk_io_invoke ()
from /scratch/mozilla/mozilla/dist/bin/./libwidget_gtk.so
#9 0xee251d0c in g_main_dispatch () from /usr/local/lib/libglib-1.2.so.0
#10 0xee252444 in g_main_iterate () from /usr/local/lib/libglib-1.2.so.0
#11 0xee252634 in g_main_run () from /usr/local/lib/libglib-1.2.so.0
#12 0xee429814 in gtk_main () from /usr/local/lib/libgtk-1.2.so.0
#13 0xee630f78 in nsAppShell::Run ()
from /scratch/mozilla/mozilla/dist/bin/./libwidget_gtk.so
#14 0xee6cfa30 in nsAppShellService::Run ()
from /scratch/mozilla/mozilla/dist/bin/components/libnsappshell.so
#15 0x139f0 in main1 ()
#16 0x13ddc in main ()
Assignee | ||
Comment 110•25 years ago
|
||
stric: Is this the latest build? Debug or optimized? We're still finding
thread-safety assertions that we're tracking down, so we know this isn't 100%
fixed yet, but we closed this bug because we know that the assertions will help
us resolve them over time. I'm wondering if you've seen any assertions, and/or
whether you think we should reopen this bug.
Comment 111•25 years ago
|
||
Note that crashing on the tip build this past weekend (or today) is no big
deal. There is a lot of instability at this moment.
Do you crash when you pull last friday's evening build? Try picking that up
from Mozilla. That was when we branched for beta, but before the giant landings
began.
If you are building your own binary, you should try to induce this bug using the
Netsacpe beta1 branch. That would be the interesting (sad? surprising?) test.
Thanks,
Jim
Comment 112•25 years ago
|
||
I hate to be a broken record, but the asserts only catch lack of thread safety
on addref and release - there could be all sorts of other thread-safety issues.
Comment 113•25 years ago
|
||
ftp://ftp.mozilla.org/pub/mozilla/nightly/2000-03-10-08-M15/mozilla-source.tar.gz
this is the source tarball from last friday that jar mentioned.
I don't see a source tarball for the netscape beta branch. You
can pull it from cvs if you use the proper tag. The tag should
be listed on the builds or seamonkey newsgroup.
Comment 114•25 years ago
|
||
I don't think mozilla.org is doing any bulding of tarballs based on the netscape
branch (although you could ask for 'em!! :-) ). That was why the best build I
could point at was late in the day on last Friday. Thanks to endico for adding
the pointer.
Bienvenu is quite correct that other bugs can/will exist in/around
multi-threading. There is a good chance that the nature of the thread-induced
problem will not be memory-centric (re: double frees, etc.), and hence I
personally would be more surprised to see a stack trace that looked consistently
like the ones we had been seeing on this bug. Another bug... yes... but I was
hoping we were free of this particular class of threading errors. Perhaps we
never will be... but a guy can hope! :-)
Again, please tell us how you do with the "relatively" stable build that endico
identified.
Comment 115•25 years ago
|
||
Warren: I was running current (by then) CVS source from CVS HEAD, optimized build.
I just updated and now I get crashes when I resize (a bunch) the window when
viewing slashdot.org for example.. I get a 120-130 step backtrace.. here's a snip:
#0 0x0 in ?? ()
#1 0xedad7300 in nsInlineFrame::ReflowFrames ()
from /scratch/mozilla/mozilla/dist/bin/components/libraptorhtml.so
#2 0xedad719c in nsInlineFrame::Reflow ()
from /scratch/mozilla/mozilla/dist/bin/components/libraptorhtml.so
#3 0xedadaa9c in nsLineLayout::ReflowFrame ()
from /scratch/mozilla/mozilla/dist/bin/components/libraptorhtml.so
#4 0xedab8b38 in nsBlockFrame::ReflowInlineFrame ()
from /scratch/mozilla/mozilla/dist/bin/components/libraptorhtml.so
...
#87 0xedae75fc in PresShell::ResizeReflow ()
from /scratch/mozilla/mozilla/dist/bin/components/libraptorhtml.so
#88 0xed6dec54 in nsViewManager2::SetWindowDimensions ()
from /scratch/mozilla/mozilla/dist/bin/components/libraptorview.so
#89 0xed6e0420 in nsViewManager2::DispatchEvent ()
from /scratch/mozilla/mozilla/dist/bin/components/libraptorview.so
#90 0xed6ced54 in HandleEvent ()
from /scratch/mozilla/mozilla/dist/bin/components/libraptorview.so
#91 0xeea3bc98 in nsWidget::DispatchEvent ()
from /scratch/mozilla/mozilla/dist/bin/./libwidget_gtk.so
#92 0xeea3bba8 in nsWidget::DispatchWindowEvent ()
from /scratch/mozilla/mozilla/dist/bin/./libwidget_gtk.so
#93 0xeea3aa8c in nsWidget::OnResize ()
from /scratch/mozilla/mozilla/dist/bin/./libwidget_gtk.so
#94 0xeea42ff4 in nsWindow::Resize ()
from /scratch/mozilla/mozilla/dist/bin/./libwidget_gtk.so
#95 0xed6d0a30 in nsView::SetDimensions ()
from /scratch/mozilla/mozilla/dist/bin/components/libraptorview.so
#96 0xed6dec24 in nsViewManager2::SetWindowDimensions ()
from /scratch/mozilla/mozilla/dist/bin/components/libraptorview.so
Here's a dump from loading a page with a bunch of png/jpg/gif images:
(gdb) bt
#0 0xee2ed9d0 in nsStreamListenerEvent::~nsStreamListenerEvent ()
from /scratch/mozilla/mozilla/dist/bin/components/libnecko.so
#1 0xee2ee074 in nsOnStopRequestEvent::~nsOnStopRequestEvent ()
from /scratch/mozilla/mozilla/dist/bin/components/libnecko.so
#2 0xee2eda88 in nsStreamListenerEvent::DestroyPLEvent ()
from /scratch/mozilla/mozilla/dist/bin/components/libnecko.so
#3 0xefa8b650 in PL_DestroyEvent ()
from /scratch/mozilla/mozilla/dist/bin/./libxpcom.so
#4 0xefa8b62c in PL_HandleEvent ()
from /scratch/mozilla/mozilla/dist/bin/./libxpcom.so
#5 0xefa8b53c in PL_ProcessPendingEvents ()
from /scratch/mozilla/mozilla/dist/bin/./libxpcom.so
#6 0xefa8c2e0 in nsEventQueueImpl::ProcessPendingEvents ()
from /scratch/mozilla/mozilla/dist/bin/./libxpcom.so
#7 0xeea2c40c in event_processor_callback ()
from /scratch/mozilla/mozilla/dist/bin/./libwidget_gtk.so
#8 0xeea2c12c in our_gdk_io_invoke ()
from /scratch/mozilla/mozilla/dist/bin/./libwidget_gtk.so
#9 0xee651d0c in g_main_dispatch () from /usr/local/lib/libglib-1.2.so.0
#10 0xee652444 in g_main_iterate () from /usr/local/lib/libglib-1.2.so.0
#11 0xee652634 in g_main_run () from /usr/local/lib/libglib-1.2.so.0
#12 0xee829814 in gtk_main () from /usr/local/lib/libgtk-1.2.so.0
#13 0xeea2c910 in nsAppShell::Run ()
from /scratch/mozilla/mozilla/dist/bin/./libwidget_gtk.so
#14 0xeeacfb30 in nsAppShellService::Run ()
from /scratch/mozilla/mozilla/dist/bin/components/libnsappshell.so
#15 0x139f0 in main1 ()
#16 0x13ddc in main ()
How do I update for the beta1 branch? If it's getting stable on this quad I
could try it on a 10 cpu onyx2 for some more concurrency 8)
With the current code I would not classified it as fixed.. Maybe on dual boxes,
but not on a quad..
You need to log in
before you can comment on or make changes to this bug.
Description
•