crash on SMP systems: socket transport in load group

VERIFIED FIXED in M14

Status

()

defect
P3
critical
VERIFIED FIXED
20 years ago
19 years ago

People

(Reporter: bsemrad, Assigned: warrensomebody)

Tracking

({crash})

Trunk
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [PDT+] w/b minus on 3/7 [have fixes!])

Attachments

(2 attachments)

(Reporter)

Description

20 years ago
System specs:

Dual PII 400 running Windows NT 4.0 Service Pack 6a, 128 Meg RAM, Tons of hard
drive. This is also a fairly young install of NT (about 4 months). Communicator
4.7 never crashes on this machine. This particular mozilla build is from
12-10-99 but Mozilla has been acting this way for me for at least a month or so.
I always remove the mozregistry.dat file, the user account directory that gets
created and the entire Moz directory every time I re-install a new daily build.
I have not modified the bookmarks or any other configuration other than to
accept the defaults at initial system startup. I'm not sure if this makes much
difference but I'm accessing the net on my NT box through a linux masquerading
system attached to a cable modem.


Problem:

Mozilla seems very unstable on my SMP system (PC specs below). Besides crashing
about 50% of the time on startup, I usually (90% of the time) get an exception
within 60 seconds of browser startup. Occasionally I can just start Mozilla and
let it sit for a minute or so and it will get a read exception while displaying
the initial mozilla.org web site. I tried this just now but couldn't get it to
reproduce within a few minutes or so. To get it to crash I can usually just type
in "http://www.slashdot.org" or "http://www.linuxworld.com" into the url bar and
press enter. Then, during the display of the home page of either of these sites
Mozilla will usually get an exception before the main page is completely
displayed. In my experience either of these sites will crash Mozilla about
30%-50% of the time.


Reproducing the crash:

Edit the url to be one of the above websites and hit enter. If Mozilla doesn't
crash put the cursor on the url bar and hit enter again. I can usually get it to
crash within the first couple of tries on either web site. I noticed that it
seems much more likely to crash the first few times I visited the site but it
may be my imagination. I went to each site about 10 times just now and got it to
crash about 4 times on each one. The dialog that popped up notifying me of the
exception seemed to be somewhat consistent in that It seemed to be crashing and
displaying the same exception message about every other time.
Adding some multi-threading gurus/perps to the cc list.

/be
One problem is 18110.  (Jan, I think that this is your reproducible testcase)
Service Pack 6a.  Wow.

We should try and find a developer with that service pack to see where we're
crashing.
Depends on: 18110

Comment 4

20 years ago
bsemrad@adsoft.net: could you attach a Dr Watson log from Windows NT?
(Reporter)

Comment 5

20 years ago
Here is an excerpt from an email that I sent to dougt@netscape.com about the
crash on my machine.

I went ahead and downloaded the source for Mozilla dated on 12-13-99 and
compiled it and then ran it. Below is a copy of the stack trace of the crash
when I tried to go to www.slashdot.org.


nsCOMPtr?nsProxyObject>::assign_with_AddRef(nsISupports * 0x02f69060) line 759 +
9 bytes

nsCOMPtr?nsProxyObject>::operator=(nsProxyObject * 0x02f69060) line 516

nsProxyObjectCallInfo::nsProxyObjectCallInfo(nsProxyObject * 0x02f69060,
nsXPTMethodInfo * 0x021ed670, unsigned int 3, nsXPTCVariant * 0x02f6a3d0,
unsigned int 4, PLEvent * 0x02f6a890) line 65

nsProxyObject::Post(unsigned int 3, nsXPTMethodInfo * 0x021ed670,
nsXPTCMiniVariant * 0x02d1fe18, nsIInterfaceInfo * 0x02f6e060) line 340 + 57
bytes

nsProxyEventObject::CallMethod(nsProxyEventObject * const 0x02f6f810, unsigned
short 3, const nsXPTMethodInfo * 0x021ed670, nsXPTCMiniVariant *
0x02d1fe18) line 391 + 55 bytes

PrepareAndDispatch(nsXPTCStubBase * 0x02f6f810, unsigned int 3, unsigned int *
0x02d1fecc, unsigned int * 0x02d1feb8) line 100 + 31 bytes


SharedStub() line 125

------------------------------------------

Doug then emailed me with the following:


Thanks for the great work.  This indeed is bug 18110.

I told Doug that I might have a go at fixing it but it has been several days
since I told him that and I haven't yet had time to look at it seriously so you
should probably not count on me for this one.

Updated

20 years ago
Assignee: leger → dp
Component: Browser-General → XPCOM

Updated

20 years ago
Assignee: dp → dougt

Comment 8

20 years ago
I've been seeing crashes in assign_with_AddRef under SMP Linux as well.  RH6.0 +

gcc 2.95.2 + binutils 2.9.1.0.25 + gtk 1.2.5 + glibc 2.1.2 (from RH6.1).

Kernels 2.2.12-2.2.14pre17.  On some pages (http://userfriendly.org/,

http://cnn.com/), I can just let the main page load, not touch the browser,

switch to another workspace (using WindowMaker) and the browser will crash

within 30secs.  This is from last night's testing with the M12 fullcircle build.

Comment 9

20 years ago
I've crashed about 90% of the time loading http://userfriendly.org/
getting one of these stacks.
Redhat 6.1, dual pentium II 450
Linux localhost.localdomain 2.2.12-20smp #1 SMP Mon Sep 27 10:34:45 EDT 1999
i686 unknown

gtk+-1.2.5-2
gcc version egcs-2.91.66 19990314/Linux (egcs-1.1.2 release)
binutils-2.9.1.0.23-6
glibc-2.1.2-11

#0  0x3f in ?? ()
#1  0x40529bf1 in nsOnStopRequestEvent::~nsOnStopRequestEvent ()
   from /home/endico/mozilla/mozilla/dist/bin/components/libnecko.so
#2  0x4052962c in nsStreamListenerEvent::DestroyPLEvent ()
   from /home/endico/mozilla/mozilla/dist/bin/components/libnecko.so
#3  0x40176c6d in PL_DestroyEvent ()
   from /home/endico/mozilla/mozilla/dist/bin/libplds3.so
#4  0x40176c46 in PL_HandleEvent ()
   from /home/endico/mozilla/mozilla/dist/bin/libplds3.so
#5  0x40176b86 in PL_ProcessPendingEvents ()
   from /home/endico/mozilla/mozilla/dist/bin/libplds3.so
#6  0x401471ce in ?? () from /home/endico/mozilla/mozilla/dist/bin/libxpcom.so
#7  0x405b6ac4 in event_processor_callback ()
   from /home/endico/mozilla/mozilla/dist/bin/libwidget_gtk.so
#8  0x405b680f in our_gdk_io_invoke ()
   from /home/endico/mozilla/mozilla/dist/bin/libwidget_gtk.so
#9  0x4086052a in g_io_unix_dispatch () from /usr/lib/libglib-1.2.so.0
#10 0x40861be6 in g_main_dispatch () from /usr/lib/libglib-1.2.so.0
#11 0x408621a1 in g_main_iterate () from /usr/lib/libglib-1.2.so.0
#12 0x40862341 in g_main_run () from /usr/lib/libglib-1.2.so.0
#13 0x4078c209 in gtk_main () from /usr/lib/libgtk-1.2.so.0
#14 0x405b7067 in nsAppShell::Run ()
   from /home/endico/mozilla/mozilla/dist/bin/libwidget_gtk.so
#15 0x404d0c41 in nsAppShellService::Run ()
   from /home/endico/mozilla/mozilla/dist/bin/libnsappshell.so
#16 0x804adf1 in main1 ()
#17 0x804b225 in main ()
#18 0x4025e1eb in ?? () from /lib/libc.so.6

#0  0x40333238 in main_arena () from /lib/libc.so.6
#1  0x4003af66 in ?? ()
   from /home/endico/mozilla/mozilla/dist/bin/libraptorgfx.so
#2  0x4014ec84 in nsCOMPtr_base::assign_with_AddRef ()
   from /home/endico/mozilla/mozilla/dist/bin/libxpcom.so
#3  0x4139d18f in nsCOMPtr<nsIChannel>::operator= ()
   from /home/endico/mozilla/mozilla/dist/bin/components/libnecko_http.so
#4  0x41392de8 in nsHTTPRequest::~nsHTTPRequest ()
   from /home/endico/mozilla/mozilla/dist/bin/components/libnecko_http.so
#5  0x41392ea0 in nsHTTPRequest::Release ()
   from /home/endico/mozilla/mozilla/dist/bin/components/libnecko_http.so
#6  0x4138cba5 in nsHTTPChannel::~nsHTTPChannel ()
   from /home/endico/mozilla/mozilla/dist/bin/components/libnecko_http.so
#7  0x4138cd63 in nsHTTPChannel::Release ()
   from /home/endico/mozilla/mozilla/dist/bin/components/libnecko_http.so
#8  0x4052954e in nsStreamListenerEvent::~nsStreamListenerEvent ()
   from /home/endico/mozilla/mozilla/dist/bin/components/libnecko.so
#9  0x40529bf1 in nsOnStopRequestEvent::~nsOnStopRequestEvent ()
   from /home/endico/mozilla/mozilla/dist/bin/components/libnecko.so
#10 0x4052962c in nsStreamListenerEvent::DestroyPLEvent ()
   from /home/endico/mozilla/mozilla/dist/bin/components/libnecko.so
#11 0x40176c6d in PL_DestroyEvent ()
   from /home/endico/mozilla/mozilla/dist/bin/libplds3.so
#12 0x40176c46 in PL_HandleEvent ()
   from /home/endico/mozilla/mozilla/dist/bin/libplds3.so
#13 0x40176b86 in PL_ProcessPendingEvents ()
   from /home/endico/mozilla/mozilla/dist/bin/libplds3.so
#14 0x401471ce in nsEventQueueImpl::ProcessPendingEvents ()
   from /home/endico/mozilla/mozilla/dist/bin/libxpcom.so
#15 0x405b6ac4 in event_processor_callback ()
   from /home/endico/mozilla/mozilla/dist/bin/libwidget_gtk.so
#16 0x405b680f in our_gdk_io_invoke ()
   from /home/endico/mozilla/mozilla/dist/bin/libwidget_gtk.so
#17 0x4086052a in g_io_unix_dispatch () from /usr/lib/libglib-1.2.so.0
#18 0x40861be6 in g_main_dispatch () from /usr/lib/libglib-1.2.so.0
#19 0x408621a1 in g_main_iterate () from /usr/lib/libglib-1.2.so.0
#20 0x40862341 in g_main_run () from /usr/lib/libglib-1.2.so.0
#21 0x4078c209 in gtk_main () from /usr/lib/libgtk-1.2.so.0
#22 0x405b7067 in nsAppShell::Run ()
   from /home/endico/mozilla/mozilla/dist/bin/libwidget_gtk.so
#23 0x404d0c41 in nsAppShellService::Run ()
   from /home/endico/mozilla/mozilla/dist/bin/libnsappshell.so
#24 0x804adf1 in main1 ()
#25 0x804b225 in main ()
#26 0x4025e1eb in __libc_start_main (main=0x804b044 <main>, argc=1,
    argv=0xbffffac4, init=0x80493a4 <_init>, fini=0x804d6d8 <_fini>,
    rtld_fini=0x4000a610, stack_end=0xbffffabc)
    at ../sysdeps/generic/libc-start.c:90

Comment 11

20 years ago
added test case with 22 gif images. There was a theory that this problem
was due to animated gifs but reducing the test case to just two animated
gifs didn't cause a crash after 2 tries. I'm guessing that it has more to
do with having lots of threads running on different processors. The problem
may also have to do with one of the gifs being a lot bigger than the others.
It seemed like the userfriendly page had been done loading for a long time
but the throbber was still spinning. Apparently the extra time was being
spent loading extra frames on one of the animated gifs.
This is a dup of 18110 [dogfood] XPCOM/Proxy needs to be threadsafe!!
dougt, jband just whacked XPConnect to be threadsafe and otherwise refactored it
for correct thread-local vs. process-global, etc. considerations.  Since the
xpcom proxy code sprang from the brow of XPConnect, perhaps his changes could
help safen xpcom/proxy.  What's the prognosis?

/be

Updated

20 years ago
Status: NEW → ASSIGNED
many of his changes can be massaged into xpcom/proxy.  However, because of the
very nature of xpcom/proxy, me do a really good job at protecting ourselves.
Simply applying his changes are not good enough.
*** Bug 22648 has been marked as a duplicate of this bug. ***

Updated

20 years ago
Summary: Mozilla crashes often on SMP systems. → [Dogfood] Mozilla crashes often on SMP systems.

Comment 16

20 years ago
ugh! Mozilla is pretty unusable for me any more at home on my smp box.
It crashes too much. Please please please get dougt an smp box to debug
with.

I noticed that looking at slashdot.org is causing problems too. It too
has lots of images/page and often uses animated gifs. I got this stack
after loading mozilla to home page, then loading slashdot.org, then
loading mozillazine and staying there a while. This is with this morning's
build.


#0  0x4017451a in nsProxyObject::Post (this=0x41d585f8, methodIndex=4,
    methodInfo=0x407bbd0c, params=0xbf5ffa5c, interfaceInfo=0x41d00870)
    at nsProxyEvent.cpp:423
#1  0x40176747 in nsProxyEventObject::CallMethod (this=0x419dbd28,
    methodIndex=4, info=0x407bbd0c, params=0xbf5ffa5c)
    at nsProxyEventObject.cpp:391
#2  0x40181924 in PrepareAndDispatch (self=0x419dbd28, methodIndex=4,
    args=0xbf5ffb14) at xptcstubs_unixish_x86.cpp:92
#3  0x40181a4a in nsXPTCStubBase::Stub4 (this=0x419dbd28)
    at ../../../../../../dist/include/xptcstubsdef.inc:6
#4  0x4061211b in ?? ()
   from /home/endico/mozilla/mozilla/dist/bin/components/libnecko.so
#5  0x4060f4a0 in ?? ()
   from /home/endico/mozilla/mozilla/dist/bin/components/libnecko.so
#6  0x40613307 in ?? ()
   from /home/endico/mozilla/mozilla/dist/bin/components/libnecko.so
#7  0x401716b5 in nsThread::Main (arg=0x41b452c0) at nsThread.cpp:83
#8  0x402138fb in _pt_root (arg=0x41b16d98) at ptthread.c:157
#9  0x4022deca in pthread_start_thread (arg=0xbf5ffe60) at manager.c:213
*** Bug 18659 has been marked as a duplicate of this bug. ***

Comment 18

20 years ago
yet another stack trace. looked at mozilla.org, slashdot.org, www.benews.com.
It crashed at benews.com in the middle of scrolling after having sat there a
while. Maybe this is timer based. It seems like this morning's crashes are
happening after 10 minutes or so and some happen while the browser is idle.

#0  0x81e48e1 in ?? ()
#1  0x4019d670 in nsCOMPtr<nsProxyObject>::Assert_NoQueryNeeded (
    this=0x4210bb30) at ../../../dist/include/nsCOMPtr.h:444
#2  0x4019d630 in nsCOMPtr<nsProxyObject>::operator= (this=0x4210bb30,
    rhs=0x81ebd20) at ../../../dist/include/nsCOMPtr.h:516
#3  0x40173448 in nsProxyObjectCallInfo::nsProxyObjectCallInfo (
    this=0x4210bb10, owner=0x81ebd20, methodInfo=0x407bbcbc, methodIndex=4,
    parameterList=0x4210bad8, parameterCount=3, event=0x4210bab8)
    at nsProxyEvent.cpp:63
#4  0x401743e0 in nsProxyObject::Post (this=0x81ebd20, methodIndex=4,
    methodInfo=0x407bbcbc, params=0xbf5ffa5c, interfaceInfo=0x824a378)
    at nsProxyEvent.cpp:374
#5  0x40176747 in nsProxyEventObject::CallMethod (this=0x81a4708,
    methodIndex=4, info=0x407bbcbc, params=0xbf5ffa5c)
    at nsProxyEventObject.cpp:391
#6  0x40181924 in PrepareAndDispatch (self=0x81a4708, methodIndex=4,
    args=0xbf5ffb14) at xptcstubs_unixish_x86.cpp:92
#7  0x40181a4a in nsXPTCStubBase::Stub4 (this=0x81a4708)
    at ../../../../../../dist/include/xptcstubsdef.inc:6
#8  0x4061211b in nsSocketTransport::fireStatus (this=0x81a73c8, aCode=5)
    at nsSocketTransport.cpp:1897
#9  0x4060f4a0 in nsSocketTransport::Process (this=0x81a73c8, aSelectFlags=0)
    at nsSocketTransport.cpp:539
---Type <return> to continue, or q <return> to quit---
#10 0x40613307 in nsSocketTransportService::Run (this=0x41b6c3b8)
    at nsSocketTransportService.cpp:467
#11 0x401716b5 in nsThread::Main (arg=0x41b48728) at nsThread.cpp:83
#12 0x402138fb in _pt_root (arg=0x41b48eb0) at ptthread.c:157
#13 0x4022deca in pthread_start_thread (arg=0xbf5ffe60) at manager.c:213
Doug, can this be fixed in M13?  Or at least get a Target Milestone.

/be

Comment 20

20 years ago
Here's another stack very similar to one before. It crashed sitting at
a slashdot article while i was away. Maybe it was reloading an ad? I
forget if their ads refresh themselves.

#0  0x85b579c in ?? ()
#1  0x4060e7ab in nsSocketTransport::~nsSocketTransport (this=0x85d75a8,
    __in_chrg=3) at nsSocketTransport.cpp:223
#2  0x40610760 in nsSocketTransport::Release (this=0x85d75a8)
    at nsSocketTransport.cpp:1191
#3  0x416b9eae in nsCOMPtr<nsIChannel>::assign_assuming_AddRef (
    this=0x862bce8, newPtr=0x0) at ../../../../dist/include/nsCOMPtr.h:415
#4  0x416bea8c in nsCOMPtr<nsIChannel>::assign_with_AddRef (this=0x862bce8,
    rawPtr=0x0) at ../../../../dist/include/nsCOMPtr.h:760
#5  0x416bf7e3 in nsCOMPtr<nsIChannel>::operator= (this=0x862bce8, rhs=0x0)
    at ../../../../dist/include/nsCOMPtr.h:515
#6  0x416b05cf in nsHTTPRequest::~nsHTTPRequest (this=0x862bcd0, __in_chrg=3)
    at nsHTTPRequest.cpp:140
#7  0x416b0720 in nsHTTPRequest::Release (this=0x862bcd0)
    at nsHTTPRequest.cpp:151
#8  0x416a9cc5 in nsHTTPChannel::~nsHTTPChannel (this=0x85b7af8, __in_chrg=3)
    at nsHTTPChannel.cpp:117
#9  0x416a9f22 in nsHTTPChannel::Release (this=0x85b7af8)
    at nsHTTPChannel.cpp:127
#10 0x4060b78e in nsStreamListenerEvent::~nsStreamListenerEvent (
    this=0x83fe668, __in_chrg=3) at nsAsyncStreamListener.cpp:77
#11 0x4060c091 in nsOnStopRequestEvent::~nsOnStopRequestEvent (this=0x83fe668,
    __in_chrg=3) at nsAsyncStreamListener.cpp:257
---Type <return> to continue, or q <return> to quit---
#12 0x4060b8bf in nsStreamListenerEvent::DestroyPLEvent (aEvent=0x84ba438)
    at nsAsyncStreamListener.cpp:104
#13 0x401d841b in PL_DestroyEvent (self=0x84ba438) at plevent.c:545
#14 0x401d83b9 in PL_HandleEvent (self=0x84ba438) at plevent.c:532
#15 0x401d827c in PL_ProcessPendingEvents (self=0x80aa1f8) at plevent.c:483
#16 0x4016faa9 in nsEventQueueImpl::ProcessPendingEvents (this=0x80aa1d0)
    at nsEventQueue.cpp:201
#17 0x40830da4 in event_processor_callback (data=0x80aa1d0, source=6,
    condition=GDK_INPUT_READ) at nsAppShell.cpp:141
#18 0x40830a2f in our_gdk_io_invoke (source=0x8156560, condition=G_IO_IN,
    data=0x81f3308) at nsAppShell.cpp:54
#19 0x406d752a in g_io_unix_dispatch () from /usr/lib/libglib-1.2.so.0
#20 0x406d8be6 in g_main_dispatch () from /usr/lib/libglib-1.2.so.0
#21 0x406d91a1 in g_main_iterate () from /usr/lib/libglib-1.2.so.0
#22 0x406d9341 in g_main_run () from /usr/lib/libglib-1.2.so.0
#23 0x40907209 in gtk_main () from /usr/lib/libgtk-1.2.so.0
#24 0x408313a7 in nsAppShell::Run (this=0x8095350) at nsAppShell.cpp:304
#25 0x4058ffbd in nsAppShellService::Run (this=0x80a9fd0)
    at nsAppShellService.cpp:465
#26 0x804bf3d in main1 (argc=1, argv=0xbffffba4) at nsAppRunner.cpp:609
#27 0x804c3c7 in main (argc=1, argv=0xbffffba4) at nsAppRunner.cpp:697

Comment 21

20 years ago
yet another stack trace. note the assert this time. It crashed while
loading http://www.mozilla.org/banners/ It seemed like it was done loading
but the throbber kept going.

Document http://www.mozilla.org/ loaded successfully
Document: Done (9.162 secs)
WEBSHELL+ = 4
Opening file signon.tbl failed
FindShortcut: in='http://www.mozilla.org/banners/'  out='null'
###!!! ASSERTION: You can't dereference a NULL nsCOMPtr with operator->().:
'mRawPtr != 0', file ../../dist/include/nsCOMPtr.h, line 569
###!!! Break: at file ../../dist/include/nsCOMPtr.h, line 569
[Switching to Thread 16561]

Program received signal SIGSEGV, Segmentation fault.
0x4017451a in ?? () from /home/endico/mozilla/mozilla/dist/bin/libxpcom.so
(gdb) where
#0  0x4017451a in nsProxyObject::Post (this=0x407575f0, methodIndex=4,
    methodInfo=0x816b65c, params=0xbf5ffa5c, interfaceInfo=0x8523ae8)
    at nsProxyEvent.cpp:423
#1  0x40176747 in nsProxyEventObject::CallMethod (this=0x40705a90,
    methodIndex=4, info=0x816b65c, params=0xbf5ffa5c)
    at nsProxyEventObject.cpp:391
#2  0x40181924 in PrepareAndDispatch (self=0x40705a90, methodIndex=4,
    args=0xbf5ffb14) at xptcstubs_unixish_x86.cpp:92
#3  0x40181a4a in nsXPTCStubBase::Stub4 (this=0x40705a90)
    at ../../../../../../dist/include/xptcstubsdef.inc:6
#4  0x4061211b in nsSocketTransport::fireStatus (this=0x4073aed8, aCode=5)
    at nsSocketTransport.cpp:1897
#5  0x4060f4a0 in nsSocketTransport::Process (this=0x4073aed8, aSelectFlags=0)
    at nsSocketTransport.cpp:539
#6  0x40613307 in nsSocketTransportService::Run (this=0x407375e0)
    at nsSocketTransportService.cpp:467
#7  0x401716b5 in nsThread::Main (arg=0x40738488) at nsThread.cpp:83
#8  0x402138fb in ?? () from /home/endico/mozilla/mozilla/dist/bin/libnspr3.so
#9  0x4022deca in pthread_start_thread (arg=0xbf5ffe60) at manager.c:213

Updated

20 years ago
Whiteboard: [PDT+]
Target Milestone: M13

Comment 22

20 years ago
Putting on PDT+ radar.

Comment 23

20 years ago
Posted file single gif image

Comment 24

20 years ago
A single animated gif image is enough to cause a crash although it may take
a while. Load the image and wait. Eventually mozilla will crash. Sometimes
it crashes immediately, sometimes it takes an  hour or more. Oddly, I found
that i get good stacks when i view an html file with a link to an animated
gif but the stack is corrupted if I type the url of the gif image directly
into the location bar.

It looks like a networking problem that happens to be exercised by animated
gifs because they aren't cached, and have to be downloaded from the source
over and over.


------------
viewing animated gif
------------
Document http://userfriendly.org/images/buttons/ufbook.gif loaded successfully
Document: Done (35.863 secs)
[Switching to Thread 25851]

Program received signal SIGSEGV, Segmentation fault.
0x84dc660 in ?? ()
(gdb) where
#0  0x84dc660 in ?? ()
#1  0x19d8f2bf in ?? ()
Cannot access memory at address 0x5ff8e808.

---------------
loaded test case 2, an html file that displays an animated gif
----------------
(gdb) where
#0  0x40d00138 in ?? ()
#1  0x416c8a9c in nsCOMPtr<nsIChannel>::assign_with_AddRef (this=0x40d22ad8,
    rawPtr=0x0) at ../../../../dist/include/nsCOMPtr.h:760
#2  0x416c97f3 in nsCOMPtr<nsIChannel>::operator= (this=0x40d22ad8, rhs=0x0)
    at ../../../../dist/include/nsCOMPtr.h:515
#3  0x416ba5df in nsHTTPRequest::~nsHTTPRequest (this=0x40d22ac0, __in_chrg=3)
    at nsHTTPRequest.cpp:140
#4  0x416ba730 in nsHTTPRequest::Release (this=0x40d22ac0)
    at nsHTTPRequest.cpp:151
#5  0x416b3cd5 in nsHTTPChannel::~nsHTTPChannel (this=0x418a2588, __in_chrg=3)
    at nsHTTPChannel.cpp:117
#6  0x416b3f32 in nsHTTPChannel::Release (this=0x418a2588)
    at nsHTTPChannel.cpp:127
#7  0x4060ca1e in nsStreamListenerEvent::~nsStreamListenerEvent (
    this=0x853efc0, __in_chrg=3) at nsAsyncStreamListener.cpp:77
#8  0x4060d321 in nsOnStopRequestEvent::~nsOnStopRequestEvent (this=0x853efc0,
    __in_chrg=3) at nsAsyncStreamListener.cpp:257
#9  0x4060cb4f in nsStreamListenerEvent::DestroyPLEvent (aEvent=0x84bae80)
    at nsAsyncStreamListener.cpp:104
#10 0x401d941b in PL_DestroyEvent (self=0x84bae80) at plevent.c:545
#11 0x401d93b9 in PL_HandleEvent (self=0x84bae80) at plevent.c:532
#12 0x401d927c in PL_ProcessPendingEvents (self=0x80ab660) at plevent.c:483
#13 0x4016fc3c in nsEventQueueImpl::ProcessPendingEvents (this=0x80ab638)
    at nsEventQueue.cpp:228
#14 0x406c1064 in event_processor_callback (data=0x80ab638, source=7,
    condition=GDK_INPUT_READ) at nsAppShell.cpp:141
#15 0x406c0cef in our_gdk_io_invoke (source=0x811fda8, condition=G_IO_IN,
    data=0x82344d0) at nsAppShell.cpp:54
#16 0x4087352a in g_io_unix_dispatch () from /usr/lib/libglib-1.2.so.0
#17 0x40874be6 in g_main_dispatch () from /usr/lib/libglib-1.2.so.0
#18 0x408751a1 in g_main_iterate () from /usr/lib/libglib-1.2.so.0
#19 0x40875341 in g_main_run () from /usr/lib/libglib-1.2.so.0
#20 0x4079c209 in gtk_main () from /usr/lib/libgtk-1.2.so.0
#21 0x406c1667 in nsAppShell::Run (this=0x808d038) at nsAppShell.cpp:304
#22 0x4059107d in nsAppShellService::Run (this=0x80ab438)
    at nsAppShellService.cpp:465
#23 0x804bf3d in main1 (argc=1, argv=0xbffffba4) at nsAppRunner.cpp:609
#24 0x804c3c7 in main (argc=1, argv=0xbffffba4) at nsAppRunner.cpp:697
(gdb) print *this
No symbol "this" in current context.
(gdb) print this
No symbol "this" in current context.
(gdb) up
#1  0x416c8a9c in nsCOMPtr<nsIChannel>::assign_with_AddRef (this=0x40d22ad8,
    rawPtr=0x0) at ../../../../dist/include/nsCOMPtr.h:760
760	    assign_assuming_AddRef(NS_REINTERPRET_CAST(T*, rawPtr));
(gdb) print this
$1 = (nsCOMPtr<nsIChannel> *) 0x0
(gdb) print *this
Cannot access memory at address 0x0.
(gdb) up
#2  0x416c97f3 in nsCOMPtr<nsIChannel>::operator= (this=0x40d22ad8, rhs=0x0)
    at ../../../../dist/include/nsCOMPtr.h:515
515	          assign_with_AddRef(rhs);
(gdb) print this
$2 = (nsCOMPtr<nsIChannel> *) 0x40d22ad8
(gdb) print *this
$3 = {mRawPtr = 0x0}
(gdb) up
#3  0x416ba5df in nsHTTPRequest::~nsHTTPRequest (this=0x40d22ac0, __in_chrg=3)
    at nsHTTPRequest.cpp:140
140	    mTransport = null_nsCOMPtr();
(gdb) print this
$4 = (nsHTTPRequest *) 0x40d22ac0
(gdb) print *this
$5 = {<nsIStreamObserver> = {<nsISupports> = {
      _vptr. = 0x416cfcc0 <nsHTTPRequest virtual table>}, <No data fields>},
<nsIRequest> = {<nsISupports> = {
      _vptr. = 0x416cfc80 <nsHTTPRequest::nsIRequest virtual table>}, <No data
fields>}, mRefCnt = 1, mMethod = HM_GET, mURI = {mRawPtr = 0x40d693b8},
  mVersion = HTTP_ONE_ZERO, mTransport = {mRawPtr = 0x0},
  mConnection = 0x418a2588, mHeaders = {mHTTPHeaders = {
      mRawPtr = 0x40d7dca8}}, mUsingProxy = 0, mRequestBuffer = {<nsStr> = {
      mLength = 0, mCapacity = 128, mCharSize = eOneByte, mOwnsBuffer = 1, {
        mStr = 0x40da2140 "", mUStr = 0x40da2140}},
    _vptr. = 0x401b0084 <nsCString virtual table>}, mPostDataStream = {
    mRawPtr = 0x0}}
(gdb) quit
Whiteboard: [PDT+] → [PDT+] need SMP machine
Doug is not looking at this until he gets his hands on a machine that exhibits
the problem. Anyone else want to take it? Anyone want to get another processor
for Doug?

Comment 26

20 years ago
Brian, do we have a machine that dougt can use to debug this? One of those
new solaris machines? (with purify?)
(Assignee)

Comment 27

20 years ago
Here's the info I dug up on SMP boxes (for the ambitious):

Bill Law and dp have dual processor machines, but they're 200mhz and dp thinks
that's too slow. Rickg has a 733mhz (?!) machine but I'm not sure if
it's here or in san diego. Cyeh may be able to whip something up too. Alec says
he sees deadlocks running an MP kernel on a single-processor
machine.

Warren
If we think the problem is in xpcom/proxy, we could try a code review, even
before dougt sits in front of a fast SMP machine (or whatever).  I'm up for it,
and jband would probably be willing to help.

/be
Brendan, et al. we (jband andI) have already done this.  I need to protect my
hash tables and proxyCallInfo class.  I could merely just code these fixes and
check it in, but I would rather be able to verify (for myself) that the problem
does go away when I do this.
This is silly: you know of thread-safety bugs (MP or Uniprocessor, I dunno),
there are people in the Mozilla community being bitten by these bugs (including
endico@mozilla.org), but you don't wanna code the fixes until you can test 'em
yourself?

This is not the way of the Mozilla bazaar.  Can you hack up fixes to the current
revs of the files, and attach cvs diff -u output to this bug, so others can at
least help test for ya?  Thanks.

/be

Comment 31

20 years ago
An even faster way to reproduce this bug is to use mail. Opening a folder
with 2k messages took 5 tries because it made mozilla crash.

Comment 32

20 years ago
Attach a patch and i'll be happy to test it.
(And let me know what testing needs to be done)

Comment 33

20 years ago
Rebuilt from the top with dougt's changes and still crashed.

#0  0x4019eb33 in nsCOMPtr<nsProxyObject>::assign_with_AddRef (
    this=0x40767028, rawPtr=0x8665848) at ../../../dist/include/nsCOMPtr.h:759
#1  0x4019ef67 in nsCOMPtr<nsProxyObject>::operator= (this=0x40767028,
    rhs=0x8665848) at ../../../dist/include/nsCOMPtr.h:515
#2  0x40174b44 in nsProxyObjectCallInfo::nsProxyObjectCallInfo (
    this=0x40767008, owner=0x8665848, methodInfo=0x816bfe0, methodIndex=3,
    parameterList=0x40746e38, parameterCount=4, event=0x40766f90)
    at nsProxyEvent.cpp:70
#3  0x40175b00 in nsProxyObject::Post (this=0x8665848, methodIndex=3,
    methodInfo=0x816bfe0, params=0xbf5ffadc, interfaceInfo=0x86831c8)
    at nsProxyEvent.cpp:384
#4  0x40177ff7 in nsProxyEventObject::CallMethod (this=0x8680668,
    methodIndex=3, info=0x816bfe0, params=0xbf5ffadc)
    at nsProxyEventObject.cpp:394
#5  0x40183184 in PrepareAndDispatch (self=0x8680668, methodIndex=3,
    args=0xbf5ffb94) at xptcstubs_unixish_x86.cpp:92
#6  0x4018325e in nsXPTCStubBase::Stub3 (this=0x8680668)
    at ../../../../../../dist/include/xptcstubsdef.inc:5
#7  0x4061343e in nsSocketTransport::doRead (this=0x868e328, aSelectFlags=1)
    at nsSocketTransport.cpp:976
#8  0x40612755 in nsSocketTransport::Process (this=0x868e328, aSelectFlags=1)
    at nsSocketTransport.cpp:512
#9  0x406166d7 in nsSocketTransportService::Run (this=0x40768b88)
---Type <return> to continue, or q <return> to quit---
    at nsSocketTransportService.cpp:467
#10 0x40172d05 in nsThread::Main (arg=0x4073b4f8) at nsThread.cpp:83
#11 0x402158fb in _pt_root (arg=0x407469a0) at ptthread.c:157
#12 0x4022feca in pthread_start_thread (arg=0xbf5ffe60) at manager.c:213

Comment 34

20 years ago
I'm still crashing but things don't seem as fragile as before. I was able to
download my mailbox headers twice in a row without crashing. Last time it
crashed 4/5 times. Doug's changes seem to have made an improvement.
(Assignee)

Comment 35

20 years ago
Probably the extra locks just slowed down the timing of things, shrinking the
window of vulnerability. Dawn -- sounds like we should get a debug build/env on
your machine so that we can diagnose the problem when it happens. Can you set
that up?

Comment 36

20 years ago
I just got a crash with a fresh tree on a dual 350 PII running linux, here's a
stack trace.

Program received signal SIGSEGV, Segmentation fault.
0x40175c3a in nsProxyObject::Post (this=0x860ff28, methodIndex=4,
    methodInfo=0x812ac44, params=0xbf5ffa38, interfaceInfo=0x849b158)
    at nsProxyEvent.cpp:433
433             mDestQueue->PostEvent(event);
(gdb) bt
#0  0x40175c3a in nsProxyObject::Post (this=0x860ff28, methodIndex=4,
    methodInfo=0x812ac44, params=0xbf5ffa38, interfaceInfo=0x849b158)
    at nsProxyEvent.cpp:433
#1  0x40177ff7 in nsProxyEventObject::CallMethod (this=0x862c7f0,
    methodIndex=4, info=0x812ac44, params=0xbf5ffa38)
    at nsProxyEventObject.cpp:394
#2  0x40183184 in PrepareAndDispatch (self=0x862c7f0, methodIndex=4,
    args=0xbf5ffaf0) at xptcstubs_unixish_x86.cpp:92
#3  0x401832aa in nsXPTCStubBase::Stub4 (this=0x862c7f0)
    at ../../../../../../dist/include/xptcstubsdef.inc:6
#4  0x4060a4eb in nsSocketTransport::fireStatus (this=0x862c900, aCode=3)
    at nsSocketTransport.cpp:1903
#5  0x40607860 in nsSocketTransport::Process (this=0x862c900, aSelectFlags=0)
    at nsSocketTransport.cpp:539
#6  0x4060b0c6 in nsSocketTransportService::ProcessWorkQ (this=0x84f64d0)
    at nsSocketTransportService.cpp:259
#7  0x4060b794 in nsSocketTransportService::Run (this=0x84f64d0)
    at nsSocketTransportService.cpp:493
#8  0x40172d05 in nsThread::Main (arg=0x84f6810) at nsThread.cpp:83
#9  0x402158fb in _pt_root (arg=0x85bf110) at ptthread.c:157
#10 0x4022feca in pthread_start_thread (arg=0xbf5ffe60) at manager.c:213
(gdb) print this
$2 = (nsProxyObject *) 0x860ff28
(gdb) print *this
$3 = {<nsISupports> = {_vptr. = 0x883ce90}, mRefCnt = 140573992,
  mProxyType = 6, mDestQueue = {mRawPtr = 0x0},
  mRealObject = {<nsCOMPtr_base> = {mRawPtr = 0x0}, <No data fields>},
  mLock = 0x882a7b8}

As far as I can tell "this" was destroyed while one thread is executing
this->Post() since there's a check for !mDestQueue in the beginning of
nsPorxyObject::Post(), so this should not happend...

Comment 37

20 years ago
Doug,
Looking at EventHandler (shouldn't this be static or something?)...
http://lxr.mozilla.org/seamonkey/source/xpcom/proxy/src/nsProxyEvent.cpp#460
...I see that you are holding a per object lock while invoking
XPTC_InvokeByIndex. This seems excessive and/or dangerous. Aren't you then
precluding reentrant calls via the proxy on the proxied object? Do you really
need to protect more than your shared tables of information about the proxies
and the refcount managment of the proxies themselves?

I think that you should limit the scope of all locks to the bare minimum that is
absolutely require so that you decrease the chance of deadlocks or nspr
assertions on attempts to reenter a non-reantrant lock.

Updated

20 years ago
Status: ASSIGNED → RESOLVED
Last Resolved: 20 years ago
Resolution: --- → DUPLICATE
good catch, both event handlers need to be static.  The scope of the locks need
to be reduced.

marking this bug as a dup of 18110

*** This bug has been marked as a duplicate of 18110 ***

Comment 39

20 years ago
On Linux SMP machine Mozilla M13 crashes almost immediately. It crashes also
while you are doing nothing..
Status: RESOLVED → REOPENED
anssi@bigfoot.com, why was this reopened if it is in fact a duplicate of 18110? 
Your comments don't argue that it is a separate bug from 18110, so I don't see 
the point in reopening.  Resolving it as a duplicate doesn't mean that the bug 
it describes, duplicated by an earlier bugzilla report, is fixed -- it just 
means we know that the newer bug is a dup.

/be

Comment 41

20 years ago
Clearing DUPLICATE resolution due to reopen.
closing.  see other bug.
Status: REOPENED → RESOLVED
Last Resolved: 20 years ago20 years ago

Comment 43

20 years ago
I brought my box in to work today and let dougt hack.
He thinks this may actually be a duplicate of 24711.

Comment 44

20 years ago
Re-opening because the bug this bug turned out not to be a duplicate
of 18110. Marking as dependent on 24711 and removing dependency on 18110.
Status: RESOLVED → REOPENED
Depends on: 24711
No longer depends on: 18110

Comment 45

20 years ago
assigning to http god.
Assignee: dougt → gagan
Status: REOPENED → NEW

Comment 46

20 years ago
Clearing Duplicate resolution due to reopen.
Resolution: DUPLICATE → ---
(Assignee)

Comment 47

20 years ago
Moving to m14.
Keywords: beta1, crash
Target Milestone: M13 → M14

Comment 48

20 years ago
Putting dogfood in the keyword field.
Keywords: dogfood

Updated

20 years ago
Summary: [Dogfood] Mozilla crashes often on SMP systems. → Mozilla crashes often on SMP systems.
Putting in correct component.
Component: XPCOM → Networking
(Assignee)

Comment 50

20 years ago
Why is this considered Networking now? It's purely a proxy problem, isn't it? It 
could affect anything.

And why is this owned by Gagan?
No.  this is a the problem with having socket transports in the load group.  
The second onStop() crashes SMP machines.
(Assignee)

Comment 52

20 years ago
Changing summary from: Mozilla crashes often on SMP systems.
To: crash on SMP systems: socket transport in load group

Reassigning to Rick Potts because I think he's working on this now.
Assignee: gagan → rpotts
Summary: Mozilla crashes often on SMP systems. → crash on SMP systems: socket transport in load group

Comment 53

20 years ago
hey doug,

are you sure that there is a SocketTransport sitting in a load group?  I would 
have thought that that was not possible...

-- rick
gagan and jud are in the know.  

Comment 55

19 years ago
This is not windows only, I been seeing this on linux for a while too, changing
OS and Platform...
OS: Windows NT → All
Hardware: PC → All

Comment 56

19 years ago
Status whiteboard says you need an SMP machine. Hasn't dougt's arrived yet?
Mozilla is pretty useless for me at home until this bug gets fixed. I could
bring the mahcine in again but the last time I tried that the motherboard
fried.

Comment 57

19 years ago
Hey Rick; I'm seeing these crashes _constantly_ on my home machine. Almost any 
page I visit will eventually end up in this state. Sometimes it's just visiting 
the page, sometimes it's when I leave the page, sometimes it's just sitting idle 
(so to speak). I'll start forwarding stack traces.

Comment 58

19 years ago
Here's an *all-too-typical* stack trace on my SMP/NT box...

nsStreamListenerEvent::~nsStreamListenerEvent() line 77 + 24 bytes
nsOnStopRequestEvent::~nsOnStopRequestEvent() line 258 + 8 bytes
nsOnStopRequestEvent::`scalar deleting destructor'(unsigned int 1) + 15 bytes
nsStreamListenerEvent::DestroyPLEvent(PLEvent * 0x02fe63e0) line 104 + 30 bytes
PL_DestroyEvent(PLEvent * 0x02fe63e0) line 549 + 10 bytes
PL_HandleEvent(PLEvent * 0x02fe63e0) line 536 + 9 bytes
PL_ProcessPendingEvents(PLEventQueue * 0x02382cd0) line 487 + 9 bytes
_md_EventReceiverProc(HWND__ * 0x003e0550, unsigned int 49342, unsigned int 0, 
long 37235920) line 975 + 9 bytes
USER32! 77e71820()
02382cd0()

I'm certainly willing to drive this machine remotely if someone wants to try to 
debug this problem.
(Assignee)

Comment 59

19 years ago
Line 77 looks like the release of mContext or possibly mChannel, the line above 
it. Rickg: Can you see if one of these looks like it has already been deleted? 
Maybe we've got race between an addref on one thread and a release on this one.

Comment 60

19 years ago
For that particular stack trace, it is possible that the crash is happening on 
the NS_RELEASE(mContext) because mContext has already been deleted!

It turns out that mContext is really an nsHTTPCHannel.  Unfortunately, 
nsHTTPChannel *does not* have thread-safe implementations of AddRef() and 
Release()...

Since these methods are caled on multiple threads (ie. socket transport and UI) 
there canbe problems :-)

I'll check in a fix to make AddRef() and Release() thread-safe and we'll see if 
things get any better...

Are you seeing any other stack traces?

Comment 61

19 years ago
I've just checked in thread-safe AddRef/Release implementations for 
nsHTTPChannel, nsHTTPResponseListener, nsHTTPRequest and nsHTTPEncodeStream.

I suspect that other nsIInputStream implementations (besides nsHTTPEncodeStream) 
will need thread-safe Addref/Release implementations...  In particular the 
"string stream" 
(Assignee)

Comment 62

19 years ago
Rick,

I've never understood how making addref and release threadsafe really solved 
things. If one thread might be doing the last release while another is trying to 
addref, there's obviously some higher-level synchronization needed, isn't there?

Or maybe it's just that the thread doing the release shouldn't have been the 
final release -- but the refcount got tromped somewhere along the way. It still 
seems like more than the refcount needs to be protected in this case.

Warren

Comment 63

19 years ago
One way it can help is that this threadsafety code makes the manipulation of the 
refcount atomic. If you have one release happening when another addref is going 
on then the release *might* set the refcount to a lower number then it should be 
- ignoring the addref's change; i.e --refcnt is really (get, decrement, store). 
If another thread changes the refcount in the middle of that non-atomic set of 
actions then you can stomp its change. Only later does that get you when the 
'final' release comes when the refcount should really not be zero yet.

Comment 64

19 years ago
Warren,
The race you worried about is not a problem.  The only time folks should be 
messing with an object is IF they already have done an adref.  There is no 
chance that a thread is "about to do an adref" on an object unless that thread 
*has* an outstanding adref ahead of time.  Hence there is no risk from some 
other thread doing a decref (the count is already at least 2, one for each 
thread handling the object).

On some platforms, you can get some guarantees about atomic actions for some 
class of integers.  Waldemar looked into this a LOT for multiprocessor machines, 
and can probably chime in with potential answers.  If the action is not atomic 
(as pointed out by jband), then there is a big risk of losing either an 
increment, or a decrement :-(.  

Adding Waldemar to this thread in case he has suggestions.
(Assignee)

Comment 65

19 years ago
My point is that if 2 threads are manipulating the same channel, then the 
channel better be protecting the state for other operations, not just 
addref/release. 

Comment 66

19 years ago
The issue that I've seen in the past with non-threadsafe Addref/Release is that 
the refcount can prematurely go to zero.  For example, if an object has a 
reference count of two and two threads call Release() simultaneously, there is a 
chance that the --mRefCount will be executed on each thread *before* either one 
checks for 0.  In this case, both threads will see (mRefCount == 0) and delete 
the object.

This double deletion was the whole reason that I added the NS_IMPL_THREADSAFE 
macros to nsISupportsUtils.h
(Assignee)

Comment 67

19 years ago
Ok. What other channel implementations need this same fix?

Comment 68

19 years ago
I'm not seeing crashes at home, but as of a day or two ago, I can no longer load
any remote pages on my machine at home (SMP machine).

Comment 69

19 years ago
warren,
I think that we should examine the File Transport as well...  Basically, any 
pointer that is Addref/Released on another thread requires thread-safe ISupports 
implementations...

Typically, these are the internal nsIStreamListener implementations and the 
streams...

I was thinking of adding some assertions to the non-threadsafe AddRef/Release 
macros which assert if they are ever called on multiple threads...  Do you think 
this would be useful ?  

I used to have some debug macros, along the lines of NS_ENSURE_THREADSAFE(...) 
which could be used to verify that method arguments were threadsafe, but they 
required using an NS_IMPL_THREADSAFE_QI macro...  Troy whined endlessly about 
that so I removed it :-(

However, I could make the checking completely transparent if I added an 'owning 
thread' pointer as a data member in NS_DECL_ISUPPORTS (for debug only)

Comment 70

19 years ago
I'm nominating bug #24642 and #26686 as dups of this bug.  What do people think?

Comment 71

19 years ago
Need to fix by 03/03 for beta1 train.
QA Contact: leger → tever
Whiteboard: [PDT+] need SMP machine → [PDT+] Must fix by 03/03 need SMP machine

Comment 72

19 years ago
*** Bug 24642 has been marked as a duplicate of this bug. ***

Comment 73

19 years ago
Rick's comment about two threads doing simultaneous decrefs, and then both think 
ing it was their job to do the delete (because they checked non-atomically for 
a zero after the decref), is really scary :-(.
Do we have this problem with many classes of objects, or is there a small set 
that generally faces this evil handling on multiple threads?

Comment 74

19 years ago
...another question... if this is a problem on SMTP, why are we not hitting it 
on a single processor machine?  Considering that task switching between threads 
is pre-emptive, I'd expect a similar amount of risk of a conflict.  What am I 
missing?  Is there a way to mark an executable to NOT use more than one 
processor??  Would that give a a wimpy work-around for now???

Comment 75

19 years ago
hey jim,

I think that we *are* seeing this problem on single processor machines.  Take 
a look at bug #26686 and bug #24642.  They both have tvery similar stack 
traces...
I think that we are seeing it *more* on SMP boxes because we get more 
concurrency...  But the problem still exists on single processors...

Comment 76

19 years ago
Damn.... this is sounding more and more scary.  I need to look at how other 
systems deal with this while doing ref-counting.  Ugh... this looks hard (but at 
least that makes it interesting!!!! :-)  ).

Updated

19 years ago
Whiteboard: [PDT+] Must fix by 03/03 need SMP machine → [PDT+] w/b minus on 03/03- need SMP machine

Comment 77

19 years ago
After some analysis, I've identified the following classes as being 
un-threadsafe in their usage of Addref/Release.  This analysis was *only* for 
bringing up the browser - there are definately more in FTP and IMAP :-(
For each of these classes, at least one instance is created on one thread and 
then Addref/Released by another.
    nsThread
    nsLocalFileSystem
    nsFileTransport
    nsLocalFile
    nsGenericModule
    nsFileTransportService
    nsProxyObject
    nsInterfaceInfo
    nsMIMEService
    nsMIMEInfoImpl
    nsBasicStringImpl
    nsDNSService
    nsIOService
    nsEventQueueImpl
    nsSupportsArray
    AtomImpl
    nsGenericFactory

Each of these classes needs to be analyzed to determine the extent of 
un-threadsafe beyond Addref() and Release()!!
should we file seperate bugs on each of these?  are you going to change the 
above to use the thread safe version of addref/release?

Comment 79

19 years ago
Does anyone have a proposed patch to fix this?  Maybe some changes toto the
addref/release macros for everything?  I don't crash here at home... I just
can't load pages.  I can run mozilla remotly to my xserver at work if anyone
wants me to test this out

Comment 80

19 years ago
The fix for Addref/Release is trivial.  You simply need to use the:
  NS_IMPL_THREADSAFE_ADDREF(...)
  NS_IMPL_THREADSAFE_RELEASE(...)
macros.
The bigger question is if Addref/Release are being accessed on multiple threads, 
what other members are also accessed - and not threadsafe!

I think that as we migrate these classes to use the THREADSAFE macros, we must 
*also* do a carful analysis to determine the overall threadsaftey (and thread 
exposure) of each class...

Comment 81

19 years ago
Pavlov, FYI: I'm running Linux at home on a dual 350MHz PII, I've never had
problems with loading remote pages (over a modem line) in mozilla (I update and
test almost daily), and mozilla hasn't crashed in a while either...
(Assignee)

Comment 82

19 years ago
Rick: I'd eventually like to get your assertions for this into the tree too so 
that the problem doesn't come up in the future (after we've analyzed and fixed 
all these). Good work figuring out how to spot this.

Pavlov: What do you say we build us an SMP box out of our Dell 210s? I want to 
make sure somebody has a machine in house that will exhibit these problems.

Don/Peter/dp: Do any of you have a spare Dell 210 that you can give up for a 
while to make a multiprocessor out of? That would let me keep mine for 
development. Thanks.

Comment 83

19 years ago
Damn, I shouldn't have said that! Now, I'm seeing a crash again, and I was able
to get a stack trace, the stacktrace is different from all the other ones in
this bug but I still think it belongs here.

#0  0x4059b090 in main_arena () from /lib/libc.so.6
#1  0x40042f7e in nsCOMPtr<nsIChannel>::assign_assuming_AddRef (
    this=0x89d8150, newPtr=0x0) at ../../dist/include/nsCOMPtr.h:416
#2  0x41c183ac in nsCOMPtr<nsIChannel>::assign_with_AddRef (this=0x89d8150, 
    rawPtr=0x0) at ../../../../dist/include/nsCOMPtr.h:787
#3  0x41c18de3 in nsCOMPtr<nsIChannel>::operator= (this=0x89d8150, rhs=0x0)
    at ../../../../dist/include/nsCOMPtr.h:526
#4  0x41c05ac0 in nsHTTPRequest::~nsHTTPRequest (this=0x89d8138, __in_chrg=3)
    at nsHTTPRequest.cpp:146
#5  0x41c05c25 in nsHTTPRequest::Release (this=0x89d8138)
    at nsHTTPRequest.cpp:154
#6  0x41bfe9b5 in nsHTTPChannel::~nsHTTPChannel (this=0x81c5660, __in_chrg=3)
    at nsHTTPChannel.cpp:127
#7  0x41bfec33 in nsHTTPChannel::Release (this=0x81c5660)
    at nsHTTPChannel.cpp:142
#8  0x40043a74 in nsCOMPtr<nsIChannel>::~nsCOMPtr (this=0xbffff2c4, 
    __in_chrg=2) at ../../dist/include/nsCOMPtr.h:434
#9  0x40e4fcc7 in nsDocLoaderImpl::DocLoaderIsEmpty (this=0x85c3918, aStatus=0)
    at nsDocLoader.cpp:495
#10 0x40e4fb18 in nsDocLoaderImpl::OnStopRequest (this=0x85c3918, 
    aChannel=0x8c87db8, aCtxt=0x0, aStatus=0, aMsg=0x0) at nsDocLoader.cpp:437
#11 0x40706b52 in nsLoadGroup::RemoveChannel (this=0x85c3970, 
    channel=0x8c87db8, ctxt=0x0, status=0, errorMsg=0x0) at nsLoadGroup.cpp:535
#12 0x407405bb in nsFileChannel::OnStopRequest (this=0x8c87db8, 
    transportChannel=0x8c87ec8, context=0x0, aStatus=0, aMsg=0x0)
    at nsFileChannel.cpp:450
#13 0x406efb0d in nsOnStopRequestEvent::HandleEvent (this=0x408ea618)
    at nsAsyncStreamListener.cpp:282
#14 0x406ef1e7 in nsStreamListenerEvent::HandlePLEvent (aEvent=0x41dd6560)
    at nsAsyncStreamListener.cpp:97
(More stack frames follow...)

Here what it crashed on

#1  0x40042f7e in nsCOMPtr<nsIChannel>::assign_assuming_AddRef (
    this=0x89d8150, newPtr=0x0) at ../../dist/include/nsCOMPtr.h:416
416                 NSCAP_RELEASE(oldPtr);
(gdb) print oldPtr
$6 = (nsIChannel *) 0x88346f4
(gdb) print *oldPtr
$7 = {<nsIRequest> = {<nsISupports> = {
      _vptr. = 0x4059b088}, <No data fields>}, <No data fields>}

Still no problems loading remote pages tho...

Comment 84

19 years ago
I've got a 210.  It isn't spare, but I could loan it out for a short time,
especially over the weekend.
For what it is worth, the xpcom log might help. It is enabled for release 
builds too. Here is how you get it:

set env NSPR_LOG_MODULES=nsComponentManager:5
set env NSPR_LOG_FILE=xpcom.log
mozilla

now you should have a xpcom.log There is a sufficiently large chance that we 
might be able to tell what is happening from the log.

Comment 86

19 years ago
It looks like the last stack trace is slightly different...  

In this case, the last URL of the document has finished and the LoadGroup is 
releasing its reference to the "document channel" (which is a nsHTTPChannel).

The nsHTTPChannel (this=0x81c5660) releases its nsHTTPRequest (this=0x89d8138), 
which in turn releases its reference to the nsSocketTransport (0x88346f4) 
- which is an nsIChannel.  Unfortunately, the nsSocketTransport instance has 
already been deleted :-(

Comment 87

19 years ago
That class of problem (release on an already deleted object) is exactly the sort
of thing that would be expected from the problem you isolated.  When the ref
count on the object is down-counted to zero, and the hit to zero is felt by
*two* threads, then *both* threads will delete and clean up that object. When
both threads start to "clean up," some related objects will be deleted on one
thread, and then later the other thread will come along to "clean up" and do
additonal releases on a collected object.

This all seems to fit... or am I missing something??

Comment 88

19 years ago
The good news is that I no longer crash when sitting there viewing slashdot.org
or the test case. It appears that animated gifs are now cached instead of being
downloaded over and over.

The bad news is that I still get random crashes with the same stack traces.
I'll try gagan's xpcom logging suggestion.

Comment 89

19 years ago
holy cow! I ran mozilla for 15 or so minutes with NSPR_LOG_MODULES  and 
NSPR_LOG_FILE set. I generated a 56mb log file filled with millions of
these

1024[8058968]: 		found lib:libgfx_gtk.so as 812b930 in factory cache.
1024[8058968]: 		Factory CreateInstance() succeeded.
1024[8058968]: nsComponentManager:
FindFactory({e12752f0-ee9a-11d1-a82a-0040959a28c9})

These spewed out at the rate of 1 or 2 per second even just sitting
at http://www.mozilla.org/

Eventually after reloading my mailbox and loading some other pages it crashed.

#0  0x40800149 in ?? ()
#1  0x41c373ac in nsCOMPtr<nsIChannel>::assign_with_AddRef (this=0x420a8ac0, 
    rawPtr=0x0) at ../../../../dist/include/nsCOMPtr.h:787
#2  0x41c37de3 in nsCOMPtr<nsIChannel>::operator= (this=0x420a8ac0, rhs=0x0)
    at ../../../../dist/include/nsCOMPtr.h:526
#3  0x41c24ac0 in nsHTTPRequest::~nsHTTPRequest (this=0x420a8aa8, __in_chrg=3)
    at nsHTTPRequest.cpp:146
#4  0x41c24c25 in nsHTTPRequest::Release (this=0x420a8aa8)
    at nsHTTPRequest.cpp:154
#5  0x41c1d9b5 in nsHTTPChannel::~nsHTTPChannel (this=0x42029b80, __in_chrg=3)
    at nsHTTPChannel.cpp:127
#6  0x41c1dc33 in nsHTTPChannel::Release (this=0x42029b80)
    at nsHTTPChannel.cpp:142
#7  0x406e40ee in nsStreamListenerEvent::~nsStreamListenerEvent (
    this=0x82efb48, __in_chrg=3) at nsAsyncStreamListener.cpp:81
#8  0x406e4a01 in nsOnStopRequestEvent::~nsOnStopRequestEvent (this=0x82efb48, 
    __in_chrg=3) at nsAsyncStreamListener.cpp:261
#9  0x406e421f in nsStreamListenerEvent::DestroyPLEvent (aEvent=0x84689c8)
    at nsAsyncStreamListener.cpp:108
#10 0x40189c5b in PL_DestroyEvent (self=0x84689c8) at plevent.c:549
#11 0x40189bf9 in PL_HandleEvent (self=0x84689c8) at plevent.c:536
#12 0x40189abc in PL_ProcessPendingEvents (self=0x812cf78) at plevent.c:487
#13 0x4018b5fc in nsEventQueueImpl::ProcessPendingEvents (this=0x812cf50)
    at nsEventQueue.cpp:298
#14 0x40935a64 in event_processor_callback (data=0x812cf50, source=9, 
    condition=GDK_INPUT_READ) at nsAppShell.cpp:141
#15 0x409356ef in our_gdk_io_invoke (source=0x4159f368, condition=G_IO_IN, 
    data=0x415b2988) at nsAppShell.cpp:54
#16 0x407cc52a in g_io_unix_dispatch () from /usr/lib/libglib-1.2.so.0
#17 0x407cdbe6 in g_main_dispatch () from /usr/lib/libglib-1.2.so.0
#18 0x407ce1a1 in g_main_iterate () from /usr/lib/libglib-1.2.so.0
#19 0x407ce341 in g_main_run () from /usr/lib/libglib-1.2.so.0
#20 0x40a12209 in gtk_main () from /usr/lib/libgtk-1.2.so.0
#21 0x40936067 in nsAppShell::Run (this=0x40812e38) at nsAppShell.cpp:304
#22 0x4064eaad in ?? ()
   from /home/endico/mozilla/mozilla/dist/bin/components/libnsappshell.so
#23 0x804e60e in main1 (argc=1, argv=0xbffff9e4, splashScreen=0x0)
    at nsAppRunner.cpp:763
#24 0x804eba0 in main (argc=1, argv=0xbffff9e4) at nsAppRunner.cpp:883
from the end of xpcom.log:

1024[8058968]: 		found rel:libnecko.so as 807ac80 in factory cache.
1024[8058968]: 		Factory CreateInstance() succeeded.
1024[8058968]: nsComponentManager:
ProgIDToClassID(component://netscape/image/decoder&type=image/gif)->{0d471b70-baf5-11d2-802c-0060088f91a3}
1024[8058968]: nsComponentManager:
FindFactory({0d471b70-baf5-11d2-802c-0060088f91a3})
1024[8058968]: 		found rel:libnsgif.so as 8085720 in factory cache.
1024[8058968]: 		Factory CreateInstance() succeeded.
1024[8058968]: nsComponentManager:
FindFactory({6049b261-c1e6-11d1-a827-0040959a28c9})
1024[8058968]: 		found lib:libgfx_gtk.so as 812b1d8 in factory cache.
1024[8058968]: 		Factory CreateInstance() succeeded.
1024[8058968]: nsComponentManager:
FindFactory({e12752f0-ee9a-11d1-a82a-0040959a28c9})
1024[8058968]: 		found lib:libgfx_gtk.so as 812b930 in factory cache.
1024[8058968]: 		Factory CreateInstance() succeeded.
1024[8058968]: nsComponentManager:
FindFactory({e12752f0-ee9a-11d1-a82a-0040959a28c9})
1024[8058968]: 		found lib:libgfx_gtk.so as 812b930 in factory cache.
1024[8058968]: 		Factory CreateInstance() succeeded.
1024[8058968]: nsComponentManager:
FindFactory({e12752f0-ee9a-11d1-a82a-0040959a28c9})
1024[8058968]: 		found lib:libgfx_gtk.so as 812b930 in factory cache.
1024[8058968]: 		Factory CreateInstance() succeeded.
1024[8058968]: nsComponentManager:
FindFactory({e12752f0-ee9a-11d1-a82a-0040959a28c9})
1024[8058968]: 		found lib:libgfx_gtk.so as 812b930 in factory cache.
1024[8058968]: 		Factory CreateInstance() succeeded.
1024[8058968]: nsComponentManager:
FindFactory({e12752f0-ee9a-11d1-a82a-0040959a28c9})
1024[8058968]: 		found lib:libgfx_gtk.so as 812b930 in factory cache.
1024[8058968]: 		Factory CreateInstance() succeeded.
1024[8058968]: nsComponentManager:
FindFactory({e12752f0-ee9a-11d1-a82a-0040959a28c9})
1024[8058968]: 		found lib:libgfx_gtk.so as 812b930 in factory cache.
1024[8058968]: 		Factory CreateInstance() succeeded.
1024[8058968]: nsComponentManager:
FindFactory({e12752f0-ee9a-11d1-a82a-0040959a28c9})
1024[8058968]: 		found lib:libgfx_gtk.so as 812b930 in factory cache.
1024[8058968]: 		Factory CreateInstance() succeeded.
1024[8058968]: nsComponentManager:
ProgIDToClassID(component://netscape/network/protocol?name=http)->{52a30880-dd95-11d3-a1a7-0050041caf44}
1024[8058968]: nsComponentManager:
FindFactory({90012125-1616-4fa1-ae14-4e7fa5766eb6})
1024[8058968]: 		found rel:libnecko.so as 807b070 in factory cache.
1024[8058968]: 		Factory CreateInstance() succeeded.
1024[8058968]: nsComponentManager:
FindFactory({de9472d0-8034-11d3-9399-00104ba0fd40})
1024[8058968]: 		found rel:libnecko.so as 807a890 in factory cache.
1024[8058968]: nsComponentManager:
FindFactory({dbf72351-4fd8-46f0-9dbc-fa5ba60a305c})
1024[8058968]: 		found rel:libnecko.so as 807afc8 in factory cache.
1024[8058968]: 		Factory CreateInstance() succeeded.
1024[8058968]: 		Factory CreateInstance() succeeded.
1024[8058968]: nsComponentManager:
ProgIDToClassID(component://netscape/scriptsecuritymanager)->{7ee2a4c0-4b93-17d3-ba18-0060b0f199a2}
1024[8058968]: nsComponentManager:
ProgIDToClassID(component://netscape/network/protocol?name=http)->{52a30880-dd95-11d3-a1a7-0050041caf44}
1024[8058968]: nsComponentManager:
ProgIDToClassID(component://netscape/network/cache?name=manager)->{2030f0b0-9567-11d3-90d3-0040056a906e}
1024[8058968]: nsComponentManager:
FindFactory({60047bb2-91c0-11d3-8cd9-0060b0fc14a3})
1024[8058968]: 		found rel:libnecko.so as 807ac80 in factory cache.
1024[8058968]: 		Factory CreateInstance() succeeded.
1024[8058968]: nsComponentManager:
ProgIDToClassID(component://netscape/image/decoder&type=image/gif)->{0d471b70-baf5-11d2-802c-0060088f91a3}
1024[8058968]: nsComponentManager:
FindFactory({0d471b70-baf5-11d2-802c-0060088f91a3})
1024[8058968]: 		found rel:libnsgif.so as 8085720 in factory cache.
1024[8058968]: 		Factory CreateInstance() succeeded.
(Assignee)

Comment 90

19 years ago
Dawn: Try setting nsSocketTransport:5 instead of nsComponentManager:5. I think 
that would be more helpful. 

Still working on an SMP machine for Rick. Pavlov agreed to pool his machine 
with mine... if I could only find him!

Comment 91

19 years ago
now that animated gifs don't constantly reload my new test case is browser
buster. it broke for me at about the 3rd url. Here's a new stack and the
last part of xpcom.log. I have 150K or so of log file with random.yahoo.com and
esta.org messages if anyone is interested.

using nsComponentManager:5.

0  0x0 in ?? ()
#1  0x406e40ee in nsStreamListenerEvent::~nsStreamListenerEvent (
    this=0x8793948, __in_chrg=3) at nsAsyncStreamListener.cpp:81
#2  0x406e4a01 in nsOnStopRequestEvent::~nsOnStopRequestEvent (this=0x8793948, 
    __in_chrg=3) at nsAsyncStreamListener.cpp:261
#3  0x406e421f in nsStreamListenerEvent::DestroyPLEvent (aEvent=0x8788988)
    at nsAsyncStreamListener.cpp:108
#4  0x40189c5b in PL_DestroyEvent (self=0x8788988) at plevent.c:549
#5  0x40189bf9 in PL_HandleEvent (self=0x8788988) at plevent.c:536
#6  0x40189abc in PL_ProcessPendingEvents (self=0x812b798) at plevent.c:487
#7  0x4018b5fc in nsEventQueueImpl::ProcessPendingEvents (this=0x812b770)
    at nsEventQueue.cpp:298
#8  0x40935a64 in event_processor_callback (data=0x812b770, source=9, 
    condition=GDK_INPUT_READ) at nsAppShell.cpp:141
#9  0x409356ef in our_gdk_io_invoke (source=0x8338070, condition=G_IO_IN, 
    data=0x81c6f08) at nsAppShell.cpp:54
#10 0x407cc52a in g_io_unix_dispatch () from /usr/lib/libglib-1.2.so.0
#11 0x407cdbe6 in g_main_dispatch () from /usr/lib/libglib-1.2.so.0
#12 0x407ce1a1 in g_main_iterate () from /usr/lib/libglib-1.2.so.0
#13 0x407ce341 in g_main_run () from /usr/lib/libglib-1.2.so.0
#14 0x40a12209 in ?? () from /usr/lib/libgtk-1.2.so.0
#15 0x40936067 in nsAppShell::Run (this=0x8130de8) at nsAppShell.cpp:304
#16 0x4064eaad in nsAppShellService::Run (this=0x812b570)
    at nsAppShellService.cpp:399
#17 0x804e60e in main1 (argc=1, argv=0xbffff9e4, splashScreen=0x0)
    at nsAppRunner.cpp:763
#18 0x804eba0 in main (argc=1, argv=0xbffff9e4) at nsAppRunner.cpp:883


1026[812d2c8]: --- Leaving nsSocketTransport::Process() [www.esta.org:80
41ccdd20]. mStatus = 80470007.	CurrentState = 5

1026[812d2c8]: +++ Entering nsSocketTransport::Process() [www.esta.org:80
41ccdd20].	aSelectFlags = 1.	CurrentState = 5
1026[812d2c8]: +++ Entering nsSocketTransport::doRead() [www.esta.org:80
41ccdd20].	aSelectFlags = 1.	
1026[812d2c8]: nsReadFromSocket [fd=40805220].  rv = 0. Buffer space = 239. 
Bytes read =239
1026[812d2c8]: nsReadFromSocket [fd=40805220].  rv = 0. Buffer space = 2048. 
Bytes read =261
1026[812d2c8]: nsReadFromSocket [fd=40805220].  rv = 80470007. Buffer space =
1787.  Bytes read =0
1026[812d2c8]: nsSocketTransport::OnWrite() [www.esta.org:80 41ccdd20].
nsIPipe=408fe088 Count=500
1026[812d2c8]: WriteSegments [fd=40805220].  rv = 0. Bytes read =500
1026[812d2c8]: --- Leaving nsSocketTransport::doRead() [www.esta.org:80
41ccdd20]. rv = 80470007.	Total bytes read: 500

1026[812d2c8]: --- Leaving nsSocketTransport::Process() [www.esta.org:80
41ccdd20]. mStatus = 80470007.	CurrentState = 5

1026[812d2c8]: +++ Entering nsSocketTransport::Process() [www.esta.org:80
41ccdd20].	aSelectFlags = 1.	CurrentState = 5
1026[812d2c8]: +++ Entering nsSocketTransport::doRead() [www.esta.org:80
41ccdd20].	aSelectFlags = 1.	
1026[812d2c8]: nsReadFromSocket [fd=40805220].  rv = 0. Buffer space = 1787. 
Bytes read =500
1026[812d2c8]: nsReadFromSocket [fd=40805220].  rv = 80470007. Buffer space =
1287.  Bytes read =0
1026[812d2c8]: nsSocketTransport::OnWrite() [www.esta.org:80 41ccdd20].
nsIPipe=408fe088 Count=500
1026[812d2c8]: WriteSegments [fd=40805220].  rv = 0. Bytes read =500
1026[812d2c8]: --- Leaving nsSocketTransport::doRead() [www.esta.org:80
41ccdd20]. rv = 80470007.	Total bytes read: 500

1026[812d2c8]: --- Leaving nsSocketTransport::Process() [www.esta.org:80
41ccdd20]. mStatus = 80470007.	CurrentState = 5

1026[812d2c8]: +++ Entering nsSocketTransport::Process() [www.esta.org:80
41ccdd20].	aSelectFlags = 1.	CurrentState = 5
1026[812d2c8]: +++ Entering nsSocketTransport::doRead() [www.esta.org:80
41ccdd20].	aSelectFlags = 1.	
1026[812d2c8]: nsReadFromSocket [fd=40805220].  rv = 0. Buffer space = 1287. 
Bytes read =46
1026[812d2c8]: nsReadFromSocket [fd=40805220].  rv = 0. Buffer space = 1241. 
Bytes read =0
1026[812d2c8]: nsSocketTransport::OnWrite() [www.esta.org:80 41ccdd20].
nsIPipe=408fe088 Count=46
1026[812d2c8]: WriteSegments [fd=40805220].  rv = 0. Bytes read =46
1026[812d2c8]: --- Leaving nsSocketTransport::doRead() [www.esta.org:80
41ccdd20]. rv = 80470007.	Total bytes read: 46

1026[812d2c8]: --- Leaving nsSocketTransport::Process() [www.esta.org:80
41ccdd20]. mStatus = 80470007.	CurrentState = 5

1026[812d2c8]: +++ Entering nsSocketTransport::Process() [www.esta.org:80
41ccdd20].	aSelectFlags = 20.	CurrentState = 5
1026[812d2c8]: Operation failed via PR_POLL_HUP. [www.esta.org:80 41ccdd20].
1026[812d2c8]: Transport [www.esta.org:80 41ccdd20] is in error state.
1026[812d2c8]: Transport [www.esta.org:80 41ccdd20] is in done state.
1026[812d2c8]: --- Leaving nsSocketTransport::Process() [www.esta.org:80
41ccdd20]. mStatus = 0.	CurrentState = 3

1024[8058968]: Deleting nsSocketTransport [komodo.mozilla.org:80 877f0a8].
1024[8058968]: Deleting nsSocketTransport [random.yahoo.com:80 408c83e8].
1024[8058968]: Deleting nsSocketTransport [www.esta.org:80 41ccdd20].

Comment 92

19 years ago
hey dawn,

This last bit of logging info is starting to look useful :-) can you try it 
again with NSPR_LOG_MODULES=nsHTTPProtocol:5,nsSocketTransport:5

This will give info about how/when the HTTP objects are destroyed too.

Thanks,
-- rick

Comment 93

19 years ago
*** Bug 26686 has been marked as a duplicate of this bug. ***

Comment 94

19 years ago
So, I've been trying to reproduce these crashes most of the night on a 2
processor NT machine without any luck :-(

I'll try Linux tomorrow...

Is anyone else still seeing these crashes on SMP NT boxes?  Or is Linux the only
platform now?

Comment 95

19 years ago
I did what rick asked and mailed him another stack trace and log file
rather than pasting it all here. Here's the end of the log.

1024[8058968]: Canceling nsSocketTransport [dspace.dial.pipex.com:80 42564550]. 
rv = 0
1024[8058968]: Canceling nsSocketTransport [dspace.dial.pipex.com:80 42578428]. 
rv = 0
1024[8058968]: Deleting nsHTTPChannel [this=8164f10].
1024[8058968]: Deleting nsHTTPRequest [this=814c7b0].
1024[8058968]: Deleting nsSocketTransport [komodo.mozilla.org:80 42093698].
1024[8058968]: Deleting nsHTTPChannel [this=4201e630].
1024[8058968]: Deleting nsHTTPRequest [this=41ec6930].

Comment 96

19 years ago
I would be happy to do any testing on Linux that is needed.  I saw some ideas of
how to get the approiate info eariler in this bug.  I did notice that mozilla
nightly from last night/this morning was very unstable compared to 48 hours ago
in linux/smp.

Comment 97

19 years ago
rpotts@netscape.com asked if anyone still was seeing this on NT: yes. I've
sent the full dump directly, this was on "latest nightly": 2000022908 on a dual
PII 450 running NT. I had been browsing for about an hour or so, /., UF,
mozilla.org, oreily.com, nothing serious when it crashed... took most of NT
with it... I had to logout and kill most of my active processes in order to
get realtime control back... there's a line in the stack trace for the active
thread that might explain that.... (dnetc was running in background, ending
that from a command line helped, but didn't restore full usability. what ever
the crash did it resulted in normal processes only getting time (even to
repaint) when dnetc was IO bound to disk - that is NOT the normal behaviour
of dnetc, it is usually very well behaved. after it shutdown it only took a
minute for the start menu to appear, another minutes for the shutdown menu
option to select.... before that it took ten minutes to get the start->run
dilog.)

here's the top of the active thread:
jsdom!nsGetInterface::operator=
gkhtml!NS_NewEventListenerManager
gkhtml!NS_NewPresShell
gkview!nsCreateInstanceByProgID::operator=
gkview!nsCreateInstanceByProgID::operator=
[...]
mozilla!nsGetInterface::operator=
kernel32!GetProcessPriorityBoost
mozilla!<nosymbols>


Comment 98

19 years ago
adding link to bug 25910 which most likely is a duplicate
(Assignee)

Comment 99

19 years ago
I'll have to take this over now that Rick has gone on sabbatical, but in some 
sense it's probably Dougt's bug.

Status: We worked on this all day yesterday on Dawn's machine and saw numerous 
crashes. For necko they were often in using the proxy code to post OnStatus and 
OnProgress notifications back to the mozilla thread. However, we also saw 
problems where the gfx toolkit would go away and others, so solving just the 
necko issue won't make us completely stable on MP machines.

Possible solutions: (a) don't deliver status/progress at all (disable them 
in the socket transport and just rel-note it) (b) don't use the proxy code to 
deliver status/progress (implement the event delivery/thread-switch by hand), 
(c) get Doug to track down what's going on with proxies.

Last night we augmented the TestSocketTransport test program to receive 
status/progress notifications so that it might also exhibit this problem, and 
left it running on the machine but didn't see the same failure by the time we 
went home. :-(
Assignee: rpotts → warren
(Assignee)

Comment 100

19 years ago
Found it! NS_MT_SUPPORTED was not defined for Linux (!) and a bunch of classes 
weren't thread safe. 
See news://news.mozilla.org/38BF7E94.3CA715DA%40netscape.com for details.
(Assignee)

Updated

19 years ago
Whiteboard: [PDT+] w/b minus on 03/03- need SMP machine → [PDT+] w/b minus on 03/03 [have fixes!]

Comment 101

19 years ago
The landing is in progress, so I'm extending this to w/b minus on 3/7
Whiteboard: [PDT+] w/b minus on 03/03 [have fixes!] → [PDT+] w/b minus on 3/7 [have fixes!]
(Assignee)

Comment 102

19 years ago
Here's the list of classes I'm having to make threadsafe:

AtomImpl
BasicStringImpl
CacheOutputStream
InterceptStreamListener
MemCacheWriteStreamWrapper
TestConnection
nsAppShellService
nsCacheEntryChannel
nsCharsetConverterManager
nsConverterFactory
nsDNSService
nsDateTimeFormatWin
nsDocShell
nsDocumentOpenInfo
nsEventQueueImpl
nsEventQueueServiceImpl
nsFTPDirListingConv
nsFileSpecImpl
nsFileTransport
nsFileTransportService
nsGenericFactory
nsGenericModule
nsHTTPIndexParser
nsIOService
nsImapFlagAndUidState
nsImapMailCopyState
nsImapMockChannel
nsInputStreamChannel
nsInputStreamFileSystem
nsInterfaceInfoManager
nsLocalFile
nsLocalFileSystem
nsLocale
nsLocaleService
nsMIMEInfoImpl
nsMIMEService
nsMemCacheChannel
nsMemCacheRecord
nsMsgAccountManager
nsMsgIncomingServer
nsMsgMailNewsUrl
nsMsgStatusFeedback
nsMsgWindow
nsObserverService
nsPref
nsPrefMigration
nsProxyEventClass
nsProxyEventObject
nsProxyObjectManager
nsRDFResource
nsRunner
nsSocketTransport
nsSocketTransportService
nsStdURLParser
nsStorageStream
nsStreamConverterService
nsSupportsArray
nsThread
nsThreadPool
nsWalletlibService

Comment 103

19 years ago
By what evidence are you basing the need to make the imap classes 
thread-safe? (by which I assume you mean adding threadsafe add and release refs) 
Inspection, or actual evidence of CONCURRENT access to add and release ref from 
multiple threads? The imap code uses BLOCKING proxy calls between threads so 
that while one thread may be manipulating the ref count, the other thread is 
blocked.
(Assignee)

Comment 104

19 years ago
These changes went in moments ago, along with Andreas' changes.

David: These classes were determined experimentally. I hadn't thought about the 
case where only synchronous proxy code was used, and consequently making 
AddRef/Release threadsafe _shouldn't_ be necessary (I'd have to really study the 
proxy code to determine whether that's really true), but I think making these 
classes threadsafe is mostly harmless -- just a little more overhead in the 
AddRef/Release which will hopefully be insignificant. Let's see if anything 
shows up during profiling.
Status: NEW → RESOLVED
Last Resolved: 20 years ago19 years ago
Resolution: --- → FIXED

Comment 105

19 years ago
Warren, I was playing around on my machine today in the tree you were
working on and found lots of other thread safety assertions and crashes
in the mail account wizard and while loading my inbox. Do you need that
tree any more or is it safe to update to the tip? I don't want to blow
away your changes but I don't want to report the crashes if they are
unique to my tree.
(Assignee)

Comment 106

19 years ago
You can update to the tip. Tons of other fixes went in after that. It would be 
great if you could verify that the thread safety asserts you mentioned have gone 
away now. If not, you can send them to me, or file new bugs. Thanks.

Comment 107

19 years ago
Dawn, could you help once again in verifying this bug.  I have been told that 
you were able to reproduce this.  Thanks.

Comment 108

19 years ago
Oops, i did this the other day and mailed warren but forgot to comment 
in the bug. After I updated from the tip things worked great. I got no
assertions and didn't crash after several hours. Marking verified.
Status: RESOLVED → VERIFIED

Comment 109

19 years ago
I'm running on a Quad Sun UE450 (Solaris 2.6) and have been experiencing quite
a lot of instability.. if I run the exact same code on an UP machine with the
exact same OS etc it's almost perfectly stable.
I bet the quad will trigger smp bugs more than a dual...

I'm running current CVS (tip) with gtk/glib 1.2.6, compiled with gcc 2.95.2
(-O -msupersparc).

Here is a stacktrace from searching for 'Mozilla' in the search sidebar and
waiting a few seconds (repeatable sometimes 8):

#0  0xef1d66b8 in pthread_mutex_lock () from /usr/lib/libthread.so.1
#1  0xef5614c8 in PR_Lock ()
   from /scratch/mozilla/mozilla/dist/bin/./libnspr4.so
#2  0xedefff6c in nsSocketTransport::Process ()
   from /scratch/mozilla/mozilla/dist/bin/components/libnecko.so
#3  0xedf029f4 in nsSocketTransportService::ProcessWorkQ ()
   from /scratch/mozilla/mozilla/dist/bin/components/libnecko.so
#4  0xedf02f30 in nsSocketTransportService::Run ()
   from /scratch/mozilla/mozilla/dist/bin/components/libnecko.so
#5  0xef68da64 in nsThread::Main ()
   from /scratch/mozilla/mozilla/dist/bin/./libxpcom.so
#6  0xef566208 in _pt_root ()
   from /scratch/mozilla/mozilla/dist/bin/./libnspr4.so

Loading a page with a bunch of images resulted in:
#0  0xedefd850 in nsStreamListenerEvent::~nsStreamListenerEvent ()
   from /scratch/mozilla/mozilla/dist/bin/components/libnecko.so
#1  0xedefdef4 in nsOnStopRequestEvent::~nsOnStopRequestEvent ()
   from /scratch/mozilla/mozilla/dist/bin/components/libnecko.so
#2  0xedefd908 in nsStreamListenerEvent::DestroyPLEvent ()
   from /scratch/mozilla/mozilla/dist/bin/components/libnecko.so
#3  0xef68b698 in PL_DestroyEvent ()
   from /scratch/mozilla/mozilla/dist/bin/./libxpcom.so
#4  0xef68b674 in PL_HandleEvent ()
   from /scratch/mozilla/mozilla/dist/bin/./libxpcom.so
#5  0xef68b584 in PL_ProcessPendingEvents ()
   from /scratch/mozilla/mozilla/dist/bin/./libxpcom.so
#6  0xef68c328 in nsEventQueueImpl::ProcessPendingEvents ()
   from /scratch/mozilla/mozilla/dist/bin/./libxpcom.so
#7  0xee630a74 in event_processor_callback ()
   from /scratch/mozilla/mozilla/dist/bin/./libwidget_gtk.so
#8  0xee630794 in our_gdk_io_invoke ()
   from /scratch/mozilla/mozilla/dist/bin/./libwidget_gtk.so
#9  0xee251d0c in g_main_dispatch () from /usr/local/lib/libglib-1.2.so.0
#10 0xee252444 in g_main_iterate () from /usr/local/lib/libglib-1.2.so.0
#11 0xee252634 in g_main_run () from /usr/local/lib/libglib-1.2.so.0
#12 0xee429814 in gtk_main () from /usr/local/lib/libgtk-1.2.so.0
#13 0xee630f78 in nsAppShell::Run ()
   from /scratch/mozilla/mozilla/dist/bin/./libwidget_gtk.so
#14 0xee6cfa30 in nsAppShellService::Run ()
   from /scratch/mozilla/mozilla/dist/bin/components/libnsappshell.so
#15 0x139f0 in main1 ()
#16 0x13ddc in main ()

(Assignee)

Comment 110

19 years ago
stric: Is this the latest build? Debug or optimized? We're still finding 
thread-safety assertions that we're tracking down, so we know this isn't 100% 
fixed yet, but we closed this bug because we know that the assertions will help 
us resolve them over time. I'm wondering if you've seen any assertions, and/or 
whether you think we should reopen this bug.

Comment 111

19 years ago
Note that crashing on the tip build this past weekend (or today) is no big
deal.  There is a lot of instability at this moment.
Do you crash when you pull last friday's evening build?  Try picking that up
from Mozilla.  That was when we branched for beta, but before the giant landings
began.
If you are building your own binary, you should try to induce this bug using the
Netsacpe beta1 branch.  That would be the interesting (sad? surprising?) test.
Thanks,
Jim

Comment 112

19 years ago
I hate to be a broken record, but the asserts only catch lack of thread safety 
on addref and release - there could be all sorts of other thread-safety issues.

Comment 113

19 years ago
ftp://ftp.mozilla.org/pub/mozilla/nightly/2000-03-10-08-M15/mozilla-source.tar.gz
this is the source tarball from last friday that jar mentioned.
I don't see a source tarball for the netscape beta branch. You
can pull it from cvs if you use the proper tag. The tag should
be listed on the builds or seamonkey newsgroup.

Comment 114

19 years ago
I don't think mozilla.org is doing any bulding of tarballs based on the netscape 
branch (although you could ask for 'em!! :-)  ).  That was why the best build I 
could point at was late in the day on last Friday.  Thanks to endico for adding 
the pointer.

Bienvenu is quite correct that other bugs can/will exist in/around 
multi-threading.  There is a good chance that the nature of the thread-induced 
problem will not be memory-centric (re: double frees, etc.), and hence I 
personally would be more surprised to see a stack trace that looked consistently 
like the ones we had been seeing on this bug.  Another bug... yes... but I was 
hoping we were free of this particular class of threading errors.  Perhaps we 
never will be... but a guy can hope! :-)

Again, please tell us how you do with the "relatively" stable build that endico 
identified.

Comment 115

19 years ago
Warren: I was running current (by then) CVS source from CVS HEAD, optimized build.

I just updated and now I get crashes when I resize (a bunch) the window when
viewing slashdot.org for example.. I get a 120-130 step backtrace.. here's a snip:

#0  0x0 in ?? ()
#1  0xedad7300 in nsInlineFrame::ReflowFrames ()
   from /scratch/mozilla/mozilla/dist/bin/components/libraptorhtml.so
#2  0xedad719c in nsInlineFrame::Reflow ()
   from /scratch/mozilla/mozilla/dist/bin/components/libraptorhtml.so
#3  0xedadaa9c in nsLineLayout::ReflowFrame ()
   from /scratch/mozilla/mozilla/dist/bin/components/libraptorhtml.so
#4  0xedab8b38 in nsBlockFrame::ReflowInlineFrame ()
   from /scratch/mozilla/mozilla/dist/bin/components/libraptorhtml.so
...
#87 0xedae75fc in PresShell::ResizeReflow ()
   from /scratch/mozilla/mozilla/dist/bin/components/libraptorhtml.so
#88 0xed6dec54 in nsViewManager2::SetWindowDimensions ()
   from /scratch/mozilla/mozilla/dist/bin/components/libraptorview.so
#89 0xed6e0420 in nsViewManager2::DispatchEvent ()
   from /scratch/mozilla/mozilla/dist/bin/components/libraptorview.so
#90 0xed6ced54 in HandleEvent ()
   from /scratch/mozilla/mozilla/dist/bin/components/libraptorview.so
#91 0xeea3bc98 in nsWidget::DispatchEvent ()
   from /scratch/mozilla/mozilla/dist/bin/./libwidget_gtk.so
#92 0xeea3bba8 in nsWidget::DispatchWindowEvent ()
   from /scratch/mozilla/mozilla/dist/bin/./libwidget_gtk.so
#93 0xeea3aa8c in nsWidget::OnResize ()
   from /scratch/mozilla/mozilla/dist/bin/./libwidget_gtk.so
#94 0xeea42ff4 in nsWindow::Resize ()
   from /scratch/mozilla/mozilla/dist/bin/./libwidget_gtk.so
#95 0xed6d0a30 in nsView::SetDimensions ()
   from /scratch/mozilla/mozilla/dist/bin/components/libraptorview.so
#96 0xed6dec24 in nsViewManager2::SetWindowDimensions ()
   from /scratch/mozilla/mozilla/dist/bin/components/libraptorview.so

Here's a dump from loading a page with a bunch of png/jpg/gif images:
(gdb) bt
#0  0xee2ed9d0 in nsStreamListenerEvent::~nsStreamListenerEvent ()
   from /scratch/mozilla/mozilla/dist/bin/components/libnecko.so
#1  0xee2ee074 in nsOnStopRequestEvent::~nsOnStopRequestEvent ()
   from /scratch/mozilla/mozilla/dist/bin/components/libnecko.so
#2  0xee2eda88 in nsStreamListenerEvent::DestroyPLEvent ()
   from /scratch/mozilla/mozilla/dist/bin/components/libnecko.so
#3  0xefa8b650 in PL_DestroyEvent ()
   from /scratch/mozilla/mozilla/dist/bin/./libxpcom.so
#4  0xefa8b62c in PL_HandleEvent ()
   from /scratch/mozilla/mozilla/dist/bin/./libxpcom.so
#5  0xefa8b53c in PL_ProcessPendingEvents ()
   from /scratch/mozilla/mozilla/dist/bin/./libxpcom.so
#6  0xefa8c2e0 in nsEventQueueImpl::ProcessPendingEvents ()
   from /scratch/mozilla/mozilla/dist/bin/./libxpcom.so
#7  0xeea2c40c in event_processor_callback ()
   from /scratch/mozilla/mozilla/dist/bin/./libwidget_gtk.so
#8  0xeea2c12c in our_gdk_io_invoke ()
   from /scratch/mozilla/mozilla/dist/bin/./libwidget_gtk.so
#9  0xee651d0c in g_main_dispatch () from /usr/local/lib/libglib-1.2.so.0
#10 0xee652444 in g_main_iterate () from /usr/local/lib/libglib-1.2.so.0
#11 0xee652634 in g_main_run () from /usr/local/lib/libglib-1.2.so.0
#12 0xee829814 in gtk_main () from /usr/local/lib/libgtk-1.2.so.0
#13 0xeea2c910 in nsAppShell::Run ()
   from /scratch/mozilla/mozilla/dist/bin/./libwidget_gtk.so
#14 0xeeacfb30 in nsAppShellService::Run ()
   from /scratch/mozilla/mozilla/dist/bin/components/libnsappshell.so
#15 0x139f0 in main1 ()
#16 0x13ddc in main ()

How do I update for the beta1 branch? If it's getting stable on this quad I
could try it on a 10 cpu onyx2 for some more concurrency 8)

With the current code I would not classified it as fixed.. Maybe on dual boxes,
but not on a quad..
You need to log in before you can comment on or make changes to this bug.