Closed Bug 234620 Opened 21 years ago Closed 20 years ago

Unknown random SEGV/seg fault/core dumps/crashes, only thing on is Mail/IMAP [@ 0x00000001 - nsSupportsArray::ElementAt][@ nsSupportsArray::Clear][@ NSS_CMSArray_Sort][@ nsSupportsArray::Clear][@ nsSupportsArray::DeleteArray]

Categories

(MailNews Core :: Networking: IMAP, defect)

x86
Linux
defect
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED
mozilla1.8beta1

People

(Reporter: jerry.lundstrom, Assigned: darin.moz)

References

Details

(4 keywords, Whiteboard: [not fixed in firefox1.0])

Crash Data

Attachments

(6 files, 3 obsolete files)

User-Agent:       
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7a) Gecko/20040215

Mozilla will randomly crash for me when i just leave it running. Only thing its
running is IMAP to one mail account, checking mail one time a minut. I get alot
of mail and have about 30 filters, 15 imap boxes. All mail are on the server,
nothing is moved to local disk. Im running nightly from 15/2 and this has been
going on since 1.6 (1.6 hangs insteed of crashes). Im only using complete
install, no other modules. Only plugins i have is flashplayer.

Reproducible: Always
Steps to Reproduce:
1. Just leave it running

Actual Results:  
It crashes, core dumps, SEGV's.


Expected Results:  
Keep running =)

One thing to think about is that my imap server is a roundrobin to 2-3 backends,
maybe thats the problem.

Here is the backtrace from gdb:

(gdb) bt
#0  0x403f9bf1 in kill () from /lib/libc.so.6
#1  0x400e783d in pthread_kill () from /lib/libpthread.so.0
#2  0x400e7b5b in raise () from /lib/libpthread.so.0
#3  0x40e7d408 in NSGetModule ()
   from /home/prox/mozilla/components/libprofile.so
#4  0x400ea905 in __pthread_sighandler () from /lib/libpthread.so.0
#5  <signal handler called>
#6  0x41c657d9 in NSGetModule ()
   from /home/prox/mozilla/components/libxpinstall.so
#7  0x41c658bc in NSGetModule ()
   from /home/prox/mozilla/components/libxpinstall.so
#8  0x40624028 in nsSupportsArray::Clear() ()
   from /home/prox/mozilla/libxpcom.so
#9  0x40623988 in nsSupportsArray::DeleteArray() ()
   from /home/prox/mozilla/libxpcom.so
#10 0x406233fa in nsSupportsArray::~nsSupportsArray() ()
   from /home/prox/mozilla/libxpcom.so
#11 0x4062365c in nsSupportsArray::Release() ()
   from /home/prox/mozilla/libxpcom.so
#12 0x0807665b in nsCOMPtr_base::~nsCOMPtr_base() ()
#13 0x4061e54f in nsObserverList::~nsObserverList() ()
   from /home/prox/mozilla/libxpcom.so
#14 0x4061efe6 in nsObserverService::Create(nsISupports*, nsID const&, void**)
    () from /home/prox/mozilla/libxpcom.so
#15 0x4061c606 in nsHashtable::Enumerate(int (*)(nsHashKey*, void*, void*),
void*) () from /home/prox/mozilla/libxpcom.so
#16 0x4061715f in PL_DHashTableEnumerate () from /home/prox/mozilla/libxpcom.so
#17 0x4061c6ac in nsHashtable::Reset(int (*)(nsHashKey*, void*, void*), void*)
    () from /home/prox/mozilla/libxpcom.so
#18 0x4061ddcd in nsObjectHashtable::Reset() ()
   from /home/prox/mozilla/libxpcom.so
#19 0x4061dc5d in nsObjectHashtable::~nsObjectHashtable() ()
   from /home/prox/mozilla/libxpcom.so
#20 0x4061ef48 in nsObserverService::~nsObserverService() ()
   from /home/prox/mozilla/libxpcom.so
#21 0x4061ed8c in nsObserverService::Release() ()
   from /home/prox/mozilla/libxpcom.so
#22 0x080766ae in nsCOMPtr_base::assign_with_AddRef(nsISupports*) ()
#23 0x4065514b in nsComponentManagerImpl::CreateInstanceByContractID(char
const*, nsISupports*, nsID const&, void**) () from /home/prox/mozilla/libxpcom.so
#24 0x4061715f in PL_DHashTableEnumerate () from /home/prox/mozilla/libxpcom.so
#25 0x406551d3 in nsComponentManagerImpl::FreeServices() ()
   from /home/prox/mozilla/libxpcom.so
#26 0x4061627d in NS_ShutdownXPCOM () from /home/prox/mozilla/libxpcom.so
#27 0x08077029 in NS_ShutdownXPCOM ()
#28 0x080774d0 in GRE_Shutdown ()
#29 0x0805b7c5 in main ()
confirmed on mozilla build id: 2004021608 also.
that stack trace looks like the app is trying to shut down. Is that possible?
Hmmm shutdown, i dont think so, all it does it fetch my mail. I'll pay closer
attention to this (checking if the Mail window is up when i close browser
windows) but i doubt that the problem.

My current dist is lunar-linux (www.lunar-linux.org), most things are compiled
with pentium4, see/mmx, fpu=both. Maybe thats the problem that libc is compiled
with pentium4 under gcc 3.2.3 but this kinds of crashes happend on my
debian/unstable system also but not as frequently.

I'll start dumping core and see if it crashes at the same place everytime.
more backtraces:

(gdb) bt                                                                       
                                                                               
            
#0  0x403f9bf1 in kill () from /lib/libc.so.6
#1  0x400e783d in pthread_kill () from /lib/libpthread.so.0
#2  0x400e7b5b in raise () from /lib/libpthread.so.0
#3  0x40e7d408 in NSGetModule ()
   from /home/prox/mozilla/components/libprofile.so
#4  0x400ea905 in __pthread_sighandler () from /lib/libpthread.so.0
#5  <signal handler called>
#6  0x00000011 in ?? ()
#7  0x40623aa6 in nsSupportsArray::ElementAt(unsigned) ()
   from /home/prox/mozilla/libxpcom.so
#8  0x406243d8 in nsSupportsArray::GetElementAt(unsigned, nsISupports**) ()
   from /home/prox/mozilla/libxpcom.so
#9  0x4061ec5d in ObserverListEnumerator::GetNext(nsISupports**) ()
   from /home/prox/mozilla/libxpcom.so
#10 0x4061f332 in nsObserverService::NotifyObservers(nsISupports*, char const*,
unsigned short const*) () from /home/prox/mozilla/libxpcom.so
#11 0x4065f81f in nsEventQueueImpl::NotifyObservers(char const*) ()
   from /home/prox/mozilla/libxpcom.so
#12 0x4065f3cb in nsEventQueueImpl::InitFromPRThread(PRThread*, int) ()
   from /home/prox/mozilla/libxpcom.so
#13 0x40660bf9 in nsEventQueueServiceImpl::MakeNewQueue(PRThread*, int,
nsIEventQueue**) () from /home/prox/mozilla/libxpcom.so
#14 0x40660c97 in nsEventQueueServiceImpl::CreateEventQueue(PRThread*, int) ()
   from /home/prox/mozilla/libxpcom.so
#15 0x40660ad3 in nsEventQueueServiceImpl::CreateMonitoredThreadEventQueue() ()
   from /home/prox/mozilla/libxpcom.so
#16 0x406659cb in nsProxyObject::PostAndWait(nsProxyObjectCallInfo*) ()
   from /home/prox/mozilla/libxpcom.so
#17 0x40665ce7 in nsProxyObject::Post(unsigned, nsXPTMethodInfo*,
nsXPTCMiniVariant*, nsIInterfaceInfo*) () from /home/prox/mozilla/libxpcom.so
#18 0x40667bd6 in nsProxyEventObject::CallMethod(unsigned short, nsXPTMethodInfo
const*, nsXPTCMiniVariant*) () from /home/prox/mozilla/libxpcom.so
#19 0x4067a4b7 in XPTC_InvokeByIndex () from /home/prox/mozilla/libxpcom.so
#20 0x417737d6 in NSGetModule ()
   from /home/prox/mozilla/components/libmsgimap.so
#21 0x41777c64 in NSGetModule ()
   from /home/prox/mozilla/components/libmsgimap.so
#22 0x41769fb9 in NSGetModule ()
   from /home/prox/mozilla/components/libmsgimap.so
#23 0x41769694 in NSGetModule ()
   from /home/prox/mozilla/components/libmsgimap.so
#24 0x41768cb0 in NSGetModule ()
   from /home/prox/mozilla/components/libmsgimap.so
#25 0x406617ab in nsThread::Main(void*) () from /home/prox/mozilla/libxpcom.so
#26 0x400c6639 in PR_Select () from mozilla/libnspr4.so
#27 0x400e4d03 in pthread_start_thread () from /lib/libpthread.so.0
#28 0x404b1d97 in clone () from /lib/libc.so.6

Here you clearly see that its a SEGV in nsSupportsArray::ElementAt().
Another core, looks same as before. Im gonna update to 20040217 nightly now.

(gdb) bt
#0  0x403f9bf1 in kill () from /lib/libc.so.6
#1  0x400e783d in pthread_kill () from /lib/libpthread.so.0
#2  0x400e7b5b in raise () from /lib/libpthread.so.0
#3  0x40e7d408 in NSGetModule ()
   from /home/prox/mozilla/components/libprofile.so
#4  0x400ea905 in __pthread_sighandler () from /lib/libpthread.so.0
#5  <signal handler called>
#6  0x00000011 in ?? ()
#7  0x40623aa6 in nsSupportsArray::ElementAt(unsigned) ()
   from /home/prox/mozilla/libxpcom.so
#8  0x406243d8 in nsSupportsArray::GetElementAt(unsigned, nsISupports**) ()
   from /home/prox/mozilla/libxpcom.so
#9  0x4061ec5d in ObserverListEnumerator::GetNext(nsISupports**) ()
   from /home/prox/mozilla/libxpcom.so
#10 0x4061f332 in nsObserverService::NotifyObservers(nsISupports*, char const*,
unsigned short const*) () from /home/prox/mozilla/libxpcom.so
#11 0x4065f81f in nsEventQueueImpl::NotifyObservers(char const*) ()
   from /home/prox/mozilla/libxpcom.so
#12 0x4065f3cb in nsEventQueueImpl::InitFromPRThread(PRThread*, int) ()
   from /home/prox/mozilla/libxpcom.so
#13 0x40660bf9 in nsEventQueueServiceImpl::MakeNewQueue(PRThread*, int,
nsIEventQueue**) () from /home/prox/mozilla/libxpcom.so
#14 0x40660c97 in nsEventQueueServiceImpl::CreateEventQueue(PRThread*, int) ()
   from /home/prox/mozilla/libxpcom.so
#15 0x40660ad3 in nsEventQueueServiceImpl::CreateMonitoredThreadEventQueue() ()
   from /home/prox/mozilla/libxpcom.so
#16 0x406659cb in nsProxyObject::PostAndWait(nsProxyObjectCallInfo*) ()
   from /home/prox/mozilla/libxpcom.so
#17 0x40665ce7 in nsProxyObject::Post(unsigned, nsXPTMethodInfo*,
nsXPTCMiniVariant*, nsIInterfaceInfo*) () from /home/prox/mozilla/libxpcom.so
#18 0x40667bd6 in nsProxyEventObject::CallMethod(unsigned short, nsXPTMethodInfo
const*, nsXPTCMiniVariant*) () from /home/prox/mozilla/libxpcom.so
#19 0x4067a4b7 in XPTC_InvokeByIndex () from /home/prox/mozilla/libxpcom.so
#20 0x419d07d6 in NSGetModule ()
   from /home/prox/mozilla/components/libmsgimap.so
#21 0x419d4c64 in NSGetModule ()
   from /home/prox/mozilla/components/libmsgimap.so
#22 0x419c6fb9 in NSGetModule ()
   from /home/prox/mozilla/components/libmsgimap.so
#23 0x419c6694 in NSGetModule ()
   from /home/prox/mozilla/components/libmsgimap.so
#24 0x419c5cb0 in NSGetModule ()
   from /home/prox/mozilla/components/libmsgimap.so
#25 0x406617ab in nsThread::Main(void*) () from /home/prox/mozilla/libxpcom.so
#26 0x400c6639 in PR_Select () from mozilla/libnspr4.so
#27 0x400e4d03 in pthread_start_thread () from /lib/libpthread.so.0
#28 0x404b1d97 in clone () from /lib/libc.so.6
I guess it would be nice to have symbols for the imap part of the stack trace.
Why does xpcom have symbols and not imap? We're supposed to work fine mixing
debug and non-debug components, but it always makes me nervous.

The stack trace itself points to a problem in the observer service, or a
ref-counting problem with the observers. 
Components are compiled with only the necessary symbols exported.  (See
mozilla/build/unix/gnu-ld-scripts/.)  In optimized builds, this means they don't
have any symbol data other than NSGetModule (or equivalent).  This is the way
we've distributed builds for years.  It's not a mix of debug and non-debug
components -- it's just that libraries that are linked against need symbols to
link against, but component libraries only need a single symbol as an entry point.
more bt's, also, if someone could build me a dbg nightly I'll be happy to wait
for it to core dump =)

(gdb) bt
#0  0x403f9bf1 in kill () from /lib/libc.so.6
#1  0x400e783d in pthread_kill () from /lib/libpthread.so.0
#2  0x400e7b5b in raise () from /lib/libpthread.so.0
#3  0x40e58408 in NSGetModule ()
   from /home/prox/mozilla/components/libprofile.so
#4  0x400ea905 in __pthread_sighandler () from /lib/libpthread.so.0
#5  <signal handler called>
#6  0x00000011 in ?? ()
#7  0x405feaa6 in nsSupportsArray::ElementAt(unsigned) ()
   from /home/prox/mozilla/libxpcom.so
#8  0x405ff3d8 in nsSupportsArray::GetElementAt(unsigned, nsISupports**) ()
   from /home/prox/mozilla/libxpcom.so
#9  0x405f9c5d in ObserverListEnumerator::GetNext(nsISupports**) ()
   from /home/prox/mozilla/libxpcom.so
#10 0x405fa332 in nsObserverService::NotifyObservers(nsISupports*, char const*,
unsigned short const*) () from /home/prox/mozilla/libxpcom.so
#11 0x4063a81f in nsEventQueueImpl::NotifyObservers(char const*) ()
   from /home/prox/mozilla/libxpcom.so
#12 0x4063a3cb in nsEventQueueImpl::InitFromPRThread(PRThread*, int) ()
   from /home/prox/mozilla/libxpcom.so
#13 0x4063bbf9 in nsEventQueueServiceImpl::MakeNewQueue(PRThread*, int,
nsIEventQueue**) () from /home/prox/mozilla/libxpcom.so
#14 0x4063bc97 in nsEventQueueServiceImpl::CreateEventQueue(PRThread*, int) ()
   from /home/prox/mozilla/libxpcom.so
#15 0x4063bad3 in nsEventQueueServiceImpl::CreateMonitoredThreadEventQueue() ()
   from /home/prox/mozilla/libxpcom.so
#16 0x406409cb in nsProxyObject::PostAndWait(nsProxyObjectCallInfo*) ()
   from /home/prox/mozilla/libxpcom.so
#17 0x40640ce7 in nsProxyObject::Post(unsigned, nsXPTMethodInfo*,
nsXPTCMiniVariant*, nsIInterfaceInfo*) () from /home/prox/mozilla/libxpcom.so
#18 0x40642bd6 in nsProxyEventObject::CallMethod(unsigned short, nsXPTMethodInfo
const*, nsXPTCMiniVariant*) () from /home/prox/mozilla/libxpcom.so
#19 0x406554b7 in XPTC_InvokeByIndex () from /home/prox/mozilla/libxpcom.so
#20 0x418997d6 in NSGetModule ()
   from /home/prox/mozilla/components/libmsgimap.so
#21 0x4189dc64 in NSGetModule ()
   from /home/prox/mozilla/components/libmsgimap.so
#22 0x4188ffb9 in NSGetModule ()
   from /home/prox/mozilla/components/libmsgimap.so
#23 0x4188f694 in NSGetModule ()
   from /home/prox/mozilla/components/libmsgimap.so
#24 0x4188ecb0 in NSGetModule ()
   from /home/prox/mozilla/components/libmsgimap.so
#25 0x4063c7ab in nsThread::Main(void*) () from /home/prox/mozilla/libxpcom.so
#26 0x400c6639 in PR_Select () from mozilla/libnspr4.so
#27 0x400e4d03 in pthread_start_thread () from /lib/libpthread.so.0
#28 0x404b1d97 in clone () from /lib/libc.so.6
The stack traces have a slightly higher chance of being useful if you also
include the output of /proc/<pid>/maps , where <pid> is the process ID of the
process that crashed.  (Slightly higher means that it becomes possible to
extract the necessary information given:
 * the stack
 * the exact nightly you were using
 * the maps file
but it's still quite difficult.)

Also, if you attach further stacks, it's probably better to attach them (see the
"Create an attachment" link above) so that the bug stays more readable.
In what manner can i copy the map file? doesn't it dissapear after the process
crashes?

My nightly id right now is 2004021708.
You do need to get the map file before the process exits, but it doesn't need to
be immediately before -- anytime after it's fully started up should be fine.  (I
was thinking you had the crash in gdb rather than debugging a core file, in
which case the map file would still have been there.)

Also, do you have a dual CPU machine?
(Note that any of those three pieces of information other than the stack isn't
useful without having all of them for the same crash.  Also, don't worry too
much about getting them, because it's not all that likely they'll lead to
anything useful.)
It's worth noting that the observer being notified here is probably the appshell
service, but that if the problem is a refcounting error it would be a
refcounting error on the weak reference object and not the appshell service
itself.  (Both the appshell service code and the observer code do a bunch of
rather nasty things, but nothing obvious that would crash.)
Yes its a dual p4 2.4ghz with HyperThreading on so it says there is 4 cpus.

No i don't run mozilla via gdb, i just gdb the core. I don't have enought time
to spend to run it via gdb.
bump, what has happend? anyone find anything? It still crashes for me (build id:
2004030109).
Is there a debug version of the nightly somewhere I can use to get a better dump
? Or can i find the build schema for nightly somewhere?
this is the builtin stacktrace from nightly built with debug
Please look at the attached stack trace, I can replicate this if you want to
have other type of information like the map file etc etc.
thx, that stack trace is much more useful. I suspect it's a race condition
exacerbated by your cpu setup. I also suspect that you're encountering a lot of
different problems, from all your stack traces. I see a race condition that
could result in m_transport getting cleared between the time that it's checked
for null and the time it's used. I can try to fix that...
Status: UNCONFIRMED → NEW
Ever confirmed: true
Attached file more stack traces
This is the most common stack trace, it breaks at nsSupportArray::ElementAt().

I have not seen any other array breaks.
btw, if you want me to test some patch just send it to me and i will, but
include the build configure (.mozconfig).
this is just a possibility - but we should be protecting the clearing of
m_transport with a monitor in case the code that's checking the non-nullness of
the m_transport has it cleared out from under it. I'm not sure why you need a
.mozconfig from me - the one you have should be fine - do you have a tree that
builds? If so, you can just apply the patch and rebuild. But as I said before,
I think you have a lot of different problems, that we'll have to try to knock
off one at a time. I'll look at the stack you just posted, but I have a fear
it's not in the imap code...
Re the event queue stuff, I stepped through the code a bit. Is it possible that
the app shell event queue stuff isn't thread-safe? The stack in
http://bugzilla.mozilla.org/attachment.cgi?id=142984&action=view is from the
imap thread. I wonder what happens when the observers array gets changed while
we're iterating over it. The observer service uses a lock when things are added
or removed, but I don't see any locking when we're iterating over the list of
observers via an enumerator...Also, does the linux code use native event queues?
It's not clear to me if the native event queue code path modifies the array
we're enumerating over or not, but if it did, that could cause more
possibilities of race conditions...
Jerry, what's the date of the most recent build you've been running? Darin says
that Brendan fixed some crash in nsSupportsArray, though I doubt that's involved
here, since I believe you've crashed before and after his checkin of 02/25/04
I don't know if this has anything todo with the race but i just got this:

###!!! ASSERTION: nsTDependentString must wrap only null-terminated strings:
'mData[mLength] == 0', file ../../../dist/include/string/nsTDependentString.h,
line 67
Break: at file ../../../dist/include/string/nsTDependentString.h, line 67

Other then that ive been running the lastest cvs with the patch you added for a
few hours now.
Attached file stack-20040309-1.txt
stack with the patch :/ seams like its still racing.
As I said, I think you're running into several different race conditions. My fix
has nothing to do with the event queue race conditions, but rather an internal
race condition in the imap code (the stack trace with CanHandleUrl in it, -
http://bugzilla.mozilla.org/attachment.cgi?id=142864&action=view ). 

The string assertion is probably just because of some new string changes and is
most likely not related.

I could take some stabs at using locks in the event queue code, for you to try,
but it would be just a stab in the dark. It's also possible that it's a
ref-counting problem, as dbaron points out, but the fact that you've got a 4 cpu
system makes me suspect a race condition (though race conditions can expose
ref-counting problems too).
I wondering if it could be HyperThreading also, if linux treats it as just 2
more cpus but its really not maybe that has to do with the instability. The only
other thing I notice about my machine is that it can sometimes lock up for a sec
or two if its doing MASSIVE memory swaping.

I will reboot and disable HyperThreading and see if thats the problem.
Turning of HT makes it more stable but it still races. It has crashed two times
now since yesterday and both are at the place shown in
http://bugzilla.mozilla.org/attachment.cgi?id=142984&action=view .
Attached file stack of 20040313
This stack is of 3 processes that crashed at the same time. Before the segv you
will see 1 2 3 4, they are printf in SupportsArray::ElementAt :

NS_IMETHODIMP_(nsISupports*)
nsSupportsArray::ElementAt(PRUint32 aIndex)
{
  printf("1 %lu %lu\n", aIndex, mCount);
  if (aIndex < mCount) {
  printf("2 %p %p\n", mArray, mAutoArray);
    nsISupports*  element = mArray[aIndex];
  printf("3 %p\n", element);
    NS_IF_ADDREF(element);
  printf("4\n");
    return element;
  printf("5\n");
  }
  printf("6\n");
  return 0;
}

As you can see it clearly crashes between 3 and 4 doing the ref count. And as
you can see from my other gdb backtrace the value is 0x00000011 . So something
sets the element to 0x11.

I currently am running HT again, using gcc 3.3.3 and all things (except
mozilla) are optimized with pentium4, mmx/sse/sse2, fpu=x387/sse -O2 .
 I can't see that someone's set it to 11 from the stack trace - am I missing
something? It definitely seems that multiple threads are accessing the array,
though that's not neccesarily a problem (though it's not protected by a monitor,
so if someone's altering the queue at the same time, maybe bad things could
happen). You might try adding printfs in the code that removes elements from the
nsSupportsArray...
Jerry: if you think memory is getting overwritten, one of the best ways to track
that down is valgrind: http://valgrind.kde.org/
It handles threading, but I'm not sure how well.
Keywords: crash
I believe this bug still exists at least in mozilla 1.7. A customer of mine
reported a crash and had the same stack trace as comment #4
Following is my investigation based on the core file I got. The crash also
happened on a 2 AMD CPU machine running solaris. HIH

the crash happened at nsSupportsArray::ElementAt(PRUint32 aIndex) which is:
1  NS_IMETHODIMP_(nsISupports*)
2  nsSupportsArray::ElementAt(PRUint32 aIndex)
3  {
4    if (aIndex < mCount) {
5      nsISupports*  element = mArray[aIndex];
6      NS_IF_ADDREF(element);  //return expr ? expr->AddRef() : 0;
7      return element;
8    }
9    return 0;
10 }
In the core file, beside the sighandler, the top of the call stack is:
libxpcom.so`__1cPnsSupportsArrayJElementAt6MI_pnLnsISupports__+0x27(81e6b78, 0)
0xcd250831(81e6b78, 0, cb21f7bc)
checking the assemble code:
: pushl  %ebp
+1: movl   %esp,%ebp
+3: pushl  %ebx
+4: call   +0x5 <libxpcom.so`__1cPnsSupportsArrayJElementAt6MI_pnLnsISupports__+9>
+9: popl   %ebx
+0xa: addl   $0x85b7f,%ebx
+0x10: movl   0xc(%ebp),%ecx
+0x13: movl   0x8(%ebp),%eax
+0x16: cmpl   0x10(%eax),%ecx
+0x19: jae    +0x19
<libxpcom.so`__1cPnsSupportsArrayJElementAt6MI_pnLnsISupports__+0x32>
+0x1b: movl   0x8(%eax),%eax
+0x1e: movl   (%eax,%ecx,4),%ebx
+0x21: testl  %ebx,%ebx
+0x23: je     +0x11
<libxpcom.so`__1cPnsSupportsArrayJElementAt6MI_pnLnsISupports__+0x34>
+0x25: movl   (%ebx),%eax
+0x27: movl   0xc(%eax),%eax
+0x2a: pushl  %ebx
+0x2b: call   *%eax
+0x2d: addl   $0x4,%esp
+0x30: jmp    +0x4
<libxpcom.so`__1cPnsSupportsArrayJElementAt6MI_pnLnsISupports__+0x34>
+0x32: xorl   %ebx,%ebx
+0x34: movl   %ebx,%eax
+0x36: popl   %ebx
+0x37: movl   %ebp,%esp
+0x39: popl   %ebp
+0x3a:    ret
We can find that at +0x25, where %ebx has already been the "element", %eax gets
the vtable of the object. Checking the register and memery, we get:
$r
%cs = 0x0017            %eax = 0x00000000
%ds = 0x001f            %ebx = 0xceba8000
%ss = 0x001f            %ecx = 0xcb21f408
%es = 0x001f            %edx = 0xd362fa00
%fs = 0x0000            %esi = 0x0000000b
%gs = 0x012f            %edi = 0xcb21f480
0xceba8000/X
0xceba8000:     c8b18
so, %eax is supposed to be c8b18. However, we found %eax = 0x00000000 and it
caused the crash. 
I found the reason that cause
http://bugzilla.mozilla.org/attachment.cgi?id=143388&action=view might be that
ObserverListEnumerator is not thread safe. It may need to share the lock with
the nsObserverList which the emumerator is got from.
Can this be verified to exist in thunderbird also? Im running thunderbird now,
its a bit more stable but still it crashes some.
Jerr, Can you try these (assume you are using bash)?
1. export NSPR_LOG_MODULES=ObserverService:5
   exprot NSPR_LOG_FILE=nspr.log
2. run mozilla mail as you usually do until it crashes
3. post the file nspr.log here

Thanks
I think the root cause may be in nsObserverService::EnumerateObservers(). I
found there two threads access this method one thread's call stack is:
 nsWeakReference::AddRef()
 nsSupportsArray::ElementAt()
 nsSupportsArray::GetElementAt()
 ObserverListEnumerator::GetNext()
 ObserverService::NotifyObservers()
 nsEventQueueImpl::NotifyObservers()
 nsEventQueueImpl::~nsEventQueueImpl()
 nsEventQueueServiceImpl::PopThreadEventQueue()
 ...

The other thread's call stack is:

 nsWeakReference::AddRef()
 nsSupportsArray::ElementAt()
 nsSupportsArray::GetElementAt()
 ObserverListEnumerator::GetNext()
 ObserverService::NotifyObservers()
 nsEventQueueImpl::NotifyObservers()
 nsEventQueueImpl::InitFromPRThread()
 nsEventQueueServiceImpl::MakeNewQueue()
 nsEventQueueServiceImpl::CreateEventQueue()
 nsProxyObject::PostAndWait()
 ...
Attached patch add a monitor (obsolete) — Splinter Review
Comment on attachment 164365 [details] [diff] [review]
add a monitor 

Can you give r? Thanks
Attachment #164365 - Flags: review?(bienvenu)
Hi, sorry for the delay.

As of now im running the suggested NSPR_* env variables but I'm not running
mozilla any longer. Im running thunderbird 0.7.3 and I dont have the oppertunity
to run mozilla because it will interfere with my work. Altho thunderbird is more
stable it too crashes from time to time.
Darin, biesi, this is the same issue as I uncovered in bug 266873 - the global
observer events for nsIEventQueueCreated and nsIEventQueueDestroyed are being
fired on multiple threads: I presume the appshellservice doesn't even want those
notifications for non-main-thread event queues. In this case things are being
compounded by the weak reference, which appears to be racing to a dual-release
or something like that.
Comment on attachment 164365 [details] [diff] [review]
add a monitor 

No, I'm not a module owner - dougt or darin would be your best bets...
Attachment #164365 - Flags: review?(bienvenu) → review?(darin)
Blocks: 266873
Jerry, this crash would be pretty much just as likely to happen in Thunderbird.
And when it's fixed in Mozilla, it will be fixed in thunderbird at the same time...
Comment on attachment 164365 [details] [diff] [review]
add a monitor 

Don't use a monitor where a lock will do.  Do use a lock, or if possible,
atomic instructions in AddRef and Release, which is what
NS_IMPL_THREADSAFE_ISUPPORTS will give you.

Looks like nsObserverList is thread-safe but ObserverListEnumerator is not,
which is a bug too.

It's not clear to me that there's a double-release bug too, but let's fix the
above two bugs and see what we can see.

This would be good to get for thunderbird 1.0.

/be
Attachment #164365 - Flags: superreview-
Flags: blocking-aviary1.0?
Attached patch v2 patchSplinter Review
Here's a better patch.	It makes no sense to invoke the observer service from a
background thread.  The observers don't expect to be called on the background
thread, and there is no contract that requires them to be threadsafe. 
Moreover, we don't make any effort to proxy notifications from a background
thread over to the "right" thread.  Lastly, the only consumer of this
particular notification expects to be called on the main thread and definitely
has no interest in non-native event queues such as the ones created by IMAP.
Assignee: bienvenu → darin
Attachment #143299 - Attachment is obsolete: true
Attachment #164365 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #164484 - Flags: superreview?(bienvenu)
Attachment #164484 - Flags: review?(bsmedberg)
Target Milestone: --- → mozilla1.8beta
Attachment #164484 - Flags: superreview?(bienvenu) → superreview+
erm, all observers must be able to live on the main thread? what if my observer
doesn't want to?
> erm, all observers must be able to live on the main thread? what if my observer
> doesn't want to?

timeless: i don't know... you may be SOL, or perhaps the observer service will
work properly if your observers and the guy calling NotifyObservers all live on
the same thread.  clearly, there is no code to support notifying observers from
a background thread and having those observers execute on the main thread (or
whatever appropriate thread).

in the long run, the observer service should either build proxies or partition
the observers by thread such that any notifications for topic "foo" on thread 1
will only affect observers registered for topic "foo" on thread 1.

note: my patch only affects nsEventQueue.cpp... it leaves nsObserverService.cpp
completely untouched.
I think we should take darin's minimal patch for the branches, and leave this
bug open for a bigger trunk patch that removes bogus threadsafe-isupports
wallpaper in observer-service and -list land, instead asserting or testing
is-main-thread and enforcing single-threadedness.

Timeless: do you have any real requirements, or were you just wondering whether
the o.s. might not be MT?  It's reasonable to want it that way, but we need a
new design and interface contracts.

/be
Flags: blocking-aviary1.0? → blocking-aviary1.0+
*** Bug 245820 has been marked as a duplicate of this bug. ***
Attachment #164365 - Flags: review?(darin)
fixed-on-trunk

brendan: i'd rather file a new bug for the enhancements to observer service
since this bug has the crash keyword :)
Status: ASSIGNED → RESOLVED
Closed: 20 years ago
Resolution: --- → FIXED
i'm pretty sure i have real requirements. we have js components which want to
(and do) live on other threads, they want to be able to observe things, but they
don't want to be dragged across threads. keep in mind that the "proxy" object
offered by xpcom will *drag* the object across threads, resulting in
threadsafety errors.

this isn't urgent, and having working imap is more important to me as an end
user, but the fix does rub me the wrong way and is quite likely to hose me
eventually.

as long as someone files a bug about that, i'll be ok, i suppose.
Comment on attachment 164484 [details] [diff] [review]
v2 patch

This is good for the 1.7 branch as well as the aviary 1.0 branch.  It's a very
safe fix.  Should only apply to IMAP since that's our only consumer of
nsEventQueue on background threads.
Attachment #164484 - Flags: approval1.7.x?
Attachment #164484 - Flags: approval-aviary?
i just looked at the patch (after reading darin's note). am i understanding that
as saying that we're changing the contract for eventqueue push/pop? that's very
unsettling.
timeless: the observer events are a private backdoor mechanism used to enable
native event queues.  background threads don't need to use native event queues.
 UI in mozilla only runs from the main event queue.  nothing changes for event
queues managed on the UI thread.
I'd like to plus this for the aviary 1.0 branch. However, I understand if Ben
would rather wait and have us check this in after Firefox 1.0 is out the door to
minimize risk, since the problem only effects Thunderbird.
Comment on attachment 164484 [details] [diff] [review]
v2 patch

a=mkaply for 1.7
Attachment #164484 - Flags: approval1.7.x? → approval1.7.x+
Comment on attachment 164484 [details] [diff] [review]
v2 patch

a=asa for aviary checkin but it would be nice if we could wait until after
firefox 1.0 ships.
Attachment #164484 - Flags: approval-aviary? → approval-aviary+
yeah I already told Darin we'd wait until after firefox 1.0 is out the door...
*** Bug 264935 has been marked as a duplicate of this bug. ***
*** Bug 268313 has been marked as a duplicate of this bug. ***
Darin, you can go ahead and check this into the aviary 1.0 branch now. I can do
it for you if you want too. Thanks again.
fixed1.7.x, fixed-aviary1.0
Whiteboard: [not fixed in firefox1.0]
Hey Darin,

I think this patch may have introduced a crash on linux builds that seems to
effect Firefox, Thunderbird and mozilla.

See crash reports in:

https://bugzilla.mozilla.org/show_bug.cgi?id=269076
https://bugzilla.mozilla.org/show_bug.cgi?id=269585
https://bugzilla.mozilla.org/show_bug.cgi?id=268402
https://bugzilla.mozilla.org/show_bug.cgi?id=270064

They all seem to die in event_process_queue. Branch and trunk and popped up
around the time this fix went into the branch and trunk. 

(In reply to comment #64)
> I think this patch may have introduced a crash on linux builds that seems to
> effect Firefox, Thunderbird and mozilla.

this is being tracked in bug 269585, which is a topcrasher (and also affects
aviary1.0-tbird bits).
note to self: I temporarily backed this out of the aviary branch until we fix
Bug #269585 (sounds like Darin is getting close)
Keywords: fixed-aviary1.0
Depends on: 269585
It's as if nsIThread::IsMainThread is lying to us :(
Attached patch v3 patch (obsolete) — Splinter Review
alternate patch.  this version bypasses NotifyObservers when the event queue is
not native.  that should solve this bug, and should hopefully avoid the crashes
in event_processor_callback that seem to have resulted from the v2 patch.
Attachment #166379 - Flags: superreview?(bienvenu)
Attachment #166379 - Flags: review?(dbaron)
Attachment #166379 - Flags: superreview?(bienvenu) → superreview+
Comment on attachment 166379 [details] [diff] [review]
v3.1 patch - same thing, but with an assertion about being on the main thread

I would like to try out this fix on the trunk.	If all goes well, it should fix
the topcrasher, bug 269585 (which is blocking 1.8a5)
Attachment #166379 - Flags: approval1.8a5?
Attachment #166379 - Flags: review+ → review?(dbaron)
Comment on attachment 166379 [details] [diff] [review]
v3.1 patch - same thing, but with an assertion about being on the main thread

a=asa for 1.8a5 checkin.
Attachment #166379 - Flags: approval1.8a5? → approval1.8a5+
v3.1 patch fixed-on-trunk:

Checking in nsEventQueue.cpp;
/cvsroot/mozilla/xpcom/threads/nsEventQueue.cpp,v  <--  nsEventQueue.cpp
new revision: 3.43; previous revision: 3.42
done
Someone going to check this into aviary then?
Product: MailNews → Core
I just checked the alternate fix into the aviary 1.0 branch since talkback shows
it fixed the crash regression. 
Keywords: fixed-aviary1.0
What about the 1.7 branch?
Attachment #166379 - Flags: approval1.7.x?
Adding topcrash info from duped bug 264935 for tracking. 
Keywords: topcrash
Summary: Unknown random SEGV/seg fault/core dumps/crashes, only thing on is Mail/IMAP → Unknown random SEGV/seg fault/core dumps/crashes, only thing on is Mail/IMAP [@ 0x00000001 - nsSupportsArray::ElementAt][@ nsSupportsArray::Clear][@ NSS_CMSArray_Sort][@ nsSupportsArray::Clear][@ nsSupportsArray::DeleteArray]
Comment on attachment 166379 [details] [diff] [review]
v3.1 patch - same thing, but with an assertion about being on the main thread

a=mkaply
Attachment #166379 - Flags: approval1.7.x? → approval1.7.x+
v3.1 patch fixed1.7.x
(In reply to comment #74)
> I just checked the alternate fix into the aviary 1.0 branch since talkback shows
> it fixed the crash regression. 

thunderbird built on 11/23 has been running since 11/23, where previously it
was crashing every couple of hours. I'm immensely happy. I think thunderbird
1,0 should be released now :)

Thanks all!
As my orginally reported bug (Bug 268313) was closed as duplicate here is 
still some crashes when closing thunderbird (version 0.9+ (20041129)):
TB2275925H,TB2266547Q
Product: Core → MailNews Core
Crash Signature: [@ 0x00000001 - nsSupportsArray::ElementAt] [@ nsSupportsArray::Clear] [@ NSS_CMSArray_Sort] [@ nsSupportsArray::Clear] [@ nsSupportsArray::DeleteArray]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: