Closed Bug 18005 Opened 21 years ago Closed 20 years ago

[DOGFOOD] Leave mail window for a long time, GetMsg, crash

Categories

(MailNews Core :: Networking, defect, P3, critical)

defect

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: trudelle, Assigned: dougt)

References

Details

(Whiteboard: [PDT+] Verified for all the platforms)

Attachments

(1 file)

Today's opt build (yesterday too)
Launch Apprunner
Task>Mail
Open IMAP server
select inbox
read a message (possibly extraneous step)
Leave mail window sitting there for a while without using it.
Click GetMsg
Crash, log available.

Saw several times on Mac, once on Linux, will try on Win98
Assignee: phil → bienvenu
Summary: Crash on GetMsg → Leave mail window for a long time, GetMsg, crash
Here's the stack trace. Peter, is this really a Seamonkey stack trace? It has
all sorts of names which look like 4.x.

Calling chain using A6/R1 links Back chain ISA Caller 00000000 PPC 16C7DF28
068428C0 PPC 16C7E07C 06842870 PPC 174BDE6C LApplication::Run()+000B8 06842800
PPC 17099704 XP_GetNonGridContext+285EC 068427A0 PPC 17447B64
LPeriodical::DevoteTimeToRepeaters(const EventRecord&)+0004C 06842740 PPC
16D2F6D8 CFrontApp::GetApplication()+010D4 068426F0 PPC 16D2FE3C
CFrontApp::GetApplication()+01838 06842660 PPC 16D31948 SSL_DataPending+01390
06842610 PPC 16E05C94 CACHE_FindURLInCache+0317C 068425C0 PPC 16EB2568
NET_CacheConverter+01560 06842560 PPC 16E92420
NET_DeregisterContentTypeConverter+087C0 06842510 PPC 17057630
FE_DefaultDocCharSetID+3AAD8 068424A0 PPC 171AD594 XP_Confirm+178D0 06842450 PPC
17071BC4 XP_GetNonGridContext+00AAC 068423B0 PPC 1707292C
XP_GetNonGridContext+01814 06842360 PPC 16DCEA64 XP_TempDirName+0C0B4 068422F0
PPC 16DCF5A8 XP_TempDirName+0CBF8 068422B0 PPC 16DD0344 XP_TempDirName+0D994
06842270 PPC 16DD0344 XP_TempDirName+0D994 06842230 PPC 16DD0344
XP_TempDirName+0D994 068421F0 PPC 16DD0358 XP_TempDirName+0D9A8 068421B0 PPC
16D96644 SOB_get_error+019E4 06842170 PPC 174F3EDC Flush_Free+0000C Return
addresses on the stack Stack Addr Frame Addr ISA Caller 068424D8 PPC 16D18A18
XP_PlatformFileToURL+0A1D4 068424CC 68K 0636DE42 068424C8 PPC 16CF8804
INTL_DefaultWinCharSetID+004F0 068424B8 68K 17445482
LBroadcaster::BroadcastMessage(long, void*)+0008A 068424A8 PPC 17057630
FE_DefaultDocCharSetID+3AAD8 0684248C 68K 065AA29E 06842458 06842450 PPC
171AD594 XP_Confirm+178D0 06842408 06842400 PPC 174F3E88 Flush_Allocate+0001C
068423B8 068423B0 PPC 17071BC4 XP_GetNonGridContext+00AAC 06842388 PPC 17133678
UGraphicGizmos::BevelRect(const Rect&, short, short , short)+05EE4 06842368
06842360 PPC 1707292C XP_GetNonGridContext+01814 0684235C 06842358 68K 063E7ACA
06842308 06842300 PPC 16F04378 XP_ProgressText+20950 068422F8 068422F0 PPC
16DCEA64 XP_TempDirName+0C0B4 068422EC 68K 063E7ACA 068422DE 68K 0003FFFE
068422D8 068422D0 PPC 1732C8B8 PR_ExitMonitor+00098 068422CC 68K 0635866A
068422B8 068422B0 PPC 16DCF5A8 XP_TempDirName+0CBF8 06842298 68K 063E7ACA
06842288 68K 063DA9CE 06842278 06842270 PPC 16DD0344 XP_TempDirName+0D994
06842258 06842250 PPC 16D7B890 ET_moz_CallFunction+003C0 06842238 06842230 PPC
16DD0344 XP_TempDirName+0D994 06842218 06842210 PPC 16D7BB04
ET_moz_CallFunction+00634 068421F8 068421F0 PPC 16DD0344 XP_TempDirName+0D994
068421D8 068421D0 PPC 16D7C044 ET_moz_CallFunction+00B74 068421CC 68K 0635866A
068421C8 068421C0 PPC 16C7F3D4 068421B8 068421B0 PPC 16DD0358
XP_TempDirName+0D9A8 068421A8 068421A0 PPC 17043C44 FE_DefaultDocCharSetID+270EC
06842194 68K 063DA9CE 06842188 06842180 PPC 174F3EDC Flush_Free+0000C 06842178
06842170 PPC 16D96644 SOB_get_error+019E4 0684215C 68K 0635866A 06842158
06842150 PPC 16DCF51C XP_TempDirName+0CB6C 06842148 68K 063E7ACA 06842138
06842130 PPC 174F3EDC Flush_Free+0000C 06842118 06842110 PPC 174F3EDC
Flush_Free+0000C 06842108 06842100 PPC 16DCF020 XP_TempDirName+0C670 068420F8
068420F0 PPC 174F3EDC Flush_Free+0000C 068420F4 068420F0 68K 063E7ACA
Severity: normal → critical
QA Contact: lchiang → esther
Summary: Leave mail window for a long time, GetMsg, crash → [DOGFOOD] Leave mail window for a long time, GetMsg, crash
is this a seamonkey crash, or a 4.5 crash? was 4.5 running at the time?
Did I send the wrong file?  Sorry, I'll try it again.
Looks like there were two logs in the file I sent, and only the first (a 4.7
crash) got pasted. I deleted that log from the file and attached the apprunner
log only.
OK, here's the stack trace from the attachment. Looks like a problem shutting
down the thread, especially with the proxy event code. I'm assuming biff is not
turned on, or we wouldn't have timed out.

 04F29908    04F29900    PPC   1791E650 PR_CSetOnMonitorRecycle+00050
   04F298C8    04F298C0    PPC   16BBB294 nsThread::Exit(void*)+0001C
   04F29888    04F29880    PPC   16BBB438 nsThread::Release()+00040
   04F29848                68K   16BBB19E nsThread::~nsThread()+00036
   04F29808    04F29800    PPC   16B85690 nsCOMPtr_base::~nsCOMPtr_base()+00030
   04F297C8    04F297C0    PPC   163289F4 nsImapProtocol::Release()+289F4
   04F297A8    04F297A0    PPC   1791E474 PR_CExitMonitor+00074
   04F29788    04F29780    PPC   16329C14
nsImapProtocol::~nsImapProtocol()+29C14
   04F29768    04F29760    PPC   16C86950 operator delete(void*)+00014
   04F29758    04F29750    PPC   17922580 PR_ExitMonitor+00054
   04F29748    04F29740    PPC   17922408 PR_DestroyMonitor+0001C
   04F29730                68K   05BA264E
   04F29728    04F29720    PPC   16C877F8 free+00030
   04F29708    04F29700    PPC   1792405C PR_DestroyLock+00018
   04F296E8    04F296E0    PPC   16BC83FC
nsProxyEventObject::~nsProxyEventObject()+000F0
   04F296D8    04F296D0    PPC   16C8956C
nsLargeHeapAllocator::AllocatorFreeBlock(void*)+000
20
   04F296C8    04F296C0    PPC   1791DE94 PR_Free+00014
   04F296B8    04F296B0    PPC   16B886AC nsAllocator::Free(void*)+00054
   04F296A8    04F296A0    PPC   16BC8480
nsProxyEventObject::RootRemoval()+00034
   04F29688    04F29680    PPC   16C86950 operator delete(void*)+00014
I tried this on windows. It seemed fine. I'll try linux next.
Right, no biff.
I can't reproduce this on Win98, but I just reproduced it on Linux again.
Are we having a dangling connection to a time-out'd thread?
I reproduced the crash on linux. We get the following stack trace. This is
probably some symptom of our screwed-up event handling. Perhaps DougT's proxy
event changes will help, though I doubt it.

#0  0x40368888 in main_arena ()
#1  0x68403688 in ?? ()
#2  0x408e37ea in nsStreamListenerEvent::HandlePLEvent (aEvent=0x83fec48) at
nsAsyncStreamListener.cpp:169
#3  0x4019a36b in PL_HandleEvent (self=0x83fec48) at plevent.c:537
#4  0x4019a27c in PL_ProcessPendingEvents (self=0x8736020) at plevent.c:498
#5  0x401599e9 in nsEventQueueImpl::ProcessPendingEvents (this=0x8735ff8) at
nsEventQueue.cpp:190
#6  0x405181ec in event_processor_callback (data=0x8735ff8, source=21,
condition=GDK_INPUT_READ) at nsAppShell.cpp:228
#7  0x40517aff in our_gdk_io_invoke (source=0x8736080, condition=G_IO_IN,
data=0x8722e98) at nsAppShell.cpp:49
#8  0x406b23ca in g_io_unix_dispatch ()
#9  0x406b3a86 in g_main_dispatch ()
#10 0x406b4041 in g_main_iterate ()
#11 0x406b41e1 in g_main_run ()
#12 0x405dd7a9 in gtk_main ()
#13 0x405186ff in nsAppShell::Run (this=0x80a2ce8) at nsAppShell.cpp:395
#14 0x4039d351 in nsAppShellService::Run (this=0x80a1f60) at
nsAppShellService.cpp:480
More likely we have a proxy event in the event queue, and it refers to a deleted
object, like the protocol, or thread. Since linux event handling seems fairly
messed up, at least as far as IMAP is concerned, this doesn't surprise me too
much.
Whiteboard: [PDT+]
Putting on PDT+ radar.
If you turn on biff at an interval less than 29 minutes, you won't have this
problem.
What's happening, I bet, is that we're removing the timed-out connection,
attempting to logout, and releasing the imap protocol instance. This eventually
causes the imap thread to be destroyed. On windows, this happens later on the
thread in question, but it looks like on the mac, it happens immediately on the
ui thread. On linux, it looks like the proxy event stuff isn't noticing that the
event queue has gone away.
Thanks David, I thought that (30 min. connection drop) might be the case, and
the workaround is good enough for dogfood.
It turns out that if we really did drop the connection, everything would be
fine. Unfortunately, we try to gracefully close and logout. If I comment out
those calls, we don't crash.  My gdb/linux skills are pretty marginal - all I
can guess is that the vtbl for the StreamListenerEvent is horked, but the object
doesn't look deleted. I'll keep poking around but I suspect this will take a few
days.
Oy, gevalt. The nsImapProtocol object is definitely getting destroyed before the
event queue is finished, which is not good. But what's worse is that I put in a
call to StopAcceptingEvents after our thread has stopped running to see if that
helps. It didn't help, but it allowed me to discover that on linux (but not
windows), our imap event queue is somehow marked the "elder" event queue. (I
suspect this should be "eldest"). This seems wrong.
The above is partly wrong - the elder assert happens on windows as well, so
perhaps it's not the problem. But, we are executing the
onDataAvailableEvent::HandleEvent on the wrong thread on Linux (i.e., the main
thread), just like 17065 - my gut tells me this is the root of our problem.
I've verified that if I stop gtk from calling into imap code from the UI thread,
this crash doesn't happen. I did this by disabling the
nsAppShell::ListenToEventQueue call, which prevents us from getting called from
the ui thread. Unfortunately, it also breaks the password prompt, presumably
because that's why this event queue listener hack is there in the first place.

I believe this is an xpapps problem, so I'm reassigning it back to you, Peter. I
truly believe that we should be called from the correct thread.
Assignee: bienvenu → trudelle
David, I think we're all agreed that the source of this problem is the same
problem for 17065. Brendan is going to help me find someone to help us figure
out what's going on with event processing on linux. I'm hesitant to mark this a
dup but the problem is probably the same even though the symptoms are different.
Assignee: trudelle → brendan
Reassigning to breandan for triage.

Let's not forget, this also happened on Mac, as did 17065.
Yep, and I believe they both do hacky things with event dispatching to get modal
dialogs to work.  I believe these two bugs have the same cause, and Scott and I
spent a lot of time discovering that in both cases, our events are getting
processed by the wrong thread.
Status: NEW → ASSIGNED
Dan is gonna help me fix this on all platforms, yes he is.

/be
Blocks: 18471
Target Milestone: M12
17065 is M12, so should this one be.

/be
Blocks: 18951
Blocks: 20203
*** Bug 20247 has been marked as a duplicate of this bug. ***
Can also be seen when using "an imap server that only allows a single connection
to a folder, and kills previous connections (like the UW server)" as bienvenu
mentions in the duplicate bug.
QA Contact: esther → huang
Change QA Contact to me since this is IMAP bug. Cc:Esther.
Same occurs for me: I'm using UW-IMAP, a non-Mozilla-Biff checking the INBOX
every 30 seconds and "check mail every 1 min." in Mozilla.
While having a normal subfolder (not INBOX and not under INBOX) open, I get
"Document: Done (0.21 secs) In OnFolderLoader" every min. or so. Mozilla (debug
build) crashed after 20 min. w/o any notice. HTH.
Brendan, what's projected fix date for this bug?
the better question to get started is who is going to tackle this hairy problem?
did we find a porkjockey owner?
Assignee: brendan → dougt
Status: ASSIGNED → NEW
dougt has been fixing bugs in event-loop land and kindly offers to take this
one. he's gonna dig into this tomorrow.

/be
Status: NEW → ASSIGNED
Whiteboard: [PDT+] → [PDT+] 12/9
Sent workaround to mscott to verify.  Still tracking down real problem.
Whiteboard: [PDT+] 12/9 → [PDT+] Fix ready, patch sent for review.
Blocks: 21564
Status: ASSIGNED → RESOLVED
Closed: 20 years ago
Resolution: --- → FIXED
fix checked in.
I have not been able to reproduce on linux 6.0, NT 4.0 or Mac OS 8.5.1 using
12-16-12m12 commercial build.  I was indeed seeing it often on my mac and linux
machines prior to this week's builds (fixed this week).

I will let huang or someone else who'd seen this double-check before marking it
verified.
This bug need to leave PC idle a while...I will test this bug later since I need
to continue testing Basic Functionality Test for M12....
Blocks: 22176
Status: RESOLVED → VERIFIED
Whiteboard: [PDT+] Fix ready, patch sent for review. → [PDT+] Verified for all the platforms
Verified on the Linux 12-20-23-M12 final commercial build
Verified on the Mac 12-21-11-M12 final commercial build
Verified on the Linux 12-21-00-M12 final commercial build
I have idled over than 30 minutes without crash for all the platforms!!
Marking as Verified.
No longer blocks: 18471
No longer blocks: 18951
No longer blocks: 20203
No longer blocks: 21564
No longer blocks: 22176
Product: MailNews → Core
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.