Closed
Bug 312775
(imapdeadlock)
Opened 19 years ago
Closed 18 years ago
deadlock between the ui thread and the imap code is causing a hang
Categories
(MailNews Core :: Networking: IMAP, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: Bienvenu, Assigned: Bienvenu)
References
Details
(Keywords: fixed1.8.1)
Attachments
(13 files, 2 obsolete files)
1.73 KB,
patch
|
mscott
:
superreview+
mscott
:
approval1.8rc1+
|
Details | Diff | Splinter Review |
2.05 KB,
patch
|
Details | Diff | Splinter Review | |
15.17 KB,
text/plain
|
Details | |
4.13 KB,
patch
|
Details | Diff | Splinter Review | |
15.72 KB,
text/plain
|
Details | |
1.22 KB,
patch
|
Details | Diff | Splinter Review | |
21.16 KB,
patch
|
Details | Diff | Splinter Review | |
16.55 KB,
text/plain
|
Details | |
767 bytes,
patch
|
Details | Diff | Splinter Review | |
7.90 KB,
patch
|
mscott
:
superreview+
|
Details | Diff | Splinter Review |
14.64 KB,
text/plain
|
Details | |
14.48 KB,
text/plain
|
Details | |
7.08 KB,
text/plain
|
Details |
Neil is reporting a hang in the imap code when his laptop gets out of range of
his wireless network. The stack traces are roughly as follows:
NTDLL! 77f88f13()
NTDLL! 77f87f26()
nsImapIncomingServer::ConnectionTimeOut(nsImapIncomingServer * const
0x01010101, nsIImapProtocol * 0x064dd070) line 671 + 21 bytes
nsImapIncomingServer::GetImapConnection(nsImapIncomingServer * const
0x01010101, nsIEventQueue * 0x00f07f10, nsIImapUrl * 0x032f0470, nsIImapProtocol
* * 0x0012fd6c) line 738 + 10 bytes
nsImapIncomingServer::RetryUrl(nsImapIncomingServer * const 0x04bfbebc,
nsIImapUrl * 0x032f0470) line 485 + 40 bytes
XPTC_InvokeByIndex(nsISupports * 0x04bfbebc, unsigned int 16, unsigned int 1,
nsXPTCVariant * 0x032f40a8) line 102
EventHandler(PLEvent * 0x07671028) line 562 + 22 bytes
PL_HandleEvent(PLEvent * 0x07671028) line 689
PL_ProcessPendingEvents(PLEventQueue * 0x10044af9) line 624
_md_TimerProc(HWND__ * 0x002a00e0, unsigned int 275, unsigned int 0, unsigned
long 654316727) line 1013 + 6 bytes
USER32! 77e4158f()
USER32! 77e3c01a()
USER32! 77e41e7e()
nsAppStartup::Run(nsAppStartup * const 0x016f0188) line 207 + 48 bytes
main1(int 0, char * * 0x00264398, nsISupports * 0x00000000) line 1251
main(int 1, char * * 0x00264398) line 1736 + 22 bytes
mainCRTStartup() line 338 + 17 bytes
KERNEL32! 7c598989()
and the other thread with IMAP in the stack:
NTDLL! 77f88f13()
NTDLL! 77f87f26()
nsImapProtocol::TellThreadToDie(nsImapProtocol * const 0x00000000, int 0) line
1039
nsImapProtocol::CreateNewLineFromSocket(nsImapProtocol * const 0x7ffd5000)
line 4294
nsImapServerResponseParser::GetNextLineForParser(nsImapServerResponseParser *
const 0x7ffd5000, char * * 0x064dd1fc) line 130 + 12 bytes
nsIMAPGenericParser::AdvanceToNextLine(nsIMAPGenericParser * const 0x7ffd5000)
line 204
nsImapServerResponseParser::ParseIMAPServerResponse(nsImapServerResponseParser
* const 0x7ffd5000, const char * 0x00000000, int 1162626121) line 244
4d4c544e()
[
Assignee | ||
Comment 1•19 years ago
|
||
Neil, can you try this patch out? I've removed the acquiring of the server
monitor in nsImapIncomingServer::ConnectionTimeOut - it was only being used to
protect the connection cache, I believe. I switched to using
RemoveConnection(), which grabs the server monitor for a much shorter time.
Assignee | ||
Updated•19 years ago
|
Attachment #199885 -
Flags: superreview?(mscott)
Assignee | ||
Comment 2•19 years ago
|
||
Scott, can you try running with this patch as well? If it fixes Neil's problem,
I'm going to want to land this on the branch as well (Neil says the problem
occurs on the branch as well)
Status: NEW → ASSIGNED
Comment 3•19 years ago
|
||
will do.
Comment 4•19 years ago
|
||
I'm not actually sure how long a dropout I need to trigger this bug (my biff
check interval is one minute), but I had one overnight and one while I was away
from the PC each for at least two minutes, with no ill effects with this patch.
Assignee | ||
Comment 5•19 years ago
|
||
Great, thx, Neil. Scott, can you sr this so I can get it into the trunk -
should I just check it into the branch as well, at the same time?
Updated•19 years ago
|
Attachment #199885 -
Flags: superreview?(mscott)
Attachment #199885 -
Flags: superreview+
Attachment #199885 -
Flags: approval1.8rc1+
Assignee | ||
Comment 6•19 years ago
|
||
fixed on trunk.
Status: ASSIGNED → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
Comment 7•19 years ago
|
||
Sorry, my patched build hung this morning, same stacks :-(
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 8•19 years ago
|
||
Neil, can you try this patch as well? The previous patch was checked in. With
this patch on top of that one, I would at least expect the stack trace to
change...
the change is to only set the m_lastActiveTime from the UI thread, so we don't
need a monitor, and thus can't get blocked. But I suspect we might see a stack
trace with CanHandleUrl on it instead.
Comment 9•19 years ago
|
||
Comment on attachment 200459 [details] [diff] [review]
additional patch to try
Sorry for the delay, but my PC crashed 20 times in the last month, so it's only now that I'm sure that this attachment fixes the given stack. I know this because I've now got a new hang ;-) Stack trace is:
NTDLL! 77f88f13()
NTDLL! 77f87f26()
nsImapIncomingServer::GetImapConnection(nsImapIncomingServer * const 0x042482a8, nsIEventQueue * 0x00ea4f10, nsIImapUrl * 0x041b48d8, nsIImapProtocol * * 0x0012fd6c) line 744 + 50 bytes
nsImapIncomingServer::RetryUrl(nsImapIncomingServer * const 0x02f2236c, nsIImapUrl * 0x041b48d8) line 485 + 40 bytes
XPTC_InvokeByIndex(nsISupports * 0x02f2236c, unsigned int 16, unsigned int 1, nsXPTCVariant * 0x04221100) line 102
EventHandler(PLEvent * 0x032640c8) line 562 + 22 bytes
PL_HandleEvent(PLEvent * 0x032640c8) line 689
PL_ProcessPendingEvents(PLEventQueue * 0x10044af9) line 624
_md_TimerProc(HWND__ * 0x00010088, unsigned int 275, unsigned int 0, unsigned long 28605342) line 1013 + 6 bytes
USER32! 77e4158f()
USER32! 77e3c01a()
USER32! 77e41e7e()
nsAppStartup::Run(nsAppStartup * const 0x01a7c378) line 207 + 48 bytes
main1(int 0, char * * 0x00262638, nsISupports * 0x00000000) line 1251
main(int 3, char * * 0x00262638) line 1736 + 22 bytes
mainCRTStartup() line 338 + 17 bytes
KERNEL32! 7c598989()
The line of code in GetImapConnection is
rv = connection->CanHandleUrl(aImapUrl, &canRunUrlImmediately, &canRunButBusy);
The IMAP thread is still TellThreadToDie() trying to call CloseStreams().
Assignee | ||
Comment 10•19 years ago
|
||
the second patch is on the 1.8.1 branch, as well as on the trunk.
Assignee | ||
Comment 11•19 years ago
|
||
*** Bug 326314 has been marked as a duplicate of this bug. ***
Updated•19 years ago
|
Alias: imapdeadlock
OS: Windows XP → All
Comment 12•19 years ago
|
||
Just ran into this again and thought I'd post another detailed backtrace.
Assignee | ||
Comment 13•19 years ago
|
||
is it possible that backtrace is missing a thread? I don't see any two threads blocked on the same monitor - with Neil's stacks, I do, so I'll try to work on his case.
Assignee | ||
Comment 14•19 years ago
|
||
nsImapProtocol::TellThreadToDie grabs an AutoMonitor on itself first thing, and holds onto it until it returns. It causes to be called functions that need a monitor on the server (nsImapIncomingServer::RemoveConnection, in particular). Meanwhile, functions that have a monitor on the server call functions that require a monitor on the protocol object. nsImapIncomingServer::GetImapConnection has a monitor on itself, and then iterates over the protocol objects calling ::CanHandleUrl, which grabs a monitor on the protocol objects. Deadly embrace-a-gogo. Lesson: nsCAutoMonitor considered harmful, and we need to limit the scope of the AutoMonitors.
Assignee | ||
Comment 15•19 years ago
|
||
Kai, this is a patch against the tip of the trunk that might help - please try it out and let me know...I've run with it for a bit and things seem to work ok, but I was never able to reproduce the problem. There are more changes I can make, but this one is easier and safer than the other changes I have in mind.
Comment 16•19 years ago
|
||
> is it possible that backtrace is missing a thread? I don't see any two threads
> blocked on the same monitor - with Neil's stacks, I do, so I'll try to work on
> his case.
Maybe I missed one. Note the stacktrace in the other bug 326314 I filed had the threads blocked on the same monitor.
> Kai, this is a patch against the tip of the trunk that might help - please try
> it out and let me know...
I'm currently working with the 1.8 branch, but your patch applies there, too!
I just rebuilt and will use it over the next days.
Today I ran into the problem only once, so we'll see how soon I'm confident this patch helps.
Comment 17•19 years ago
|
||
No, this patch does not fix it.
At least it does not on the 1.8 branch.
David, does it make sense that I try a trunk build with this patch applied?
Assignee | ||
Comment 18•19 years ago
|
||
either the trunk or the 1.8.1 branch - the 1.8 branch is missing the second patch above, so another option would be to try applying that patch as well to the 1.8 branch.
Whichever way you go, if it still happens with all the patches applied, if you could attach backtraces, I can continue from there. Thx!
Comment 19•19 years ago
|
||
I was wrong. I made a mistake, and when I posted my previous comment, my build did not yet contain your patch!
Now I'm sure, I'm running a MOZILLA_1_8_BRANCH build, and it contains all the patches attached to this bug.
I will report once I know more.
Comment 20•19 years ago
|
||
It just happened again, here are new stacks.
This one looks more obvious, 3 threads blocked on same lock.
Assignee | ||
Comment 21•19 years ago
|
||
Yes, the 3 threads are trying to get the lock on the imap incoming server, and nsImapProtocol::CreateNewLineFromSocket() grabs a monitor on the protocol object before calling TellThreadToDie(). Then nsImapIncomingServer::GetImapConnection has a lock on the server and calls nsImapProtocol::CanHandleUrl, which tries to get a lock on the protocol object. I'm not sre if nsImapProtocol::CreateNewLineFromSocket needs to grab the monitor it does, for as long as it does. I'll try to figure that out...
Status: REOPENED → ASSIGNED
Assignee | ||
Comment 22•19 years ago
|
||
Kai, here's a patch to try, on top of the previous ones. It should apply to the 1.8 branch as well. I've tested it briefly w/o a problem, but I don't hit the changed code very often. Let me know how it works, and attach a new stack trace if it hangs, thx!
Comment 23•19 years ago
|
||
Yes, your additional patch applies fine.
Comment 24•19 years ago
|
||
With your latest patch applied in addition, it ran stable for a while, until I switched my router off, and it hung again.
These are new stacks.
They are not as obvious to me as the previous ones, hopefully it makes sense to you.
I will continue to run and update more stacks.
Assignee | ||
Comment 25•19 years ago
|
||
we're in Idle() on the same protocol object that's trying to do a pseudo interrupt message load. Idle() is blocked wating for data, and hold a monitor for the protocol object while it does so. I'm not sure how to fix that...I'll have to think about it. Idle() is blocked way down in the pipe code waiting for data that's not going to come...it should get killed because the socket is dead but that doesn't always happen.
Comment 26•19 years ago
|
||
Assignee | ||
Comment 27•19 years ago
|
||
Kai, here's an additional change to try. It has real potential to re-introduce a hang on shutdown, however - https://bugzilla.mozilla.org/show_bug.cgi?id=246909
Assignee | ||
Comment 28•19 years ago
|
||
It definitely re-introduces bug 246909 so I'm going to need to figure out how not to regress that - but for now, if you don't have empty trash or cleanup inbox on exit set, you probably won't see bug 246909.
Comment 29•19 years ago
|
||
> but for now, if you don't have empty trash or cleanup
> inbox on exit set
I use neither, and with all the patches applied, this build has been running very well until now!!!
Assignee | ||
Comment 30•19 years ago
|
||
Kai, can you try this cumulative fix to make sure things still work for you with it? It shouldn't regress bug 246909 because I made it so we won't try to go idle when shutting down, and I think it should still work for you.
Scott, I had to add to an interface, but it's not one I would expect any extensions to touch with a 10 foot pole.
Attachment #212783 -
Flags: superreview?(mscott)
Comment 31•19 years ago
|
||
Comment on attachment 212783 [details] [diff] [review]
fix that doesn't regress exiting
>+ attribute boolean shuttingDown;
I don't actually see anyone setting this...
>+ PRBool shuttingDown;
>+ (void) imapServer->GetShuttingDown(&shuttingDown);
>+ if (!shuttingDown)
>+ (void) imapServer->GetUseIdle(&m_useIdle);
>+ else
>+ m_useIdle = PR_FALSE;
Would it be worth moving the shutting down check into GetUseIdle, thus avoiding the interface change?
Assignee | ||
Comment 32•19 years ago
|
||
Thx, Neil, forgot the base/src part of the patch. Since the account manager only has an nsIImapIncomingServer, the interface change is still required (though I could just have a setter...). I thought about putting it in GetUseIdle, but since I'd still need an interface change, this seemed about the same.
Attachment #212847 -
Flags: superreview?(mscott)
Comment 33•19 years ago
|
||
Some more feedback about the previous set of patches, prior to 2006-02-22.
I am running two instances of SeaMonkey all the time, two profiles. One work work, one for private stuff. Both profiles have a different IMAP+SSL account on separate servers open all the time.
The work profile did not hang since you added the latest patch on 2006-02-15. However, when my network trouble occurred, it could happen the application ended up in a state, where it still was responsive to user input, but loading data from the network did not work (after restoring the network). Even switching the application to offline and back online did not work. I was required to restart the application, but that worked fine, no hang.
In my private profile I have "cleanup inbox on exit" enabled, and I ran into your predicted hang once, but no more often.
I am now building a new optimized build of the latest 1.8 branch plus both patches from 2006-02-22 applied and will use that as my new primary application.
Comment 34•19 years ago
|
||
There was a typo in your patch, I fixed that and produced a combined patch.
Attachment #212783 -
Attachment is obsolete: true
Attachment #212847 -
Attachment is obsolete: true
Attachment #212865 -
Flags: review?(mscott)
Attachment #212783 -
Flags: superreview?(mscott)
Attachment #212847 -
Flags: superreview?(mscott)
Updated•19 years ago
|
Attachment #212865 -
Flags: review?(mscott) → superreview?(mscott)
Assignee | ||
Comment 35•19 years ago
|
||
Thx, Kai, what was the typo?
Comment 36•19 years ago
|
||
There was a ; after IMPL_GET_SET and gcc bailed out with a syntax error.
Assignee | ||
Comment 37•19 years ago
|
||
thx, I'm always doing that. Fixed in my tree.
Re networking activity stopping, does that include non-imap networking activity, like http in messages, or rss feeds? If so, I bet one of the necko threads is blocked, like the socket transport thread or the dns lookup thread - backtraces might be interesting.
Comment 38•19 years ago
|
||
I think the network hang was limited to imap, and other networking still working. But I might be wrong. If I see it again, I'll test.
Assignee | ||
Comment 39•19 years ago
|
||
if limited to imap, it would be most likely limited to a particular folder or folders, so that a folder you hadn't opened should be OK.
Comment 40•19 years ago
|
||
Yes, I confirm. The stalled network is limited to some imap folders only.
I just encountered this situation again while the application has still been running from yesterday.
I restored the broken network connection after the night, the new mail check automatically started. Multiple folders had new messages. I could click some and got a message display, but the hourglass remained, not being able to click messages. In another folder I didn't get an updated message list display. I was able to open new folders and click, read and delete messages.
Then I attached with the debugger, produced the attached stack listing, and detached from the application, keeping it running. To my surprise, after another while, all folders have now recovered and work correctly. The application is still running and I'm posting this bugzilla comment with it.
Comment 41•19 years ago
|
||
(In reply to comment #32)
>Thx, Neil, forgot the base/src part of the patch. Since the account manager
>only has an nsIImapIncomingServer, the interface change is still required
Ah, that explains things, thanks!
Updated•19 years ago
|
Attachment #212865 -
Flags: superreview?(mscott) → superreview+
Comment 42•19 years ago
|
||
I'm running 1.8 branch from yesterday, with the latest patch from this bug (and my patch from bug 111384).
I just run into a cleanupOnExit related hang, stack attached.
Assignee | ||
Comment 43•19 years ago
|
||
yep, I'm seeing that too, which is why I haven't got this stuff checked in yet :-( I think the url to cleanup the inbox is failing but the account manager code never gets notified that the url finished/failed.
Comment 44•19 years ago
|
||
There seems to be a general wake-up problem somewhere.
I just had a situation when I saw two new mail messages, which were junk, but not yet marked as such. I clicked them, but they didn't load, mail stalled, although the application was still responsive.
Then I sent a mail message. This suddenly released the stalled situation, my junk messages got downloaded, marked as junked, and moved to the junk folder.
Comment 45•19 years ago
|
||
I've been getting a lot of hangs in Thunderbird lately, and the symptoms sound similar to this bug, except that my UI is totally unresponsive once it hangs (can't bring backgrounded windows to the front or anything when it happens). I'm on OS X.
gdb backtraces attached.
Assignee | ||
Comment 46•18 years ago
|
||
fixed on trunk and branch (or at least, it should be much better). My worry now is about the hang on exit...
Status: ASSIGNED → RESOLVED
Closed: 19 years ago → 18 years ago
Keywords: fixed1.8.1
Resolution: --- → FIXED
Updated•16 years ago
|
Product: Core → MailNews Core
You need to log in
before you can comment on or make changes to this bug.
Description
•