Closed Bug 312775 (imapdeadlock) Opened 15 years ago Closed 15 years ago

deadlock between the ui thread and the imap code is causing a hang

Categories

(MailNews Core :: Networking: IMAP, defect)

x86
All
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Bienvenu, Assigned: Bienvenu)

References

Details

(Keywords: fixed1.8.1)

Attachments

(13 files, 2 obsolete files)

1.73 KB, patch
mscott
: superreview+
Details | Diff | Splinter Review
2.05 KB, patch
Details | Diff | Splinter Review
15.17 KB, text/plain
Details
4.13 KB, patch
Details | Diff | Splinter Review
15.72 KB, text/plain
Details
1.22 KB, patch
Details | Diff | Splinter Review
21.16 KB, patch
Details | Diff | Splinter Review
16.55 KB, text/plain
Details
767 bytes, patch
Details | Diff | Splinter Review
7.90 KB, patch
mscott
: superreview+
Details | Diff | Splinter Review
14.64 KB, text/plain
Details
14.48 KB, text/plain
Details
7.08 KB, text/plain
Details
Neil is reporting a hang in the imap code when his laptop gets out of range of
his wireless network. The stack traces are roughly as follows:

NTDLL! 77f88f13()
  NTDLL! 77f87f26()
  nsImapIncomingServer::ConnectionTimeOut(nsImapIncomingServer * const
0x01010101, nsIImapProtocol * 0x064dd070) line 671 + 21 bytes
  nsImapIncomingServer::GetImapConnection(nsImapIncomingServer * const
0x01010101, nsIEventQueue * 0x00f07f10, nsIImapUrl * 0x032f0470, nsIImapProtocol
* * 0x0012fd6c) line 738 + 10 bytes
  nsImapIncomingServer::RetryUrl(nsImapIncomingServer * const 0x04bfbebc,
nsIImapUrl * 0x032f0470) line 485 + 40 bytes
  XPTC_InvokeByIndex(nsISupports * 0x04bfbebc, unsigned int 16, unsigned int 1,
nsXPTCVariant * 0x032f40a8) line 102
  EventHandler(PLEvent * 0x07671028) line 562 + 22 bytes
  PL_HandleEvent(PLEvent * 0x07671028) line 689
  PL_ProcessPendingEvents(PLEventQueue * 0x10044af9) line 624
  _md_TimerProc(HWND__ * 0x002a00e0, unsigned int 275, unsigned int 0, unsigned
long 654316727) line 1013 + 6 bytes
  USER32! 77e4158f()
  USER32! 77e3c01a()
  USER32! 77e41e7e()
  nsAppStartup::Run(nsAppStartup * const 0x016f0188) line 207 + 48 bytes
  main1(int 0, char * * 0x00264398, nsISupports * 0x00000000) line 1251
  main(int 1, char * * 0x00264398) line 1736 + 22 bytes
  mainCRTStartup() line 338 + 17 bytes
  KERNEL32! 7c598989()

and the other thread with IMAP in the stack:

NTDLL! 77f88f13()
  NTDLL! 77f87f26()
  nsImapProtocol::TellThreadToDie(nsImapProtocol * const 0x00000000, int 0) line
1039
  nsImapProtocol::CreateNewLineFromSocket(nsImapProtocol * const 0x7ffd5000)
line 4294
  nsImapServerResponseParser::GetNextLineForParser(nsImapServerResponseParser *
const 0x7ffd5000, char * * 0x064dd1fc) line 130 + 12 bytes
  nsIMAPGenericParser::AdvanceToNextLine(nsIMAPGenericParser * const 0x7ffd5000)
line 204
  nsImapServerResponseParser::ParseIMAPServerResponse(nsImapServerResponseParser
* const 0x7ffd5000, const char * 0x00000000, int 1162626121) line 244
  4d4c544e()
[
Attached patch proposed fixSplinter Review
Neil, can you try this patch out? I've removed the acquiring of the server
monitor in nsImapIncomingServer::ConnectionTimeOut - it was only being used to
protect the connection cache, I believe. I switched to using
RemoveConnection(), which grabs the server monitor for a much shorter time.
Attachment #199885 - Flags: superreview?(mscott)
Scott, can you try running with this patch as well? If it fixes Neil's problem,
I'm going to want to land this on the branch as well (Neil says the problem
occurs on the branch as well)
Status: NEW → ASSIGNED
will do.
I'm not actually sure how long a dropout I need to trigger this bug (my biff
check interval is one minute), but I had one overnight and one while I was away
from the PC each for at least two minutes, with no ill effects with this patch.
Great, thx, Neil.  Scott, can you sr this so I can get it into the trunk -
should I just check it into the branch as well, at the same time?
Attachment #199885 - Flags: superreview?(mscott)
Attachment #199885 - Flags: superreview+
Attachment #199885 - Flags: approval1.8rc1+
fixed on trunk.
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Sorry, my patched build hung this morning, same stacks :-(
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Neil, can you try this patch as well? The previous patch was checked in. With
this patch on top of that one, I would at least expect the stack trace to
change...

the change is to only set the m_lastActiveTime from the UI thread, so we don't
need a monitor, and thus can't get blocked. But I suspect we might see a stack
trace with CanHandleUrl on it instead.
Comment on attachment 200459 [details] [diff] [review]
additional patch to try

Sorry for the delay, but my PC crashed 20 times in the last month, so it's only now that I'm sure that this attachment fixes the given stack. I know this because I've now got a new hang ;-) Stack trace is:
NTDLL! 77f88f13()
NTDLL! 77f87f26()
nsImapIncomingServer::GetImapConnection(nsImapIncomingServer * const 0x042482a8, nsIEventQueue * 0x00ea4f10, nsIImapUrl * 0x041b48d8, nsIImapProtocol * * 0x0012fd6c) line 744 + 50 bytes
nsImapIncomingServer::RetryUrl(nsImapIncomingServer * const 0x02f2236c, nsIImapUrl * 0x041b48d8) line 485 + 40 bytes
XPTC_InvokeByIndex(nsISupports * 0x02f2236c, unsigned int 16, unsigned int 1, nsXPTCVariant * 0x04221100) line 102
EventHandler(PLEvent * 0x032640c8) line 562 + 22 bytes
PL_HandleEvent(PLEvent * 0x032640c8) line 689
PL_ProcessPendingEvents(PLEventQueue * 0x10044af9) line 624
_md_TimerProc(HWND__ * 0x00010088, unsigned int 275, unsigned int 0, unsigned long 28605342) line 1013 + 6 bytes
USER32! 77e4158f()
USER32! 77e3c01a()
USER32! 77e41e7e()
nsAppStartup::Run(nsAppStartup * const 0x01a7c378) line 207 + 48 bytes
main1(int 0, char * * 0x00262638, nsISupports * 0x00000000) line 1251
main(int 3, char * * 0x00262638) line 1736 + 22 bytes
mainCRTStartup() line 338 + 17 bytes
KERNEL32! 7c598989()
The line of code in GetImapConnection is
rv = connection->CanHandleUrl(aImapUrl, &canRunUrlImmediately, &canRunButBusy);

The IMAP thread is still TellThreadToDie() trying to call CloseStreams().
the second patch is on the 1.8.1 branch, as well as on the trunk.
*** Bug 326314 has been marked as a duplicate of this bug. ***
Alias: imapdeadlock
OS: Windows XP → All
Attached file Another backtrace
Just ran into this again and thought I'd post another detailed backtrace.
is it possible that backtrace is missing a thread? I don't see any two threads blocked on the same monitor - with Neil's stacks, I do, so I'll try to work on his case.
nsImapProtocol::TellThreadToDie grabs an AutoMonitor on itself first thing, and holds onto it until it returns. It causes to be called functions that need a monitor on the server (nsImapIncomingServer::RemoveConnection, in particular). Meanwhile, functions that have a monitor on the server call functions that require a monitor on the protocol object. nsImapIncomingServer::GetImapConnection has a monitor on itself, and then iterates over the protocol objects calling ::CanHandleUrl, which grabs a monitor on the protocol objects. Deadly embrace-a-gogo. Lesson: nsCAutoMonitor considered harmful, and we need to limit the scope of the AutoMonitors.
Attached patch trunk patchSplinter Review
Kai, this is a patch against the tip of the trunk that might help - please try it out and let me know...I've run with it for a bit and things seem to work ok, but I was never able to reproduce the problem. There are more changes I can make, but this one is easier and safer than the other changes I have in mind.
> is it possible that backtrace is missing a thread? I don't see any two threads
> blocked on the same monitor - with Neil's stacks, I do, so I'll try to work on
> his case.

Maybe I missed one. Note the stacktrace in the other bug 326314 I filed had the threads blocked on the same monitor.

> Kai, this is a patch against the tip of the trunk that might help - please try
> it out and let me know...

I'm currently working with the 1.8 branch, but your patch applies there, too!
I just rebuilt and will use it over the next days.
Today I ran into the problem only once, so we'll see how soon I'm confident this patch helps.
No, this patch does not fix it.
At least it does not on the 1.8 branch.

David, does it make sense that I try a trunk build with this patch applied?
either the trunk or the 1.8.1 branch - the 1.8 branch is missing the second patch above, so another option would be to try applying that patch as well to the 1.8 branch. 

Whichever way you go, if it still happens with all the patches applied, if you could attach backtraces, I can continue from there. Thx!
I was wrong. I made a mistake, and when I posted my previous comment, my build did not yet contain your patch!

Now I'm sure, I'm running a MOZILLA_1_8_BRANCH build, and it contains all the patches attached to this bug.

I will report once I know more.
Attached file New backtrace
It just happened again, here are new stacks.
This one looks more obvious, 3 threads blocked on same lock.
Yes, the 3 threads are trying to get the lock on the imap incoming server, and nsImapProtocol::CreateNewLineFromSocket() grabs a monitor on the protocol object before calling TellThreadToDie(). Then nsImapIncomingServer::GetImapConnection has a lock on the server and calls nsImapProtocol::CanHandleUrl, which tries to get a lock on the protocol object. I'm not sre if nsImapProtocol::CreateNewLineFromSocket needs to grab the monitor it does, for as long as it does. I'll try to figure that out...
Status: REOPENED → ASSIGNED
Kai, here's a patch to try, on top of the previous ones. It should apply to the 1.8 branch as well. I've tested it briefly w/o a problem, but I don't hit the changed code very often. Let me know how it works, and attach a new stack trace if it hangs, thx!
Yes, your additional patch applies fine.
With your latest patch applied in addition, it ran stable for a while, until I switched my router off, and it hung again.

These are new stacks.
They are not as obvious to me as the previous ones, hopefully it makes sense to you.
I will continue to run and update more stacks.
we're in Idle() on the same protocol object that's trying to do a pseudo interrupt message load. Idle() is blocked wating for data, and hold a monitor for the protocol object while it does so. I'm not sure how to fix that...I'll have to think about it. Idle() is blocked way down in the pipe code waiting for data that's not going to come...it should get killed because the socket is dead but that doesn't always happen.
Kai, here's an additional change to try. It has real potential to re-introduce a hang on shutdown, however - https://bugzilla.mozilla.org/show_bug.cgi?id=246909
It definitely re-introduces bug 246909 so I'm going to need to figure out how not to regress that - but for now, if you don't have empty trash or cleanup inbox on exit set, you probably won't see bug 246909.
> but for now, if you don't have empty trash or cleanup
> inbox on exit set

I use neither, and with all the patches applied, this build has been running very well until now!!!
Attached patch fix that doesn't regress exiting (obsolete) — Splinter Review
Kai, can you try this cumulative fix to make sure things still work for you with it? It shouldn't regress bug 246909 because I made it so we won't try to go idle when shutting down, and I think it should still work for you.

Scott, I had to add to an interface, but it's not one I would expect any extensions to touch with a 10 foot pole.
Attachment #212783 - Flags: superreview?(mscott)
Comment on attachment 212783 [details] [diff] [review]
fix that doesn't regress exiting

>+  attribute boolean shuttingDown;
I don't actually see anyone setting this...

>+    PRBool shuttingDown;
>+    (void) imapServer->GetShuttingDown(&shuttingDown);
>+    if (!shuttingDown)
>+      (void) imapServer->GetUseIdle(&m_useIdle);
>+    else
>+      m_useIdle = PR_FALSE;
Would it be worth moving the shutting down check into GetUseIdle, thus avoiding the interface change?
Attached patch account manager part of change (obsolete) — Splinter Review
Thx, Neil, forgot the base/src part of the patch. Since the account manager only has an nsIImapIncomingServer, the interface change is still required (though I could just have a setter...). I thought about putting it in GetUseIdle, but since I'd still need an interface change, this seemed about the same.
Attachment #212847 - Flags: superreview?(mscott)
Some more feedback about the previous set of patches, prior to 2006-02-22.

I am running two instances of SeaMonkey all the time, two profiles. One work work, one for private stuff. Both profiles have a different IMAP+SSL account on separate servers open all the time.

The work profile did not hang since you added the latest patch on 2006-02-15. However, when my network trouble occurred, it could happen the application ended up in a state, where it still was responsive to user input, but loading data from the network did not work (after restoring the network). Even switching the application to offline and back online did not work. I was required to restart the application, but that worked fine, no hang.

In my private profile I have "cleanup inbox on exit" enabled, and I ran into your predicted hang once, but no more often.

I am now building a new optimized build of the latest 1.8 branch plus both patches from 2006-02-22 applied and will use that as my new primary application.
There was a typo in your patch, I fixed that and produced a combined patch.
Attachment #212783 - Attachment is obsolete: true
Attachment #212847 - Attachment is obsolete: true
Attachment #212865 - Flags: review?(mscott)
Attachment #212783 - Flags: superreview?(mscott)
Attachment #212847 - Flags: superreview?(mscott)
Attachment #212865 - Flags: review?(mscott) → superreview?(mscott)
Thx, Kai, what was the typo?
There was a ; after IMPL_GET_SET and gcc bailed out with a syntax error.
thx, I'm always doing that. Fixed in my tree.

Re networking activity stopping, does that include non-imap networking activity, like http in messages, or rss feeds? If so, I bet one of the necko threads is blocked, like the socket transport thread or the dns lookup thread - backtraces might be interesting.
I think the network hang was limited to imap, and other networking still working. But I might be wrong. If I see it again, I'll test.
if limited to imap, it would be most likely limited to a particular folder or folders, so that a folder you hadn't opened should be OK.
Attached file backtrace 2006-02-24
Yes, I confirm. The stalled network is limited to some imap folders only.
I just encountered this situation again while the application has still been running from yesterday.
I restored the broken network connection after the night, the new mail check automatically started. Multiple folders had new messages. I could click some and got a message display, but the hourglass remained, not being able to click messages. In another folder I didn't get an updated message list display. I was able to open new folders and click, read and delete messages.
Then I attached with the debugger, produced the attached stack listing, and detached from the application, keeping it running. To my surprise, after another while, all folders have now recovered and work correctly. The application is still running and I'm posting this bugzilla comment with it.
(In reply to comment #32)
>Thx, Neil, forgot the base/src part of the patch. Since the account manager
>only has an nsIImapIncomingServer, the interface change is still required
Ah, that explains things, thanks!
Attachment #212865 - Flags: superreview?(mscott) → superreview+
Attached file backtrace 2006-03-14
I'm running 1.8 branch from yesterday, with the latest patch from this bug (and my patch from bug 111384).

I just run into a cleanupOnExit related hang, stack attached.
yep, I'm seeing that too, which is why I haven't got this stuff checked in yet :-( I think the url to cleanup the inbox is failing but the account manager code never gets notified that the url finished/failed.
There seems to be a general wake-up problem somewhere.
I just had a situation when I saw two new mail messages, which were junk, but not yet marked as such. I clicked them, but they didn't load, mail stalled, although the application was still responsive.

Then I sent a mail message. This suddenly released the stalled situation, my junk messages got downloaded, marked as junked, and moved to the junk folder.
Attached file Backtrace 2006-05-18
I've been getting a lot of hangs in Thunderbird lately, and the symptoms sound similar to this bug, except that my UI is totally unresponsive once it hangs (can't bring backgrounded windows to the front or anything when it happens).  I'm on OS X.

gdb backtraces attached.
fixed on trunk and branch (or at least, it should be much better). My worry now is about the hang on exit...
Status: ASSIGNED → RESOLVED
Closed: 15 years ago15 years ago
Keywords: fixed1.8.1
Resolution: --- → FIXED
No longer blocks: 283823
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.