698882 - Deadlock in nsSocketTransportService [tbird] PR_SetPollableEvent

:Irving Reid (No longer working on Firefox)

Reporter

Description

•

14 years ago

Caught a difficult-to-reproduce Thunderbird hang in nsSocketTransportService. Here are the backtraces of blocked threads; most of the interesting info is in the second trace (the necko thread) I was running under Xcode / GDB when this locked up, so I spent a pile of time poking around and trying to understand what was up. Unfortunately after a while Xcode crashed (yay! blunt tools!) so I no longer have access to the state as of the lock up. Next time I'll try to make a core dump... As far as I could tell, there was an issue between the PollableEvent loopback pipe and the nsSocketTransportService - nsSocketTransportService::OnDispatchedEvent was blocked trying to write to the pipe, which implies that the pipe was full, but the poll performed up the stack in DoPollIteration() did not show the other end of the pipe as readable. Main UI thread: #2 0x000000010008b7dd in PR_Lock at /Users/ireid/tbird/comm-central/mozilla/nsprpub/pr/src/pthreads/ptsynch.c:206 #3 0x0000000102df3d16 in mozilla::Mutex::Lock() at /Users/ireid/tbird/objdir-comm-central-permissions/mozilla/xpcom/build/BlockingResourceBase.cpp:261 #4 0x0000000101250c84 in mozilla::MutexAutoLock::MutexAutoLock(mozilla::Mutex&, mozilla::GuardObjectNotifier const&) () #5 0x00000001012bea2f in nsSocketTransportService::GetThreadSafely() () #6 0x00000001012beb68 in nsSocketTransportService::Dispatch(nsIRunnable*, unsigned int) () #7 0x0000000102e3d46e in nsAStreamCopier::PostContinuationEvent_Locked() () #8 0x0000000102e3dd65 in nsAStreamCopier::PostContinuationEvent() () #9 0x0000000102e3deac in nsAStreamCopier::Start(nsIInputStream*, nsIOutputStream*, nsIEventTarget*, void (*)(void*, unsigned int), void*, unsigned int, bool, bool) () #10 0x0000000102e3d19f in NS_AsyncCopy(nsIInputStream*, nsIOutputStream*, nsIEventTarget*, nsAsyncCopyMode, unsigned int, void (*)(void*, unsigned int), void*, bool, bool, nsISupports**) () #11 0x00000001012badff in nsSocketTransport::OpenInputStream(unsigned int, unsigned int, unsigned int, nsIInputStream**) () #12 0x0000000102a3c9d2 in nsImapProtocol::SetupWithUrl(nsIURI*, nsISupports*) () #13 0x0000000102a3d921 in nsImapProtocol::LoadImapUrl(nsIURI*, nsISupports*) () #14 0x00000001029e9361 in nsImapIncomingServer::GetImapConnectionAndLoadUrl(nsIEventTarget*, nsIImapUrl*, nsISupports*) () #15 0x0000000102a58971 in nsImapService::GetImapConnectionAndLoadUrl(nsIEventTarget*, nsIImapUrl*, nsISupports*, nsIURI**) () #16 0x0000000102a667f6 in nsImapService::SelectFolder(nsIEventTarget*, nsIMsgFolder*, nsIUrlListener*, nsIMsgWindow*, nsIURI**) () #17 0x0000000102a0ff1a in nsImapMailFolder::UpdateFolderWithListener(nsIMsgWindow*, nsIUrlListener*) () #18 0x00000001029ebc25 in nsImapMailFolder::UpdateFolder(nsIMsgWindow*) () the "necko" thread #1 0x0000000100096145 in poll () #2 0x000000010008d7b0 in pt_poll_now () #3 0x000000010008db6a in pt_Continue () #4 0x000000010008e96e in pt_Write () #5 0x000000010006b422 in PR_Write () #6 0x000000010006ef02 in PR_SetPollableEvent () #7 0x00000001012bec62 in nsSocketTransportService::OnDispatchedEvent(nsIThreadInternal*) at /Users/ireid/tbird/comm-central/mozilla/netwerk/base/src/nsSocketTransportService2.cpp:588 #8 0x0000000102e61360 in nsThread::PutEvent(nsIRunnable*) at /Users/ireid/tbird/comm-central/mozilla/xpcom/threads/nsThread.cpp:397 #9 0x0000000102e62830 in nsThread::Dispatch(nsIRunnable*, unsigned int) at /Users/ireid/tbird/comm-central/mozilla/xpcom/threads/nsThread.cpp:435 #10 0x00000001012bebe0 in nsSocketTransportService::Dispatch(nsIRunnable*, unsigned int) at /Users/ireid/tbird/comm-central/mozilla/netwerk/base/src/nsSocketTransportService2.cpp:140 #11 0x0000000102e3d46e in nsAStreamCopier::PostContinuationEvent_Locked() at /Users/ireid/tbird/comm-central/mozilla/xpcom/io/nsStreamUtils.cpp:467 #12 0x0000000102e3dd65 in nsAStreamCopier::PostContinuationEvent() at /Users/ireid/tbird/comm-central/mozilla/xpcom/io/nsStreamUtils.cpp:458 #13 0x0000000102e3dd9b in nsAStreamCopier::OnOutputStreamReady(nsIAsyncOutputStream*) at /Users/ireid/tbird/comm-central/mozilla/xpcom/io/nsStreamUtils.cpp:428 #14 0x00000001012b8525 in nsSocketOutputStream::OnSocketReady(unsigned int) at /Users/ireid/tbird/comm-central/mozilla/netwerk/base/src/nsSocketTransport2.cpp:514 #15 0x00000001012b894d in nsSocketTransport::OnSocketReady(PRFileDesc*, short) at /Users/ireid/tbird/comm-central/mozilla/netwerk/base/src/nsSocketTransport2.cpp:1531 #16 0x00000001012c035c in nsSocketTransportService::DoPollIteration(bool) () #17 0x00000001012c068a in nsSocketTransportService::Run() () #18 0x0000000102e6188e in nsThread::ProcessNextEvent(bool, bool*) () #19 0x0000000102defe26 in NS_ProcessNextEvent_P(nsIThread*, bool) () #20 0x0000000102e623c5 in nsThread::ThreadFunc(void*) at /Users/ireid/tbird/comm-central/mozilla/xpcom/threads/nsThread.cpp:272 IMAP thread for host I was copying to #1 0x00007fff870cb881 in _pthread_cond_wait () #2 0x000000010008c094 in PR_WaitCondVar () #3 0x000000010008c813 in PR_Wait () #4 0x0000000102df3137 in mozilla::ReentrantMonitor::Wait(unsigned int) at /Users/ireid/tbird/objdir-comm-central-permissions/mozilla/xpcom/build/BlockingResourceBase.cpp:346 #5 0x0000000101361e5e in mozilla::ReentrantMonitorAutoEnter::Wait(unsigned int) () #6 0x0000000102e3a09f in nsPipeInputStream::Wait() at /Users/ireid/tbird/comm-central/mozilla/xpcom/io/nsPipe3.cpp:653 #7 0x0000000102e3b697 in nsPipeInputStream::ReadSegments(unsigned int (*)(nsIInputStream*, void*, char const*, unsigned int, unsigned int, unsigned int*), void*, unsigned int, unsigned int*) () #8 0x0000000102e38f91 in nsPipeInputStream::Read(char*, unsigned int, unsigned int*) () #9 0x0000000102779837 in nsMsgLineStreamBuffer::ReadNextLine(nsIInputStream*, unsigned int&, bool&, unsigned int*, bool) () #10 0x0000000102a2cb26 in nsImapProtocol::CreateNewLineFromSocket() () #11 0x0000000102a3fa2c in nsImapProtocol::EstablishServerConnection() () #12 0x0000000102a44a7b in nsImapProtocol::ProcessCurrentURL() () #13 0x0000000102a35a06 in nsImapProtocol::ImapThreadMainLoop() () #14 0x0000000102a3d189 in nsImapProtocol::Run() () #15 0x0000000102e6188e in nsThread::ProcessNextEvent(bool, bool*) () #16 0x0000000102defe26 in NS_ProcessNextEvent_P(nsIThread*, bool) () #17 0x0000000102e623c5 in nsThread::ThreadFunc(void*) () #18 0x000000010009304b in _pt_root at /Users/ireid/tbird/comm-central/mozilla/nsprpub/pr/src/pthreads/ptthread.c:187

Jason Duell

Comment 1

•

14 years ago

Thanks a lot for taking the time to investigate this. Is this a hang you've seen frequently, or you just saw once?

:Irving Reid (No longer working on Firefox)

Reporter

Comment 2

•

14 years ago

(In reply to Jason Duell (:jduell) from comment #1) > Thanks a lot for taking the time to investigate this. > > Is this a hang you've seen frequently, or you just saw once? I've only caught this exact backtrace once, but :bienvenu has hung Thunderbird reliably in a few other ways; I'm not sure if his hangs have a thread stopped in exactly the same place or not.

Wayne Mery (:wsmwk)

Updated

•

14 years ago

Blocks: 713253

:Irving Reid (No longer working on Firefox)

Reporter

Comment 3

•

14 years ago

Sid0 is now hitting this pretty reliably on Windows. Confirmed it's the same problem by looking at stack traces.

OS: Mac OS X → All

Honza Bambas (:mayhemer)

Comment 4

•

14 years ago

Please check on bug 711787. It (or the cause of it) might be related if this is something new.

:Irving Reid (No longer working on Firefox)

Reporter

Comment 6

•

13 years ago

Firefox is seeing quite a few crashes in PR_SetPollableEvent(); worth keeping an eye on these in case they are related: https://bugzilla.mozilla.org/show_bug.cgi?id=709847 https://bugzilla.mozilla.org/show_bug.cgi?id=662330 Also, Bug 649323 appears to be another dupe. It's older, anyone else have a preference as to which direction we set the duplication?

Updated

•

13 years ago

Blocks: 672913, 649323

Severity: major → critical

Keywords: hang

Wayne Mery (:wsmwk)

Updated

•

13 years ago

Blocks: 535070

Honza Bambas (:mayhemer)

Comment 7

•

13 years ago

We may want to make the pollable event's socket or pipe buffer be a bit larger and in PR_WaitForPollableEvent read more then just 1024 bytes.

:Irving Reid (No longer working on Firefox)

Reporter

Comment 8

•

13 years ago

There are a number of ways we could make the problem more survivable, but I think they'd just end up covering over a more important underlying bug. As far as I can tell, we're already getting into a situation where there are 16k more notifications posted to the network thread than it has been able to read; since we *should* be reading from the notification socket once every time we wake up from select() I'm concerned that there's a nasty loop happening somewhere below the socket thread's main loop, or a path through the main loop that doesn't get to the place where we handle the notification socket.

Honza Bambas (:mayhemer)

Comment 9

•

13 years ago

If we are inside the event handling loop, we may not need to call on PR_SetPollableEvent. If we are inside that loop [1], then we will always loop while events are in the queue. When events are detected prior to call to PR_Poll, we exit immediately from PR_Poll regardless pollable event has been set or not. Though, simpler option would be to prevent duplicate calls to PR_SetPollableEvent between calls to PR_WaitForPollableEvent (may need one more lock enter near PR_WaitForPollableEvent, but precise atomicity here is not that important). [1] http://hg.mozilla.org/mozilla-central/annotate/8ea5c983743f/netwerk/base/src/nsSocketTransportService2.cpp#l635

:Irving Reid (No longer working on Firefox)

Reporter

Comment 10

•

13 years ago

I agree with both of those points; we don't need to write a byte into the pollable event pipe if we're already in the networking thread, because we'll always process any pending events before we do a blocking poll. We also don't need more than one byte in the pipe for each time around the networking main loop, because we'll process all events no matter how many times PR_SetPollableEvent() is called. However, I don't feel comfortable with fixing the problem in either of these ways until I understand *why* so many bytes are being written into the event pipe during a single pass through the main networking loop. I'm nervous that there is some underlying problem that will be even harder to isolate and fix if we make the pollable event mechanism less sensitive. And if we do find the reason that the pipe is getting clogged, we may not need to make the pollable event more complicated after all.

Wayne Mery (:wsmwk)

Updated

•

13 years ago

Blocks: 726432

newpollable 9 years ago Patrick McManus [:mcmanus] 16.58 KB, patch	dragana : feedback+	Details \| Diff \| Splinter Review
mozilla::net::PollableEvent 9 years ago Patrick McManus [:mcmanus] 15.96 KB, patch		Details \| Diff \| Splinter Review
mozilla::net::PollableEvent 9 years ago Patrick McManus [:mcmanus] 23.25 KB, patch		Details \| Diff \| Splinter Review
mozilla::net::PollableEvent 9 years ago Patrick McManus [:mcmanus] 23.35 KB, patch	mayhemer : review+ mayhemer : review+	Details \| Diff \| Splinter Review
mozilla::net::PollableEvent 9 years ago Patrick McManus [:mcmanus] 27.42 KB, patch	mcmanus : review+	Details \| Diff \| Splinter Review
mozilla::net::PollableEvent 9 years ago Patrick McManus [:mcmanus] 29.03 KB, patch		Details \| Diff \| Splinter Review
old.pcap 9 years ago Dragana Damjanovic [:dragana] 1007 bytes, application/octet-stream		Details
new.pcap 9 years ago Dragana Damjanovic [:dragana] 27.08 KB, application/octet-stream		Details
mozilla::net::PollableEvent 9 years ago Patrick McManus [:mcmanus] 30.86 KB, patch		Details \| Diff \| Splinter Review
mozilla::net::PollableEvent 9 years ago Patrick McManus [:mcmanus] 31.59 KB, patch		Details \| Diff \| Splinter Review
mozilla::net::PollableEvent 9 years ago Patrick McManus [:mcmanus] 30.98 KB, patch	dragana : review+	Details \| Diff \| Splinter Review
bug_698882_aurora.patch 9 years ago Dragana Damjanovic [:dragana] 31.14 KB, patch	ritu : approval-mozilla-aurora+	Details \| Diff \| Splinter Review
bug_698882_aurora_reverse.patch 9 years ago Dragana Damjanovic [:dragana] 30.98 KB, patch		Details \| Diff \| Splinter Review
bug_698882_aurora_reverse.patch 9 years ago Dragana Damjanovic [:dragana] 31.00 KB, patch		Details \| Diff \| Splinter Review
bug_698882_aurora.patch 9 years ago Dragana Damjanovic [:dragana] 31.10 KB, patch	ritu : approval-mozilla-beta-	Details \| Diff \| Splinter Review
bug_698882_aurora_reverse.patch 9 years ago Dragana Damjanovic [:dragana] 30.95 KB, patch		Details \| Diff \| Splinter Review