Closed Bug 588698 Opened 15 years ago Closed 15 years ago

SSL deadlock (seen in Thunderbird)

Categories

(NSS :: Libraries, defect, P1)

Tracking

(Not tracked)

RESOLVED FIXED
3.12.8

People

(Reporter: Usul, Assigned: wtc)

References

Details

(Keywords: hang, Whiteboard: [test full duplex SSL mode][tb31wants])

Attachments

(4 files, 1 obsolete file)

Attached file Shark sample
I noticed wanted to go back to Thunderbird (which was a background task), I couldn't because it was beach-balling like hell. David from the shark sample this might be imap related, what do you think ? Then I asked nicely to gdb to give me a stack and I got : (gdb) bt #0 0x941742ae in semaphore_wait_signal_trap () #1 0x9417bd85 in pthread_mutex_lock () #2 0x02793032 in PR_Lock () #3 0x027931a4 in PR_EnterMonitor () #4 0x028028c1 in ssl3_GatherCompleteHandshake () #5 0x0280295b in ssl3_GatherAppDataRecord () #6 0x0280aa51 in ssl_SecureRecv () #7 0x02810506 in ssl_Recv () #8 0x00ab0eb7 in nsSSLThread::requestRecvMsgPeek () #9 0x00ac1145 in PSMRecv () #10 0x0005ccf3 in nsSocketTransport::IsAlive () #11 0x00c9c6b1 in nsImapProtocol::CanHandleUrl () #12 0x00c71958 in nsImapIncomingServer::GetImapConnection () #13 0x00c71ff8 in nsImapIncomingServer::GetImapConnectionAndLoadUrl () #14 0x00cb7e13 in nsImapService::GetImapConnectionAndLoadUrl () #15 0x00cbb66e in nsImapService::SelectFolder () #16 0x00c7aea1 in nsImapMailFolder::UpdateFolderWithListener () #17 0x00c76a83 in nsImapMailFolder::UpdateFolder () #18 0x00c7bd54 in nsImapMailFolder::OnNewIdleMessages () #19 0x0268bb78 in NS_InvokeByIndex_P () #20 0x0267fe6d in nsProxyObjectCallInfo::Run () #21 0x0267a63c in nsThread::ProcessNextEvent () #22 0x02637307 in NS_ProcessPendingEvents_P () #23 0x00203a61 in nsBaseAppShell::NativeEventCallback () #24 0x001cb757 in nsAppShell::ProcessGeckoEvents () #25 0x906c53c5 in CFRunLoopRunSpecific () #26 0x906c5aa8 in CFRunLoopRunInMode () #27 0x968bf2ac in RunCurrentEventLoopInMode () #28 0x968bf0c5 in ReceiveNextEventCommon () #29 0x968bef39 in BlockUntilNextEventMatchingListInMode () #30 0x95d8b6d5 in _DPSNextEvent () #31 0x95d8af88 in -[NSApplication nextEventMatchingMask:untilDate:inMode:dequeue:] () #32 0x95d83f9f in -[NSApplication run] () #33 0x001cae18 in nsAppShell::Run () #34 0x00a86377 in nsAppStartup::Run () #35 0x00006fd4 in XRE_main () #36 0x00002f50 in main ()
Are there any other stacks holding onto an ssl monitor?
(In reply to comment #1) > Are there any other stacks holding onto an ssl monitor? How would I get that (ain't sure I get what you are asking for ) ?
gdb lets you look at other threads (I assume), and I'd like to see any stacks of those threads that look like they might be holding onto the monitor.
So got the hang again and here are the Threads that had ssl or monitor in them : Thread 3 (process 2012 thread 0x2007): #0 0x95eb12ae in semaphore_wait_signal_trap () #1 0x95eb8d85 in pthread_mutex_lock () #2 0x027ae032 in PR_Lock () #3 0x027ae1a4 in PR_EnterMonitor () #4 0x02824a19 in SSL_DataPending () #5 0x0282a839 in ssl_Poll () #6 0x00ab715f in nsSSLThread::requestPoll () #7 0x00ac73d7 in nsSSLIOLayerPoll () #8 0x027afb80 in PR_Poll () #9 0x0005f4d4 in nsSocketTransportService::Poll () #10 0x000601b2 in nsSocketTransportService::DoPollIteration () #11 0x00060740 in nsSocketTransportService::OnProcessNextEvent () #12 0x026965e1 in nsThread::ProcessNextEvent () #13 0x02653307 in NS_ProcessPendingEvents_P () #14 0x0005fdc8 in nsSocketTransportService::Run () #15 0x0269663c in nsThread::ProcessNextEvent () #16 0x026533aa in NS_ProcessNextEvent_P () #17 0x02696810 in nsThread::ThreadFunc () #18 0x027b3892 in _pt_root () #19 0x95ee2155 in _pthread_start () #20 0x95ee2012 in thread_start () and Thread 1 hread 1 (process 2012 thread 0x20b): #0 0x95eb12ae in semaphore_wait_signal_trap () #1 0x95eb8d85 in pthread_mutex_lock () #2 0x027ae032 in PR_Lock () #3 0x027ae1a4 in PR_EnterMonitor () #4 0x0281d8c1 in ssl3_GatherCompleteHandshake () #5 0x0281d95b in ssl3_GatherAppDataRecord () #6 0x02825a51 in ssl_SecureRecv () #7 0x0282b506 in ssl_Recv () #8 0x00ab7377 in nsSSLThread::requestRecvMsgPeek () #9 0x00ac7655 in PSMRecv () #10 0x0005d773 in nsSocketTransport::IsAlive () #11 0x00ca33a1 in nsImapProtocol::CanHandleUrl () #12 0x00c78628 in nsImapIncomingServer::GetImapConnection () #13 0x00c78cc8 in nsImapIncomingServer::GetImapConnectionAndLoadUrl () #14 0x00cbeb43 in nsImapService::GetImapConnectionAndLoadUrl () #15 0x00cc239e in nsImapService::SelectFolder () #16 0x00c89fc1 in nsImapMailFolder::UpdateFolderWithListener () #17 0x00c7d753 in nsImapMailFolder::UpdateFolder () #18 0x00c82834 in nsImapMailFolder::OnNewIdleMessages () #19 0x026a7b78 in NS_InvokeByIndex_P () #20 0x0269be6d in nsProxyObjectCallInfo::Run () #21 0x0269663c in nsThread::ProcessNextEvent () #22 0x02653307 in NS_ProcessPendingEvents_P () #23 0x00204bf1 in nsBaseAppShell::NativeEventCallback () #24 0x001cc787 in nsAppShell::ProcessGeckoEvents () #25 0x96eab3c5 in CFRunLoopRunSpecific () #26 0x96eabaa8 in CFRunLoopRunInMode () #27 0x905d52ac in RunCurrentEventLoopInMode () #28 0x905d50c5 in ReceiveNextEventCommon () #29 0x905d4f39 in BlockUntilNextEventMatchingListInMode () #30 0x9389c6d5 in _DPSNextEvent () #31 0x9389bf88 in -[NSApplication nextEventMatchingMask:untilDate:inMode:dequeue:] () #32 0x93894f9f in -[NSApplication run] () #33 0x001cbe48 in nsAppShell::Run () #34 0x00a8caf7 in nsAppStartup::Run () #35 0x00007584 in XRE_main () #36 0x000034b0 in main () So that makes it two threads Anything else you need david ?
I think I need someone who knows about PSM to say if this is a PSM bug or not...If I can't call nsSocketTransport::IsAlive, that's going to break a lot of things.
Can you please say which Thunderbird version you're using? This will tell us which NSPR and NSS versions are used.
Assignee: nobody → nobody
Component: General → Libraries
Product: Thunderbird → NSS
QA Contact: general → libraries
Version: Trunk → trunk
Summary: Hanged while in the background → Thunderbird SSL deadlock while in the background
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.5; rv:2.0b5pre) Gecko/20100824 Shredder/3.2a1pre
If you can reproduce again, it would be interesting whether both stacks try to lock the same or different monitors, and same or different locks. (They'll probably be both the same, but would be good to confirm). I propose, in gdb, use "up" to go to PR_EnterMonitor (up 3), then use the following commands to print some values: print mon print mon->lock print mon->entryCount Do this for both threads, please Also, it would be interesting whether both refer to the same socket, or different sockets. In "thread 3" (or your future equivalent) please go to the frame with SSL_DataPending (use "up") and print ss In "thread 1" (or your future equivalent) please go to the frame with ssl3_GatherCompleteHandshake (use "up") and print ss
(In reply to comment #7) > Mozilla/5.0 (Macintosh; Intel Mac OS X 10.5; rv:2.0b5pre) Gecko/20100824 > Shredder/3.2a1pre Thanks. I believe Shredder 3.2 uses mozilla-central. You're using the latest nightly build, so you're probably using NSS 3.12.8 beta 2.
Version: trunk → 3.12.8
Summary: Thunderbird SSL deadlock while in the background → Speculation: Thunderbird SSL deadlock while in the background
Using the two thread stacks in comment 4, I see it is possible to get into a deadlock due to locking order, in the new SSL false start code. Ludovic, do you know how to set the security.ssl.enable_false_start preference to false in Thunderbird? agl: the two locks involved are RecvBufLock and SSL3HandshakeLock. The new SSL false start code in ssl3_GatherCompleteHandshake may lock them in the wrong order, assuming the lock order in SSL_DataPending is correct.
Assignee: nobody → wtc
Status: NEW → ASSIGNED
I found that the order in which ssl3_GatherCompleteHandshake aquires RecvBufLock and SSL3HandshakeLock is correct, based on inspection of existing code and this comment before ssl3_HandleRecord: http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/security/nss/lib/ssl/ssl3con.c&rev=1.145&mark=8807-8811,8814#8807 So the locking order in SSL_DataPending is wrong: http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/security/nss/lib/ssl/sslsecur.c&rev=1.44&mark=1353,1363,1365#1348 Since SSL_DataPending accesses ss->gs only, it only needs to get RecvBufLock. Looking at CVS history, I believe it gets 1stHandshakeLock and SSL3HandshakeLock because it used to call ssl_CreateSecurityInfo: http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/security/nss/lib/ssl/sslsecur.c&rev=1.15&mark=1199,1208-1209,1211-1212#1194 That ssl_CreateSecurityInfo call was removed in rev. 1.16 of sslsecur.c, leaving behind the ssl_Get1stHandshakeLock and ssl_GetSSL3HandshakeLock calls. The sslSecurityInfo structure is documented to be protected by those two locks: http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/security/nss/lib/ssl/sslimpl.h&rev=1.78&mark=942-943,945#939 So I propose that we simply remove the two unnecessary locks from SSL_DataPending.
Attachment #469243 - Flags: superreview?(nelson)
Attachment #469243 - Flags: review?(rrelyea)
I found the locking order is documented in mozilla/security/nss/lib/ssl/notes.txt, but that file hasn't been updated since year 2000, so it may contain stale info. Adam Langley sent me a patch for discovering the prevalent locking order in the current code base, and fixing some violations. I cleaned it up a bit and omitted some white space changes to make the patch smaller. Adam, please make sure my edits are fine.
Attachment #469273 - Flags: superreview?(nelson)
Attachment #469273 - Flags: review?(agl)
Target Milestone: --- → 3.12.8
(In reply to comment #10) > Using the two thread stacks in comment 4, I see it is possible > to get into a deadlock due to locking order, in the new SSL > false start code. > > Ludovic, do you know how to set the security.ssl.enable_false_start > preference to false in Thunderbird? Yes. Shall I set it ?
Ludovic: we now believe setting the security.ssl.enable_false_start preference to false should not matter, because the new SSL false start code uses the same locking order as the existing code. But please do this experiment anyway. Thanks.
Comment on attachment 469273 [details] [diff] [review] Patch to discover and document current locking order, by Adam Langley I checked that PORT_Assert turns into a no-op in optimised builds, so I hope the additional asserts don't hurt the performance of debug builds too much. I think they are fairly helpful to prevent these issues in the future. Note: I'm sure that I didn't exercise all the code paths when testing these asserts. Having the code explode when the locking order is wrong may well uncover a few more issues in the near future so that's sometime to watch out for.
Attachment #469273 - Flags: review?(agl) → review+
Comment on attachment 469243 [details] [diff] [review] SSL_DataPending only needs to get recvBufLock (checked in) I agree. DataPending was acquiring more locks than necessary.
Attachment #469243 - Flags: superreview?(nelson) → superreview+
Priority: -- → P1
Summary: Speculation: Thunderbird SSL deadlock while in the background → SSL deadlock (seen in Thunderbird)
Comment on attachment 469273 [details] [diff] [review] Patch to discover and document current locking order, by Adam Langley r=nelson. Thanks for updating notes.txt. I think the fixes for this bug may also fix Bug 53479. Please check. Please check and make sure that "full duplex" SSL mode still works with your locking changes. Two threads should be able to simultaneously read and write on the same SSL socket without mutual exclusion due to some (any) locks on that socket. NSS's test programs have command line options to test this capability. Please make sure it doesn't get broken.
Attachment #469273 - Flags: superreview?(nelson) → superreview+
Comment on attachment 469243 [details] [diff] [review] SSL_DataPending only needs to get recvBufLock (checked in) Checked in on the NSS trunk (NSS 3.13) and NSS_3_12_BRANCH (NSS 3.12.8). Checking in sslsecur.c; /cvsroot/mozilla/security/nss/lib/ssl/sslsecur.c,v <-- sslsecur.c new revision: 1.45; previous revision: 1.44 done Checking in sslsecur.c; /cvsroot/mozilla/security/nss/lib/ssl/sslsecur.c,v <-- sslsecur.c new revision: 1.43.2.2; previous revision: 1.43.2.1 done
Attachment #469243 - Attachment description: SSL_DataPending only needs to get recvBufLock → SSL_DataPending only needs to get recvBufLock (checked in)
Attachment #469243 - Flags: review?(rrelyea)
In the weekly NSS conference call this morning, I said I understood why this deadlock became much more likely after the addition of the false start code to ssl3_GatherCompleteHandshake. I made that statement based only on memory. Now that I actually looked at the ssl3_GatherCompleteHandshake code, I realized I misunderstood it. If the socket is idle, the ssl3_GatherData call will return -1 with PR_WOULD_BLOCK_ERROR, and ssl3_GatherCompleteHandshake will return immediately, skipping the new false start code. So I cannot explain how the new false start code can make this dead lock more likely, and I'm looking forward to the results of Ludovic's experiment. Note: my analysis above assumes Thunderbird calls nsSocketTransport::IsAlive only on idle sockets, to see if they're still alive. David, if that's not the case, please let me know.
(In reply to comment #17) > > I think the fixes for this bug may also fix Bug 53479. Please check. Hmm... I think this bug is actually a duplicate of bug 53479!
OS: Mac OS X → Windows 7
OS: Windows 7 → Mac OS X
(In reply to comment #19) > So I cannot explain how the new false start code can make this > dead lock more likely, and I'm looking forward to the results of > Ludovic's experiment. I've set the pref to false this morning. I'm now waiting for the issue to reoccur.
I had to move the !ssl_HaveFooLock(ss) assertions into the "if (!ss->opt.noLocks)" blocks. I fixed the locking order problem in SSL_ResetHandshake in a different way, by getting SSL3HandshakeLock only when we're about to modify ss->sec. Patch checked in on the NSS trunk (NSS 3.13). Checking in notes.txt; /cvsroot/mozilla/security/nss/lib/ssl/notes.txt,v <-- notes.txt new revision: 1.3; previous revision: 1.2 done Checking in ssl3con.c; /cvsroot/mozilla/security/nss/lib/ssl/ssl3con.c,v <-- ssl3con.c new revision: 1.146; previous revision: 1.145 done Checking in sslcon.c; /cvsroot/mozilla/security/nss/lib/ssl/sslcon.c,v <-- sslcon.c new revision: 1.41; previous revision: 1.40 done Checking in sslimpl.h; /cvsroot/mozilla/security/nss/lib/ssl/sslimpl.h,v <-- sslimpl.h new revision: 1.79; previous revision: 1.78 done Checking in sslsecur.c; /cvsroot/mozilla/security/nss/lib/ssl/sslsecur.c,v <-- sslsecur.c new revision: 1.46; previous revision: 1.45 done
Attachment #469273 - Attachment is obsolete: true
The remaining work is to test "full duplex" SSL mode.
OS: Mac OS X → All
Hardware: x86 → All
Whiteboard: [test full duplex SSL mode]
Version: 3.12.8 → 3.0
The lock order assertions may be too strict in some cases because the locks in libSSL are reentrant (PRMonitors). On some DBG (debug build) NSS tinderboxes, tstclnt hits an assertion failure in the ssl_Get1stHandshakeLock(ss) call in SSL_SetURL(), because tstclnt calls SSL_SetURL() in the handshakeCallback, which already holds 1stHandshakeLock, recvBufLock, and SSL3HandshakeLock. Here is an example of the assertion failure. ssl.sh: TLS Server hello response with SNI: Change name on 2d HS ---- selfserv starting at Sat Aug 28 08:12:45 PDT 2010 selfserv -D -p 8111 -d ../server -n attic.red.iplanet.com -B -s \ -e attic.red.iplanet.com-ec -w nss -r -r -r -a attic-sni.red.iplanet.com -i ../tests_pid.22063 & trying to connect to selfserv at Sat Aug 28 08:12:45 PDT 2010 tstclnt -p 8111 -h attic.red.iplanet.com -q \ -d ../client -v < /export/tinderlight/data/attic_32_DBG/mozilla/security/nss/tests/ssl/sslreq.dat tstclnt: connecting to attic.red.iplanet.com:8111 (address=127.0.0.1) kill -0 27869 >/dev/null 2>/dev/null selfserv with PID 27869 found at Sat Aug 28 08:12:45 PDT 2010 selfserv with PID 27869 started at Sat Aug 28 08:12:45 PDT 2010 tstclnt -p 8111 -h attic.red.iplanet.com -f -d ../client -v \ -2 -w nss -n TestUser -a attic-sni.red.iplanet.com -a attic.red.iplanet.com < /export/tinderlight/data/attic_32_DBG/mozilla/security/nss/tests/ssl/sslreq.dat ./all.sh: line 997: 27891 Aborted (core dumped) ${PROFTOOL} ${BINDIR}/tstclnt -p ${PORT} -h ${HOSTADDR} -f ${cparam} ${CLIENT_OPTIONS} -d ${P_R_CLIENTDIR} -v <${REQUEST_FILE} >${TMP}/$HOST.tmp.$$ 2>&1 tstclnt: connecting to attic.red.iplanet.com:8111 (address=127.0.0.1) tstclnt: connect: Operation is still in progress (probably a non-blocking connect). tstclnt: about to call PR_Poll for connect completion! tstclnt: PR_Poll returned 0x02 for socket out_flags. tstclnt: ready... tstclnt: about to call PR_Poll ! tstclnt: PR_Poll returned! tstclnt: PR_Poll returned 0x01 for stdin out_flags. tstclnt: PR_Poll returned 0x00 for socket out_flags. tstclnt: stdin read 18 bytes tstclnt: Writing 18 bytes to server tstclnt: about to call PR_Poll on writable socket ! tstclnt: PR_Poll returned with writable socket ! tstclnt: about to call PR_Poll on writable socket ! tstclnt: PR_Poll returned with writable socket ! Assertion failure: !ssl_HaveRecvBufLock(ss), at sslsecur.c:1301 ssl.sh: #1635: TLS Server hello response with SNI: Change name on 2d HS produced a returncode of 1, expected is 1 - Core file is detected - FAILED trying to kill selfserv with PID 27869 at Sat Aug 28 08:12:46 PDT 2010 kill -USR1 27869 selfserv: 0 cache hits; 1 cache misses, 0 cache not reusable 0 stateless resumes, 0 ticket parse failures selfserv: normal termination selfserv -b -p 8111 2>/dev/null; selfserv with PID 27869 killed at Sat Aug 28 08:12:46 PDT 2010 I will disable the assertion in ssl_Get1stHandshakeLock for now. It seems that we can only assert when getting a lock for the first time, but not when reentering the lock.
Checked in on the NSS trunk (NSS 3.13). Checking in sslimpl.h; /cvsroot/mozilla/security/nss/lib/ssl/sslimpl.h,v <-- sslimpl.h new revision: 1.80; previous revision: 1.79 done
for the record I'm didn't see the issue over the last two days.
We're now seeing assertion failures turning tinderbox orange. It's always the same assertion message: tstclnt: PR_Poll returned with writable socket ! Assertion failure: !ssl_HaveRecvBufLock(ss), at sslsecur.c:1301
I'd like to discuss my recollections about the lock ranking with Wan-Teh and Adam by phone if possible. I remember some things to tell you, but I really don't want to take the time to type them up. I'm too slow a typist.
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Wan-Teh, was anything done to stop the assertion botches reported in comment 28 above? Did they just "go away"? Do they still occur?
I commented out that assertion. The assertion needs to be relaxed to assert !ssl_HaveRecvBufLock(ss) only when we acquire firstHandshakeLock for the first time (as opposed to re-entering it), like this: PORT_Assert(PR_GetMonitorEntryCount((ss)->firstHandshakeLock) != 0 || !ssl_HaveRecvBufLock(ss));
I filed bug 600438 to fix the assertion.
Have only trunk builds picked up this version of NSS? We may want this for Thunderbird 3.1.x if possible, but I'm not sure which version of NSS that uses...
The target milestone is set to 3.12.8, and usually (often? coincidentally?) that means that's the fixed version. But the check-in comments all reference 3.13 Thunderbird 3.1 and 3.0 should be using 3.12.8 already http://mxr.mozilla.org/mozilla1.9.2/source/security/nss/lib/nss/nss.h#69 http://mxr.mozilla.org/mozilla1.9.1/source/security/nss/lib/nss/nss.h#69 However, I still see deadlocks in Tbird 3.1.6 so I think the comments about 3.13 must be correct and my understanding of the target milestone field use by the NSS product incorrect.
thx, Dan. yes, I believe 3.1.6 still has this bug.
Whiteboard: [test full duplex SSL mode] → [test full duplex SSL mode][tb31wants]
NSS 3.12.8 has the bug fix (see comment 18). The NSS 3.13 only checkins are lock-order assertions.
Blocks: 602521
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: