Closed Bug 704984 Opened 13 years ago Closed 12 years ago

Thunderbird Daily/Earlybird (11.0a1 20111122-20111122) Crashes or hangs in LDAP Address Book Search

Categories

(Thunderbird :: Address Book, defect)

11 Branch
defect
Not set
critical

Tracking

(thunderbird11+ fixed)

RESOLVED DUPLICATE of bug 712363
Tracking Status
thunderbird11 + fixed

People

(Reporter: ohghoo8n.mzl.al, Assigned: mayhemer)

References

Details

(Keywords: crash, dogfood, hang, Whiteboard: [has stack trace])

Crash Data

Attachments

(4 files, 3 obsolete files)

When I run Thunderbird Daily 11.0a1 on Windows 7 SP1 32-bit on the latest of 22 and 23 November, I open the Address Book and try to run a query against the university LDAP server (over LDAPS on port 636).  The Address Book is greyed out after further attempts at user interaction. The main Thunderbird window subsequently does the same, and the windows associated with the Thunderbird process are registered as "Not Responding" according to Windows Task Manager.  Generated a debug log with a short script I generated (see for logging parameters: https://gist.github.com/1388881).  Trying to get the build tools in place since I do not see a crash report on any of multiple attempts.
See Also: → 351642
Not sure if it is an indentical situation, but updated the See Also field because this is the closest I can get to the issue I face in the Bug Tracker.  Please update with what I can do for better debugging information and techniques.
Just created a clean profile and configured just the LDAP server.  After first test search, I was still able to recreate the issue. Crashed immediately.
exactly how is your ldap configured in thunderbird?

and just be clear, you are experiencing a hang, not a crash (thunderbird terminates), correct?

(over to ludo, et al)
Severity: normal → critical
Keywords: hang
Can we get a crash ID ? (see http://support.mozillamessaging.com/en-US/kb/mozilla-crash-reporter) or does the crash reporter not show up ?
Keywords: hangcrash
(In reply to Ludovic Hirlimann [:Usul] from comment #4)
> Can we get a crash ID ? (see
> http://support.mozillamessaging.com/en-US/kb/mozilla-crash-reporter) or does
> the crash reporter not show up ?

The crash never shows up.  To be clear, this is not an installed version.  I unzip and run it from a directory with other alpha and beta builds in my user profile.  Regardless, I have manually had to force close the application, or opt to close it once the Windows detects the application hang.

Log Name:      Application
Source:        Application Hang
Date:          11/24/2011 2:03:17 AM
Event ID:      1002
Task Category: (101)
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      COMPNAME.ad-domain.domain.tld
Description:
The program thunderbird.exe version 11.0.0.4344 stopped interacting with Windows and was closed. To see if more information about the problem is available, check the problem history in the Action Center control panel.
 Process ID: c4c
 Start Time: 01ccaa33d4e7a672
 Termination Time: 36
 Application Path: C:\Users\ajs67\QA\thunderbird-10.0a1.en-US.win32\thunderbird\thunderbird.exe
 Report Id: 3ff5b8ca-1627-11e1-8e6e-90f55a0cdfe3

Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Application Hang" />
    <EventID Qualifiers="0">1002</EventID>
    <Level>2</Level>
    <Task>101</Task>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2011-11-23T23:03:17.000000000Z" />
    <EventRecordID>13863</EventRecordID>
    <Channel>Application</Channel>
    <Computer>COMPNAME.ad-domain.domain.tld</Computer>
    <Security />
  </System>
  <EventData>
    <Data>thunderbird.exe</Data>
    <Data>11.0.0.4344</Data>
    <Data>c4c</Data>
    <Data>01ccaa33d4e7a672</Data>
    <Data>36</Data>
    <Data>C:\Users\Username\QA\thunderbird-10.0a1.en-US.win32\thunderbird\thunderbird.exe</Data>
    <Data>3ff5b8ca-1627-11e1-8e6e-90f55a0cdfe3</Data>
    <Binary>55006E006B006E006F0077006E0000000000</Binary>
  </EventData>
</Event>
(In reply to Wayne Mery (:wsmwk) from comment #3)
> exactly how is your ldap configured in thunderbird?

What do you mean?  My client:

Name:           Georgetown Directory
Hostname:       directory.georgetown.edu
Base DN:        dc=georgetown,dc=edu
Port number:    636
Bind DN:        id=username,ou=people,dc=georgetown,dc=edu

[x] Use secure connection (SSL)

Don't return more than 300 results.

Scope:          [-] One Level [x] Subtree
Search filter:  (objectclass=*)
Login method:   Simple

As for the server, I think it is an aging iPlanet SDK implementation.  How much detail do you want?  Just to be clear, the same configuration on the client works against the university server normally.  I can try without LDAPS/LDAP-SSL on the same config for testing if you want.

> and just be clear, you are experiencing a hang, not a crash (thunderbird
> terminates), correct?
> 
> (over to ludo, et al)
(In reply to alharaka from comment #5)
> (In reply to Ludovic Hirlimann [:Usul] from comment #4)
> > Can we get a crash ID ? (see
> > http://support.mozillamessaging.com/en-US/kb/mozilla-crash-reporter) or does
> > the crash reporter not show up ?
> 
> The crash never shows up.  To be clear, this is not an installed version.  I
> unzip and run it from a directory with other alpha and beta builds in my
> user profile.  Regardless, I have manually had to force close the
> application, or opt to close it once the Windows detects the application
> hang.

We need a stack trace to be able to diagnose what is going on. Could you follow the instructions at https://developer.mozilla.org/en/How_to_get_a_stacktrace_with_WinDbg (replace firefox , by thunderbird) and attach the stack trace here ?
(In reply to Ludovic Hirlimann [:Usul] from comment #7)
> (In reply to alharaka from comment #5)
> > (In reply to Ludovic Hirlimann [:Usul] from comment #4)
> > > Can we get a crash ID ? (see
> > > http://support.mozillamessaging.com/en-US/kb/mozilla-crash-reporter) or does
> > > the crash reporter not show up ?
> > 
> > The crash never shows up.  To be clear, this is not an installed version.  I
> > unzip and run it from a directory with other alpha and beta builds in my
> > user profile.  Regardless, I have manually had to force close the
> > application, or opt to close it once the Windows detects the application
> > hang.
> 
> We need a stack trace to be able to diagnose what is going on. Could you
> follow the instructions at
> https://developer.mozilla.org/en/How_to_get_a_stacktrace_with_WinDbg
> (replace firefox , by thunderbird) and attach the stack trace here ?

Working on it now.  Sorry for the delay.
I am having an issue(In reply to alharaka from comment #8)
> (In reply to Ludovic Hirlimann [:Usul] from comment #7)
> > (In reply to alharaka from comment #5)
> > > (In reply to Ludovic Hirlimann [:Usul] from comment #4)
> > > > Can we get a crash ID ? (see
> > > > http://support.mozillamessaging.com/en-US/kb/mozilla-crash-reporter) or does
> > > > the crash reporter not show up ?
> > > 
> > > The crash never shows up.  To be clear, this is not an installed version.  I
> > > unzip and run it from a directory with other alpha and beta builds in my
> > > user profile.  Regardless, I have manually had to force close the
> > > application, or opt to close it once the Windows detects the application
> > > hang.
> > 
> > We need a stack trace to be able to diagnose what is going on. Could you
> > follow the instructions at
> > https://developer.mozilla.org/en/How_to_get_a_stacktrace_with_WinDbg
> > (replace firefox , by thunderbird) and attach the stack trace here ?
> 
> Working on it now.  Sorry for the delay.

I am having an issue.  When I follow the instructions, I get up a "Debugee is running" message and the Thunderbird interface becomes inaccessible after I go back to WinDbg, having tried the LDAP query that causes this issue.  When I try to go back to the Thunderbird interface, either the main window or the address book window, I can not get it again.  I can see thumbnails when I highlight over them, but clicking ot Alt+Tab operations just open the WinDbg interface again.  It is very unusual.

I read the wiki page.  I see this message, which the wiki suggests a workaround for ntdll!LdrpDoDebuggerBreak+0x2c or whatever.

> ModLoad: 736a0000 7373b000   C:\Windows\WinSxS
> \x86_microsoft.vc80.crt_1fc8b3b9a1e18e3b_8.0.50727.6195_none_d09154e044272b9a
> \MSVCR80.dll
> (918.14e4): Break instruction exception - code 80000003 (first chance)
> eax=00000000 ebx=00000000 ecx=001df838 edx=770370b4 esi=fffffffe edi=00000000
> eip=770904f6 esp=001df854 ebp=001df880 iopl=0         nv up ei pl zr na pe nc
> cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000246
> ntdll!LdrpDoDebuggerBreak+0x2c:
> 770904f6 cc              int     3

This is before I run the debugger (Go).  This is from initial startup of WinDbg.  When I try the wiki suggestion to set different breakpoints at these addresses.  Then, running the process in WinDbg, it crashes immediately.  It is still hanging while I write this message, and I recreated the LDAP issue about ~15 minutes ago.  Any pointers?
Forced close the process, and uploaded the log I described in https://bugzilla.mozilla.org/show_bug.cgi?id=704984#c9.
Whiteboard: [has stack trace]
Anything else I can do to help you guys with this? Just checking it.
Just repro'd this on Daily on Linux - the following showed up just before the hang in the terminal:

###!!! ASSERTION: nsAbLDAPMessageBase::OnLDAPInit(): failed to perform bind operation: 'Error', file /media/Projects/mozilla/thunderbird/mailnews/addrbook/src/nsAbLDAPListenerBase.cpp, line 305
Status: UNCONFIRMED → NEW
Ever confirmed: true
(In reply to alharaka from comment #12)
> Anything else I can do to help you guys with this? Just checking it.

yes! expanding on mconley's edit, you can determine the regression range, i.e find the first nightly build which crashes. ftp://ftp.mozilla.org/pub/thunderbird/nightly/
Status: NEW → UNCONFIRMED
Ever confirmed: false
grr, wrong button
Status: UNCONFIRMED → NEW
Ever confirmed: true
Mike/alharaka: are you guys sure your password is stored correctly? I could reproduce this with an out-of-date password for my ldap server, but without, I can't.
Based on the investigation I was able to get done before my debugger crashed, the UI thread (thread 0) is holding the UI event loop lock and closing the LDAPS socket (actually happening deep in the destructor for the UI event object). This is trying to grab the NSS library's SSL3 Handshake Lock.

In the mean time, thread 4 is negotiating an SSL connection. It grabs the SSL3 Handshake Lock in ssl3_RestartHandshakeAfterAuthCertificate() and then calls down to nsNSSSocketInfo::GetPreviousCert() which tries to send an event back over to the main thread - this is blocking trying to get the SyncRunnableBase monitor in SyncRunnableBase::DispatchToMainThreadAndWait() - note that it's blocking on trying to get its own monitor *before* dispatching the event to the main thread, not on waiting for the main thread to reply that it has finished running the event. This looks very strange. As far as I can see that Runnable is created right before the call, so nobody else should be able to have grabbed its lock.

I'll try to repro under the debugger again.
OK, I've reproduced this under the debugger and at least one point is wrong in my previous note - the DispatchToMainThreadAndWait() call *is* waiting for the main thread to respond.

So it looks like the deadlock is:

background thread grabs SSL3 handshake lock, dispatches an event to the UI thread, and waits for the UI thread to respond

UI thread calls the destructor on a Runnable it's executing, which blocks grabbing the SSL3 handshake lock

More to come...
Changing platform from Win7 to All, since I'm getting it on Mac OS X. From the original reporter's traces, it looks like he's hitting the Windows "assertions throw into the debugger" trap. On OS X, the assertion failure gets printed but the program keeps executing (and hangs soon after).

Here's the tiny bit of info printed by the ldap:5 debug trace and the assertion I'm seeing:

2011-12-14 20:39:57.075409 UTC - 1886665920[100319260]: nsLDAPOperation::SimpleBind(): called; bindName = 'mail=ireid@mozilla.com,o=com,dc=mozilla'; 
2011-12-14 20:39:57.312133 UTC - 1886665920[100319260]: ###!!! ASSERTION: nsAbLDAPMessageBase::OnLDAPInit(): failed to perform bind operation: 'Error', file /Users/ireid/tbird/comm-central/mailnews/addrbook/src/nsAbLDAPListenerBase.cpp, line 305
2011-12-14 20:39:57.349016 UTC - 1886665920[100319260]: unbinding
OS: Windows 7 → All
(In reply to Irving Reid (:irving) from comment #18)
> OK, I've reproduced this under the debugger and at least one point is wrong
> in my previous note - the DispatchToMainThreadAndWait() call *is* waiting
> for the main thread to respond.
> 
> So it looks like the deadlock is:
> 
> background thread grabs SSL3 handshake lock, dispatches an event to the UI
> thread, and waits for the UI thread to respond
> 
> UI thread calls the destructor on a Runnable it's executing, which blocks
> grabbing the SSL3 handshake lock
> 
> More to come...

Have confirmed in the debugger that the background thread and the UI thread are both working on the *same* SSL socket.

I'm not quite sure yet, but I don't think the UI thread is working on exactly the same Runnable that the background thread posted. True to form, XCode crashed again as I was trying to figure that out...
In my (Windows) build, LDAPS calls through to PSM which calls DispatchToMainThreadAndWait ... on the main thread. Oops.
Is this because of the psm/ssl thread reworking to stop using xpcom proxies? I forget if those patches landed on the trunk. 

I see similar hangs with SSL IMAP on the trunk as well.
Irving seems to be correct about the deadlock.

I think the bug is that the socket is being closed on the main thread. Instead, AFAICT, all sockets must be closed only on the socket transport thread. Honza, Kai, Patrick: please confirm or refute.

Compare nsHTTPConnection::~nsHTTPConnection(). There is no Close() call there.

PSM's NSS socket layer is not thread safe (see bug 427948 for example). AFAICT, this has always been the case, even before the XPCOM-removal patches. Thus, based on my limited knowledge of comm-central-specific code, it seems like there has always been a race condition here.

I have made the assumption in the XPCOM-removal patch that only the socket transport thread is allowed to perform operations on sockets that are managed by the socket transport thread.
(In reply to neil@parkwaycc.co.uk from comment #21)
> In my (Windows) build, LDAPS calls through to PSM which calls
> DispatchToMainThreadAndWait ... on the main thread. Oops.

In my XPCOM removal patches, I was very careful to ensure that DispatchToMainThreadAndWait is never called by any method that is expected/allowed to be called on the main thread. I specifically did not implement any logic like
    
    if (NS_IsMainThread() { 
        runnable->Run();
    } else {
        runnable->DispatchToMainThreadAndWait();
    }

specifically to catch these kinds of bugs.

So, either:
(1) My assumptions about which threads are allowed to do what operations is wrong, or
(2) The LDAPS code that is operating on the file descriptors is doing so on the wrong thread.
The hang I'm seeing is calling nsSocketTransport::IsAlive from the UI thread. Is that not allowed anymore?
Confirming Brian's words: sockets must be completely managed only on the socket thread.  This is result of both XPCOM proxy removal and SSL thread removal patches that we now see this bug in LDAP implementation.



This is how LDAP creates the SSL socket, this happens on the main thread:
 	xul.dll!nsSSLIOLayerAddToSocket(int family=2, const char * host=0x0481deb8, int port=636, const char * proxyHost=0x00000000, int proxyPort=0, PRFileDesc * fd=0x086f6160, nsISupports * * info=0x00296fac, bool forSTARTTLS=true, bool anonymousLoad=false)  Line 3914	C++
 	xul.dll!nsTLSSocketProvider::AddToSocket(int family=2, const char * host=0x0481deb8, int port=636, const char * proxyHost=0x00000000, int proxyPort=0, unsigned int flags=0, PRFileDesc * aSocket=0x086f6160, nsISupports * * securityInfo=0x00296fac)  Line 96 + 0x30 bytes	C++
	xul.dll!nsLDAPSSLConnect(const char * hostlist=0x04819738, int defport=636, int timeout=-1, unsigned long options=0, lextiof_session_private * sessionarg=0x04826138, lextiof_socket_private * * socketargp=0x08c16e0c)  Line 217 + 0x47 bytes	C++
 	nsldap32v60.dll!nsldapi_connect_to_host(ldap * ld=0x08bfa9f8, sockbuf * sb=0x08c16ca8, const char * hostlist=0x04819738, int defport=636, int secure=1, char * * krbinstancep=0x049a8e10)  Line 510 + 0x35 bytes	C
 	nsldap32v60.dll!nsldapi_new_connection(ldap * ld=0x08bfa9f8, ldap_server * * srvlistp=0x00297490, int use_ldsb=1, int connect=1, int bind=0)  Line 651 + 0x33 bytes	C
 	nsldap32v60.dll!nsldapi_open_ldap_defconn(ldap * ld=0x08bfa9f8)  Line 716 + 0x13 bytes	C
 	nsldap32v60.dll!nsldapi_send_server_request(ldap * ld=0x08bfa9f8, berelement * ber=0x08dd2e60, int msgid=1, ldapreq * parentreq=0x00000000, ldap_server * srvlist=0x00000000, ldap_conn * lc=0x00000000, char * bindreqdn=0x07421eb8, int bind=0)  Line 195 + 0x9 bytes	C
 	nsldap32v60.dll!nsldapi_send_initial_request(ldap * ld=0x08bfa9f8, int msgid=1, unsigned long msgtype=96, char * dn=0x07421eb8, berelement * ber=0x08dd2e60)  Line 148 + 0x2b bytes	C
 	nsldap32v60.dll!simple_bind_nolock(ldap * ld=0x08bfa9f8, const char * dn=0x07421eb8, const char * passwd=0x04823f40, int unlock_permitted=1)  Line 153 + 0x17 bytes	C
 	nsldap32v60.dll!ldap_simple_bind(ldap * ld=0x08bfa9f8, const char * dn=0x07421eb8, const char * passwd=0x04823f40)  Line 82 + 0x13 bytes	C
 	xul.dll!nsLDAPOperation::SimpleBind(const nsACString_internal & passwd={...})  Line 322 + 0x30 bytes	C++
 	xul.dll!nsAbLDAPListenerBase::OnLDAPInit(nsILDAPConnection * aConn=0x04a04de8, unsigned int aStatus=0)  Line 302 + 0x35 bytes	C++
 	xul.dll!nsLDAPConnection::OnLookupComplete(nsICancelable * aRequest=0x049a8d94, nsIDNSRecord * aRecord=0x08eb6710, unsigned int aStatus=0)  Line 663	C++


The ldap code is not simple and I really don't understand well how it works and how it even poll sockets.

Adding some folks that might know more.
(In reply to David :Bienvenu from comment #25)
> The hang I'm seeing is calling nsSocketTransport::IsAlive from the UI
> thread. Is that not allowed anymore?

It seems like in this instance, my assumption is wrong. nsISocketTransport's documentation says: "* NOTE: This is a free-threaded interface, meaning that the methods on this interface may be called from any thread."

The implementation of nsSocketTransport::IsAlive calls PR_Recv(fd, &c, 1, PR_MSG_PEEK, 0);

I believe there are two issues:

(1) The SSL code now assumes that PR_Recv is being called on the socket transport thread. I guess this worked previously (before bug 674147) because the I/O operations were never being done on the calling thread, but were instead proxied to the SSL thread.

(2) This PR_Recv() is driving the SSL handshake forward, perhaps in a way that the NPN code does not expect. I see that nsHttpConnection is already aware that IsAlive() drives the handshake forward, based on a comment in nsHttpConnection::IsAlive. However, does nsHttpConnection correctly take into account the fact that IsAlive() could have been called from outside of Necko, on another thread?

It seems like in mozilla-central and comm-central, there are very few callers to IsAlive().

Still investigating...
(In reply to Brian Smith (:bsmith) from comment #27)

> (2) This PR_Recv() is driving the SSL handshake forward, perhaps in a way
> that the NPN code does not expect. I see that nsHttpConnection is already
> aware that IsAlive() drives the handshake forward, based on a comment in
> nsHttpConnection::IsAlive. However, does nsHttpConnection correctly take
> into account the fact that IsAlive() could have been called from outside of
> Necko, on another thread?

I don't see a problem for NPN - it has to be called on the http/spdy socket and nothing outside of necko has access to that. That http socket is allocated by necko I don't see how code outside of necko could operate on it or have access to it. (ldap code acting on a different socket won't impact this socket/npn/psm state).. The comment is referring to an isalive call that necko's nshttpconnectionmgr can make (also on the socket thread).
(In reply to David :Bienvenu from comment #25)
> The hang I'm seeing is calling nsSocketTransport::IsAlive from the UI
> thread. Is that not allowed anymore?

I don't see IsAlive call in https://bugzilla.mozilla.org/attachment.cgi?id=581743.  Does your comment belong to a different bug?

LDAP is not using nsSocketTransport.  It's only using the NSPR socket layer directly.
Interesting.  nsSocketTransport::IsAlive has always been implemented as PR_Read(1, MSG_PEEK).  It's interesting to allow it be called on any thread.  There can easily be IMO race between read on the socket thread and e.g the main thread.

We might start thinking of making IsAlive just a passive check, not active (check state variables only).  IMO it is better to claim the socket is alive and let consumers figure the state out by call of read or write on the socket streams.

Could also make bug 631801 patch simpler.

Looks like there is more then one issue:
- LDAP implementation accesses NSPR sockets on the main thread (fix it by moving to the socket thread, if possible)
- IMAP is using socket transport and accesses some methods on the main (non-networking) thread (fix it by making the socket transport implementation thread safe)
(In reply to Honza Bambas (:mayhemer) from comment #31)
> Interesting.  nsSocketTransport::IsAlive has always been implemented as
> PR_Read(1, MSG_PEEK).  It's interesting to allow it be called on any thread.
> There can easily be IMO race between read on the socket thread and e.g the
> main thread.
>
> We might start thinking of making IsAlive just a passive check, not active
> (check state variables only).  IMO it is better to claim the socket is alive
> and let consumers figure the state out by call of read or write on the
> socket streams.

I filed bug 711786 for this.

> - LDAP implementation accesses NSPR sockets on the main thread (fix it by
> moving to the socket thread, if possible)

That is this bug.

> - IMAP is using socket transport and accesses some methods on the main
> (non-networking) thread (fix it by making the socket transport
> implementation thread safe)

IMO, it is better to change this IMAP code to work on the socket transport thread. I filed I filed bug 711787 for this. But, it may also be possible to fix it by making IsAlive thread safe.
Wen we ensure that nsLDAPConnection::Close() (or at least ldap_unbind) is always called on the socket transport thread? At least, we would have to change the LDAP code so that it always closes the LDAP connection before the destructor is called, so that the Close() in nsLDAPConnection::~nsLDAPConnection() can be removed.

That is the approach I would like to try first. However, I do not know how much other code is expecting to be able to close connections off the socket transport thread, or what other I/O-related threading issues there may be in Mail/News.
Attached patch attampt 1 (obsolete) — Splinter Review
This is an attempt to fix the immediate crash with close (try: http://build.mozillamessaging.com/tinderboxpushlog/?tree=ThunderbirdTry&rev=79bf9e6f566c)

Here https://ftp.mozilla.org/pub/mozilla.org/thunderbird/try-builds/honzab.moz@firemni.cz-79bf9e6f566c/ will soon be modified Thunderbird Daily build for try.

Anyone able to reproduce the issue, please install the build for your platform from the above URL (if not working, try later, it will work!) and try to reproduce the crash.
Assignee: nobody → honzab.moz
Status: NEW → ASSIGNED
So far, tried the Windows build and was able to reproduce the crash.

https://ftp.mozilla.org/pub/mozilla.org/thunderbird/try-builds/honzab.moz@firemni.cz-79bf9e6f566c/try-comm-central-win32/thunderbird-11.0a1.en-US.win32.zip

What was it you wanted exactly?  It is a little unclear to me from your message, sorry.
Honza:

So I'm having a few issues with this build on Windows 7:

1)  On TB open, if I immediately open a compose window, and enter fragments of addresses into the To field, I get "LDAP Server Search problem" displayed in an autocomplete dropdown.  Clicking on that item reveals:

"Error code 0x80070057: Unknown error

Please contact your System Administrator."

2)  I'm able to open the address book, and do a search on an LDAP address book, and have it return results.  Doing so seems to clear the issue in (1).

No hangs though, which is good.

-Mike
(In reply to Honza Bambas (:mayhemer) from comment #31)

> We might start thinking of making IsAlive just a passive check, not active
> (check state variables only). 

checking for a fin queued in the kernel (without having to worry about buffering any goodput) is the major reason for IsAlive() usage, so that would severely harm its utility imo.
(In reply to alharaka from comment #35)
> What was it you wanted exactly?  It is a little unclear to me from your
> message, sorry.

Thanks for checking on this.  I forgot to add that if you experience another crash, to please report it and add a link here to that report.  It might a very different issue, or the patch just moves the bug to a different place of the code.

(In reply to Mike Conley (:mconley) from comment #36)
> Honza:
> 
> So I'm having a few issues with this build on Windows 7:
> 
> 1)  On TB open, if I immediately open a compose window, and enter fragments
> of addresses into the To field, I get "LDAP Server Search problem" displayed
> in an autocomplete dropdown.  Clicking on that item reveals:
> 
> "Error code 0x80070057: Unknown error
> 
> Please contact your System Administrator."

As I've no test environment unfortunately, I cannot tell you what that is :(  Are you 100% sure this is caused by this patch?  Doesn't the current Daily build do the same?

> 
> 2)  I'm able to open the address book, and do a search on an LDAP address
> book, and have it return results.  Doing so seems to clear the issue in (1).
> 
> No hangs though, which is good.

This fix is about preventing a crash, not a hang.  Hang we know about is in IMAP, bug 711787.
With the patch, I no longer see a deadlock on OS X, but I'm still seeing an assertion failing just before the deadlock would have happened. I'm not 100% sure, but the observer service assertion might be new as well.

2011-12-19 16:25:46.013227 UTC - 1886665920[100319260]: nsLDAPOperation::SimpleBind(): called; bindName = 'mail=ireid@mozilla.com,o=com,dc=mozilla'; 
2011-12-19 16:25:46.308532 UTC - 1886665920[100319260]: ###!!! ASSERTION: nsAbLDAPMessageBase::OnLDAPInit(): failed to perform bind operation: 'Error', file /Users/ireid/tbird/comm-central/mailnews/addrbook/src/nsAbLDAPListenerBase.cpp, line 305
2011-12-19 16:25:46.344750 UTC - 411570176[1003194a0]: ###!!! ASSERTION: Using observer service off the main thread!: 'Error', file /Users/ireid/tbird/comm-central/mozilla/xpcom/ds/nsObserverService.cpp, line 143
2011-12-19 16:25:46.344771 UTC - 411570176[1003194a0]: unbinding
2011-12-19 16:25:46.345153 UTC - 411570176[1003194a0]: unbound
(In reply to Irving Reid (:irving) from comment #39)
> With the patch, I no longer see a deadlock on OS X, but I'm still seeing an
> assertion failing just before the deadlock would have happened. I'm not 100%
> sure, but the observer service assertion might be new as well.

Irving, so with an unmodified Daily build, you hang or you crash?

The "Using observer service off the main thread!" is new, caused by my patch, I have to fix that too.
(In reply to Honza Bambas (:mayhemer) from comment #40)
> (In reply to Irving Reid (:irving) from comment #39)
> > With the patch, I no longer see a deadlock on OS X, but I'm still seeing an
> > assertion failing just before the deadlock would have happened. I'm not 100%
> > sure, but the observer service assertion might be new as well.
> 
> Irving, so with an unmodified Daily build, you hang or you crash?
> 
> The "Using observer service off the main thread!" is new, caused by my
> patch, I have to fix that too.

On Mac OS X I hang, see comments 16,17 on this bug. Based on comments 5 and 9, I think the original submitter was also seeing either a hang, or a crash caused by the breakpoint instruction we execute after assertion failures on Windows debug builds.
Attached patch attempt 2 (obsolete) — Splinter Review
So, this is an updated patch, the assertion about observer service should be gone, the assertion about LDAP operation failure should remain (that is expected), deadlocks and crashes should be gone.

Anyone interested (so far thanks for all the help!) please check this try build again (remove the previous one or install this new one over the old one): https://ftp.mozilla.org/pub/mozilla.org/thunderbird/try-builds/honzab.moz@firemni.cz-e281fd419c4a/
Attachment #582731 - Attachment is obsolete: true
Comment on attachment 582878 [details] [diff] [review]
attempt 2

(In reply to Honza Bambas (:mayhemer) from comment #42)
> Created attachment 582878 [details] [diff] [review]
> attempt 2

Pressed enter too soon...

I managed to have an environment reproducing the deadlock (seems 100% reliable STR).

This patch DOESN'T fix it.  So please forget about this try build now...
Attachment #582878 - Attachment is obsolete: true
(In reply to Honza Bambas (:mayhemer) from comment #43)
> This patch DOESN'T fix it.  So please forget about this try build now...

And for Patrick and David: this patch is also a wrong patch file, the fix was actually in LDAP code and was about posting unbind to the socket thread.
Have done a bunch of debugging to try and isolate the OnLDAPInit() assertion failure

2011-12-19 16:25:46.308532 UTC - 1886665920[100319260]: ###!!! ASSERTION: nsAbLDAPMessageBase::OnLDAPInit(): failed to perform bind operation: 'Error', file /Users/ireid/tbird/comm-central/mailnews/addrbook/src/nsAbLDAPListenerBase.cpp, line 305

The underlying failure is happening in the SSL handshake, which is all taking place on the main thread as far as I can tell. Looking back, I guess this is what Honza saw in comment 26.

Do we have any options other than reworking Thunderbird to move the LDAP calls in address autocomplete (if not everywhere else) off the main thread?
So,

I have a patch that makes our PSM runnables more generic and thread independent.  All seems to work well for PSM and Firefox and fixes the deadlock.

However, removing deadlocks revealed another issue: SSL thread removal patch somehow broke support for blocking sockets.  I believe the fix should be fairly simple hopefully, investigating right now.
More patches will follow.

Cause of why we return WOULD_BLOCK even for a blocking socket is a pending cert verification.  LDAP lib cannot handle WOULD_BLOCK.

My original assumption about a simple fix is now gone, this won't be that simple to fix :(

Solutions:
- let the socket be blocked during cert verification (PSM changes)
- or, make LDAP handle blocking sockets (LDAP changes)
Attachment #582975 - Flags: feedback?(bsmith)
LDAP has some support for non-blocking IO.  But it is disabled for windows and mac: [1], [2].

Suggestions:
- unrealistic in one day: allow async api for all platforms
- unrealistic in one day: make cert verification blocking on blocking sockets (would deadlock on OCSP requests)
- move binding to socket thread (sounds realistic but needs good testing)
- simply disable LDAPS and announce LDAPS is broken and will be fixed ASAP



[1] http://bonsai.mozilla.org/cvsview2.cgi?diff_mode=context&whitespace_mode=show&root=/cvsroot&subdir=mozilla/directory/c-sdk/ldap/include&command=DIFF_FRAMESET&file=portable.h&rev2=5.0&rev1=1.2
[2] http://hg.mozilla.org/projects/ldap-sdks/annotate/2da94bd1028c/c-sdk/ldap/include/portable.h#l168
Comment on attachment 582975 [details] [diff] [review]
fixes deadlock but leaves LDAPS non-functional

I would also rename RunOnTargetThread to RanOnMainThread to be more distinctive.
Attachment #582975 - Attachment description: the PSM runnable patch → fixes deadlock but leaves LDAPS non-functional
No longer deadlocks on Mac OS X with the latest patch. Address autocomplete over LDAPS still fails. Looking up addresses by searching in the addressbook does return results, but it still logs some assertion failures:

###!!! ASSERTION: nsAbLDAPMessageBase::OnLDAPInit(): failed to perform bind operation: 'Error', file /Users/ireid/tbird/comm-central/mailnews/addrbook/src/nsAbLDAPListenerBase.cpp, line 305
###!!! ASSERTION: nsNSSSocketInfo::GetPreviousCert called on the main thread: 'Error', file /Users/ireid/tbird/comm-central/mozilla/security/manager/ssl/src/nsNSSIOLayer.cpp, line 944
###!!! ASSERTION: nsNSSSocketInfo::GetPreviousCert called on the main thread: 'Error', file /Users/ireid/tbird/comm-central/mozilla/security/manager/ssl/src/nsNSSIOLayer.cpp, line 944
Irving, which of the two try builds did you test with?  From comment 34 or comment 42?  (Comment 42 is not working when connection with the server succeeded, I didn't test the patch properly before submit.)
(In reply to Honza Bambas (:mayhemer) from comment #51)
> Irving, which of the two try builds did you test with?  From comment 34 or
> comment 42?  (Comment 42 is not working when connection with the server
> succeeded, I didn't test the patch properly before submit.)

I was running my own dev build patched with https://bug704984.bugzilla.mozilla.org/attachment.cgi?id=582975 (comment 47)
(In reply to Irving Reid (:irving) from comment #52)
> I was running my own dev build patched with
> https://bug704984.bugzilla.mozilla.org/attachment.cgi?id=582975 (comment 47)

Awesome!  Thanks.

Re:
###!!! ASSERTION: nsNSSSocketInfo::GetPreviousCert called on the main thread: 'Error', file /Users/ireid/tbird/comm-central/mozilla/security/manager/ssl/src/nsNSSIOLayer.cpp, line 944

These are still to be fixed...
Full patch that fixes the issue (deadlock and non-working LDAPS).

- PSMRunnable (SyncRunnableBase) allowed to be dispatched also from the main thread (calls its operation method directly)
- SSLServerCertVerificationJob remembers the thread it has to dispatch the result to (= the thread where the socket is being called on)
- NS_IsMainThread() special handling in PK11PasswordPrompt no longer necessary
- nsNSSSocketInfo::GetPreviousCert can now be called on the main thread too
- when send operation on a blocking socket returns would-block then if we are waiting for cert verify, block by looping the event queue and then retry the send again



The true fix should be made in NSS (not for today): when the socket is set blocking, we must never return would-block since it will be handled as a fatal error by the calling layer.  No idea how to achieve this  This presumption might be wrong for the ssl socket, though..
Attachment #582975 - Attachment is obsolete: true
Attachment #582975 - Flags: feedback?(bsmith)
Attachment #583184 - Flags: review?(bsmith)
Irving, please check this patch and let me know.  All should work now.
(In reply to Honza Bambas (:mayhemer) from comment #55)
> Irving, please check this patch and let me know.  All should work now.

On Mac OS X, attachment 583184 [details] [diff] [review] fixes both address autocomplete in the compose window and search in the address book. No error or warning messages.
No longer blocks: 712363
Depends on: 712363
I think that making SyncRunnableBase::RunOnTargetThread special-case the main thread will not prevent this deadlock, at least in all cases.

The LDAP code manages of queue of LDAP operations, and those operations are executed on the LDAP connections background thread:
http://mxr.mozilla.org/comm-central/source/ldap/xpcom/src/nsLDAPConnection.cpp#366

Some of these LDAP operations perform I/O, right? Assuming so, then there will still be a deadlock where PR_Close is called on the main thread when an LDAP operation is executing on the background thread.

It seems to me that, at least, Close() needs to be modified so that it shuts down the background thread (finishes and/or cancels all LDAP operations) before closing the connection. Currently, the background thread is shut down after the connection is closed (note: ldap_unbind eventually calls PR_Close):
http://mxr.mozilla.org/comm-central/source/ldap/xpcom/src/nsLDAPConnection.cpp#216

Also, isn't there a race between AddPendingOperation and Close? If we Close() and shutdown the background thread, then AddPendingOperation would just re-create it. Maybe it is implicitly assumed that AddPendingOperation and the destructor for nsLDAPConnection will both only be called on the main thread? If so, that would be a good thing to document and check in both (using NS_IsMainThread() and NS_ERROR).

Anyway, it is very important that the main thread waiting for LDAP operations, or the same kind of deadlock will occur, even with Honza's patch.
(In reply to Brian Smith (:bsmith) from comment #58)
> I think that making SyncRunnableBase::RunOnTargetThread special-case the
> main thread will not prevent this deadlock, at least in all cases.

Can you provide a POC ?

> Some of these LDAP operations perform I/O, right? Assuming so, then there
> will still be a deadlock where PR_Close is called on the main thread when an
> LDAP operation is executing on the background thread.

I had a patch for it, but I've later threw it away.  That patch wasn't it self enough to fix.  

Also, only unbind call of the LDAP sdk should be posted to a background thread.

Close is called only from two places:
- destructor
  - called from OnLookupComplete throwing the last ref to |self| on the main thread when sending the initial request fails
  - when the object is released by JS GC
  - I don't think the destructor is called sooner then all background ops finish
- "profile-change-net-teardown", here could potentially be a problem


As I can see now, I didn't enclose the deadlock analyzes:
Without any patches:
- main thread is using a blocking PSM socket to establish an SSL connection trying to send the primary request
- the operation fails because we return WOULD_BLOCK from PSMSend (first flaw)
- main thread throws the connection away, calls close (this happens in case of a failure to send the initial request)
- however, the cert verification background thread finished its work and is posting the SSLServerCertVerificationResult to the network thread, which is here not involved at all (seconds flaw)
- on the network thread, SSL_RestartHandshakeAfterAuthCertificate calls nsNSSSocketInfo::GetPreviousCert that waits for the main thread
=> ssl mutex is held on the socket thread and cannot be entered on the main when closing the connection thread -> deadlock


I've also seen deadlock when main thread was calling ClientAuthDataRunnable::DispatchToMainThreadAndWait on the main thread.


> Anyway, it is very important that the main thread waiting for LDAP
> operations, or the same kind of deadlock will occur, even with Honza's patch.

I don't understand this sentence.


(In reply to Honza Bambas (:mayhemer) from comment #54)
> The true fix should be made in NSS (not for today): when the socket is set
> blocking, we must never return would-block since it will be handled as a
> fatal error by the calling layer.  No idea how to achieve this  This
> presumption might be wrong for the ssl socket, though..

Adding some NSS people to have more detailed statement how NSS has to behave with blocking sockets.
(In reply to Honza Bambas (:mayhemer) from comment #54)
>
> The true fix should be made in NSS (not for today): when the socket is set
> blocking, we must never return would-block since it will be handled as a
> fatal error by the calling layer.

Perhaps we should add a new error code to indicate that PSMSend is taking
a break to allow the calling layer to verify the server certificate?

But we will still need to update the calling layer to know that this new
error code is not a fatal error.
In a long term, we should rather fix LDAP code to work on either a background thread since the beginning (i.e. connect and send the initial request) or use non-blocking sockets.  The former seems simpler.

We also need to introduce tests that involve:
- LDAP secure server
- the server cert has a valid OCSP AIA
- optionally, OCSP server for the LDAP server cert CA validation
Patches failed to apply on the try server, so the build didn't go through. I also get hunks failing when I try to manually apply the patches to my own working copy.
Comment on attachment 583184 [details] [diff] [review]
[ONLY FOR REFERENCE] v1, PSM fix

Canceling review request since we have another patch in bug 712363 that seems to solve this.
Attachment #583184 - Flags: review?(bsmith)
Summary: Thunderbird Daily (11.0a1 20111122-20111122) Crashes in LDAP Address Book Search → Thunderbird Daily (11.0a1 20111122-20111122) Crashes or hangs in LDAP Address Book Search
I hit this once yesterday and once today, using up-to-date Earlybird. I used kill -11 to interrupt the hang and generate a crash report -- the crash reports are:
 bp-700fd806-deee-4bdc-974a-b63f72120106
 bp-e9bcec56-bd1f-4051-abcb-7dec62120105
Mozilla/5.0 (X11; Linux x86_64; rv:11.0a2) Gecko/20120106 Thunderbird/11.0a2
Crash Signature: [@ libpthread-2.13.so@0xebe4]
Keywords: hang
Hardware: x86 → All
Summary: Thunderbird Daily (11.0a1 20111122-20111122) Crashes or hangs in LDAP Address Book Search → Thunderbird Daily/Earlybird (11.0a1 20111122-20111122) Crashes or hangs in LDAP Address Book Search
One unfortunate aspect of this bug that merits mentioning -- at least on my Linux system, the hang happens while Thunderbird has locked-in focus, so it locks up my whole windowing system.  AFAICT, the only way to regain control is by getting to a Ctrl+Alt+F1 TTY or connecting via SSH and running "killall thunderbird-bin".  (For the majority of users who don't know about those options, this is effectively a full-system lockup that requires a hard reboot to fix. :-/ )

(This seems to happen to me every time I type in a "To" field in a compose window now :-/  Just bit me 3 times in a row.  Adding "dogfood" keyword.)
Keywords: dogfood
The (not terribly satisfying) workaround is to go into Preferences / Composition / Addressing and un-check the address autocompletion from directory servers, if you're using any LDAPS address books. Hopefully we'll get a fix landed soon and you'll be able to turn it back on.

Might be worth filing a separate bug for the fact that TB uses locked-in focus while it's doing address autocomplete. Doesn't make a lot of sense that you can't switch to another application at that time.
I opened bug 718131 with this issue, it was not only ldap address book searches.  Daily was hanging even when I was auto-completing from Collected Addresses.
(In reply to Guillermo Huerta [:guillermo] from comment #71)
> I opened bug 718131 with this issue, it was not only ldap address book
> searches.  Daily was hanging even when I was auto-completing from Collected
> Addresses.

The bug didn't specify about the collected addresses. If you've turned LDAP off completely (check in both global prefs & account specific prefs) and it is still hanging, please file a bug on just the collected addresses issue.
attempting to download an offline copy of the address book also hangs the app
Just tested shredded/daily and the problem is no longer there.
Attachment #583184 - Attachment description: v1, PSM fix → [ONLY FOR REFERENCE] v1, PSM fix
Marking fixed.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → DUPLICATE
From a tracking perspective, we shipped this in TB 11 with a relbranch in mozilla-*. All is fine, and the fix is completely in 12 & onwards. Updating flags.
blocking-b2g: --- → 2.0M?
Please don't play with the flags.
blocking-b2g: 2.0M? → ---
I'm still experiencing this bug in TB 45.2.0 on OSX 10.11.5. As soon as I start typing some characters it hangs if it can't reach the LDAP server (on SSL port 636). The only way to fix it is to disable the LDAP server in the settings.
To clarify: it hangs = it must be killed (force quit). Then restarted and LDAP disabled. The LDAP server is the university's and doesn't allow connection from outside the university network. So my TB is happy while it's on the university network, but hangs everywhere else unless I disable LDAP.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: