Closed Bug 79509 Opened 23 years ago Closed 2 years ago

avoid stalling out if ldap C SDK hangs during connect() (LDAP server/service not available/unresponsive causes hang/freeze or causes mozilla to stop responding)

Categories

(MailNews Core :: LDAP Integration, defect)

defect
Not set
critical

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: dmosedale, Unassigned)

References

(Depends on 1 open bug)

Details

(Keywords: hang, perf, Whiteboard: problems found while testing)

Attachments

(2 files, 11 obsolete files)

6.58 KB, patch
Details | Diff | Splinter Review
1.55 KB, patch
Details | Diff | Splinter Review
There is an "aync reconnect" option in the C SDK which can be set.  We should
use this.  Last time I checked, using it at least doesn't break anything on
Windows and Mac for the success case, but does on Linux.  So we'll probably need
to whack the C SDK a bit for this, and then test to make sure it things work in
both the success case and the case where there is actually a problem during
connect().
Blocks: 17880
Severity: normal → major
Status: NEW → ASSIGNED
Keywords: nsbeta1
Priority: -- → P2
Target Milestone: --- → mozilla0.9.2
reassign to Olga as QA contact
QA Contact: yulian → olgac
Blocks: 82317
No longer blocks: 17880
Target Milestone: mozilla0.9.2 → mozilla0.9.3
Priority: P2 → P3
Marking P2, 0.9.4.
Priority: P3 → P2
Target Milestone: mozilla0.9.3 → mozilla0.9.4
Keywords: nsenterprise
The reason async connects were often causing crashes under linux is that in the
asynchronous case, wait4msg() in the LDAP C SDK depends on the compiler to
initialize automatics variables to 0.  This isn't guaranteed by the C language,
although it apparently is done by the Windows and Mac compilers.  I'm attaching
a one-liner patch for the C SDK problem.  mhein or mcs, can you review?

Another patch to turn async I/O on in the LDAP XPCOM SDK will be forthcoming...
The patch looks fine, so you certainly have module owner approval.

But the code in result.c:wait2msg() that uses the potentially unitialized lr 
variable seems incomplete to me.  The code looks like this:

 else if (ld->ld_options & LDAP_BITOPT_ASYNC) {
    if ( lr
            && lc->lconn_status == LDAP_CONNST_CONNECTING
            && nsldapi_iostatus_is_write_ready( ld,
            lc->lconn_sb ) ) {
         rc = nsldapi_ber_flush( ld, lc->lconn_sb, lr->lr_ber 0, 1 );
         ...
    }
 }

The problem I see is that lr is set earlier in the function ONLY if you request 
a specific message id (i.e., when you call ldap_result() with positive integer 
as the second parameter).  But the code above is trying to flush any pending 
requests to a server for which the TCP connect just completed, so it seems like 
it should loop through all of the outstanding requests that are associated with 
that connection and call nsldapi_ber_flush().  Of course if you never initiate 
more than one (or any) LDAP operations before the connect occurs, you will never 
run into problems because of this.  Should we open a separate bug?
Attached patch patch for C SDK, v2 (obsolete) — Splinter Review
OK, I think this is a reasonable patch.  It attempts to free all requests
associated with a connection when something goes wrong during the flush.  Also
worth nothing is that in my previous fix, this code path actually NEVER got
followed in the LDAP_MSG_ANY case, because lr was always 0.  So
LDAP_CONNST_CONNECTED was only ever set in request.c, never here in result.c.
Whiteboard: have patches; need r=, sr=
Attached patch patch for XPCOM SDK, v1 (obsolete) — Splinter Review
The above patch turns on async connections in the XPCOM SDK.  leif, can you  r=
that one?
With regard to the C SDK v2 patch:

The connection status should be set to "connected" regardless of whether the 
flushes succeed.  I think.

nsldapi_ber_flush() will return -2 if the I/O would block but the proposed 
flush_connection_requests() function treats that as a fatal error.  I think the 
there is a general problem in the C SDK LDAP_OPT_ASYNC code where a request that 
is partially flushed will never be completely flushed (I just do not see where 
that would happen, since nsldapi_ber_flush() is only called:

1) When a connection completes (in result.c; the subject of this patch)

2) When an LDAP operation is first issued (inside 
request.c:nsldapi_send_server_request())

3) When an unbind request is sent.

The logical place to do the necessary nsldapi_ber_flush() calls is inside 
result.c:wait4msg(), since that is where we poll() for the socket state.  It 
seems like the right strategy would be:

a) If a ber_flush() call returns "would block", call 
nsldapi_iostatus_interest_write() for the connection.

b) Check for "write ready" status inside wait4msg() regardless of whether the 
connection state is "connecting" or "connected."  If it is "connecting", do the 
things in the patch.  If it is "connected", just call 
flush_connection_requests(). 
Whiteboard: have patches; need r=, sr= → incorporating r= feedback into patch
Target Milestone: mozilla0.9.4 → mozilla0.9.5
really adding chofmann to the cc and marking nsbranch+
Keywords: nsbranch+
*** Bug 99423 has been marked as a duplicate of this bug. ***
Attachment #45886 - Attachment is obsolete: true
Attachment #45999 - Attachment is obsolete: true
mcs: I don't see anything in the 4.0 SDK with a name resembling
nsldapi_iostatus_interest_write().  Is there another way this is done with the
4.0 code?
Sorry.  The function is called nsldapi_is_write_ready() in the 4.x SDK.
*** Bug 101040 has been marked as a duplicate of this bug. ***
Attached patch patch to C-SDK, v3 (obsolete) — Splinter Review
Attachment #45976 - Attachment is obsolete: true
mcs: OK, here's a new version of the patch that does what you suggest.  I assume
that when you said

> a) If a ber_flush() call returns "would block", call 
> nsldapi_iostatus_interest_write() for the connection.
>
> [...]
>
> Sorry.  The function is called nsldapi_is_write_ready() in the 4.x SDK.

that you really meant to say that I should call nsldapi_mark_select_write() each
time flushing would block, and the patch is written with that assumption.

There is an XXXdmose comment in the code, I assume I don't really need to call
nsldapi_mark_select_read() in the case where we're already connected, but I just
wanted to be sure.

Interestingly, in the case where the LDAP connect is working, I see lots of
"flush_connection_requests would block" messages in the debug log, though they
eventually go away.  The thing that's odd is that LDAP autocomplete worked just
fine even with the previous rev of this patch, which didn't have the flushing,
so given this debug message, I would have expected previous versions of the
patch not to work.  One conceivable explanation for this is that
nsldapi_ber_flush() will get called the next time a request gets sent out, but
in that case it would get called with a different BerElement arg. And the LDAP
autocomplete code only sends its requests to the C SDK one at a time anyway, so
I'm not really convinced that that's what's going on.

Thoughts appreciated....


Depends on: 101252
No longer depends on: 101252
Comment on attachment 50582 [details] [diff] [review]
threadsafety patch to nsLDAPAutoCompleteSession, v1

This patch changes the LDAP autocomplete session to use the threadsafe ISUPPORTS macros, as per the discussion in bug 101252.
Your assumption about nsldapi_mark_select_write() is correct.  Sorry.

I think you do need to call nsldapi_mark_select_read() in the connected case, 
because the code inside nsldapi_send_server_request() does not call same if it 
encounters a "would block" situation.

I am not sure why you see the "flush_connection_requests would block" messages 
and things still work.  Does not quite add up, as you say.  Unfortunately I 
don't have time today to puzzle it out (too many meetings).

Your patch looks fine except for one thing:  I think free_connection_requests() 
needs to save the lr->lr_next pointer into a local variable before it calls 
nsldapi_free_request() (otherwise it will access freed memory the next time 
around the loop when it dereferences lr to grab lr->lr_next, won't it?)  See the 
code in bind.c/ldap_ld_free() for an example.
Pls provide an ETA for this one.
Whiteboard: incorporating r= feedback into patch → incorporating r= feedback into patch,[ETA?]
Attachment #50435 - Attachment is obsolete: true
Attached patch patch to C-SDK, v4 (obsolete) — Splinter Review
The latest C-SDK patch (v4) looks good.  r=mcs
Comment on attachment 50179 [details] [diff] [review]
LDAP XPCOM SDK patch to turn on async connections by default, v2

r=leif
Comment on attachment 50582 [details] [diff] [review]
threadsafety patch to nsLDAPAutoCompleteSession, v1

r=leif
Whiteboard: incorporating r= feedback into patch,[ETA?] → incorporating r= feedback into patch,[ETA 09.26]
Whiteboard: incorporating r= feedback into patch,[ETA 09.26] → problems found while testing; at-risk
Reassign QA contact to yulian@netscape.com
QA Contact: olgac → yulian
Attachment #50807 - Attachment is obsolete: true
Attached patch C SDK changes, v5 (obsolete) — Splinter Review
OK, the latest C-SDK patch makes the following changes:

* turns on the async functionality for Windows

* adds an nsldapi_mark_select_write()call to nsldapi_new_connection().  This
causes the appropriate if() clause in wait4msg() to notice if an error happens
while in the connecting state (eg connection refused).  I don't think it should
case any other ill-effects, but I'd love confirmation of that.  Also, mcs, do I
need to clear out the read and write fdsets ever, or is that taken care of? 
Turns out the exception fdset isn't very interesting, because it's only used for
TCP out-of-band data and some pseudo-TTY junk.
Interestingly, the nsldapi_mark_select_write change seems to only fix the
"connection refused" not being noticed problem on Linux.  On windows, the
problem is still there.
Turns out that on Windows, unlike Unix, I _do_ need to use the exception fdset
to notice a connect error.  New patch in progress...
Depends on: 102270
Too risky for 0.9.5 at this time; pushing out to 0.9.6. Removing nsbranch+ 
nomination, setting back to nsenterprise.
Target Milestone: mozilla0.9.5 → mozilla0.9.6
marking nsenterprise-; will be reevaluated for nsenterprise in future release.

Keywords: nsenterprise-
Blocks: 104166
*** Bug 105112 has been marked as a duplicate of this bug. ***
*** Bug 90563 has been marked as a duplicate of this bug. ***
Attachment #51167 - Attachment is obsolete: true
Attached patch C SDK changes, v6 (obsolete) — Splinter Review
While testing the previous version of the patch, I made
flush_connection_requests return -1 always in order to test that code path, and
I could make it race and crash about every 3rd or 4th time I used LDAP autocomplete.

What appeared to be going on was this: 

The LDAP connection thread would call ldap_result(), which would end up down in
wait4msg() which would see the error in flush_connection_requests, and then
would call nsldapi_free_connection(), which actually decrements the refcount on
the connection, and frees itif appropriate.  wait4msg never actually incremented
the refcount, so I believe the reason that it was non-zero in wait4msg is
because this happened to be the default connection for the session.
Anyway, the connection would get freed from wait4msg, and then when an operation
would be started from another thread, the other thread would increment the
refcount (through a free()d pointer!) attempt to use the connection, hit an
error, and then try and free the connection pointer again.

So I've attached a new version of the patch, which simply doesn't call
nsldapi_free_connection from inside wait4msg when there's an error.  See the
XXXdmose comments in the patch for some details about possible problems.

I'm not particularly convinced this patch is the right fix, but it's one
possibility.  I sort of suspect that a better fix would be to make wait4msg
increment the connection refcount before use.  

Also, should nsldapi_free_connection() check to see if the connection is the
default connection for a session, and if so, set the default connection to
either NULL or another connection?

The real problem I'm having here is that it's not entirely clear from reading
the code what the refcounting model is supposed to be: do only LDAPRequests hold
references to a connection (except for the "never go away property of the
default connection")?  Or can/should other things (eg ldap_result/wait4msg) get
references to a connection?

mcs/mhein: any light you can shed here would be greatly appreciated.
Moving to 0.9.7 since it looks like there's still some work left to do.
Target Milestone: mozilla0.9.6 → mozilla0.9.7
Sorry for the delay in looking at the recent comments from Dan.

lconn_refcnt is supposed to be incremented when a new LDAP operation (request)
is sent on a connection and decremented when the LDAPResult message is received
for a request.  Functions like wait4msg() should not need to increment the
refcnt, at least in theory.  Obviously there is a bug somewhere.

With the previous patch, were there more calls to nsldapi_free_connection() than
to nsldapi_send_server_request()?

Or, is it possible that the call to flush_connection_requests() and related code
in wait4msg() was being executed when no outstanding requests were present?

Also, I can't remember why we didn't just call nsldapi_free_connection() once
for each request we free inside free_connection_requests().
*** Bug 90815 has been marked as a duplicate of this bug. ***
Priority: P2 → P1
Target Milestone: mozilla0.9.7 → mozilla0.9.8
Mass change: pushing out 0.9.8 bugs, as I'm concentrating feature work now.
Target Milestone: mozilla0.9.8 → mozilla0.9.9
Keywords: nsenterprise
Keywords: nsbeta1
Keywords: nsbeta1nsbeta1+
Target Milestone: mozilla0.9.9 → mozilla1.0
*** Bug 126283 has been marked as a duplicate of this bug. ***
ADT needs more info. as to what the impact of this bug will be
Whiteboard: problems found while testing; at-risk → [ADT NEED INFO] problems found while testing; at-risk
Attachment #50179 - Attachment is obsolete: true
Comment on attachment 50582 [details] [diff] [review]
threadsafety patch to nsLDAPAutoCompleteSession, v1

Already checked in.
Attachment #50582 - Attachment is obsolete: true
Comment on attachment 56833 [details] [diff] [review]
C SDK changes, v6

This patch was only relevant to the C-SDK v4.
Attachment #56833 - Attachment is obsolete: true
ADT: the deal is that any LDAP operation (autocomplete, addressbook search,
replication) against a server which is unreachable (eg because a laptop has been
taken outside a firewall) will cause the entire app-suite to freeze for the
length of the TCP timeout (typically long enough that most users will give up
and kill the program).

I've got a patch which fixes this for cleartext connections, but breaks SSL
connections entirely.  I'll attach it as a work-in-progress; worst case it will
be possible to confine this fix to only be used on cleartext connections.  I'll
keep working on the SSL-piece.
*** Bug 130967 has been marked as a duplicate of this bug. ***
clarifying summary ... it's kind of weird but hopefully it will make this bug
easier to find. 
Summary: avoid stalling out if ldap C SDK hangs during connect() → avoid stalling out if ldap C SDK hangs during connect() (LDAP server/service not available/unresponsive causes hang/freeze or causes mozilla to stop responding)
I am starting to see why someone put the LDAP_OPT_ASYNC_CONNECT in
ldap-deprecated.h. But maybe we can fix it. I tried enabling that option with a
simple LDAP client (outside of the Mozilla client) that uses the LDAP C SDK's
libssldap layer. It failed with "timeout" errors. I debugged a bit and found a
few problems. The one I have not yet solved is that libldap needs to know if the
connection really completed or not. I will attach a patch that shows what I mean.

Also, note that you should be able to reduce the connect timeout when you are
NOT using the LDAP_OPT_ASYNC_CONNECT option by making a call like this:

int ms = 10000; /* 10 seconds, in milliseconds */
if ( ldap_set_option( 0, LDAP_X_OPT_CONNECT_TIMEOUT, *ms ) != 0 ) {
      /* error */
}
Attached patch experimental libldap changes (obsolete) — Splinter Review
3 changes:

1) Fix a silly bug in nsldapi_iostatus_poll() that causes it to mistakenly
generate timeout errors.

2) Add a hack that says "the connection did not yet complete." This is a real
hack, because we should not do this unconditionally. I need to figure out a way
to pass back the "not yet connected" status via libldap's (and libprldap's)
CONNECT callback function.

3) Set rc to -2 so we loop waiting for LDAP results after a connect completes.
I vaguely remember Dan fixing this before. Or at least talking about it. I
don't understand at all why we would set rc to LDAP_RES_BIND (I didn't write
this particular code ;-)

With these changes in place, my drop deal simple LDAP over SSL async test
program works.
Woot!  mcs' changes made async work with SSL also.  I've incorporated his
changes into V8 of the patch.  Additionally, I think I'm gonna have to
incorporate the (not-yet-entirely-complete) patches that I was working on for
the LDAP C SDK v4 (ie patch v6 in this bug) before checking in, or this change
will just expose other problems.  That's my next step.
Attachment #79224 - Attachment is obsolete: true
Attachment #79877 - Attachment is obsolete: true
Comment on attachment 79905 [details] [diff] [review]
async patc, v8 (work-in-progress)

Yes, you will need some of the changes from patch v6. Let me know if you need
help sorting out the unknown issues.

As you and I discussed via AIM yesterday, the changes you propose for
nsldapi_new_connection() to always set the state to LDAP_CONNST_CONNECTING
should be safe and are not as much of a hack as I originally thought. But it
would be a little better to put the code inside nsldapi_connect_to_host(); that
way, we can return -2 (connect not yet completed) for async connects that are
processed using the extended I/O callbacks only (because in the case where
there are no is no connect I/O callback, the return code is already set to -2
if appropriate).
*** Bug 50074 has been marked as a duplicate of this bug. ***
I spent yesterday and today wrapping my head around the various codepaths taken
when making LDAP connections.  It turns out that the code in the tree (and in
patch v8 here, I think) never actually returns -2 for EINPROGRESS, even though
there's some code to handle that case and it's documented that way.

The deal is that both the extended connect command and the nsldapi_try_each_host
return a socket file descriptor (which may or may not be used, depending on the
particular call-stack) or -1 on error.  So it's not possible to return -2,
because that would mean not returning the socket fd itself, which is required in
some cases (in the USE_NATIVE_OS_SOCKETS case, as well as the
nsldapi_ext_compat_connect / old IO callbacks case).

The reason that we get away with it in the current patch is that
LDAP_CONNST_CONNECTING is unconditionally set when we're using LDAP_BITOPT_ASYNC.

Presumably a better fix would be to somehow tweak the extended connect and
nsldapi_try_each_host APIs to pass back more information.  Another option might
be to say that LDAP_BITOPT_ASYNC isn't supported for those cases where the
socket fd is actually used, and then change those two functions to return -2 for
EINPROGRESS.

Thoughts?
Hmmm, there may still be a problem with the above explanation, because it
doesn't explain why patch v7 worked in the non-SSL case... if the above
explanation were exactly correct, there's no way LDAP_CONNST_CONNECTING could
ever be set.
Maybe the v7 patch worked by chance (if the connect happens to complete before
and writes or reads are attempted on the socket, the LDAP code will work fine).

But you are right about the -2... I probably broke that when I added the
extended I/O callbacks. I don't think we should bother changing any APIs right
now though. If we assume that the connect may be in progress whenever
LDAP_OPT_ASYNC is set, then I believe everything will work fine. We can clean up
the other things later.
I think gayatri, laurel and yulian were seeing this bug this morning, as
nsdirectory.netscape.com:389 was not responding.

D:\builds\trunk\mozilla>telnet nsdirectory.netscape.com 389
Connecting To nsdirectory.netscape.com...Could not open a connection to host on
port 389 : Connect failed

they were doing ldap autocomplete.  what they were seeing was the UI thread
hang, and then a while later (a long while, in some cases) the message about
"ldap connection failed" would appear in the address field.
Well, I never saw the "ldap connection failed" message but maybe I didn't wait
long enough.  In task manager all the netscape proceses were showing "not
responding". However, it does sound like the same problem.
Comment #34:
Bug 105112 reported that Linux hangs too long to get <LDAP server connection
failed> error back. 

On Win2K, it took about 1.5 minutes to get the error message.
It would be very easy to implement a consistent and short timeout for
autocomplete on all platforms.  See the second half of my comment # 50. IMHO a
timeout of 10 seconds seems more than long enough for autocomplete (of course it
should be configurable).
mcs: thanks for pointing that out; I had totally spaced that part of your
comment.  I've spun off bug 143172 for this in the hopes of getting it into
Mozilla 1.0 since it's such a small fix.
Discussed at mail news bug meeting.  Decided to minus this bug.
Blocks: 122274
Keywords: nsbeta1+nsbeta1-
Target Milestone: mozilla1.0 → mozilla1.2
LDAP hangs for much longer than 10 seconds for me. I've waited HOURS on two
different occasions. Both times it was after typing REALLY LONG emails that I
eventually lost when I had to kill and restart Mozilla. :( I don't know how to
repro it though.
I have been working on various async. I/O improvements inside the LDAP library
code. I just attached a patch (work on progress) to bug 140182 (attachment
111518 [details] [diff] [review]). I have not yet tried the new code inside the Mozilla client.
The fix for bug 140182 has been committed.  It includes the changes in this bug
that are under mozilla/directory/c-sdk (and many more fixes too).
This bug was a showstopper in the project of migrating Outlook users to Mozilla
Mail I was involved.
Is there any particular reason why voting on this bug is disabled?
*** Bug 306963 has been marked as a duplicate of this bug. ***
Does anyone have time to try the nsLDAPProtocolModule.cpp portion of dmose's last patch?  Does it solve this problem?
Assigning bugs that I'm not actively working on back to nobody; use SearchForThis as a search term if you want to delete all related bugmail at once.
Assignee: dmose → nobody
Status: ASSIGNED → NEW
Filter on "Nobody_NScomTLD_20080620"
Assignee: nobody → dmose
QA Contact: yulian → xpcom
Blocks: 373167
Assignee: dmose → nobody
Keywords: perf
Whiteboard: [ADT NEED INFO] problems found while testing; at-risk → problems found while testing
Target Milestone: mozilla1.2 → ---
Anniversary!
This unresolved bug is now more than 10 years old and still causes problems especially in corporate environments.
We should light a candle...
... could someone please at least set the timeout to 300ms and display a warning: e.g. "LDAP timeaout"
I have found the LDAP Swapping extension useful for turning off LDAP when I am not on the corporate network - https://bugzilla.mozilla.org/show_bug.cgi?id=79509. Still would like to see this just quickly timeout though.
I applied the nsLDAPProtocolModule.cpp portion of dmose's last patch as Mark Smith suggests above, and it fixes the hanging issue in OSX 10.9. I've only put it through minimal testing, however. Optimally the LDAP query resultant names are appended to the pre-existing entries in the combobox when the query returns, though I haven't stuck around long enough for it to execute. Hopefully I can test that at a later time.

I'm a remote worker with big corporate ldap, and each name query was taking forever for me (10+ minutes each name lookup when composing). My current workaround is move abook.mab to abook.mab.bak, then ln -s ldap.mab to abook.mab. This makes everything work, however, I'd have to update my offline mab regularly for new hires/people that leave. Optimally, thunderbird would immediately grab from the offline mab for the combo box, then when the query returns both update the combo box and update the mab. I'd be elated just with updated the combo box.

Will add my modified patch, based on dmose's but cut out what seems to have already been patched in.

The bug is open 18 years ago (!) and it is still annoying as of today. Could please someone please fix this?

2019-07-09 20:50:07.152000 UTC - [49228:Main Thread]: D/LDAP nsLDAPOperation::SimpleBind(): called; bindName = '';
2019-07-09 20:51:11.468000 UTC - [49228:Main Thread]: D/LDAP unbinding
2019-07-09 20:51:11.468000 UTC - [49228:Main Thread]: D/LDAP unbound

Move this to a secondary thread, make it async .. something!

It's even weirder to the user eye that it freezes, even if there is local cache / address book matches (and those matches are already visible).

I updated the Product and Component so the patch from comment 77 may get some attention from Thunderbird engineers.

Component: LDAP XPCOM SDK → LDAP Integration
Product: Directory → MailNews Core
Version: other → unspecified
Severity: major → critical
Keywords: hang
Priority: P1 → --

To those of you who were able to reproduce this issue...

Can you create a new profile https://support.mozilla.org/en-US/kb/using-multiple-profiles to see if this reproduces with beta https://archive.mozilla.org/pub/thunderbird/releases/90.0b2/ ?

beta implements new ldap code https://www-stage.thunderbird.net/en-US/thunderbird/90.0beta/releasenotes/

LDAP c-sdk has been removed in bug 1729862.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.