avoid stalling out if ldap C SDK hangs during connect() (LDAP server/service not available/unresponsive causes hang/freeze or causes mozilla to stop responding)
Categories
(MailNews Core :: LDAP Integration, defect)
Tracking
(Not tracked)
People
(Reporter: dmosedale, Unassigned)
References
(Depends on 1 open bug)
Details
(Keywords: hang, perf, Whiteboard: problems found while testing)
Attachments
(2 files, 11 obsolete files)
6.58 KB,
patch
|
Details | Diff | Splinter Review | |
1.55 KB,
patch
|
Details | Diff | Splinter Review |
There is an "aync reconnect" option in the C SDK which can be set. We should use this. Last time I checked, using it at least doesn't break anything on Windows and Mac for the success case, but does on Linux. So we'll probably need to whack the C SDK a bit for this, and then test to make sure it things work in both the success case and the case where there is actually a problem during connect().
Reporter | ||
Updated•24 years ago
|
Updated•24 years ago
|
Reporter | ||
Updated•24 years ago
|
Updated•24 years ago
|
Comment 2•24 years ago
|
||
Marking P2, 0.9.4.
Reporter | ||
Comment 3•23 years ago
|
||
The reason async connects were often causing crashes under linux is that in the asynchronous case, wait4msg() in the LDAP C SDK depends on the compiler to initialize automatics variables to 0. This isn't guaranteed by the C language, although it apparently is done by the Windows and Mac compilers. I'm attaching a one-liner patch for the C SDK problem. mhein or mcs, can you review? Another patch to turn async I/O on in the LDAP XPCOM SDK will be forthcoming...
Reporter | ||
Comment 4•23 years ago
|
||
Comment 5•23 years ago
|
||
The patch looks fine, so you certainly have module owner approval. But the code in result.c:wait2msg() that uses the potentially unitialized lr variable seems incomplete to me. The code looks like this: else if (ld->ld_options & LDAP_BITOPT_ASYNC) { if ( lr && lc->lconn_status == LDAP_CONNST_CONNECTING && nsldapi_iostatus_is_write_ready( ld, lc->lconn_sb ) ) { rc = nsldapi_ber_flush( ld, lc->lconn_sb, lr->lr_ber 0, 1 ); ... } } The problem I see is that lr is set earlier in the function ONLY if you request a specific message id (i.e., when you call ldap_result() with positive integer as the second parameter). But the code above is trying to flush any pending requests to a server for which the TCP connect just completed, so it seems like it should loop through all of the outstanding requests that are associated with that connection and call nsldapi_ber_flush(). Of course if you never initiate more than one (or any) LDAP operations before the connect occurs, you will never run into problems because of this. Should we open a separate bug?
Reporter | ||
Comment 6•23 years ago
|
||
Reporter | ||
Comment 7•23 years ago
|
||
OK, I think this is a reasonable patch. It attempts to free all requests associated with a connection when something goes wrong during the flush. Also worth nothing is that in my previous fix, this code path actually NEVER got followed in the LDAP_MSG_ANY case, because lr was always 0. So LDAP_CONNST_CONNECTED was only ever set in request.c, never here in result.c.
Reporter | ||
Comment 8•23 years ago
|
||
Reporter | ||
Comment 9•23 years ago
|
||
The above patch turns on async connections in the XPCOM SDK. leif, can you r= that one?
Comment 10•23 years ago
|
||
With regard to the C SDK v2 patch: The connection status should be set to "connected" regardless of whether the flushes succeed. I think. nsldapi_ber_flush() will return -2 if the I/O would block but the proposed flush_connection_requests() function treats that as a fatal error. I think the there is a general problem in the C SDK LDAP_OPT_ASYNC code where a request that is partially flushed will never be completely flushed (I just do not see where that would happen, since nsldapi_ber_flush() is only called: 1) When a connection completes (in result.c; the subject of this patch) 2) When an LDAP operation is first issued (inside request.c:nsldapi_send_server_request()) 3) When an unbind request is sent. The logical place to do the necessary nsldapi_ber_flush() calls is inside result.c:wait4msg(), since that is where we poll() for the socket state. It seems like the right strategy would be: a) If a ber_flush() call returns "would block", call nsldapi_iostatus_interest_write() for the connection. b) Check for "write ready" status inside wait4msg() regardless of whether the connection state is "connecting" or "connected." If it is "connecting", do the things in the patch. If it is "connected", just call flush_connection_requests().
Updated•23 years ago
|
Reporter | ||
Updated•23 years ago
|
Reporter | ||
Updated•23 years ago
|
Reporter | ||
Comment 12•23 years ago
|
||
*** Bug 99423 has been marked as a duplicate of this bug. ***
Reporter | ||
Updated•23 years ago
|
Reporter | ||
Updated•23 years ago
|
Reporter | ||
Comment 13•23 years ago
|
||
Reporter | ||
Comment 14•23 years ago
|
||
mcs: I don't see anything in the 4.0 SDK with a name resembling nsldapi_iostatus_interest_write(). Is there another way this is done with the 4.0 code?
Comment 15•23 years ago
|
||
Sorry. The function is called nsldapi_is_write_ready() in the 4.x SDK.
Reporter | ||
Comment 16•23 years ago
|
||
*** Bug 101040 has been marked as a duplicate of this bug. ***
Reporter | ||
Comment 17•23 years ago
|
||
Reporter | ||
Updated•23 years ago
|
Reporter | ||
Comment 18•23 years ago
|
||
mcs: OK, here's a new version of the patch that does what you suggest. I assume that when you said > a) If a ber_flush() call returns "would block", call > nsldapi_iostatus_interest_write() for the connection. > > [...] > > Sorry. The function is called nsldapi_is_write_ready() in the 4.x SDK. that you really meant to say that I should call nsldapi_mark_select_write() each time flushing would block, and the patch is written with that assumption. There is an XXXdmose comment in the code, I assume I don't really need to call nsldapi_mark_select_read() in the case where we're already connected, but I just wanted to be sure. Interestingly, in the case where the LDAP connect is working, I see lots of "flush_connection_requests would block" messages in the debug log, though they eventually go away. The thing that's odd is that LDAP autocomplete worked just fine even with the previous rev of this patch, which didn't have the flushing, so given this debug message, I would have expected previous versions of the patch not to work. One conceivable explanation for this is that nsldapi_ber_flush() will get called the next time a request gets sent out, but in that case it would get called with a different BerElement arg. And the LDAP autocomplete code only sends its requests to the C SDK one at a time anyway, so I'm not really convinced that that's what's going on. Thoughts appreciated....
Reporter | ||
Comment 19•23 years ago
|
||
Reporter | ||
Comment 20•23 years ago
|
||
Comment on attachment 50582 [details] [diff] [review] threadsafety patch to nsLDAPAutoCompleteSession, v1 This patch changes the LDAP autocomplete session to use the threadsafe ISUPPORTS macros, as per the discussion in bug 101252.
Comment 21•23 years ago
|
||
Your assumption about nsldapi_mark_select_write() is correct. Sorry. I think you do need to call nsldapi_mark_select_read() in the connected case, because the code inside nsldapi_send_server_request() does not call same if it encounters a "would block" situation. I am not sure why you see the "flush_connection_requests would block" messages and things still work. Does not quite add up, as you say. Unfortunately I don't have time today to puzzle it out (too many meetings). Your patch looks fine except for one thing: I think free_connection_requests() needs to save the lr->lr_next pointer into a local variable before it calls nsldapi_free_request() (otherwise it will access freed memory the next time around the loop when it dereferences lr to grab lr->lr_next, won't it?) See the code in bind.c/ldap_ld_free() for an example.
Comment 22•23 years ago
|
||
Pls provide an ETA for this one.
Reporter | ||
Updated•23 years ago
|
Reporter | ||
Comment 23•23 years ago
|
||
Comment 24•23 years ago
|
||
The latest C-SDK patch (v4) looks good. r=mcs
Comment 25•23 years ago
|
||
Comment on attachment 50179 [details] [diff] [review] LDAP XPCOM SDK patch to turn on async connections by default, v2 r=leif
Comment 26•23 years ago
|
||
Comment on attachment 50582 [details] [diff] [review] threadsafety patch to nsLDAPAutoCompleteSession, v1 r=leif
Updated•23 years ago
|
Reporter | ||
Updated•23 years ago
|
Reporter | ||
Updated•23 years ago
|
Reporter | ||
Comment 28•23 years ago
|
||
Reporter | ||
Comment 29•23 years ago
|
||
OK, the latest C-SDK patch makes the following changes: * turns on the async functionality for Windows * adds an nsldapi_mark_select_write()call to nsldapi_new_connection(). This causes the appropriate if() clause in wait4msg() to notice if an error happens while in the connecting state (eg connection refused). I don't think it should case any other ill-effects, but I'd love confirmation of that. Also, mcs, do I need to clear out the read and write fdsets ever, or is that taken care of? Turns out the exception fdset isn't very interesting, because it's only used for TCP out-of-band data and some pseudo-TTY junk.
Reporter | ||
Comment 30•23 years ago
|
||
Interestingly, the nsldapi_mark_select_write change seems to only fix the "connection refused" not being noticed problem on Linux. On windows, the problem is still there.
Reporter | ||
Comment 31•23 years ago
|
||
Turns out that on Windows, unlike Unix, I _do_ need to use the exception fdset to notice a connect error. New patch in progress...
Comment 32•23 years ago
|
||
Too risky for 0.9.5 at this time; pushing out to 0.9.6. Removing nsbranch+ nomination, setting back to nsenterprise.
Updated•23 years ago
|
Comment 33•23 years ago
|
||
marking nsenterprise-; will be reevaluated for nsenterprise in future release.
Reporter | ||
Comment 34•23 years ago
|
||
*** Bug 105112 has been marked as a duplicate of this bug. ***
Reporter | ||
Comment 35•23 years ago
|
||
*** Bug 90563 has been marked as a duplicate of this bug. ***
Reporter | ||
Updated•23 years ago
|
Reporter | ||
Comment 36•23 years ago
|
||
Reporter | ||
Comment 37•23 years ago
|
||
While testing the previous version of the patch, I made flush_connection_requests return -1 always in order to test that code path, and I could make it race and crash about every 3rd or 4th time I used LDAP autocomplete. What appeared to be going on was this: The LDAP connection thread would call ldap_result(), which would end up down in wait4msg() which would see the error in flush_connection_requests, and then would call nsldapi_free_connection(), which actually decrements the refcount on the connection, and frees itif appropriate. wait4msg never actually incremented the refcount, so I believe the reason that it was non-zero in wait4msg is because this happened to be the default connection for the session. Anyway, the connection would get freed from wait4msg, and then when an operation would be started from another thread, the other thread would increment the refcount (through a free()d pointer!) attempt to use the connection, hit an error, and then try and free the connection pointer again. So I've attached a new version of the patch, which simply doesn't call nsldapi_free_connection from inside wait4msg when there's an error. See the XXXdmose comments in the patch for some details about possible problems. I'm not particularly convinced this patch is the right fix, but it's one possibility. I sort of suspect that a better fix would be to make wait4msg increment the connection refcount before use. Also, should nsldapi_free_connection() check to see if the connection is the default connection for a session, and if so, set the default connection to either NULL or another connection? The real problem I'm having here is that it's not entirely clear from reading the code what the refcounting model is supposed to be: do only LDAPRequests hold references to a connection (except for the "never go away property of the default connection")? Or can/should other things (eg ldap_result/wait4msg) get references to a connection? mcs/mhein: any light you can shed here would be greatly appreciated.
Comment 38•23 years ago
|
||
Moving to 0.9.7 since it looks like there's still some work left to do.
Comment 39•23 years ago
|
||
Sorry for the delay in looking at the recent comments from Dan. lconn_refcnt is supposed to be incremented when a new LDAP operation (request) is sent on a connection and decremented when the LDAPResult message is received for a request. Functions like wait4msg() should not need to increment the refcnt, at least in theory. Obviously there is a bug somewhere. With the previous patch, were there more calls to nsldapi_free_connection() than to nsldapi_send_server_request()? Or, is it possible that the call to flush_connection_requests() and related code in wait4msg() was being executed when no outstanding requests were present? Also, I can't remember why we didn't just call nsldapi_free_connection() once for each request we free inside free_connection_requests().
Reporter | ||
Comment 40•23 years ago
|
||
*** Bug 90815 has been marked as a duplicate of this bug. ***
Reporter | ||
Updated•23 years ago
|
Reporter | ||
Updated•23 years ago
|
Reporter | ||
Comment 41•23 years ago
|
||
Mass change: pushing out 0.9.8 bugs, as I'm concentrating feature work now.
Updated•23 years ago
|
Updated•23 years ago
|
Reporter | ||
Comment 42•23 years ago
|
||
*** Bug 126283 has been marked as a duplicate of this bug. ***
Comment 43•23 years ago
|
||
ADT needs more info. as to what the impact of this bug will be
Reporter | ||
Updated•23 years ago
|
Reporter | ||
Comment 44•23 years ago
|
||
Comment on attachment 50582 [details] [diff] [review] threadsafety patch to nsLDAPAutoCompleteSession, v1 Already checked in.
Reporter | ||
Comment 45•23 years ago
|
||
Comment on attachment 56833 [details] [diff] [review] C SDK changes, v6 This patch was only relevant to the C-SDK v4.
Reporter | ||
Comment 46•23 years ago
|
||
ADT: the deal is that any LDAP operation (autocomplete, addressbook search, replication) against a server which is unreachable (eg because a laptop has been taken outside a firewall) will cause the entire app-suite to freeze for the length of the TCP timeout (typically long enough that most users will give up and kill the program). I've got a patch which fixes this for cleartext connections, but breaks SSL connections entirely. I'll attach it as a work-in-progress; worst case it will be possible to confine this fix to only be used on cleartext connections. I'll keep working on the SSL-piece.
Reporter | ||
Comment 47•23 years ago
|
||
Comment 48•23 years ago
|
||
*** Bug 130967 has been marked as a duplicate of this bug. ***
Comment 49•23 years ago
|
||
clarifying summary ... it's kind of weird but hopefully it will make this bug easier to find.
Comment 50•23 years ago
|
||
I am starting to see why someone put the LDAP_OPT_ASYNC_CONNECT in ldap-deprecated.h. But maybe we can fix it. I tried enabling that option with a simple LDAP client (outside of the Mozilla client) that uses the LDAP C SDK's libssldap layer. It failed with "timeout" errors. I debugged a bit and found a few problems. The one I have not yet solved is that libldap needs to know if the connection really completed or not. I will attach a patch that shows what I mean. Also, note that you should be able to reduce the connect timeout when you are NOT using the LDAP_OPT_ASYNC_CONNECT option by making a call like this: int ms = 10000; /* 10 seconds, in milliseconds */ if ( ldap_set_option( 0, LDAP_X_OPT_CONNECT_TIMEOUT, *ms ) != 0 ) { /* error */ }
Comment 51•23 years ago
|
||
3 changes: 1) Fix a silly bug in nsldapi_iostatus_poll() that causes it to mistakenly generate timeout errors. 2) Add a hack that says "the connection did not yet complete." This is a real hack, because we should not do this unconditionally. I need to figure out a way to pass back the "not yet connected" status via libldap's (and libprldap's) CONNECT callback function. 3) Set rc to -2 so we loop waiting for LDAP results after a connect completes. I vaguely remember Dan fixing this before. Or at least talking about it. I don't understand at all why we would set rc to LDAP_RES_BIND (I didn't write this particular code ;-) With these changes in place, my drop deal simple LDAP over SSL async test program works.
Reporter | ||
Comment 52•23 years ago
|
||
Woot! mcs' changes made async work with SSL also. I've incorporated his changes into V8 of the patch. Additionally, I think I'm gonna have to incorporate the (not-yet-entirely-complete) patches that I was working on for the LDAP C SDK v4 (ie patch v6 in this bug) before checking in, or this change will just expose other problems. That's my next step.
Comment 53•23 years ago
|
||
Comment on attachment 79905 [details] [diff] [review] async patc, v8 (work-in-progress) Yes, you will need some of the changes from patch v6. Let me know if you need help sorting out the unknown issues. As you and I discussed via AIM yesterday, the changes you propose for nsldapi_new_connection() to always set the state to LDAP_CONNST_CONNECTING should be safe and are not as much of a hack as I originally thought. But it would be a little better to put the code inside nsldapi_connect_to_host(); that way, we can return -2 (connect not yet completed) for async connects that are processed using the extended I/O callbacks only (because in the case where there are no is no connect I/O callback, the return code is already set to -2 if appropriate).
Reporter | ||
Comment 54•23 years ago
|
||
*** Bug 50074 has been marked as a duplicate of this bug. ***
Reporter | ||
Comment 55•23 years ago
|
||
I spent yesterday and today wrapping my head around the various codepaths taken when making LDAP connections. It turns out that the code in the tree (and in patch v8 here, I think) never actually returns -2 for EINPROGRESS, even though there's some code to handle that case and it's documented that way. The deal is that both the extended connect command and the nsldapi_try_each_host return a socket file descriptor (which may or may not be used, depending on the particular call-stack) or -1 on error. So it's not possible to return -2, because that would mean not returning the socket fd itself, which is required in some cases (in the USE_NATIVE_OS_SOCKETS case, as well as the nsldapi_ext_compat_connect / old IO callbacks case). The reason that we get away with it in the current patch is that LDAP_CONNST_CONNECTING is unconditionally set when we're using LDAP_BITOPT_ASYNC. Presumably a better fix would be to somehow tweak the extended connect and nsldapi_try_each_host APIs to pass back more information. Another option might be to say that LDAP_BITOPT_ASYNC isn't supported for those cases where the socket fd is actually used, and then change those two functions to return -2 for EINPROGRESS. Thoughts?
Reporter | ||
Comment 56•23 years ago
|
||
Hmmm, there may still be a problem with the above explanation, because it doesn't explain why patch v7 worked in the non-SSL case... if the above explanation were exactly correct, there's no way LDAP_CONNST_CONNECTING could ever be set.
Comment 57•23 years ago
|
||
Maybe the v7 patch worked by chance (if the connect happens to complete before and writes or reads are attempted on the socket, the LDAP code will work fine). But you are right about the -2... I probably broke that when I added the extended I/O callbacks. I don't think we should bother changing any APIs right now though. If we assume that the connect may be in progress whenever LDAP_OPT_ASYNC is set, then I believe everything will work fine. We can clean up the other things later.
Comment 58•23 years ago
|
||
I think gayatri, laurel and yulian were seeing this bug this morning, as nsdirectory.netscape.com:389 was not responding. D:\builds\trunk\mozilla>telnet nsdirectory.netscape.com 389 Connecting To nsdirectory.netscape.com...Could not open a connection to host on port 389 : Connect failed they were doing ldap autocomplete. what they were seeing was the UI thread hang, and then a while later (a long while, in some cases) the message about "ldap connection failed" would appear in the address field.
Comment 59•23 years ago
|
||
Well, I never saw the "ldap connection failed" message but maybe I didn't wait long enough. In task manager all the netscape proceses were showing "not responding". However, it does sound like the same problem.
Comment 60•23 years ago
|
||
Comment #34: Bug 105112 reported that Linux hangs too long to get <LDAP server connection failed> error back. On Win2K, it took about 1.5 minutes to get the error message.
Comment 61•23 years ago
|
||
It would be very easy to implement a consistent and short timeout for autocomplete on all platforms. See the second half of my comment # 50. IMHO a timeout of 10 seconds seems more than long enough for autocomplete (of course it should be configurable).
Reporter | ||
Comment 62•23 years ago
|
||
mcs: thanks for pointing that out; I had totally spaced that part of your comment. I've spun off bug 143172 for this in the hopes of getting it into Mozilla 1.0 since it's such a small fix.
Comment 63•23 years ago
|
||
Discussed at mail news bug meeting. Decided to minus this bug.
Comment 64•22 years ago
|
||
LDAP hangs for much longer than 10 seconds for me. I've waited HOURS on two different occasions. Both times it was after typing REALLY LONG emails that I eventually lost when I had to kill and restart Mozilla. :( I don't know how to repro it though.
Comment 65•22 years ago
|
||
I have been working on various async. I/O improvements inside the LDAP library code. I just attached a patch (work on progress) to bug 140182 (attachment 111518 [details] [diff] [review]). I have not yet tried the new code inside the Mozilla client.
Comment 66•21 years ago
|
||
The fix for bug 140182 has been committed. It includes the changes in this bug that are under mozilla/directory/c-sdk (and many more fixes too).
Comment 67•21 years ago
|
||
This bug was a showstopper in the project of migrating Outlook users to Mozilla Mail I was involved.
Comment 68•21 years ago
|
||
Is there any particular reason why voting on this bug is disabled?
Comment 69•19 years ago
|
||
*** Bug 306963 has been marked as a duplicate of this bug. ***
Comment 70•18 years ago
|
||
Does anyone have time to try the nsLDAPProtocolModule.cpp portion of dmose's last patch? Does it solve this problem?
Reporter | ||
Comment 71•18 years ago
|
||
Assigning bugs that I'm not actively working on back to nobody; use SearchForThis as a search term if you want to delete all related bugmail at once.
Comment 72•17 years ago
|
||
Filter on "Nobody_NScomTLD_20080620"
Updated•14 years ago
|
Comment 74•13 years ago
|
||
Anniversary! This unresolved bug is now more than 10 years old and still causes problems especially in corporate environments. We should light a candle... ... could someone please at least set the timeout to 300ms and display a warning: e.g. "LDAP timeaout"
Comment 75•12 years ago
|
||
I have found the LDAP Swapping extension useful for turning off LDAP when I am not on the corporate network - https://bugzilla.mozilla.org/show_bug.cgi?id=79509. Still would like to see this just quickly timeout though.
Comment 76•10 years ago
|
||
I applied the nsLDAPProtocolModule.cpp portion of dmose's last patch as Mark Smith suggests above, and it fixes the hanging issue in OSX 10.9. I've only put it through minimal testing, however. Optimally the LDAP query resultant names are appended to the pre-existing entries in the combobox when the query returns, though I haven't stuck around long enough for it to execute. Hopefully I can test that at a later time. I'm a remote worker with big corporate ldap, and each name query was taking forever for me (10+ minutes each name lookup when composing). My current workaround is move abook.mab to abook.mab.bak, then ln -s ldap.mab to abook.mab. This makes everything work, however, I'd have to update my offline mab regularly for new hires/people that leave. Optimally, thunderbird would immediately grab from the offline mab for the combo box, then when the query returns both update the combo box and update the mab. I'd be elated just with updated the combo box. Will add my modified patch, based on dmose's but cut out what seems to have already been patched in.
Comment 77•10 years ago
|
||
Comment 78•5 years ago
|
||
The bug is open 18 years ago (!) and it is still annoying as of today. Could please someone please fix this?
2019-07-09 20:50:07.152000 UTC - [49228:Main Thread]: D/LDAP nsLDAPOperation::SimpleBind(): called; bindName = '';
2019-07-09 20:51:11.468000 UTC - [49228:Main Thread]: D/LDAP unbinding
2019-07-09 20:51:11.468000 UTC - [49228:Main Thread]: D/LDAP unbound
Move this to a secondary thread, make it async .. something!
It's even weirder to the user eye that it freezes, even if there is local cache / address book matches (and those matches are already visible).
Comment 79•5 years ago
•
|
||
I updated the Product and Component so the patch from comment 77 may get some attention from Thunderbird engineers.
Updated•5 years ago
|
Comment 80•3 years ago
|
||
To those of you who were able to reproduce this issue...
Can you create a new profile https://support.mozilla.org/en-US/kb/using-multiple-profiles to see if this reproduces with beta https://archive.mozilla.org/pub/thunderbird/releases/90.0b2/ ?
beta implements new ldap code https://www-stage.thunderbird.net/en-US/thunderbird/90.0beta/releasenotes/
Comment 81•3 years ago
|
||
LDAP c-sdk has been removed in bug 1729862.
Description
•