Open Bug 286598 Opened 20 years ago Updated 10 years ago

Access violation on unloading of the 5.08 Netscape Directory SDK for C when using SSL

Categories

(Directory :: LDAP C SDK, defect)

Other
Other
defect
Not set
critical

Tracking

(Not tracked)

People

(Reporter: amanda.bortolin, Assigned: mcs)

Details

Attachments

(3 files)

User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322) Build Identifier: I'm having an issue on Solaris where when we unload our shared library that uses the Netscape SDK, the Netscape SDK shared libraries are still accessing each other while they are being unloaded. For example, we unload our library, then the libprldap50.so lib is unloaded by the os. Then we get an access violation. After analyzing the core we find that the nspr4 lib is trying to access the libprldap50 lib after it has been unloaded. We only seem to see this behaviour when SSL is being used. Here is a bit more info for you. I did a pmap before our shared component is unloaded, a pmap of the core when the access violation happened and a pstack of the core when the access violation happened. Here's what you can deduce from the logs. pmap_before.txt: libnspr4.so is loaded at 71630000 libprldap50.so is loaded at 70D60000 libsecLDAP.so is loaded at 6C280000 <- this is our shared lib that uses the Netscape LDAP SDK pmap_access_violation.txt: libsecLDAP.so and libprldap50.so have been unloaded. libnspr4.so is still loaded at 71630000 pstack_access_violation.txt 70d62930 ???????? (f71f80, 0, 1, 0, 14, 7166fa58) 716570e8 _pt_thread_death (f71f80, 71657064, 71670094, 7f4523ac, 0, 82) + 84 7fb653d0 tsd_exit (0, 0, 7fb15254, 70ee1860, 15a60, 0) + 70 7fb58bdc _thrp_exit (727f0800, 203dc, 0, 0, 14e54, 7fb152f4) + 68 7fb57c4c _t_cancel (0, 7fb7a074, 0, 7fb657b4, 1, 7fb937f8) + 2c 7fb657b4 _lwp_start (0, 0, 0, 0, 0, 0) Looks like the npsr4 lib is trying to access something where the prldap50 lib USED to be loaded. I assume what is happening is that all the Netscape SDK shared libs are loaded by the os when our libsecLDAP.so lib is loaded (since we link against them). When our shared lib is unloaded, the os starts unloading the Netscape SDK libs but they don't realize they are being unloaded and continue doing things. I have also seen this on windows but it is harder to reproduce. Remember, this only *seems* to happen with ssl turned on. Reproducible: Sometimes Steps to Reproduce: 1. load a program that uses the Netscape SDK (5.08) and initialize SSL 2. unload the program 3. see access violation on unloading Actual Results: access violation Expected Results: unload all SDK component successfully.
I added Wan-Teh and Rich to the bug CC (Wan-Teh is one of the leads for NSPR and NSS issues and Rich is one of the leads on Netscape->Red Hat Directory Server). I agree with your analysis, but I am not sure why this is happening. Neither libprldap nor libldap create any threads. Perhaps you are unloading the libraries before all LDAP-related threads have exited? I think that would cause this kind of crash because libprldap installs a NSPR thread private data destructor function which gets called when a thread exits. See: http://lxr.mozilla.org/mozilla/source/directory/c-sdk/ldap/libraries/libprldap/ldappr-threads.c#390 and: http://lxr.mozilla.org/mozilla/source/directory/c-sdk/ldap/libraries/libprldap/ldappr-threads.c#610 Also make sure you have exactly one ldap_unbind() call for each successful call to ldap_init().
I have added tracing to my code and have confirmed that we are doing exactly one successful unbind for every successful ldapssl_init before we unload. I've also narrowed this down so that it only happens when ssl is configured with mutual authentication. It does not seem to happen when we using ssl configured with server authentication. Not sure where to go from here... I can't think of anything else to try. Help? Amanda (In reply to comment #4) > I added Wan-Teh and Rich to the bug CC (Wan-Teh is one of the leads for NSPR and > NSS issues and Rich is one of the leads on Netscape->Red Hat Directory Server). > > I agree with your analysis, but I am not sure why this is happening. Neither > libprldap nor libldap create any threads. Perhaps you are unloading the > libraries before all LDAP-related threads have exited? I think that would cause > this kind of crash because libprldap installs a NSPR thread private data > destructor function which gets called when a thread exits. See: > > http://lxr.mozilla.org/mozilla/source/directory/c-sdk/ldap/libraries/libprldap/ldappr-threads.c#390 > > and: > > http://lxr.mozilla.org/mozilla/source/directory/c-sdk/ldap/libraries/libprldap/ldappr-threads.c#610 > > Also make sure you have exactly one ldap_unbind() call for each successful call > to ldap_init().
Have you tried calling PR_Cleanup() before unloading the LDAP/NSPR/NSS libraries? See: http://lxr.mozilla.org/seamonkey/source/nsprpub/pr/src/misc/prinit.c#363
No, because I'm not using nspr directly (I don't link against it or include any of it's headers). It looks like it is the LDAP directory libraries that are using NSPR. Shouldn't they call PR_Cleanup() on shutdown? (In reply to comment #6) > Have you tried calling PR_Cleanup() before unloading the LDAP/NSPR/NSS > libraries? See: > http://lxr.mozilla.org/seamonkey/source/nsprpub/pr/src/misc/prinit.c#363
Yes, in theory libprldap or libssldap could call PR_Cleanup(). The problem is that many applications (such as the Mozilla browsers and e-mail clients) use NSPR for other things as well, so it would be bad if the LDAP code called PR_Cleanup() unexpectedly. We could add a prldap_cleanup() call that calls PR_Cleanup() but I do not at the moment see a safe way to automatically call PR_Cleanup() for you. In any case it would be really good to know if calling PR_Cleanup() fixes the problem you are seeing.
A known workaround for this probllem is to leak the NSPR shared object by making an extra explict dlopen() on it without a corresponding dlclose(). This will cause NSPR to never be unloaded, and will hide this problem.
Mark, PR_Cleanup currently has several problems. I consider it to be incomplete, it doesn't do a full cleanup of threads and other resources allocated by NSPR. Therefore, even if you call it in the ldap shutdown function, I don't think it will help much with this problem. See bugs 254987 , 254983 , 255452 for more info . Until those are fixed, I think it is best never to call PR_Cleanup and keep the NSPR shared object loaded for the life of the process.
Is this a known issue only on Solaris? Or is all platforms (Unix and windows?)? I'm not very keen on leaking the shared object. Is there a fix planned for this? (In reply to comment #9) > A known workaround for this probllem is to leak the NSPR shared object by making > an extra explict dlopen() on it without a corresponding dlclose(). This will > cause NSPR to never be unloaded, and will hide this problem. >
It's an issue for all platforms. To fix it would require : 1) to complete the implementation of PR_Cleanup 2) to expose APIs through the LDAP SDK to decide when to call PR_Cleanup Currently, I'm not aware of plans to fix 1), and 2) depends on 1) .
I assume you do not have control over the unloading process (in which case you could just not unload your library). Or maybe you find that to be unacceptable for VM footprint reasons or some other reason. Another option would be to avoid NSPR, which will be some work if you want to support SSL (I think you would have to avoid NSS in that case as well and use something like OpenSSL).
Mark, I admit I don't know much about the LDAP SDK code, but doesn't it rely on NSPR/NSS for other things than SSL sockets ? Eg. OS abstraction of threading, locking, etc. I can't imagine how you would be able to rid yourself of NSPR unless much of the code was rewritten specifically for a particular OS, which IMO would be a step backwards.
I'm a little bit confused. How will leaking the NSPR shared object fix this problem? The core dump occurs when NSPR calls into prldap50 after prldap50 has been unloaded. Does this mean I have to leak prldap50 as well? (In reply to comment #9) > A known workaround for this probllem is to leak the NSPR shared object by making > an extra explict dlopen() on it without a corresponding dlclose(). This will > cause NSPR to never be unloaded, and will hide this problem. >
You may have to leak both. Try just leaking NSPR, and if that doesn't work, leaking the SDK DLL as well. The stack of your crash doesn't show an actual callback into the SDK so it may not be necessary. Another possible workaround is to terminate all the threads in your application that used NSPR and the SDK before you unload your shared library. This would prevent the NSPR thread termination callback code from being invoked after NSPR or the SDK are unloaded.
Regarding Comment #14, the core libldap code does not directly link with NSPR or NSS. Two separate shared libraries, libprldap (which links with NSPR) and libssldap (links with NSS), are used if an application wants to use NSPR or SSL. libssldap has a dependency on libprldap. But the core libldap simply allows optional callback functions to be installed to handle things like I/O and thread safety. We have maintained the core libldap library this way because some applications do not want to or can't use NSPR or NSS.
Amanda, any update on this bug?
Any update? We have a window of opportunity now to work on ldap csdk issues, so please let us know.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: