Created attachment 110015 [details] [diff] [review] rev 1 This patch will intentially leak some memory (part of a global state) in order to prevent crashes when an application has not freed all cert references before calling shutdown.
Created attachment 110016 [details] [diff] [review] rev 2 oops - that patch had a "!=" where it meant to have a "|=".
Comment on attachment 110015 [details] [diff] [review] rev 1 Three suggested changes. 1. We should not set the SEC_ERROR_BUSY error code in NSS_Shutdown. We should set the error code in nssCertificateStore_Destroy and nssTrustDomain_DestroyCache when the hash count is greater than 0. 2. It is safer to call nssHash_Count while holding the lock on the hash table. 3. Important: if nssCertificateStore_Destroy fails, NSSCryptoContext_Destroy should fail immediately without destroying the crypto context. Otherwise, c->object.cryptocontext will be pointing to freed memory. Similarly, if nssTrustDomain_DestroyCache fails, NSSTrustDomain_Destroy should fail immediately without destroying the trust domain.
Created attachment 110017 [details] [diff] [review] rev 3 1 - would involve an NSS 3.X error code in Stan code, so I added an equivalent Stan error. 2 - nssHash has a lock for the count 3 - done
Created attachment 110029 [details] [diff] [review] rev 4 Ian, I made three changes to your patch, rev 3. 1. I moved the setting of the NSS_ERROR_BUSY error code to nssCertificateStore_Destroy and nssTrustDomain_DestroyCache, where we test to see if nssHash_Count() is > 0. 2. I added a missing return statement to the end of nssCertificateStore_Destroy. 3. In nssTrustDomain_DestroyCache I replaced the test for !td->cache by an assertion for td->cache because td->cache could never be NULL. Please review this revision of the patch. Thanks.
Comment on attachment 110029 [details] [diff] [review] rev 4 Prior to this patch, nssTrustDomain_DestroyCache will not crash if you call it with a "td" such that td->cache is NULL. With this patch, in non-DEBUG builds, nssTrustDomain_DestroyCache will crash in that circumstance.
Nelson, Ian told me that td->cache could not be NULL, which is why I changed the test to an assertion. If this is your only objection to this patch, I can undo that change. That change is not part of the fix for this bug.
Besides the td->cache issue, this patch also brought to my attention the fact that Stan is using a global external set of integers for error numbers, rather than an enumerated type or a set of #defines. This is undesirable, IMO, for several reasons, including the fact that external references to data in other shared libraries is a problem on Windows. I also wish that the numerous places where a line was added that reads: extern const NSSError NSS_ERROR_BUSY; were instead #includes of a header file that defines those error symbols.
I raised both issues when I first reviewed Ian's patch. Ian told me that these are both design decisions made by the original Stan code and he's following them in the Stan code underneath NSS 3.x. However, he has abandoned this design in the new NSS 4.0 code he's working on for exactly the reasons you cited (difficulty in exporting data from Windows DLLs and the need to manually declare the error symbols). The lack of a header file that declares the error symbols is intentional in the original Stan design. It forces the author of a function to explicitly declare the error symbols that may be returned by a function. This is supposed to make the code self-documenting. This patch is merely following the style of existing Stan code in NSS 3.x. I hope this addresses your concerns.
Created attachment 110047 [details] [diff] [review] rev 5 How does this one look, Nelson? The assertion on td->cache has been replaced by a failure return with the NSS_ERROR_INTERNAL_ERROR error. The purpose of this bug is to provide a workaround for the crash in nssCertificateStore_Lock reported in bug 182803. The real bug is that Mozilla is destroying CERTCertificates after NSS has been shut down. I will ask Kai to file bug(s) against Mozilla for that application programming error. But it is useful for NSS to avoid a crash under that condition.
r=nelsonb What are the release target and priority of this bug?
The release target is 3.7.1 and the priority is P1. We should get this fix into Mozilla 1.3. I just realized that the latest patch does not cause a subsequent NSS initialization to fail. I think I know how to do that, but I am afraid that I'll need to wait until both Ian and Bob get back to check in a fix.
Created attachment 110102 [details] [diff] [review] rev 6 This revision of the patch has new code to cause a subsequent Stan initialization to fail with the new Stan error code NSS_ERROR_ALREADY_INITIALIZED if a previous Stan shutdown failed. I am not sure if my change to nss_Init is the right way to do that and whether I need to undo anything done by nss_Init if the STAN_LoadDefaultNSS3TrustDomain call fails. In order to detect a failed Stan shutdown, I am setting g_default_trust_domain and g_default_crypto_context to NULL if they are successfully destroyed. Your review is invited.
Kai, The fix of this bug will avoid the crash reported in bug 182803. However, the real bug is that Mozilla is destroying CERTCertificates after NSS has been shut down, which is why I do not make this bug block bug 182803. When the fix for this bug is checked in, you should remove the workaround for bug 182803 (that is, restore the NSS_Shutdown call). Furthermore, you should check the return status of NSS_Shutdown. I want Bob, Ian, and Nelson to review this patch, so I estimate that this patch won't be ready for checkin until Jan. 6. You are welcome to test this patch (with the NSS_Shutdown call in PSM restored).
Comment on attachment 110102 [details] [diff] [review] rev 6 That looks like the correct way to prevent reinitialization of Stan when a shutdown has failed.
The patch has been checked into the tip and NSS_3_7_BRANCH of NSS. Note that I needed to add a SSL_ClearSessionCache() call to strsclnt.c, right before the NSS_Shutdown() call, so that NSS_Shutdown would not fail.
The patch has been checked into the NSS_CLIENT_TAG. Kai, you can add the NSS_Shutdown() call back to PSM now. You should check the return status of NSS_Shutdown().
Summary: With this patch, profile switching does no longer work. I should have commented earlier. I was ok with your plan to make NSS avoid this crash, so I didn't have much to say. Meanwhile I got back to working on profile switching and have again a testing environment. For testing I use a very special test build. It is the latest trunk build, which includes the patch from this bug; the test build also has the profile switching UI patch from bug 97622; it has a general fix from Darin for bug 182230 which ensures all sockets are getting shut down prior to profile switching; I have re-enabled the call to NSS_Shutdown. Even more, we have agreed to use the early-shutdown-of-NSS resources approach that I initially posted in bug 177260, and my test build also has this logic. Therefore I'm rather confident that my test build behaves good wrt NSS resources. My test scenario is: I use IMAP/SSL in a Mozilla session. Next I try to switch the profile. This involves shutting down Mozilla. I traced the SSL connection is fully closed before attempting to shut down. The return code from NSS_Shutdown is checked and a failure is reported. If I back out this patch, shutting down NSS reports success, and initializing NSS again does work. I reopen this bug for discussion whether to back out this patch or to find another solution.
Error code on shutdown failure is SEC_ERROR_BUSY. Another leak?
Sigh, the cause was indeed another leak in PSM. Luckily I was able to find it rather quickly. Obviously this patch increased NSS' intolerance about leaks. Marking bug fixed again. I'll file a separate bug on the new leak.
Kai, thank you for testing the patch in this bug. Yes, this patch made NSS detect more leaks at shutdown and therefore increased its intolerance about leaks. The new SEC_ERROR_BUSY error code is intended to report the error condition that NSS is shut down while there are still open NSS objects. I just realized that I forgot to add a new SEC_ERROR_ALREADY_INITIALIZED error code and the code to map the new Stan error code NSS_ERROR_ALREADY_INITIALIZED to it. I will submit a patch. This new error code is intended to report the condition that one attempts to re-initialize NSS after a failed shutdown. Since NSS init is allowed to be called multiple times, SEC_ERROR_ALREADY_INITIALIZED sounds contradictory to what is allowed. Any suggestion for a better name for this error code?
Marked the bug fixed again. I decided to open a new bug (bug 189024) for the remaining minor problem that no error code is set on the attempt to re-initialize NSS after a failed NSS shutdown.
> Yes, this patch made NSS detect more leaks at > shutdown and therefore increased its intolerance > about leaks. Oops. I thought NSS would already report all the problems it cares about. This new behaviour could be a problem for me. By doing some more testing, I was able to find yet another scenario where shutdown fails - sending an encrypted message without any SSL involved. I fear I will run into more leaks that all need to get tracked down, whether they are in NSS or PSM. Is it absolutely necessary that a shutdown problem reported by STAN_Shutdown results into the inability to re-initialize NSS?
Kai, I understand you are disappointed to learn that you now have more leaks that you need to track down. To answer your question it is useful to review how we got here. In NSS 3.6.1, we checked in Ian's fix for bug 177366. That fix made NSS do reference counting better, but it also made CERT_DestroyCertificate depend on some internal Stan data structures. Those data structures are destroyed by STAN_Shutdown, which is why CERT_DestroyCertificate would crash if STAN_Shutdown had been called. This resulted in various crashes of Mozilla on exit (bug 182803). You were forced to remove the NSS_Shutdown call from PSM. The patch in this bug caused STAN_Shutdown to not destroy those data structures if it detects that there are still open certs. Ian's fix for bug 177366 made it easy to detect open certs at NSS shutdown. So, your question becomes: Is it absolutely necessary that the existence of open certs at NSS shutdown results in the inability to re-initialize NSS?
Wan-Teh: I understand, thanks for the explanation.