Last Comment Bug 186586 - NSS should not crash when shutdown fails due to open references
: NSS should not crash when shutdown fails due to open references
Status: RESOLVED FIXED
:
Product: NSS
Classification: Components
Component: Libraries (show other bugs)
: 3.6
: All All
: P1 normal (vote)
: 3.7.1
Assigned To: Ian McGreer
: Bishakha Banerjee
Mentors:
Depends on:
Blocks: 187501
  Show dependency treegraph
 
Reported: 2002-12-23 11:59 PST by Ian McGreer
Modified: 2003-01-16 07:18 PST (History)
5 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
rev 1 (5.97 KB, patch)
2002-12-23 12:01 PST, Ian McGreer
wtc: review-
Details | Diff | Splinter Review
rev 2 (5.97 KB, patch)
2002-12-23 12:04 PST, Ian McGreer
no flags Details | Diff | Splinter Review
rev 3 (7.21 KB, patch)
2002-12-23 12:33 PST, Ian McGreer
no flags Details | Diff | Splinter Review
rev 4 (7.47 KB, patch)
2002-12-23 14:43 PST, Wan-Teh Chang
bugz: review+
nelson: superreview-
Details | Diff | Splinter Review
rev 5 (7.56 KB, patch)
2002-12-23 18:02 PST, Wan-Teh Chang
nelson: superreview+
Details | Diff | Splinter Review
rev 6 (8.63 KB, patch)
2002-12-24 12:33 PST, Wan-Teh Chang
bugz: review+
Details | Diff | Splinter Review

Description Ian McGreer 2002-12-23 11:59:44 PST
patch coming.
Comment 1 Ian McGreer 2002-12-23 12:01:04 PST
Created attachment 110015 [details] [diff] [review]
rev 1


This patch will intentially leak some memory (part of a global state) in order
to prevent crashes when an application has not freed all cert references before
calling shutdown.
Comment 2 Ian McGreer 2002-12-23 12:04:18 PST
Created attachment 110016 [details] [diff] [review]
rev 2


oops - that patch had a "!=" where it meant to have a "|=".
Comment 3 Wan-Teh Chang 2002-12-23 12:09:52 PST
Comment on attachment 110015 [details] [diff] [review]
rev 1

Three suggested changes.

1. We should not set the SEC_ERROR_BUSY error code
in NSS_Shutdown.  We should set the error code in
nssCertificateStore_Destroy and
nssTrustDomain_DestroyCache when the hash count is
greater than 0.

2. It is safer to call nssHash_Count while holding
the lock on the hash table.

3. Important: if nssCertificateStore_Destroy fails,
NSSCryptoContext_Destroy should fail immediately
without destroying the crypto context.	Otherwise,
c->object.cryptocontext will be pointing to freed
memory.

Similarly, if nssTrustDomain_DestroyCache fails,
NSSTrustDomain_Destroy should fail immediately
without destroying the trust domain.
Comment 4 Ian McGreer 2002-12-23 12:33:15 PST
Created attachment 110017 [details] [diff] [review]
rev 3


1 - would involve an NSS 3.X error code in Stan code, so I added an equivalent
Stan error.

2 - nssHash has a lock for the count

3 - done
Comment 5 Wan-Teh Chang 2002-12-23 14:43:33 PST
Created attachment 110029 [details] [diff] [review]
rev 4

Ian, I made three changes to your patch, rev 3.

1. I moved the setting of the NSS_ERROR_BUSY error code
to nssCertificateStore_Destroy and nssTrustDomain_DestroyCache,
where we test to see if nssHash_Count() is > 0.

2. I added a missing return statement to the end of
nssCertificateStore_Destroy.

3. In nssTrustDomain_DestroyCache I replaced the test for
!td->cache by an assertion for td->cache because td->cache
could never be NULL.

Please review this revision of the patch.  Thanks.
Comment 6 Nelson Bolyard (seldom reads bugmail) 2002-12-23 15:35:39 PST
Comment on attachment 110029 [details] [diff] [review]
rev 4

Prior to this patch, nssTrustDomain_DestroyCache will not crash if you call it
with a "td" such that td->cache is NULL.  With this patch, in non-DEBUG builds,
nssTrustDomain_DestroyCache will crash in that circumstance.
Comment 7 Wan-Teh Chang 2002-12-23 15:41:34 PST
Nelson, Ian told me that td->cache could not be NULL,
which is why I changed the test to an assertion.

If this is your only objection to this patch, I can
undo that change.  That change is not part of the
fix for this bug.
Comment 8 Nelson Bolyard (seldom reads bugmail) 2002-12-23 16:15:21 PST
Besides the td->cache issue, this patch also brought to my attention the 
fact that Stan is using a global external set of integers for error numbers, 
rather than an enumerated type or a set of #defines.  This is undesirable, 
IMO, for several reasons, including the fact that external references to data 
in other shared libraries is a problem on Windows.  

I also wish that the numerous places where a line was added that reads:
   extern const NSSError NSS_ERROR_BUSY;
were instead #includes of a header file that defines those error symbols.
Comment 9 Wan-Teh Chang 2002-12-23 16:29:50 PST
I raised both issues when I first reviewed Ian's patch.
Ian told me that these are both design decisions made
by the original Stan code and he's following them in
the Stan code underneath NSS 3.x.  However, he has
abandoned this design in the new NSS 4.0 code he's
working on for exactly the reasons you cited (difficulty
in exporting data from Windows DLLs and the need to
manually declare the error symbols).

The lack of a header file that declares the error
symbols is intentional in the original Stan design.
It forces the author of a function to explicitly
declare the error symbols that may be returned by
a function.  This is supposed to make the code
self-documenting.  This patch is merely following
the style of existing Stan code in NSS 3.x.

I hope this addresses your concerns.
Comment 10 Wan-Teh Chang 2002-12-23 18:02:05 PST
Created attachment 110047 [details] [diff] [review]
rev 5

How does this one look, Nelson?

The assertion on td->cache has been replaced by a
failure return with the NSS_ERROR_INTERNAL_ERROR
error.

The purpose of this bug is to provide a workaround
for the crash in nssCertificateStore_Lock reported
in bug 182803.	The real bug is that Mozilla is
destroying CERTCertificates after NSS has been
shut down.  I will ask Kai to file bug(s) against
Mozilla for that application programming error.  But
it is useful for NSS to avoid a crash under that
condition.
Comment 11 Nelson Bolyard (seldom reads bugmail) 2002-12-23 18:53:30 PST
r=nelsonb
What are the release target and priority of this bug?
Comment 12 Wan-Teh Chang 2002-12-24 11:25:07 PST
The release target is 3.7.1 and the priority is P1.
We should get this fix into Mozilla 1.3.

I just realized that the latest patch does not cause
a subsequent NSS initialization to fail.  I think I
know how to do that, but I am afraid that I'll need
to wait until both Ian and Bob get back to check in
a fix.
Comment 13 Wan-Teh Chang 2002-12-24 12:33:38 PST
Created attachment 110102 [details] [diff] [review]
rev 6

This revision of the patch has new code to cause
a subsequent Stan initialization to fail with the
new Stan error code NSS_ERROR_ALREADY_INITIALIZED
if a previous Stan shutdown failed.

I am not sure if my change to nss_Init is the right
way to do that and whether I need to undo anything done
by nss_Init if the STAN_LoadDefaultNSS3TrustDomain
call fails.

In order to detect a failed Stan shutdown, I am
setting g_default_trust_domain and
g_default_crypto_context to NULL if they are
successfully destroyed.

Your review is invited.
Comment 14 Wan-Teh Chang 2002-12-24 12:44:09 PST
Kai,

The fix of this bug will avoid the crash reported in
bug 182803.  However, the real bug is that Mozilla is
destroying CERTCertificates after NSS has been shut
down, which is why I do not make this bug block bug
182803.

When the fix for this bug is checked in, you should
remove the workaround for bug 182803 (that is, restore
the NSS_Shutdown call).  Furthermore, you should check
the return status of NSS_Shutdown.

I want Bob, Ian, and Nelson to review this patch, so
I estimate that this patch won't be ready for checkin
until Jan. 6.  You are welcome to test this patch (with
the NSS_Shutdown call in PSM restored).
Comment 15 Ian McGreer 2003-01-02 10:44:27 PST
Comment on attachment 110102 [details] [diff] [review]
rev 6

That looks like the correct way to prevent reinitialization of Stan when a
shutdown has failed.
Comment 16 Robert Relyea 2003-01-08 10:33:49 PST
Comment on attachment 110102 [details] [diff] [review]
rev 6

r=relyea
Comment 17 Wan-Teh Chang 2003-01-08 18:16:15 PST
The patch has been checked into the tip and NSS_3_7_BRANCH
of NSS.

Note that I needed to add a SSL_ClearSessionCache() call
to strsclnt.c, right before the NSS_Shutdown() call, so
that NSS_Shutdown would not fail.
Comment 18 Wan-Teh Chang 2003-01-10 16:40:38 PST
The patch has been checked into the NSS_CLIENT_TAG.

Kai, you can add the NSS_Shutdown() call back to PSM
now.  You should check the return status of NSS_Shutdown().
Comment 19 Kai Engert (:kaie) 2003-01-14 02:54:56 PST
Summary: With this patch, profile switching does no longer work.


I should have commented earlier. I was ok with your plan to make NSS avoid this
crash, so I didn't have much to say.

Meanwhile I got back to working on profile switching and have again a testing
environment. For testing I use a very special test build. It is the latest trunk
build, which includes the patch from this bug; the test build also has the
profile switching UI patch from bug 97622; it has a general fix from Darin for
bug 182230 which ensures all sockets are getting shut down prior to profile
switching; I have re-enabled the call to NSS_Shutdown. Even more, we have agreed
to use the early-shutdown-of-NSS resources approach that I initially posted in
bug 177260, and my test build also has this logic. Therefore I'm rather
confident that my test build behaves good wrt NSS resources.


My test scenario is: I use IMAP/SSL in a Mozilla session. Next I try to switch
the profile. This involves shutting down Mozilla. I traced the SSL connection is
fully closed before attempting to shut down. The return code from NSS_Shutdown
is checked and a failure is reported.


If I back out this patch, shutting down NSS reports success, and initializing
NSS again does work.


I reopen this bug for discussion whether to back out this patch or to find
another solution.
Comment 20 Kai Engert (:kaie) 2003-01-14 03:14:02 PST
Error code on shutdown failure is SEC_ERROR_BUSY.

Another leak?

Comment 21 Kai Engert (:kaie) 2003-01-14 03:21:20 PST
Sigh, the cause was indeed another leak in PSM. Luckily I was able to find it
rather quickly.

Obviously this patch increased NSS' intolerance about leaks.

Marking bug fixed again. I'll file a separate bug on the new leak.
Comment 22 Wan-Teh Chang 2003-01-14 05:53:48 PST
Kai, thank you for testing the patch in this bug.
Yes, this patch made NSS detect more leaks at
shutdown and therefore increased its intolerance
about leaks.  The new SEC_ERROR_BUSY error code
is intended to report the error condition that
NSS is shut down while there are still open NSS
objects.

I just realized that I forgot to add a new
SEC_ERROR_ALREADY_INITIALIZED error code and
the code to map the new Stan error code
NSS_ERROR_ALREADY_INITIALIZED to it.  I will
submit a patch.  This new error code is intended
to report the condition that one attempts to
re-initialize NSS after a failed shutdown.
Since NSS init is allowed to be called multiple
times, SEC_ERROR_ALREADY_INITIALIZED sounds
contradictory to what is allowed.  Any suggestion
for a better name for this error code?
Comment 23 Wan-Teh Chang 2003-01-14 06:05:30 PST
Marked the bug fixed again.

I decided to open a new bug (bug 189024) for
the remaining minor problem that no error code
is set on the attempt to re-initialize NSS after
a failed NSS shutdown.
Comment 24 Kai Engert (:kaie) 2003-01-14 07:33:52 PST
> Yes, this patch made NSS detect more leaks at
> shutdown and therefore increased its intolerance
> about leaks.

Oops. I thought NSS would already report all the problems it cares about. This
new behaviour could be a problem for me. By doing some more testing, I was able
to find yet another scenario where shutdown fails - sending an encrypted message
without any SSL involved.

I fear I will run into more leaks that all need to get tracked down, whether
they are in NSS or PSM.

Is it absolutely necessary that a shutdown problem reported by STAN_Shutdown
results into the inability to re-initialize NSS?
Comment 25 Wan-Teh Chang 2003-01-14 08:48:11 PST
Kai, I understand you are disappointed to learn
that you now have more leaks that you need to
track down.  To answer your question it is useful
to review how we got here.

In NSS 3.6.1, we checked in Ian's fix for bug
177366.  That fix made NSS do reference counting
better, but it also made CERT_DestroyCertificate
depend on some internal Stan data structures.
Those data structures are destroyed by
STAN_Shutdown, which is why CERT_DestroyCertificate
would crash if STAN_Shutdown had been called.
This resulted in various crashes of Mozilla on exit
(bug 182803).  You were forced to remove the
NSS_Shutdown call from PSM.

The patch in this bug caused STAN_Shutdown to not
destroy those data structures if it detects that
there are still open certs.

Ian's fix for bug 177366 made it easy to detect
open certs at NSS shutdown.  So, your question
becomes: Is it absolutely necessary that the
existence of open certs at NSS shutdown results
in the inability to re-initialize NSS?
Comment 26 Kai Engert (:kaie) 2003-01-16 07:18:03 PST
Wan-Teh: I understand, thanks for the explanation.

Note You need to log in before you can comment on or make changes to this bug.