Closed Bug 186586 Opened 22 years ago Closed 22 years ago

NSS should not crash when shutdown fails due to open references

Categories

(NSS :: Libraries, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bugz, Assigned: bugz)

References

Details

Attachments

(1 file, 5 obsolete files)

patch coming.
Attached patch rev 1 (obsolete) — Splinter Review
This patch will intentially leak some memory (part of a global state) in order
to prevent crashes when an application has not freed all cert references before
calling shutdown.
Attached patch rev 2 (obsolete) — Splinter Review
oops - that patch had a "!=" where it meant to have a "|=".
Attachment #110015 - Attachment is obsolete: true
Comment on attachment 110015 [details] [diff] [review]
rev 1

Three suggested changes.

1. We should not set the SEC_ERROR_BUSY error code
in NSS_Shutdown.  We should set the error code in
nssCertificateStore_Destroy and
nssTrustDomain_DestroyCache when the hash count is
greater than 0.

2. It is safer to call nssHash_Count while holding
the lock on the hash table.

3. Important: if nssCertificateStore_Destroy fails,
NSSCryptoContext_Destroy should fail immediately
without destroying the crypto context.	Otherwise,
c->object.cryptocontext will be pointing to freed
memory.

Similarly, if nssTrustDomain_DestroyCache fails,
NSSTrustDomain_Destroy should fail immediately
without destroying the trust domain.
Attachment #110015 - Attachment is obsolete: false
Attachment #110015 - Flags: review-
Attached patch rev 3 (obsolete) — Splinter Review
1 - would involve an NSS 3.X error code in Stan code, so I added an equivalent
Stan error.

2 - nssHash has a lock for the count

3 - done
Attachment #110015 - Attachment is obsolete: true
Attachment #110016 - Attachment is obsolete: true
Attached patch rev 4 (obsolete) — Splinter Review
Ian, I made three changes to your patch, rev 3.

1. I moved the setting of the NSS_ERROR_BUSY error code
to nssCertificateStore_Destroy and nssTrustDomain_DestroyCache,
where we test to see if nssHash_Count() is > 0.

2. I added a missing return statement to the end of
nssCertificateStore_Destroy.

3. In nssTrustDomain_DestroyCache I replaced the test for
!td->cache by an assertion for td->cache because td->cache
could never be NULL.

Please review this revision of the patch.  Thanks.
Attachment #110017 - Attachment is obsolete: true
Attachment #110029 - Flags: superreview?(nelsonb)
Attachment #110029 - Flags: review+
Comment on attachment 110029 [details] [diff] [review]
rev 4

Prior to this patch, nssTrustDomain_DestroyCache will not crash if you call it
with a "td" such that td->cache is NULL.  With this patch, in non-DEBUG builds,
nssTrustDomain_DestroyCache will crash in that circumstance.
Attachment #110029 - Flags: superreview?(nelsonb) → superreview-
Nelson, Ian told me that td->cache could not be NULL,
which is why I changed the test to an assertion.

If this is your only objection to this patch, I can
undo that change.  That change is not part of the
fix for this bug.
Besides the td->cache issue, this patch also brought to my attention the 
fact that Stan is using a global external set of integers for error numbers, 
rather than an enumerated type or a set of #defines.  This is undesirable, 
IMO, for several reasons, including the fact that external references to data 
in other shared libraries is a problem on Windows.  

I also wish that the numerous places where a line was added that reads:
   extern const NSSError NSS_ERROR_BUSY;
were instead #includes of a header file that defines those error symbols.
I raised both issues when I first reviewed Ian's patch.
Ian told me that these are both design decisions made
by the original Stan code and he's following them in
the Stan code underneath NSS 3.x.  However, he has
abandoned this design in the new NSS 4.0 code he's
working on for exactly the reasons you cited (difficulty
in exporting data from Windows DLLs and the need to
manually declare the error symbols).

The lack of a header file that declares the error
symbols is intentional in the original Stan design.
It forces the author of a function to explicitly
declare the error symbols that may be returned by
a function.  This is supposed to make the code
self-documenting.  This patch is merely following
the style of existing Stan code in NSS 3.x.

I hope this addresses your concerns.
Attached patch rev 5 (obsolete) — Splinter Review
How does this one look, Nelson?

The assertion on td->cache has been replaced by a
failure return with the NSS_ERROR_INTERNAL_ERROR
error.

The purpose of this bug is to provide a workaround
for the crash in nssCertificateStore_Lock reported
in bug 182803.	The real bug is that Mozilla is
destroying CERTCertificates after NSS has been
shut down.  I will ask Kai to file bug(s) against
Mozilla for that application programming error.  But
it is useful for NSS to avoid a crash under that
condition.
r=nelsonb
What are the release target and priority of this bug?
The release target is 3.7.1 and the priority is P1.
We should get this fix into Mozilla 1.3.

I just realized that the latest patch does not cause
a subsequent NSS initialization to fail.  I think I
know how to do that, but I am afraid that I'll need
to wait until both Ian and Bob get back to check in
a fix.
Priority: -- → P1
Target Milestone: --- → 3.7.1
Attached patch rev 6Splinter Review
This revision of the patch has new code to cause
a subsequent Stan initialization to fail with the
new Stan error code NSS_ERROR_ALREADY_INITIALIZED
if a previous Stan shutdown failed.

I am not sure if my change to nss_Init is the right
way to do that and whether I need to undo anything done
by nss_Init if the STAN_LoadDefaultNSS3TrustDomain
call fails.

In order to detect a failed Stan shutdown, I am
setting g_default_trust_domain and
g_default_crypto_context to NULL if they are
successfully destroyed.

Your review is invited.
Attachment #110029 - Attachment is obsolete: true
Attachment #110047 - Attachment is obsolete: true
Kai,

The fix of this bug will avoid the crash reported in
bug 182803.  However, the real bug is that Mozilla is
destroying CERTCertificates after NSS has been shut
down, which is why I do not make this bug block bug
182803.

When the fix for this bug is checked in, you should
remove the workaround for bug 182803 (that is, restore
the NSS_Shutdown call).  Furthermore, you should check
the return status of NSS_Shutdown.

I want Bob, Ian, and Nelson to review this patch, so
I estimate that this patch won't be ready for checkin
until Jan. 6.  You are welcome to test this patch (with
the NSS_Shutdown call in PSM restored).
Comment on attachment 110102 [details] [diff] [review]
rev 6

That looks like the correct way to prevent reinitialization of Stan when a
shutdown has failed.
Attachment #110102 - Flags: review+
Blocks: 187501
Comment on attachment 110102 [details] [diff] [review]
rev 6

r=relyea
The patch has been checked into the tip and NSS_3_7_BRANCH
of NSS.

Note that I needed to add a SSL_ClearSessionCache() call
to strsclnt.c, right before the NSS_Shutdown() call, so
that NSS_Shutdown would not fail.
Status: NEW → ASSIGNED
The patch has been checked into the NSS_CLIENT_TAG.

Kai, you can add the NSS_Shutdown() call back to PSM
now.  You should check the return status of NSS_Shutdown().
Status: ASSIGNED → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
Summary: With this patch, profile switching does no longer work.


I should have commented earlier. I was ok with your plan to make NSS avoid this
crash, so I didn't have much to say.

Meanwhile I got back to working on profile switching and have again a testing
environment. For testing I use a very special test build. It is the latest trunk
build, which includes the patch from this bug; the test build also has the
profile switching UI patch from bug 97622; it has a general fix from Darin for
bug 182230 which ensures all sockets are getting shut down prior to profile
switching; I have re-enabled the call to NSS_Shutdown. Even more, we have agreed
to use the early-shutdown-of-NSS resources approach that I initially posted in
bug 177260, and my test build also has this logic. Therefore I'm rather
confident that my test build behaves good wrt NSS resources.


My test scenario is: I use IMAP/SSL in a Mozilla session. Next I try to switch
the profile. This involves shutting down Mozilla. I traced the SSL connection is
fully closed before attempting to shut down. The return code from NSS_Shutdown
is checked and a failure is reported.


If I back out this patch, shutting down NSS reports success, and initializing
NSS again does work.


I reopen this bug for discussion whether to back out this patch or to find
another solution.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Error code on shutdown failure is SEC_ERROR_BUSY.

Another leak?

Sigh, the cause was indeed another leak in PSM. Luckily I was able to find it
rather quickly.

Obviously this patch increased NSS' intolerance about leaks.

Marking bug fixed again. I'll file a separate bug on the new leak.
Status: REOPENED → RESOLVED
Closed: 22 years ago22 years ago
Resolution: --- → FIXED
Kai, thank you for testing the patch in this bug.
Yes, this patch made NSS detect more leaks at
shutdown and therefore increased its intolerance
about leaks.  The new SEC_ERROR_BUSY error code
is intended to report the error condition that
NSS is shut down while there are still open NSS
objects.

I just realized that I forgot to add a new
SEC_ERROR_ALREADY_INITIALIZED error code and
the code to map the new Stan error code
NSS_ERROR_ALREADY_INITIALIZED to it.  I will
submit a patch.  This new error code is intended
to report the condition that one attempts to
re-initialize NSS after a failed shutdown.
Since NSS init is allowed to be called multiple
times, SEC_ERROR_ALREADY_INITIALIZED sounds
contradictory to what is allowed.  Any suggestion
for a better name for this error code?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Marked the bug fixed again.

I decided to open a new bug (bug 189024) for
the remaining minor problem that no error code
is set on the attempt to re-initialize NSS after
a failed NSS shutdown.
Status: REOPENED → RESOLVED
Closed: 22 years ago22 years ago
Resolution: --- → FIXED
> Yes, this patch made NSS detect more leaks at
> shutdown and therefore increased its intolerance
> about leaks.

Oops. I thought NSS would already report all the problems it cares about. This
new behaviour could be a problem for me. By doing some more testing, I was able
to find yet another scenario where shutdown fails - sending an encrypted message
without any SSL involved.

I fear I will run into more leaks that all need to get tracked down, whether
they are in NSS or PSM.

Is it absolutely necessary that a shutdown problem reported by STAN_Shutdown
results into the inability to re-initialize NSS?
Kai, I understand you are disappointed to learn
that you now have more leaks that you need to
track down.  To answer your question it is useful
to review how we got here.

In NSS 3.6.1, we checked in Ian's fix for bug
177366.  That fix made NSS do reference counting
better, but it also made CERT_DestroyCertificate
depend on some internal Stan data structures.
Those data structures are destroyed by
STAN_Shutdown, which is why CERT_DestroyCertificate
would crash if STAN_Shutdown had been called.
This resulted in various crashes of Mozilla on exit
(bug 182803).  You were forced to remove the
NSS_Shutdown call from PSM.

The patch in this bug caused STAN_Shutdown to not
destroy those data structures if it detects that
there are still open certs.

Ian's fix for bug 177366 made it easy to detect
open certs at NSS shutdown.  So, your question
becomes: Is it absolutely necessary that the
existence of open certs at NSS shutdown results
in the inability to re-initialize NSS?
Wan-Teh: I understand, thanks for the explanation.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: