Closed Bug 177260 Opened 22 years ago Closed 22 years ago

Fix known leaks in PSM, track blocking PSM UI, track open SSL sockets

Categories

(Core Graveyard :: Security: UI, defect, P1)

Other Branch

Tracking

(Not tracked)

VERIFIED FIXED
psm2.4

People

(Reporter: KaiE, Assigned: KaiE)

References

Details

(Keywords: topembed+)

Attachments

(2 files, 6 obsolete files)

We tried to make PSM succeed in shutting down NSS, by fixing leaks and shutting
down all network activity prior to attempting a shutdown or profile switch.

I still think that shutting down all network activity first is a good idea,
because it should at least ensure, that at the time we are going doing our
cleanup, all activity happening on other threads should have stopped.

However, our assumption that shutting down the network is sufficient, in order
to bring down all resources to PSM, was wrong.

I can see, when the profile switch or shutdown notification is received, there
is still a HTML object alive, which references a HTTP channel, holding a
reference to a SSL status info, and by that way to a certificate.

When I saw that, I was finally convinced that PSM should not depend on external
Mozilla code to always behave correct, and go with a failsafe approached, that
also has been suggested by Darin.

The idea is, all living PSM C++ wrapper instances, carrying references to NSS
data structures, will be tracked in a list. When a shutdown or profile switch is
requested, all entries in the list will free their NSS references. The C++
instances itself will stay alive, in order to avoid crashes, in case external
Mozilla code will do future calls to PSM's wrapper objects. However, all those
objects will switch to an AlreadyShutDown mode, and will fail gracefully when
interfaces are called after the shutdown.

This approach was necessary, and it alone fixed two bugs I was able to reproduce.

However, besides this we still must make sure that we don't have leaks internal
in PSM. Freeing the references being held externally to PSM still doesn't help,
if PSM or NSS have leaks on their own.

I have been spending a couple of days tracing PSM and NSS. I found 20+ leaks.

The attached patch implements the approach described above, and all the leaks I
have found so far. I have found many leaks using the new code from bug 175320,
and it helped me to find shutdown failure bugs, as they can currently be seen as
dependent bugs of bug 157937. The attached patch fixes all of them and some more
bug that I have found (but not filed yet).

While the patch works for me on Linux, it is yet untested on other platforms. It
also contains some changes to NSS that still need discussion. I guess we will
probably want to track the NSS changes in (a) separate bug(s), but as a starter
I'm including them here.

In addition, the patch contains several sections marked with "???". If you could
have a look at those places, maybe you could give me some answers to those
questions.

Reviews and comments welcome.
Attached patch Patch v1 (obsolete) — Splinter Review
Blocks: psmleaks
Severity: normal → major
Status: NEW → ASSIGNED
Keywords: nsbeta1+
Priority: -- → P1
Target Milestone: --- → 2.4
Blocks: 175320
Depends on: 177366
Keywords: topembed+
Depends on: 176667, 176799
Attached patch Patch v2 (obsolete) — Splinter Review
New patch:
- still depends on NSS cleanup, but does no longer contain changes to NSS, they
are now being tracked in external bug 176667, bug 176799, 177366
- works around bug 177201 as suggested in that bug
- has some typecasts added that are needed for compilation on other platforms
- adds the changes required for the Mac buildsystem
Attachment #104474 - Attachment is obsolete: true
Blocks: 94557
Comment on attachment 105076 [details] [diff] [review]
Patch v2

Here are my comments.  Only the first comment I consider a must fix.
Everything else is take it or leave it.

Fix the first issue and r=javi

>+  void set(PRUint32 i, CERTCertificate *c)
>+  {
>+    if (i >= mSize)
>+      return;
>+    
>+    mCerts[i] = c;
>+  }
Depending on how one uses this class, this method can easily lead to leaks
or releasing a reference on a cert too many times.

In your destructor, you call CERT_DestroyCertficate which will decrement
the refCount, but you do not augment the refCount here.  You should
either not release the reference in the destructor or add a reference here
when assigning (making sure to release the reference to the old cert if there
was already a cert assigned to that index.

>+  
>+  CERTCertificate *get(PRUint32 i)
>+  {
>+    if (i >= mSize)
>+      return nsnull;
>+    
>+    return mCerts[i];
>+  }
>+
I guess this is fine if you make sure none of the callers then call
CERT_DestroyCertificate on the value returned.

>+void nsCMSDecoder::destructorSafeDestroyNSSReference()
>+{
>+  if (isAlreadyShutDown())
>+    return;
>+
>+  if (m_dcx)
>+    NSS_CMSDecoder_Cancel(m_dcx);
> }

hm, I'm not sure if this is the right call.  I seem to remember having crashes 
in other code when I tried using NSS_CMSDecoder_Cancel.  Will a call to 
NSS_CMSDecoder_Finish suffice here?

But since you use NSS_CMSDecoder_Cancel later on in the patch, I'll 
assume it's OK.

Regardless, should we NULL out m_dcx in this case?




>     } else if (!PK11_IsReadOnly(mCert->slot)) {
>+      SEC_DeletePermCertificate(mCert);
>+      CERT_DestroyCertificate(mCert);
>+    }
>+
>+      // ??? kaie: The following comment is old, and confuses me.
>+      //           It doesn't match what the code is doing.
>+      //           I don't see code that changes something to untrusted.
>+
>+      // -----[ old comment starting ]-------------
>       // If the cert isn't a user cert and it is on an external token, 
>       // then we'll just leave it as untrusted, but won't delete it 
>       // from the cert db.
>-      SEC_DeletePermCertificate(mCert);
>-    }
>+      // -----[ end of old comment  ]--------------

Reading the code closely, I believe this uses the semantic of NSS where
if a cert shows up as untrusted in your cert db and then it finds the same cert 
on an external token (ie, the built-ins), then NSS uses the cached trust 
setting to over-ride what the module (the built-ins in this case) says.  So 
if I take the CA cert from Evil Corp that is bundled in the built-ins and 
"delete" it from the Cert Manager, under the covers we mark it as untrusted 
and don't delete it from the permanent cert db so that next time you launch 
the cert manager, the cert will show up as untrusted again.

This is how we used to "delete" certs that were in the built-ins module.  Not 
sure if that's how it still works.

I believe your code still does that.

> NS_IMETHODIMP
> nsNSSCertificate::GetNickname(nsAString &_nickname)
> {
>+  if (isAlreadyShutDown())
>+    return NS_ERROR_NOT_AVAILABLE;
>+
>   const char *nickname = (mCert->nickname) ? mCert->nickname : "(no nickname)";
>   _nickname = NS_ConvertUTF8toUCS2(nickname);
off-topic, but this isn't localizable.


>+// ??? How should the result from this get freed?
>+// There is no function in NSS to free data structure CERTDERCerts
>+// Any iteration required?
>+// However, this appears to be a memory leak, not a NSS reference count leak
The CERTDERCerts returned from this method are allocated from the arena 
passed in.  So the CERTDERCerts go away when the arena goes away.

>   serviceURL = CERT_GetOCSPAuthorityInfoAccessLocation(aCert);
>+// Is it necessary to free serviceURL ??? I think it is
You ar right.  CERT_GetOCSPAuthorityInfoAccessLocation allocates
a new char* buffer and copies the location into it.

>   if (serviceURL) {
> 	url = ToNewUnicode(NS_ConvertUTF8toUCS2(serviceURL));
>   }
>@@ -914,6 +928,7 @@
>   nickname = aCert->nickname;
>   nn = ToNewUnicode(NS_ConvertUTF8toUCS2(nickname));
> 
>+// are we leaking nn and url here??? I think we are.
Not sure since I don't know if the ToNewUnicode will return you a new buffer
that lives past it's currect scope.

>-    ::NSS_Shutdown();
>+    mShutdownObjectList->shutdownAll();
>+    if (SECSuccess != ::NSS_Shutdown()) {
>+printf("=> NSS SHUTDOWN FAILURE\n");
>+    }
>+    else {
>+printf("=> nss shutdown =====>> OK <<=====\n");
>+    }
>   }
> 
Perhaps these print statements should be wrapped in debug blocks or assing
the appropriate logging macro?

>+// do we need to call   SECMOD_DestroyModuleList(list); ??? But it is not exported???
This is a question for someone on the NSS team like relyea.
Attachment #105076 - Flags: review+
*** Bug 176670 has been marked as a duplicate of this bug. ***
Blocks: 176666
Blocks: 175161
Blocks: 175159
Blocks: 175158
Blocks: 175157
Blocks: 175155
Blocks: 175154
Blocks: 175153
Attached patch Patch v3 (obsolete) — Splinter Review
> >+  void set(PRUint32 i, CERTCertificate *c)
> >+  {
> >+	if (i >= mSize)
> >+	  return;
> >+	
> >+	mCerts[i] = c;
> >+  }
> Depending on how one uses this class, this method can easily lead to leaks
> or releasing a reference on a cert too many times.
> 
> In your destructor, you call CERT_DestroyCertficate which will decrement
> the refCount, but you do not augment the refCount here.  You should
> either not release the reference in the destructor or add a reference here
> when assigning (making sure to release the reference to the old cert if there

> was already a cert assigned to that index.

Oops. Good catch, thanks.
I made the class failsafe now.
If you want, you could have another look at the class code.
If not I'll assume r=javi.


> >+  CERTCertificate *get(PRUint32 i)
> >+  {
> >+	if (i >= mSize)
> >+	  return nsnull;
> >+	
> >+	return mCerts[i];
> >+  }
> >+
> I guess this is fine if you make sure none of the callers then call
> CERT_DestroyCertificate on the value returned.

For enhanced safety, I now changed the function to addref and make the callers
destroy.


> >+void nsCMSDecoder::destructorSafeDestroyNSSReference()
> >+{
> >+  if (isAlreadyShutDown())
> >+	return;
> >+
> >+  if (m_dcx)
> >+	NSS_CMSDecoder_Cancel(m_dcx);
> > }
> 
> hm, I'm not sure if this is the right call.  I seem to remember having
crashes 
> in other code when I tried using NSS_CMSDecoder_Cancel.  Will a call to 
> NSS_CMSDecoder_Finish suffice here?
> But since you use NSS_CMSDecoder_Cancel later on in the patch, I'll 
> assume it's OK.

I think NSS code should not crash. NSS offers this function to cancel the
decoding.
I would prefer to not use finish in this situation, since finish produces 
additional objects that would require cleanup.

Let's assume NSS code works and does not crash.


> Regardless, should we NULL out m_dcx in this case?

Can't hurt. Done.


> >	} else if (!PK11_IsReadOnly(mCert->slot)) {
> >+	  SEC_DeletePermCertificate(mCert);
> >+	  CERT_DestroyCertificate(mCert);
> >+	}
> >+
> >+	  // ??? kaie: The following comment is old, and confuses me.
> >+	  //	       It doesn't match what the code is doing.
> >+	  //	       I don't see code that changes something to untrusted.
> >+
> >+	  // -----[ old comment starting ]-------------
> >	  // If the cert isn't a user cert and it is on an external token, 
> >	  // then we'll just leave it as untrusted, but won't delete it 
> >	  // from the cert db.
> >-	  SEC_DeletePermCertificate(mCert);
> >-	}
> >+	  // -----[ end of old comment	]--------------
> 
> Reading the code closely, I believe this uses the semantic of NSS where
> if a cert shows up as untrusted in your cert db and then it finds the same
cert 
> on an external token (ie, the built-ins), then NSS uses the cached trust 
> setting to over-ride what the module (the built-ins in this case) says.  So 
> if I take the CA cert from Evil Corp that is bundled in the built-ins and 
> "delete" it from the Cert Manager, under the covers we mark it as untrusted 
> and don't delete it from the permanent cert db so that next time you launch 
> the cert manager, the cert will show up as untrusted again.
> 
> This is how we used to "delete" certs that were in the built-ins module.  Not

> sure if that's how it still works.
> 
> I believe your code still does that.

You are right, and our code is indeed still doing that.
With your comments, I slowly begin to understand the comment,
but I think this is confusing.
I have replaced it with the new comment:
  // If the list of built-ins does contain a non-removable
  // copy of this certificate, our call will not remove
  // the certificate permanently, but rather remove all trust.


> > NS_IMETHODIMP
> > nsNSSCertificate::GetNickname(nsAString &_nickname)
> > {
> >+  if (isAlreadyShutDown())
> >+	return NS_ERROR_NOT_AVAILABLE;
> >+
> >   const char *nickname = (mCert->nickname) ? mCert->nickname : "(no
nickname)";
> >   _nickname = NS_ConvertUTF8toUCS2(nickname);
> off-topic, but this isn't localizable.

Good point, I have filed bug 178881.


> >+// ??? How should the result from this get freed?
> >+// There is no function in NSS to free data structure CERTDERCerts
> >+// Any iteration required?
> >+// However, this appears to be a memory leak, not a NSS reference count
leak
> The CERTDERCerts returned from this method are allocated from the arena 
> passed in.  So the CERTDERCerts go away when the arena goes away.

Ah, ok, thanks. Removing comment.


> >   serviceURL = CERT_GetOCSPAuthorityInfoAccessLocation(aCert);
> >+// Is it necessary to free serviceURL ??? I think it is
> You ar right.  CERT_GetOCSPAuthorityInfoAccessLocation allocates
> a new char* buffer and copies the location into it.

Ok. Freeing with PORT_Free as mentioned in the implementation of the function.


> >@@ -914,6 +928,7 @@
> >   nickname = aCert->nickname;
> >   nn = ToNewUnicode(NS_ConvertUTF8toUCS2(nickname));
> > 
> >+// are we leaking nn and url here??? I think we are.
> Not sure since I don't know if the ToNewUnicode will return you a new buffer
> that lives past it's currect scope.

Yeah, ToNewUnicode needs to convert and will give you a completely new buffer.
I now have added code to destroy nn and url, according to comments in the
implementation of ToNewUnicode.


> >-	::NSS_Shutdown();
> >+	mShutdownObjectList->shutdownAll();
> >+	if (SECSuccess != ::NSS_Shutdown()) {
> >+printf("=> NSS SHUTDOWN FAILURE\n");
> >+	}
> >+	else {
> >+printf("=> nss shutdown =====>> OK <<=====\n");
> >+	}
> >   }
> > 
> Perhaps these print statements should be wrapped in debug blocks or assing
> the appropriate logging macro?

Right. Changed.


> >+// do we need to call   SECMOD_DestroyModuleList(list); ??? But it is not
exported???
> This is a question for someone on the NSS team like relyea.

I removed the comment and sent the question to the NSS team.
Attachment #105076 - Attachment is obsolete: true
Comment on attachment 105464 [details] [diff] [review]
Patch v3

Carrying forward r=javi since we agree.
Attachment #105464 - Flags: review+
+  void shutdown(calledFromType calledFrom)
+  {
+    if (!mAlreadyShutDown) {
+      if (calledFromObject == calledFrom) {
+        nsNSSShutDownList::forget(this);
+      }
+      if (calledFromList == calledFrom) {
+        virtualDestroyNSSReference();
+      }
+      mAlreadyShutDown = PR_TRUE;
+    }
+  }

so, what if shutdown is called from the main application thread, and some other
thread is trying to access one of these objects?  suppose the other thread finds
that the object is not yet shutdown?

take nsHash for example.  suppose one thread is calling nsHash::Begin while on
the main thread nsHash::virtualDestroyNSSReference() is called.  now, there
doesn't appear to be any mutex protecting m_ctxt.  so, where's the thread safety
in all of this?  is it there, and i'm just missing it?
is it that all of these objects inheriting from nsNSSShutDownObject are only
accessed from the main thread?
> is it that all of these objects inheriting from nsNSSShutDownObject are only
> accessed from the main thread?

They are accessed from multiple threads.


> so, what if shutdown is called from the main application thread, and some other
> thread is trying to access one of these objects?  suppose the other thread finds
> that the object is not yet shutdown?

In that scenario we crash. My assumption was, this would not happen, because
when we shut down at a time when all other activity has already stopped.


If after switching the network to offline, I still must assume that mutliple
threads execute our data structures, we indeed need much more locking.
kai: ok, so provided necko kills off all the sockets before NSS shutdown, things
will work.  but, i thought part of this work was to remove the dependency on
necko doing the right thing.  though that would greatly increase the complexity
of this patch.  so which way do you want to go? ;-)
Well, there are two things that I can try to protect PSM against the outside world

The patch as it is in here does a first step. It protects PSM from references
that do not get freed.

But you are right, it does not protect from ongoing activity during shutdown, it
actually adds the new requirement that no other activity is going on during
shutdown, which I assumed would be already true.


I'm slowly getting convinced that I indeed should work on the solution that is
more complex and adds more locking.
maybe.. i'm not entirely convinced either.  so, is the plan to land this patch
and then decide what to do about the multithread issues?
After thinking more about it, I want to do the additional locking.

For a while I was thinking about an approach that adds a lock to each of PSM's
tracked crypto instances, being unhappy with the overhead and complexity. But
now I think I have a simpler idea.

I would like to change the lock in nsNSSShutdownList from a simple lock to a
recursive monitor.

Each implementation method uses a nsAutoMonitor instance to lock the monitor.
The monitor implementation has the nice property that it is not necessary to
check whether the lock is already held by the current thread. Instead it will
automatically increment the lock count.

This makes the locking approach pretty straight forward, as we don't need to
check whether a function could be called from another function.

It also makes sure there will be no deadlocks if the implementation of tracked
class A causes tracked class B to get destroyed. Since it happens on the same
thread, the removal from the shutdown tracking list will succeed thanks to the
recursive monitor.

In short, as long as crypto activity is momentarily going on, a global shutdown
request will wait unless it's over.

One more thing that needs attention is the possibility that NSS functions can
callback into Mozilla to show up a password prompt. Unfortunately those prompts
are window modal, not application modal. While we show a password prompt, I fear
the user could switch to another window and select shutdown/switch profile.
That's maybe an unlikely user action, but it would result in a deadlock.
Therefore I think we need to retest the "is shut down" condition whenever we
return from a function that could have possibly invoked the callback.
Actually, I see a problem. I think we should not try to pull down NSS while it
is currently calling back into us. That looks tricky.
Patch v4 increased by 35%, but I think the complexity is not much higher.

We now have two independent things that need protection.


1) The list of all tracked objects

This is what already existed in the previous patch and the logic is mostly
unchanged.
A lock it used to make sure access to list is serialized.
The lock is only being acquired for very short moments, during which the list is
modified, but released for any other actions.


2) The NSS activity state

New in this patch.

When the shutdown iteration is about to start, it must make sure that no
activity is going on on other threads while it NSS frees resources.

Initially I thought a "monitor" would be an appropriate synchronization
mechanism, but I don't like the property that at any given time only one thread
can have the monitor locked.

Ideally, it should be known whether any thread is currently executing NSS activity,
but activity on one thread should not block activity on other threads.

So while a monitor, locked from any tracked PSM activity, would allow to find
out whether NSS activity is going on,
it would also prevent simultaenous PSM activity on separate threads.

It would be nice to have a counter for the activity tracking.

Whenever activity is going on, the counter would get increased,
and decreased when a thread leaves the activity.

When the shutdown list is about to execute, it would wait until the counter
reaches zero,
meaning no activity is going on.
When the count arrives at zero, it would make sure that while the shutdown is in
progress, 
activity on other threads is not possible - except removing objects from the list.

The new patch implements what I describe above. I will paste the pseudo-code
logic in the next comment.


3) Cancel open SSL connections

I realized that our new locking and tracking approach will allow us to solve bug
94557.
When the user interactively makes the decision that a security token should be
switched to the "logged out" state,
we could do the same as for the NSS shutdown preparation. We iterate over all
crypto objects, which might have
a cached authentication attached.

Ideally, we could fine tune this later, and only clean up those crypto objects
we actually need to clean,
potentially only the list of open SSL socket file descriptors.

However, I think the current approach to pull down everything is a good first step,
and in addition it is a very good way to test the correctness of this patch,
because we can force the logout while there is still crypto acitivty going on!


4) User Interface deadlock risk

This refers to the problem I mentioned in the previous comment 14.

The following can happen:
- a  mail window is open
- a browser window is open
- the user starts to fetch mail over SSL, and possibly the "security master
password" prompt comes up
- this dialog blocks the mail window
- now the user selects the browser window and uses the menu to instruct a token
logout

With the current patch, this produces a deadlock.

I know how I can circumvent the deadlock, we can use a maximum timeout time for
waiting for the used condition variable,
and if nobody objects, I will implement that rather simple change for the next
follow up patch.

But first, I would like to mention my motivation:
While the mail window is blocked, we are effectively in the middle of NSS activity.
NSS has called back into us.
I believe, we should never try to bring down all of NSS while is has called back
into us.

Instead, I would like to make the shutdown attempt wait for at most 2 seconds
for the "all NSS activity has stopped".

I think this timeout value can be flexible, when called from the UI, 2 seconds
seems reasonable, while during the shutdown, 
we might want to allow a larger value, maybe 10 seconds.

I think we will see this kind of problem only if the user ignores a callback
from NSS, like the master password prompt,
or the selection of a SSL certificate for client authentication.


5) Services not yet disabled

While I worked on the patch I realized that there is one more thing that we
should possibly solve better.
Depending on the application, there might be a short or longer phase during
which NSS is brought to the "down" state.

During this state, the application must not call any crypto services to iterate
over certificates or access modules etc.
If it breaks the rule and does anyway, we will crash.

So, in addition to what is now implemented, the activity tracking and tracking
of objects that carry NSS resources,
we should extend the implementation functions of services, to also check the
current global state of NSS "up or down",
and refuse fulfilling the service while we are down.
Pseudo code of the tracking and locking logic:

lock PSMActivityStateLock protects
  int ActivityCounter, which is initially 0
  PRThread* PSMRestrictedThread, which is initially nsnull
event PSMActivityChanged
this is a condition variable protected by a lock and the ability to post the
event when the counter or allowance state changes


lock ListOfTrackedPSMObjectsLock protects
  list ListOfTrackedPSMObjects


Whenever code enters into an implementation function of a tracked PSM object:
  if the method is a ctor
    call nsNSSShutDownObject ctor, to register with the list
    (no explicit called needed, base class init happens first anyway)

  acquire PSMActivityStateLock
  if PSMRestrictedThread != null && PSMRestrictedThread != PR_GetCurrentThread()
    wait for PSMActivityChanged
      (this releases PSMActivityStateLock during the wait)
    retest condition
  once the condition is reached, we have permission to go ahead
  increment ActivityCounter
  release PSMActivityStateLock

  check self's alreadyShutDown flag, and return if it is already shut down

  if this method is a ctor
    do the init
  else if this method is a dtor
    call destructorSafeDestroyNSSReference
    call shutdown(calledFromObject)
  else
    do the actual activity


Whenever code leaves an implementation function or a ctor or a dtor of a tracked
PSM object
  acquire PSMActivityStateLock
  decrement ActivityCounter
  if ActivityCounter reaches zero, 
    NotifyAll PSMActivityChanged
  release PSMActivityStateLock


destructorSafeDestroyNSSReference
  check self's alreadyShutDown flag, and return if it is already shut down
  does not do any locking
  this function should only be called from the destructor, which must have done
the locking already
  or coming from virtualDestroy, originating from the list, which has also done
the locking already  
  !no calls to NSS are allowed that could possibly result in UI callback to PSM
    (because of that, I removed the delayed deletion of certificates in
nsNSSCertificate and change it to happen immediately)


virtualDestroyNSSReference
  is only called from the list
  do not do any locking, only call destructorSafeDestroyNSSReference


Whenever a tracked object gets created,
  acquire ListOfTrackedPSMObjectsLock
  add to ListOfTrackedPSMObjects
  release ListOfTrackedPSMObjectsLock


Whenever a tracked object gets destroyed
  acquire ListOfTrackedPSMObjectsLock
  remove from ListOfTrackedPSMObjects
  release ListOfTrackedPSMObjectsLock


When it is time to shut down NSS, or the application requests explicitly to
reset the security state, and it is necessary to mark all PSM objects as
"already shut down"
  acquire PSMActivityStateLock
  if 0 < ActivityCounter
    wait for PSMActivityChanged
      (this releases PSMActivityStateLock during the wait)
    retest condition
  once the condition is reached, we have the exclusive lock
  set PSMRestrictedThread = PR_GetCurrentThread()
    this will cause all other threads to idle that want to do
    PSM activity, while we are busy with cleaning up one object
  release PSMActivityStateLock
  do
    acquire ListOfTrackedPSMObjectsLock
    entry = pull first entry from ListOfTrackedPSMObjects
    release ListOfTrackedPSMObjectsLock
    shutdown(entry)
      which calls virtualDestroyNSSReference
  while 
    were able to pull an entry from ListOfTrackedPSMObjects
  acuire PSMActivityStateLock
  set PSMRestrictedThread = nsnull
  NotifyAll PSMActivityChanged
  release PSMActivityStateLock


Optionally, for this last block (not yet implemented):
  pass in a timeout
  give up and fail to clean up if the "no more activity going on" state was not
reached during the timeout period
Attached patch Patch v4 (my snapshot 23) (obsolete) — Splinter Review
Patch v4 as explained in the previous two comments.
Attachment #105464 - Attachment is obsolete: true
To solve the problem with the UI being active in a callback from NSS, making the
requested shutdown impossible, we could do the following:

Let's assume the request to bring up a callback UI arrives on thread A (which
might either be the UI thread or the socket transport thread, or...).

The function that is going to bring up the UI is implemented in PSM.

Instead of making a blocking call to show the UI, PSM could spawn a new thread B.

B would have the information where to store the data obtained from the user with
the dialog. Once the dialog closes, B notifies a condition variable (to wake up A).

Under normal circumstances, A would idle and wait for the condition variable
signal to arrive from B.

However, the same condition variable would be used to signal another event.

While the UI is shown and A is blocking, let's say the NSS shutdown request
occurs (which happens on thread C).

Right now, with patch v4, C would block forever. With the suggestion of a
timeout, C would eventually fail to do the shutdown.

With the new suggestion, thread C would be able to detect that currently a
blocking UI operation is in progress.

It would set some flag that we do not longer want the result from the UI.
It would talk to the variables maintained by B and make sure, that any
information returned from the UI will get ignored, and that thread B will quit
silently, once the UI closes.

Next, C would set another flag that NSS shutdown is initiated, and that all
callabck UI is currently prohibited.

Next, C notifies the condition variable, waking up A. A will detect that B is
still running, but the NSS shutdown requested flag has been set. It will detach
itself from waiting for B. It will return, which goes back the road of clearing
all NSS activity counters.

C can continue to wait for all activity to stop, and that will soon succeed, as
A canceled the operation.
I implemented what I proposed in the previous comment - but it results in a lot
of trouble, namely deadlocks. I no longer like the idea, but have a new idea.

I suggest we disallow profile switching while the user has not yet completed
such a security prompt.

The problem only occurs when we show the prompt to enter the master password
while the user is attempting a crypto operation, and the user switches to
another open window to switch the profile or shut down the application.

I suggest, when the user attempts to switch the profile, we disallow it. We can
bring up a message and inform the user with a message like "the selected
operation can not be completed at the current time, because the application
needs your attention in another window".

I'm not yet sure what we should do when the user requests to quit the
application. We could show the same message and not allow to quit. Or, if we can
detect the application is going down for sure, we could decide to not shut down
the crypto library.
Depends on: 181230
After having talked to Darin yesterday, here is my current understanding:

My testcases to test the patch for this bug include a scenario, where the SSL
sockets have not been closed at the time we receive the request to switch the
profile.

I tried to work around it, using an idea from Nelson, to swap the contents of
the lower NSPR data structure.

However, Darin warned me correctly that this swapping is a risk, since we have
trouble to lock access to the socket from outside the socket transport thread.

So we filed bug 181230 and Darin wants to help us to ensure that all sockets
will have received their close call at the time we are asked to switch the profile.


So, I'm a bit blocked right now waiting for the fix to 181230, and I wonder what
this fix will mean for the approach suggested to fix this bug.


It will of course mean we can avoid the hack to the NSPR socket.

In a perfect world, where all components outside of PSM release their references
to PSM prior to the request to switch profile, we would not have to do anything
but fix our leaks (those changes also included in this bug).

However, PSM will not be able to control whether the outside world is behaving
perfectly and free references to PSM in time.

Because of that, I would like to suggest we still try to get this tracking and
early cleanup in, given the fact that I have already worked on it some time :)


It has been suggested that we should test carefully.

I will definitively produce test builds for all three major platforms once we
have the fix for 181230, but I think we should wait for it first.

I will prepare a patch that excludes the early socket shutdown using the NSPR
hack, assuming we don't have to care for that, but will do everything else for
early resources cleanup.
John, this information is mostly for you and for everybody who wants to test.

I have a patch that is in its final state, as far as I think changes are
necessary to PSM.
The patch is large and makes some tricky changes to PSM.

Without the patch for bug 181230, the intention of this bug has not yet reached:
There are still scenarios where profile switching fails.

However, it might make sense to do some early testing whether the large patch to PSM
introduces other regressions.

John, could you please give the test builds a try?
  http://www.kuix.de/mozilla/177260/

The test build has a new feature - from 97622.
The Tools menu has a new command "switch profile".

Known problems with the test build:
- if you have used https, and you click very soon after having used it,
  Mozilla might report that it detected an internal failure
  (namely that sockets have not been cleaned up yet, bug 181230).
  
  You can provoke this situation by using two windows, make sure
  your warnings when entering a secure site are on,
  and go to a https site. When the warning shows up, go
  to another window and try to switch the profile.
  
  If, however, you close all shown prompts, you wait until traffic
  seems to be finished, and wait 20 seconds after using https, 
  shutting down and profile switching should work.

- if you have used imap/ssl during your session, shutting down or
  profile switching does very often fail for me, even if you wait a long time 
  after closing mail. (It seems mail is not cleaning its sockets up at all, 
  hopefully will also be fixed by 181230?)

So I'd suggest the following for testing:
- use https
- for now do not use IMAP/SSL or SMTP/SSL, use the plain mail protocols
- use any other kind of crypto operation
- have 2 open browser windows and in one of them,
  try to open any kinds of UI prompts that are triggered by PSM,
  like password prompts, p12 import confirmation messages,
  https client auth certificate selectors, etc.
  Try to go to the other open browser window and tell it to switch
  profile.

Whatever you do, you should never see the application hanging.

An expected new behaviour is: While you have a PSM password prompt etc.
in one window, and you try to switch profile, PSM will tell you that
switching is currently not possible, because the operation in the other
window must be completed first.

You can also test whether profile switching works realiable
(currently only possible to test when not having used SSL with mail).

If you have used https in your session, wait for > 20 seconds.
Then try to switch the profile.

If PSM allows you to do the switch, that's good.

This bug 177260 blocks a couple of bugs, describing crypto operations,
that cause a later attempt to switch the profile to fail
(in my test environment without this patch).
You could try whether you are able to switch profile after those operations.

Whatever you do, you should never see that the profile switch operation
is about to start but fails with a message like "failed to initialize the
security component".

If you never see a hang, this first phase of testing would be succesfull.

The primary risk scenario for producing hangs is doing multiple operations
at once, like having a password prompt open in one window, 
and trying to do another operation in another window.


One more thing for testing. I have enhanced the command
  Tools/Password manager/Logout
When you execute that (regardless whether any I/O activity is currently active
or any prompts are shown), not only will the token get logged out, but also
will it mark all sockets as invalidated, and I/O should stop very soon.
(Note that it doesn't help for profile switching, because the underlying
resources might still be held).
Terry, this is the subset you asked me to provide.
Attached patch Patch v5 (Kai's snapshot 43) (obsolete) — Splinter Review
Full patch
Attachment #106240 - Attachment is obsolete: true
For reference: The test builds were created using a trunk build as of two days
ago, patch v5 from this bug, plus the code from bug 97622 patch v8, plus the
fixes for NSS bugs 177366, 176667, 179799.
Comment on attachment 107151 [details] [diff] [review]
Subset of patch that contains only the tracking helper classes

I found a bug
Attachment #107151 - Attachment is obsolete: true
Attachment #107152 - Attachment is obsolete: true
Blocks: 149834
Blocks: 182803
This is a newer snapshot of the patch. It has all the logic from the previous
patches, it works around bug 181230  by using a NSPR trick that should be
threadsafe and it fixes crasher bug 182803. However, there are concerns that it
adds complexity to PSM that we should avoid.

Plan A is: We wait for Darin's general fix for bug 181230, and once we have it,
we'll test whether the complexity proposed with this patch can be avoided.

It is my understanding that it does not make sense to do full testing of this
patch or reviewing this patch, unless we know whether a fix for bug 181230 is
sufficient or not.

If a fix for 181230 is not sufficient, plan B would be: take this patch.
This is an alternative patch that was suggested. It does not introduce the
track-all-nss-resources-and-allow-shutdown-early complexity. I only fixes the
leaks, tracks whether we have open sockets or not, whether we have active
blocking UI or not, and vetos profile changes if there are.
This patch would assume that bug 181230 gets fixed, tearing down sockets, which
Darin is working on.

However, this patch still does not solve all the problems (as the complex patch
v6 would do).

I can reproduce a crash using this patch, without accessing any SSL.
It only requires having open certificate manager or having read a signed email
message.

The problem is the JavaScript environment still holds a reference to a
nsNSSCertificate object, which gets cleaned up after we have shut down NSS,
which is no longer allowed at this time, and we are crashing.
Comment on attachment 108029 [details] [diff] [review]
Delayed - Maybe Later - Hammer Approach but very complex Patch v6  (snapshot 47)

I'm renaming this patch and delaying it. As of today, I don't see support to
get this landed. We can consider it again at a later time, if alternative
solutions don't work.
Attachment #108029 - Attachment description: Patch v6 (snapshot 47) → Delayed - Maybe Later - Hammer Approach but very complex Patch v6 (snapshot 47)
Comment on attachment 108846 [details] [diff] [review]
Partial Patch / Leaks and UI tracking only / v1

I'm renaming this patch and want to try to get it checked it in. From my
understanding, it reflects the portions of the changes everybody agrees on. (It
does not contain the "track all NSS resources as used by PSM" nor the "early
shutdown" complexity).
Attachment #108846 - Attachment description: sidetrack patch v1 → Partial Patch / Leaks and UI tracking only / v1
In comment 27 I mention that a fix for bug 181230 is still required, but will
not be sufficient. However, this "is not sufficient" statement applies to the
profile switching approach as implemented in bug 97622.

I think it makes sense to give separate status updates for the Mozilla browser
and an embedding environment.


Let's look at a simpler embedding environment first:

By landing the leak fixes and UI tracking (attachment 108846 [details] [diff] [review]) in combination
with a complete fix for bug 181230, we should have a working base environment
for profile switching.

With having those fixes, if the embedding application releases all references to
PSM resources prior to attempting the profile switch, everything should be fine.


However, profile switching in Mozilla will be more difficult. As I understand
it, the patch in bug 97622 is not sufficient, because JavaScript and delayed
destruction of references is involved. In my tests, as I explained in comment
27, I saw resources hold by the UI (security status) are getting destructed
after the profile switch was requested. I don't see a simple solution to this.
We either need to find a way to tear down completely all resources held by
Mozilla to PSM prior to switching the profile, or we need to take a complex
patch like the one I suggested in attachment 108029 [details] [diff] [review].


From my point of view this means the following:
- let's land the leak fixes now
- let's work on bug 181230
- if the above is done, reenable NSS shutdown, tracked in bug 187501
- if all of the above is done, we are ready to go back to profile switching
testing - however, we can do testing only in those (embedding) environments
where we don't have to deal with delayed JavaScript destruction


Being unable to do profile switching correctness testing with Mozilla is
inconvinient. It would be great if we could find ways to enhance the patch in
bug 97622 to be independent of the delayed JavaScript destruction, i.e., enhance
the patch to tear down all open windows and execute Garbage Collection in all
active JavaScript contexts prior to the profile switching attempt.
Comment on attachment 108846 [details] [diff] [review]
Partial Patch / Leaks and UI tracking only / v1

As stated in the previous comments, I would like to land this patch.

Javi, Darin, can you please review?
Attachment #108846 - Flags: superreview?(darin)
Attachment #108846 - Flags: review?(javi)
Comment on attachment 108846 [details] [diff] [review]
Partial Patch / Leaks and UI tracking only / v1

r=javi
Attachment #108846 - Flags: review?(javi) → review+
Blocks: 187501
Comment on attachment 108846 [details] [diff] [review]
Partial Patch / Leaks and UI tracking only / v1

sr=darin
Attachment #108846 - Flags: superreview?(darin) → superreview+
Removing "make PSM leak tolerant" from bug summary.
Summary: Make PSM XPCOM leak tolerant + fix leaks → Fix known leaks in PSM, track blocking PSM UI, track open SSL sockets
Patch checked in, marking fixed.
Status: ASSIGNED → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
Verified.
Status: RESOLVED → VERIFIED
Product: PSM → Core
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: