Closed
Bug 177260
Opened 22 years ago
Closed 22 years ago
Fix known leaks in PSM, track blocking PSM UI, track open SSL sockets
Categories
(Core Graveyard :: Security: UI, defect, P1)
Tracking
(Not tracked)
VERIFIED
FIXED
psm2.4
People
(Reporter: KaiE, Assigned: KaiE)
References
Details
(Keywords: topembed+)
Attachments
(2 files, 6 obsolete files)
153.92 KB,
patch
|
Details | Diff | Splinter Review | |
57.97 KB,
patch
|
javi
:
review+
darin.moz
:
superreview+
|
Details | Diff | Splinter Review |
We tried to make PSM succeed in shutting down NSS, by fixing leaks and shutting down all network activity prior to attempting a shutdown or profile switch. I still think that shutting down all network activity first is a good idea, because it should at least ensure, that at the time we are going doing our cleanup, all activity happening on other threads should have stopped. However, our assumption that shutting down the network is sufficient, in order to bring down all resources to PSM, was wrong. I can see, when the profile switch or shutdown notification is received, there is still a HTML object alive, which references a HTTP channel, holding a reference to a SSL status info, and by that way to a certificate. When I saw that, I was finally convinced that PSM should not depend on external Mozilla code to always behave correct, and go with a failsafe approached, that also has been suggested by Darin. The idea is, all living PSM C++ wrapper instances, carrying references to NSS data structures, will be tracked in a list. When a shutdown or profile switch is requested, all entries in the list will free their NSS references. The C++ instances itself will stay alive, in order to avoid crashes, in case external Mozilla code will do future calls to PSM's wrapper objects. However, all those objects will switch to an AlreadyShutDown mode, and will fail gracefully when interfaces are called after the shutdown. This approach was necessary, and it alone fixed two bugs I was able to reproduce. However, besides this we still must make sure that we don't have leaks internal in PSM. Freeing the references being held externally to PSM still doesn't help, if PSM or NSS have leaks on their own. I have been spending a couple of days tracing PSM and NSS. I found 20+ leaks. The attached patch implements the approach described above, and all the leaks I have found so far. I have found many leaks using the new code from bug 175320, and it helped me to find shutdown failure bugs, as they can currently be seen as dependent bugs of bug 157937. The attached patch fixes all of them and some more bug that I have found (but not filed yet). While the patch works for me on Linux, it is yet untested on other platforms. It also contains some changes to NSS that still need discussion. I guess we will probably want to track the NSS changes in (a) separate bug(s), but as a starter I'm including them here. In addition, the patch contains several sections marked with "???". If you could have a look at those places, maybe you could give me some answers to those questions. Reviews and comments welcome.
Assignee | ||
Comment 1•22 years ago
|
||
Assignee | ||
Updated•22 years ago
|
Assignee | ||
Updated•22 years ago
|
Assignee | ||
Comment 2•22 years ago
|
||
New patch: - still depends on NSS cleanup, but does no longer contain changes to NSS, they are now being tracked in external bug 176667, bug 176799, 177366 - works around bug 177201 as suggested in that bug - has some typecasts added that are needed for compilation on other platforms - adds the changes required for the Mac buildsystem
Attachment #104474 -
Attachment is obsolete: true
Assignee | ||
Updated•22 years ago
|
Blocks: profile-switching
Comment 3•22 years ago
|
||
Comment on attachment 105076 [details] [diff] [review] Patch v2 Here are my comments. Only the first comment I consider a must fix. Everything else is take it or leave it. Fix the first issue and r=javi >+ void set(PRUint32 i, CERTCertificate *c) >+ { >+ if (i >= mSize) >+ return; >+ >+ mCerts[i] = c; >+ } Depending on how one uses this class, this method can easily lead to leaks or releasing a reference on a cert too many times. In your destructor, you call CERT_DestroyCertficate which will decrement the refCount, but you do not augment the refCount here. You should either not release the reference in the destructor or add a reference here when assigning (making sure to release the reference to the old cert if there was already a cert assigned to that index. >+ >+ CERTCertificate *get(PRUint32 i) >+ { >+ if (i >= mSize) >+ return nsnull; >+ >+ return mCerts[i]; >+ } >+ I guess this is fine if you make sure none of the callers then call CERT_DestroyCertificate on the value returned. >+void nsCMSDecoder::destructorSafeDestroyNSSReference() >+{ >+ if (isAlreadyShutDown()) >+ return; >+ >+ if (m_dcx) >+ NSS_CMSDecoder_Cancel(m_dcx); > } hm, I'm not sure if this is the right call. I seem to remember having crashes in other code when I tried using NSS_CMSDecoder_Cancel. Will a call to NSS_CMSDecoder_Finish suffice here? But since you use NSS_CMSDecoder_Cancel later on in the patch, I'll assume it's OK. Regardless, should we NULL out m_dcx in this case? > } else if (!PK11_IsReadOnly(mCert->slot)) { >+ SEC_DeletePermCertificate(mCert); >+ CERT_DestroyCertificate(mCert); >+ } >+ >+ // ??? kaie: The following comment is old, and confuses me. >+ // It doesn't match what the code is doing. >+ // I don't see code that changes something to untrusted. >+ >+ // -----[ old comment starting ]------------- > // If the cert isn't a user cert and it is on an external token, > // then we'll just leave it as untrusted, but won't delete it > // from the cert db. >- SEC_DeletePermCertificate(mCert); >- } >+ // -----[ end of old comment ]-------------- Reading the code closely, I believe this uses the semantic of NSS where if a cert shows up as untrusted in your cert db and then it finds the same cert on an external token (ie, the built-ins), then NSS uses the cached trust setting to over-ride what the module (the built-ins in this case) says. So if I take the CA cert from Evil Corp that is bundled in the built-ins and "delete" it from the Cert Manager, under the covers we mark it as untrusted and don't delete it from the permanent cert db so that next time you launch the cert manager, the cert will show up as untrusted again. This is how we used to "delete" certs that were in the built-ins module. Not sure if that's how it still works. I believe your code still does that. > NS_IMETHODIMP > nsNSSCertificate::GetNickname(nsAString &_nickname) > { >+ if (isAlreadyShutDown()) >+ return NS_ERROR_NOT_AVAILABLE; >+ > const char *nickname = (mCert->nickname) ? mCert->nickname : "(no nickname)"; > _nickname = NS_ConvertUTF8toUCS2(nickname); off-topic, but this isn't localizable. >+// ??? How should the result from this get freed? >+// There is no function in NSS to free data structure CERTDERCerts >+// Any iteration required? >+// However, this appears to be a memory leak, not a NSS reference count leak The CERTDERCerts returned from this method are allocated from the arena passed in. So the CERTDERCerts go away when the arena goes away. > serviceURL = CERT_GetOCSPAuthorityInfoAccessLocation(aCert); >+// Is it necessary to free serviceURL ??? I think it is You ar right. CERT_GetOCSPAuthorityInfoAccessLocation allocates a new char* buffer and copies the location into it. > if (serviceURL) { > url = ToNewUnicode(NS_ConvertUTF8toUCS2(serviceURL)); > } >@@ -914,6 +928,7 @@ > nickname = aCert->nickname; > nn = ToNewUnicode(NS_ConvertUTF8toUCS2(nickname)); > >+// are we leaking nn and url here??? I think we are. Not sure since I don't know if the ToNewUnicode will return you a new buffer that lives past it's currect scope. >- ::NSS_Shutdown(); >+ mShutdownObjectList->shutdownAll(); >+ if (SECSuccess != ::NSS_Shutdown()) { >+printf("=> NSS SHUTDOWN FAILURE\n"); >+ } >+ else { >+printf("=> nss shutdown =====>> OK <<=====\n"); >+ } > } > Perhaps these print statements should be wrapped in debug blocks or assing the appropriate logging macro? >+// do we need to call SECMOD_DestroyModuleList(list); ??? But it is not exported??? This is a question for someone on the NSS team like relyea.
Attachment #105076 -
Flags: review+
Assignee | ||
Comment 4•22 years ago
|
||
*** Bug 176670 has been marked as a duplicate of this bug. ***
Assignee | ||
Comment 5•22 years ago
|
||
> >+ void set(PRUint32 i, CERTCertificate *c) > >+ { > >+ if (i >= mSize) > >+ return; > >+ > >+ mCerts[i] = c; > >+ } > Depending on how one uses this class, this method can easily lead to leaks > or releasing a reference on a cert too many times. > > In your destructor, you call CERT_DestroyCertficate which will decrement > the refCount, but you do not augment the refCount here. You should > either not release the reference in the destructor or add a reference here > when assigning (making sure to release the reference to the old cert if there > was already a cert assigned to that index. Oops. Good catch, thanks. I made the class failsafe now. If you want, you could have another look at the class code. If not I'll assume r=javi. > >+ CERTCertificate *get(PRUint32 i) > >+ { > >+ if (i >= mSize) > >+ return nsnull; > >+ > >+ return mCerts[i]; > >+ } > >+ > I guess this is fine if you make sure none of the callers then call > CERT_DestroyCertificate on the value returned. For enhanced safety, I now changed the function to addref and make the callers destroy. > >+void nsCMSDecoder::destructorSafeDestroyNSSReference() > >+{ > >+ if (isAlreadyShutDown()) > >+ return; > >+ > >+ if (m_dcx) > >+ NSS_CMSDecoder_Cancel(m_dcx); > > } > > hm, I'm not sure if this is the right call. I seem to remember having crashes > in other code when I tried using NSS_CMSDecoder_Cancel. Will a call to > NSS_CMSDecoder_Finish suffice here? > But since you use NSS_CMSDecoder_Cancel later on in the patch, I'll > assume it's OK. I think NSS code should not crash. NSS offers this function to cancel the decoding. I would prefer to not use finish in this situation, since finish produces additional objects that would require cleanup. Let's assume NSS code works and does not crash. > Regardless, should we NULL out m_dcx in this case? Can't hurt. Done. > > } else if (!PK11_IsReadOnly(mCert->slot)) { > >+ SEC_DeletePermCertificate(mCert); > >+ CERT_DestroyCertificate(mCert); > >+ } > >+ > >+ // ??? kaie: The following comment is old, and confuses me. > >+ // It doesn't match what the code is doing. > >+ // I don't see code that changes something to untrusted. > >+ > >+ // -----[ old comment starting ]------------- > > // If the cert isn't a user cert and it is on an external token, > > // then we'll just leave it as untrusted, but won't delete it > > // from the cert db. > >- SEC_DeletePermCertificate(mCert); > >- } > >+ // -----[ end of old comment ]-------------- > > Reading the code closely, I believe this uses the semantic of NSS where > if a cert shows up as untrusted in your cert db and then it finds the same cert > on an external token (ie, the built-ins), then NSS uses the cached trust > setting to over-ride what the module (the built-ins in this case) says. So > if I take the CA cert from Evil Corp that is bundled in the built-ins and > "delete" it from the Cert Manager, under the covers we mark it as untrusted > and don't delete it from the permanent cert db so that next time you launch > the cert manager, the cert will show up as untrusted again. > > This is how we used to "delete" certs that were in the built-ins module. Not > sure if that's how it still works. > > I believe your code still does that. You are right, and our code is indeed still doing that. With your comments, I slowly begin to understand the comment, but I think this is confusing. I have replaced it with the new comment: // If the list of built-ins does contain a non-removable // copy of this certificate, our call will not remove // the certificate permanently, but rather remove all trust. > > NS_IMETHODIMP > > nsNSSCertificate::GetNickname(nsAString &_nickname) > > { > >+ if (isAlreadyShutDown()) > >+ return NS_ERROR_NOT_AVAILABLE; > >+ > > const char *nickname = (mCert->nickname) ? mCert->nickname : "(no nickname)"; > > _nickname = NS_ConvertUTF8toUCS2(nickname); > off-topic, but this isn't localizable. Good point, I have filed bug 178881. > >+// ??? How should the result from this get freed? > >+// There is no function in NSS to free data structure CERTDERCerts > >+// Any iteration required? > >+// However, this appears to be a memory leak, not a NSS reference count leak > The CERTDERCerts returned from this method are allocated from the arena > passed in. So the CERTDERCerts go away when the arena goes away. Ah, ok, thanks. Removing comment. > > serviceURL = CERT_GetOCSPAuthorityInfoAccessLocation(aCert); > >+// Is it necessary to free serviceURL ??? I think it is > You ar right. CERT_GetOCSPAuthorityInfoAccessLocation allocates > a new char* buffer and copies the location into it. Ok. Freeing with PORT_Free as mentioned in the implementation of the function. > >@@ -914,6 +928,7 @@ > > nickname = aCert->nickname; > > nn = ToNewUnicode(NS_ConvertUTF8toUCS2(nickname)); > > > >+// are we leaking nn and url here??? I think we are. > Not sure since I don't know if the ToNewUnicode will return you a new buffer > that lives past it's currect scope. Yeah, ToNewUnicode needs to convert and will give you a completely new buffer. I now have added code to destroy nn and url, according to comments in the implementation of ToNewUnicode. > >- ::NSS_Shutdown(); > >+ mShutdownObjectList->shutdownAll(); > >+ if (SECSuccess != ::NSS_Shutdown()) { > >+printf("=> NSS SHUTDOWN FAILURE\n"); > >+ } > >+ else { > >+printf("=> nss shutdown =====>> OK <<=====\n"); > >+ } > > } > > > Perhaps these print statements should be wrapped in debug blocks or assing > the appropriate logging macro? Right. Changed. > >+// do we need to call SECMOD_DestroyModuleList(list); ??? But it is not exported??? > This is a question for someone on the NSS team like relyea. I removed the comment and sent the question to the NSS team.
Attachment #105076 -
Attachment is obsolete: true
Assignee | ||
Comment 6•22 years ago
|
||
Comment on attachment 105464 [details] [diff] [review] Patch v3 Carrying forward r=javi since we agree.
Attachment #105464 -
Flags: review+
Comment 7•22 years ago
|
||
+ void shutdown(calledFromType calledFrom) + { + if (!mAlreadyShutDown) { + if (calledFromObject == calledFrom) { + nsNSSShutDownList::forget(this); + } + if (calledFromList == calledFrom) { + virtualDestroyNSSReference(); + } + mAlreadyShutDown = PR_TRUE; + } + } so, what if shutdown is called from the main application thread, and some other thread is trying to access one of these objects? suppose the other thread finds that the object is not yet shutdown? take nsHash for example. suppose one thread is calling nsHash::Begin while on the main thread nsHash::virtualDestroyNSSReference() is called. now, there doesn't appear to be any mutex protecting m_ctxt. so, where's the thread safety in all of this? is it there, and i'm just missing it?
Comment 8•22 years ago
|
||
is it that all of these objects inheriting from nsNSSShutDownObject are only accessed from the main thread?
Assignee | ||
Comment 9•22 years ago
|
||
> is it that all of these objects inheriting from nsNSSShutDownObject are only > accessed from the main thread? They are accessed from multiple threads. > so, what if shutdown is called from the main application thread, and some other > thread is trying to access one of these objects? suppose the other thread finds > that the object is not yet shutdown? In that scenario we crash. My assumption was, this would not happen, because when we shut down at a time when all other activity has already stopped. If after switching the network to offline, I still must assume that mutliple threads execute our data structures, we indeed need much more locking.
Comment 10•22 years ago
|
||
kai: ok, so provided necko kills off all the sockets before NSS shutdown, things will work. but, i thought part of this work was to remove the dependency on necko doing the right thing. though that would greatly increase the complexity of this patch. so which way do you want to go? ;-)
Assignee | ||
Comment 11•22 years ago
|
||
Well, there are two things that I can try to protect PSM against the outside world The patch as it is in here does a first step. It protects PSM from references that do not get freed. But you are right, it does not protect from ongoing activity during shutdown, it actually adds the new requirement that no other activity is going on during shutdown, which I assumed would be already true. I'm slowly getting convinced that I indeed should work on the solution that is more complex and adds more locking.
Comment 12•22 years ago
|
||
maybe.. i'm not entirely convinced either. so, is the plan to land this patch and then decide what to do about the multithread issues?
Assignee | ||
Comment 13•22 years ago
|
||
After thinking more about it, I want to do the additional locking. For a while I was thinking about an approach that adds a lock to each of PSM's tracked crypto instances, being unhappy with the overhead and complexity. But now I think I have a simpler idea. I would like to change the lock in nsNSSShutdownList from a simple lock to a recursive monitor. Each implementation method uses a nsAutoMonitor instance to lock the monitor. The monitor implementation has the nice property that it is not necessary to check whether the lock is already held by the current thread. Instead it will automatically increment the lock count. This makes the locking approach pretty straight forward, as we don't need to check whether a function could be called from another function. It also makes sure there will be no deadlocks if the implementation of tracked class A causes tracked class B to get destroyed. Since it happens on the same thread, the removal from the shutdown tracking list will succeed thanks to the recursive monitor. In short, as long as crypto activity is momentarily going on, a global shutdown request will wait unless it's over. One more thing that needs attention is the possibility that NSS functions can callback into Mozilla to show up a password prompt. Unfortunately those prompts are window modal, not application modal. While we show a password prompt, I fear the user could switch to another window and select shutdown/switch profile. That's maybe an unlikely user action, but it would result in a deadlock. Therefore I think we need to retest the "is shut down" condition whenever we return from a function that could have possibly invoked the callback.
Assignee | ||
Comment 14•22 years ago
|
||
Actually, I see a problem. I think we should not try to pull down NSS while it is currently calling back into us. That looks tricky.
Assignee | ||
Comment 15•22 years ago
|
||
Patch v4 increased by 35%, but I think the complexity is not much higher. We now have two independent things that need protection. 1) The list of all tracked objects This is what already existed in the previous patch and the logic is mostly unchanged. A lock it used to make sure access to list is serialized. The lock is only being acquired for very short moments, during which the list is modified, but released for any other actions. 2) The NSS activity state New in this patch. When the shutdown iteration is about to start, it must make sure that no activity is going on on other threads while it NSS frees resources. Initially I thought a "monitor" would be an appropriate synchronization mechanism, but I don't like the property that at any given time only one thread can have the monitor locked. Ideally, it should be known whether any thread is currently executing NSS activity, but activity on one thread should not block activity on other threads. So while a monitor, locked from any tracked PSM activity, would allow to find out whether NSS activity is going on, it would also prevent simultaenous PSM activity on separate threads. It would be nice to have a counter for the activity tracking. Whenever activity is going on, the counter would get increased, and decreased when a thread leaves the activity. When the shutdown list is about to execute, it would wait until the counter reaches zero, meaning no activity is going on. When the count arrives at zero, it would make sure that while the shutdown is in progress, activity on other threads is not possible - except removing objects from the list. The new patch implements what I describe above. I will paste the pseudo-code logic in the next comment. 3) Cancel open SSL connections I realized that our new locking and tracking approach will allow us to solve bug 94557. When the user interactively makes the decision that a security token should be switched to the "logged out" state, we could do the same as for the NSS shutdown preparation. We iterate over all crypto objects, which might have a cached authentication attached. Ideally, we could fine tune this later, and only clean up those crypto objects we actually need to clean, potentially only the list of open SSL socket file descriptors. However, I think the current approach to pull down everything is a good first step, and in addition it is a very good way to test the correctness of this patch, because we can force the logout while there is still crypto acitivty going on! 4) User Interface deadlock risk This refers to the problem I mentioned in the previous comment 14. The following can happen: - a mail window is open - a browser window is open - the user starts to fetch mail over SSL, and possibly the "security master password" prompt comes up - this dialog blocks the mail window - now the user selects the browser window and uses the menu to instruct a token logout With the current patch, this produces a deadlock. I know how I can circumvent the deadlock, we can use a maximum timeout time for waiting for the used condition variable, and if nobody objects, I will implement that rather simple change for the next follow up patch. But first, I would like to mention my motivation: While the mail window is blocked, we are effectively in the middle of NSS activity. NSS has called back into us. I believe, we should never try to bring down all of NSS while is has called back into us. Instead, I would like to make the shutdown attempt wait for at most 2 seconds for the "all NSS activity has stopped". I think this timeout value can be flexible, when called from the UI, 2 seconds seems reasonable, while during the shutdown, we might want to allow a larger value, maybe 10 seconds. I think we will see this kind of problem only if the user ignores a callback from NSS, like the master password prompt, or the selection of a SSL certificate for client authentication. 5) Services not yet disabled While I worked on the patch I realized that there is one more thing that we should possibly solve better. Depending on the application, there might be a short or longer phase during which NSS is brought to the "down" state. During this state, the application must not call any crypto services to iterate over certificates or access modules etc. If it breaks the rule and does anyway, we will crash. So, in addition to what is now implemented, the activity tracking and tracking of objects that carry NSS resources, we should extend the implementation functions of services, to also check the current global state of NSS "up or down", and refuse fulfilling the service while we are down.
Assignee | ||
Comment 16•22 years ago
|
||
Pseudo code of the tracking and locking logic: lock PSMActivityStateLock protects int ActivityCounter, which is initially 0 PRThread* PSMRestrictedThread, which is initially nsnull event PSMActivityChanged this is a condition variable protected by a lock and the ability to post the event when the counter or allowance state changes lock ListOfTrackedPSMObjectsLock protects list ListOfTrackedPSMObjects Whenever code enters into an implementation function of a tracked PSM object: if the method is a ctor call nsNSSShutDownObject ctor, to register with the list (no explicit called needed, base class init happens first anyway) acquire PSMActivityStateLock if PSMRestrictedThread != null && PSMRestrictedThread != PR_GetCurrentThread() wait for PSMActivityChanged (this releases PSMActivityStateLock during the wait) retest condition once the condition is reached, we have permission to go ahead increment ActivityCounter release PSMActivityStateLock check self's alreadyShutDown flag, and return if it is already shut down if this method is a ctor do the init else if this method is a dtor call destructorSafeDestroyNSSReference call shutdown(calledFromObject) else do the actual activity Whenever code leaves an implementation function or a ctor or a dtor of a tracked PSM object acquire PSMActivityStateLock decrement ActivityCounter if ActivityCounter reaches zero, NotifyAll PSMActivityChanged release PSMActivityStateLock destructorSafeDestroyNSSReference check self's alreadyShutDown flag, and return if it is already shut down does not do any locking this function should only be called from the destructor, which must have done the locking already or coming from virtualDestroy, originating from the list, which has also done the locking already !no calls to NSS are allowed that could possibly result in UI callback to PSM (because of that, I removed the delayed deletion of certificates in nsNSSCertificate and change it to happen immediately) virtualDestroyNSSReference is only called from the list do not do any locking, only call destructorSafeDestroyNSSReference Whenever a tracked object gets created, acquire ListOfTrackedPSMObjectsLock add to ListOfTrackedPSMObjects release ListOfTrackedPSMObjectsLock Whenever a tracked object gets destroyed acquire ListOfTrackedPSMObjectsLock remove from ListOfTrackedPSMObjects release ListOfTrackedPSMObjectsLock When it is time to shut down NSS, or the application requests explicitly to reset the security state, and it is necessary to mark all PSM objects as "already shut down" acquire PSMActivityStateLock if 0 < ActivityCounter wait for PSMActivityChanged (this releases PSMActivityStateLock during the wait) retest condition once the condition is reached, we have the exclusive lock set PSMRestrictedThread = PR_GetCurrentThread() this will cause all other threads to idle that want to do PSM activity, while we are busy with cleaning up one object release PSMActivityStateLock do acquire ListOfTrackedPSMObjectsLock entry = pull first entry from ListOfTrackedPSMObjects release ListOfTrackedPSMObjectsLock shutdown(entry) which calls virtualDestroyNSSReference while were able to pull an entry from ListOfTrackedPSMObjects acuire PSMActivityStateLock set PSMRestrictedThread = nsnull NotifyAll PSMActivityChanged release PSMActivityStateLock Optionally, for this last block (not yet implemented): pass in a timeout give up and fail to clean up if the "no more activity going on" state was not reached during the timeout period
Assignee | ||
Comment 17•22 years ago
|
||
Patch v4 as explained in the previous two comments.
Attachment #105464 -
Attachment is obsolete: true
Assignee | ||
Comment 18•22 years ago
|
||
To solve the problem with the UI being active in a callback from NSS, making the requested shutdown impossible, we could do the following: Let's assume the request to bring up a callback UI arrives on thread A (which might either be the UI thread or the socket transport thread, or...). The function that is going to bring up the UI is implemented in PSM. Instead of making a blocking call to show the UI, PSM could spawn a new thread B. B would have the information where to store the data obtained from the user with the dialog. Once the dialog closes, B notifies a condition variable (to wake up A). Under normal circumstances, A would idle and wait for the condition variable signal to arrive from B. However, the same condition variable would be used to signal another event. While the UI is shown and A is blocking, let's say the NSS shutdown request occurs (which happens on thread C). Right now, with patch v4, C would block forever. With the suggestion of a timeout, C would eventually fail to do the shutdown. With the new suggestion, thread C would be able to detect that currently a blocking UI operation is in progress. It would set some flag that we do not longer want the result from the UI. It would talk to the variables maintained by B and make sure, that any information returned from the UI will get ignored, and that thread B will quit silently, once the UI closes. Next, C would set another flag that NSS shutdown is initiated, and that all callabck UI is currently prohibited. Next, C notifies the condition variable, waking up A. A will detect that B is still running, but the NSS shutdown requested flag has been set. It will detach itself from waiting for B. It will return, which goes back the road of clearing all NSS activity counters. C can continue to wait for all activity to stop, and that will soon succeed, as A canceled the operation.
Assignee | ||
Comment 19•22 years ago
|
||
I implemented what I proposed in the previous comment - but it results in a lot of trouble, namely deadlocks. I no longer like the idea, but have a new idea. I suggest we disallow profile switching while the user has not yet completed such a security prompt. The problem only occurs when we show the prompt to enter the master password while the user is attempting a crypto operation, and the user switches to another open window to switch the profile or shut down the application. I suggest, when the user attempts to switch the profile, we disallow it. We can bring up a message and inform the user with a message like "the selected operation can not be completed at the current time, because the application needs your attention in another window". I'm not yet sure what we should do when the user requests to quit the application. We could show the same message and not allow to quit. Or, if we can detect the application is going down for sure, we could decide to not shut down the crypto library.
Assignee | ||
Comment 20•22 years ago
|
||
After having talked to Darin yesterday, here is my current understanding: My testcases to test the patch for this bug include a scenario, where the SSL sockets have not been closed at the time we receive the request to switch the profile. I tried to work around it, using an idea from Nelson, to swap the contents of the lower NSPR data structure. However, Darin warned me correctly that this swapping is a risk, since we have trouble to lock access to the socket from outside the socket transport thread. So we filed bug 181230 and Darin wants to help us to ensure that all sockets will have received their close call at the time we are asked to switch the profile. So, I'm a bit blocked right now waiting for the fix to 181230, and I wonder what this fix will mean for the approach suggested to fix this bug. It will of course mean we can avoid the hack to the NSPR socket. In a perfect world, where all components outside of PSM release their references to PSM prior to the request to switch profile, we would not have to do anything but fix our leaks (those changes also included in this bug). However, PSM will not be able to control whether the outside world is behaving perfectly and free references to PSM in time. Because of that, I would like to suggest we still try to get this tracking and early cleanup in, given the fact that I have already worked on it some time :) It has been suggested that we should test carefully. I will definitively produce test builds for all three major platforms once we have the fix for 181230, but I think we should wait for it first. I will prepare a patch that excludes the early socket shutdown using the NSPR hack, assuming we don't have to care for that, but will do everything else for early resources cleanup.
Assignee | ||
Comment 21•22 years ago
|
||
John, this information is mostly for you and for everybody who wants to test. I have a patch that is in its final state, as far as I think changes are necessary to PSM. The patch is large and makes some tricky changes to PSM. Without the patch for bug 181230, the intention of this bug has not yet reached: There are still scenarios where profile switching fails. However, it might make sense to do some early testing whether the large patch to PSM introduces other regressions. John, could you please give the test builds a try? http://www.kuix.de/mozilla/177260/ The test build has a new feature - from 97622. The Tools menu has a new command "switch profile". Known problems with the test build: - if you have used https, and you click very soon after having used it, Mozilla might report that it detected an internal failure (namely that sockets have not been cleaned up yet, bug 181230). You can provoke this situation by using two windows, make sure your warnings when entering a secure site are on, and go to a https site. When the warning shows up, go to another window and try to switch the profile. If, however, you close all shown prompts, you wait until traffic seems to be finished, and wait 20 seconds after using https, shutting down and profile switching should work. - if you have used imap/ssl during your session, shutting down or profile switching does very often fail for me, even if you wait a long time after closing mail. (It seems mail is not cleaning its sockets up at all, hopefully will also be fixed by 181230?) So I'd suggest the following for testing: - use https - for now do not use IMAP/SSL or SMTP/SSL, use the plain mail protocols - use any other kind of crypto operation - have 2 open browser windows and in one of them, try to open any kinds of UI prompts that are triggered by PSM, like password prompts, p12 import confirmation messages, https client auth certificate selectors, etc. Try to go to the other open browser window and tell it to switch profile. Whatever you do, you should never see the application hanging. An expected new behaviour is: While you have a PSM password prompt etc. in one window, and you try to switch profile, PSM will tell you that switching is currently not possible, because the operation in the other window must be completed first. You can also test whether profile switching works realiable (currently only possible to test when not having used SSL with mail). If you have used https in your session, wait for > 20 seconds. Then try to switch the profile. If PSM allows you to do the switch, that's good. This bug 177260 blocks a couple of bugs, describing crypto operations, that cause a later attempt to switch the profile to fail (in my test environment without this patch). You could try whether you are able to switch profile after those operations. Whatever you do, you should never see that the profile switch operation is about to start but fails with a message like "failed to initialize the security component". If you never see a hang, this first phase of testing would be succesfull. The primary risk scenario for producing hangs is doing multiple operations at once, like having a password prompt open in one window, and trying to do another operation in another window. One more thing for testing. I have enhanced the command Tools/Password manager/Logout When you execute that (regardless whether any I/O activity is currently active or any prompts are shown), not only will the token get logged out, but also will it mark all sockets as invalidated, and I/O should stop very soon. (Note that it doesn't help for profile switching, because the underlying resources might still be held).
Assignee | ||
Comment 22•22 years ago
|
||
Terry, this is the subset you asked me to provide.
Assignee | ||
Comment 24•22 years ago
|
||
For reference: The test builds were created using a trunk build as of two days ago, patch v5 from this bug, plus the code from bug 97622 patch v8, plus the fixes for NSS bugs 177366, 176667, 179799.
Assignee | ||
Comment 25•22 years ago
|
||
Comment on attachment 107151 [details] [diff] [review] Subset of patch that contains only the tracking helper classes I found a bug
Attachment #107151 -
Attachment is obsolete: true
Assignee | ||
Updated•22 years ago
|
Attachment #107152 -
Attachment is obsolete: true
Assignee | ||
Comment 26•22 years ago
|
||
This is a newer snapshot of the patch. It has all the logic from the previous patches, it works around bug 181230 by using a NSPR trick that should be threadsafe and it fixes crasher bug 182803. However, there are concerns that it adds complexity to PSM that we should avoid. Plan A is: We wait for Darin's general fix for bug 181230, and once we have it, we'll test whether the complexity proposed with this patch can be avoided. It is my understanding that it does not make sense to do full testing of this patch or reviewing this patch, unless we know whether a fix for bug 181230 is sufficient or not. If a fix for 181230 is not sufficient, plan B would be: take this patch.
Assignee | ||
Comment 27•22 years ago
|
||
This is an alternative patch that was suggested. It does not introduce the track-all-nss-resources-and-allow-shutdown-early complexity. I only fixes the leaks, tracks whether we have open sockets or not, whether we have active blocking UI or not, and vetos profile changes if there are. This patch would assume that bug 181230 gets fixed, tearing down sockets, which Darin is working on. However, this patch still does not solve all the problems (as the complex patch v6 would do). I can reproduce a crash using this patch, without accessing any SSL. It only requires having open certificate manager or having read a signed email message. The problem is the JavaScript environment still holds a reference to a nsNSSCertificate object, which gets cleaned up after we have shut down NSS, which is no longer allowed at this time, and we are crashing.
Assignee | ||
Comment 28•22 years ago
|
||
Comment on attachment 108029 [details] [diff] [review] Delayed - Maybe Later - Hammer Approach but very complex Patch v6 (snapshot 47) I'm renaming this patch and delaying it. As of today, I don't see support to get this landed. We can consider it again at a later time, if alternative solutions don't work.
Attachment #108029 -
Attachment description: Patch v6 (snapshot 47) → Delayed - Maybe Later - Hammer Approach but very complex Patch v6 (snapshot 47)
Assignee | ||
Comment 29•22 years ago
|
||
Comment on attachment 108846 [details] [diff] [review] Partial Patch / Leaks and UI tracking only / v1 I'm renaming this patch and want to try to get it checked it in. From my understanding, it reflects the portions of the changes everybody agrees on. (It does not contain the "track all NSS resources as used by PSM" nor the "early shutdown" complexity).
Attachment #108846 -
Attachment description: sidetrack patch v1 → Partial Patch / Leaks and UI tracking only / v1
Assignee | ||
Comment 30•22 years ago
|
||
In comment 27 I mention that a fix for bug 181230 is still required, but will not be sufficient. However, this "is not sufficient" statement applies to the profile switching approach as implemented in bug 97622. I think it makes sense to give separate status updates for the Mozilla browser and an embedding environment. Let's look at a simpler embedding environment first: By landing the leak fixes and UI tracking (attachment 108846 [details] [diff] [review]) in combination with a complete fix for bug 181230, we should have a working base environment for profile switching. With having those fixes, if the embedding application releases all references to PSM resources prior to attempting the profile switch, everything should be fine. However, profile switching in Mozilla will be more difficult. As I understand it, the patch in bug 97622 is not sufficient, because JavaScript and delayed destruction of references is involved. In my tests, as I explained in comment 27, I saw resources hold by the UI (security status) are getting destructed after the profile switch was requested. I don't see a simple solution to this. We either need to find a way to tear down completely all resources held by Mozilla to PSM prior to switching the profile, or we need to take a complex patch like the one I suggested in attachment 108029 [details] [diff] [review]. From my point of view this means the following: - let's land the leak fixes now - let's work on bug 181230 - if the above is done, reenable NSS shutdown, tracked in bug 187501 - if all of the above is done, we are ready to go back to profile switching testing - however, we can do testing only in those (embedding) environments where we don't have to deal with delayed JavaScript destruction Being unable to do profile switching correctness testing with Mozilla is inconvinient. It would be great if we could find ways to enhance the patch in bug 97622 to be independent of the delayed JavaScript destruction, i.e., enhance the patch to tear down all open windows and execute Garbage Collection in all active JavaScript contexts prior to the profile switching attempt.
Assignee | ||
Comment 31•22 years ago
|
||
Comment on attachment 108846 [details] [diff] [review] Partial Patch / Leaks and UI tracking only / v1 As stated in the previous comments, I would like to land this patch. Javi, Darin, can you please review?
Attachment #108846 -
Flags: superreview?(darin)
Attachment #108846 -
Flags: review?(javi)
Comment 32•22 years ago
|
||
Comment on attachment 108846 [details] [diff] [review] Partial Patch / Leaks and UI tracking only / v1 r=javi
Attachment #108846 -
Flags: review?(javi) → review+
Comment 33•22 years ago
|
||
Comment on attachment 108846 [details] [diff] [review] Partial Patch / Leaks and UI tracking only / v1 sr=darin
Attachment #108846 -
Flags: superreview?(darin) → superreview+
Assignee | ||
Comment 34•22 years ago
|
||
Removing "make PSM leak tolerant" from bug summary.
Summary: Make PSM XPCOM leak tolerant + fix leaks → Fix known leaks in PSM, track blocking PSM UI, track open SSL sockets
Assignee | ||
Comment 35•22 years ago
|
||
Patch checked in, marking fixed.
Status: ASSIGNED → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
Updated•8 years ago
|
Product: Core → Core Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•