101329 - Make sure switching profiles is safe for PSM / NSS

Reporter

Description

•

24 years ago

As we saw from bug 98182 and bug 99566, PSM needs to be enhanced. When a profile is changed, the security component must behave failsafe. We have different approaches, which I want to summarize. Assumptions =========== NSS flushes all persistent data to disk immediately, no real need to call NSS_Shutdown PSM is different from most components of Mozilla. It uses NSS for storage of personal security information. NSS is using global data. This global data belongs to exactly one profile, i.e. cert/key/token files on disk. To make SSL work, PSM creates links from network objects to NSS objects. As long as the glued network objects live, NSS must not be shut down, or we might crash. PSM, i.e. the nsNSSComponent, is the only module that calls NSS directly. Solution 1 ========== In my opinion, in order to switch a profile, shutting down and restarting the application is the easiest way to get complete crash safety. This could be automated by starting the application again, create a communication channel between old and new instance, delay loading of the new instance in the very beginning, until the old instance says it is down. Solution 2 ========== If a complete restart is not approriate, we must carefully monitor all references to NSS objects we create. It is very likely, that NSS objects are still referenced, if the user asks to switch a profile while there is still SSL activity. Therefore, as a minimum, all network activity should be shut down synchronously before an attempt to switch a profile is being made. This will result in all sockets to close, which will clear references to NSS. When the event "profile will change, flush data", we do nothing. When the event "new profile selected, prepare for using it" is seen, PSM frees its own references to NSS. Nothing else has be to done. As we have required, all network activity is already down, and the socket close code already free the open references. We call NSS_Shutdown, and call NSS_Init for the new profile. Solution 3 ========== If we can't shut down the network completely before switching profile, then we must make PSM completely failsafe. This includes serializing all NSS access with the NSS re-init code. To make sure, NSS data from the old session is not used after the re-init, PSM needs to shut down internally. In one way or another, we must invalidate all PSM/NSS in-memory data. This could be made possible by always keeping a list of all open SSL sockets / file descriptors. When we receive the "Profile going down" event, we iterate over all descriptors, marking them as invalid. When we receive the "New profile coming up" event, we free all additional references to NSS, call NSS_Shutdown, and call NSS_Init for the new profile In addition, all PSM code that is asynchronously called by other threads, must check whether the data has become invalid, before trying to call NSS functions (this includes the functions that are hooked into the file descriptior I/O layer), and simply fail gracefully. Question ======== I fear solution 1 is too difficult to implement on all platforms, and the logic needed to be rewritten for every embedding platform. From what I've learned from bug 98182 and bug 99566, I guess we need to go with Solution 3. Do you agree?

Conrad Carlen (not reading bugmail)

Comment 1

•

24 years ago

I agree that Solution #3 is the best. How much work is it?

(Don't use, old account) - Kai Engert

Reporter

Comment 2

•

24 years ago

If I do it, I need to review most of the PSM code, to find out what needs to be done exactly, before I can make a good guess. I suggest, to minimize reviewing work, the review for this bug should be done with the requirements for bug 101005 in mind, and to first develop a design, how to fix both issues. Combining the work for both bugs makes sense, as both will require adding locks for serialization.

Stephane Saux

Updated

•

24 years ago

Priority: -- → P1

Target Milestone: --- → Future

Stephane Saux

Comment 3

•

24 years ago

*** Bug 98182 has been marked as a duplicate of this bug. ***

(Don't use, old account) - Kai Engert

Reporter

Comment 4

•

24 years ago

Bob, can it cause trouble if NSS_Shutdown is called, without ever having called a NSS_Init function? I just saw that this happens. If a user has multiple profiles, during application startup, the nsNSSComponent is constructed and destructed at least once prior to the real start. The destructor of PSM/nsNSSComponent unconditionally calls NSS_Shutdown.

Conrad Carlen (not reading bugmail)

Comment 5

•

24 years ago

This happens (failed construction of NSS component) for every DOMWindow which is created before the profile is set. If you create a new profile, it will happen several more times as well - once for each of those windows. Whether it's hazzardous to call shutdown before init, there's extra work going on - string bundles are loaded before it fails due to not having a profile. You might want to determine that up front.

Robert Relyea

Comment 6

•

24 years ago

Oop,s Sorry, I need to post to the bug, not reply to bugzilla... Kai, It is safe to call shutdown even if you haven't called init. Shutdown checks each pointer type before it frees it. The only worry is shutdown and init are not thread safe, you will have bad results if call either multiple times on different threads at the same time. bob

(Don't use, old account) - Kai Engert

Reporter

Comment 7

•

24 years ago

Implementation design for making PSM safe for profile switching --------------------------------------------------------------- Definitions =========== I define the term "NSS session" as the time between a call to NSS_Init and NSS_Shutdown. A profile switch will result in terminating a NSS session, and starting a new one. I define the term "NSS object" as an instance of a structure defined by NSS, which PSM has the ownership for, and which must be destroyed using a NSS function. An example is a CERTCertificate*. Design Idea =========== PSM already stores references to NSS objects (inside other C++ objects), and those might by referenced by code all over Mozilla (e.g. the network code). Just before NSS_Shutdown is called, PSM must invalidate all references to underlying NSS objects and at least mark them as invalid, ideally free/release them, too. The application must make sure that the NSS objects are not accessed in a later NSS session, i.e. after they have been marked invalid (because they might refer to a token or cert database or whatever that has been destroyed by NSS_Shutdown). Implementation Idea =================== To store this valid/invalid state, we need to associate each NSS object with user data. This is something NSS does not allow for. I therefore suggest that the application creates a little wrapper object for every owned NSS object. This just contains the pointer to the NSS object, a valid/invalid flag, and a type information, to allow the PSM code that terminates the NSS session to call the correct NSS cleanup function (e.g. CERT_DestroyCerticicate). NSS_Shutdown does not cleanup everything. It cleans some global NSS data structures, but references obtained by the application keep some objects alive. Therefore the code that terminates a NSS session must destroy all those objects (to avoid leaks). To allow for this, PSM needs to have a global list of all NSS objects created during a NSS session (there is currently no such list). We must make sure that no NSS function is active when NSS_Shutdown is called. The application might even make simultaneous calls to NSS functions from different threads. We need to introduce a lock combined with a counter, a condition variable. Whenever a NSS function is going to be called, we lock and increment the "NSS calls active counter" condition variable. While we lock for incrementing the counter, we need to check for the valid/invalid flag as well. After the call, we lock and decrement. If we reach zero, we need to signal the condition variable. The code that is called on profile switch must not call NSS_Shutdown until it is able to obtain the lock with the counter having reached zero. By using a condition variable we can wait for the condition (which might block until no NSS functions are active). Implementation Details ====================== To implement the above, he following work is required: - define the wrapper class for NSS objects - define and introduce the global list - wherever NSS objects are created/stored/referenced/freed, change the code to use the new wrapper class and update the global list - wherever NSS functions are called, add the lock/check/condition code, and the code to fail gracefully - create the NSS session shutdown code I discussed this with Bob Relyea, and he agreed that the above seems to be the safest approach.

Status: NEW → ASSIGNED

(Don't use, old account) - Kai Engert

Reporter

Comment 8

•

24 years ago

I just counted how many times PSM makes a call to a NSS function, and the count of 460 different calls frightens me. While my design in theory sounds nice, it means very much work. I feel tempted to ask again for synchronous network shutdown, i.e. solution 2.

(Don't use, old account) - Kai Engert

Reporter

Comment 9

•

24 years ago

What about the following, much simpler idea: The code that initiates profile switch brings down all network connections (which will happen asynchronously), maybe by going offline as suggested. The network component could provide a status information, whether it is currently completely idle. As long as this flag does not report idle, the profile switching code sleeps a few microseconds and then retries. It blocks. I think it is acceptable to block there, the profile switching code should make sure that it is safe to go down. After that has happened, the profile switch can continue and post the events. For every class in PSM that stores references to NSS objects, we introduce a class instance counter. When the profile switch event is seen in PSM, we add assertion checks to see whether the instance counts of all PSM classes have reached zero. If we arrive here without the counts having reached zero, we have bugs in PSM, and/or need to make sure that more referencing components are shut down. When all counts have reached zero (and PSM has freed the global references that PSM keeps), we can safely call NSS_Shutdown and NSS_Init. When the profile switch event notification returns, online made can be reactivated. Conrad, what do you think?

Conrad Carlen (not reading bugmail)

Comment 10

•

24 years ago

Kai, I was thinking along similar lines - writing up in between your comments and then the phone rang. Anyway, here's what I was thinking at that time. > it means very much work Alright, plan 2 is sounding more realistic. Is synchronous shut down of *all* network activity needed if only PSM is this difficult to shut down? Is it possible to, on receiving a profile shutdown notification to check and see if PSM has any outstanding transactions? And then, only in that case, block for some time until things have completed? If we went offline as the first step of shutting down the profile, the stopping of net activity is not synchronous. But, can any new activity be started after shutdown? The reason I ask is that if before going offline you could know that no PSM was "idle" and then you go offline, can you be sure that no new PSM activity will be started as a result of the asycc completion of other things?

(Don't use, old account) - Kai Engert

Reporter

Comment 11

•

24 years ago

Generally I think we should decrease complexity, and just shutting down all network activity, maybe even shutting down all application activity whatsoever sounds very safe. > Is synchronous shut down of *all* network activity needed if only PSM is this > difficult to shut down? Yes, see the final paragraph. > Is it possible to, on receiving a profile shutdown > notification to check and see if PSM has any outstanding transactions? And then, > only in that case, block for some time until things have completed? Yes. I thought about checking internally whether there are no open SSL sockets, no referenced certificates for page information, etc. And I could provide this boolean status to you over an interface to the NSS component. However, I fear it is not safe enough, see next answer. > If we went offline as the first step of shutting down the profile, the stopping > of net activity is not synchronous. But, can any new activity be started after > shutdown? The reason I ask is that if before going offline you could know that > no PSM was "idle" and then you go offline, can you be sure that no new PSM > activity will be started as a result of the asycc completion of other things? That is possible. For example, if the user has configured a proxy, this involves network activity that first communicates in plain text, and once the handshake with the proxy is finished, the SSL handshake is initiated over the already connected socket. The same happens, when a user uses SSL/SMTP.If there is still application activity at the time you query for the "PSM idle state", it might indeed report idle, but a moment later the application logic that requests a https page over a proxy or sends a mail message can initiate the SSL mode.

(Don't use, old account) - Kai Engert

Reporter

Comment 12

•

24 years ago

Changing my prefered e-mail address.

Assignee: kai.engert → kaie

Status: ASSIGNED → NEW

Kai Engert [:KaiE:]

Assignee

Updated

•

24 years ago

Status: NEW → ASSIGNED

Kai Engert [:KaiE:]

Assignee

Comment 13

•

24 years ago

Bob: Darin, Conrad and Kai had some chats outside Bugzilla. We currently favorize shutting down all network activity before we do the change, and before we re-init NSS. We already know how to make the shutdown, but as shutting down happens asynchronous with a small delay, we need to wait until the state has been reached, where we know that all sockets are down. I want to do this by using a counter of currently open sockets and we want to wait until this has reached zero. (We switch the application to an offline mode, so it is guaranteed that no other sockets will get opened as soon as we are in offline mode) I see hat PSM uses a call to SSL_ImportFD. I haven't looked into the details of this function, but I guess that this causes NSS to layer itself on this FD. Although currently PSM does its own additional layering, I would prefer not to do the reference counting there. Nelson created bug 99303 where he requests to eliminate that layer. So here is my question: Is there a function we could use to query NSS: "How many sockets are currently in use by SSL?". We want to either poll this function periodically or wait for a signaled condition variable, whatever is easier. If there is currently no such function in NSS, should we: - add such a function? - or stick with PSM layering? - or is there any other approach we could use inside PSM to monitor the "socket closed" event to do the bookkeeping?

Robert Relyea

Comment 14

•

24 years ago

The best people to answer your questions on how SSL sockets and NSPR sockets relate and are layered on each other are wan-teh and nelson, I'm adding them to the CC list.

Wan-Teh Chang

Comment 15

•

24 years ago

Solution 3 doesn't make sense. (Name one application that can switch user profiles before all open objects that belong to the previous user have been closed.) I am glad that you decided not to implement solution 3. > Is there a function we could use to query NSS: "How many > sockets are currently in use by SSL?". No. If all SSL sockets are opened and closed by PSM, you can easily do the bookkeeping and monitor the "SSL socket closed" event with an integer counter, a PRLock, and a PRCondVar. But it seems that you should wait until all sockets (SSL or non-SSL) have been closed before you can switch user profiles. Therefore, I think it is probably Necko that should be doing the bookkeeping and monitoring the "socket closed" event. Here is some pseudocode. 1. Variable declaration and initialization: PRLock *lock; PRCondVar *cv; unsigned int nsock; lock = PR_NewLock(); cv = PR_NewCondVar(lock); nsock = 0; 2. When a new socket is opened: Open a new socket; PR_Lock(lock); nsock++; PR_Unlock(lock); 3. When a socket is closed: Close a socket; PR_Lock(lock); nsock--; if (nsock == 0) { PR_NotifyCondVar(cv); } PR_Unlock(lock); 4. The thread that monitors the "socket closed" event would do: PR_Lock(lock); while (nsock != 0) { PR_WaitCondVar(cv, PR_INTERVAL_NO_TIMEOUT); } PR_Unlock(lock);

Kai Engert [:KaiE:]

Assignee

Comment 16

•

24 years ago

wtc: Thanks for your comments and suggestions. CC'ing Darin. Conrad, Darin: wtc suggests to have necko doing bookkeeping to monitor the network down event. I guess, as the condition variable can't be shared between components directly, we needed to introduce a new interface, implemented by necko, for querying and monitoring the network state. We could add a blocking (no timeout) method to that interface, which waits for the condition variable. That would mean in detail: - necko always makes the "active socket count" bookkepping, together with locking, increasing and decreasing the count as sockets are closed and opened - necko implements the new interface, e.g. nsINetworkDownWatcher - necko triggers the condition variable whenever the count reaches zero - profile shutdown requests offline mode - profile shutdown code sends out the events - PSM receives the "profile up" event and calls the blocking method nsINetworkDownWatch::WaitUntilAllSocketsDown - PSM does NSS shutdown and init. - PSM returns and profile event posting can continue. What do you think about that suggestion?

Darin Fisher

Comment 17

•

24 years ago

necko already uses the observer service to notify observers that necko is going offline ("network:offline-status-changed"). either this notification should be delayed until necko is really offline, or there should be an additional notification when necko is really offline. so, i don't see any need to a nsINetworkDownWatcher type interface... or, do you really really need a condition variable to wait on? is an async callback not sufficient?

Kai Engert [:KaiE:]

Assignee

Comment 18

•

24 years ago

You say you are using the observer service. That is the same method that the profile code is using for its notification. Is my assumption correct, that a queue is used for all events, and PSM will receive the events "one after another", i.e. if we are currently processing an event, we will not receive the next event until we return from processing the previous event? If that is the case, we can't use that notifiation. Because when we see that profile change event, we need to block within that method until the network is down. If we can assume that mutliple events can be processed at the same time by PSM, then we can use your event "all sockets really down".

Darin Fisher

Comment 19

•

24 years ago

what thread do you intend to block?

Kai Engert [:KaiE:]

Assignee

Comment 20

•

24 years ago

I'm not sure on which thread the profile switch events do arrive. I'd have to debug this to find out. Easiest would be if someone knows already. Is it the UI/main thread? If nobody knows, I will debug.

Conrad Carlen (not reading bugmail)

Comment 21

•

24 years ago

> Because when we see that profile change event, we need to block within that > method until the network is down. Not nescesarily. If the profile mgr could, as the first step of switching the profile, request the offline state and idle until it knew that a quiet state had been reached (possible, Darin?), by the time it sent out the "profile-before-change", anybody observing that event could know for a fact that all network activity had stopped. By putting the responsibility on the profile mgr, PSM wouldn't really have to block at all.

Darin Fisher

Comment 22

•

24 years ago

conrad, that sounds like a good idea, though it does make the profile manager suddenly dependent on necko, right?

Conrad Carlen (not reading bugmail)

Comment 23

•

24 years ago

If I was doing this in profile mgr (waiting for I/O to finish after going offline), could I query the inUseTransportCount attribute on nsISocketTransportService in order to know when we're really offline? But, is it easier and without dependencies at all to have nsIIOService::SetOffline() do the idling? Then, it's all within necko. SetOffline() could have a boolean param with which you could specify whether or not to idle and wait. Even though, as far as dependencies go, depending on necko is not so bad since necko is such core code, it would be good to avoid any dependencies.

Darin Fisher

Comment 24

•

24 years ago

conrad: SetOffline cannot block waiting for the socket transport service to shutdown completely. it is possible, however, to have necko notify an observer topic when the socket transport service is shutdown. this means making delaying profile change notification until this notification is fired, which should be fine. who is calling SetOffline(TRUE) in the profile change case? is it the profile manager? if so, then it makes perfect sense for the profile manager to register as an observer for notification of this event being completed.

Conrad Carlen (not reading bugmail)

Comment 25

•

24 years ago

> who is calling SetOffline(TRUE) in the profile change case? That's the problem - nobody is. > if so, then it makes perfect sense for the profile manager to > register as an observer for notification of this event being completed. That's what I'll do. Does this bug need to be split up into bugs which block this one? I'll do the profile mgr part of this. In order to do that, sounds like something needs to be done to necko to send out notification that the offline state is fully offline.

Darin Fisher

Comment 26

•

24 years ago

yes, file a bug against me to add support for that.

Conrad Carlen (not reading bugmail)

Comment 27

•

24 years ago

Necko bug for this is bug 104020 and Profile Mgr part is bug 104021.

Kai Engert [:KaiE:]

Assignee

Comment 28

•

24 years ago

Making dependent on new bugs.

Depends on: 104020, 104021

Judson Valeski

Updated

•

24 years ago

Blocks: 70229

Stephane Saux

Comment 29

•

24 years ago

target->2.2

Target Milestone: Future → 2.2

Kai Engert [:KaiE:]

Assignee

Comment 30

•

24 years ago

I realize the patch that I just attached to bug 99566 will fix this bug, too. Conrad, do you want to review the profile relevant code in that patch?

Peter Trudelle

Updated

•

24 years ago

Blocks: 108795

Kai Engert [:KaiE:]

Assignee

Updated

•

24 years ago

Blocks: 107624

Kai Engert [:KaiE:]

Assignee

Comment 31

•

24 years ago

I think this is now fixed by todays checkin for bug 75947. Please reopen if you find additional problems.

Status: ASSIGNED → RESOLVED

Closed: 24 years ago

Resolution: --- → FIXED

John Unruh

Comment 32

•

24 years ago

Verified per kaie's comment.

Status: RESOLVED → VERIFIED

Serge Gautherie (:sgautherie)

Updated

•

21 years ago

Blocks: 75599

timeless

Updated

•

20 years ago

Product: PSM → Core

Nobody; OK to take it and work on it

Updated

•

9 years ago

Product: Core → Core Graveyard