Closed Bug 101329 Opened 23 years ago Closed 23 years ago

Make sure switching profiles is safe for PSM / NSS

Categories

(Core Graveyard :: Security: UI, defect, P1)

Other Branch

Tracking

(Not tracked)

VERIFIED FIXED
psm2.2

People

(Reporter: inactive-mailbox, Assigned: KaiE)

References

Details

As we saw from bug 98182 and bug 99566, PSM needs to be enhanced.

When a profile is changed, the security component must behave failsafe. We have
different approaches, which I want to summarize.


Assumptions
===========
NSS flushes all persistent data to disk immediately, no real need to call
NSS_Shutdown

PSM is different from most components of Mozilla.
It uses NSS for storage of personal security information.
NSS is using global data.
This global data belongs to exactly one profile, i.e. cert/key/token files on disk.

To make SSL work, PSM creates links from network objects to NSS objects.

As long as the glued network objects live, NSS must not be shut down, or we
might crash.

PSM, i.e. the nsNSSComponent, is the only module that calls NSS directly.


Solution 1
==========
In my opinion, in order to switch a profile, shutting down and restarting the
application is the easiest way to get complete crash safety. This could be
automated by starting the application again, create a communication channel
between old and new instance, delay loading of the new instance in the very
beginning, until the old instance says it is down.


Solution 2
==========
If a complete restart is not approriate, we must carefully monitor all
references to NSS objects we create.

It is very likely, that NSS objects are still referenced, if the user asks to
switch a profile while there is still SSL activity.

Therefore, as a minimum, all network activity should be shut down synchronously
before an attempt to switch a profile is being made.
This will result in all sockets to close, which will clear references to NSS.

When the event "profile will change, flush data", we do nothing.

When the event "new profile selected, prepare for using it" is seen, PSM frees
its own references to NSS. Nothing else has be to done. As we have required, all
network activity is already down, and the socket close code already free the
open references. We call NSS_Shutdown, and call NSS_Init for the new profile.


Solution 3
==========
If we can't shut down the network completely before switching profile, then we
must make PSM completely failsafe.

This includes serializing all NSS access with the NSS re-init code.

To make sure, NSS data from the old session is not used after the re-init, PSM
needs to shut down internally.

In one way or another, we must invalidate all PSM/NSS in-memory data. This could
be made possible by always keeping a list of all open SSL sockets / file
descriptors. 

When we receive the "Profile going down" event, we iterate over all descriptors,
marking them as invalid.

When we receive the "New profile coming up" event, we free all additional
references to NSS, call NSS_Shutdown, and call NSS_Init for the new profile

In addition, all PSM code that is asynchronously called by other threads, must
check whether the data has become invalid, before trying to call NSS functions
(this includes the functions that are hooked into the file descriptior I/O
layer), and simply fail gracefully.


Question
========
I fear solution 1 is too difficult to implement on all platforms, and the logic
needed to be rewritten for every embedding platform.

From what I've learned from bug 98182 and bug 99566, I guess we need to go with
Solution 3.

Do you agree?
I agree that Solution #3 is the best. How much work is it?
If I do it, I need to review most of the PSM code, to find out what needs to be
done exactly, before I can make a good guess.

I suggest, to minimize reviewing work, the review for this bug should be done
with the requirements for bug 101005 in mind, and to first develop a design, how
to fix both issues.

Combining the work for both bugs makes sense, as both will require adding locks
for serialization.
Priority: -- → P1
Target Milestone: --- → Future
*** Bug 98182 has been marked as a duplicate of this bug. ***
Bob, can it cause trouble if NSS_Shutdown is called, without ever having called
a NSS_Init function?

I just saw that this happens. If a user has multiple profiles, during
application startup, the nsNSSComponent is constructed and destructed at least
once prior to the real start.

The destructor of PSM/nsNSSComponent unconditionally calls NSS_Shutdown.

This happens (failed construction of NSS component) for every DOMWindow which is
created before the profile is set. If you create a new profile, it will happen
several more times as well - once for each of those windows. Whether it's
hazzardous to call shutdown before init, there's extra work going on - string
bundles are loaded before it fails due to not having a profile. You might want
to determine that up front.
Oop,s Sorry, I need to post to the bug, not reply to bugzilla...

Kai, It is safe to call shutdown even if you haven't called init. Shutdown
checks each pointer type before it frees it. The only worry is shutdown and init
are not thread safe, you will have bad results if call either multiple times on
different threads at the same time.

bob
Implementation design for making PSM safe for profile switching
---------------------------------------------------------------

Definitions
===========
I define the term "NSS session" as the time between a call to NSS_Init and
NSS_Shutdown.

A profile switch will result in terminating a NSS session, and starting a new one.

I define the term "NSS object" as an instance of a structure defined by NSS,
which PSM has the ownership for, and which must be destroyed using a NSS
function. An example is a CERTCertificate*.

Design Idea
===========
PSM already stores references to NSS objects (inside other C++ objects), and
those might by referenced by code all over Mozilla (e.g. the network code).

Just before NSS_Shutdown is called, PSM must invalidate all references to
underlying NSS objects and at least mark them as invalid, ideally free/release
them, too.

The application must make sure that the NSS objects are not accessed in a later
NSS session, i.e. after they have been marked invalid (because they might refer
to a token or cert database or whatever that has been destroyed by NSS_Shutdown).

Implementation Idea
===================
To store this valid/invalid state, we need to associate each NSS object with
user data. This is something NSS does not allow for.

I therefore suggest that the application creates a little wrapper object for
every owned NSS object. This just contains the pointer to the NSS object, a
valid/invalid flag, and a type information, to allow the PSM code that
terminates the NSS session to call the correct NSS cleanup function (e.g.
CERT_DestroyCerticicate).

NSS_Shutdown does not cleanup everything. It cleans some global NSS data
structures, but references obtained by the application keep some objects alive.
Therefore the code that terminates a NSS session must destroy all those objects
(to avoid leaks).

To allow for this, PSM needs to have a global list of all NSS objects created
during a NSS session (there is currently no such list).

We must make sure that no NSS function is active when NSS_Shutdown is called.
The application might even make simultaneous calls to NSS functions from
different threads.

We need to introduce a lock combined with a counter, a condition variable.

Whenever a NSS function is going to be called, we lock and increment the "NSS
calls active counter" condition variable. While we lock for incrementing the
counter, we need to check for the valid/invalid flag as well. After the call, we
lock and decrement. If we reach zero, we need to signal the condition variable.

The code that is called on profile switch must not call NSS_Shutdown until it is
able to obtain the lock with the counter having reached zero. By using a
condition variable we can wait for the condition (which might block until no NSS
functions are active).

Implementation Details
======================
To implement the above, he following work is required:
- define the wrapper class for NSS objects
- define and introduce the global list
- wherever NSS objects are created/stored/referenced/freed, change the code to
use the new wrapper class and update the global list
- wherever NSS functions are called, add the lock/check/condition code, and the
code to fail gracefully
- create the NSS session shutdown code


I discussed this with Bob Relyea, and he agreed that the above seems to be the
safest approach.
Status: NEW → ASSIGNED
I just counted how many times PSM makes a call to a NSS function, and the count
of 460 different calls frightens me.

While my design in theory sounds nice, it means very much work.

I feel tempted to ask again for synchronous network shutdown, i.e. solution 2.
What about the following, much simpler idea:

The code that initiates profile switch brings down all network connections
(which will happen asynchronously), maybe by going offline as suggested.

The network component could provide a status information, whether it is
currently completely idle.

As long as this flag does not report idle, the profile switching code sleeps a
few microseconds and then retries. It blocks. I think it is acceptable to block
there, the profile switching code should make sure that it is safe to go down.

After that has happened, the profile switch can continue and post the events.

For every class in PSM that stores references to NSS objects, we introduce a
class instance counter. When the profile switch event is seen in PSM, we add
assertion checks to see whether the instance counts of all PSM classes have
reached zero. If we arrive here without the counts having reached zero, we have
bugs in PSM, and/or need to make sure that more referencing components are shut
down.

When all counts have reached zero (and PSM has freed the global references that
PSM keeps), we can safely call NSS_Shutdown and NSS_Init.
When the profile switch event notification returns, online made can be reactivated.
Conrad, what do you think?
Kai, I was thinking along similar lines - writing up in between your comments
and then the phone rang. Anyway, here's what I was thinking at that time.


> it means very much work

Alright, plan 2 is sounding more realistic.

Is synchronous shut down of *all* network activity needed if only PSM is this
difficult to shut down? Is it possible to, on receiving a profile shutdown
notification to check and see if PSM has any outstanding transactions? And then,
only in that case, block for some time until things have completed?

If we went offline as the first step of shutting down the profile, the stopping
of net activity is not synchronous. But, can any new activity be started after
shutdown? The reason I ask is that if before going offline you could know that
no PSM was "idle" and then you go offline, can  you be sure that no new PSM
activity will be started as a result of the asycc completion of other things?
Generally I think we should decrease complexity, and just shutting down all
network activity, maybe even shutting down all application activity whatsoever
sounds very safe.


> Is synchronous shut down of *all* network activity needed if only PSM is this
> difficult to shut down?

Yes, see the final paragraph.


> Is it possible to, on receiving a profile shutdown
> notification to check and see if PSM has any outstanding transactions? And then,
> only in that case, block for some time until things have completed?


Yes. I thought about checking internally whether there are no open SSL sockets,
no referenced certificates for page information, etc. And I could provide this
boolean status to you over an interface to the NSS component. However, I fear it
is not safe enough, see next answer.


> If we went offline as the first step of shutting down the profile, the stopping
> of net activity is not synchronous. But, can any new activity be started after
> shutdown? The reason I ask is that if before going offline you could know that
> no PSM was "idle" and then you go offline, can  you be sure that no new PSM
> activity will be started as a result of the asycc completion of other things?

That is possible.

For example, if the user has configured a proxy, this involves network activity
that first communicates in plain text, and once the handshake with the proxy is
finished, the SSL handshake is initiated over the already connected socket.
The same happens, when a user uses SSL/SMTP.If there is still application
activity at the time you query for the "PSM idle state", it might indeed report
idle, but a moment later the application logic that requests a https page over a
proxy or sends a mail message can initiate the SSL mode.
Changing my prefered e-mail address.
Assignee: kai.engert → kaie
Status: ASSIGNED → NEW
Status: NEW → ASSIGNED
Bob:

Darin, Conrad and Kai had some chats outside Bugzilla.

We currently favorize shutting down all network activity before we do the
change, and before we re-init NSS.

We already know how to make the shutdown, but as shutting down happens
asynchronous with a small delay, we need to wait until the state has been
reached, where we know that all sockets are down. I want to do this by using a
counter of currently open sockets and we want to wait until this has reached
zero. (We switch the application to an offline mode, so it is guaranteed that no
other sockets will get opened as soon as we are in offline mode)

I see hat PSM uses a call to SSL_ImportFD. I haven't looked into the details of
this function, but I guess that this causes NSS to layer itself on this FD.

Although currently PSM does its own additional layering, I would prefer not to
do the reference counting there. Nelson created bug 99303 where he requests to
eliminate that layer.

So here is my question: Is there a function we could use to query NSS: "How many
sockets are currently in use by SSL?".

We want to either poll this function periodically or wait for a signaled
condition variable, whatever is easier.

If there is currently no such function in NSS, should we:
- add such a function?
- or stick with PSM layering?
- or is there any other approach we could use inside PSM to monitor the "socket
closed" event to do the bookkeeping?

The best people to answer your questions on how SSL sockets and NSPR sockets
relate and are layered on each other are wan-teh and nelson, I'm adding them to
the CC list.
Solution 3 doesn't make sense.  (Name one application that
can switch user profiles before all open objects that
belong to the previous user have been closed.)  I am glad
that you decided not to implement solution 3.

> Is there a function we could use to query NSS: "How many
> sockets are currently in use by SSL?".

No.  If all SSL sockets are opened and closed by PSM, you
can easily do the bookkeeping and monitor the "SSL socket
closed" event with an integer counter, a PRLock, and a
PRCondVar.  But it seems that you should wait until all
sockets (SSL or non-SSL) have been closed before you can
switch user profiles.  Therefore, I think it is probably
Necko that should be doing the bookkeeping and monitoring
the "socket closed" event.

Here is some pseudocode.

1. Variable declaration and initialization:

PRLock *lock;
PRCondVar *cv;
unsigned int nsock;

lock = PR_NewLock();
cv = PR_NewCondVar(lock);
nsock = 0;

2. When a new socket is opened:

Open a new socket;
PR_Lock(lock);
nsock++;
PR_Unlock(lock);

3. When a socket is closed:

Close a socket;
PR_Lock(lock);
nsock--;
if (nsock == 0) {
    PR_NotifyCondVar(cv);
}
PR_Unlock(lock);

4. The thread that monitors the "socket closed" event would do:

PR_Lock(lock);
while (nsock != 0) {
    PR_WaitCondVar(cv, PR_INTERVAL_NO_TIMEOUT);
}
PR_Unlock(lock);

wtc: Thanks for your comments and suggestions.

CC'ing Darin.

Conrad, Darin: wtc suggests to have necko doing bookkeeping to monitor the
network down event. I guess, as the condition variable can't be shared between
components directly, we needed to introduce a new interface, implemented by
necko, for querying and monitoring the network state.

We could add a blocking (no timeout) method to that interface, which waits for
the condition variable.

That would mean in detail:
- necko always makes the "active socket count" bookkepping, together with
locking, increasing and decreasing the count as sockets are closed and opened
- necko implements the new interface, e.g. nsINetworkDownWatcher
- necko triggers the condition variable whenever the count reaches zero
- profile shutdown requests offline mode
- profile shutdown code sends out the events
- PSM receives the "profile up" event and calls the blocking method
nsINetworkDownWatch::WaitUntilAllSocketsDown
- PSM does NSS shutdown and init.
- PSM returns and profile event posting can continue.

What do you think about that suggestion?
necko already uses the observer service to notify observers that necko is going
offline ("network:offline-status-changed").  either this notification should be
delayed until necko is really offline, or there should be an additional
notification when necko is really offline.  so, i don't see any need to a
nsINetworkDownWatcher type interface... or, do you really really need a
condition variable to wait on?  is an async callback not sufficient?
You say you are using the observer service. That is the same method that the
profile code is using for its notification.

Is my assumption correct, that a queue is used for all events, and PSM will
receive the events "one after another", i.e. if we are currently processing an
event, we will not receive the next event until we return from processing the
previous event?

If that is the case, we can't use that notifiation. Because when we see that
profile change event, we need to block within that method until the network is down.
If we can assume that mutliple events can be processed at the same time by PSM,
then we can use your event "all sockets really down".
what thread do you intend to block?
I'm not sure on which thread the profile switch events do arrive. I'd have to
debug this to find out. Easiest would be if someone knows already. Is it the
UI/main thread? If nobody knows, I will debug.
> Because when we see that profile change event, we need to block within that
> method until the network is down.

Not nescesarily. If the profile mgr could, as the first step of switching the
profile, request the offline state and idle until it knew that a quiet state had
been  reached (possible, Darin?), by the time it sent out the
"profile-before-change", anybody observing that event could know for a fact that
all network activity had  stopped. By putting the responsibility on the profile
mgr, PSM wouldn't really have to block at all.
conrad, that sounds like a good idea, though it does make the profile manager
suddenly dependent on necko, right?
If I was doing this in profile mgr (waiting for I/O to finish after going
offline), could I query the inUseTransportCount attribute on
nsISocketTransportService in order to know when we're really offline?

But, is it easier and without dependencies at all to have
nsIIOService::SetOffline() do the idling? Then, it's all within necko.
SetOffline() could have a boolean param with which you could specify whether or
not to idle and wait.

Even though, as far as dependencies go, depending on necko is not so bad since
necko is such core code, it would be good to avoid any dependencies.
conrad: SetOffline cannot block waiting for the socket transport service to
shutdown completely.  it is possible, however, to have necko notify an observer
topic when the socket transport service is shutdown.  this means making delaying
profile change notification until this notification is fired, which should be
fine.  who is calling SetOffline(TRUE) in the profile change case?  is it the
profile manager?  if so, then it makes perfect sense for the profile manager to
register as an observer for notification of this event being completed.
> who is calling SetOffline(TRUE) in the profile change case?

That's the problem - nobody is. 

> if so, then it makes perfect sense for the profile manager to
> register as an observer for notification of this event being completed.

That's what I'll do.

Does this bug need to be split up into bugs which block this one? I'll do the
profile mgr part of this. In order to do that, sounds like something needs to be
done to necko to send out notification that the offline state is fully offline.
yes, file a bug against me to add support for that.
Necko bug for this is bug 104020 and Profile Mgr part is bug 104021. 
Making dependent on new bugs.
Depends on: 104020, 104021
Blocks: 70229
target->2.2
Target Milestone: Future → 2.2
I realize the patch that I just attached to bug 99566 will fix this bug, too.

Conrad, do you want to review the profile relevant code in that patch?
Blocks: 108795
Blocks: 107624
I think this is now fixed by todays checkin for bug 75947.
Please reopen if you find additional problems.
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Verified per kaie's comment.
Status: RESOLVED → VERIFIED
Product: PSM → Core
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.