Closed Bug 145322 Opened 22 years ago Closed 22 years ago

NSS does not manage PKCS#11 sessions well

Categories

(NSS :: Libraries, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bugz, Assigned: bugz)

References

Details

Attachments

(6 files, 6 obsolete files)

There is a related bug, bug 126840.  That bug is currently targeted 3.5, I'm
going to make this one 4.0.

Using the PKCS#11 logging tool, I have tested a server doing full handshakes,
client auth on both handshakes, with 20 threads.  Hitting it with a strsclnt, I
have determined:

* Each connection uses an average of 60-70 PKCS#11 sessions
* About 25% of total execution time is spent in C_OpenSession/C_CloseSession

IINM, NSSCryptoContext was created to solve exactly this problem.  A single
context, holding a single PKCS#11 session, should suffice for multiple crypto
operations, as long as they are done sequentially.  Currently, PK11Context's are
created for a single operation only.

Opening bug to track issue during 4.0 design/development.
setting 4.0
Priority: -- → P1
Target Milestone: --- → 4.0
Attached file profile data from selfserv (obsolete) —
This data was collected from the PKCS#11 logger (bug 98926).  Notice how much
time was spent in C_CloseSession, and the number of calls overall.

In this case, the server ran on 20 threads and received 1000 connections.  The
number of sessions created was about 16 per connection (I made a mistake in the
opening comment, it was the *client* that used about 60-70 PKCS#11 sessions /
SSL connection).
Attached file profile data from strsclnt (obsolete) —
This is the client side data (I profiled the client against a non-logging
server, then the server against a non-logging client with the same arguments).
Thanks for writing this bug to track this issue, Ian.
Assignee: wtc → ian.mcgreer
Looking at this closer, I am wondering if we should just
make this bug a duplicate of bug 126840 and target it 4.0.
Or do we want to use bug 126840 for what we can do in NSS
3.x and use this bug for 4.0?
> Or do we want to use bug 126840 for what we can do in NSS
> 3.x and use this bug for 4.0?

That's what I intended.  I think there is only so much that can be done in 3.x,
and this bug is a tracking bug to make sure this issue is addressed in the
design of 4.0.
Ian proposed a solution by email.  Ian, could you attach
your proposal to this bug and find out how many PKCS#11
sessions per SSL connection will be required with your
proposal?  Thanks.
I need to modify items 2 and 3 from Part III, in which I discussed a couple of
areas where session use may be reduced.

Starting with (3), I opined that one session could suffice for both MD5 and SHA1
hashing.  Since the hashing being done is multipart, this is completely wrong.

Now, for (2), I wondered why we needed to copy both hashing sessions in every
call to ssl3_ComputeHandshakeHashes.  At first glance, I thought
PK11_CloneContext just returned a context "like" the one passed to it, e.g., a
new MD5 context.  But, of course, it actually copies over all of the data from
the previous context, which is needed to continue a multipart operation. 
However, I still do not believe that new sessions are necessary.  Here is the
current algorithm:

1. save the session state
2. create a new session
3. restore the old session's state into the new session
    (at this point, the two are identical)
4. continue multipart digest operations with the new session
5. finalize the new session (that is, get the digest output)
6. close the new session

Steps 1-3 are done in PK11_CloneContext.

I believe we can achieve exactly the same effect, without creating a new
session.  Here is what I would propose:

1. save the session state
2. continue the multipart digest with the session
3. finalize the session
4. restore the saved state into the same session
Retargeting this bug to 3.6.  I have found that most of my proposed changes can
happen in 3.6.  Since this bug has more info, using it.
Target Milestone: 4.0 → 3.6
The function ss3_GenerateSessionKeys makes four calls to the function
PK11_SymKeyFromHandle, to obtain { client_write, client_mac, server_write,
server_hmac }.

That function obtains a new PK11SymKey by one of the following steps:

1.  The key is obtained from the freelist.  But the key does not have a
session, so a new session is created.

2.  The key is obtained from the freelist.  The key has a valid session.

3.  A new key is created, with a new session.

The end result is always a key with a valid session.  However, the session is
immediately destroyed, and the key adopts a session from a "parent".  In this
case, the parent is the master_secret key.  If the parent owns its session, the
first key to adopt the session also becomes the owner.

This means that in cases (1) and (3), a new session is created and immediately
destroyed.  Additionally, only one of the five keys (including the
master_secret) "owns" a session.

After returning from ssl3_GenerateSessionKeys, each key is passed to
PK11_CreateContextFromSymKey.  This function _always_ creates a new session.

The end result is that the process of generating contexts for 4 symmetric keys
uses between 4-8 sessions, depending on the state of the freelist.  Of the 4
resultant PK11SymKeys, at most one will end up with a valid session.  If the 4
keys all end up on the freelist, only one at most will provide a valid session
to the next consumer.

My patch makes the following changes:

*  PK11_CreateSymKeyFromHandle does _not_ free the session and adopt the
parent's session.

*  PK11_CreateContextBySymKey _does_ adopt the symkey's session for the
context, if the symkey owns the session.

*  PK11_DestroyContext returns a valid session to the symkey if the context
owns the session.

The effect of these changes is to cut the number of sessions required to create
a context for a symkey in half.  Additionally, _all_ symkeys end up with valid
sessions when they are destroyed.  Instead of < 25% of the freelist keys having
a valid session, ~100% do (it may be less than 100% if other parts of the code
shuffle around the key's session ownership).
This proposal was outlined in comment #9.

One problem is that PK11_SaveContext does not return a buffer, but instead
requires that a static buffer is passed to it.	I would prefer to have a
function that allocated as much space as was required, but there isn't one.
Compare to tables in attachment #85660 [details].

The average number of session/connection went from:

[no client auth]  18 to 4
[auth 1st hs   ]  28 to 12
[auth 2nd hs   ]  44 to 15

The effect of improved freelisting is quite noticeable -- in the long run, most
of the new symkeys do not need new sessions.
Attachment #84091 - Attachment is obsolete: true
Attachment #84093 - Attachment is obsolete: true
*** Bug 126840 has been marked as a duplicate of this bug. ***
Ian-

My selfserv stress testing on soupnazi showed a solid 10% gain on full
handshakes by applying the 2 patches attached to this bug! 

I've attached an html file.  All of the data was gathered today.  Full and
Restart runs with and without the zone allocator, and a matching set of
runs after doing nothing more than applying the patches and rebuilding
on soupnazi, a 4x400 Mhz Solaris 5.8 machine.  I hit with 8 clients 
running under 3.3.1 with 100 threads each.  

The runs with your patches have a "-ian" suffix.

Previous to applying the patches to my tree on soupnazi, I ran the
collector under dbx for an array of threads/ops on rsaperf (the test
program that exercises the session code exclusively).  The results
are in /u/kirke/nss/145322/analyzer.  I have yet to analyze the
results.

I'll do another set of collector runs with the patched tree,
and see where contention has fallen.

Good work.

kirk
Ian,  

First, 6% improvement on FULL RSA handshakes with zones is amazing!  
That's great!

I'm guessing that these changes cause NSS to have a higher number 
of concurrent PKCS#11 sessions than before.  Do you have any statistics
that measure the maximum number of concurrent sessions with and without
your changes?

I seem to vaguely recall that some of the behaviors being modified here 
were specifically done to reduce the number of concurrent sessions.  
Bob will remember that better.  
> I'm guessing that these changes cause NSS to have a higher number 
> of concurrent PKCS#11 sessions than before.  Do you have any statistics
> that measure the maximum number of concurrent sessions with and without
> your changes?

No, but I agree that would be useful, and easy to add.

> I seem to vaguely recall that some of the behaviors being modified here 
> were specifically done to reduce the number of concurrent sessions.  
> Bob will remember that better.

From looking at the code, I believe that was the intent.  However, the existing
code simply reduced the lifetime of certain sessions, not their overall number.
   The idea seemed to be to coalesce 5 sessions for 5 keys into 1 session for
the lifetime of the keys.  But these same keys will belong to contexts later,
which also need a session.  So the end result is 6 sessions for 5 keys and 4
contexts (the master_secret key does not have a persistent context).  With this
patch, there are 5 sessions for 5 keys and 4 contexts, without the drawback of
creating/destroying sessions constantly.

But I think the real win was the freelist.  The numbers show that each
connection is actually creating *less* than the number of required sessions,
because in the long run most sessions are coming off the freelist, so we don't
need new ones.
Kirk,

Something is wrong about your restart results, with zones they should be in the 
500 - 600 ops/s range, not 200 . Your zone results should also be higher. Are 
you sure you enabled the zone allocator in your tests ?
My previous attachment utilized an older tip (May 23), and the zone
allocator was not engaged.  I created a new workarea and repeated all
the runs.  I also saw no problems stressing this patch overnight.

	full	15154/139.27 = 108.81	( 8.81% speedup)
	restart 20534/180.74 = 113.61	(13.61% speedup)

The gain is less dramatic with on full handshakes with the zone allocator
engaged.  I've repeated these runs several times with similiar results.

	full	21427/211.05 = 101.52	( 1.52% speedup)
	restart 52215/486.46 = 107.34	( 7.34% speedup)

Its interesting that restart-zones-ian was 12% idle.  All the
other restart runs were 8% idle.  Also, full-ian is 2% compared
to 1%.	In both of the restart runs with the zone allocator
(with and without the patch) selfserv seems to "cough"; ops/second
flucuates rather than settling. 

With and without the patch restart runs with the zone allocator
also yield timeouts and down-throttling on the clients:

+/var/tmp/workarea-nss331/mozilla/dist/SunOS5.8_OPT.OBJ/bin/strsclnt
	-C c -t 100 -D -p 12344 -c 2147483647 -d
	/u/kirke/nss/selfserv/certs soupnazi
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: PR_Connect returned error -5990:
I/O operation timed out.

strsclnt: -- SSL: Server Certificate Validated.
max_threads set down to 16

The patch seems to accentuate this preexisting anomally.
The peak run in both is about 15 seconds into the run.

55468/522.15 = 106.23	(average was 6.23% off peak)
50447/486.46 = 103.70	(average was 3.7% off peak)
Attachment #86268 - Attachment is obsolete: true
Attachment #86268 - Attachment is obsolete: false
Attachment #86301 - Attachment is obsolete: true

Seeing the results from Kirk's analyzer runs of opening/closing sessions should
be interesting.  One curious thing I noticed was that while the overall number
of calls to C_OpenSession went down after applying the patches, the percent time
spent in those functions did not change.  Compare the numbers in attachment
#85660 [details] and attachment #86268 [details].

Looking at the data more, I can see that the overall time spent in the softoken
decreased by about 10% (consistent with improved performance).  But
C_CloseSession is still costly.  Here are results from a server with 100 threads:

Before the patches:

Function                     # Calls   Time (ms)  Avg. (ms)     % Time

C_CancelFunction                   0          0       0.00      0.00
C_CloseAllSessions                 1          0       0.00      0.00
C_CloseSession                 18015     335107      18.60     13.94
...
                   Totals     170620    2403601

After:


Function                     # Calls   Time (ms)  Avg. (ms)     % Time

C_CancelFunction                   0          0       0.00      0.00
C_CloseAllSessions                 1          0       0.00      0.00
C_CloseSession                  4015      26266       6.54     12.76
...
                   Totals     144809     205884


You can see the ~10% decrease in total time, and that the number of calls to
C_CloseSession decreased by a factor of 4+, and that the time spent in each call
decreased by a factor of nearly 3.  Yet, the percent time did not change
significantly.

If one looks at the same results for a single threaded server handling the same
number of connections, the percent of time spent in C_CloseSession is less than
1.  I believe this shows there are still lock contention issues involved in
session handling, which may also explain why Kirk sees an increase in idle time
after the patches.
Regarding Nelson's question in commment #16 about the number of concurrent
sessions: these patches drop the number slightly.  I used a client with 100
threads doing 1000 connections, and got the following numbers:

# server threads / max concurrent sessions before patch / after patch

1    /  33  /  30
10   /  79  /  78
100  / 651  / 633

Again, I think the patches effect the lifetime of sessions more the actual
number.  For example, if a key retrieved from the freelist had a valid session,
that session was closed, then a new one opened immediately after.  Now, the
session is kept.  So there's less busy work, but the same overall number.
I like the idea here, but I would suggest a couple of modifications:

1) Right now if the context doesn't own the session, it always saves it's data.
That means every operation will do a save state/restore state. These functions
will fail on many tokens, and will produce unnecessary work in our token. I
suggest the following: Add a variable in the context that says we are using the
key's session. Add a variable in the key that says it's session is in use by the
context. If we try to create a new context for a key whose session is already in
use, then we use the standard 'NewSession' call to create a new session. Ian's
current code would use the new 'using key's session' flag to determine if it
should 'close' the session or not.

2) I agree we should add a new 'saveState' function which returns allocated
data, rather than deals with a fixed buffer. The underlying calls already
support this, so the PK11_SaveStateAlloc() would be relatively easy to write.

bob
Bob,

1.  I believe this case is already taken care of.  Note that when a new context
takes a session away from a symkey, it sets symkey->sessionOwner to false.  And
it only takes the session if symkey->sessionOwner is true.  If a symkey has
already released its session to a context, any new contexts using the symkey
will create a new session.

2.  Agreed.
The big issue with #1 is what happens if the Context doesn't own its session.

If the 'own session' is false for the context, then the context will assume that
it needs to save and restore it's state everytime. In this case I don't think we
should 'overload' the own session flag here.

On the key side, I think it's cleaner if we make it explicit that the key is
sharing the session with the context (as opposed to sharing the session with the
global session). I'm a little worried about the case where the key is freed
before the context. I'm not sure that that ever happens (and I think other
things won't work if it does), but I wouldn't want the session to get leaked in
that case.

bob
Regarding Bob's second suggestion, I'd prefer to avoid the extra malloc/free
calls for such a short-lived buffer.  
This patch does not change the context's session ownership properties at all. 
Here is the code where the context gets a session:

-    context->session = pk11_GetNewSession(slot,&context->ownSession);
+    if (symKey && symKey->sessionOwner) {
+
/* The symkey owns a session.  Adopt that session. */
+
context->session = symKey->session;
+
context->ownSession = symKey->sessionOwner;
+
symKey->sessionOwner = PR_FALSE;
+    } else {
+
context->session = pk11_GetNewSession(slot, &context->ownSession);
+    }

the only change is that *if* the key owns a session *then* the context takes it.
 In all other cases, the context still attempts to get a new session.  I don't
see how this change would create a case (that didn't exist before) where the
context does not own a session.

As for freeing, PK11SymKey's are refcounted, and the context owns a reference. 
Thus it is safe to free the context before or after freeing the key.
Ian adapted a program called rsaperf (which tests cryptoki) to work
with NSS.  The idea is to exercise libsoften3.so exclusively, calling
the Session functions directly without the NSS layer.  Ian expected to
see lock contention in NSC_OpenSession and NSC_CloseSession.

I collected Solaris analyzer data running rsaperf under dbx for an
array of threads and operations.  Considering Julien's input, the single
operation run with many threads is the most relevant run to look for lock
contention (make us do mostly opens and closes).  I was suprised to see
"synchronization wait time" was big in decrypt.  NSC_CloseSession is
off the map.

191.291		<Total>
173.369		mutex_lock		(104.187 in malloc, 68.159 in free)
158.251		NSC_Decrypt		(157.564 in RSA_DecryptBlock)
 28.083		NSC_OpenSession	(28.037 in plk11_NewSession)

The big eater was rsa_PrivateKeyOp with 154.859.  Of the wait time
attributed to NSC_OpenSession, almost all is spent at the PORT_Alloc
call at the beginning of pk11_NewSession.  The remainder is spent at
PR_NewLock in pk11_NewSession.

With 1000 threads and 100 operations, results were similar.
The time in rsa_PrivateKeyOp is attributed to the PR_Lock call in
generate_blinding_params.

    /* Acquire the list lock */
    PZ_Lock(blindingParamsList.lock);				<-----here
	    /* Walk the list looking for the private key */
	    for (el = PR_NEXT_LINK(&blindingParamsList.head);
"mozilla/security/nss/lib/freebl/rsa.c" line 581 of 930 --62%-- col 8

Results were similiar with the zone allocator engaged.  Running with
a single operation revealed about 13.925 of a total of 220.141 secs of
wait time was in NSC_OpenSession.  187.511 was attributed to NSC_Decrypt.

Ian, you can explore this data with Solaris analyz1er:

	<under Solaris>
	cd /u/kirke/nss/145322/analyzer-patched
	analyzer rsaperf.1000-1.er
	cd /u/kirke/nss/145322/analyzer-patched-zones
	analyzer rsaperf.1000-1.er

[kirke@box 145322]$ vi HINTWS
[kirke@box 145322]$ cat Kirk-ana* 
Ian adapted a program called rsaperf (which tests cryptoki) to work
with NSS.  The idea is to exercise libsoften3.so exclusively, calling
the Session functions directly without the NSS layer.  Ian expected to
see lock contention in NSC_OpenSession and NSC_CloseSession.

I collected Solaris analyzer data running rsaperf under dbx for an
array of threads and operations.  Considering Julien's input, the single
operation run with many threads is the most relevant run to look for lock
contention (make us do mostly opens and closes).  I was suprised to see
"synchronization wait time" was big in decrypt.  NSC_CloseSession is
off the map.

191.291		<Total>
173.369		mutex_lock		(104.187 in malloc, 68.159 in free)
158.251		NSC_Decrypt		(157.564 in RSA_DecryptBlock)
 28.083		NSC_OpenSession	(28.037 in plk11_NewSession)

The big eater was rsa_PrivateKeyOp with 154.859.  Of the wait time
attributed to NSC_OpenSession, almost all is spent at the PORT_Alloc
call at the beginning of pk11_NewSession.  The remainder is spent at
PR_NewLock in pk11_NewSession.

With 1000 threads and 100 operations, results were similar.
The time in rsa_PrivateKeyOp is attributed to the PR_Lock call in
generate_blinding_params.

    /* Acquire the list lock */
    PZ_Lock(blindingParamsList.lock);				<-----here
	    /* Walk the list looking for the private key */
	    for (el = PR_NEXT_LINK(&blindingParamsList.head);
"mozilla/security/nss/lib/freebl/rsa.c" line 581 of 930 --62%-- col 8

Results were similiar with the zone allocator engaged.  Running with
a single operation revealed about 13.925 of a total of 220.141 secs of
wait time was in NSC_OpenSession.  187.511 was attributed to NSC_Decrypt.

Apparently, generate_blinding_params() in decrypt is where rsaperf
does most all of its waiting.  The time spent in NSC_OpenSession is
small compared to NSC_Decrypt, even doing a single operation (making
the loop after the open small).

About 80% of the wait time in NSC_OpenSession is at the PORT_Alloc
call, the remainder is at the PR_NewLock call.

Ian, you can explore this data with Solaris analyz1er:

	<under Solaris>
	cd /u/kirke/nss/145322/analyzer-patched
	analyzer rsaperf.1000-1.er
	cd /u/kirke/nss/145322/analyzer-patched-zones
	analyzer rsaperf.1000-1.er
Kirk,

I thought you were going to remove the crypto calls from that program, and just
have it open/close sessions.  If the program is doing RSA ops, there's no way
any other functions will show up :)
Ian,

I've attached my version of rsaperf.c for reference.  I removed if-def'd
away the Encrypt and Decrypt calls, and now call C_CloseSession instead.

cd /u/kirke/nss/145322/analyzer-session-calibrate
analyzer rsaperf.1000-1.er

NSC_OpenSession come out on top, with PR_Lock, with .406 of
the total .750 wait time.  Its spending all that calling PR_Lock.

There are two PR_Lock calls in NSC_OpenSession.  Unfortunately,
the analyzer doesn't distinguish the calls. One is on the slot,
and the other locking on list of sessions while verifying the
id we're about to return is unique.
Kirk-

This bug is getting rather lengthy.  I think we should restrict discussion here
to the number of sessions used by NSS.  I've opened bug 149756 to address
performance issues related to opening/closing sessions.
There are two issues open with this bug:

1.  Does the change to PK11_CreateContextBySymKey create situations where
contexts do not own sessions, and therefore save/restore state for every
operation?  I'm not convinced, see comment #26.

2.  Should we create a new function that allocs enough space for saving digest
contexts, or attempt to use stack memory (comment #22 and comment #25)?  The
current patch uses a fixed size stack memory buffer, my worry is that while this
buffer is always large enough for the softoken, it may not be for other tokens.
 PKCS#11 says nothing about how long the state of any operation may be.

Currently (before these patches), we do alloc space for the state.  This occurs
in PK11_CloneContext.  Therefore, if we create a new function that saves the
state by allocating a heap buffer for it, we would not be creating any new
allocations.

The only alternative I consider failsafe is to attempt to use a static buffer,
and alloc if it is not large enough.  This takes care of the 90% case (the
hashing is done in the softoken, and so we can easily have a stack buffer large
enough for the state) without breaking the 10% case (anything else).
Just to clarify, I was thinking of a function like this:

unsigned char *
PK11_SaveStateAlloc(PK11Context *cx, unsigned char *staticBuf, 
                    unsigned int staticBufLen, unsigned int *len,
                    PRBool *freeit)

If staticBuf is not NULL, and staticBufLen is >= than the state's length, the
length goes into *len, and staticBuf is returned.  *freeit is PR_FALSE.

If staticBuf is NULL, or staticBufLen is less than the state's length, a new
buffer is allocated from the heap, of size *len.  The new buffer is returned,
*freeit is PR_TRUE.

If an error occurs, NULL is returned.
Ian, I like your suggestion.  

I think the PRBOOL is unnecessary, because the application can simply 
compare the returned value against the value passed in as the staticBuf 
argument and free the buffer if they're not equal.

Also, I wouldn't call it a "static" buf, because it's not static (as that
term is defined by the c language).  
How about "callerBuf" or "preallocatedBuf"?

Issue 1: I agree. You are basically adopting ownership of the session, so my
complaint is moot.

Issue 2: comment #32 looks good with Nelson's modification in comment #33 .

bob
Weird.  My brain was thinking "stackBuf", and my fingers were typing
"staticBuf".  Must (really) be Friday.

At any rate, I think "preAllocBuf" is more clear.
This patch combines both patches.  It adds a function PK11_SaveContextAlloc, as
described in this bug.	Also, it includes a bugfix for NSC_GetOperationState. 
This function would not return CKR_BUFFER_TOO_SMALL when the provided buffer
was not large enough, a fix for that was required for this bug fix.
Attachment #86264 - Attachment is obsolete: true
Attachment #86266 - Attachment is obsolete: true
Last one was wrong patch.  Please review this one.
Attachment #87107 - Attachment is obsolete: true
Comment on attachment 87112 [details] [diff] [review]
combined patch again

The patch looks fine as is, though I do have some comments:

pk11_saveContextHelper should be a static helper function, so it's ok to adjust
it's semantics with additional parameters so it doesn't automatically free
preAllocBuf if it is too small. That could simplify PK11_SaveContextAlloc() a
bit.
Attachment #87112 - Flags: review+
That function could use a lot of cleanup.  First of all, no caller of it ever
uses a value for staticBuffer besides PR_FALSE, so I'm not sure what the point
of that argument is.  Also, the function recurses when recurse == PR_FALSE,
which is counterintuitive to say the least.

I didn't want to confuse this patch with a lot of extra code.  What I'll do
instead is provide a patch-on-patch, so that we can take care of this here.
This patch fixes a potential memory leak, removes two unnecessary arguments,
and removes a recursive call (which eliminates a superfluous call to
C_GetOperationState).
Depends on: 150704
Comment on attachment 87132 [details] [diff] [review]
cleanup pk11_saveContextHelper

This seems the safest patch, though I would have prefered to just pass the
buffer in and look for CKR_BUFFER_TOO_SMALL error return.
Attachment #87132 - Flags: review+
marking fixed.
Status: NEW → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
Comment on attachment 87112 [details] [diff] [review]
combined patch again

This patch introduced two regressions: bug 210660
and bug 202179.  Portions of this patch should be
removed.  See those two bugs for more information.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: