Closed Bug 329072 Opened 19 years ago Closed 19 years ago

client sometimes fails to authenticate despite having cert

Categories

(NSS :: Libraries, defect, P1)

3.11

Tracking

(Not tracked)

RESOLVED FIXED
3.11.1

People

(Reporter: nelson, Assigned: rrelyea)

References

Details

Attachments

(1 file)

Reportedly, a multi-threaded NSS-based https test client sometimes fails to send a client auth certificate in response to a server's cert request in a full handshake. I've seen ssldump output that seems to support this story. This is not NSS's strsclnt test program, but another test program. The server is requesting, but not requiring, TLS client authentication in every full handshake. The client behaves as if its cache is disabled, sending an SSL2-style TLS client hello in every connection, and doing a full handshake on every connection. The server requests client auth in every full handshake. Most times the client successfully authenticates. But occasionally the client sends an empty certificate message (which is how a TLS client indicates it has no cert with which to honor the client auth request) and also does not send a cert verify message (further confirming that it has no cert to send). Given that all requests for client auth are the same, the client should authenticate to all requests whether required or not. Although the test server is not requiring client authentication, it treats any handshake that fails to authenticate as an error. That's how this came to our attention. The problem is reported against 3.11, but I am told that it has been seen in 3.9 and 3.10 also, and is not a new problem. I'm leaving this bug in UNCONFIRMED state until I can review the test client source code. I think it likely that either a) there is a race in the client code somewhere that sometimes causes it to erroneously conclude that it has no cert, or b) an application bug somewhere has this effect. One thing is clear: NSS QA has NO client auth stress tests at all. So, we have no basis on which to say "it always works for us". This is the subject of bug 86179 and bug 220380.
Another apparent CERT DB race problem. Must be resolved for 3.11.1. Should see if we can reproduce this with a smaller test client and test server program than the ones that now demonstrate it. Also, need to double check that this is not due to an error in the present test client program.
Assignee: nelson → alexei.volkov.bugs
Status: UNCONFIRMED → NEW
Ever confirmed: true
Priority: -- → P1
Target Milestone: --- → 3.11.1
I have been looking at the web server test client. Some changes were made to it recently. It does use the session cache now - it doesn't set the SSL_NO_CACHE option on any sockets anymore. It uses the SSL_SetSockPeerID function to differentiate tests using different client cets in different threads. The test case I was given does 2 client auth requests from 2 threads, for a total of 4 operations. One of the 4 requests doesn't send a client cert at least 20% of the time. The problem never happens when running with a single-thread, so we have a race condition. I put assertions in the client cert selection callback. I found that it always finds the certificate using PK11_FindCertFromNickname, but it sometimes fails to get the private key - PK11_FindKeyByAnyCert sometimes returns NULL. I tested the client with our official 3.9.5 and 3.10.1 shared libraries, and the problem is also reproducible, so this is not a regression in NSS. I'm going to do some further investigation to find out why the key is not always found.
Assignee: alexei.volkov.bugs → julien.pierre.bugs
In comment 2 I meant that this was not a regression in NSS 3.11 . I went back to older versions of NSS : 3.3.12, 3.4.2 and 3.8.3 . I didn't see this problem with any of them . So, this is an NSS regression that was introduced between NSS 3.8.3 and 3.9.5 .
Status: NEW → ASSIGNED
The problem is in the function pk11_LoginStillRequired in pk11auth.c . This function is invoked by PK11_FindKeyByAnyCert to find out if a token login is needed. /* * Returns true if the token is needLogin and isn't logged in. * This function is used to determine if authentication is needed * before attempting a potentially privelleged operation. */ PRBool pk11_LoginStillRequired(PK11SlotInfo *slot, void *wincx) { return slot->needLogin && !PK11_IsLoggedIn(slot,wincx); } We have 2 threads trying to login at the same time to do the client authentication. One of them logs in first, and the login bit is set in the PK11SlotInfo structure. In the meantime the other thread tests the bit and it is already set, and that causes PK11_FindKeyByAnyCert to not try to find the key again, and fail. See the code sequence below : keyHandle = PK11_MatchItem(slot,certHandle,CKO_PRIVATE_KEY); error = PORT_GetError(); if ((keyHandle == CK_INVALID_HANDLE) && (error == SSL_ERROR_NO_CERTIFICATE) && pk11_LoginStillRequired(slot,wincx)) { /* authenticate and try again */ rv = PK11_Authenticate(slot, PR_TRUE, wincx); if (rv == SECSuccess) { keyHandle = PK11_MatchItem(slot,certHandle,CKO_PRIVATE_KEY); PORT_Assert(keyHandle != CK_INVALID_HANDLE); } else { PORT_Assert(0); } } The conditional block is never entered due to the race. So, keyHandle remains CK_INVALID_HANDLE . The proper fix should be to have a lock around the PK11SlotInfo and use it within pk11_LoginStillRequired. If we don't want to pay the price of an additional lock, we can just not call that function and always try again and authenticate. But this will reduce performance in legitimate error cases where we are already logged in and the key is missing. This regression was introduced in NSS 3.9.2 as part of Bob's checkins for bug 244907 and bug 244914 . I'm reassigning this to Bob.
Assignee: julien.pierre.bugs → rrelyea
Status: ASSIGNED → NEW
OS: Solaris → All
Hardware: Sun → All
Actually forget what I said about the lock in pk11_loginStillRequired. It wouldn't help, because the login status changes between the time PK11_MatchItem returns and pk11_loginStillRequired is invoked. We may have no choice but to eliminate the call and try again.
So this is not likely the explanation of mysterious server failures, since servers authenticate all their tokens before the start doing operations (that is login state is either always logged in or never logged in). The fix is simply acquire the needLogin state before issuing the MatchItem. That way we know if we needed to login sometime before. If we race between the early call and the MatchItem, the worst that will happen is we will call PK11_Authenticate again (noop) and MatchItem again (returning the same failure as before). Logout is a 'human' scale event. Users expect things to fail if the logout (pull a token, etc) in a middle of an operation, so there's no need to check pk11_LoginStillRequired() after the MatchItem call.
I have a patch, I'm checking for regressions now in all.sh, If I find none, I'll attach the patch and ask julien to test and review it.
There is concern here about breaking compatibility with the device ( & PKCS11 module) for which this code was originally developed, specifically a device known as "house key" that requires no login ever, not even as a prerequisite to use its private key. The objective IIRC was to completely avoid all logins for that devices. It's unclear to some of us how to test for regressions with respect to devices such as that. On thing that could use a little clarification (and documentation) is whether slot->needLogin means "this token sometimes requires login" (is not a house key), or whether it means "this token requires login AT THIS TIME before it can do any sensitive operations. I think that is equivalent to asking whether the value of needLogin ever changes for a token, except at token insertion/ removal time. This is YA case where the code itself does not suffice to define the intended behavbior. The question "what is this bit intended to mean" is not equivalent to "how does the code use this bit now?"
slot->needLogin is a cached value describing the attribute of the token itself. Either the token needs to log in or it doesn't. That value will only change if the token itself changes. My patch does not change this semantic. The race is between the is logged in state and the key search. If the key search fails because the token is not logged in, but the token becomes logged in (through the actions of another thread) before we check to see if it's logged in, the function will erroneously conclude that the key does not exist. The patch (which I'm attaching just now). Move the check before we key search. If the token is subsequently logged in, either the key search will succeed, or it will succeed in the 2nd check (if the key exists), or in the worst case, the key doesn't exist, but we will check for it twice.
see bug for description of this patch
Attachment #215431 - Flags: superreview?(nelson)
Attachment #215431 - Flags: review?(julien.pierre.bugs)
Comment on attachment 215431 [details] [diff] [review] Check login state before we try to find the key to prevent race r=nelson, provided a comment is fixed. This comment appears 3 times in this patch, in 3 different functions: >+ * prevent a login race condition. If le->slot is logged in between Trouble is, the "le" pointer exists in only one of the 3 functions. In the other two functions, the variable pointer is just "slot", not le->slot. So, the comment needs to be fixed in two of the 3 places it appears.
Attachment #215431 - Flags: superreview?(nelson) → review+
checked in on tip: Checking in pk11cert.c; /cvsroot/mozilla/security/nss/lib/pk11wrap/pk11cert.c,v <-- pk11cert.c new revision: 1.145; previous revision: 1.144 done I'd still like to see if julien's test case still fails. (and need second review for 3.11).
Comment on attachment 215431 [details] [diff] [review] Check login state before we try to find the key to prevent race The test case passes with this change.
Attachment #215431 - Flags: review?(julien.pierre.bugs) → review+
Bob, these question are not specifically about any patch, but seek explanation for some aspects of NSS's approach to token logins. (In reply to comment #9) > slot->needLogin is a cached value describing the attribute of the token itself. An attribute of the token's current dynamic state? Or of the token's static definition? > Either the token needs to log in or it doesn't. That is of course always true at any instant in time, whether the value is static or dynamic. > That value will only change if the token itself changes. Do you mean if the token changes state, such as being logged in, or logged out? Or do you mean if the token is physically replaced in the slot? Let me make this multiple-choice. Please choose A or B. Does the bit mean: a) this token's policy requires that it must (in general) be logged in before its private keys may be used. It may or may not be logged in at this moment, but it always requires that it be logged in when private keys are to be used. (This definition is a statement about the token's static policy, not about its present dynamic state. By this definition, the value for house key would be false, and for most other devices the value would always be true.) b) This token is not logged in at this moment, and must be logged in before its private keys can be used? (This definition is a statement about the token's present dynamic state. If the bit is true, it means that the token is not presently logged in, and it *implies* that the token has a static policy that requires a logged-in state to use private keys. If the bit is false, this has no implication about the token's policy.) I suspect the answer is a, because PK11_IsLoggedIn() would seem to be unnecessary otherwise. Here's a related question. I'm asking it here so that the answer can be recorded here for posterity. It seems that there might be two strategies for dealing with token logins. a) the "optimistic" strategy: Don't try to determine if login is needed before doing an operation. Just try the operation. If it fails and returns a CKR_USER_NOT_LOGGED_IN, then login and retry the operation. b) the "error avoiding" strategy: Try to determine if login is needed before doing every operation. If so, do the login attempt first. Then after the login succeeds, or if the login was not needed, attempt the operation once and do not retry if you get CKR_USER_NOT_LOGGED_IN. The latter seems like it would be more complicated to implement. I think the code we're doing now uses the "error avoiding" strategy. Why does it not use the simpler optimistic strategy? (There's probably a good reason. I simply do not recall it.) Thanks.
need loging is static to the token. it means that the token is one that requires login to do operations. If a given token has needLogin set, it will have needLogin set even if the token is logged in. (as you surmised, or at least trying to pull out, it is a static attribute of the token). It basically caches CKF_LOGIN_REQUIRED. needLogin can only change if the token itself is changed (such as the card is physically removed). Not by the act of logining in or out. So that would be answer 'A'. ------------------------ Yes there are two strategies. The first strategy is what we used to do. (look for the key, then log in if we failed to find it). The problem is the case of mixed tokens. Ones which have a private key that is readable, and a private key that is not. If you look for the readable private key, then you want to find it without logging in. This is the essence of the change I made which introduced this problem (that is the patch was changing the code from authenticate then look to look then authenticate). bob
Checking in pk11cert.c; /cvsroot/mozilla/security/nss/lib/pk11wrap/pk11cert.c,v <-- pk11cert.c new revision: 1.143.2.1; previous revision: 1.143 done checked into 3.11
Status: NEW → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: