Closed
Bug 290121
Opened 19 years ago
Closed 19 years ago
NSS doesn't fetch CRLs during the first minute of program execution on AIX
Categories
(NSS :: Libraries, defect, P1)
Tracking
(Not tracked)
RESOLVED
FIXED
3.10
People
(Reporter: julien.pierre, Assigned: julien.pierre)
Details
Attachments
(1 file, 1 obsolete file)
10.47 KB,
patch
|
nelson
:
review+
|
Details | Diff | Splinter Review |
The new CRL tests that were added on monday passed on every OS except for AIX . The failure was tracked down to a problem with a CRL stored in the cert8.db database. The problem was that the SSL client auth connections with tstclnt went through for 2 certs that were present in the CRL that was supposed to be stored in the server's database. The connections should have failed with a revocation error reported by the server . The output log showed no failure when crlutil imported the CRL . I'm suspecting a problem with the CRL lookup from the token. In addition, I have observed that the crlutil -E command , which is used to erase all CRLs in the database, fails on AIX with the tip. This is a regression from 3.9.5 . Again, this problem is only reproducible on AIX .
Assignee | ||
Updated•19 years ago
|
Priority: -- → P1
Target Milestone: --- → 3.10
Assignee | ||
Comment 1•19 years ago
|
||
After getting a copy of the newest gmake and making a few NSPR and NSS build changes (see bugzilla 290120 and 290122), I was able to get the excellent idebug IBM GUI debugger working on AIX. It is nearly identical to the OS/2 version . It was still quite hard to trace the problem because it was very timing dependent. I finally traced the issue to a problem with the CRL cache not refreshing itself upon first use. This is a regression introduced by the fix for bug 243585, which is the CRL cache rewrite for 3.10 that allows NSS to use CRLs from outside of a PKCS#11 token. The first time a CRL query is made, either by the cert validation code from SSL or by SEC_FindCrlByName from crlutil, the CRL cache runs its regular update routine in DPCache_GetUpToDate, which includes an "up-to-date" test involving a PRIntervalTime : (now - cache->lastfetch > CRLCache_Empty_TokenFetch_Interval) now is the value of PR_IntervalNow(), and cache->lastfetch is initialized to zero. CRLCache_Empty_TokenFetch_Interval is an interval of 60 seconds. The above test was presumed to be true upon initial entry, and cause the cache to refresh itself, including fetching CRLs from the softoken. However, on AIX, an NSPR interval time value of zero represents the NSPR initialization time, which is start time of the program. This is different than on any other platform. Thus, the above test will not become true until the program has run for at least one minute. Since our QA tests don't run for that long, CRLs are unavailable to them. This is obviously a security issue for servers - it would mean that revocation wouldn't be enabled during the first minute. The fix is to add a test for (0 == cache->lastfetch), which will always be true the first time DPCache_GetUpToDate is invoked, and will force the CRL cache to refresh itself initially, regardless of the way NSPR implements intervals. It may also cause an extra refresh every 2**64 cache lookups when the interval value wraps if it happens to be zero again, but that's not a problem.
Summary: CRL issues on AIX → NSS doesn't fetch CRLs during the first minute of program execution on AIX
Assignee | ||
Comment 2•19 years ago
|
||
Attachment #180566 -
Flags: review?(nelson)
Comment 3•19 years ago
|
||
Comment on attachment 180566 [details] [diff] [review] always fill CRL cache upon initial entry I think there may be a race among threads in this code. But if so, it is a pre-existing condition, and this patch doesn't cause it. So, r=nelson.bolyard
Attachment #180566 -
Flags: review?(nelson) → review+
Comment 4•19 years ago
|
||
In the DPCache_GetUpToDate code I saw many misunderstandings of PRIntervalTime. 1. There is no invalid value for PRIntervalTime. In particular, 0 and (PRIntervalTime)-1 are both valid values. So using 0 or (PRIntervalTime)-1 to indicate an invalid or uninitialized PRIntervalTime is wrong. (BTW, our token cache has the same problem.) Instead of initializing cache->lastfetch to 0, you can initialize it to PR_IntervalNow() - (CRLCache_Empty_TokenFetch_Interval+1) to cause the cache to refresh itself upon initial entry. If you don't like this solution, you can use a boolean flag to indicate whether we have ever fetched a CRL. Since 0 is not a special or invalid value for PRIntervalTime, initializing PRIntervalTime variables to 0, for example, PRIntervalTime now = 0; PRIntervalTime lastfetch = 0; is meaningless in that you could initialize them to any other value; the only thing you achieve is deterministic code execution. 2. PRIntervalTime can be used for both time instants and time intervals. Since PRIntervalTime wraps around, you must not compare two PRIntervalTime time instants. Instead, you should only compare two PRIntervalTime time intervals. So this is wrong (comparing two time instants): now < cache->lastfetch but this is correct (comparing two time intervals): now - cache->lastfetch > CRLCache_Empty_TokenFetch_Interval The code also has thread safety issues. I don't understand the code well enough to point them out. But this one is definitely a problem: multiple threads read cache->lastfetch without lock protection.
Assignee | ||
Comment 5•19 years ago
|
||
Nelson, Thanks for the review. I checked in the fix to the NSS tip. Checking in crl.c; /cvsroot/mozilla/security/nss/lib/certdb/crl.c,v <-- crl.c new revision: 1.48; previous revision: 1.47 done I did this seconds before Wan-Teh's comments (got a mid-air collision). Leaving bug opened to discuss the possible race .
Assignee | ||
Comment 6•19 years ago
|
||
Wan-Teh, 1. Nelson suggested the PR_IntervalNow() - CRLCache_Empty_TokenFetch_Interval+1) expression, but I really didn't like it. As I said in comment #1, using zero will also ensure the cache is filled on initial entry, and has a very low probability of side effects - which would be a harmless, very occasional extra token object refetch, if PR_IntervalNow() happened to return exactly zero again, which can happen at most every 6 hours due to the definition of PRIntervalTime, and usually less often than that. The goal was deterministic code execution. Adding a flag indeed accomplishes that too, without the side effect. 2. DPCache_GetUpToDate always owns either a read lock or a write lock to the DPCache object. Please read the logic in the AcquireDPCache function to convince yourself that this is the case. The read lock was acquired in the calling function in order to avoid successive locks and unlocks, as would have been required if it was in DPCache_GetUpToDate . Based on this, do you agree that there is no race condition in this code ?
Assignee | ||
Comment 7•19 years ago
|
||
Wan-Teh, In response to your other issue in 2. about PRIntervalTime : The comparison of two intervals was an attempt to detect a wraparound in case the interval is greater than the maximum allowed by PRIntervalTime. I see now that this is unnecessary because PRIntervalTime is actually defined as PRUInt32 . So the subtraction of now - lastfetch would not evaluate to a negative number which may be less than the CRLCache_Empty_TokenFetchInterval, but rather to a very large positive number, which would exceed it, and thus trigger the first test. Therefore, the second test, (now < cache->lastfetch), is dead code. This leaves a one minute (the fetchinterval) period every 6hrs during which the CRL cache will not refresh its CRL. Unfortunately, the only way to fix that would be to use the much costlier PRTime and PR_Now() functions.
Comment 8•19 years ago
|
||
Sorry, I didn't know a read lock is acquired by the caller. Please ignore my comment "The code also has thread safety issues". I confirm that PR_IntervalNow has a specialized implementation on AIX. So it makes sense that PR_IntervalNow's return value on AIX is different on other Unix platforms. I also see some code in the AIX implementation of PR_IntervalNow that subtracts NSPR's initialization time instant from the time, which would explain the 0 return value at NSPR initialization. I don't have time to beat on the abuse of 0 with PRIntervalTime, but you really should fix comparisons of time instants like now < cache->lastfetch Given your comment, seems that you can just delete those comparisons. You can also add a PRIntervalTime cast to the subtractions, which is what our code examples do, see http://www.mozilla.org/projects/nspr/reference/html/prinrvl.html. If your time intervals may be longer than 6 hours, you must use PRTime and PR_Now. An incorrect program has no performance to speak of.
Assignee | ||
Comment 9•19 years ago
|
||
This patch allows intervals of greater than 6 hours, which are possible between two queries to the CRL cache. I believe the two constructs that were questioned by Wan-Teh with PRIntervalTime become valid with PRTime. The now < cache->last_fetched test is an attempt to detect the clock being turned back manually by the administrator, and forces a refetch. Using 0 as the initial value for PRTime and testing for it is OK, since PRTime won't wrap to january 1st 1970 . There would only be an extra refetch is somebody reset the machine clock back to that date and managed to get a query through at a particular microsecond , but that's acceptable.
Assignee | ||
Updated•19 years ago
|
Attachment #180566 -
Attachment is obsolete: true
Comment 10•19 years ago
|
||
Comment on attachment 180624 [details] [diff] [review] replace PRIntervalTime with PRTime I disagree with this patch, replacing interval times with PRtimes. I think this is too high a price to pay to compensate for the fact that there may be an extra one-minute delay every 6 hours. But I won't go so far as to mark it r-. I also don't think there's a problem with using zero as a magic number. If we do one extra update every 6 hours, it hurts nothing. I do believe that all time comparisons should be a single expression of the form (now - lastime) > desiredInterval where lasttime is *never* computed to be a time in the future (e.g. never computed a (now + interval). As faar as correctness goes, the only incorrect behavior is one that suppressed an update for a time much longer than a minute. Any computaational result that causes updates to occasionally occur more frequently than inteded does not cause incorrect behavior but only suboptimal.
Assignee | ||
Comment 11•19 years ago
|
||
Nelson, Consider a server with very little usage. Somebody logs in at time = 0 . Cache is created The next user logs in between time = 6 hours and 6 hours + 1 minute (the refresh interval ). In this case, when using PRIntervalTime the CRL cache may not be refreshed, even though it should be due for an update. The same user could log in again another 6 hours later within a 1 minute time frame, and again the CRL cache wouldn't update. This could continue forever, as long as there aren't other users logging to the server in between, which would cause the cache to refresh. I'm not sure how to solve this problem without switching to PRTime.
Comment 12•19 years ago
|
||
Well, 3.10 isn't a "performance" release, so we can go with a correct (but possibly inefficient) solution now, and work on efficiency for 3.11. But please do ensure that all time comparisons are done in the form of (now - previousTime) > interval (or < interval) and no time comparisons of the form (previousTime + interval) > now (or < now) I have not reviewed your most recent patch. I am just withdrawing my objection in principle to the use of PRTime in this code.
Assignee | ||
Comment 13•19 years ago
|
||
Nelson, Even with the use of PRTime, this code will still be more efficient than the previous CRL cache implementation that was in the 3.9 branch. In that version, the cache performed a token CRL fetch for every revocation check if it was empty. There was no interval to control the frequency of fetches. This meant that in 99% of the cases, in which there are no CRLs available, it would do a fetch. This may not sound efficient, but it was no less efficient than the prior implementation in 3.5, before there was a cache. Also, in cases in which the cache contained a CRL, the cache checked that the CRL still existed in the token - by checking for the subject attribute - on every revocation check as well. So, the addition of both of these intervals are optimizations. It would be great if these optimizations could use the most efficient PRIntervalTime rather than PRTime, but there is no guarantee about the frequency of certificate revocation check in an application, so they can't.
Assignee | ||
Updated•19 years ago
|
Attachment #180624 -
Flags: superreview?(wtchang)
Attachment #180624 -
Flags: review?(nelson)
Comment 14•19 years ago
|
||
Comment on attachment 180624 [details] [diff] [review] replace PRIntervalTime with PRTime This code appears to be correct (or, as correct as the code that preceeds it) on all platforms for which the NSPR feature test macro HAVE_LONG_LONG is true. On those platforms, PRTime is a 64-bit primitive integer type, and not a struct of two PRUint32s. On platforms where HAVE_LONG_LONG is not true, this code would not build, but would complain about assigning integer values to structure types. To work around this, NSPR supports a set of LL_ macros that do 64-bit arithmetic (and assignment and comparison correctly on both platforms with and platforms without HAVE_LONG_LONG. Some NSS code uses the LL_ macros, other NSS code assumes (as this patch does) that all platforms support primitive 64-bit integers. I think NSS should be consistent, and either always use LL_ macros for 64-bit operations in code that is supposedly platform independent, or NEVER use those macros. It bothers me that NSS is inconsistent about this. Having said that, I observe that there is already code in NSS elsewhere that makes this same assumption, so this concern need not hold up this patch. But I'd like to hear Wan-Teh's opinion on all this. r=nelson
Attachment #180624 -
Flags: review?(nelson) → review+
Assignee | ||
Comment 15•19 years ago
|
||
Nelson, thanks for the review. I believe all platforms now have HAVE_LONG_LONG set to true, so the macros aren't necessary. Otherwise, the places that don't use the macros would cause build failures on some platforms, and we would have heard about the issue. I checked in this patch to the tip. Checking in certi.h; /cvsroot/mozilla/security/nss/lib/certdb/certi.h,v <-- certi.h new revision: 1.16; previous revision: 1.15 done Checking in crl.c; /cvsroot/mozilla/security/nss/lib/certdb/crl.c,v <-- crl.c new revision: 1.49; previous revision: 1.48 done
Status: NEW → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
Comment 16•19 years ago
|
||
In the 1999 revision of the C Standard, the 'long long' type was added and is required to be at least 64 bits. It is 2005 now, so it is okay to assume HAVE_LONG_LONG is defined on all platforms in NSS 3.9 or later. In reality, because GCC is widely used and supports long long, we could have made that assumption a few years ago.
Comment 17•19 years ago
|
||
Comment on attachment 180624 [details] [diff] [review] replace PRIntervalTime with PRTime You should remove the unnessary comparisons like (now < cache->lastfetch) and (now < cache->lastcheck). You should also remove the (0 == lastfetch) comparison you recently added to fix the AIX problem. For better style you can use the PR_USEC_PER_SEC macro instead of 1000000. I don't have time to review your patch carefully. It looks correct.
Attachment #180624 -
Flags: superreview?(wtchang)
Assignee | ||
Comment 18•19 years ago
|
||
Wan-Teh, Thanks for mentioning PR_USEC_PER_SEC, I'll use it. I believe the other checks are still wanted . As I explained in comment #9 : - the test for (now < cache->lastfetch) is an attempt to detect the clock being turned back by the system administrator, and forces a refetch in that case . - the test (0 == lastcheck) still needs to be done in order to force a deterministic fetch on the first update of the cache .
You need to log in
before you can comment on or make changes to this bug.
Description
•