Closed Bug 290121 Opened 19 years ago Closed 19 years ago

NSS doesn't fetch CRLs during the first minute of program execution on AIX

Categories

(NSS :: Libraries, defect, P1)

3.9.5
Other
AIX
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: julien.pierre, Assigned: julien.pierre)

Details

Attachments

(1 file, 1 obsolete file)

The new CRL tests that were added on monday passed on every OS except for AIX .
The failure was tracked down to a problem with a CRL stored in the cert8.db
database.

The problem was that the SSL client auth connections with tstclnt went through
for 2 certs that were present in the CRL that was supposed to be stored in the
server's database. The connections should have failed with a revocation error
reported by the server .

The output log showed no failure when crlutil imported the CRL . I'm suspecting
a problem with the CRL lookup from the token.

In addition, I have observed that the crlutil -E command , which is used to
erase all CRLs in the database, fails on AIX with the tip. This is a regression
from 3.9.5 . Again, this problem is only reproducible on AIX .
Priority: -- → P1
Target Milestone: --- → 3.10
After getting a copy of the newest gmake and making a few NSPR and NSS build
changes (see bugzilla 290120 and 290122), I was able to get the excellent idebug
IBM GUI debugger working on AIX. It is nearly identical to the OS/2 version . It
was still quite hard to trace the problem because it was very timing dependent.

I finally traced the issue to a problem with the CRL cache not refreshing itself
upon first use. This is a regression introduced by the fix for bug 243585, which
is the CRL cache rewrite for 3.10 that allows NSS to use CRLs from outside of a
PKCS#11 token.

The first time a CRL query is made, either by the cert validation code from SSL
or by SEC_FindCrlByName from crlutil, the CRL cache runs its regular update
routine in DPCache_GetUpToDate, which includes an "up-to-date" test involving a
PRIntervalTime :

 (now - cache->lastfetch > CRLCache_Empty_TokenFetch_Interval)

now is the value of PR_IntervalNow(), and cache->lastfetch is initialized to
zero. CRLCache_Empty_TokenFetch_Interval is an interval of 60 seconds.
The above test was presumed to be true upon initial entry, and cause the cache
to refresh itself, including fetching CRLs from the softoken.

However, on AIX, an NSPR interval time value of zero represents the NSPR
initialization time, which is start time of the program. This is different than
on any other platform. Thus, the above test will not become true until the
program has run for at least one minute. Since our QA tests don't run for that
long, CRLs are unavailable to them. This is obviously a security issue for
servers - it would mean that revocation wouldn't be enabled during the first minute.

The fix is to add a test for (0 == cache->lastfetch), which will always be true
the first time DPCache_GetUpToDate is invoked, and will force the CRL cache to
refresh itself initially, regardless of the way NSPR implements intervals. It
may also cause an extra refresh every 2**64 cache lookups when the interval
value wraps if it happens to be zero again, but that's not a problem.
Summary: CRL issues on AIX → NSS doesn't fetch CRLs during the first minute of program execution on AIX
Attachment #180566 - Flags: review?(nelson)
Comment on attachment 180566 [details] [diff] [review]
always fill CRL cache upon initial entry

I think there may be a race among threads in this code.
But if so, it is a pre-existing condition, and this patch doesn't cause it.
So, r=nelson.bolyard
Attachment #180566 - Flags: review?(nelson) → review+
In the DPCache_GetUpToDate code I saw many misunderstandings
of PRIntervalTime.

1. There is no invalid value for PRIntervalTime.  In particular,
0 and (PRIntervalTime)-1 are both valid values.  So using 0 or
(PRIntervalTime)-1 to indicate an invalid or uninitialized
PRIntervalTime is wrong.  (BTW, our token cache has the same
problem.)

Instead of initializing cache->lastfetch to 0, you can initialize
it to PR_IntervalNow() - (CRLCache_Empty_TokenFetch_Interval+1)
to cause the cache to refresh itself upon initial entry.  If you
don't like this solution, you can use a boolean flag to indicate
whether we have ever fetched a CRL.

Since 0 is not a special or invalid value for PRIntervalTime,
initializing PRIntervalTime variables to 0, for example,

    PRIntervalTime now = 0;
    PRIntervalTime lastfetch = 0;

is meaningless in that you could initialize them to any other
value; the only thing you achieve is deterministic code execution.

2. PRIntervalTime can be used for both time instants and time
intervals.  Since PRIntervalTime wraps around, you must not
compare two PRIntervalTime time instants.  Instead, you should
only compare two PRIntervalTime time intervals.

So this is wrong (comparing two time instants):

    now < cache->lastfetch

but this is correct (comparing two time intervals):

    now - cache->lastfetch > CRLCache_Empty_TokenFetch_Interval

The code also has thread safety issues.  I don't understand the
code well enough to point them out.  But this one is definitely
a problem: multiple threads read cache->lastfetch without lock
protection.
Nelson,

Thanks for the review. I checked in the fix to the NSS tip. 
Checking in crl.c;
/cvsroot/mozilla/security/nss/lib/certdb/crl.c,v  <--  crl.c
new revision: 1.48; previous revision: 1.47
done

I did this seconds before Wan-Teh's comments (got a mid-air collision). Leaving
bug opened to discuss the possible race .
Wan-Teh,

1. Nelson suggested the PR_IntervalNow() - CRLCache_Empty_TokenFetch_Interval+1)
expression, but I really didn't like it.
As I said in comment #1, using zero will also ensure the cache is filled on
initial entry, and has a very low probability of side effects - which would be a
harmless, very occasional extra token object refetch, if PR_IntervalNow()
happened to return exactly zero again, which can happen at most every 6 hours
due to the definition of PRIntervalTime, and usually less often than that.
The goal was deterministic code execution. Adding a flag indeed accomplishes
that too, without the side effect.

2. DPCache_GetUpToDate always owns either a read lock or a write lock to the
DPCache object. Please read the logic in the AcquireDPCache function to convince
yourself that this is the case. The read lock was acquired in the calling
function in order to avoid successive locks and unlocks, as would have been
required if it was in DPCache_GetUpToDate . 

Based on this, do you agree that there is no race condition in this code ?
Wan-Teh,

In response to your other issue in 2. about PRIntervalTime :

The comparison of two intervals was an attempt to detect a wraparound in case
the interval is greater than the maximum allowed by PRIntervalTime. 
I see now that this is unnecessary because PRIntervalTime is actually defined as
PRUInt32 . So the subtraction of now - lastfetch would not evaluate to a
negative number which may be less than the CRLCache_Empty_TokenFetchInterval,
but rather to a very large positive number, which would exceed it, and thus
trigger the first test.

Therefore, the second test, (now < cache->lastfetch), is dead code.

This leaves a one minute (the fetchinterval) period every 6hrs during which the
CRL cache will not refresh its CRL. Unfortunately, the only way to fix that
would be to use the much costlier PRTime and PR_Now() functions.
Sorry, I didn't know a read lock is acquired by
the caller.  Please ignore my comment "The code
also has thread safety issues".

I confirm that PR_IntervalNow has a specialized
implementation on AIX.  So it makes sense that
PR_IntervalNow's return value on AIX is different
on other Unix platforms.  I also see some code
in the AIX implementation of PR_IntervalNow that
subtracts NSPR's initialization time instant from
the time, which would explain the 0 return value at
NSPR initialization.

I don't have time to beat on the abuse of 0 with
PRIntervalTime, but you really should fix
comparisons of time instants like

    now < cache->lastfetch

Given your comment, seems that you can just delete
those comparisons.  You can also add a PRIntervalTime
cast to the subtractions, which is what our code
examples do, see
http://www.mozilla.org/projects/nspr/reference/html/prinrvl.html.

If your time intervals may be longer than 6 hours,
you must use PRTime and PR_Now.  An incorrect program
has no performance to speak of.
This patch allows intervals of greater than 6 hours, which are possible between
two queries to the CRL cache.

I believe the two constructs that were questioned by Wan-Teh with
PRIntervalTime become valid with PRTime.

The now < cache->last_fetched test is an attempt to detect the clock being
turned back manually by the administrator, and forces a refetch.

Using 0 as the initial value for PRTime and testing for it is OK, since PRTime
won't wrap to january 1st 1970 . There would only be an extra refetch is
somebody reset the machine clock back to that date and managed to get a query
through at a particular microsecond , but that's acceptable.
Attachment #180566 - Attachment is obsolete: true
Comment on attachment 180624 [details] [diff] [review]
replace PRIntervalTime with PRTime

I disagree with this patch, replacing interval times with PRtimes.  I think
this is too high a price to pay to compensate for the fact that there may be an
extra one-minute delay every 6 hours.  But I won't go so far as to mark it r-.
I also don't think there's a problem with using zero as a magic number.  If we
do one extra update every 6 hours, it hurts nothing. 
I do believe that all time comparisons should be a single expression of the
form 
(now - lastime) > desiredInterval
where lasttime is *never* computed to be a time in the future (e.g. never
computed a (now + interval).  
As faar as correctness goes, the only incorrect behavior is one that suppressed
an update for a time much longer than a minute.  Any computaational result that
causes updates to occasionally occur more frequently than inteded does not
cause incorrect behavior but only suboptimal.
Nelson,

Consider a server with very little usage.

Somebody logs in at time = 0 . Cache is created
The next user logs in between time = 6 hours and 6 hours + 1 minute (the refresh
interval ).
In this case, when using PRIntervalTime the CRL cache may not be refreshed, even
though it should be due for an update.
The same user could log in again another 6 hours later within a 1 minute time
frame, and again the CRL cache wouldn't update. This could continue forever, as
long as there aren't other users logging to the server in between, which would
cause the cache to refresh.
I'm not sure how to solve this problem without switching to PRTime.
Well, 3.10 isn't a "performance" release, so we can go with a correct 
(but possibly inefficient) solution now, and work on efficiency for 3.11.  

But please do ensure that all time comparisons are done in the form of  
   (now - previousTime) > interval     (or < interval) 
and no time comparisons of the form
   (previousTime + interval) > now     (or < now)

I have not reviewed your most recent patch.  I am just withdrawing my 
objection in principle to the use of PRTime in this code.
Nelson,

Even with the use of PRTime, this code will still be more efficient than the
previous CRL cache implementation that was in the 3.9 branch.

In that version, the cache performed a token CRL fetch for every revocation
check if it was empty. There was no interval to control the frequency of
fetches. This meant that in 99% of the cases, in which there are no CRLs
available, it would do a fetch. This may not sound efficient, but it was no less
efficient than the prior implementation in 3.5, before there was a cache.

Also, in cases in which the cache contained a CRL, the cache checked that the
CRL still existed in the token - by checking for the subject attribute - on
every revocation check as well.

So, the addition of both of these intervals are optimizations. It would be great
if these optimizations could use the most efficient PRIntervalTime rather than
PRTime, but there is no guarantee about the frequency of certificate revocation
check in an application, so they can't.
Attachment #180624 - Flags: superreview?(wtchang)
Attachment #180624 - Flags: review?(nelson)
Comment on attachment 180624 [details] [diff] [review]
replace PRIntervalTime with PRTime

This code appears to be correct (or, as correct as the code
that preceeds it) on all platforms for which the NSPR feature
test macro HAVE_LONG_LONG is true.  

On those platforms, PRTime is a 64-bit primitive integer type, 
and not a struct of two PRUint32s.  On platforms where 
HAVE_LONG_LONG is not true, this code would not build, but would
complain about assigning integer values to structure types. 
To work around this, NSPR supports a set of LL_ macros that 
do 64-bit arithmetic (and assignment and comparison correctly 
on both platforms with and platforms without HAVE_LONG_LONG.  

Some NSS code uses the LL_ macros, other NSS code assumes 
(as this patch does) that all platforms support primitive 
64-bit integers.  I think NSS should be consistent, and 
either always use LL_ macros for 64-bit operations in code
that is supposedly platform independent, or NEVER use those
macros.  It bothers me that NSS is inconsistent about this.

Having said that, I observe that there is already code in 
NSS elsewhere that makes this same assumption, so this 
concern need not hold up this patch.  But I'd like to hear
Wan-Teh's opinion on all this.

r=nelson
Attachment #180624 - Flags: review?(nelson) → review+
Nelson, thanks for the review.
I believe all platforms now have HAVE_LONG_LONG set to true, so the macros
aren't necessary. Otherwise, the places that don't use the macros would cause
build failures on some platforms, and we would have heard about the issue.

I checked in this patch to the tip.

Checking in certi.h;
/cvsroot/mozilla/security/nss/lib/certdb/certi.h,v  <--  certi.h
new revision: 1.16; previous revision: 1.15
done
Checking in crl.c;
/cvsroot/mozilla/security/nss/lib/certdb/crl.c,v  <--  crl.c
new revision: 1.49; previous revision: 1.48
done
Status: NEW → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
In the 1999 revision of the C Standard, the 'long long'
type was added and is required to be at least 64 bits.
It is 2005 now, so it is okay to assume HAVE_LONG_LONG
is defined on all platforms in NSS 3.9 or later.

In reality, because GCC is widely used and supports long
long, we could have made that assumption a few years ago.
Comment on attachment 180624 [details] [diff] [review]
replace PRIntervalTime with PRTime

You should remove the unnessary comparisons like
(now < cache->lastfetch) and (now < cache->lastcheck).
You should also remove the (0 == lastfetch) comparison
you recently added to fix the AIX problem.

For better style you can use the PR_USEC_PER_SEC
macro instead of 1000000.

I don't have time to review your patch carefully.
It looks correct.
Attachment #180624 - Flags: superreview?(wtchang)
Wan-Teh,

Thanks for mentioning PR_USEC_PER_SEC, I'll use it.

I believe the other checks are still wanted . As I explained in comment #9 :

- the test for (now < cache->lastfetch) is an attempt to detect the clock being
turned back by the system administrator, and forces a refetch in that case .

- the test (0 == lastcheck) still needs to be done in order to force a
deterministic fetch on the first update of the cache .
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: