Closed Bug 380784 Opened 17 years ago Closed 15 years ago

PK11MODE in non FIPS mode failed.

Categories

(NSS :: Libraries, defect)

3.12.3
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
3.12.3

People

(Reporter: slavomir.katuscak+mozilla, Assigned: glenbeasley)

References

Details

(Whiteboard: FIPS)

Attachments

(1 file)

fips.sh: Run PK11MODE in Non FIPSMODE  -----------------
pk11mode -d ../fips -p nonfips- -f ../tests.fipspw.8874 -n
NON FIPS MODE PKM_Error: DSA domain parameter generation failed with 0x00000030, CKR_DEVICE_ERROR                    
NON FIPS MODE PKM_Error: PKM_PublicKey failed with 0x00000030, CKR_DEVICE_ERROR                    
loaded C_GetFunctionList for NON FIPS MODE; slotID 1 
fips.sh: Run PK11MODE in Non FIPS mode (pk11mode -n) . FAILED

Occured in nightly testing on simplify SunOS5.10_64_OPT.OBJ securityjes5 
20070514.1.
We need to find out immediately if this is a new regression, or was a 
one-time error.  This could be a stopper for 3.11.7 if it is a new 
regression and is repeatable.
Version: 3.11 → 3.11.7
It was suggested that the network have caused the problem Here is the code involved :

    PKM_LogIt("Generate DSA PQG domain parameters ... \n");
    /* Generate DSA domain parameters PQG */
    crv = pFunctionList->C_GenerateKey(hRwSession, &dsaParamGenMech,
                                       dsaParamGenTemplate,
                                       1,
                                       &hDsaParams);
    if (crv == CKR_OK) {
        PKM_LogIt("DSA domain parameter generation succeeded\n");
    } else {
        PKM_Error( "DSA domain parameter generation failed "
                   "with 0x%08X, %-26s\n", crv, PKM_CK_RVtoStr(crv));
        return crv;
    }

I'm not sure if C_GenerateKey would hit the database and hence the network in this case. If so, then that could be a possible explanation.
Julien, I checked timestamps in log. At the time when this problem occured network was probably OK. This test was tested at +- 23:31. Network problems started after 03:00. 
The bug didn't show up for the last 2 nightly builds. So the error was triggered only once.
Do we know why this happened ?
Is it still a stopper for NSS 3.11.7 ?
CC'ing the author of pk11mode and Bob. I don't think we have a good explanation for this error yet. 
The last change to pk11mode was on February 6, which I believe was before 3.11.6 .
If this is a real problem, it is more likely an rare issue in the softoken. I know we have had some occasional keygen failures in the past if we get unlucky.
Softoken return CKR_DEVICE_ERROR when either 1) the underlying database fails, or 2) the crypto engine fails. In this case I would be hard pressed to explain why we are accessing the database (these are all session objects). The only place where CKR_DEVICE_ERROR could been returned for a non-token object is if the return from PQG_ParamGen() indicated a failure.

PQG_ParamGen() can fail itself if the parameters are bad (unlikely in this case unless there was a cosmic 'zap'). It calls PQG_ParamGenSeedLen.

PQG_ParamGenSeedLen can fail for the following reasons:
1) out of memory condition..
2) Interation count was hit.
3) random number generater failure.
4) some mpi primitive failure (init, add, divide, etc.).

I suspect that we aren't dealing with (1), or we would see other failures on the box.
(3) is also extremely unlikely.
(4) is unlikely unless we are hitting a divide by zero condition randomly. My quick look at the code turned up several divides, but only one where the dividend is not directly calculated from the input. That one case is where we are dividing p-1 by q. If q is zero, or even 1, then our primality tests are really broken;).

That leaves we hit the interation count. It too is unlikely, but not beyond the realm of possibility. I'll go back to my prime book and look at the theoretic density of primes in the 160-bit range and see what the changes are that we would fail to find such a prime in 600 attempts.

In any case, all the possible failures are of the 'once in a blue moon' case, it's quite possible the problem we have has been in the code for a long time, and certainly not a regression. We could systematically call PQG_ParamGen in a loop on a fast machine and see how many iterations it takes to fail.

bob
Hmmm The density of primes in the 2^160 range is 1/(160*ln(2)) or about 1 in 111 (1/111). Since our increment is 600, it means that we have a just less than 6 in 1 chance of succeeding (more exactly we have have a 1 in 6 chance of failing).

There must be something wrong with my math, Because that would indicate shlibsign should fail 1 our of 6 times. So something seems to be wrong with my math.

bob
Bob, your density of primes calculation is correct.  Trying random 160 bit
numbers, we should find a prime one out of every 111 values tried, ON AVERAGE.
Since we're trying only odd numbers, our average is actually twice that good,
or one prime out of every 55 tries, ON AVERAGE.  

However, primes are not uniformly distributed.  There are gaps in the number 
space in which there are no primes.  These length of these gaps can be 
as long as ~86% of the square of the average distance!  For 160 bit numbers,
there could be a gap as large as 10550.  I believe these gaps are the reason
that we don't walk the number space sequentially from a random starting point 
looking for primes.  So IINM, the probability of finding a 160 bit prime in 
55 tries is 50%, and the probability of finding a prime within 5275 tries is 
much closer to 100%, but still not quite there.  

IF & when we get a 160 bit prime number, Q, we then try to generate a prime 
P that is L bits long (L is usually 1024).  We use SHA1 to generate W, a 
pseudo random (actually very formulaic) L bit number from Q and a counter. 
W is modified so that W-1 is divisible by 2Q.  Then we test W for primality, 
and stop and declare it to be prime P if it is prime.  If it is not prime, 
we continue to bump the counter and generate another W and test it.  We 
repeat this until we have found a prime W or until we have tried 4096 values.  
Note that for 1024 bit numbers, we expect to find a prime one out of 709 
tries, ON AVERAGE.  But even trying up to 4K tries, we're not guaranteed
to find one.  I don't know the probability of finding a prime P from the 
prime Q values using this method.  I suspect it's not actually very high.

If we don't find a prime P within 4K tries starting from a given prime Q,
we abandon Q, and start over, trying random 160 bit values for primality.

We limit the TOTAL number of attempts to find a prime Q from a 160-bit 
random "seed" to 600 tries.  Given that we expect to find a prime Q one 
out of every 55 tries ON AVERAGE, then we expect to find ~11 prime numbers
(on average) out of those 600 trials.  And for each of those ~11 primes Q,
we will try up to 4K times to find a prime P.  So before we fail, we have 
tried an average of 600+(11*4096) = 45656 primality tests, each of which 
does numerous modular exponentiations.  So you can see why these sometimes 
take a long time.

This whole algorithm comes right out of FIPS 186-2 Appendix 2.2, except
perhaps for the number 600.  I don't recall where or how we got that number.
But I suspect it was chosen, back in 2001, to bound the maximum time spent 
trying, to not tax browser users' patience beyond reason.  Maybe with today's
faster CPUs, we should raise it.

I guess the only solution to this is to raise the limit on the number of attempts to find a prime Q up to a higher number than 600.  Unfortunately, 
this code is part of the FIPS evaluation, so changing it means ...  
If our evaluation is correct, we could fix failures for NSS applications by retrying the keygen on DSA results. This won't fix PK11Mode issues, since it calls softoken directly.

We certainly can fix it right for 3.12;).

I suggest we hueristically find out how frequent these failures are, so we can decide how much we want to increase the retry count.

bob
Similar error in FIPS mode today (build 20070920.1, simplify, SunOS5.10_64_DBG.OBJ):

fips.sh: Run PK11MODE in FIPSMODE  -----------------
pk11mode -d ../fips -p fips- -f ../tests.fipspw.13905

FIPS MODE PKM_Error: DSA domain parameter generation failed with 0x00000030, CKR_DEVICE_ERROR                    

FIPS MODE PKM_Error: PKM_KeyTest failed with 0x00000030, CKR_DEVICE_ERROR                    
Loaded FC_GetFunctionList for FIPS MODE; slotID 0 
fips.sh: Run PK11MODE in FIPS mode (pk11mode) . FAILED
Assignee: nobody → glen.beasley
Slavo noted that this happened again on April 26, 2008 on Sparc 64 bit. This is the same platform that it happened on before. This may be worth creating a looping test case to see if it can be reproduced and how often.

I ran two instances of the DSA PQG domain parameters in a loop on a dual CPU machine overnight. Each process is at about 10,800 iterations and no failures have been seen yet.
This bug was seen again on 8/14.
Blocks: FIPS2008
No longer blocks: FIPS2008
Occured again, nightly build securityjes5/20081028.1/HP-UXB.11.11_DBG/charm.1.
OS: Solaris → All
Hardware: Sun → All
Target Milestone: --- → 3.11.10
Since we're about to begin another FIPS evaluation, we should tackle this
bug now.  I think it's as simple as raising the number 600 to something
higher.  Maybe 1000.
Whiteboard: FIPS
Target Milestone: 3.11.10 → 3.12.3
The patch would be to file security/nss/lib/freebl/pqg.c near line 54
and would be something like this:

-#define MAX_ITERATIONS 600  /* Maximum number of iterations of primegen */
+#define MAX_ITERATIONS 1000 /* Maximum number of iterations of primegen */
Found similar bug today in nightly tests:
securitytip/20081118.1/charm.1/HP-UXB.11.11_DBG.OBJ:

Log:
---
fips.sh: Run PK11MODE in FIPSMODE  -----------------
pk11mode -d ../fips -p fips- -f ../tests.fipspw.879
Loaded FC_GetFunctionList for FIPS MODE; slotID 0 
Loaded FC_GetFunctionList for FIPS MODE; slotID 0 
Loaded FC_GetFunctionList for FIPS MODE; slotID 0 
---

These are the last lines in logfile, written at +- 9:11am. Cleanup script was started at 7:30pm and testing was killed this time. It seems that pk11mode command hanged and blocked other testing on machine.
Attachment #369172 - Flags: review?(rrelyea)
Attachment #369172 - Flags: review?(rrelyea) → review+
Comment on attachment 369172 [details] [diff] [review]
increase max MAX_ITERATIONS for primegen

r+ rrelyea
Checking in pqg.c;
/cvsroot/mozilla/security/nss/lib/freebl/pqg.c,v  <--  pqg.c
new revision: 1.17; previous revision: 1.16
done
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Version: 3.11.7 → 3.12.3
You need to log in before you can comment on or make changes to this bug.