fips.sh: Run PK11MODE in Non FIPSMODE ----------------- pk11mode -d ../fips -p nonfips- -f ../tests.fipspw.8874 -n NON FIPS MODE PKM_Error: DSA domain parameter generation failed with 0x00000030, CKR_DEVICE_ERROR NON FIPS MODE PKM_Error: PKM_PublicKey failed with 0x00000030, CKR_DEVICE_ERROR loaded C_GetFunctionList for NON FIPS MODE; slotID 1 fips.sh: Run PK11MODE in Non FIPS mode (pk11mode -n) . FAILED Occured in nightly testing on simplify SunOS5.10_64_OPT.OBJ securityjes5 20070514.1.

We need to find out immediately if this is a new regression, or was a one-time error. This could be a stopper for 3.11.7 if it is a new regression and is repeatable.

It was suggested that the network have caused the problem Here is the code involved : PKM_LogIt("Generate DSA PQG domain parameters ... \n"); /* Generate DSA domain parameters PQG */ crv = pFunctionList->C_GenerateKey(hRwSession, &dsaParamGenMech, dsaParamGenTemplate, 1, &hDsaParams); if (crv == CKR_OK) { PKM_LogIt("DSA domain parameter generation succeeded\n"); } else { PKM_Error( "DSA domain parameter generation failed " "with 0x%08X, %-26s\n", crv, PKM_CK_RVtoStr(crv)); return crv; } I'm not sure if C_GenerateKey would hit the database and hence the network in this case. If so, then that could be a possible explanation.

Julien, I checked timestamps in log. At the time when this problem occured network was probably OK. This test was tested at +- 23:31. Network problems started after 03:00.

The bug didn't show up for the last 2 nightly builds. So the error was triggered only once. Do we know why this happened ? Is it still a stopper for NSS 3.11.7 ?

CC'ing the author of pk11mode and Bob. I don't think we have a good explanation for this error yet. The last change to pk11mode was on February 6, which I believe was before 3.11.6 . If this is a real problem, it is more likely an rare issue in the softoken. I know we have had some occasional keygen failures in the past if we get unlucky.

Softoken return CKR_DEVICE_ERROR when either 1) the underlying database fails, or 2) the crypto engine fails. In this case I would be hard pressed to explain why we are accessing the database (these are all session objects). The only place where CKR_DEVICE_ERROR could been returned for a non-token object is if the return from PQG_ParamGen() indicated a failure. PQG_ParamGen() can fail itself if the parameters are bad (unlikely in this case unless there was a cosmic 'zap'). It calls PQG_ParamGenSeedLen. PQG_ParamGenSeedLen can fail for the following reasons: 1) out of memory condition.. 2) Interation count was hit. 3) random number generater failure. 4) some mpi primitive failure (init, add, divide, etc.). I suspect that we aren't dealing with (1), or we would see other failures on the box. (3) is also extremely unlikely. (4) is unlikely unless we are hitting a divide by zero condition randomly. My quick look at the code turned up several divides, but only one where the dividend is not directly calculated from the input. That one case is where we are dividing p-1 by q. If q is zero, or even 1, then our primality tests are really broken;). That leaves we hit the interation count. It too is unlikely, but not beyond the realm of possibility. I'll go back to my prime book and look at the theoretic density of primes in the 160-bit range and see what the changes are that we would fail to find such a prime in 600 attempts. In any case, all the possible failures are of the 'once in a blue moon' case, it's quite possible the problem we have has been in the code for a long time, and certainly not a regression. We could systematically call PQG_ParamGen in a loop on a fast machine and see how many iterations it takes to fail. bob

Hmmm The density of primes in the 2^160 range is 1/(160*ln(2)) or about 1 in 111 (1/111). Since our increment is 600, it means that we have a just less than 6 in 1 chance of succeeding (more exactly we have have a 1 in 6 chance of failing). There must be something wrong with my math, Because that would indicate shlibsign should fail 1 our of 6 times. So something seems to be wrong with my math. bob

*** Bug 360264 has been marked as a duplicate of this bug. ***

Bob, your density of primes calculation is correct. Trying random 160 bit numbers, we should find a prime one out of every 111 values tried, ON AVERAGE. Since we're trying only odd numbers, our average is actually twice that good, or one prime out of every 55 tries, ON AVERAGE. However, primes are not uniformly distributed. There are gaps in the number space in which there are no primes. These length of these gaps can be as long as ~86% of the square of the average distance! For 160 bit numbers, there could be a gap as large as 10550. I believe these gaps are the reason that we don't walk the number space sequentially from a random starting point looking for primes. So IINM, the probability of finding a 160 bit prime in 55 tries is 50%, and the probability of finding a prime within 5275 tries is much closer to 100%, but still not quite there. IF & when we get a 160 bit prime number, Q, we then try to generate a prime P that is L bits long (L is usually 1024). We use SHA1 to generate W, a pseudo random (actually very formulaic) L bit number from Q and a counter. W is modified so that W-1 is divisible by 2Q. Then we test W for primality, and stop and declare it to be prime P if it is prime. If it is not prime, we continue to bump the counter and generate another W and test it. We repeat this until we have found a prime W or until we have tried 4096 values. Note that for 1024 bit numbers, we expect to find a prime one out of 709 tries, ON AVERAGE. But even trying up to 4K tries, we're not guaranteed to find one. I don't know the probability of finding a prime P from the prime Q values using this method. I suspect it's not actually very high. If we don't find a prime P within 4K tries starting from a given prime Q, we abandon Q, and start over, trying random 160 bit values for primality. We limit the TOTAL number of attempts to find a prime Q from a 160-bit random "seed" to 600 tries. Given that we expect to find a prime Q one out of every 55 tries ON AVERAGE, then we expect to find ~11 prime numbers (on average) out of those 600 trials. And for each of those ~11 primes Q, we will try up to 4K times to find a prime P. So before we fail, we have tried an average of 600+(11*4096) = 45656 primality tests, each of which does numerous modular exponentiations. So you can see why these sometimes take a long time. This whole algorithm comes right out of FIPS 186-2 Appendix 2.2, except perhaps for the number 600. I don't recall where or how we got that number. But I suspect it was chosen, back in 2001, to bound the maximum time spent trying, to not tax browser users' patience beyond reason. Maybe with today's faster CPUs, we should raise it. I guess the only solution to this is to raise the limit on the number of attempts to find a prime Q up to a higher number than 600. Unfortunately, this code is part of the FIPS evaluation, so changing it means ...

If our evaluation is correct, we could fix failures for NSS applications by retrying the keygen on DSA results. This won't fix PK11Mode issues, since it calls softoken directly. We certainly can fix it right for 3.12;). I suggest we hueristically find out how frequent these failures are, so we can decide how much we want to increase the retry count. bob

Similar error in FIPS mode today (build 20070920.1, simplify, SunOS5.10_64_DBG.OBJ): fips.sh: Run PK11MODE in FIPSMODE ----------------- pk11mode -d ../fips -p fips- -f ../tests.fipspw.13905 FIPS MODE PKM_Error: DSA domain parameter generation failed with 0x00000030, CKR_DEVICE_ERROR FIPS MODE PKM_Error: PKM_KeyTest failed with 0x00000030, CKR_DEVICE_ERROR Loaded FC_GetFunctionList for FIPS MODE; slotID 0 fips.sh: Run PK11MODE in FIPS mode (pk11mode) . FAILED

Slavo noted that this happened again on April 26, 2008 on Sparc 64 bit. This is the same platform that it happened on before. This may be worth creating a looping test case to see if it can be reproduced and how often.

I ran two instances of the DSA PQG domain parameters in a loop on a dual CPU machine overnight. Each process is at about 10,800 iterations and no failures have been seen yet.

This bug was seen again on 8/14.

Occured again, nightly build securityjes5/20081028.1/HP-UXB.11.11_DBG/charm.1.

Since we're about to begin another FIPS evaluation, we should tackle this bug now. I think it's as simple as raising the number 600 to something higher. Maybe 1000.

The patch would be to file security/nss/lib/freebl/pqg.c near line 54 and would be something like this: -#define MAX_ITERATIONS 600 /* Maximum number of iterations of primegen */ +#define MAX_ITERATIONS 1000 /* Maximum number of iterations of primegen */

Found similar bug today in nightly tests: securitytip/20081118.1/charm.1/HP-UXB.11.11_DBG.OBJ: Log: --- fips.sh: Run PK11MODE in FIPSMODE ----------------- pk11mode -d ../fips -p fips- -f ../tests.fipspw.879 Loaded FC_GetFunctionList for FIPS MODE; slotID 0 Loaded FC_GetFunctionList for FIPS MODE; slotID 0 Loaded FC_GetFunctionList for FIPS MODE; slotID 0 --- These are the last lines in logfile, written at +- 9:11am. Cleanup script was started at 7:30pm and testing was killed this time. It seems that pk11mode command hanged and blocked other testing on machine.

Created attachment 369172 [details] [diff] [review] increase max MAX_ITERATIONS for primegen

Comment on attachment 369172 [details] [diff] [review] increase max MAX_ITERATIONS for primegen r+ rrelyea

Checking in pqg.c; /cvsroot/mozilla/security/nss/lib/freebl/pqg.c,v <-- pqg.c new revision: 1.17; previous revision: 1.16 done