Closed Bug 342795 Opened 18 years ago Closed 18 years ago

PR_Read returns -1 but PR_GetError returns 0 when ECDHE keygen fails

Categories

(NSS :: Libraries, defect, P2)

3.11.2
defect

Tracking

(Not tracked)

RESOLVED FIXED
3.11.5

People

(Reporter: julien.pierre, Assigned: wtc)

References

Details

Attachments

(1 file)

In my tree, I just got a core file in selfserv in the "errWarn" function : 241 static const char * 242 errWarn(char * funcString) 243 { 244 PRErrorCode perr = PR_GetError(); 245 const char * errString = SECU_Strerror(perr); 246 247 fprintf(stderr, "selfserv: %s returned error %d:\n%s\n", 248 funcString, perr, errString); 249 return errString; 250 } The problem is that perr is zero, and errString is NULL . errString is not checked for NULL, and the fprintf crashes while calling strlen(errString). The stack is : [1] strlen(0x8068b3e, 0xfe64b220, 0xfec2b090, 0x0), at 0xfeb84acc [2] _fprintf(0xfec2b090, 0x8068b1c, 0x8080c08, 0x0, 0x0), at 0xfebdcc0d =>[3] errWarn(funcString = 0x8080c08 "HDX PR_Read"), line 248 in "selfserv.c" [4] handle_connection(tcp_sock = 0x81862e0, model_sock = 0x8083800, requestCert = 0), line 968 in "selfserv.c" [5] jobLoop(a = (nil), b = (nil), c = 0), line 518 in "selfserv.c" [6] thread_wrapper(arg = 0x816d3f8), line 486 in "selfserv.c" [7] _pt_root(arg = 0x81788b8), line 220 in "ptthread.c" [8] _thr_setup(0xfe650400), at 0xfebff6f8 [9] _lwp_start(), at 0xfebff9e0 The caller of errWarn is : 960 rv = PR_Read(ssl_sock, pBuf, bufRem - 1); 961 if (rv == 0 || 962 (rv < 0 && PR_END_OF_FILE_ERROR == PR_GetError())) { 963 if (verbose) 964 errWarn("HDX PR_Read hit EOF"); 965 break; 966 } 967 if (rv < 0) { 968 errWarn("HDX PR_Read"); 969 goto cleanup; 970 } It is the second errWarn that is invoked . rv is -1 . The command-line that caused selfserv to get in this condition is below : (dbx) p argc argc = 16 (dbx) p argv[0..15] argv[0..15] = [0] = 0x80469bc "selfserv" [1] = 0x80469c5 "-D" [2] = 0x80469c8 "-p" [3] = 0x80469cb "8443" [4] = 0x80469d0 "-d" [5] = 0x80469d3 "../ext_server" [6] = 0x80469e1 "-n" [7] = 0x80469e4 "monstre.red.iplanet.com" [8] = 0x80469fc "-e" [9] = 0x80469ff "monstre.red.iplanet.com-ec" [10] = 0x8046a1a "-w" [11] = 0x8046a1d "nss" [12] = 0x8046a21 "-c" [13] = 0x8046a24 "ABCDEF:C001:C002:C003:C004:C005:C006:C007:C008:C009:C00A:C00B:C00C:C00D:C00E:C00F:C010:C011:C012:C013:C014cdefgijklmnvyz" [14] = 0x8046a9d "-i" [15] = 0x8046aa0 "../tests_pid.26949" (dbx)
I got 10 cores in ext_client and 10 cores in client all with this same stack - crash in fprintf/strlen due to the NULL error string. We should fix errWarn to check for the NULL string before calling fprintf . But we also need to figure out why PR_Read returned -1 and no error.
Julien, Instead of fixing errWarn to check for NULL, it may be better to fix SECU_Strerror so that it never returns NULL. - map 0 to "no error" - map an error code not in the errStrings table to "unknown error" Note that an error code not in the errStrings table also makes SECU_Strerror return NULL. The first errWarn call should also be fixed: 961 if (rv == 0 || 962 (rv < 0 && PR_END_OF_FILE_ERROR == PR_GetError())) { 963 if (verbose) 964 errWarn("HDX PR_Read hit EOF"); 965 break; 966 } errWarn calls PR_GetError(). We should only call PR_GetError() when an NSPR function failed. So for the rv == 0 case, the return value of PR_GetError() in errWarn is meaningless. (NSPR function don't need to clear the error code when they succeed.)
This problem is reproducible by pulling the tip, setting NSS_ENABLE_ECC and NSS_ECC_MORE_THAN_SUITE_B to 1, and running all.sh . Re: comment 2, Good catch, Wan-Teh. The rv==0 case should be changed in selfserv to not invoke errWarn . I'm not as confident about changing SECU_Strerror to never return NULL . I think it's useful to detect a zero error when one is expected. I am tempted to make that an assertion case, if not in SECU_Strerror, then perhaps in errWarn . Anyway, the main bug here is not about the tools crashing but the PR_Read API returning -1 and not setting any error code.
Here is the test case for the server side : selfserv -D -p 8443 -d ../../../tests_results/security/monstre.2/ext_server -n monstre.red.iplanet.com -e monstre.red.iplanet.com-ec -w nss -c :C010 -t 1 Here is the test case for the client side : tstclnt -p 8443 -h monstre.red.iplanet.com -c :C010 -T -B -s -f -d ../../../tests_results/security/monstre.2/client < /export/home/nss/tip2/mozilla/security/nss/tests/ssl/sslreq.dat I have to run tstclnt *twice* before the server reports the zero error. The first time around, the client reports : tstclnt: write to SSL socket failed: Encountered end of file. And the server reports : selfserv: HDX PR_Read returned error -8092: Unable to generate public/private key pair. The second time, the server aborts with the stack previously reported due to the zero error code. It appears that the server does an ECDHE keygen the first time, and that fails. libssl ends up correctly setting an NSPR error, and then selfserv reports it. But the second time, libssl sets no error.
The problem is in ssl3ecc.c : 613 status = PR_CallOnceWithArg(&gECDHEKeyPairs[ec_curve].once, 614 ssl3_CreateECDHEphemeralKeyPair, 615 (void *)ec_curve); 616 if (status != PR_SUCCESS) { 617 return SECFailure; 618 } The first time PR_CallOnceWithArg is called, it invokes ssl3_CreateECDHEphemeralKeyPair, which ends up setting up the NSPR error code. But the second and subsequent times, this NSPR function just returns PR_FAILURE, without setting any error code. And nothing else in the SSL stack sets it either all the way to the return from PR_Read .
Assignee: nobody → nelson
Summary: PR_Read returns -1 but PR_GetError returns 0 → PR_Read returns -1 but PR_GetError returns 0 when ECDHE keygen fails
Priority: -- → P2
Version: trunk → 3.11.2
Julien, since PRCallOnceType is a public type, we can't change it. Otherwise, we can store the error code in a PRCallOnceType structure. We can consider storing the error code in the "PRStatus status" field. But I am worried that some NSPR clients have started to use the fields of PRCallOnceType directly. (For example, there are patches for lib/freebl that zero a PRCallOnceType structure to fix memory leaks.)
I don't think we necessarily have to change the PRCallOnceType structure. The simplest fix is to have the caller of PR_CallOnceWithArg always set the NSPR error whenever there is a failure. But of course, that will lose the details of the reason for the failure. Another possibility is to set the NSPR error code to zero before the call, and then query the error code after the call, and if it is zero, set the value. This is equivalent to my first proposal except for the first call, during which we would get the detailed error from the actual call.
IINM, this bug is paired with bug 341708. In that bug, we found that the client was receiving a server key exchange with a bogus ECDHE key, and then was aborting the handshake without sending an alert. The implication is that the server did not detect that it had failed to generate a valid ECDHE key, and sent the server key exchange message anyway. That needs to be fixed. I think that bug is essentially the same as this one. I suspect that the code shown in comment 5 above is returning PR_SUCCESS in the second call, even though the first call failed, or alternatively, the function that calls this one ignores the error return. That hypothesis needs investigation.
Nelson, These two bugs are unrelated. In bug 341708, the server "succeeds" at doing the keygen, but the key ends up being invalid, and only the client detects that problem. Bug 341708 is paired with bug 341573 - the combination of the -fsimple=2 compiler option and the floating point ECL library cause the bad ECC computations. On the other hand, this bug occurs when building with NSS_ECC_MORE_THAN_SUITE_B set to 1, and checking out the ecl-curves.h file from the tip as of yesterday, which only contains 3 curves, then running all.sh again with this environment variable set. I verified through debugging that PR_CallOnceWithArg returns PR_FAILURE both the first and second times. The only difference is that the first time, the NSPR error code is set by the function it invokes, and the second time, the function is not invoked, and no NSPR error code is set.
Blocks: 353896
Depends on: 353899
Assignee: nelson → julien.pierre.bugs
Target Milestone: --- → 3.11.4
Attached patch Proposed patchSplinter Review
Although this bug is caused by a design flaw of PR_CallOnce's error reporting (bug 353899), we can't change PR_CallOnce to report meaningful error codes. Therefore, this patch is the only way to get a meaningful error code rather than a fixed error code for all PR_CallOnce failures. With PR_CallOnce's limitation, call-once functions are responsible for storing their error codes so that the error codes can be retrieved later. So I store the error code in the 'error' member of the ECDHEKeyPair structure.
Attachment #246034 - Flags: superreview?(alexei.volkov)
Attachment #246034 - Flags: review?(nelson)
Nelson, please set the target milestone for this bug. For now I left it targeted at a 3.11.x patch release.
Assignee: bugzilla → wtchang
Target Milestone: 3.11.4 → 3.11.5
Attachment #246034 - Flags: superreview?(alexei.volkov) → superreview?(alexei.volkov.bugs)
Comment on attachment 246034 [details] [diff] [review] Proposed patch Patch is good for fixing reported problem.
Attachment #246034 - Flags: superreview?(alexei.volkov.bugs) → superreview+
Comment on attachment 246034 [details] [diff] [review] Proposed patch If there is no platform on which this change creates a binary incompatibility then r=nelson. OTOH, if there is no platform on which the change creates a binary incompatiblity, then why change it? > typedef struct ECDHEKeyPairStr { > ssl3KeyPair * pair; >- PRInt32 flag; >+ int error; /* error code of the call-once function */ > PRCallOnceType once; > } ECDHEKeyPair;
Attachment #246034 - Flags: review?(nelson) → review+
Comment on attachment 246034 [details] [diff] [review] Proposed patch Thanks for the code review. Before I answer Nelson's question, I just wanted to make sure that you still want to fix this bug in NSS 3.11.x. (Julien set the target milestone to 3.11.4 on 2006-09-22.) Nelson, since the ECDHEKeyPairStr structure is private (it's defined inside a .c file), changing the type of a member won't break binary compatibility on any platform. It turns out that the original 'PRInt32 flag' member was unused. I renamed it 'error' because I'm storing the error code in it. Then I changed the type to 'int' because that's the type of error codes in NSS (used by PORT_GetError and PORT_SetError). Does this answer address your concern?
Wan-Teh, thanks for your answer.
I checked in the patch on the NSS trunk (NSS 3.12) and NSS_3_11_BRANCH (NSS 3.11.5). Checking in ssl3ecc.c; /cvsroot/mozilla/security/nss/lib/ssl/ssl3ecc.c,v <-- ssl3ecc.c new revision: 1.16; previous revision: 1.15 done Checking in ssl3ecc.c; /cvsroot/mozilla/security/nss/lib/ssl/ssl3ecc.c,v <-- ssl3ecc.c new revision: 1.3.2.10; previous revision: 1.3.2.9 done
Status: NEW → RESOLVED
Closed: 18 years ago
Resolution: --- → FIXED
No longer depends on: 353899
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: