Closed Bug 341573 Opened 19 years ago Closed 18 years ago

ECDHE SSL tests fail on UltraSparc with Studio 11 and -fsimple=2 option

Categories

(NSS :: Libraries, defect, P1)

3.11
Sun
SunOS
defect

Tracking

(Not tracked)

RESOLVED FIXED
3.11.2

People

(Reporter: julien.pierre, Assigned: julien.pierre)

Details

Attachments

(1 file, 1 obsolete file)

This problem was found on the tip by our bigbelly tinderbox, which is an Ultra 10 with an Ultrasparc II CPU . But I believe it may exist on the NSS_3_11_BRANCH as well. All ECDHE tests report failures on the client side : strsclnt: -- SSL: Server Certificate Validated. strsclnt: PR_Send returned error -12219: Unspecified failure while processing SSL Client Key Exchange handshake. I verified with a known good build on another platform that this was a problem with the server running with NSS on the UltraSparc-II. I made the following experiment : cp libfreebl_32int_3.so libfreebl_32fpu_3.so This causes NSS to use the "32-bit integer" version of our freebl crypto library, instead of the "32-bit floating point" version. When I did this, the ECDHE tests succeeded, both locally and with the remote client. This tells me that libfreebl_32fpu_3.so as we build it today is not compatible with the UltraSparc II chip. This might be a problem with the compiler flags. I'm experimenting with them now. The next possibility is a compiler bug. And of course, we might have another bug in freebl that is somehow triggered only on that CPU.
Here is a quick test case for this bug after you run cert.sh. It is to be run from mozilla/dist/$OBJDIR/bin : Server : #!/bin/tcsh selfserv -D -p 8443 -d ../../../tests_results/security/bigbelly.1/ext_server -n bigbelly.red.iplanet.com -B -s \ -e bigbelly.red.iplanet.com-ec -w nss -c :C013 Client : #!/bin/tcsh ./strsclnt -q -p 8443 -d ../../../tests_results/security/bigbelly.1/ext_client -w nss -c 1000 -C :C013 -T bigbelly.red.iplanet.com
Attached patch A working patch (obsolete) — Splinter Review
I took the extreme approach of removing a number of flags. When I built with this combination, the ECDHE errors disappeared. I'm going to put them back one at a time to figure out which causes the problem. Unfortunately the machine is very slow to build.
I have confirmed that this is a problem on NSS_3_11_BRANCH through doing my own build and running it on bigbelly.
Priority: -- → P1
Target Milestone: --- → 3.11.2
I tracked the problem down to the -fsimple=2 compiler flag . The man page for the Sun compiler states : -fsimple[=n] Allows the optimizer to make simplifying assumptions concerning floating-point arithmetic. The compiler defaults to -fsimple=0. Specifying -fsim- ple is equivalent to -fsimple=1. If n is present, it must be 0, 1, or 2. -fsimple=0 Permits no simplifying assumptions. Preserves strict IEEE 754 conformance. -fsimple=1 Allows conservative simplifications. The resulting code does not strictly conform to IEEE 754, but numeric results of most programs are unchanged. With -fsimple=1, the optimizer can assume the follow- ing: o The IEEE 754 default rounding/trapping modes do not change after process initialization. o Computations producing no visible result other than potential floating- point exceptions may be deleted. o Computations with Infinity or NaNs as operands need not propagate NaNs to their results. For example, x*0 may be replaced by 0. o Computations do not depend on sign of zero. With -fsimple=1, the optimizer is not allowed to optim- ize completely without regard to roundoff or excep- tions. In particular, a floating-point computation can- not be replaced by one that produces different results with rounding modes held constant at run time. -fsimple=2 Last change: 2005/09/26 16 User Commands cc(1) Enables use of SIMD instructions to compute reductions when -xvector=simd is in effect. Permits aggressive floating point optimizations that may cause many programs to produce different numeric results due to changes in rounding. For example, -fsim- ple=2 permits the optimizer to attempt replace all com- putations of x/y in a given loop with x*z, where x/y is guaranteed to be evaluated at least once in the loop, z=1/y, and the values of y and z are known to have con- stant values during execution of the loop. Even with -fsimple=2, the optimizer still is not per- mitted to introduce a floating point exception in a program that otherwise produces none. See Also: Techniques for Optimizing Applications: High Perfor- mance Computing written by Rajat Garg and Ilya Sharapov for a more detailed explanation of how optimization can impact precision. So, -fsimple=2 enables the use of SIMD instructions . On the UltraSparc chips, that would probably be the VIS or VIS2 instructions. The UltraSparc II chip on bigbelly supports VIS, but not VIS2. % isalist sparcv9+vis sparcv9 sparcv8plus+vis sparcv8plus sparcv8 sparcv8-fsmuld sparcv7 sparc libfreebl_32fpu_3.so is marked as requiring VIS, but not VIS2 . So, I would expect it to work on the UltraSparc II. This library works properly on UltraSparc III, which supports both VIS and VIS2.
This problem exists on both 32-bit and 64-bit builds. Both fail this ECC test on the US-II. In the 32-bit build, it is enough to recompile ecp_fp.c with -fsimple=1 to pass the test from comment 1. In the 64-bit build, I need to recompile both ecp_fp.c and ecp_fp192.c with -fsimple=1 to pass the test from comment 1. I didn't run the full set of tests. Other files may be affected as well. My preliminary recommendation is to change -fsimple=2 down to -fsimple=1 . I will verify that all.sh passes with that setting. I will also talk to our compiler team to find out why -fsimple=2 might cause an issue. Based on the man page I quoted in comment 4, SIMD code (ie. VIS) would only be generated if -xvector=simd is also present. But we don't have that. So, this problem may not be related to VIS.
I have confirmed that all.sh passes when NSS is built with this fix using Studio 11 : Sun C 5.8 2005/10/13 . It seems that builds made with Studio 10 may not have this bug. We are currently shipping bits only made with Studio 10. If that's confirmed, I will change the target milestone for this bug to 3.12 .
Attachment #225633 - Attachment is obsolete: true
I have confirmed that this bug only exists when building with the Studio 11 compiler. Since we are building 3.11.x with Studio 10, there is no urgent need to make this change on NSS_3_11_BRANCH . But we build our tip with Studio 11, so this change is still necessary. Retargetting to 3.12.
Target Milestone: 3.11.2 → 3.12
Version: 3.11.2 → 3.12
There are several issues with this bug, many of them discussed in our meeting today. 1) We don't have low-level EC tests (bltest) that test the functionality that failed here. I will open an RFE for that. Bob suggested that I use ecperf in the meantime and see if it reports any failure. I will try that. 2) Bob pointed out that it doesn't make sense for ECDHE to fail but ECDH to pass. I will try to use dtrace to collect the sequence of all freebl calls made by the server during one request to see what the difference is. This would help create the proper tests with bltest above that would be able to catch this problem. 3) Nelson pointed out yesterday that selfserv should be reporting some sort of error, since the handshake does not complete. But it doesn't report anything at all. Only the client reports the error. I have already determined that the server is at fault, by using a client from a known good build (on a different platform) against the server from the bad build. The good client reports the same error reported in the initial description of this bug . There may be another bug in selfserv or libssl here about the lack of error reporting. 4) The code that is causing the problem is in the floating point EC library. We area only compiling that on Sparc, in the 32fpu and 64fpu flavors of freebl . There might be a bug in that code that is only triggered with the -fsimple=2 compiler option on Sparc. The functions that cause the problem haven't been identified, but the source files have - they are ecp_fp.c and ecp_fp192.c . I'm adding Douglas and Vipul to the cc list since they might have some idea about that .
Some NSS developers (including me) build the NSS_3_11_BRANCH with Sun Studio 11. This bug should not block any NSS 3.11.x release, but when you check in your patch, please check it in on the NSS_3_11_BRANCH as well. If you need to use -fsimple=2 with Studio 10, you will need to add a check for the compiler version to your patch.
Wan-Teh, fsimple=1 makes the code to work for both Studio 10 and 11. I don't think we "require" -fsimple=2 for studio 10. But that flag was part of the list that produced the best RSA code last summer. If I find some cycles I'll see how much of a difference that individual flag makes on rsaperf .
Target Milestone: 3.12 → 3.11.2
Version: 3.12 → 3.11.2
I didn't mean to change the target.
Target Milestone: 3.11.2 → 3.12
Version: 3.11.2 → 3.12
I think floating point performance doesn't matter in freebl. The mpi bignum library does have some floating point code for Solaris SPARC, but that is an assembly code file, so the compiler optimization isn't relevant.
Update from comment 8 : I had long chats with Bob and Nelson today. 1) I filed bug 341704 about more low-level EC QA . Bob's ecperf program showed the following failures on the US-II machine with the broken build : Testing NIST-P192 using freebl implementation... ECDH_Derive count: 100 sec: 0.50 op/sec: 199.12 ECDSA_Sign count: 100 sec: 0.54 op/sec: 186.88 ECDHE max rate = 96.50 Error:: ECDSA_Verify: Peer's certificate has an invalid signature. ... okay. Testing NIST-P224 using freebl implementation... ECDH_Derive count: 100 sec: 0.77 op/sec: 129.39 ECDSA_Sign count: 100 sec: 0.81 op/sec: 122.93 ECDHE max rate = 63.08 Error:: ECDSA_Verify: Peer's certificate has an invalid signature. ... okay. SSL was negotiating a key with the P-192 curve. When using a SUITE B client, which does not support this curve, or the P-224 curve, and uses the P-256 curve, the connection is successful. But the SUITE B client has to have TLS enabled in order to send the proper HELO extension. My strsclnt had TLS disabled (-T argument) . With SSL3 only, NSS does not send the curve HELO extension. So, it took a while to figure out why the SUITE B client still failed. See bug 341707 that Nelson filed about that problem. 2) We also know why ECDH doesn't fail here. In the test case, the server cert uses a P256 curve. And the CA cert uses P521 . And these curves don't run into the problem. 3) I filed bug 341708 about the selfserv silence . 4) I would like some assistance from Douglas and/or Vipul about where to start looking for the failure source. Even if it is a compiler bug, we will need to isolate the source which shows the problem for a test case. For now, I have checked in the change to use -fsimple=1 on both the tip and the branch. NSS_3_11_BRANCH : Checking in Makefile; /cvsroot/mozilla/security/nss/lib/freebl/Makefile,v <-- Makefile new revision: 1.70.2.10; previous revision: 1.70.2.9 done Tip : Checking in Makefile; /cvsroot/mozilla/security/nss/lib/freebl/Makefile,v <-- Makefile new revision: 1.85; previous revision: 1.84 done
(In reply to comment #13) > 4) I would like some assistance from Douglas and/or Vipul about where to start > looking for the failure source. Even if it is a compiler bug, we will need to > isolate the source which shows the problem for a test case. Can you build the standalone ECC libraries and report if the tests fail? To do so, run the following: % cd nss/lib/freebl/mpi % gmake libs % cd ../ecl % gmake tests % ./ecp_test % ./ec2_test % ./ec_naft If any of the last three tests fail, please rerun them with the parameter --print: % ./ecp_test --print and post the output here. That may help in narrowing down the problem in the elliptic curve code. Douglas
Douglas, Unfortunately, when building ECL the way you suggested in comment 14, the set of compiler flags used is totally different than when ECL is built as part of a freebl shared library. So, I wouldn't expect to find the same problems. But anyway, I did as you suggested, and here are the results : [jp96085@bigbelly]/h/monstre/export/home/nss/tip/mozilla/security/nss/lib/freebl/ecl 163 % ./ecp_test Testing SECP-160R1 using generic implementation... Error: could not construct params. Error: exiting with error value -1 [jp96085@bigbelly]/h/monstre/export/home/nss/tip/mozilla/security/nss/lib/freebl/ecl 164 % ./ec2_test Testing SECT-131R1 using generic implementation... Error: could not construct params. Error: exiting with error value -1 [jp96085@bigbelly]/h/monstre/export/home/nss/tip/mozilla/security/nss/lib/freebl/ecl 165 % ./ec_naft Segmentation fault (core dumped) [jp96085@bigbelly]/h/monstre/export/home/nss/tip/mozilla/security/nss/lib/freebl/ecl 166 % file core core: ELF 32-bit MSB core file SPARC Version 1, from 'ec_naft' [jp96085@bigbelly]/h/monstre/export/home/nss/tip/mozilla/security/nss/lib/freebl/ecl 167 % dbx ec_naft core For information about new features see `help changes' To remove this message, put `dbxenv suppress_startup_message 7.5' in your .dbxrc Reading ec_naft core file header read successfully Reading ld.so.1 Reading libm.so.2 Reading libc.so.1 program terminated by signal SEGV (no mapping at the fault address) 0x0001d3d4: mp_init_copy+0x001c: ld [%i1 + 4], %o0 (dbx) w =>[1] mp_init_copy(0xffbfefa8, 0x2c, 0x4, 0x10, 0x3, 0xffbfefa8), at 0x1d3d4 [2] ec_compute_wNAF(0xffbff48c, 0xa0, 0x2c, 0x5, 0x0, 0xffbfefa8), at 0x17508 [3] main(0x1, 0xffbff594, 0xffbff59c, 0x0, 0xffbff48c, 0x2c), at 0x12d5c (dbx) q
I have determined that the problem had no relation to the Ultrasparc II chip on our tinderbox machine. The problem is purely the code generated by the Studio 11 with the -fsimple=2 option on the Sparc architecture, in the libfreebl_fpu32_3.so library. I tested the bits on an Ultrasparc III and they failed too. We just didn't have any other Sparc machine building with Studio 11 . I think that's one less mystery - the generated code runs the same on all Ultraparc chips. But we still need to find out if the problem is somewhere in our source or strictly in the compiler.
Summary: ECDHE SSL tests fail on UltraSparc II CPU → ECDHE SSL tests fail on UltraSparc with Studio 11 and -fsimple=2 option
Julien, we at Red Hat have started to build with Studio 11. So please consider Studio 11 a supported compiler. On the other hand, it should be okay to drop Forte 6 update 2 soon.
Group: security
Assignee: bugzilla → neil.williams
Group: security
CC list accessible: false
Not accessible to reporter
This problem was worked around for 3.11.2 . I will open another bug to figure out if this was a compiler issue or an NSS issue.
Assignee: neil.williams → julien.pierre.boogz
Target Milestone: 3.12 → 3.11.2
Version: 3.12 → 3.11
Status: NEW → RESOLVED
Closed: 18 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: