Closed
Bug 341573
Opened 19 years ago
Closed 18 years ago
ECDHE SSL tests fail on UltraSparc with Studio 11 and -fsimple=2 option
Categories
(NSS :: Libraries, defect, P1)
Tracking
(Not tracked)
RESOLVED
FIXED
3.11.2
People
(Reporter: julien.pierre, Assigned: julien.pierre)
Details
Attachments
(1 file, 1 obsolete file)
2.01 KB,
patch
|
Details | Diff | Splinter Review |
This problem was found on the tip by our bigbelly tinderbox, which is an Ultra 10 with an Ultrasparc II CPU . But I believe it may exist on the NSS_3_11_BRANCH as well. All ECDHE tests report failures on the client side :
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: PR_Send returned error -12219:
Unspecified failure while processing SSL Client Key Exchange handshake.
I verified with a known good build on another platform that this was a problem with the server running with NSS on the UltraSparc-II.
I made the following experiment :
cp libfreebl_32int_3.so libfreebl_32fpu_3.so
This causes NSS to use the "32-bit integer" version of our freebl crypto library, instead of the "32-bit floating point" version.
When I did this, the ECDHE tests succeeded, both locally and with the remote client.
This tells me that libfreebl_32fpu_3.so as we build it today is not compatible with the UltraSparc II chip. This might be a problem with the compiler flags. I'm experimenting with them now. The next possibility is a compiler bug. And of course, we might have another bug in freebl that is somehow triggered only on that CPU.
Assignee | ||
Comment 1•19 years ago
|
||
Here is a quick test case for this bug after you run cert.sh. It is to be run from mozilla/dist/$OBJDIR/bin :
Server :
#!/bin/tcsh
selfserv -D -p 8443 -d ../../../tests_results/security/bigbelly.1/ext_server -n bigbelly.red.iplanet.com -B -s \
-e bigbelly.red.iplanet.com-ec -w nss -c :C013
Client :
#!/bin/tcsh
./strsclnt -q -p 8443 -d ../../../tests_results/security/bigbelly.1/ext_client -w nss -c 1000 -C :C013 -T bigbelly.red.iplanet.com
Assignee | ||
Comment 2•19 years ago
|
||
I took the extreme approach of removing a number of flags. When I built with this combination, the ECDHE errors disappeared. I'm going to put them back one at a time to figure out which causes the problem. Unfortunately the machine is very slow to build.
Assignee | ||
Comment 3•19 years ago
|
||
I have confirmed that this is a problem on NSS_3_11_BRANCH through doing my own build and running it on bigbelly.
Priority: -- → P1
Target Milestone: --- → 3.11.2
Assignee | ||
Comment 4•19 years ago
|
||
I tracked the problem down to the -fsimple=2 compiler flag .
The man page for the Sun compiler states :
-fsimple[=n]
Allows the optimizer to make simplifying assumptions
concerning floating-point arithmetic.
The compiler defaults to -fsimple=0. Specifying -fsim-
ple is equivalent to -fsimple=1.
If n is present, it must be 0, 1, or 2.
-fsimple=0
Permits no simplifying assumptions. Preserves strict
IEEE 754 conformance.
-fsimple=1
Allows conservative simplifications. The resulting code
does not strictly conform to IEEE 754, but numeric
results of most programs are unchanged.
With -fsimple=1, the optimizer can assume the follow-
ing:
o The IEEE 754 default rounding/trapping modes do not
change after process initialization.
o Computations producing no visible result other than
potential floating- point exceptions may be deleted.
o Computations with Infinity or NaNs as operands need
not propagate NaNs to their results. For example, x*0
may be replaced by 0.
o Computations do not depend on sign of zero.
With -fsimple=1, the optimizer is not allowed to optim-
ize completely without regard to roundoff or excep-
tions. In particular, a floating-point computation can-
not be replaced by one that produces different results
with rounding modes held constant at run time.
-fsimple=2
Last change: 2005/09/26 16
User Commands cc(1)
Enables use of SIMD instructions to compute reductions
when -xvector=simd is in effect.
Permits aggressive floating point optimizations that
may cause many programs to produce different numeric
results due to changes in rounding. For example, -fsim-
ple=2 permits the optimizer to attempt replace all com-
putations of x/y in a given loop with x*z, where x/y is
guaranteed to be evaluated at least once in the loop,
z=1/y, and the values of y and z are known to have con-
stant values during execution of the loop.
Even with -fsimple=2, the optimizer still is not per-
mitted to introduce a floating point exception in a
program that otherwise produces none.
See Also:
Techniques for Optimizing Applications: High Perfor-
mance Computing written by Rajat Garg and Ilya Sharapov
for a more detailed explanation of how optimization can
impact precision.
So, -fsimple=2 enables the use of SIMD instructions .
On the UltraSparc chips, that would probably be the VIS or VIS2 instructions.
The UltraSparc II chip on bigbelly supports VIS, but not VIS2.
% isalist
sparcv9+vis sparcv9 sparcv8plus+vis sparcv8plus sparcv8 sparcv8-fsmuld sparcv7 sparc
libfreebl_32fpu_3.so is marked as requiring VIS, but not VIS2 . So, I would expect it to work on the UltraSparc II. This library works properly on UltraSparc III, which supports both VIS and VIS2.
Assignee | ||
Comment 5•19 years ago
|
||
This problem exists on both 32-bit and 64-bit builds. Both fail this ECC test on the US-II.
In the 32-bit build, it is enough to recompile ecp_fp.c with -fsimple=1 to pass the test from comment 1.
In the 64-bit build, I need to recompile both ecp_fp.c and ecp_fp192.c with -fsimple=1 to pass the test from comment 1.
I didn't run the full set of tests. Other files may be affected as well.
My preliminary recommendation is to change -fsimple=2 down to -fsimple=1 . I will verify that all.sh passes with that setting.
I will also talk to our compiler team to find out why -fsimple=2 might cause an issue. Based on the man page I quoted in comment 4, SIMD code (ie. VIS) would only be generated if -xvector=simd is also present. But we don't have that. So, this problem may not be related to VIS.
Assignee | ||
Comment 6•19 years ago
|
||
I have confirmed that all.sh passes when NSS is built with this fix using Studio 11 : Sun C 5.8 2005/10/13 .
It seems that builds made with Studio 10 may not have this bug. We are currently shipping bits only made with Studio 10. If that's confirmed, I will change the target milestone for this bug to 3.12 .
Attachment #225633 -
Attachment is obsolete: true
Assignee | ||
Comment 7•19 years ago
|
||
I have confirmed that this bug only exists when building with the Studio 11 compiler. Since we are building 3.11.x with Studio 10, there is no urgent need to make this change on NSS_3_11_BRANCH . But we build our tip with Studio 11, so this change is still necessary. Retargetting to 3.12.
Target Milestone: 3.11.2 → 3.12
Version: 3.11.2 → 3.12
Assignee | ||
Comment 8•19 years ago
|
||
There are several issues with this bug, many of them discussed in our meeting today.
1) We don't have low-level EC tests (bltest) that test the functionality that failed here. I will open an RFE for that. Bob suggested that I use ecperf in the meantime and see if it reports any failure. I will try that.
2) Bob pointed out that it doesn't make sense for ECDHE to fail but ECDH to pass. I will try to use dtrace to collect the sequence of all freebl calls made by the server during one request to see what the difference is. This would help create the proper tests with bltest above that would be able to catch this problem.
3) Nelson pointed out yesterday that selfserv should be reporting some sort of error, since the handshake does not complete. But it doesn't report anything at all. Only the client reports the error. I have already determined that the server is at fault, by using a client from a known good build (on a different platform) against the server from the bad build. The good client reports the same error reported in the initial description of this bug . There may be another bug in selfserv or libssl here about the lack of error reporting.
4) The code that is causing the problem is in the floating point EC library. We area only compiling that on Sparc, in the 32fpu and 64fpu flavors of freebl . There might be a bug in that code that is only triggered with the -fsimple=2 compiler option on Sparc. The functions that cause the problem haven't been identified, but the source files have - they are ecp_fp.c and ecp_fp192.c . I'm adding Douglas and Vipul to the cc list since they might have some idea about that .
Comment 9•19 years ago
|
||
Some NSS developers (including me) build the NSS_3_11_BRANCH
with Sun Studio 11. This bug should not block any NSS 3.11.x
release, but when you check in your patch, please check it in
on the NSS_3_11_BRANCH as well. If you need to use -fsimple=2
with Studio 10, you will need to add a check for the compiler
version to your patch.
Assignee | ||
Comment 10•19 years ago
|
||
Wan-Teh,
fsimple=1 makes the code to work for both Studio 10 and 11. I don't think we "require" -fsimple=2 for studio 10. But that flag was part of the list that produced the best RSA code last summer. If I find some cycles I'll see how much of a difference that individual flag makes on rsaperf .
Target Milestone: 3.12 → 3.11.2
Version: 3.12 → 3.11.2
Assignee | ||
Comment 11•19 years ago
|
||
I didn't mean to change the target.
Target Milestone: 3.11.2 → 3.12
Version: 3.11.2 → 3.12
Comment 12•19 years ago
|
||
I think floating point performance doesn't matter in
freebl. The mpi bignum library does have some floating
point code for Solaris SPARC, but that is an assembly
code file, so the compiler optimization isn't relevant.
Assignee | ||
Comment 13•19 years ago
|
||
Update from comment 8 :
I had long chats with Bob and Nelson today.
1) I filed bug 341704 about more low-level EC QA .
Bob's ecperf program showed the following failures on the US-II machine with the broken build :
Testing NIST-P192 using freebl implementation...
ECDH_Derive count: 100 sec: 0.50 op/sec: 199.12
ECDSA_Sign count: 100 sec: 0.54 op/sec: 186.88
ECDHE max rate = 96.50
Error:: ECDSA_Verify: Peer's certificate has an invalid signature.
... okay.
Testing NIST-P224 using freebl implementation...
ECDH_Derive count: 100 sec: 0.77 op/sec: 129.39
ECDSA_Sign count: 100 sec: 0.81 op/sec: 122.93
ECDHE max rate = 63.08
Error:: ECDSA_Verify: Peer's certificate has an invalid signature.
... okay.
SSL was negotiating a key with the P-192 curve.
When using a SUITE B client, which does not support this curve, or the P-224 curve, and uses the P-256 curve, the connection is successful. But the SUITE B client has to have TLS enabled in order to send the proper HELO extension. My strsclnt had TLS disabled (-T argument) . With SSL3 only, NSS does not send the curve HELO extension. So, it took a while to figure out why the SUITE B client still failed. See bug 341707 that Nelson filed about that problem.
2)
We also know why ECDH doesn't fail here. In the test case, the server cert uses a P256 curve. And the CA cert uses P521 . And these curves don't run into the problem.
3) I filed bug 341708 about the selfserv silence .
4) I would like some assistance from Douglas and/or Vipul about where to start looking for the failure source. Even if it is a compiler bug, we will need to isolate the source which shows the problem for a test case.
For now, I have checked in the change to use -fsimple=1 on both the tip and the branch.
NSS_3_11_BRANCH :
Checking in Makefile;
/cvsroot/mozilla/security/nss/lib/freebl/Makefile,v <-- Makefile
new revision: 1.70.2.10; previous revision: 1.70.2.9
done
Tip :
Checking in Makefile;
/cvsroot/mozilla/security/nss/lib/freebl/Makefile,v <-- Makefile
new revision: 1.85; previous revision: 1.84
done
Comment 14•19 years ago
|
||
(In reply to comment #13)
> 4) I would like some assistance from Douglas and/or Vipul about where to start
> looking for the failure source. Even if it is a compiler bug, we will need to
> isolate the source which shows the problem for a test case.
Can you build the standalone ECC libraries and report if the tests fail? To do so, run the following:
% cd nss/lib/freebl/mpi
% gmake libs
% cd ../ecl
% gmake tests
% ./ecp_test
% ./ec2_test
% ./ec_naft
If any of the last three tests fail, please rerun them with the parameter --print:
% ./ecp_test --print
and post the output here. That may help in narrowing down the problem in the elliptic curve code.
Douglas
Assignee | ||
Comment 15•19 years ago
|
||
Douglas,
Unfortunately, when building ECL the way you suggested in comment 14, the set of compiler flags used is totally different than when ECL is built as part of a freebl shared library. So, I wouldn't expect to find the same problems.
But anyway, I did as you suggested, and here are the results :
[jp96085@bigbelly]/h/monstre/export/home/nss/tip/mozilla/security/nss/lib/freebl/ecl 163 % ./ecp_test
Testing SECP-160R1 using generic implementation...
Error: could not construct params.
Error: exiting with error value -1
[jp96085@bigbelly]/h/monstre/export/home/nss/tip/mozilla/security/nss/lib/freebl/ecl 164 % ./ec2_test
Testing SECT-131R1 using generic implementation...
Error: could not construct params.
Error: exiting with error value -1
[jp96085@bigbelly]/h/monstre/export/home/nss/tip/mozilla/security/nss/lib/freebl/ecl 165 % ./ec_naft
Segmentation fault (core dumped)
[jp96085@bigbelly]/h/monstre/export/home/nss/tip/mozilla/security/nss/lib/freebl/ecl 166 % file core
core: ELF 32-bit MSB core file SPARC Version 1, from 'ec_naft'
[jp96085@bigbelly]/h/monstre/export/home/nss/tip/mozilla/security/nss/lib/freebl/ecl 167 % dbx ec_naft core
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.5' in your .dbxrc
Reading ec_naft
core file header read successfully
Reading ld.so.1
Reading libm.so.2
Reading libc.so.1
program terminated by signal SEGV (no mapping at the fault address)
0x0001d3d4: mp_init_copy+0x001c: ld [%i1 + 4], %o0
(dbx) w
=>[1] mp_init_copy(0xffbfefa8, 0x2c, 0x4, 0x10, 0x3, 0xffbfefa8), at 0x1d3d4
[2] ec_compute_wNAF(0xffbff48c, 0xa0, 0x2c, 0x5, 0x0, 0xffbfefa8), at 0x17508
[3] main(0x1, 0xffbff594, 0xffbff59c, 0x0, 0xffbff48c, 0x2c), at 0x12d5c
(dbx) q
Assignee | ||
Comment 16•19 years ago
|
||
I have determined that the problem had no relation to the Ultrasparc II chip on our tinderbox machine. The problem is purely the code generated by the Studio 11 with the -fsimple=2 option on the Sparc architecture, in the libfreebl_fpu32_3.so library. I tested the bits on an Ultrasparc III and they failed too. We just didn't have any other Sparc machine building with Studio 11 .
I think that's one less mystery - the generated code runs the same on all Ultraparc chips. But we still need to find out if the problem is somewhere in our source or strictly in the compiler.
Summary: ECDHE SSL tests fail on UltraSparc II CPU → ECDHE SSL tests fail on UltraSparc with Studio 11 and -fsimple=2 option
Comment 17•19 years ago
|
||
Julien, we at Red Hat have started to build with
Studio 11. So please consider Studio 11 a supported
compiler. On the other hand, it should be okay to
drop Forte 6 update 2 soon.
Updated•18 years ago
|
Group: security
Updated•18 years ago
|
Assignee: bugzilla → neil.williams
Group: security
CC list accessible: false
Not accessible to reporter
Assignee | ||
Comment 18•18 years ago
|
||
This problem was worked around for 3.11.2 . I will open another bug to figure out if this was a compiler issue or an NSS issue.
Assignee: neil.williams → julien.pierre.boogz
Target Milestone: 3.12 → 3.11.2
Version: 3.12 → 3.11
Assignee | ||
Updated•18 years ago
|
Status: NEW → RESOLVED
Closed: 18 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•