Closed
Bug 831006
Opened 8 years ago
Closed 8 years ago
P-256 performance on ARM is poor
Categories
(NSS :: Libraries, enhancement, P1)
Tracking
(Not tracked)
RESOLVED
FIXED
3.15
People
(Reporter: agl, Assigned: agl)
Details
Attachments
(2 files, 6 obsolete files)
54.13 KB,
patch
|
wtc
:
checked-in+
|
Details | Diff | Splinter Review |
879 bytes,
patch
|
agl
:
review+
wtc
:
checked-in+
|
Details | Diff | Splinter Review |
From a previous email, where I'm doing testing on a Galaxy Nexus: "Firstly I tweaked the Makefile to change -O to -O3. I also found significant differences in speed between when the screen is on and when it's off. I assume in the latter case that the CPU frequency is reduced to save power. Below, I report times in the format "16/20ms" where the time with the screen on is 16ms and the time when the screen is off is 20ms. The original code is 20.59/35.29ms and 20.62/35.38ms for a base-point scalar mult and arbitrary point scalar mult respectively. ECDHE involves one of each and the sum (41.21/70.67ms) is pretty close to the estimate that wtc reported (~80ms) for running on some ARM device, at least with the screen off. I found that NSS's optimised NIST routines are disabled in the standard build. By enabling those I get 13.64/23.09ms and 13.59/23.11ms (sum: 27.23/46ms) wtc also reported that he found ARM assembly that was disabled by default and I enabled that (I think: I set TARGET=armLINUX in freebl/mpi), this brought the times down to 10.27/17.64ms and 10.28/17.61ms (sum: 20.55/35.25ms) Finally, I wrote a reimplementation in plain C, which didn't go as well as I had hoped, but the times are 1.24/2.09ms and 3.21/5.49ms (sum: 4.45/7.58ms). This means that I didn't quite get my 10x improvement on the original times that I expected."
Assignee | ||
Comment 1•8 years ago
|
||
Note that the new version may be slower on 64-bit systems because it cannot use the large mults on those processors. However, it is constant time.
Updated•8 years ago
|
Assignee: nobody → agl
Status: NEW → ASSIGNED
Priority: -- → P1
Target Milestone: --- → 3.14.3
Comment 2•8 years ago
|
||
I created this NSS patch from the previous two attachments. I only made some minor edits such as removing unnecessary standard library headers from ecp_256_32.c and adding #ifdef IS_LITTLE_ENDIAN to the new code in ecp_256.c. The corresponding makefile changes for determining when to use ecp_256_32.c (i.e., detecting little-endian CPUs) still need to be written. Adam, should we use ecp_256_32.c in 64-bit builds?
Attachment #702526 -
Attachment is obsolete: true
Attachment #702527 -
Attachment is obsolete: true
Assignee | ||
Comment 3•8 years ago
|
||
> Adam, should we use ecp_256_32.c in 64-bit builds?
The ecp_256_32 code is nicely constant time, so I'd be tempted to say yes. It's tough to beat the 64x64 full-mult in 64-bit mode however so it may be slower. If I get a chance, I'll try to benchmark the different on an amd64 box and get numbers.
If anyone is suffering from slow P-256 performance on amd64, I could port the work that I did for OpenSSL to NSS, but it only really applies to servers doing lots of ECDHE or ECDSA.
Comment 4•8 years ago
|
||
This new patch does not use ecp_256.c unless NSS_ECC_MORE_THAN_SUITE_B is set. I can't make ECGroup_consGFp_mont work with the point multiplication functions defined in ecp_256_32.c, so I am using ECGroup_consGFp with ecp_256_32.c. The remaining issue is to figure out the makefile changes that use ecp_256_32.c only for little-endian CPUs. In the worst case I can use a white list (arm, x86, and x86_64).
Attachment #706155 -
Attachment is obsolete: true
Comment 5•8 years ago
|
||
(In reply to Wan-Teh Chang from comment #4) > This new patch does not use ecp_256.c unless NSS_ECC_MORE_THAN_SUITE_B > is set. Just to be clear: does this mean that the performance will be better with NSS_ECC_MORE_THAN_SUITE_B defined or without NSS_ECC_MORE_THAN_SUITE_B defined?
Assignee | ||
Comment 6•8 years ago
|
||
I tested this using `make tests` in freebl/ecl. Prior to that, I had to build the standalone libmpi.a and made the following changes to the freebl/mpi/target.mk to try and enable the amd64 assembly and give the existing code a fighting chance: +AS_OBJS = mpi_amd64.o mpi_amd64_gas.o +MP_CONFIG = -DMP_ASSEMBLY_MULTIPLY -DMPI_AMD64 +MP_CONFIG += -DMP_IS_LITTLE_ENDIAN +CFLAGS = -O3 -I. -DMP_API_COMPATIBLE -DMP_IOFUNC $(MP_CONFIG) +MPICMN += $(MP_CONFIG) + +mpi_amd64_asm.o: mpi_amd64_gas.s + gcc -o mpi_amd64_asm.o -D_ASM -c mpi_amd64_gas.s In freebl/ecl/tests/ecp_test.c I upped the test iterations to 1000 and commented out the code for everything except P-256. The test machine has TurboBoost and Hyperthreading disabled. Unpatched, NSS 3.14 resulted in this: MP_CHECKOK (ECPoint_mul(group, &n, NULL, NULL, &rx, &ry)) k: 1000, t: 0.71 sec MP_CHECKOK (ECPoint_mul(group, &n, &gx, &gy, &rx, &ry)) k: 1000, t: 0.71 sec MP_CHECKOK (ECPoints_mul (group, &n, &n, &gx, &gy, &rx, &ry)) k: 1000, t: 1.39 sec With the P-256 v2 patch, attached: MP_CHECKOK (ECPoint_mul(group, &n, NULL, NULL, &rx, &ry)) k: 1000, t: 0.17 sec MP_CHECKOK (ECPoint_mul(group, &n, &gx, &gy, &rx, &ry)) k: 1000, t: 0.49 sec MP_CHECKOK (ECPoints_mul (group, &n, &n, &gx, &gy, &rx, &ry)) k: 1000, t: 0.64 sec With v2 and NSS_ECC_MORE_THAN_SUITE_B: P_CHECKOK (ECPoint_mul(group, &n, NULL, NULL, &rx, &ry)) k: 1000, t: 0.19 sec MP_CHECKOK (ECPoint_mul(group, &n, &gx, &gy, &rx, &ry)) k: 1000, t: 0.49 sec MP_CHECKOK (ECPoints_mul (group, &n, &n, &gx, &gy, &rx, &ry)) k: 1000, t: 0.63 sec I conclude that we should enable this P-256 code for all platforms, but that NSS_ECC_MORE_THAN_SUITE_B hardly matters.
Comment 7•8 years ago
|
||
Comment on attachment 706698 [details] [diff] [review] Patch by Adam Langley, v2 Review of attachment 706698 [details] [diff] [review]: ----------------------------------------------------------------- ::: mozilla/security/nss/lib/freebl/ecl/ecp_256_32.c @@ +1464,5 @@ > + if (name == ECCurve_NIST_P256) { > + group->base_point_mul = &ec_GFp_nistp256_base_point_mul; > + group->point_mul = &ec_GFp_nistp256_point_mul; > + group->points_mul = &ec_GFp_nistp256_points_mul_vartime; > + } Adam, could you measure the performance with the P-256 v2 patch applied but with this function being a no-op (simply returning MP_OKAY) and with NSS_ECC_MORE_THAN_SUITE_B undefined? This will allow us to compare plain ECGroup_consGFp (without additional method overrides) with ECGroup_consGFp_mont. Thanks. Rationale: on big-endian processors or processors that don't right-shift a signed integer in the way the NON_ZERO_TO_ALL_ONES macro requires, it will simplify the patch if we can just use plain ECGroup_consGFp for curve P-256.
Comment 8•8 years ago
|
||
This patch is ready for review. There are three potential concerns with this patch. I address them below. 1. Use of ECGroup_consGFp() when NSS_ECC_MORE_THAN_SUITE_B is *not* defined. I inspected the differences between ECGroup_consGFp_mont() (which we are currently using when NSS_ECC_MORE_THAN_SUITE_B is *not* defined) and ECGroup_consGFp(). They only differ in three methods: meth->field_mul, meth->field_sqr, and meth->field_div. The implementations of those three methods in ECGroup_consGFp() do generic bignum arithmetic: http://mxr.mozilla.org/mozilla-central/ident?i=ec_GFp_mul http://mxr.mozilla.org/mozilla-central/ident?i=ec_GFp_sqr http://mxr.mozilla.org/mozilla-central/ident?i=ec_GFp_div 2. The ecp_256_32.c code requires a little-endian processor. I figured out a simple solution. I define ec_group_set_gfp256_32() as a simple no-op if IS_LITTLE_ENDIAN is not defined. Please search for "IS_LITTLE_ENDIAN" in ecp_256_32.c to see my solution. 3. The ecp_256_32.c code requires right-shift of a negative signed integer does an arithmetic shift. The C Standard says the resulting value of right-shifting a negative signed integer is implementation-defined. Both GCC and Visual C++ document they do an arithmetic shift in that case: http://gcc.gnu.org/onlinedocs/gcc-4.3.3/gcc/Integers-implementation.html#Integers-implementation http://msdn.microsoft.com/en-us/library/336xbhcz.aspx So it is safe to assume all the compilers we are using (GCC, Clang, and Visual C++) behave this way. If necessary I can add a compile-time compiler type check to ensure this.
Attachment #706698 -
Attachment is obsolete: true
Attachment #707750 -
Flags: superreview?(rrelyea)
Attachment #707750 -
Flags: review?(bsmith)
Attachment #707750 -
Flags: feedback?(agl)
Assignee | ||
Comment 9•8 years ago
|
||
(In reply to Wan-Teh Chang from comment #7) > Adam, could you measure the performance with the P-256 v2 patch applied > but with this function being a no-op (simply returning MP_OKAY) and > with NSS_ECC_MORE_THAN_SUITE_B undefined? It is slightly slower: MP_CHECKOK (ECPoint_mul(group, &n, NULL, NULL, &rx, &ry)) k: 1000, t: 0.78 sec MP_CHECKOK (ECPoint_mul(group, &n, &gx, &gy, &rx, &ry)) k: 1000, t: 0.79 sec MP_CHECKOK (ECPoints_mul (group, &n, &n, &gx, &gy, &rx, &ry)) k: 1000, t: 1.43 sec
Assignee | ||
Updated•8 years ago
|
Attachment #707750 -
Flags: feedback?(agl) → feedback+
Comment 10•8 years ago
|
||
agl: thanks for the measurement. I think that kind of slowdown for big-endian processors is an acceptable trade-off for code simplicity. Just curious: how much work would it be to port the code to big-endian processors? In this comment I summarize my new analysis of your code and suggest two optional future changes. 1. I did some code tracing last night. The current patch does not seem to be an ideal way to integrate your P-256 implementation with lib/freebl/ecl. For example, I believe the to_montgomery and from_montgomery functions should become group->meth->field_enc and group->meth->field_dec, and the group->meth->field_xxx field arithmetic functions could call your functions. But I think it is reasonable to only intercept the calls to the group->point_xxx point arithmetic functions. In my code tracing, the only non-intercepted point arithmetic function used during TLS is group->validate_point. (NSS validates the peer's public key before doing ECDH.) If we also intercept group->validate_point, then it will be a better (more complete) integration with lib/freebl/ecl. 2. The field arithmetic functions used during TLS are (with the current patch): * group->meth->field_mod: used by to_montgomery and from_montgomery only * group->meth->{field_sqr,field_mul,field_add,field_sub}: used by group->validate_point only If we also intercept group->validate_point, then group->meth->field_mod becomes the only field arithmetic function used. group->meth->field_mod can be easily replaced by mp_mod if we don't care about optimizing it. Does ecp_256_32.c not contain a P-256 specific field_mod function?
Comment 11•8 years ago
|
||
This patch seems to work. I tested it on Mac OS X PowerPC. I identified the places that need fixing by searching for "char", "int8", "u8", "endian", "memcpy", and "*)" (the end of a pointer cast). I found only two places. I also looked into implementing group->validate_point, but I can't find the a and b coefficients of P-256 in ecp_256_32.c. It seems that they are not needed for point multiplication. Looking at ec_GFp_validate_point closer, I think the only expensive step is ECPoint_mul, which is already using the optimized code.
Attachment #708457 -
Flags: review?(agl)
Assignee | ||
Comment 12•8 years ago
|
||
(In reply to Wan-Teh Chang from comment #10) > 1. I did some code tracing last night. The current patch does not > seem to be an ideal way to integrate your P-256 implementation > with lib/freebl/ecl. For example, I believe the to_montgomery > and from_montgomery functions should become group->meth->field_enc > and group->meth->field_dec, and the group->meth->field_xxx > field arithmetic functions could call your functions. But > I think it is reasonable to only intercept the calls to the > group->point_xxx point arithmetic functions. field_enc and field_dec could point to the field_enc/dec functions I believe. The only reason that I did't do that was because point multiplication is the interface that I'm used to implementing. I've no strong feelings either way about doing this. However, I don't believe that the other functions could point to ecp_256_32.c functions because I didn't implement contraction: the reduction is only partial, which works for the purposes of the intermediate calculations in the group, but a full reduction is needed at the end and the code depends on the standard bigint functions for that. > If we also intercept group->validate_point, then group->meth->field_mod > becomes the only field arithmetic function used. group->meth->field_mod > can be easily replaced by mp_mod if we don't care about optimizing it. > Does ecp_256_32.c not contain a P-256 specific field_mod function? I don't think that validate_point is performance critical enough to bother with. As noted, the reduction function in ecp_256_32.c is not suitable to use as field_mod as it's only partial.
Assignee | ||
Comment 13•8 years ago
|
||
Comment on attachment 708457 [details] [diff] [review] Add big-endian support LGTM, thanks! Without a big-endian machine to hand I didn't believe that I'd get it right.
Attachment #708457 -
Flags: review?(agl) → review+
Comment 14•8 years ago
|
||
I updated the patch for the current NSS tree in the hg repository, and use the latest ecp_256_32.c file that has been tested in Chromium for a while. I had to remove #include "secport.h" and a PORT_Assert call from ecp_256_32.c because it is inconvenient to call PORT_Assert in the lib/freebl/ecl directory -- we'd need to include "stubs.h", which is in the parent directory lib/freebl. Patch pushed to the NSS hg repository (NSS 3.15): https://hg.mozilla.org/projects/nss/rev/49984bef0706
Attachment #707750 -
Attachment is obsolete: true
Attachment #708457 -
Attachment is obsolete: true
Attachment #707750 -
Flags: superreview?(rrelyea)
Attachment #707750 -
Flags: review?(bsmith)
Attachment #730377 -
Flags: review?(rrelyea)
Attachment #730377 -
Flags: checked-in+
Updated•8 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: 3.14.3 → 3.15
Comment 15•8 years ago
|
||
ECC tests failed on Linux PowerPC. I will look into that. I tested the code on Mac OS X PowerPC before.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 16•8 years ago
|
||
Comment on attachment 730377 [details] [diff] [review] Patch by Adam Langley, v4 Review of attachment 730377 [details] [diff] [review]: ----------------------------------------------------------------- ::: lib/freebl/ecl/ecp_256_32.c @@ +1264,5 @@ > + > +#ifdef MP_USE_UINT_DIGIT > +#define BYTESWAP_MP_DIGIT_TO_LE(x) BYTESWAP32(BYTESWAP32(x)) > +#else > +#define BYTESWAP_MP_DIGIT_TO_LE(x) BYTESWAP64(BYTESWAP64(x)) The double BYTESWAP32 and BYTESWAP64 look wrong. I wonder if I did that to test the big-endian code paths on a little-endian processor, and forgot to change it back.
Comment 17•8 years ago
|
||
https://hg.mozilla.org/projects/nss/rev/b8c1ffd325c2
Attachment #730521 -
Flags: review?(agl)
Attachment #730521 -
Flags: checked-in+
Assignee | ||
Updated•8 years ago
|
Attachment #730521 -
Flags: review?(agl) → review+
Comment 18•8 years ago
|
||
Removing the extra BYTESWAP32 and BYTESWAP64 fixed the test failures on the Linux PowerPC build bots.
Status: REOPENED → RESOLVED
Closed: 8 years ago → 8 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•