Closed Bug 831006 Opened 13 years ago Closed 12 years ago

P-256 performance on ARM is poor

Tracking

(Not tracked)

Status:

RESOLVED FIXED

Milestone:

3.15

People

(Reporter: agl, Assigned: agl)

Details

Attachments

(2 files, 6 obsolete files)

32-bit, P-256 implementation. 13 years ago Adam Langley 48.07 KB, text/x-csrc		Details
patch to wire up the new implementation. 13 years ago Adam Langley 5.93 KB, patch		Details \| Diff \| Splinter Review
Patch by Adam Langley 13 years ago Wan-Teh Chang 55.12 KB, patch		Details \| Diff \| Splinter Review
Patch by Adam Langley, v2 13 years ago Wan-Teh Chang 54.56 KB, patch		Details \| Diff \| Splinter Review
Patch by Adam Langley, v3 13 years ago Wan-Teh Chang 54.65 KB, patch	agl : feedback+	Details \| Diff \| Splinter Review
Add big-endian support 13 years ago Wan-Teh Chang 2.46 KB, patch	agl : review+	Details \| Diff \| Splinter Review
Patch by Adam Langley, v4 12 years ago Wan-Teh Chang 54.13 KB, patch	wtc : review? rrelyea wtc : checked-in+	Details \| Diff \| Splinter Review
Remove the extra BYTESWAP32 and BYTESWAP64 for big-endian processors 12 years ago Wan-Teh Chang 879 bytes, patch	agl : review+ wtc : checked-in+	Details \| Diff \| Splinter Review

Adam Langley

Assignee

Description

•

13 years ago

Attached file 32-bit, P-256 implementation. (obsolete) — Details

From a previous email, where I'm doing testing on a Galaxy Nexus: "Firstly I tweaked the Makefile to change -O to -O3. I also found significant differences in speed between when the screen is on and when it's off. I assume in the latter case that the CPU frequency is reduced to save power. Below, I report times in the format "16/20ms" where the time with the screen on is 16ms and the time when the screen is off is 20ms. The original code is 20.59/35.29ms and 20.62/35.38ms for a base-point scalar mult and arbitrary point scalar mult respectively. ECDHE involves one of each and the sum (41.21/70.67ms) is pretty close to the estimate that wtc reported (~80ms) for running on some ARM device, at least with the screen off. I found that NSS's optimised NIST routines are disabled in the standard build. By enabling those I get 13.64/23.09ms and 13.59/23.11ms (sum: 27.23/46ms) wtc also reported that he found ARM assembly that was disabled by default and I enabled that (I think: I set TARGET=armLINUX in freebl/mpi), this brought the times down to 10.27/17.64ms and 10.28/17.61ms (sum: 20.55/35.25ms) Finally, I wrote a reimplementation in plain C, which didn't go as well as I had hoped, but the times are 1.24/2.09ms and 3.21/5.49ms (sum: 4.45/7.58ms). This means that I didn't quite get my 10x improvement on the original times that I expected."

Adam Langley

Assignee

Comment 1

•

13 years ago

Attached patch patch to wire up the new implementation. (obsolete) — Details — Splinter Review

Note that the new version may be slower on 64-bit systems because it cannot use the large mults on those processors. However, it is constant time.

Wan-Teh Chang

Updated

•

13 years ago

Assignee: nobody → agl

Status: NEW → ASSIGNED

Priority: -- → P1

Target Milestone: --- → 3.14.3

Wan-Teh Chang

Comment 2

•

13 years ago

Attached patch Patch by Adam Langley (obsolete) — Details — Splinter Review

I created this NSS patch from the previous two attachments. I only made some minor edits such as removing unnecessary standard library headers from ecp_256_32.c and adding #ifdef IS_LITTLE_ENDIAN to the new code in ecp_256.c. The corresponding makefile changes for determining when to use ecp_256_32.c (i.e., detecting little-endian CPUs) still need to be written. Adam, should we use ecp_256_32.c in 64-bit builds?

Attachment #702526 - Attachment is obsolete: true

Attachment #702527 - Attachment is obsolete: true

Adam Langley

Assignee

Comment 3

•

13 years ago

> Adam, should we use ecp_256_32.c in 64-bit builds? The ecp_256_32 code is nicely constant time, so I'd be tempted to say yes. It's tough to beat the 64x64 full-mult in 64-bit mode however so it may be slower. If I get a chance, I'll try to benchmark the different on an amd64 box and get numbers. If anyone is suffering from slow P-256 performance on amd64, I could port the work that I did for OpenSSL to NSS, but it only really applies to servers doing lots of ECDHE or ECDSA.

Wan-Teh Chang

Comment 4

•

13 years ago

Attached patch Patch by Adam Langley, v2 (obsolete) — Details — Splinter Review

This new patch does not use ecp_256.c unless NSS_ECC_MORE_THAN_SUITE_B is set. I can't make ECGroup_consGFp_mont work with the point multiplication functions defined in ecp_256_32.c, so I am using ECGroup_consGFp with ecp_256_32.c. The remaining issue is to figure out the makefile changes that use ecp_256_32.c only for little-endian CPUs. In the worst case I can use a white list (arm, x86, and x86_64).

Attachment #706155 - Attachment is obsolete: true

Brian Smith (:briansmith, :bsmith, use NEEDINFO?)

Comment 5

•

13 years ago

(In reply to Wan-Teh Chang from comment #4) > This new patch does not use ecp_256.c unless NSS_ECC_MORE_THAN_SUITE_B > is set. Just to be clear: does this mean that the performance will be better with NSS_ECC_MORE_THAN_SUITE_B defined or without NSS_ECC_MORE_THAN_SUITE_B defined?

Adam Langley

Assignee

Comment 6

•

13 years ago

I tested this using `make tests` in freebl/ecl. Prior to that, I had to build the standalone libmpi.a and made the following changes to the freebl/mpi/target.mk to try and enable the amd64 assembly and give the existing code a fighting chance: +AS_OBJS = mpi_amd64.o mpi_amd64_gas.o +MP_CONFIG = -DMP_ASSEMBLY_MULTIPLY -DMPI_AMD64 +MP_CONFIG += -DMP_IS_LITTLE_ENDIAN +CFLAGS = -O3 -I. -DMP_API_COMPATIBLE -DMP_IOFUNC $(MP_CONFIG) +MPICMN += $(MP_CONFIG) + +mpi_amd64_asm.o: mpi_amd64_gas.s + gcc -o mpi_amd64_asm.o -D_ASM -c mpi_amd64_gas.s In freebl/ecl/tests/ecp_test.c I upped the test iterations to 1000 and commented out the code for everything except P-256. The test machine has TurboBoost and Hyperthreading disabled. Unpatched, NSS 3.14 resulted in this: MP_CHECKOK (ECPoint_mul(group, &n, NULL, NULL, &rx, &ry)) k: 1000, t: 0.71 sec MP_CHECKOK (ECPoint_mul(group, &n, &gx, &gy, &rx, &ry)) k: 1000, t: 0.71 sec MP_CHECKOK (ECPoints_mul (group, &n, &n, &gx, &gy, &rx, &ry)) k: 1000, t: 1.39 sec With the P-256 v2 patch, attached: MP_CHECKOK (ECPoint_mul(group, &n, NULL, NULL, &rx, &ry)) k: 1000, t: 0.17 sec MP_CHECKOK (ECPoint_mul(group, &n, &gx, &gy, &rx, &ry)) k: 1000, t: 0.49 sec MP_CHECKOK (ECPoints_mul (group, &n, &n, &gx, &gy, &rx, &ry)) k: 1000, t: 0.64 sec With v2 and NSS_ECC_MORE_THAN_SUITE_B: P_CHECKOK (ECPoint_mul(group, &n, NULL, NULL, &rx, &ry)) k: 1000, t: 0.19 sec MP_CHECKOK (ECPoint_mul(group, &n, &gx, &gy, &rx, &ry)) k: 1000, t: 0.49 sec MP_CHECKOK (ECPoints_mul (group, &n, &n, &gx, &gy, &rx, &ry)) k: 1000, t: 0.63 sec I conclude that we should enable this P-256 code for all platforms, but that NSS_ECC_MORE_THAN_SUITE_B hardly matters.

Wan-Teh Chang

Comment 7

•

13 years ago

Comment on attachment 706698 [details] [diff] [review] Patch by Adam Langley, v2 Review of attachment 706698 [details] [diff] [review]: ----------------------------------------------------------------- ::: mozilla/security/nss/lib/freebl/ecl/ecp_256_32.c @@ +1464,5 @@ > + if (name == ECCurve_NIST_P256) { > + group->base_point_mul = &ec_GFp_nistp256_base_point_mul; > + group->point_mul = &ec_GFp_nistp256_point_mul; > + group->points_mul = &ec_GFp_nistp256_points_mul_vartime; > + } Adam, could you measure the performance with the P-256 v2 patch applied but with this function being a no-op (simply returning MP_OKAY) and with NSS_ECC_MORE_THAN_SUITE_B undefined? This will allow us to compare plain ECGroup_consGFp (without additional method overrides) with ECGroup_consGFp_mont. Thanks. Rationale: on big-endian processors or processors that don't right-shift a signed integer in the way the NON_ZERO_TO_ALL_ONES macro requires, it will simplify the patch if we can just use plain ECGroup_consGFp for curve P-256.

Wan-Teh Chang

Comment 8

•

13 years ago

Attached patch Patch by Adam Langley, v3 (obsolete) — Details — Splinter Review

This patch is ready for review. There are three potential concerns with this patch. I address them below. 1. Use of ECGroup_consGFp() when NSS_ECC_MORE_THAN_SUITE_B is *not* defined. I inspected the differences between ECGroup_consGFp_mont() (which we are currently using when NSS_ECC_MORE_THAN_SUITE_B is *not* defined) and ECGroup_consGFp(). They only differ in three methods: meth->field_mul, meth->field_sqr, and meth->field_div. The implementations of those three methods in ECGroup_consGFp() do generic bignum arithmetic: http://mxr.mozilla.org/mozilla-central/ident?i=ec_GFp_mul http://mxr.mozilla.org/mozilla-central/ident?i=ec_GFp_sqr http://mxr.mozilla.org/mozilla-central/ident?i=ec_GFp_div 2. The ecp_256_32.c code requires a little-endian processor. I figured out a simple solution. I define ec_group_set_gfp256_32() as a simple no-op if IS_LITTLE_ENDIAN is not defined. Please search for "IS_LITTLE_ENDIAN" in ecp_256_32.c to see my solution. 3. The ecp_256_32.c code requires right-shift of a negative signed integer does an arithmetic shift. The C Standard says the resulting value of right-shifting a negative signed integer is implementation-defined. Both GCC and Visual C++ document they do an arithmetic shift in that case: http://gcc.gnu.org/onlinedocs/gcc-4.3.3/gcc/Integers-implementation.html#Integers-implementation http://msdn.microsoft.com/en-us/library/336xbhcz.aspx So it is safe to assume all the compilers we are using (GCC, Clang, and Visual C++) behave this way. If necessary I can add a compile-time compiler type check to ensure this.

Attachment #706698 - Attachment is obsolete: true

Attachment #707750 - Flags: superreview?(rrelyea)

Attachment #707750 - Flags: review?(bsmith)

Attachment #707750 - Flags: feedback?(agl)

Adam Langley

Assignee

Comment 9

•

13 years ago

(In reply to Wan-Teh Chang from comment #7) > Adam, could you measure the performance with the P-256 v2 patch applied > but with this function being a no-op (simply returning MP_OKAY) and > with NSS_ECC_MORE_THAN_SUITE_B undefined? It is slightly slower: MP_CHECKOK (ECPoint_mul(group, &n, NULL, NULL, &rx, &ry)) k: 1000, t: 0.78 sec MP_CHECKOK (ECPoint_mul(group, &n, &gx, &gy, &rx, &ry)) k: 1000, t: 0.79 sec MP_CHECKOK (ECPoints_mul (group, &n, &n, &gx, &gy, &rx, &ry)) k: 1000, t: 1.43 sec

Adam Langley

Assignee

Updated

•

13 years ago

Attachment #707750 - Flags: feedback?(agl) → feedback+

Wan-Teh Chang

Comment 10

•

13 years ago

agl: thanks for the measurement. I think that kind of slowdown for big-endian processors is an acceptable trade-off for code simplicity. Just curious: how much work would it be to port the code to big-endian processors? In this comment I summarize my new analysis of your code and suggest two optional future changes. 1. I did some code tracing last night. The current patch does not seem to be an ideal way to integrate your P-256 implementation with lib/freebl/ecl. For example, I believe the to_montgomery and from_montgomery functions should become group->meth->field_enc and group->meth->field_dec, and the group->meth->field_xxx field arithmetic functions could call your functions. But I think it is reasonable to only intercept the calls to the group->point_xxx point arithmetic functions. In my code tracing, the only non-intercepted point arithmetic function used during TLS is group->validate_point. (NSS validates the peer's public key before doing ECDH.) If we also intercept group->validate_point, then it will be a better (more complete) integration with lib/freebl/ecl. 2. The field arithmetic functions used during TLS are (with the current patch): * group->meth->field_mod: used by to_montgomery and from_montgomery only * group->meth->{field_sqr,field_mul,field_add,field_sub}: used by group->validate_point only If we also intercept group->validate_point, then group->meth->field_mod becomes the only field arithmetic function used. group->meth->field_mod can be easily replaced by mp_mod if we don't care about optimizing it. Does ecp_256_32.c not contain a P-256 specific field_mod function?

Wan-Teh Chang

Comment 11

•

13 years ago

Attached patch Add big-endian support (obsolete) — Details — Splinter Review

This patch seems to work. I tested it on Mac OS X PowerPC. I identified the places that need fixing by searching for "char", "int8", "u8", "endian", "memcpy", and "*)" (the end of a pointer cast). I found only two places. I also looked into implementing group->validate_point, but I can't find the a and b coefficients of P-256 in ecp_256_32.c. It seems that they are not needed for point multiplication. Looking at ec_GFp_validate_point closer, I think the only expensive step is ECPoint_mul, which is already using the optimized code.

Attachment #708457 - Flags: review?(agl)

Adam Langley

Assignee

Comment 12

•

13 years ago

(In reply to Wan-Teh Chang from comment #10) > 1. I did some code tracing last night. The current patch does not > seem to be an ideal way to integrate your P-256 implementation > with lib/freebl/ecl. For example, I believe the to_montgomery > and from_montgomery functions should become group->meth->field_enc > and group->meth->field_dec, and the group->meth->field_xxx > field arithmetic functions could call your functions. But > I think it is reasonable to only intercept the calls to the > group->point_xxx point arithmetic functions. field_enc and field_dec could point to the field_enc/dec functions I believe. The only reason that I did't do that was because point multiplication is the interface that I'm used to implementing. I've no strong feelings either way about doing this. However, I don't believe that the other functions could point to ecp_256_32.c functions because I didn't implement contraction: the reduction is only partial, which works for the purposes of the intermediate calculations in the group, but a full reduction is needed at the end and the code depends on the standard bigint functions for that. > If we also intercept group->validate_point, then group->meth->field_mod > becomes the only field arithmetic function used. group->meth->field_mod > can be easily replaced by mp_mod if we don't care about optimizing it. > Does ecp_256_32.c not contain a P-256 specific field_mod function? I don't think that validate_point is performance critical enough to bother with. As noted, the reduction function in ecp_256_32.c is not suitable to use as field_mod as it's only partial.

Adam Langley

Assignee

Comment 13

•

13 years ago

Comment on attachment 708457 [details] [diff] [review] Add big-endian support LGTM, thanks! Without a big-endian machine to hand I didn't believe that I'd get it right.

Attachment #708457 - Flags: review?(agl) → review+

Wan-Teh Chang

Comment 14

•

12 years ago

Attached patch Patch by Adam Langley, v4 — Details — Splinter Review

I updated the patch for the current NSS tree in the hg repository, and use the latest ecp_256_32.c file that has been tested in Chromium for a while. I had to remove #include "secport.h" and a PORT_Assert call from ecp_256_32.c because it is inconvenient to call PORT_Assert in the lib/freebl/ecl directory -- we'd need to include "stubs.h", which is in the parent directory lib/freebl. Patch pushed to the NSS hg repository (NSS 3.15): https://hg.mozilla.org/projects/nss/rev/49984bef0706

Attachment #707750 - Attachment is obsolete: true

Attachment #708457 - Attachment is obsolete: true

Attachment #707750 - Flags: superreview?(rrelyea)

Attachment #707750 - Flags: review?(bsmith)

Attachment #730377 - Flags: review?(rrelyea)

Attachment #730377 - Flags: checked-in+

Wan-Teh Chang

Updated

•

12 years ago

Status: ASSIGNED → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Target Milestone: 3.14.3 → 3.15

Wan-Teh Chang

Comment 15

•

12 years ago

ECC tests failed on Linux PowerPC. I will look into that. I tested the code on Mac OS X PowerPC before.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Wan-Teh Chang

Comment 16

•

12 years ago

Comment on attachment 730377 [details] [diff] [review] Patch by Adam Langley, v4 Review of attachment 730377 [details] [diff] [review]: ----------------------------------------------------------------- ::: lib/freebl/ecl/ecp_256_32.c @@ +1264,5 @@ > + > +#ifdef MP_USE_UINT_DIGIT > +#define BYTESWAP_MP_DIGIT_TO_LE(x) BYTESWAP32(BYTESWAP32(x)) > +#else > +#define BYTESWAP_MP_DIGIT_TO_LE(x) BYTESWAP64(BYTESWAP64(x)) The double BYTESWAP32 and BYTESWAP64 look wrong. I wonder if I did that to test the big-endian code paths on a little-endian processor, and forgot to change it back.

Wan-Teh Chang

Comment 17

•

12 years ago

Attached patch Remove the extra BYTESWAP32 and BYTESWAP64 for big-endian processors — Details — Splinter Review

https://hg.mozilla.org/projects/nss/rev/b8c1ffd325c2

Attachment #730521 - Flags: review?(agl)

Attachment #730521 - Flags: checked-in+

Adam Langley

Assignee

Updated

•

12 years ago

Attachment #730521 - Flags: review?(agl) → review+

Wan-Teh Chang

Comment 18

•

12 years ago

Removing the extra BYTESWAP32 and BYTESWAP64 fixed the test failures on the Linux PowerPC build bots.

Status: REOPENED → RESOLVED

Closed: 12 years ago → 12 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.