The default bug view has changed. See this FAQ.

P-256 performance on ARM is poor

RESOLVED FIXED in 3.15

Status

NSS
Libraries
P1
enhancement
RESOLVED FIXED
4 years ago
4 years ago

People

(Reporter: Adam Langley, Assigned: Adam Langley)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments, 6 obsolete attachments)

(Assignee)

Description

4 years ago
Created attachment 702526 [details]
32-bit, P-256 implementation.

From a previous email, where I'm doing testing on a Galaxy Nexus:

"Firstly I tweaked the Makefile to change -O to -O3. I also found
significant differences in speed between when the screen is on and
when it's off. I assume in the latter case that the CPU frequency is
reduced to save power. Below, I report times in the format "16/20ms"
where the time with the screen on is 16ms and the time when the screen
is off is 20ms.

The original code is 20.59/35.29ms and 20.62/35.38ms for a base-point
scalar mult and arbitrary point scalar mult respectively. ECDHE
involves one of each and the sum (41.21/70.67ms) is pretty close to
the estimate that wtc reported (~80ms) for running on some ARM device,
at least with the screen off.

I found that NSS's optimised NIST routines are disabled in the
standard build. By enabling those I get 13.64/23.09ms and
13.59/23.11ms (sum: 27.23/46ms)

wtc also reported that he found ARM assembly that was disabled by
default and I enabled that (I think: I set TARGET=armLINUX in
freebl/mpi), this brought the times down to 10.27/17.64ms and
10.28/17.61ms (sum: 20.55/35.25ms)

Finally, I wrote a reimplementation in plain C, which didn't go as
well as I had hoped, but the times are 1.24/2.09ms and 3.21/5.49ms
(sum: 4.45/7.58ms). This means that I didn't quite get my 10x
improvement on the original times that I expected."
(Assignee)

Comment 1

4 years ago
Created attachment 702527 [details] [diff] [review]
patch to wire up the new implementation.

Note that the new version may be slower on 64-bit systems because it cannot use the large mults on those processors. However, it is constant time.

Updated

4 years ago
Assignee: nobody → agl
Status: NEW → ASSIGNED
Priority: -- → P1
Target Milestone: --- → 3.14.3

Comment 2

4 years ago
Created attachment 706155 [details] [diff] [review]
Patch by Adam Langley

I created this NSS patch from the previous two attachments. I only
made some minor edits such as removing unnecessary standard library
headers from ecp_256_32.c and adding #ifdef IS_LITTLE_ENDIAN to the
new code in ecp_256.c.

The corresponding makefile changes for determining when to use
ecp_256_32.c (i.e., detecting little-endian CPUs) still need to be
written.

Adam, should we use ecp_256_32.c in 64-bit builds?
Attachment #702526 - Attachment is obsolete: true
Attachment #702527 - Attachment is obsolete: true
(Assignee)

Comment 3

4 years ago
> Adam, should we use ecp_256_32.c in 64-bit builds?

The ecp_256_32 code is nicely constant time, so I'd be tempted to say yes. It's tough to beat the 64x64 full-mult in 64-bit mode however so it may be slower. If I get a chance, I'll try to benchmark the different on an amd64 box and get numbers.

If anyone is suffering from slow P-256 performance on amd64, I could port the work that I did for OpenSSL to NSS, but it only really applies to servers doing lots of ECDHE or ECDSA.

Comment 4

4 years ago
Created attachment 706698 [details] [diff] [review]
Patch by Adam Langley, v2

This new patch does not use ecp_256.c unless NSS_ECC_MORE_THAN_SUITE_B
is set.

I can't make ECGroup_consGFp_mont work with the point multiplication
functions defined in ecp_256_32.c, so I am using ECGroup_consGFp with
ecp_256_32.c.

The remaining issue is to figure out the makefile changes that
use ecp_256_32.c only for little-endian CPUs. In the worst case
I can use a white list (arm, x86, and x86_64).
Attachment #706155 - Attachment is obsolete: true
(In reply to Wan-Teh Chang from comment #4)
> This new patch does not use ecp_256.c unless NSS_ECC_MORE_THAN_SUITE_B
> is set.

Just to be clear: does this mean that the performance will be better with NSS_ECC_MORE_THAN_SUITE_B defined or without NSS_ECC_MORE_THAN_SUITE_B defined?
(Assignee)

Comment 6

4 years ago
I tested this using `make tests` in freebl/ecl. Prior to that, I had to build the standalone libmpi.a and made the following changes to the freebl/mpi/target.mk to try and enable the amd64 assembly and give the existing code a fighting chance:

+AS_OBJS = mpi_amd64.o mpi_amd64_gas.o
+MP_CONFIG = -DMP_ASSEMBLY_MULTIPLY -DMPI_AMD64
+MP_CONFIG += -DMP_IS_LITTLE_ENDIAN
+CFLAGS = -O3 -I. -DMP_API_COMPATIBLE -DMP_IOFUNC $(MP_CONFIG)
+MPICMN += $(MP_CONFIG)
+
+mpi_amd64_asm.o: mpi_amd64_gas.s
+       gcc -o mpi_amd64_asm.o -D_ASM -c mpi_amd64_gas.s

In freebl/ecl/tests/ecp_test.c I upped the test iterations to 1000 and commented out the code for everything except P-256.

The test machine has TurboBoost and Hyperthreading disabled.

Unpatched, NSS 3.14 resulted in this:

MP_CHECKOK (ECPoint_mul(group, &n, NULL, NULL, &rx, &ry)) k:   1000, t:   0.71 sec 
MP_CHECKOK (ECPoint_mul(group, &n, &gx, &gy, &rx, &ry)) k:   1000, t:   0.71 sec
MP_CHECKOK (ECPoints_mul (group, &n, &n, &gx, &gy, &rx, &ry)) k:   1000, t:   1.39 sec

With the P-256 v2 patch, attached:

MP_CHECKOK (ECPoint_mul(group, &n, NULL, NULL, &rx, &ry)) k:   1000, t:   0.17 sec 
MP_CHECKOK (ECPoint_mul(group, &n, &gx, &gy, &rx, &ry)) k:   1000, t:   0.49 sec
MP_CHECKOK (ECPoints_mul (group, &n, &n, &gx, &gy, &rx, &ry)) k:   1000, t:   0.64 sec

With v2 and NSS_ECC_MORE_THAN_SUITE_B:

P_CHECKOK (ECPoint_mul(group, &n, NULL, NULL, &rx, &ry)) k:   1000, t:   0.19 sec 
MP_CHECKOK (ECPoint_mul(group, &n, &gx, &gy, &rx, &ry)) k:   1000, t:   0.49 sec
MP_CHECKOK (ECPoints_mul (group, &n, &n, &gx, &gy, &rx, &ry)) k:   1000, t:   0.63 sec

I conclude that we should enable this P-256 code for all platforms, but that NSS_ECC_MORE_THAN_SUITE_B hardly matters.

Comment 7

4 years ago
Comment on attachment 706698 [details] [diff] [review]
Patch by Adam Langley, v2

Review of attachment 706698 [details] [diff] [review]:
-----------------------------------------------------------------

::: mozilla/security/nss/lib/freebl/ecl/ecp_256_32.c
@@ +1464,5 @@
> +    if (name == ECCurve_NIST_P256) {
> +        group->base_point_mul = &ec_GFp_nistp256_base_point_mul;
> +        group->point_mul = &ec_GFp_nistp256_point_mul;
> +        group->points_mul = &ec_GFp_nistp256_points_mul_vartime;
> +    }

Adam, could you measure the performance with the P-256 v2 patch applied
but with this function being a no-op (simply returning MP_OKAY) and
with NSS_ECC_MORE_THAN_SUITE_B undefined?

This will allow us to compare plain ECGroup_consGFp (without additional
method overrides) with ECGroup_consGFp_mont. Thanks.

Rationale: on big-endian processors or processors that don't right-shift
a signed integer in the way the NON_ZERO_TO_ALL_ONES macro requires,
it will simplify the patch if we can just use plain ECGroup_consGFp
for curve P-256.

Comment 8

4 years ago
Created attachment 707750 [details] [diff] [review]
Patch by Adam Langley, v3

This patch is ready for review.

There are three potential concerns with this patch. I address them below.

1. Use of ECGroup_consGFp() when NSS_ECC_MORE_THAN_SUITE_B is *not* defined.

I inspected the differences between ECGroup_consGFp_mont() (which we are
currently using when NSS_ECC_MORE_THAN_SUITE_B is *not* defined) and
ECGroup_consGFp(). They only differ in three methods: meth->field_mul,
meth->field_sqr, and meth->field_div. The implementations of those
three methods in ECGroup_consGFp() do generic bignum arithmetic:
  http://mxr.mozilla.org/mozilla-central/ident?i=ec_GFp_mul
  http://mxr.mozilla.org/mozilla-central/ident?i=ec_GFp_sqr
  http://mxr.mozilla.org/mozilla-central/ident?i=ec_GFp_div

2. The ecp_256_32.c code requires a little-endian processor.

I figured out a simple solution. I define ec_group_set_gfp256_32()
as a simple no-op if IS_LITTLE_ENDIAN is not defined. Please
search for "IS_LITTLE_ENDIAN" in ecp_256_32.c to see my solution.

3. The ecp_256_32.c code requires right-shift of a negative signed
integer does an arithmetic shift.

The C Standard says the resulting value of right-shifting a negative
signed integer is implementation-defined. Both GCC and Visual C++
document they do an arithmetic shift in that case:

http://gcc.gnu.org/onlinedocs/gcc-4.3.3/gcc/Integers-implementation.html#Integers-implementation

http://msdn.microsoft.com/en-us/library/336xbhcz.aspx

So it is safe to assume all the compilers we are using (GCC, Clang,
and Visual C++) behave this way. If necessary I can add a compile-time
compiler type check to ensure this.
Attachment #706698 - Attachment is obsolete: true
Attachment #707750 - Flags: superreview?(rrelyea)
Attachment #707750 - Flags: review?(bsmith)
Attachment #707750 - Flags: feedback?(agl)
(Assignee)

Comment 9

4 years ago
(In reply to Wan-Teh Chang from comment #7)
> Adam, could you measure the performance with the P-256 v2 patch applied
> but with this function being a no-op (simply returning MP_OKAY) and
> with NSS_ECC_MORE_THAN_SUITE_B undefined?

It is slightly slower:

MP_CHECKOK (ECPoint_mul(group, &n, NULL, NULL, &rx, &ry)) k:   1000, t:   0.78 sec
MP_CHECKOK (ECPoint_mul(group, &n, &gx, &gy, &rx, &ry)) k:   1000, t:   0.79 sec
MP_CHECKOK (ECPoints_mul (group, &n, &n, &gx, &gy, &rx, &ry)) k:   1000, t:   1.43 sec
(Assignee)

Updated

4 years ago
Attachment #707750 - Flags: feedback?(agl) → feedback+

Comment 10

4 years ago
agl: thanks for the measurement. I think that kind of slowdown
for big-endian processors is an acceptable trade-off for code
simplicity. Just curious: how much work would it be to port the
code to big-endian processors?

In this comment I summarize my new analysis of your code and
suggest two optional future changes.

1. I did some code tracing last night. The current patch does not
seem to be an ideal way to integrate your P-256 implementation
with lib/freebl/ecl. For example, I believe the to_montgomery
and from_montgomery functions should become group->meth->field_enc
and group->meth->field_dec, and the group->meth->field_xxx
field arithmetic functions could call your functions. But
I think it is reasonable to only intercept the calls to the
group->point_xxx point arithmetic functions.

In my code tracing, the only non-intercepted point arithmetic
function used during TLS is group->validate_point. (NSS validates
the peer's public key before doing ECDH.) If we also intercept
group->validate_point, then it will be a better (more complete)
integration with lib/freebl/ecl.

2. The field arithmetic functions used during TLS are (with the
current patch):

* group->meth->field_mod: used by to_montgomery and from_montgomery
  only

* group->meth->{field_sqr,field_mul,field_add,field_sub}: used by
  group->validate_point only

If we also intercept group->validate_point, then group->meth->field_mod
becomes the only field arithmetic function used. group->meth->field_mod
can be easily replaced by mp_mod if we don't care about optimizing it.
Does ecp_256_32.c not contain a P-256 specific field_mod function?

Comment 11

4 years ago
Created attachment 708457 [details] [diff] [review]
Add big-endian support

This patch seems to work. I tested it on Mac OS X PowerPC.

I identified the places that need fixing by searching for
"char", "int8", "u8", "endian", "memcpy", and "*)" (the end
of a pointer cast). I found only two places.

I also looked into implementing group->validate_point, but
I can't find the a and b coefficients of P-256 in ecp_256_32.c.
It seems that they are not needed for point multiplication.

Looking at ec_GFp_validate_point closer, I think the only
expensive step is ECPoint_mul, which is already using the
optimized code.
Attachment #708457 - Flags: review?(agl)
(Assignee)

Comment 12

4 years ago
(In reply to Wan-Teh Chang from comment #10)
> 1. I did some code tracing last night. The current patch does not
> seem to be an ideal way to integrate your P-256 implementation
> with lib/freebl/ecl. For example, I believe the to_montgomery
> and from_montgomery functions should become group->meth->field_enc
> and group->meth->field_dec, and the group->meth->field_xxx
> field arithmetic functions could call your functions. But
> I think it is reasonable to only intercept the calls to the
> group->point_xxx point arithmetic functions.

field_enc and field_dec could point to the field_enc/dec functions I believe. The only reason that I did't do that was because point multiplication is the interface that I'm used to implementing. I've no strong feelings either way about doing this.

However, I don't believe that the other functions could point to ecp_256_32.c functions because I didn't implement contraction: the reduction is only partial, which works for the purposes of the intermediate calculations in the group, but a full reduction is needed at the end and the code depends on the standard bigint functions for that.

> If we also intercept group->validate_point, then group->meth->field_mod
> becomes the only field arithmetic function used. group->meth->field_mod
> can be easily replaced by mp_mod if we don't care about optimizing it.
> Does ecp_256_32.c not contain a P-256 specific field_mod function?

I don't think that validate_point is performance critical enough to bother with.

As noted, the reduction function in ecp_256_32.c is not suitable to use as field_mod as it's only partial.
(Assignee)

Comment 13

4 years ago
Comment on attachment 708457 [details] [diff] [review]
Add big-endian support

LGTM, thanks! Without a big-endian machine to hand I didn't believe that I'd get it right.
Attachment #708457 - Flags: review?(agl) → review+

Comment 14

4 years ago
Created attachment 730377 [details] [diff] [review]
Patch by Adam Langley, v4

I updated the patch for the current NSS tree in the hg repository,
and use the latest ecp_256_32.c file that has been tested in
Chromium for a while.

I had to remove #include "secport.h" and a PORT_Assert call from
ecp_256_32.c because it is inconvenient to call PORT_Assert in the
lib/freebl/ecl directory -- we'd need to include "stubs.h", which
is in the parent directory lib/freebl.

Patch pushed to the NSS hg repository (NSS 3.15):
https://hg.mozilla.org/projects/nss/rev/49984bef0706
Attachment #707750 - Attachment is obsolete: true
Attachment #708457 - Attachment is obsolete: true
Attachment #707750 - Flags: superreview?(rrelyea)
Attachment #707750 - Flags: review?(bsmith)
Attachment #730377 - Flags: review?(rrelyea)
Attachment #730377 - Flags: checked-in+

Updated

4 years ago
Status: ASSIGNED → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
Target Milestone: 3.14.3 → 3.15

Comment 15

4 years ago
ECC tests failed on Linux PowerPC. I will look into that.

I tested the code on Mac OS X PowerPC before.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---

Comment 16

4 years ago
Comment on attachment 730377 [details] [diff] [review]
Patch by Adam Langley, v4

Review of attachment 730377 [details] [diff] [review]:
-----------------------------------------------------------------

::: lib/freebl/ecl/ecp_256_32.c
@@ +1264,5 @@
> +
> +#ifdef MP_USE_UINT_DIGIT
> +#define BYTESWAP_MP_DIGIT_TO_LE(x) BYTESWAP32(BYTESWAP32(x))
> +#else
> +#define BYTESWAP_MP_DIGIT_TO_LE(x) BYTESWAP64(BYTESWAP64(x))

The double BYTESWAP32 and BYTESWAP64 look wrong.
I wonder if I did that to test the big-endian code
paths on a little-endian processor, and forgot to
change it back.

Comment 17

4 years ago
Created attachment 730521 [details] [diff] [review]
Remove the extra BYTESWAP32 and BYTESWAP64 for big-endian processors

https://hg.mozilla.org/projects/nss/rev/b8c1ffd325c2
Attachment #730521 - Flags: review?(agl)
Attachment #730521 - Flags: checked-in+
(Assignee)

Updated

4 years ago
Attachment #730521 - Flags: review?(agl) → review+

Comment 18

4 years ago
Removing the extra BYTESWAP32 and BYTESWAP64 fixed the test
failures on the Linux PowerPC build bots.
Status: REOPENED → RESOLVED
Last Resolved: 4 years ago4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.