Last Comment Bug 831006 - P-256 performance on ARM is poor
: P-256 performance on ARM is poor
Status: RESOLVED FIXED
:
Product: NSS
Classification: Components
Component: Libraries (show other bugs)
: trunk
: ARM All
: P1 enhancement (vote)
: 3.15
Assigned To: Adam Langley
:
Mentors:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-01-15 14:25 PST by Adam Langley
Modified: 2013-03-28 08:06 PDT (History)
4 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
32-bit, P-256 implementation. (48.07 KB, text/x-csrc)
2013-01-15 14:25 PST, Adam Langley
no flags Details
patch to wire up the new implementation. (5.93 KB, patch)
2013-01-15 14:27 PST, Adam Langley
no flags Details | Diff | Review
Patch by Adam Langley (55.12 KB, patch)
2013-01-24 16:21 PST, Wan-Teh Chang
no flags Details | Diff | Review
Patch by Adam Langley, v2 (54.56 KB, patch)
2013-01-25 18:04 PST, Wan-Teh Chang
no flags Details | Diff | Review
Patch by Adam Langley, v3 (54.65 KB, patch)
2013-01-29 11:39 PST, Wan-Teh Chang
agl: feedback+
Details | Diff | Review
Add big-endian support (2.46 KB, patch)
2013-01-30 21:30 PST, Wan-Teh Chang
agl: review+
Details | Diff | Review
Patch by Adam Langley, v4 (54.13 KB, patch)
2013-03-27 15:21 PDT, Wan-Teh Chang
wtc: review? (rrelyea)
wtc: checked‑in+
Details | Diff | Review
Remove the extra BYTESWAP32 and BYTESWAP64 for big-endian processors (879 bytes, patch)
2013-03-27 21:16 PDT, Wan-Teh Chang
agl: review+
wtc: checked‑in+
Details | Diff | Review

Description Adam Langley 2013-01-15 14:25:27 PST
Created attachment 702526 [details]
32-bit, P-256 implementation.

From a previous email, where I'm doing testing on a Galaxy Nexus:

"Firstly I tweaked the Makefile to change -O to -O3. I also found
significant differences in speed between when the screen is on and
when it's off. I assume in the latter case that the CPU frequency is
reduced to save power. Below, I report times in the format "16/20ms"
where the time with the screen on is 16ms and the time when the screen
is off is 20ms.

The original code is 20.59/35.29ms and 20.62/35.38ms for a base-point
scalar mult and arbitrary point scalar mult respectively. ECDHE
involves one of each and the sum (41.21/70.67ms) is pretty close to
the estimate that wtc reported (~80ms) for running on some ARM device,
at least with the screen off.

I found that NSS's optimised NIST routines are disabled in the
standard build. By enabling those I get 13.64/23.09ms and
13.59/23.11ms (sum: 27.23/46ms)

wtc also reported that he found ARM assembly that was disabled by
default and I enabled that (I think: I set TARGET=armLINUX in
freebl/mpi), this brought the times down to 10.27/17.64ms and
10.28/17.61ms (sum: 20.55/35.25ms)

Finally, I wrote a reimplementation in plain C, which didn't go as
well as I had hoped, but the times are 1.24/2.09ms and 3.21/5.49ms
(sum: 4.45/7.58ms). This means that I didn't quite get my 10x
improvement on the original times that I expected."
Comment 1 Adam Langley 2013-01-15 14:27:16 PST
Created attachment 702527 [details] [diff] [review]
patch to wire up the new implementation.

Note that the new version may be slower on 64-bit systems because it cannot use the large mults on those processors. However, it is constant time.
Comment 2 Wan-Teh Chang 2013-01-24 16:21:50 PST
Created attachment 706155 [details] [diff] [review]
Patch by Adam Langley

I created this NSS patch from the previous two attachments. I only
made some minor edits such as removing unnecessary standard library
headers from ecp_256_32.c and adding #ifdef IS_LITTLE_ENDIAN to the
new code in ecp_256.c.

The corresponding makefile changes for determining when to use
ecp_256_32.c (i.e., detecting little-endian CPUs) still need to be
written.

Adam, should we use ecp_256_32.c in 64-bit builds?
Comment 3 Adam Langley 2013-01-25 08:06:56 PST
> Adam, should we use ecp_256_32.c in 64-bit builds?

The ecp_256_32 code is nicely constant time, so I'd be tempted to say yes. It's tough to beat the 64x64 full-mult in 64-bit mode however so it may be slower. If I get a chance, I'll try to benchmark the different on an amd64 box and get numbers.

If anyone is suffering from slow P-256 performance on amd64, I could port the work that I did for OpenSSL to NSS, but it only really applies to servers doing lots of ECDHE or ECDSA.
Comment 4 Wan-Teh Chang 2013-01-25 18:04:43 PST
Created attachment 706698 [details] [diff] [review]
Patch by Adam Langley, v2

This new patch does not use ecp_256.c unless NSS_ECC_MORE_THAN_SUITE_B
is set.

I can't make ECGroup_consGFp_mont work with the point multiplication
functions defined in ecp_256_32.c, so I am using ECGroup_consGFp with
ecp_256_32.c.

The remaining issue is to figure out the makefile changes that
use ecp_256_32.c only for little-endian CPUs. In the worst case
I can use a white list (arm, x86, and x86_64).
Comment 5 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2013-01-26 19:24:51 PST
(In reply to Wan-Teh Chang from comment #4)
> This new patch does not use ecp_256.c unless NSS_ECC_MORE_THAN_SUITE_B
> is set.

Just to be clear: does this mean that the performance will be better with NSS_ECC_MORE_THAN_SUITE_B defined or without NSS_ECC_MORE_THAN_SUITE_B defined?
Comment 6 Adam Langley 2013-01-28 09:21:23 PST
I tested this using `make tests` in freebl/ecl. Prior to that, I had to build the standalone libmpi.a and made the following changes to the freebl/mpi/target.mk to try and enable the amd64 assembly and give the existing code a fighting chance:

+AS_OBJS = mpi_amd64.o mpi_amd64_gas.o
+MP_CONFIG = -DMP_ASSEMBLY_MULTIPLY -DMPI_AMD64
+MP_CONFIG += -DMP_IS_LITTLE_ENDIAN
+CFLAGS = -O3 -I. -DMP_API_COMPATIBLE -DMP_IOFUNC $(MP_CONFIG)
+MPICMN += $(MP_CONFIG)
+
+mpi_amd64_asm.o: mpi_amd64_gas.s
+       gcc -o mpi_amd64_asm.o -D_ASM -c mpi_amd64_gas.s

In freebl/ecl/tests/ecp_test.c I upped the test iterations to 1000 and commented out the code for everything except P-256.

The test machine has TurboBoost and Hyperthreading disabled.

Unpatched, NSS 3.14 resulted in this:

MP_CHECKOK (ECPoint_mul(group, &n, NULL, NULL, &rx, &ry)) k:   1000, t:   0.71 sec 
MP_CHECKOK (ECPoint_mul(group, &n, &gx, &gy, &rx, &ry)) k:   1000, t:   0.71 sec
MP_CHECKOK (ECPoints_mul (group, &n, &n, &gx, &gy, &rx, &ry)) k:   1000, t:   1.39 sec

With the P-256 v2 patch, attached:

MP_CHECKOK (ECPoint_mul(group, &n, NULL, NULL, &rx, &ry)) k:   1000, t:   0.17 sec 
MP_CHECKOK (ECPoint_mul(group, &n, &gx, &gy, &rx, &ry)) k:   1000, t:   0.49 sec
MP_CHECKOK (ECPoints_mul (group, &n, &n, &gx, &gy, &rx, &ry)) k:   1000, t:   0.64 sec

With v2 and NSS_ECC_MORE_THAN_SUITE_B:

P_CHECKOK (ECPoint_mul(group, &n, NULL, NULL, &rx, &ry)) k:   1000, t:   0.19 sec 
MP_CHECKOK (ECPoint_mul(group, &n, &gx, &gy, &rx, &ry)) k:   1000, t:   0.49 sec
MP_CHECKOK (ECPoints_mul (group, &n, &n, &gx, &gy, &rx, &ry)) k:   1000, t:   0.63 sec

I conclude that we should enable this P-256 code for all platforms, but that NSS_ECC_MORE_THAN_SUITE_B hardly matters.
Comment 7 Wan-Teh Chang 2013-01-28 12:42:16 PST
Comment on attachment 706698 [details] [diff] [review]
Patch by Adam Langley, v2

Review of attachment 706698 [details] [diff] [review]:
-----------------------------------------------------------------

::: mozilla/security/nss/lib/freebl/ecl/ecp_256_32.c
@@ +1464,5 @@
> +    if (name == ECCurve_NIST_P256) {
> +        group->base_point_mul = &ec_GFp_nistp256_base_point_mul;
> +        group->point_mul = &ec_GFp_nistp256_point_mul;
> +        group->points_mul = &ec_GFp_nistp256_points_mul_vartime;
> +    }

Adam, could you measure the performance with the P-256 v2 patch applied
but with this function being a no-op (simply returning MP_OKAY) and
with NSS_ECC_MORE_THAN_SUITE_B undefined?

This will allow us to compare plain ECGroup_consGFp (without additional
method overrides) with ECGroup_consGFp_mont. Thanks.

Rationale: on big-endian processors or processors that don't right-shift
a signed integer in the way the NON_ZERO_TO_ALL_ONES macro requires,
it will simplify the patch if we can just use plain ECGroup_consGFp
for curve P-256.
Comment 8 Wan-Teh Chang 2013-01-29 11:39:19 PST
Created attachment 707750 [details] [diff] [review]
Patch by Adam Langley, v3

This patch is ready for review.

There are three potential concerns with this patch. I address them below.

1. Use of ECGroup_consGFp() when NSS_ECC_MORE_THAN_SUITE_B is *not* defined.

I inspected the differences between ECGroup_consGFp_mont() (which we are
currently using when NSS_ECC_MORE_THAN_SUITE_B is *not* defined) and
ECGroup_consGFp(). They only differ in three methods: meth->field_mul,
meth->field_sqr, and meth->field_div. The implementations of those
three methods in ECGroup_consGFp() do generic bignum arithmetic:
  http://mxr.mozilla.org/mozilla-central/ident?i=ec_GFp_mul
  http://mxr.mozilla.org/mozilla-central/ident?i=ec_GFp_sqr
  http://mxr.mozilla.org/mozilla-central/ident?i=ec_GFp_div

2. The ecp_256_32.c code requires a little-endian processor.

I figured out a simple solution. I define ec_group_set_gfp256_32()
as a simple no-op if IS_LITTLE_ENDIAN is not defined. Please
search for "IS_LITTLE_ENDIAN" in ecp_256_32.c to see my solution.

3. The ecp_256_32.c code requires right-shift of a negative signed
integer does an arithmetic shift.

The C Standard says the resulting value of right-shifting a negative
signed integer is implementation-defined. Both GCC and Visual C++
document they do an arithmetic shift in that case:

http://gcc.gnu.org/onlinedocs/gcc-4.3.3/gcc/Integers-implementation.html#Integers-implementation

http://msdn.microsoft.com/en-us/library/336xbhcz.aspx

So it is safe to assume all the compilers we are using (GCC, Clang,
and Visual C++) behave this way. If necessary I can add a compile-time
compiler type check to ensure this.
Comment 9 Adam Langley 2013-01-30 07:15:20 PST
(In reply to Wan-Teh Chang from comment #7)
> Adam, could you measure the performance with the P-256 v2 patch applied
> but with this function being a no-op (simply returning MP_OKAY) and
> with NSS_ECC_MORE_THAN_SUITE_B undefined?

It is slightly slower:

MP_CHECKOK (ECPoint_mul(group, &n, NULL, NULL, &rx, &ry)) k:   1000, t:   0.78 sec
MP_CHECKOK (ECPoint_mul(group, &n, &gx, &gy, &rx, &ry)) k:   1000, t:   0.79 sec
MP_CHECKOK (ECPoints_mul (group, &n, &n, &gx, &gy, &rx, &ry)) k:   1000, t:   1.43 sec
Comment 10 Wan-Teh Chang 2013-01-30 12:51:11 PST
agl: thanks for the measurement. I think that kind of slowdown
for big-endian processors is an acceptable trade-off for code
simplicity. Just curious: how much work would it be to port the
code to big-endian processors?

In this comment I summarize my new analysis of your code and
suggest two optional future changes.

1. I did some code tracing last night. The current patch does not
seem to be an ideal way to integrate your P-256 implementation
with lib/freebl/ecl. For example, I believe the to_montgomery
and from_montgomery functions should become group->meth->field_enc
and group->meth->field_dec, and the group->meth->field_xxx
field arithmetic functions could call your functions. But
I think it is reasonable to only intercept the calls to the
group->point_xxx point arithmetic functions.

In my code tracing, the only non-intercepted point arithmetic
function used during TLS is group->validate_point. (NSS validates
the peer's public key before doing ECDH.) If we also intercept
group->validate_point, then it will be a better (more complete)
integration with lib/freebl/ecl.

2. The field arithmetic functions used during TLS are (with the
current patch):

* group->meth->field_mod: used by to_montgomery and from_montgomery
  only

* group->meth->{field_sqr,field_mul,field_add,field_sub}: used by
  group->validate_point only

If we also intercept group->validate_point, then group->meth->field_mod
becomes the only field arithmetic function used. group->meth->field_mod
can be easily replaced by mp_mod if we don't care about optimizing it.
Does ecp_256_32.c not contain a P-256 specific field_mod function?
Comment 11 Wan-Teh Chang 2013-01-30 21:30:12 PST
Created attachment 708457 [details] [diff] [review]
Add big-endian support

This patch seems to work. I tested it on Mac OS X PowerPC.

I identified the places that need fixing by searching for
"char", "int8", "u8", "endian", "memcpy", and "*)" (the end
of a pointer cast). I found only two places.

I also looked into implementing group->validate_point, but
I can't find the a and b coefficients of P-256 in ecp_256_32.c.
It seems that they are not needed for point multiplication.

Looking at ec_GFp_validate_point closer, I think the only
expensive step is ECPoint_mul, which is already using the
optimized code.
Comment 12 Adam Langley 2013-01-31 13:25:08 PST
(In reply to Wan-Teh Chang from comment #10)
> 1. I did some code tracing last night. The current patch does not
> seem to be an ideal way to integrate your P-256 implementation
> with lib/freebl/ecl. For example, I believe the to_montgomery
> and from_montgomery functions should become group->meth->field_enc
> and group->meth->field_dec, and the group->meth->field_xxx
> field arithmetic functions could call your functions. But
> I think it is reasonable to only intercept the calls to the
> group->point_xxx point arithmetic functions.

field_enc and field_dec could point to the field_enc/dec functions I believe. The only reason that I did't do that was because point multiplication is the interface that I'm used to implementing. I've no strong feelings either way about doing this.

However, I don't believe that the other functions could point to ecp_256_32.c functions because I didn't implement contraction: the reduction is only partial, which works for the purposes of the intermediate calculations in the group, but a full reduction is needed at the end and the code depends on the standard bigint functions for that.

> If we also intercept group->validate_point, then group->meth->field_mod
> becomes the only field arithmetic function used. group->meth->field_mod
> can be easily replaced by mp_mod if we don't care about optimizing it.
> Does ecp_256_32.c not contain a P-256 specific field_mod function?

I don't think that validate_point is performance critical enough to bother with.

As noted, the reduction function in ecp_256_32.c is not suitable to use as field_mod as it's only partial.
Comment 13 Adam Langley 2013-01-31 13:26:52 PST
Comment on attachment 708457 [details] [diff] [review]
Add big-endian support

LGTM, thanks! Without a big-endian machine to hand I didn't believe that I'd get it right.
Comment 14 Wan-Teh Chang 2013-03-27 15:21:56 PDT
Created attachment 730377 [details] [diff] [review]
Patch by Adam Langley, v4

I updated the patch for the current NSS tree in the hg repository,
and use the latest ecp_256_32.c file that has been tested in
Chromium for a while.

I had to remove #include "secport.h" and a PORT_Assert call from
ecp_256_32.c because it is inconvenient to call PORT_Assert in the
lib/freebl/ecl directory -- we'd need to include "stubs.h", which
is in the parent directory lib/freebl.

Patch pushed to the NSS hg repository (NSS 3.15):
https://hg.mozilla.org/projects/nss/rev/49984bef0706
Comment 15 Wan-Teh Chang 2013-03-27 20:43:50 PDT
ECC tests failed on Linux PowerPC. I will look into that.

I tested the code on Mac OS X PowerPC before.
Comment 16 Wan-Teh Chang 2013-03-27 20:57:17 PDT
Comment on attachment 730377 [details] [diff] [review]
Patch by Adam Langley, v4

Review of attachment 730377 [details] [diff] [review]:
-----------------------------------------------------------------

::: lib/freebl/ecl/ecp_256_32.c
@@ +1264,5 @@
> +
> +#ifdef MP_USE_UINT_DIGIT
> +#define BYTESWAP_MP_DIGIT_TO_LE(x) BYTESWAP32(BYTESWAP32(x))
> +#else
> +#define BYTESWAP_MP_DIGIT_TO_LE(x) BYTESWAP64(BYTESWAP64(x))

The double BYTESWAP32 and BYTESWAP64 look wrong.
I wonder if I did that to test the big-endian code
paths on a little-endian processor, and forgot to
change it back.
Comment 17 Wan-Teh Chang 2013-03-27 21:16:03 PDT
Created attachment 730521 [details] [diff] [review]
Remove the extra BYTESWAP32 and BYTESWAP64 for big-endian processors

https://hg.mozilla.org/projects/nss/rev/b8c1ffd325c2
Comment 18 Wan-Teh Chang 2013-03-28 08:06:03 PDT
Removing the extra BYTESWAP32 and BYTESWAP64 fixed the test
failures on the Linux PowerPC build bots.

Note You need to log in before you can comment on or make changes to this bug.