Last Comment Bug 886222 - [NSS PATCH] Efficient and constant time 1024-bit and 2048-bit modular exponentiation for AVX2 capable x86_64 platforms
: [NSS PATCH] Efficient and constant time 1024-bit and 2048-bit modular exponen...
Status: NEW
:
Product: NSS
Classification: Components
Component: Libraries (show other bugs)
: trunk
: x86_64 Linux
: -- normal with 2 votes (vote)
: ---
Assigned To: Shay Gueron
:
:
Mentors:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-06-23 23:27 PDT by Shay Gueron
Modified: 2015-07-14 11:02 PDT (History)
18 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
[NSS PATCH] Efficient and constant time 1024-bit and 2048-bit modular exponentiation for AVX2 capable x86_64 platforms (148.75 KB, patch)
2013-06-23 23:27 PDT, Shay Gueron
no flags Details | Diff | Splinter Review
intel_ModExp_AVX2_patch_v02.patch (143.51 KB, patch)
2013-07-27 05:09 PDT, Shay Gueron
rrelyea: review-
Details | Diff | Splinter Review

Description Shay Gueron 2013-06-23 23:27:32 PDT
Created attachment 766545 [details] [diff] [review]
[NSS PATCH] Efficient and constant time 1024-bit and 2048-bit modular exponentiation for AVX2 capable x86_64 platforms

Hello all,

This patch is a contribution to NSS. 

It provides an efficient and constant-time implementation of modular exponent function, using the AVX2 instructions set, and achieves high performance on Intel 4th Generation Core Processors.

The details of the algorithm behind this implementation can be found in:
S. Gueron, V. Krasnov: "Software implementation of modular exponentiation, using advanced vector instructions architectures", WAIFI'12 Proceedings of the 4th international conference on Arithmetic of Finite Fields Pages 119-135.

Applying this “vectorized” algorithm to modular exponentiation improves the performance of: DH1024, DH2048, RSA2048 sign and verify, RSA1024 verify, JPAKE and DSA1024 with DSA2048 sign and verify.

We measured performance of the patch on the new Intel® 4th Generation Core™ i7-4770K, running at 3.5 GHz, and achieved the following performance speedup, compared with current version of NSS:

Algorithm – speedup factor
RSA2048 sign – 2.20x
RSA1024 verify (with e= 65537) – 1.25x
RSA2048 verify (with e = 65537) – 1.44x
DSA1024 sign – 1.86x
DSA2048 sign – 2.00x
DSA1024 verify – 1.57x
DSA2048 verify – 2.00x


Developers and authors:
***************************************************************************
Shay Gueron (1, 2), and Vlad Krasnov (1)
(1) Intel Corporation, Israel Development Center, Haifa, Israel
(2) University of Haifa, Israel
***************************************************************************
Copyright(c) 2013, Intel Corp.
Comment 1 Jeff Muizelaar [:jrmuizel] 2013-06-24 19:13:07 PDT
(In reply to Shay Gueron from comment #0)
> Created attachment 766545 [details] [diff] [review]
> [NSS PATCH] Efficient and constant time 1024-bit and 2048-bit modular
> exponentiation for AVX2 capable x86_64 platforms
> 

I've only quickly glanced at the patch but it looks like there's a lot of manual code duplication that could be cleaned up by using macros.
Comment 2 Shay Gueron 2013-07-07 00:03:35 PDT
Indeed. A more compact (easier to maintain) will be posted soon.
Comment 3 Shay Gueron 2013-07-27 05:09:20 PDT
Created attachment 782142 [details] [diff] [review]
intel_ModExp_AVX2_patch_v02.patch

Here is the patch - a version using with macros. 

Performance, compared to cvs 7/22/2013:

DSA1024 enc: 2.04 X
DSA1024 dec: 2.11 X
DSA2048 enc: 2.13 X
DSA2048 dec: 2.14 X

RSA2048 dec: 2.37 X
RSA4096 dec: 2.02 X


Developers and authors:
***************************************************************************
Shay Gueron (1, 2), and Vlad Krasnov (1)
(1) Intel Corporation, Israel Development Center, Haifa, Israel
(2) University of Haifa, Israel
***************************************************************************
Copyright(c) 2013, Intel Corp.
Comment 4 Emanuel Hoogeveen [:ehoogeveen] 2013-08-28 17:08:19 PDT
Shay, did you mean to set that patch as |review?| or |feedback?|? Jeff can probably review or at least point you at the right person :) In addition, you should mark attachment 766545 [details] [diff] [review] as obsolete.
Comment 5 Shay Gueron 2013-08-28 17:15:30 PDT
The second revision of the algorithm is merely a more compact code version, coming to replace the original post. I will mark the older one as obsolete. 
This is intended for a review and possible integration. 
Shay
Comment 6 Shay Gueron 2013-08-28 17:24:06 PDT
Comment on attachment 766545 [details] [diff] [review]
[NSS PATCH] Efficient and constant time 1024-bit and 2048-bit modular exponentiation for AVX2 capable x86_64 platforms

The version of the patch (posted originally on 2013-06-23 23:27 PDT) is replaced by a newer one. 

Please use the version from the 2013-07-27 05:09 PDT posting for a review.

Thank you, Shay Gueron
Comment 7 Emanuel Hoogeveen [:ehoogeveen] 2013-08-28 17:36:16 PDT
Comment on attachment 782142 [details] [diff] [review]
intel_ModExp_AVX2_patch_v02.patch

Jeff, tentatively asking you for review.
Comment 8 Jeff Muizelaar [:jrmuizel] 2013-09-03 12:11:46 PDT
Comment on attachment 782142 [details] [diff] [review]
intel_ModExp_AVX2_patch_v02.patch

I'm really not a great candidate to review this and don't have time to right now.
Comment 9 Kai Engert (:kaie) 2013-09-06 08:31:21 PDT
Comment on attachment 782142 [details] [diff] [review]
intel_ModExp_AVX2_patch_v02.patch

Sorry, I'm unable to review this. Bob might.
Comment 10 SpeciesX 2014-01-24 07:24:37 PST
Ping!
Comment 11 Virtual_ManPL [:Virtual] - (ni? me) 2014-01-24 07:25:34 PST
Any progress?
Comment 12 Julien Pierre 2014-04-07 18:45:21 PDT
A few questions :

1) Are these constant-time implementations still vulnerable to timing attacks that blinding protects against ? If not, should we disable blinding for this case ?

2) Is there any possibility of implementing the 4096 bit key size as well ?
RSA key sizes of less than 2048 bits are not really recommended anymore.

If this patch gets approved on Linux, I would like to contribute a patch make it work on Solaris x64 as well.
Comment 13 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2014-04-07 18:56:49 PDT
(In reply to Shay Gueron from comment #0)
> Applying this “vectorized” algorithm to modular exponentiation improves the
> performance of: DH1024, DH2048, RSA2048 sign and verify, RSA1024 verify,
> JPAKE and DSA1024 with DSA2048 sign and verify.

I believe that we should concentrate on DH2048 and RSA2048 in this bug. I think JPAKE should go in its own bug. I suggest we WONTFIX RSA1024, DSA1024, and DH1024. Maybe DSA2048 should be in its own bug too.
Comment 14 Robert Relyea 2015-05-21 11:57:27 PDT
Comment on attachment 782142 [details] [diff] [review]
intel_ModExp_AVX2_patch_v02.patch

Review of attachment 782142 [details] [diff] [review]:
-----------------------------------------------------------------

This review has been pining basically because of the daunting task of digging into the the massive .s files.

I've decided that that shouldn't hold up a reasonable review of the patch. I have a whole lot of nittly comments that really should be taken care of, thus the r-, but the over structure and the bulk of this patch is fine.

I'm trusting Shay's assembler implementation for the .s files, mostly because shay knows how AVX2 works better than anyone on the planet.

NOTE: to support RSA 2048, we need both 1024 and 2048 bit implementations here. Why you ask? because we use CRT for RSA signing, thus we do 2 1024 bit operations rather than 2 2048 bit operation for RSA private key operations, so Brian's comment doesn't apply. Note that here is no speed up for RSA 1024 bit signing because this code isn't invoked in that case (we will also get a speed up on RSA 4096 signing from this).

Also NOTE that this is linux Only. I'm OK with that for now, but others should be aware. Solarix x86 should be relatively easy to add. Windows and Mac would require more work.

My one issue may be that this might need an upgrade in assembler to build (we had that issue with gsm).

bob

::: lib/freebl/Makefile
@@ +193,5 @@
>      MPI_SRCS += mpi_amd64.c mp_comba.c
> +    # comment the next two lines to turn off AVX2 Modular Exponent acceleration
> +    DEFINES += -DMODEXP_AVX2_ASM
> +    EXTRA_SRCS += mpi_mod_exp_redundant_WW.c
> +    ASFILES += red2norm.s redundant_AVX2_AMM1024_asm.s redundant_AVX2_AMM2048_asm.s redundant_AVX2_AMS1024_asm.s redundant_AVX2_AMS2048_asm.s

Note: this only adds support in Linux. I don't deem this fatal, but it should be made clear.

::: lib/freebl/mpi/mpi_mod_exp_redundant_WW.c
@@ +20,5 @@
> +
> +#include "mpi_RSAZ.h"
> +#include <string.h>
> +#include <malloc.h>
> +

Below are a bunch of inline functions, but we don't use them inline, we use set function pointers to them and use them through function pointers. We should remove the inline label in that case.

@@ +100,5 @@
> +        out+=36;
> +        in+=period;
> +    }
> +}
> +

So we have a bunch of unsigned long *outs, but in this case we explicitly say the array size. We probably should be consistent here.
(It also leads to a type mismatch when we assign a pointer, so it should probably be and unsigned long *out. A comment on the expected size would be reasonable for all these functions.

@@ +237,5 @@
> +
> +    out[4]=in[step*2];
> +    out[5]=in[step*3];
> +}
> +

see comment on norm2red1024()

@@ +286,5 @@
> +    
> +    out[74] = 0;
> +    out[75] = 0;
> +}
> +

Unlike the above functions where the inline statement isn't all that useful, it's fine for these functions (though gcc is smart, it would have inlined these already.. still it's a good use of inline here).

@@ +307,5 @@
> +const unsigned long one[74] = {1};
> +const unsigned long conv2048[74] = {0,0,0,1<<12};
> +const unsigned long conv1024[36] = {0,0,1<<22};
> +
> +int RSAZ_redundant_mod_exponent(mp_digit result[16], mp_digit base_norm[16], mp_digit exponent[16], mp_digit m_norm[16], mp_digit RR[16], mp_digit k0, int exponent_bits, int mod_bits, int const_time)

These should be mp_digit *'s not arrays. Theyre size varies between 1024 and 2048 bits values.

@@ +406,5 @@
> +        conv_help = conv1024;
> +    }
> +    else
> +    {
> +        /* This function should never be called with other modulii, should be checked externally */

We should assert here, and return an error.

::: lib/freebl/mpi/mpmontg.c
@@ +1088,5 @@
> +    freebl_cpuid(1, &eax, &ebx, &ecx, &edx);
> +    has_avx = (ecx & (1 << 28)) != 0 ? 1 : -1;
> +    freebl_cpuid(7, &eax, &ebx, &ecx, &edx);
> +    has_avx2 = (ebx & (1 << 5)) != 0 ? 1 : -1;
> +    use_avx2 = (has_avx > 0) && (has_avx2 > 0);

We should be able to override this to off with an environment variable (see the AES code in rijndael.c for an example).

@@ +1101,5 @@
> +        mp_digit a_locl[32] = {0};
> +
> +        mp_init(&RR);
> +        mp_set(&RR, 1);  
> +        ( s_mp_lshd( &RR, modulus->used*2 ));

Please use the MP macro for thie (MP_USED()).

@@ +1105,5 @@
> +        ( s_mp_lshd( &RR, modulus->used*2 ));
> +        ( mp_mod(&RR, modulus, &RR) );
> +        n0 = 0 - s_mp_invmod_radix( MP_DIGIT(modulus, 0) );
> +        
> +        memcpy(a_locl, base->dp, base->used*8);

Please use the MP macros MP_DIGITS() and MP_USED().

@@ +1108,5 @@
> +        
> +        memcpy(a_locl, base->dp, base->used*8);
> +        
> +        if(!RSAZ_redundant_mod_exponent(result->dp, a_locl, exponent->dp, modulus->dp, RR.dp, n0, mpl_significant_bits(exponent), mpl_significant_bits(modulus), mp_using_cache_safe_exp))
> +        {

Same macro comment here.

@@ +1114,5 @@
> +        }
> +        
> +        result->used = modulus->used;
> +        
> +        while((result->dp[result->used - 1] == 0) && result->used>0) result->used--;

MP_USED here and line 1116 as well.

::: lib/freebl/mpi/redundant_AVX2_AMM1024_asm.s
@@ +1,1 @@
> +

need mozilla header here.

::: lib/freebl/mpi/redundant_AVX2_AMM2048_asm.s
@@ +1,1 @@
> +

need Mozilla header here.

::: lib/freebl/mpi/redundant_AVX2_AMS1024_asm.s
@@ +1,1 @@
> +

Need Mozilla header here.

::: lib/freebl/mpi/redundant_AVX2_AMS2048_asm.s
@@ +1,2 @@
> +
> +

Need mozilla header here.
Comment 15 Robert Relyea 2015-05-21 12:03:52 PDT
> I believe that we should concentrate on DH2048 and RSA2048 in this bug. I think JPAKE should go in its own bug.
> I suggest we WONTFIX RSA1024, DSA1024, and DH1024. Maybe DSA2048 should be in its own bug too.

Not really relevant. The patch is for the generic modular exponentiation code. It's not RSA specific, and should work for JPAKE, DSA, RSA, and DH once applied.

The patch doesn't help RSA1024 bit signing, but accidentally helps 1024 bit RSA verify. It also accidentally helps DSA1024, DH1024 and JPAKE1024, but that's just an accident. That's because RSA does to smaller mod exp (one for p and one for q) operations rather than one (for n). Since mod exp scales as the square, we get a 2x performance improvement doing this.

Note that this patch will also help RSA4096 signing/decryption as well, but that's an accident.

Basically this is the minimum patch to accelerate RSA2048 signing and verification. It will also help all those others as well for free. Thus no separate bugs are needed.

bob

Note You need to log in before you can comment on or make changes to this bug.