Closed Bug 1566126 Opened 5 years ago Closed 5 years ago

freebl: POWER GHASH Vector Acceleration

Categories

(NSS :: Libraries, enhancement, P5)

Other
Unspecified
enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: johnjmar, Assigned: cand)

References

Details

(Keywords: perf)

Attachments

(1 file, 5 obsolete files)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0

Expected results:

Use POWER8 and POWER9 ISA enhancements to improve the performance of GHASH.

Optimized implementations in the Cryptogams repository[1] may serve as useful references.
[1] https://github.com/dot-asm/cryptogams/

POWER is not one of our officially supported architectures [0], but we would happily review and accept patches for POWER from the community. We know it's a very useful architecture - it's just a matter of limited time on our side, and nothing to test with. Thanks!

[0] https://wiki.mozilla.org/NSS#Supported_Architectures

Status: UNCONFIRMED → NEW
Type: defect → enhancement
Ever confirmed: true
Keywords: perf
Priority: -- → P5
Hardware: Unspecified → Other
See Also: → 1566124

I have an early implementation working, but I hit a gcc bug in one of the crypto builtins. Waiting to hear more on that, possible workarounds and the root cause, before optimizing it further.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91275

I followed the ARM paper linked in the aarch64 implementation. POWER8 crypto instructions are close to what the paper used for aarch32, and early speedups are nice, but will wait for final optimizations before posting numbers and patches.

Attached patch ghash-ppc.patch (obsolete) — Splinter Review

The gcc bug is now fixed in master and pending for versions 7, 8 and 9. The patch is ready for review, attaching. I used git since I'm not familiar with mercurial, hope it works with the systems. Also not familiar with the r? system, I put jcj as a placeholder (?).

Benchmark of bltest -E -m aes_gcm -i tests/aes_gcm/plaintext10 \ -v tests/aes_gcm/iv10 -k tests/aes_gcm/key10 -5 10 on POWER8 3.3GHz.

NSS_DISABLE_HW_CRYPTO=1
mode in symmkey opreps cxreps context op time(sec) thrgput
aes_gcm_e 309Mb 192 5M 0 0.000 10000.000 10.001 30Mb

  mode          in symmkey  opreps  cxreps     context          op   time(sec)     thrgput

aes_gcm_e 829Mb 192 14M 0 0.000 10000.000 10.001 82Mb

Notable operf results, sw:
samples % image name symbol name
226033 59.3991 libfreeblpriv3.so bmul
80606 21.1824 libfreeblpriv3.so rijndael_encryptBlock128
28851 7.5817 libfreeblpriv3.so gcm_HashMult_sftw

hw:
213899 56.2037 libfreeblpriv3.so rijndael_encryptBlock128
45233 11.8853 libfreeblpriv3.so gcm_HashMult_hw

So the ghash part is ~5.6x faster.

Attachment #9098201 - Flags: review?(jjones)
Comment on attachment 9098201 [details] [diff] [review] ghash-ppc.patch Review of attachment 9098201 [details] [diff] [review]: ----------------------------------------------------------------- Attaching like this is OK. We'd prefer in the future that you upload via `moz-phab` or `arc` to phabricator, particularly for complex reviews, but this is OK. Kevin do you have time to review?
Attachment #9098201 - Flags: review?(jjones) → review?(kjacobs.bugzilla)

Thank you for the patch. I don't have access to any PowerPC hardware to test this, so for an initial step, could you please report the test vector results for this patch? These can be executed via NSS_TESTS=gtests GTESTS=pk11_gtest GTESTFILTER=*Gcm* NSS_CYCLES=standard tests/all.sh.

I used gtests/common/testvectors/gcm-vectors.h as test data when developing the patch.

HOST=nss DOMSUF=local BUILD_OPT=1 NSS_TESTS=gtests GTESTS=pk11_gtest GTESTFILTER=*Gcm* NSS_CYCLES=standard tests/all.sh
...snip...
Tests summary:
--------------
Passed:             241
Failed:             0
Failed with core:   0
ASan failures:      0
Unknown status:     0
Comment on attachment 9098201 [details] [diff] [review] ghash-ppc.patch Review of attachment 9098201 [details] [diff] [review]: ----------------------------------------------------------------- The patch looks correct using the algorithms from [0]. r+ based on the review and your test results. Thank you! Note: this will need a clang-format before landing. We can take care of that. [0] https://www.researchgate.net/publication/285612706_Implementing_GCM_on_ARMv8
Attachment #9098201 - Flags: review?(kjacobs.bugzilla) → review+
Comment on attachment 9098201 [details] [diff] [review] ghash-ppc.patch Review of attachment 9098201 [details] [diff] [review]: ----------------------------------------------------------------- ::: lib/freebl/gcm-ppc.c @@ +17,5 @@ > + vec_xst_be((vector unsigned char) ghash->x, 0, outbuf); > + return SECSuccess; > +} > + > +static vector unsigned long long vpmsumd(const vector unsigned long long a, It seems to me that `vector unsigned long long` might benefit from a typedef, if only to reduce line noise. @@ +26,5 @@ > + return __builtin_altivec_crypto_vpmsumd(a, b); > +#elif (__GNUC__ >= 10) || (__GNUC__ == 9 && __GNUC_MINOR__ >= 3) || \ > + (__GNUC__ == 8 && __GNUC_MINOR__ >= 4) || \ > + (__GNUC__ == 7 && __GNUC_MINOR__ >= 5) > + /* GCC versions not affected by bug 91275 */ This is a GCC bug? Maybe use a link rather than "bug XXXX" so we don't get confused with our bugs.
Attached patch ghash-v2.patch (obsolete) — Splinter Review

Uploading v2. Typedefs and changed the gcc bug to link, no code changes.

Attachment #9098201 - Attachment is obsolete: true
Attachment #9100123 - Flags: review?(mt)

(In reply to Kevin Jacobs [:kjacobs] from comment #6)

Thank you for the patch. I don't have access to any PowerPC hardware to test this, so for an initial step, could you please report the test vector results for this patch? These can be executed via NSS_TESTS=gtests GTESTS=pk11_gtest GTESTFILTER=*Gcm* NSS_CYCLES=standard tests/all.sh.

HI Kevin. Access to Power systems can be freely and easily obtained through http://osuosl.org/services/powerdev/.
I can be your advocate if necessary.

Comment on attachment 9100123 [details] [diff] [review] ghash-v2.patch Review of attachment 9100123 [details] [diff] [review]: ----------------------------------------------------------------- LGTM; modulo some formatting that we can fix up. ::: lib/freebl/gcm-ppc.c @@ +19,5 @@ > + return SECSuccess; > +} > + > +static vec_u64 vpmsumd(const vec_u64 a, > + const vec_u64 b) One line now. @@ +98,5 @@ > +gcm_HashInit_hw(gcmHashContext *ghash) > +{ > + ghash->x = (vec_u64) vec_splat_u32(0); > + ghash->h = (vec_u64) {ghash->h_low, > + ghash->h_high}; With the shorter names, there are a few cases like this that can go back to one line. ::: lib/freebl/gcm.h @@ +71,5 @@ > __m128i x, h; > #elif defined(__aarch64__) > uint64x2_t x, h; > +#elif defined(__powerpc64__) > + vector unsigned long long x, h; Want to use vec_u64 here too?
Attachment #9100123 - Flags: review?(mt) → review+
Attached patch ghash-v3.patch (obsolete) — Splinter Review

v3, addressing comments. Also rebased to master, solving one conflict in freebl.gyp.

Attachment #9100123 - Attachment is obsolete: true
Attachment #9100815 - Flags: review?(mt)
Comment on attachment 9100815 [details] [diff] [review] ghash-v3.patch Review of attachment 9100815 [details] [diff] [review]: ----------------------------------------------------------------- Sorry for not seeing this before, we should change USE_PPC_CRYPTO to USE_PPC_GCM or even USE_PPC_GHASH on the basis that this is only GHASH.
Attached patch ghash-v4.patch (obsolete) — Splinter Review

Are you sure? The other USE defines are named after instruction sets, like on ARM you have separate PMULL and AES. However on PPC both AES and the polynomial multiply were added in the same instruction set bit, PPC CRYPTO. So for PPC, if one is supported, they both are.

I do agree the env var to disable AES and GHASH should be separate, so the user can better benchmark specific parts or to check for bugs. v4 changes the env var, but not yet the USE define.

Attachment #9100815 - Attachment is obsolete: true
Attachment #9100815 - Flags: review?(mt)
Attachment #9101782 - Flags: review?(mt)
Comment on attachment 9101782 [details] [diff] [review] ghash-v4.patch Review of attachment 9101782 [details] [diff] [review]: ----------------------------------------------------------------- Another round, sorry. I did a thorough check of all the preprocessor directives and it looks like they are a little inconsistent throughout. I think that they need another pass to ensure that we don't have odd holes. Right now, I think that this will fail to build in some configurations. Other than that, this looks pretty good, just a few minor nits. ::: lib/freebl/freebl.gyp @@ +144,5 @@ > + 'cflags': [ > + '-mcrypto -maltivec' > + ], > + 'cflags_mozilla': [ > + '-mcrypto -maltivec' This should be: ``` 'cflags': [ # or 'cflags_mozilla' '-mcrypto', '-maltivec', ], ``` ::: lib/freebl/gcm-ppc.c @@ +103,5 @@ > + ghash->x = (vec_u64) vec_splat_u32(0); > + return SECSuccess; > +} > + > +#endif /* defined(__clang__) || (defined(__GNUC__) && __GNUC__ > 6) */ >= 5, to match the comment at the top. ::: lib/freebl/gcm.c @@ +23,4 @@ > #define USE_ARM_GCM > #endif > > +#if defined(__powerpc64__) && defined(IS_LITTLE_ENDIAN) So this test here doesn't match the one in `gcm-ppc.c`. That code only exists if you have clang or GCC 5+. Here you only try to use that code if you have a little-endian PPC CPU. In `freebl.gyp` you include the extra libraries if you have a big- OR little-endian PPC CPU. What is the actual test that applies here? The test should be either whatever you need to build and link, plus any additional checks that you might do at compile time for targets where the code won't work. It seems like you need clang/GCC 5+ to build and link the functions, even if they won't run. If the instructions a little-endian only, then you need to fix the target architecture in freebl.gyp; if they work on big endian or the CPU check is fine, remove the IS_LITTLE_ENDIAN check here. We currently build with GCC 4.8 on Linux x86/64, but we could raise the limit for PPC as a result of this bug if you wanted to keep things simple; we only support that old compiler for old RHEL releases that we maintain. I'd have to run a change like that by the NSS developers group though. ::: lib/freebl/gcm.h @@ +71,4 @@ > __m128i x, h; > #elif defined(__aarch64__) > uint64x2_t x, h; > +#elif defined(__powerpc64__) This will include the variables in a wider set of cases than where it is used. Maybe you can move the USE_PPC_CRYPTO check into the header file and use that for all the relevant code.
Attachment #9101782 - Flags: review?(mt)

Regarding LE: the instructions work in both endians, but the code may need changes for BE. I can only test LE. The intention to build always but to only run on LE is so that anyone interested in BE can more easily enable it.

BTW, seems that PPC is getting a higher standard review here than ARM ;) (those ARM variables are included in a wider set than USE_ARM_GCM).

Attached patch ghash-v5.patch (obsolete) — Splinter Review

v5:

  • Unify defines, only build on LE
  • Work around the ghash test being c++

PS: The ARM lib is also built on BE in the gyp, but only used on LE...

Attachment #9101782 - Attachment is obsolete: true
Attachment #9103467 - Flags: review?(mt)

(In reply to cand from comment #18)

BTW, seems that PPC is getting a higher standard review here than ARM ;) (those ARM variables are included in a wider set than USE_ARM_GCM).

:) Part of that is because we have active development on ARM and a test suite that runs regularly. If there is a problem with ARM, we've greater confidence that it will be fixed. Part of it is my attention span - I don't have a lot of time for reviewing this, so I've made you do multiple iterations and so I've had fresh eyes for every review.

That said, if you think that there is a problem with ARM, then point at it and we'll clobber it promptly.

I think that this patch is good. The workaround you added shouldn't be necessary if we move to -std=c99, but I need to check that this doesn't cause issues for us before I land this.

Assignee: nobody → cand
Comment on attachment 9103467 [details] [diff] [review] ghash-v5.patch Review of attachment 9103467 [details] [diff] [review]: ----------------------------------------------------------------- ::: lib/freebl/gcm.h @@ +32,5 @@ > > +#ifdef __powerpc64__ > +#include "altivec-types.h" > + > +/* The ghash freebl test tries to use this in C++, and gcc defines conflict. */ Just looking to check this in. With the recent change in NSS to use -std=c99, my understanding is that these defines won't be present and you can remove this block. Can you check that please?
Attachment #9103467 - Flags: review?(cand)

Please see gtests/freebl_gtest/*.cc. Those are C++, and only built with gyp (I build with make normally, only trying gyp when making changes to the gyp file, which is why I didn't notice it right after changing the header). There don't seem to be any -std= changes in the last couple weeks, which is about when I last rebased?

GCC 10 changes this header such that those defines are excluded for C++, but GCC 10 is still ways out.

https://github.com/gcc-mirror/gcc/blob/4c5d288b95dfd7f8f4d412eb23b680ca19f32be9/gcc/config/rs6000/altivec.h
This is the pre-gcc 10 altivec.h. The defines are there in -std=c99 and -std=c++11 for those gcc versions.

However if the -std= change has not yet landed, the patch may need changes in altivec-types.h. Maybe better to land the change first, so I can test with it?

My bad, I had thought that Bug 1590972 had landed. But it has now. Maybe rebase again.

As for the freebl tests only being built with gyp, that's a problem that we should track. (And you should build with gyp if you care about your time :)

Attached patch ghash-v6.patchSplinter Review

v6 rebased and adjusted to build with the new -std= options.

Attachment #9103467 - Attachment is obsolete: true
Attachment #9103467 - Flags: review?(mt)
Attachment #9103467 - Flags: review?(cand)
Attachment #9105526 - Flags: review?(mt)
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → 3.48

@cand, is there a way to show GHASH only improvement (like in OpenSSL)?

There is no direct GHASH benchmark in NSS, only the quoted AES-GCM one. The ARM implementation of GHASH used the same.

The algorithm used in OpenSSL is slightly different, NSS's GHASH takes pains to avoid side channels. I'm not sure if the OpenSSL algo has the same guarantees.

It looks like NSS does not expose GHASH in the public api headers, making external benchmarking a bit harder.

@cand thank you for your answers.
One more question, would you explain where do you see ~5.6x improvement in your benchmarks? Looking at:

Benchmark of `bltest -E -m aes_gcm -i tests/aes_gcm/plaintext10 \
-v tests/aes_gcm/iv10  -k tests/aes_gcm/key10 -5 10` on POWER8 3.3GHz.

NSS_DISABLE_HW_CRYPTO=1
      mode          in symmkey  opreps  cxreps     context          op   time(sec)     thrgput
 aes_gcm_e       309Mb     192      5M       0       0.000   10000.000      10.001        30Mb

      mode          in symmkey  opreps  cxreps     context          op   time(sec)     thrgput
 aes_gcm_e       829Mb     192     14M       0       0.000   10000.000      10.001        82Mb

I see at most ~2.8x improvement.

It was from the operf results. Perf counters are time-based, and in both cases the benchmark ran for 10s, so while not as accurate as a standalone benchmark, it's rougly correct.

Before, bmul had 226k hits and gcm_HashMult_sftw 28k (bmul being the software equivalent of vpmsumd), and after gcm_HashMult_hw had 45k hits. (226+28)/45 = ~5.6

One last question, do you know how the ppc64le implemention compares to ARM's?

https://hg.mozilla.org/projects/nss/rev/264f19e7ede7 has the ARM commit. It only has bltest numbers, and I don't have an ARM device to test with, so we don't have directly comparable numbers for GHASH only. Comparing the bltest ratios (829/309 ~ 2.68 for PPC, 659/265 ~ 2.48 for ARM), the PPC GHASH implementation is faster than ARM, but how much we can't tell with just the AES-GCM numbers.

At that ARM commit time, ARM did not yet have HW AES acceleration, only GHASH, just like PPC now. (ARM AES accel was committed on Oct 11)

If you are interested in benchmarking just GHASH, it should be possible to build a test that only runs AAD through the AEAD. Then AES isn't engaged.

Note: there is some overhead if you are comparing NSS against other implementations. We need the new PKCS#11 AEAD APIs to do this justice on longer sequences of messages that are protected with the same key.

Regressions: 1602386
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: