freebl: POWER GHASH Vector Acceleration
Categories
(NSS :: Libraries, enhancement, P5)
Tracking
(Not tracked)
People
(Reporter: johnjmar, Assigned: cand)
References
Details
(Keywords: perf)
Attachments
(1 file, 5 obsolete files)
11.86 KB,
patch
|
mt
:
review+
|
Details | Diff | Splinter Review |
User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0
Expected results:
Use POWER8 and POWER9 ISA enhancements to improve the performance of GHASH.
Reporter | ||
Comment 1•5 years ago
|
||
Optimized implementations in the Cryptogams repository[1] may serve as useful references.
[1] https://github.com/dot-asm/cryptogams/
Comment 2•5 years ago
|
||
POWER is not one of our officially supported architectures [0], but we would happily review and accept patches for POWER from the community. We know it's a very useful architecture - it's just a matter of limited time on our side, and nothing to test with. Thanks!
I have an early implementation working, but I hit a gcc bug in one of the crypto builtins. Waiting to hear more on that, possible workarounds and the root cause, before optimizing it further.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91275
I followed the ARM paper linked in the aarch64 implementation. POWER8 crypto instructions are close to what the paper used for aarch32, and early speedups are nice, but will wait for final optimizations before posting numbers and patches.
The gcc bug is now fixed in master and pending for versions 7, 8 and 9. The patch is ready for review, attaching. I used git since I'm not familiar with mercurial, hope it works with the systems. Also not familiar with the r? system, I put jcj as a placeholder (?).
Benchmark of bltest -E -m aes_gcm -i tests/aes_gcm/plaintext10 \ -v tests/aes_gcm/iv10 -k tests/aes_gcm/key10 -5 10
on POWER8 3.3GHz.
NSS_DISABLE_HW_CRYPTO=1
mode in symmkey opreps cxreps context op time(sec) thrgput
aes_gcm_e 309Mb 192 5M 0 0.000 10000.000 10.001 30Mb
mode in symmkey opreps cxreps context op time(sec) thrgput
aes_gcm_e 829Mb 192 14M 0 0.000 10000.000 10.001 82Mb
Notable operf results, sw:
samples % image name symbol name
226033 59.3991 libfreeblpriv3.so bmul
80606 21.1824 libfreeblpriv3.so rijndael_encryptBlock128
28851 7.5817 libfreeblpriv3.so gcm_HashMult_sftw
hw:
213899 56.2037 libfreeblpriv3.so rijndael_encryptBlock128
45233 11.8853 libfreeblpriv3.so gcm_HashMult_hw
So the ghash part is ~5.6x faster.
Comment 5•5 years ago
|
||
Comment 6•5 years ago
|
||
Thank you for the patch. I don't have access to any PowerPC hardware to test this, so for an initial step, could you please report the test vector results for this patch? These can be executed via NSS_TESTS=gtests GTESTS=pk11_gtest GTESTFILTER=*Gcm* NSS_CYCLES=standard tests/all.sh
.
I used gtests/common/testvectors/gcm-vectors.h as test data when developing the patch.
HOST=nss DOMSUF=local BUILD_OPT=1 NSS_TESTS=gtests GTESTS=pk11_gtest GTESTFILTER=*Gcm* NSS_CYCLES=standard tests/all.sh
...snip...
Tests summary:
--------------
Passed: 241
Failed: 0
Failed with core: 0
ASan failures: 0
Unknown status: 0
Comment 8•5 years ago
|
||
Comment 9•5 years ago
|
||
Assignee | ||
Comment 10•5 years ago
|
||
Uploading v2. Typedefs and changed the gcc bug to link, no code changes.
Reporter | ||
Comment 11•5 years ago
|
||
(In reply to Kevin Jacobs [:kjacobs] from comment #6)
Thank you for the patch. I don't have access to any PowerPC hardware to test this, so for an initial step, could you please report the test vector results for this patch? These can be executed via
NSS_TESTS=gtests GTESTS=pk11_gtest GTESTFILTER=*Gcm* NSS_CYCLES=standard tests/all.sh
.
HI Kevin. Access to Power systems can be freely and easily obtained through http://osuosl.org/services/powerdev/.
I can be your advocate if necessary.
Comment 12•5 years ago
|
||
Assignee | ||
Comment 13•5 years ago
|
||
v3, addressing comments. Also rebased to master, solving one conflict in freebl.gyp.
Comment 14•5 years ago
|
||
Assignee | ||
Comment 15•5 years ago
|
||
Are you sure? The other USE defines are named after instruction sets, like on ARM you have separate PMULL and AES. However on PPC both AES and the polynomial multiply were added in the same instruction set bit, PPC CRYPTO. So for PPC, if one is supported, they both are.
I do agree the env var to disable AES and GHASH should be separate, so the user can better benchmark specific parts or to check for bugs. v4 changes the env var, but not yet the USE define.
Comment 16•5 years ago
|
||
Assignee | ||
Comment 17•5 years ago
|
||
Regarding LE: the instructions work in both endians, but the code may need changes for BE. I can only test LE. The intention to build always but to only run on LE is so that anyone interested in BE can more easily enable it.
Assignee | ||
Comment 18•5 years ago
|
||
BTW, seems that PPC is getting a higher standard review here than ARM ;) (those ARM variables are included in a wider set than USE_ARM_GCM).
Assignee | ||
Comment 19•5 years ago
|
||
v5:
- Unify defines, only build on LE
- Work around the ghash test being c++
PS: The ARM lib is also built on BE in the gyp, but only used on LE...
Comment 20•5 years ago
|
||
(In reply to cand from comment #18)
BTW, seems that PPC is getting a higher standard review here than ARM ;) (those ARM variables are included in a wider set than USE_ARM_GCM).
:) Part of that is because we have active development on ARM and a test suite that runs regularly. If there is a problem with ARM, we've greater confidence that it will be fixed. Part of it is my attention span - I don't have a lot of time for reviewing this, so I've made you do multiple iterations and so I've had fresh eyes for every review.
That said, if you think that there is a problem with ARM, then point at it and we'll clobber it promptly.
I think that this patch is good. The workaround you added shouldn't be necessary if we move to -std=c99
, but I need to check that this doesn't cause issues for us before I land this.
Comment 21•5 years ago
|
||
Assignee | ||
Comment 22•5 years ago
|
||
Please see gtests/freebl_gtest/*.cc. Those are C++, and only built with gyp (I build with make normally, only trying gyp when making changes to the gyp file, which is why I didn't notice it right after changing the header). There don't seem to be any -std= changes in the last couple weeks, which is about when I last rebased?
GCC 10 changes this header such that those defines are excluded for C++, but GCC 10 is still ways out.
Assignee | ||
Comment 23•5 years ago
|
||
https://github.com/gcc-mirror/gcc/blob/4c5d288b95dfd7f8f4d412eb23b680ca19f32be9/gcc/config/rs6000/altivec.h
This is the pre-gcc 10 altivec.h. The defines are there in -std=c99 and -std=c++11 for those gcc versions.
However if the -std= change has not yet landed, the patch may need changes in altivec-types.h. Maybe better to land the change first, so I can test with it?
Comment 24•5 years ago
|
||
My bad, I had thought that Bug 1590972 had landed. But it has now. Maybe rebase again.
As for the freebl tests only being built with gyp, that's a problem that we should track. (And you should build with gyp if you care about your time :)
Assignee | ||
Comment 25•5 years ago
|
||
v6 rebased and adjusted to build with the new -std= options.
Comment 26•5 years ago
|
||
Updated•5 years ago
|
Comment 27•5 years ago
|
||
Reporter | ||
Comment 28•5 years ago
|
||
@cand, is there a way to show GHASH only improvement (like in OpenSSL)?
Assignee | ||
Comment 29•5 years ago
|
||
There is no direct GHASH benchmark in NSS, only the quoted AES-GCM one. The ARM implementation of GHASH used the same.
The algorithm used in OpenSSL is slightly different, NSS's GHASH takes pains to avoid side channels. I'm not sure if the OpenSSL algo has the same guarantees.
Assignee | ||
Comment 30•5 years ago
|
||
It looks like NSS does not expose GHASH in the public api headers, making external benchmarking a bit harder.
Reporter | ||
Comment 31•5 years ago
|
||
@cand thank you for your answers.
One more question, would you explain where do you see ~5.6x improvement in your benchmarks? Looking at:
Benchmark of `bltest -E -m aes_gcm -i tests/aes_gcm/plaintext10 \
-v tests/aes_gcm/iv10 -k tests/aes_gcm/key10 -5 10` on POWER8 3.3GHz.
NSS_DISABLE_HW_CRYPTO=1
mode in symmkey opreps cxreps context op time(sec) thrgput
aes_gcm_e 309Mb 192 5M 0 0.000 10000.000 10.001 30Mb
mode in symmkey opreps cxreps context op time(sec) thrgput
aes_gcm_e 829Mb 192 14M 0 0.000 10000.000 10.001 82Mb
I see at most ~2.8x improvement.
Assignee | ||
Comment 32•5 years ago
|
||
It was from the operf results. Perf counters are time-based, and in both cases the benchmark ran for 10s, so while not as accurate as a standalone benchmark, it's rougly correct.
Before, bmul had 226k hits and gcm_HashMult_sftw 28k (bmul being the software equivalent of vpmsumd), and after gcm_HashMult_hw had 45k hits. (226+28)/45 = ~5.6
Reporter | ||
Comment 33•5 years ago
|
||
One last question, do you know how the ppc64le implemention compares to ARM's?
Assignee | ||
Comment 34•5 years ago
|
||
https://hg.mozilla.org/projects/nss/rev/264f19e7ede7 has the ARM commit. It only has bltest numbers, and I don't have an ARM device to test with, so we don't have directly comparable numbers for GHASH only. Comparing the bltest ratios (829/309 ~ 2.68 for PPC, 659/265 ~ 2.48 for ARM), the PPC GHASH implementation is faster than ARM, but how much we can't tell with just the AES-GCM numbers.
At that ARM commit time, ARM did not yet have HW AES acceleration, only GHASH, just like PPC now. (ARM AES accel was committed on Oct 11)
Comment 35•5 years ago
|
||
If you are interested in benchmarking just GHASH, it should be possible to build a test that only runs AAD through the AEAD. Then AES isn't engaged.
Note: there is some overhead if you are comparing NSS against other implementations. We need the new PKCS#11 AEAD APIs to do this justice on longer sequences of messages that are protected with the same key.
Description
•