freebl: POWER SHA-2 digest vector acceleration
Categories
(NSS :: Libraries, enhancement, P5)
Tracking
(Not tracked)
People
(Reporter: johnjmar, Assigned: beurdouche)
References
Details
Attachments
(2 files, 4 obsolete files)
47 bytes,
text/x-phabricator-request
|
Details | Review | |
59.99 KB,
patch
|
Details | Diff | Splinter Review |
User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0
Expected results:
Would like to reach community to help improve the performance of SHA256
and SHA512 using POWER8 and POWER9 ISA enhancements, and to provide proof
that achieved performance of SHA-2 is close to optimal for the platform
(i.e matching or beating OpenSSL performance *)
Optimized assembly implementations in the Cryptogams repository [1] may serve
as useful reference. C-based intrinsic implementations are also acceptable
given prior conditions (*).
Financial bounty upon completion, community acceptance of patches, and reviewer input.
Moreover, free access to a ppc64le VM or direct Jenkins CI account can be
obtained at OSU Open Source Lab: https://osuosl.org/services/powerdev/
You can list me (johnjmar@linux.vnet.ibm.com) as the IBM Advocate.
In addition to the crypto instructions, IBM advertises an on-chip crypto accelerator engine. For SHA-512, it's apparently 20 times faster than the crypto instructions, which in turn are faster than plain C. Sadly, the on-chip accelerator requires kernel-level privileges, and Linux does not expose it currently to userspace.
I'll target the crypto instructions like OpenSSL does. Just noting, the hw has even more potential; IBM folks may want to target that, enabling the Power7+ crypto engine for Linux. The kernel already exposes crypto operations to userspace (see libkcapi), so in theory just enabling the kernel to use the engine would also allow userspace to.
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Comment 2•5 years ago
|
||
(In reply to cand from comment #1)
In addition to the crypto instructions, IBM advertises an on-chip crypto accelerator engine. For SHA-512, it's apparently 20 times faster than the crypto instructions, which in turn are faster than plain C. Sadly, the on-chip accelerator requires kernel-level privileges, and Linux does not expose it currently to userspace.
@cand Unfortunately, we haven't actually been able to obtain the theoretical performance benefits from the NX crypto accelerator you mentioned. Hence, since Power 8, we recommend using the in-core VMX crypto instructions rather than the NX crypto. The NX is far away from the core and that creates a great deal of latency for AES and SHA2, which are designed to perform well in software. The data must be byte permuted on LE since NX is inherently BE and aligned, and counter overflows have to be handled in software, further diminishing acceleration gains.
I'll target the crypto instructions like OpenSSL does. Just noting, the hw has even more potential; IBM folks may want to target that, enabling the Power7+ crypto engine for Linux. The kernel already exposes crypto operations to userspace (see libkcapi), so in theory just enabling the kernel to use the engine would also allow userspace to.
Thank you
Updated•5 years ago
|
I have a C intrinsic-based implementation of SHA-256 that matches OpenSSL speed (full bench is a couple % slower due to infra differences), but only when compiled with PGO. Without PGO, it's 40% slower than OpenSSL (clearly gcc's default code generation is not great here, PGO normally has 10-20% improvements).
What's the NSS project's preference in cases like this? C is clearly more maintainable than asm, but most users will not build with PGO. Should I commit the C file separately, without adding it to the build files, and have the PGO-generated asm built? Generating the asm at build time would not work with cross-compiling, etc.
Comment 4•5 years ago
|
||
We will accept .asm
files in patches along with their source; we'll also need the build tooling to recreate the .asm
of course. In lib/freebl/mpi
our Montgomery full multiplication implementation works this way, in mont_mulf.*
. Obviously the barrier to landing is a little higher, as we need to verify that assembly as well.
needinfo to Martin as a double-check that I'm not speaking falsehoods here.
Comment 5•5 years ago
|
||
That sounds about right. What we've done in the past is check in build products where we have a process that can validate them from source. Then we only do minimal verification on the derivative code. This is harder in this case due to a (likely) reliance on less common architectures.
Attaching patch for review.
Benchmark on POWER8 3.3GHz
bltest -H -m sha256 -i tests/sha256/plaintext1 -o /dev/null -5 10
bltest -H -m sha512 -i tests/sha512/plaintext1 -o /dev/null -5 10
Before
mode in opreps cxreps context op time(sec) thrgput
sha256_e 721Mb 13M 0 0.000 10000.000 10.000 72Mb
sha512_e 929Mb 8M 0 0.000 10000.000 10.000 92Mb
After
sha256_e 1Gb 23M 0 0.000 10000.000 10.000 127Mb
sha512_e 2Gb 20M 0 0.000 10000.000 10.000 215Mb
openssl 1.1.1 (same sizes as the bltest test files)
type 57 bytes
sha256 133831.36k (130Mb)
type 113 bytes
sha512 221516.05k (216Mb)
Notes:
- the ppc build is currently failing along with other minority platforms, see bug 1574643
- it turns out PGO wasn't the magic part, but an option that -fprofile-use enables. The actual profile data is not needed, and only gives ~1% more performance
- openssl's infra is slightly faster, SHA parts should be close to equal now
- I couldn't match OpenSSL's SHA-512 speed using C. My best attempt was 17% slower; the asm code does some things with the key table indices I couldn't make gcc do. So for SHA-512 the patch uses the openssl/cryptogams implementation, whose license is compatible with MPL (much of the included Google code uses the same license)
The self tests in the sha512.c file all pass. I also ran
HOST=nss DOMSUF=local BUILD_OPT=1 NSS_TESTS=gtests GTESTS=cryptohi_gtest GTESTFILTER=*Hash* NSS_CYCLES=standard tests/all.sh
where all but one test pass, but that one test fails without my patch too.
certutil: unable to generate key(s)
: SEC_ERROR_PKCS11_GENERAL_ERROR: A PKCS #11 module returned CKR_GENERAL_ERROR, indicating that an unrecoverable error has occurred.
gtests.sh: #1: create certificate: dummy p256 sign - FAILED
Forgot to commit the gyp change. No other changes.
Reporter | ||
Comment 8•5 years ago
|
||
After applying patch 9126777 to local revision 15510 in a Power 9 system, build fails as follows (using gcc 8.3.0):
NSPR [1/5] configure ...
NSPR [2/5] make ...
NSPR [3/5] NOT building tests
NSPR [4/5] NOT running tests
NSPR [5/5] install ...
ninja: Entering directory `/home/ubuntu/libs/nss/out/Debug'
[644/1020] SOLINK /home/ubuntu/libs/dist/Debug/lib/libfreeblpriv3.so
FAILED: /home/ubuntu/libs/dist/Debug/lib/libfreeblpriv3.so /home/ubuntu/libs/dist/Debug/lib/libfreeblpriv3.so.TOC
if [ ! -e /home/ubuntu/libs/dist/Debug/lib/libfreeblpriv3.so -o ! -e /home/ubuntu/libs/dist/Debug/lib/libfreeblpriv3.so.TOC ]; then cc -shared -Wl,-Bsymbolic -Wl,--version-script,obj/lib/freebl/freeblpriv3.gen/out.freebl_hash_vector.def -Wl,--gc-sections -Wl,-z,defs -z noexecstack -o /home/ubuntu/libs/dist/Debug/lib/libfreeblpriv3.so -Wl,-soname=libfreeblpriv3.so @/home/ubuntu/libs/dist/Debug/lib/libfreeblpriv3.so.rsp && { readelf -d /home/ubuntu/libs/dist/Debug/lib/libfreeblpriv3.so | grep SONAME ; nm -gD -f p /home/ubuntu/libs/dist/Debug/lib/libfreeblpriv3.so | cut -f1-2 -d' '; } > /home/ubuntu/libs/dist/Debug/lib/libfreeblpriv3.so.TOC; else cc -shared -Wl,-Bsymbolic -Wl,--version-script,obj/lib/freebl/freeblpriv3.gen/out.freebl_hash_vector.def -Wl,--gc-sections -Wl,-z,defs -z noexecstack -o /home/ubuntu/libs/dist/Debug/lib/libfreeblpriv3.so -Wl,-soname=libfreeblpriv3.so @/home/ubuntu/libs/dist/Debug/lib/libfreeblpriv3.so.rsp && { readelf -d /home/ubuntu/libs/dist/Debug/lib/libfreeblpriv3.so | grep SONAME ; nm -gD -f p /home/ubuntu/libs/dist/Debug/lib/libfreeblpriv3.so | cut -f1-2 -d' '; } > /home/ubuntu/libs/dist/Debug/lib/libfreeblpriv3.so.tmp && if ! cmp -s /home/ubuntu/libs/dist/Debug/lib/libfreeblpriv3.so.tmp /home/ubuntu/libs/dist/Debug/lib/libfreeblpriv3.so.TOC; then mv /home/ubuntu/libs/dist/Debug/lib/libfreeblpriv3.so.tmp /home/ubuntu/libs/dist/Debug/lib/libfreeblpriv3.so.TOC ; fi; fi
obj/lib/freebl/freeblpriv3.sha512.o: In function `SHA512_Compress':
/home/ubuntu/libs/nss/out/Debug/../../lib/freebl/sha512.c:1087: undefined reference to `sha512_block_p8'
collect2: error: ld returned 1 exit status
[669/1020] CXX obj/gtests/ssl_gtest/ssl_gtest.selfencrypt_unittest.o
ninja: build stopped: subcommand failed.
Just checking, did you pass the -g (rerun gyp) option to build.sh? If there was a previous build done, I think ninja does not rerun gyp, so any gyp file changes aren't taken into account.
I primarily tested using the Makefiles, not gyp, however the gyp build did work for me.
Reporter | ||
Comment 10•5 years ago
|
||
Ah, that did it. It built after nss/build.sh -g
.
However, looks like the optimizations for SHA-256 are not being picked up?
ubuntu@oprom8:~/libs$ p512=~/libs/nss/cmd/bltest/tests/sha512/plaintext1
ubuntu@oprom8:~/libs$ p256=~/libs/nss/cmd/bltest/tests/sha256/plaintext1
ubuntu@oprom8:~/libs$ bltest=~/libs/dist/Debug/bin/bltest
ubuntu@oprom8:~/libs$ $bltest -H -m sha512 -i $p512 -o /dev/null -p 5000000
# mode in opreps cxreps context op time(sec) thrgput
sha512_e 534Mb 5M 0 0.000 16165.000 16.167 33Mb
ubuntu@oprom8:~/libs$ $bltest -H -m sha256 -i $p256 -o /dev/null -p 5000000
# mode in opreps cxreps context op time(sec) thrgput
sha256_e 267Mb 5M 0 0.000 16292.000 16.294 16Mb
ubuntu@oprom8:~/libs$ cd nss/
ubuntu@oprom8:~/libs/nss$ patch -p1 < ../nss.9126777.patch
patching file lib/freebl/Makefile
patching file lib/freebl/cryptogams/LICENSE
patching file lib/freebl/cryptogams/ppc-xlate.pl
patching file lib/freebl/cryptogams/sha512p8-ppc.pl
patching file lib/freebl/freebl.gyp
Hunk #1 succeeded at 187 (offset 1 line).
patching file lib/freebl/sha512-p8.s
patching file lib/freebl/sha512.c
ubuntu@oprom8:~/libs/nss$ cd ../
ubuntu@oprom8:~/libs$ nss/build.sh -g
NSPR [1/5] configure ...
NSPR [2/5] make ...
NSPR [3/5] NOT building tests
NSPR [4/5] NOT running tests
NSPR [5/5] install ...
ninja: Entering directory `/home/ubuntu/libs/nss/out/Debug'
[1021/1021] STAMP obj/nss_tests.actions_depends.stamp
ubuntu@oprom8:~/libs$ $bltest -H -m sha512 -i $p512 -o /dev/null -p 5000000
# mode in opreps cxreps context op time(sec) thrgput
sha512_e 534Mb 5M 0 0.000 3043.000 3.043 175Mb
ubuntu@oprom8:~/libs$ $bltest -H -m sha256 -i $p256 -o /dev/null -p 5000000
# mode in opreps cxreps context op time(sec) thrgput
sha256_e 267Mb 5M 0 0.000 20653.000 20.654 12Mb
Comment 11•5 years ago
|
||
The Debug build does not use any optimizations at all. Please add the -o param to do a release build (for both before and after tests).
I did find I forgot to add the specific flag to gyp, and that turned out a bit harder than expected. Gyp does not support per-file cflags, and then the file needs to be compiled twice, once for the static freebl and once for the shared one. Adding the gyp flag changes as a separate patch, instead of squashing, so that people more familiar with gyp can review easily.
Comment 12•5 years ago
|
||
Reporter | ||
Comment 13•5 years ago
|
||
Thanks. Results are as follows in a Power 9 system (starting from scratch):
ubuntu@oprom8:~/libs$ bltest=~/libs/dist/Release/bin/bltest
ubuntu@oprom8:~/libs$ export LD_LIBRARY_PATH=~/libs/dist/Release/lib
ubuntu@oprom8:~/libs$ $bltest -H -m sha512 -i $p512 -o /dev/null -p 5000000
# mode in opreps cxreps context op time(sec) thrgput
sha512_e 534Mb 5M 0 0.000 5052.000 5.054 105Mb
ubuntu@oprom8:~/libs$ $bltest -H -m sha256 -i $p256 -o /dev/null -p 5000000
# mode in opreps cxreps context op time(sec) thrgput
sha256_e 267Mb 5M 0 0.000 3260.000 3.262 81Mb
ubuntu@oprom8:~/libs$ cd nss
ubuntu@oprom8:~/libs/nss$ patch -p1 < ../nss.9126777.patch
patching file lib/freebl/Makefile
patching file lib/freebl/cryptogams/LICENSE
patching file lib/freebl/cryptogams/ppc-xlate.pl
patching file lib/freebl/cryptogams/sha512p8-ppc.pl
patching file lib/freebl/freebl.gyp
Hunk #1 succeeded at 187 (offset 1 line).
patching file lib/freebl/sha512-p8.s
patching file lib/freebl/sha512.c
ubuntu@oprom8:~/libs/nss$ patch -p1 < ../nss.9127802.patch
patching file lib/freebl/freebl.gyp
Hunk #1 succeeded at 200 (offset 1 line).
Hunk #2 succeeded at 326 (offset 1 line).
Hunk #3 succeeded at 388 (offset 1 line).
patching file lib/freebl/freebl_base.gypi
ubuntu@oprom8:~/libs/nss$ cd ../
ubuntu@oprom8:~/libs$ rm -rf dist/
ubuntu@oprom8:~/libs$ nss/build.sh -o -g
NSPR [1/5] configure ...
NSPR [2/5] make ...
NSPR [3/5] NOT building tests
NSPR [4/5] NOT running tests
NSPR [5/5] install ...
ninja: Entering directory '/home/ubuntu/libs/nss/out/Release'
[1022/1022] STAMP obj/nss_tests.actions_depends.stamp
ubuntu@oprom8:~/libs$ $bltest -H -m sha512 -i $p512 -o /dev/null -p 5000000
# mode in opreps cxreps context op time(sec) thrgput
sha512_e 534Mb 5M 0 0.000 2791.000 2.792 191Mb
ubuntu@oprom8:~/libs$ $bltest -H -m sha256 -i $p256 -o /dev/null -p 5000000
# mode in opreps cxreps context op time(sec) thrgput
sha256_e 267Mb 5M 0 0.000 2325.000 2.326 114Mb
Reporter | ||
Comment 14•5 years ago
|
||
Hi, any updates on the patch review process?
Comment 15•5 years ago
|
||
Assignee | ||
Comment 16•5 years ago
|
||
Assignee | ||
Comment 17•5 years ago
|
||
Depends on D70519
Assignee | ||
Comment 18•5 years ago
|
||
I imported the patches in Phabricator and made a review.
Due to a feature/bug of Phabricator I couldn't set the state to "Request Changes" state
but my review is there as comments. There are a few changes to make and a few remarks but
I didn't see any major issues with that patch... : )
Comment 19•5 years ago
|
||
The scripts come from
https://github.com/dot-asm/cryptogams/tree/master/ppc
instead of OpenSSL directly, as there's the license difference to consider.
Regarding subdirs, scripts/ is fine with me, I'm just wondering where did the cryptogams/ subdir disappear to. It was in the patch, but it's not in the phabricator?
Assignee | ||
Comment 20•5 years ago
|
||
I removed it for some reason and forgot to re-add it. So I think that just renaming that folder to scripts
(so that we can add other things there in the future) in the next iteration of the patch would be completely fine.
Comment 21•5 years ago
|
||
This will have to be blocked on license review and the actions needed to satisfy that, both for NSS and Firefox.
Mike, is that still you? Can you add this submission to your resolution queue? Thanks.
Comment 22•5 years ago
|
||
That's still me, and this is a straightforward BSD-or-GPL license; because of that first set of BSD-style clauses, this is acceptably-licensed code.
We can't bring GPL'ed code in-tree, but GPL-or-[something acceptable] is routine, and this qualifies.
Comment 23•5 years ago
|
||
OK thanks, I assume we need to file bugs to add it to about:license
to satisfy the BSD binary acknowledgement?
And probably heads-up to RedHat about them needing to do the same.
Comment 24•5 years ago
|
||
v3:
- Rebase to current master
- Include gyp patch
- Replace -fprofile-use with the loop subflags
- Rename cryptogams/ -> scripts/
- Add a script for the generation args
- Factor out the ppc_crypto defines
- Factor out the sha256 unrolled rounds, move round define above the function
I don't know of any public PPC CI however, it's just tested on my personal machine.
Updated•5 years ago
|
Updated•5 years ago
|
Comment 25•5 years ago
|
||
Assignee | ||
Updated•5 years ago
|
Comment 26•5 years ago
|
||
Whoops - sorry for the delay. Yes, as usual you should add the relevant license text to toolkit/content/license.html along with a note about what files or folder it applies to.
Comment 27•5 years ago
|
||
Would it be possible to change this such that the POWER8 acceleration is used only on ppc64le systems but not ppc64 systems, i.e. big-endian machines?
The reason is that the majority of systems (all systems?) running POWER8 are actually little-endian while the big-endian systems are usually on POWER5 or lower.
Comment 28•5 years ago
|
||
It is LE-only? See the USE_PPC_CRYPTO define. The patch in bug 1642174 will fix the build issue on the asm file getting built on such systems - it wasn't called, just caused link time trouble due to ABI differences.
(we've seen one or two such recent-power BE reports, so at least those users are vocal about their existence ;))
Comment 29•5 years ago
|
||
Yep, already found that report and I can confirm that the fix works.
FWIW, according to statistics, we have similar numbers of users for both ppc64le and ppc64 in Debian. There is a surprising number of loyal Apple PowerMac users.
Description
•