Closed Bug 1613238 Opened 6 months ago Closed 4 months ago

freebl: POWER SHA-2 digest vector acceleration

Categories

(NSS :: Libraries, enhancement, P5)

Other
Unspecified
enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: johnjmar, Assigned: beurdouche)

References

(Regressed 1 open bug)

Details

Attachments

(2 files, 4 obsolete files)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0

Expected results:

Would like to reach community to help improve the performance of SHA256
and SHA512 using POWER8 and POWER9 ISA enhancements, and to provide proof
that achieved performance of SHA-2 is close to optimal for the platform
(i.e matching or beating OpenSSL performance *)

Optimized assembly implementations in the Cryptogams repository [1] may serve
as useful reference. C-based intrinsic implementations are also acceptable
given prior conditions (*).

Financial bounty upon completion, community acceptance of patches, and reviewer input.

Moreover, free access to a ppc64le VM or direct Jenkins CI account can be
obtained at OSU Open Source Lab: https://osuosl.org/services/powerdev/
You can list me (johnjmar@linux.vnet.ibm.com) as the IBM Advocate.

[1] https://github.com/dot-asm/cryptogams/

In addition to the crypto instructions, IBM advertises an on-chip crypto accelerator engine. For SHA-512, it's apparently 20 times faster than the crypto instructions, which in turn are faster than plain C. Sadly, the on-chip accelerator requires kernel-level privileges, and Linux does not expose it currently to userspace.

I'll target the crypto instructions like OpenSSL does. Just noting, the hw has even more potential; IBM folks may want to target that, enabling the Power7+ crypto engine for Linux. The kernel already exposes crypto operations to userspace (see libkcapi), so in theory just enabling the kernel to use the engine would also allow userspace to.

Summary: freebl: POWER SHA-2 stream digest vector acceleration → freebl: POWER SHA-2 digest vector acceleration

(In reply to cand from comment #1)

In addition to the crypto instructions, IBM advertises an on-chip crypto accelerator engine. For SHA-512, it's apparently 20 times faster than the crypto instructions, which in turn are faster than plain C. Sadly, the on-chip accelerator requires kernel-level privileges, and Linux does not expose it currently to userspace.

@cand Unfortunately, we haven't actually been able to obtain the theoretical performance benefits from the NX crypto accelerator you mentioned. Hence, since Power 8, we recommend using the in-core VMX crypto instructions rather than the NX crypto. The NX is far away from the core and that creates a great deal of latency for AES and SHA2, which are designed to perform well in software. The data must be byte permuted on LE since NX is inherently BE and aligned, and counter overflows have to be handled in software, further diminishing acceleration gains.

I'll target the crypto instructions like OpenSSL does. Just noting, the hw has even more potential; IBM folks may want to target that, enabling the Power7+ crypto engine for Linux. The kernel already exposes crypto operations to userspace (see libkcapi), so in theory just enabling the kernel to use the engine would also allow userspace to.

Thank you

Status: UNCONFIRMED → NEW
Ever confirmed: true
Priority: -- → P5
Hardware: Unspecified → Other
See Also: → 1613236, 1613235

I have a C intrinsic-based implementation of SHA-256 that matches OpenSSL speed (full bench is a couple % slower due to infra differences), but only when compiled with PGO. Without PGO, it's 40% slower than OpenSSL (clearly gcc's default code generation is not great here, PGO normally has 10-20% improvements).

What's the NSS project's preference in cases like this? C is clearly more maintainable than asm, but most users will not build with PGO. Should I commit the C file separately, without adding it to the build files, and have the PGO-generated asm built? Generating the asm at build time would not work with cross-compiling, etc.

We will accept .asm files in patches along with their source; we'll also need the build tooling to recreate the .asm of course. In lib/freebl/mpi our Montgomery full multiplication implementation works this way, in mont_mulf.*. Obviously the barrier to landing is a little higher, as we need to verify that assembly as well.

needinfo to Martin as a double-check that I'm not speaking falsehoods here.

Flags: needinfo?(mt)

That sounds about right. What we've done in the past is check in build products where we have a process that can validate them from source. Then we only do minimal verification on the derivative code. This is harder in this case due to a (likely) reliance on less common architectures.

Flags: needinfo?(mt)
Attached patch ppc-sha.patch (obsolete) — Splinter Review

Attaching patch for review.

Benchmark on POWER8 3.3GHz
bltest -H -m sha256 -i tests/sha256/plaintext1 -o /dev/null -5 10
bltest -H -m sha512 -i tests/sha512/plaintext1 -o /dev/null -5 10

Before
     mode          in  opreps  cxreps     context          op   time(sec)     thrgput
  sha256_e       721Mb     13M       0       0.000   10000.000      10.000        72Mb
  sha512_e       929Mb      8M       0       0.000   10000.000      10.000        92Mb
After
  sha256_e         1Gb     23M       0       0.000   10000.000      10.000       127Mb
  sha512_e         2Gb     20M       0       0.000   10000.000      10.000       215Mb

openssl 1.1.1 (same sizes as the bltest test files)

type             57 bytes
sha256          133831.36k (130Mb)

type            113 bytes
sha512          221516.05k (216Mb)

Notes:

  • the ppc build is currently failing along with other minority platforms, see bug 1574643
  • it turns out PGO wasn't the magic part, but an option that -fprofile-use enables. The actual profile data is not needed, and only gives ~1% more performance
  • openssl's infra is slightly faster, SHA parts should be close to equal now
  • I couldn't match OpenSSL's SHA-512 speed using C. My best attempt was 17% slower; the asm code does some things with the key table indices I couldn't make gcc do. So for SHA-512 the patch uses the openssl/cryptogams implementation, whose license is compatible with MPL (much of the included Google code uses the same license)

The self tests in the sha512.c file all pass. I also ran
HOST=nss DOMSUF=local BUILD_OPT=1 NSS_TESTS=gtests GTESTS=cryptohi_gtest GTESTFILTER=*Hash* NSS_CYCLES=standard tests/all.sh

where all but one test pass, but that one test fails without my patch too.

certutil: unable to generate key(s)
: SEC_ERROR_PKCS11_GENERAL_ERROR: A PKCS #11 module returned CKR_GENERAL_ERROR, indicating that an unrecoverable error has occurred.
gtests.sh: #1: create certificate: dummy p256 sign  - FAILED
Attachment #9126757 - Flags: review?(jjones)
Attached patch ppc-sha-v2.patch (obsolete) — Splinter Review

Forgot to commit the gyp change. No other changes.

Attachment #9126757 - Attachment is obsolete: true
Attachment #9126757 - Flags: review?(jjones)
Attachment #9126777 - Flags: review?(jjones)

After applying patch 9126777 to local revision 15510 in a Power 9 system, build fails as follows (using gcc 8.3.0):

NSPR [1/5] configure ...
NSPR [2/5] make ...
NSPR [3/5] NOT building tests
NSPR [4/5] NOT running tests
NSPR [5/5] install ...
ninja: Entering directory `/home/ubuntu/libs/nss/out/Debug'
[644/1020] SOLINK /home/ubuntu/libs/dist/Debug/lib/libfreeblpriv3.so
FAILED: /home/ubuntu/libs/dist/Debug/lib/libfreeblpriv3.so /home/ubuntu/libs/dist/Debug/lib/libfreeblpriv3.so.TOC 
if [ ! -e /home/ubuntu/libs/dist/Debug/lib/libfreeblpriv3.so -o ! -e /home/ubuntu/libs/dist/Debug/lib/libfreeblpriv3.so.TOC ]; then cc -shared -Wl,-Bsymbolic -Wl,--version-script,obj/lib/freebl/freeblpriv3.gen/out.freebl_hash_vector.def -Wl,--gc-sections -Wl,-z,defs -z noexecstack -o /home/ubuntu/libs/dist/Debug/lib/libfreeblpriv3.so -Wl,-soname=libfreeblpriv3.so @/home/ubuntu/libs/dist/Debug/lib/libfreeblpriv3.so.rsp && { readelf -d /home/ubuntu/libs/dist/Debug/lib/libfreeblpriv3.so | grep SONAME ; nm -gD -f p /home/ubuntu/libs/dist/Debug/lib/libfreeblpriv3.so | cut -f1-2 -d' '; } > /home/ubuntu/libs/dist/Debug/lib/libfreeblpriv3.so.TOC; else cc -shared -Wl,-Bsymbolic -Wl,--version-script,obj/lib/freebl/freeblpriv3.gen/out.freebl_hash_vector.def -Wl,--gc-sections -Wl,-z,defs -z noexecstack -o /home/ubuntu/libs/dist/Debug/lib/libfreeblpriv3.so -Wl,-soname=libfreeblpriv3.so @/home/ubuntu/libs/dist/Debug/lib/libfreeblpriv3.so.rsp && { readelf -d /home/ubuntu/libs/dist/Debug/lib/libfreeblpriv3.so | grep SONAME ; nm -gD -f p /home/ubuntu/libs/dist/Debug/lib/libfreeblpriv3.so | cut -f1-2 -d' '; } > /home/ubuntu/libs/dist/Debug/lib/libfreeblpriv3.so.tmp && if ! cmp -s /home/ubuntu/libs/dist/Debug/lib/libfreeblpriv3.so.tmp /home/ubuntu/libs/dist/Debug/lib/libfreeblpriv3.so.TOC; then mv /home/ubuntu/libs/dist/Debug/lib/libfreeblpriv3.so.tmp /home/ubuntu/libs/dist/Debug/lib/libfreeblpriv3.so.TOC ; fi; fi
obj/lib/freebl/freeblpriv3.sha512.o: In function `SHA512_Compress':
/home/ubuntu/libs/nss/out/Debug/../../lib/freebl/sha512.c:1087: undefined reference to `sha512_block_p8'
collect2: error: ld returned 1 exit status
[669/1020] CXX obj/gtests/ssl_gtest/ssl_gtest.selfencrypt_unittest.o
ninja: build stopped: subcommand failed.

Just checking, did you pass the -g (rerun gyp) option to build.sh? If there was a previous build done, I think ninja does not rerun gyp, so any gyp file changes aren't taken into account.

I primarily tested using the Makefiles, not gyp, however the gyp build did work for me.

Ah, that did it. It built after nss/build.sh -g.
However, looks like the optimizations for SHA-256 are not being picked up?

ubuntu@oprom8:~/libs$ p512=~/libs/nss/cmd/bltest/tests/sha512/plaintext1
ubuntu@oprom8:~/libs$ p256=~/libs/nss/cmd/bltest/tests/sha256/plaintext1
ubuntu@oprom8:~/libs$ bltest=~/libs/dist/Debug/bin/bltest
ubuntu@oprom8:~/libs$ $bltest -H -m sha512 -i $p512 -o /dev/null -p 5000000
#     mode          in  opreps  cxreps     context          op   time(sec)     thrgput
  sha512_e       534Mb      5M       0       0.000   16165.000      16.167        33Mb
ubuntu@oprom8:~/libs$ $bltest -H -m sha256 -i $p256 -o /dev/null -p 5000000
#     mode          in  opreps  cxreps     context          op   time(sec)     thrgput
  sha256_e       267Mb      5M       0       0.000   16292.000      16.294        16Mb
ubuntu@oprom8:~/libs$ cd nss/
ubuntu@oprom8:~/libs/nss$ patch -p1 < ../nss.9126777.patch 
patching file lib/freebl/Makefile
patching file lib/freebl/cryptogams/LICENSE
patching file lib/freebl/cryptogams/ppc-xlate.pl
patching file lib/freebl/cryptogams/sha512p8-ppc.pl
patching file lib/freebl/freebl.gyp
Hunk #1 succeeded at 187 (offset 1 line).
patching file lib/freebl/sha512-p8.s
patching file lib/freebl/sha512.c
ubuntu@oprom8:~/libs/nss$ cd ../
ubuntu@oprom8:~/libs$ nss/build.sh -g
NSPR [1/5] configure ...
NSPR [2/5] make ...
NSPR [3/5] NOT building tests
NSPR [4/5] NOT running tests
NSPR [5/5] install ...
ninja: Entering directory `/home/ubuntu/libs/nss/out/Debug'
[1021/1021] STAMP obj/nss_tests.actions_depends.stamp
ubuntu@oprom8:~/libs$ $bltest -H -m sha512 -i $p512 -o /dev/null -p 5000000
#     mode          in  opreps  cxreps     context          op   time(sec)     thrgput
  sha512_e       534Mb      5M       0       0.000    3043.000       3.043       175Mb
ubuntu@oprom8:~/libs$ $bltest -H -m sha256 -i $p256 -o /dev/null -p 5000000
#     mode          in  opreps  cxreps     context          op   time(sec)     thrgput
  sha256_e       267Mb      5M       0       0.000   20653.000      20.654        12Mb

The Debug build does not use any optimizations at all. Please add the -o param to do a release build (for both before and after tests).

I did find I forgot to add the specific flag to gyp, and that turned out a bit harder than expected. Gyp does not support per-file cflags, and then the file needs to be compiled twice, once for the static freebl and once for the shared one. Adding the gyp flag changes as a separate patch, instead of squashing, so that people more familiar with gyp can review easily.

Attached patch gyp.patch (obsolete) — Splinter Review

Thanks. Results are as follows in a Power 9 system (starting from scratch):

ubuntu@oprom8:~/libs$ bltest=~/libs/dist/Release/bin/bltest
ubuntu@oprom8:~/libs$ export LD_LIBRARY_PATH=~/libs/dist/Release/lib
ubuntu@oprom8:~/libs$ $bltest -H -m sha512 -i $p512 -o /dev/null -p 5000000
#     mode          in  opreps  cxreps     context          op   time(sec)     thrgput
  sha512_e       534Mb      5M       0       0.000    5052.000       5.054       105Mb
ubuntu@oprom8:~/libs$ $bltest -H -m sha256 -i $p256 -o /dev/null -p 5000000
#     mode          in  opreps  cxreps     context          op   time(sec)     thrgput
  sha256_e       267Mb      5M       0       0.000    3260.000       3.262        81Mb

ubuntu@oprom8:~/libs$ cd nss
ubuntu@oprom8:~/libs/nss$ patch -p1 < ../nss.9126777.patch 
patching file lib/freebl/Makefile
patching file lib/freebl/cryptogams/LICENSE
patching file lib/freebl/cryptogams/ppc-xlate.pl
patching file lib/freebl/cryptogams/sha512p8-ppc.pl
patching file lib/freebl/freebl.gyp
Hunk #1 succeeded at 187 (offset 1 line).
patching file lib/freebl/sha512-p8.s
patching file lib/freebl/sha512.c
ubuntu@oprom8:~/libs/nss$ patch -p1 < ../nss.9127802.patch 
patching file lib/freebl/freebl.gyp
Hunk #1 succeeded at 200 (offset 1 line).
Hunk #2 succeeded at 326 (offset 1 line).
Hunk #3 succeeded at 388 (offset 1 line).
patching file lib/freebl/freebl_base.gypi
ubuntu@oprom8:~/libs/nss$ cd ../
ubuntu@oprom8:~/libs$ rm -rf dist/
ubuntu@oprom8:~/libs$ nss/build.sh -o -g
NSPR [1/5] configure ...
NSPR [2/5] make ...
NSPR [3/5] NOT building tests
NSPR [4/5] NOT running tests
NSPR [5/5] install ...
ninja: Entering directory '/home/ubuntu/libs/nss/out/Release'
[1022/1022] STAMP obj/nss_tests.actions_depends.stamp
ubuntu@oprom8:~/libs$ $bltest -H -m sha512 -i $p512 -o /dev/null -p 5000000
#     mode          in  opreps  cxreps     context          op   time(sec)     thrgput
  sha512_e       534Mb      5M       0       0.000    2791.000       2.792       191Mb
ubuntu@oprom8:~/libs$ $bltest -H -m sha256 -i $p256 -o /dev/null -p 5000000
#     mode          in  opreps  cxreps     context          op   time(sec)     thrgput
  sha256_e       267Mb      5M       0       0.000    2325.000       2.326       114Mb

Hi, any updates on the patch review process?

Flags: needinfo?(jjones)
Comment on attachment 9126777 [details] [diff] [review]
ppc-sha-v2.patch

Ben, I can't get to this given everything else. Can you?
Flags: needinfo?(jjones)
Attachment #9126777 - Flags: review?(jjones) → review?(bbeurdouche)

Depends on D70519

I imported the patches in Phabricator and made a review.
Due to a feature/bug of Phabricator I couldn't set the state to "Request Changes" state
but my review is there as comments. There are a few changes to make and a few remarks but
I didn't see any major issues with that patch... : )

Assignee: nobody → bbeurdouche

The scripts come from
https://github.com/dot-asm/cryptogams/tree/master/ppc
instead of OpenSSL directly, as there's the license difference to consider.

Regarding subdirs, scripts/ is fine with me, I'm just wondering where did the cryptogams/ subdir disappear to. It was in the patch, but it's not in the phabricator?

I removed it for some reason and forgot to re-add it. So I think that just renaming that folder to scripts (so that we can add other things there in the future) in the next iteration of the patch would be completely fine.

This will have to be blocked on license review and the actions needed to satisfy that, both for NSS and Firefox.

Mike, is that still you? Can you add this submission to your resolution queue? Thanks.

Flags: needinfo?(mhoye)

That's still me, and this is a straightforward BSD-or-GPL license; because of that first set of BSD-style clauses, this is acceptably-licensed code.

We can't bring GPL'ed code in-tree, but GPL-or-[something acceptable] is routine, and this qualifies.

Flags: needinfo?(mhoye)

OK thanks, I assume we need to file bugs to add it to about:license to satisfy the BSD binary acknowledgement?

And probably heads-up to RedHat about them needing to do the same.

Flags: needinfo?(mhoye)
Attached patch ppc-sha-v3.patchSplinter Review

v3:

  • Rebase to current master
  • Include gyp patch
  • Replace -fprofile-use with the loop subflags
  • Rename cryptogams/ -> scripts/
  • Add a script for the generation args
  • Factor out the ppc_crypto defines
  • Factor out the sha256 unrolled rounds, move round define above the function

I don't know of any public PPC CI however, it's just tested on my personal machine.

Attachment #9126777 - Attachment is obsolete: true
Attachment #9127802 - Attachment is obsolete: true
Attachment #9126777 - Flags: review?(bbeurdouche)
Attachment #9140376 - Flags: review?(bbeurdouche)
Attachment #9139726 - Attachment description: Bug 1613238 - POWER SHA-2 digest vector acceleration 1/2. → Bug 1613238 - POWER SHA-2 digest vector acceleration. r=jcj
Attachment #9139727 - Attachment is obsolete: true
Status: NEW → RESOLVED
Closed: 4 months ago
Resolution: --- → FIXED
Target Milestone: --- → 3.52
Attachment #9140376 - Flags: review?(bbeurdouche)
Regressions: 1635625

Whoops - sorry for the delay. Yes, as usual you should add the relevant license text to toolkit/content/license.html along with a note about what files or folder it applies to.

Flags: needinfo?(mhoye)
Regressions: 1642174

Would it be possible to change this such that the POWER8 acceleration is used only on ppc64le systems but not ppc64 systems, i.e. big-endian machines?

The reason is that the majority of systems (all systems?) running POWER8 are actually little-endian while the big-endian systems are usually on POWER5 or lower.

It is LE-only? See the USE_PPC_CRYPTO define. The patch in bug 1642174 will fix the build issue on the asm file getting built on such systems - it wasn't called, just caused link time trouble due to ABI differences.

(we've seen one or two such recent-power BE reports, so at least those users are vocal about their existence ;))

Yep, already found that report and I can confirm that the fix works.

FWIW, according to statistics, we have similar numbers of users for both ppc64le and ppc64 in Debian. There is a surprising number of loyal Apple PowerMac users.

You need to log in before you can comment on or make changes to this bug.