freebl: POWER AES-GCM Vector Acceleration
Categories
(NSS :: Libraries, enhancement, P5)
Tracking
(Not tracked)
People
(Reporter: johnjmar, Assigned: maamoun.tk)
References
Details
(Keywords: perf)
Attachments
(3 files, 6 obsolete files)
User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0
Expected results:
Use POWER8 and POWER9 ISA enhancements to improve the performance of AES-GCM.
Reporter | ||
Comment 1•6 years ago
|
||
Optimized implementations in the Cryptogams repository[1] may serve as useful references.
[1] https://github.com/dot-asm/cryptogams/
Comment 2•6 years ago
|
||
POWER is not one of our officially supported architectures [0], but we would happily review and accept patches for POWER from the community. We know it's a very useful architecture - it's just a matter of limited time on our side, and nothing to test with. Thanks!
Comment 4•5 years ago
|
||
I don't believe there's anyone lined up to work on it, vbrkov - feel free to take it. Happy to review the patch!
I intended to work on this, but since I hit the gcc bug on the other issue, just in case it would also affect this, I wanted to wait for it to be fixed. So I haven't written any code for this issue, and you're welcome to do so.
@cand
I've started to do patch. Do you have any code for this? If yes then my work on it doesn't make sence.
No, I was just thinking of starting. If you're already working on it, then I won't.
Reporter | ||
Comment 9•5 years ago
|
||
@cand, @vbrkov, for clarification, let me restate the task description:
Using POWER8 and POWER9 ISA enhancements, improve and provide proof that achieved performance of AES-GCM is close to optimal for the platform (i.e matching or beating OpenSSL performance).
Optimized assembly implementations in the Cryptogams repository [1]
may serve as useful reference. C-based intrinsic implementations are discouraged. Financial bounty upon completion, community acceptance of patches, and bounty reviewer input.
Reporter | ||
Comment 10•5 years ago
|
||
(In reply to John Martinez from comment #9)
@cand, @vbrkov, for clarification, let me restate the task description:
Using POWER8 and POWER9 ISA enhancements, improve and provide proof that achieved performance of AES-GCM is close to optimal for the platform (i.e matching or beating OpenSSL performance).
Optimized assembly implementations in the Cryptogams repository
[1]
may serve as useful reference. C-based intrinsic implementations are discouraged. Financial bounty upon completion, community acceptance of patches, and bounty reviewer input.
One note of clarification:
For bounty purposes, as long as the intrinsic implementation matches or beats OpenSSL performance, it is acceptable.
Reporter | ||
Comment 11•5 years ago
|
||
Hello,
Is anyone working on this? Just want to confirm.
Comment 12•5 years ago
|
||
John,
Yes, I work on it. I have AES optimized implementation already, now have problems with endianness.
Reporter | ||
Comment 13•5 years ago
|
||
Hi @vbrkov. Have you made any progress since the last time I asked?
Comment 14•5 years ago
|
||
Hi @John, unfortunately I haven't done any progress on it yet (got my laptop broken). However I continue to work kn it this week.
Reporter | ||
Comment 15•5 years ago
|
||
Hi @vbrkov.
Do you have any updates?
Assignee | ||
Comment 16•5 years ago
|
||
This patch optimizes both AES_CTR and GCM for PowerPC64 to achieve the
optimal performance on this platform.
Before this patch GCM is slightly optimized to take advantage of hardware
crypto-specific instruction vpmsumd
. In order to achieve the optimal
performance this patch add new implementation of GCM with new features:
- Addressing 4 blocks (128-bit each) per loop (Before only one block per
loop is addressed) this improves the parallelism by applying the equation:
Digest = (((((((Digest⊕C0)*H)⊕C1)*H)⊕C2)H)⊕C3)H =>
Digest = ((Digest⊕C0)H4)⊕(C1H3)⊕(C2H2)⊕(C3H) [1] - Handling Bit-reflection of the multiplication product [1]
- Karatsuba Algorithm [1] (Used before this patch)
- Deferred Recombination of partial products [1]
- Multiplication-based reduction [2]
This patch also optimizes AES_CTR by using built-in AES instructionsvcipher..
with assistance of other vector instructions such asvadduwm
to increase the
counter value.
Benchmark on POWER8 3.425GHz
(measuring AES_GCM encryption performance for a 16KB buffer, repeated 10000 times)
bltest -E -m aes_gcm -p 10000 -o /dev/null -g 16 -b 16384
Before (C implementation of AES_CTR. Slightly GCM hardware optimization)
mode in symmkey opreps cxreps context op time(sec) thrgput
aes_gcm_e 156Mb 128 10T 0 0.000 1413.000 1.413 110Mb
After (AES_CTR and GCM optimizations)
mode in symmkey opreps cxreps context op time(sec) thrgput
aes_gcm_e 156Mb 128 10T 0 0.000 64.000 0.064 2Gb
Cycles per byte
Before
29.7
After
1.6
[1]
https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/communications-ia-galois-counter-mode-paper.pdf
[2]
https://www.intel.com/content/dam/www/public/us/en/documents/software-support/enabling-high-performance-gcm.pdf
Comment 17•4 years ago
|
||
Hi. The original submitter of this bug, John Martinez, left IBM. For now, I'll be the designated contact from IBM.
Assignee | ||
Comment 18•4 years ago
|
||
Hi George,
Thank you for participation. As you already know, there is an enhanced algorithm for optimizing GCM on PowerPC architecture that has been implemented in GNU nettle library, I'm planning to update this patch too to take advantage of the enhanced algorithm but it seems Mozilla writing a policy for such contributions, hope it takes no longer. Once the policy is in place I will update this patch to the new algorithm so you can review it.
Assignee | ||
Comment 19•4 years ago
|
||
This patch implements enhanced algorithm of optimizing GCM on PowerPC architecture, the enhanced algorithm already implemented and approved in GNU nettle library and it's now released in nettle version 3.7
. Considering AES-GCM is the most widely used TLS cipher in Firefox with 88% of total usage at the time of writhing this blog and now hits 98% of total TLS cipher usage in Firefox, I considered to implement the enhanced algorithm to squeeze the potential performance so PowerPC users could get the maximum experience when using products that depend on NSS.
To facilitate the task of reviewing the patch, I followed the same scheme used to optimize AES-GCM by intel. ppc-gcm-wrap.c is forked from intel-gcm-wrap.c for that reason. Also if Mozilla team has difficulties reviewing certain points in this patch, I believe that IBM team could help here, we had similar workflow in GNU nettle library to optimize AES-GCM and everything has gone smoothly.
Benchmark on POWER9 2.2GHz
(measuring AES_GCM encryption performance for a 16KB buffer, repeated 10000 times)
bltest -E -m aes_gcm -p 10000 -o /dev/null -g 16 -b 16384
Before (C implementation of AES_CTR and slightly GCM hardware optimization)
mode in symmkey opreps cxreps context op time(sec) thrgput
aes_gcm_e 156Mb 128 10T 0 0.000 2317.000 2.317 67Mb
After (GCM optimization)
mode in symmkey opreps cxreps context op time(sec) thrgput
aes_gcm_e 156Mb 128 10T 0 0.000 119.000 0.119 1Gb
Comment 20•4 years ago
|
||
Indeed, the GHASH optimization in Nettle greatly improved performance on ppc64le. We on the IBM side are happy to help test and review this patch.
Comment 21•4 years ago
|
||
Has a policy for platform-specific optimized code been posted?
Assignee | ||
Comment 22•4 years ago
|
||
I don't know, I saw it an initiative from Mozilla team here so I updated the patch.
Comment 23•4 years ago
|
||
Hi both,
Since Mamone updated the patch last, will he be the one to handle these changes? If so, can you please push the patch to Phabricator?
Georges, from an initial look at the patch, it is almost certain that I'll need a few other reviews than mine before I can land this in NSS.
Assignee | ||
Comment 24•4 years ago
|
||
Assignee | ||
Comment 25•4 years ago
|
||
Rebuilt and submitted to Phabricator.
Assignee | ||
Comment 26•4 years ago
|
||
Rebased*
Comment 27•4 years ago
|
||
Hi Benjamin. I've asked two developers from my team to create an account here to help with the review.
Comment 28•4 years ago
|
||
@mt, can you please have a quick look at the phabricator patch? I'd like a second risk assessment in taking handwritten GCM assembly for PPC in NSS. When doing so, please keep in mind that George and IBM have agreed provide additional reviewing and support for this patch. Thanks!
Comment 29•4 years ago
|
||
Hi - I work with George at IBM and can help review the powerpc64 assembly bits in the patch.
I will post my comments in the next few days.
Comment 30•4 years ago
|
||
Hi Christopher, thanks for having a look, this is very helpful.
Mamone,
Martin, posted a "no red flags" in the Phabricator patch, so we are good to proceed : )
I have a very intense week, so I cannot guarantee that I can review this week, but I'll make sure we proceed ASAP.
In the meantime, can you please update the patch to fix the coding-style by running mach clang-format
at the top-level of NSS (it uses Docker)?
(you can amend your current commit, as long as the same phab link is in the commit message, moz-phab submit
will update the same patch)
Updated•4 years ago
|
Assignee | ||
Comment 31•4 years ago
|
||
Assignee | ||
Comment 32•4 years ago
|
||
(In reply to Benjamin Beurdouche [:beurdouche] from comment #30)
Mamone,
Martin, posted a "no red flags" in the Phabricator patch, so we are good to proceed : )
I have a very intense week, so I cannot guarantee that I can review this week, but I'll make sure we proceed ASAP.
In the meantime, can you please update the patch to fix the coding-style by runningmach clang-format
at the top-level of NSS (it uses Docker)?
(you can amend your current commit, as long as the same phab link is in the commit message,moz-phab submit
will update the same patch)
Thank you both.
I submitted the formatted code to Phabricator but it created a new revision, is that what it's meant or I missed something?
Comment 33•4 years ago
|
||
(In reply to Mamone from comment #32)
[...]
I submitted the formatted code to Phabricator but it created a new revision, is that what it's meant or I missed something?
@Mamone, I got quite confused with Phabricator, too, but once you get used a little bit to it, it's nice.
My understanding (though not 100% sure) is that you need to
- use the same title "Bug 1566124 - Optimize AES-GCM for ppc64le"
- have https://phabricator.services.mozilla.com/D108221 within the commit message.
No need to mention "Format ppc-gcm-wrap.c and ppc-gcm.h using clang-format", its like you sending the initial commit in.
Assignee | ||
Comment 34•4 years ago
|
||
Assignee | ||
Comment 35•4 years ago
|
||
Updated•4 years ago
|
Updated•4 years ago
|
Assignee | ||
Comment 36•4 years ago
|
||
Rebased in a new reversion.
Assignee | ||
Comment 37•4 years ago
|
||
Assignee | ||
Comment 38•4 years ago
|
||
This bug can be closed now and marked as resolved, the patch has been landed to the main repository.
Assignee | ||
Comment 39•4 years ago
|
||
@bbeurdouche could you please close the bug since the reporter is no longer active.
Comment 40•4 years ago
|
||
Waiting for the landing in Firefox as the patch is quite large, I don't expect this patch will break anything though... : )
Updated•4 years ago
|
Updated•4 years ago
|
Comment 41•4 years ago
|
||
Comment 42•4 years ago
|
||
Georges, Mamone,
I landed the main branch of NSS 3.65, which contains this patch, in Firefox and it doesn't seem to cause trouble
for now, so I think we can call it resolved :)
Thanks Mamone for this contribution, I am sure the PPC community will enjoy this change!
Thanks to all that contributed to the discussion, and in particular to Christopher and John Paul for reviewing the patch! : )
Updated•4 years ago
|
Updated•4 years ago
|
Comment 43•4 years ago
|
||
I've tried to pick up nss 3.65 into fedora, and ppcle is failing all.sh. Setting NSS_DISABLE_PPC_GHASH=1 causes the tests to succeed. The test cases I saw failing were 2nd handshake client auth cases, the server would send the second SendCertificate message and the client would blow up trying to parse it. The whole message block (which includes other protocol messages) arrived corrupted on the client after decrypting with AES GCM. Corruption started at byte 6, continued for 104 bytes, 10 more bytes of good data, followed more bytes of corrupted data.
Builds with failing tests:
https://koji.fedoraproject.org/koji/taskinfo?taskID=68803260
Builds with PPC_GHASH turned off (which succeed):
https://koji.fedoraproject.org/koji/taskinfo?taskID=68875498
cpuinfo on the platform:
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 160
On-line CPU(s) list: 0-159
Model name: POWER9, altivec supported
Model: 2.2 (pvr 004e 1202)
Thread(s) per core: 4
Core(s) per socket: 20
Socket(s): 2
Frequency boost: enabled
CPU max MHz: 3800.0000
CPU min MHz: 2166.0000
Caches (sum of all):
L1d: 1.3 MiB (40 instances)
L1i: 1.3 MiB (40 instances)
L2: 10 MiB (20 instances)
L3: 200 MiB (20 instances)
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0-79
NUMA node8 CPU(s): 80-159
Vulnerabilities:
Itlb multihit: Not affected
L1tf: Mitigation; RFI Flush, L1D private per thread
Mds: Not affected
Meltdown: Mitigation; RFI Flush, L1D private per thread
Spec store bypass: Mitigation; Kernel entry/exit barrier (eieio)
Spectre v1: Mitigation; __user pointer sanitization, ori31 specula
tion barrier enabled
Spectre v2: Mitigation; Indirect branch cache disabled, Software l
ink stack flush
Srbds: Not affected
Tsx async abort: Not affected
For now I'm turning off ppc HW acceleration in Red hat releases, but we should get this fixed:
NOTE: ppcle is not one of the platforms in the mozilla CI for NSS, so when you submit patches for it you should at least make sure nss/tests/all.sh passes (cd nss/tests ; ./all.sh).
Bob
Assignee | ||
Comment 44•4 years ago
|
||
I thought running BLAPI self-test would be enough to verify the correctness of data ciphering since it contains samples of different that passes multiple cipher functions. However, I'll investigate what cause the data corruption and push a patch with a fix soon.
Comment 45•4 years ago
|
||
I released NSS 3.66 today (Firefox 90) before I could see this, so it is also affected by this issue.
Note that the release cycle for NSS 3.67 is also a very short to realign with the Fx release schedule.
NSS 3.67 will be freezed+beta on June 7 for a release on June 10, so the decision to back out the patch or not will be taken on June 7.
Updated•4 years ago
|
Assignee | ||
Comment 46•4 years ago
|
||
Assignee | ||
Comment 47•4 years ago
|
||
It seems the optimized core of AES_GCM mode for ppc64le processes the first 255-byte correctly and produce corrupted data for any additional bytes, I pushed a fix for this bug on Phabricator. Unfortunately, bltest is unable to catch such bug since all the sample messages it tests are less than 256-byte I needed to manually check the correctness in order to fix this bug. I'm running tests/all.sh to check if the failures are gone, it would take up some time to finish the test around ~2 hours on my power9 machine so I'll back with the result soon, meanwhile feel free to merge the last bug fix.
Assignee | ||
Comment 48•4 years ago
|
||
Assignee | ||
Comment 49•4 years ago
|
||
Apparently, the failures related to data corruption are gone, I'm not sure if the test reaches the end tho it stops after being running for 4 hours. However, reports like tests_results/security/localhost.1/gtests/pk11_gtests/report.xml are clean now, previously they had data corruption failures for many tests. It would be nice if someone give the test another run with the last two patches merged to confirm the result.
Comment 50•4 years ago
|
||
Mamome, if you want to do a quicker test, go to nss/tests/ssl and run ./ssl.sh. That's where I was seeing the failures.
Benjamin, I have a small patch that will just turn off hardware acceleration you can drop into 3.66.1,
I'll likely pick up mamome's current patch in nss 3.66 for RHEL if it's all good.
We should probably look into making sure blapi tests texts large blocks of code (I know we have that for hashing, but we probably didn't have the for stream and block ciphers).
bob
Comment 51•4 years ago
|
||
OK, I'm looking at the fix and it's small, simple and safe, so it's probably better to pick up the fix than to turn of hardware ppcle.
Assignee | ||
Comment 52•4 years ago
|
||
Thanks for the info.
I run tests/ssl.sh and all 1160 tests are passed, I'll give ssl.sh another run so I can save the output log to attach it here. By the way during fixing the bug I made a reviewing round for the whole patch to make sure it doesn't cause any trouble in the future.
Comment 53•4 years ago
|
||
(In reply to Mamone from comment #49)
Apparently, the failures related to data corruption are gone, I'm not sure if the test reaches the end tho it stops after being running for 4 hours. However, reports like tests_results/security/localhost.1/gtests/pk11_gtests/report.xml are clean now, previously they had data corruption failures for many tests. It would be nice if someone give the test another run with the last two patches merged to confirm the result.
FWIW, I checked out NSS_3_66_RTM, verified that the nss/tests/ssl.sh tests fail on a P9 machine, then applied your two patches (they apply without conflict) and did a clean build to verify that the nss/tests/ssl.sh tests now pass. Everything looks good on my machine.
(In reply to Robert Relyea from comment #51)
OK, I'm looking at the fix and it's small, simple and safe, so it's probably better to pick up the fix than to turn of hardware ppcle.
Yes - if there's anything I can do to further help prove these patches on POWER hardware please let me know as well.
Updated•4 years ago
|
Comment 54•4 years ago
|
||
@Mamone, @Christopher,
Thanks for the patch and the verification that it now works, I'll merge the two patches today...
Comment 55•4 years ago
|
||
https://hg.mozilla.org/projects/nss/rev/91ce0df5a56952c35714c9790c7b19190a18d658
https://hg.mozilla.org/projects/nss/rev/66b9da6f53d50417d75eb076c2d97fcfb2ee4029
Comment 56•4 years ago
|
||
Fx90, built from the tree, ppc64le on Fedora 34, all TLS transactions fail with SSL_ERROR_BAD_SERVER
. Either building with the NSS in Fedora (--with-system-nss
) or NSS_DISABLE_PPC_GHASH=1
fixes the problem, so I think this still has issues.
Assignee | ||
Comment 57•4 years ago
|
||
I've tested the latest patch of AES-GCM on ubuntu and it's built successfully with all TLS tests passed, I'll request a Fedora instance to check if there is a distribution-specific issue. More info or log files about the issue will be appreciated.
Comment 58•4 years ago
|
||
Actually, I see the problem. The fixes above aren't in Fx90. Sorry for the churn.
Comment 59•4 years ago
|
||
especially since these patches are already checked into system-nss on fedora 34;).
Description
•