Closed Bug 1566124 Opened 5 years ago Closed 3 years ago

freebl: POWER AES-GCM Vector Acceleration

Categories

(NSS :: Libraries, enhancement, P5)

Other
Unspecified
enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: johnjmar, Assigned: maamoun.tk)

References

Details

(Keywords: perf)

Attachments

(3 files, 6 obsolete files)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0

Expected results:

Use POWER8 and POWER9 ISA enhancements to improve the performance of AES-GCM.

Optimized implementations in the Cryptogams repository[1] may serve as useful references.
[1] https://github.com/dot-asm/cryptogams/

POWER is not one of our officially supported architectures [0], but we would happily review and accept patches for POWER from the community. We know it's a very useful architecture - it's just a matter of limited time on our side, and nothing to test with. Thanks!

[0] https://wiki.mozilla.org/NSS#Supported_Architectures

Status: UNCONFIRMED → NEW
Type: defect → enhancement
Ever confirmed: true
Keywords: perf
Priority: -- → P5
Hardware: Unspecified → Other
See Also: → 1566126

What is the status of issue? Do anyone work on it?

I don't believe there's anyone lined up to work on it, vbrkov - feel free to take it. Happy to review the patch!

I intended to work on this, but since I hit the gcc bug on the other issue, just in case it would also affect this, I wanted to wait for it to be fixed. So I haven't written any code for this issue, and you're welcome to do so.

vbrkov, are you on this?

@cand

I've started to do patch. Do you have any code for this? If yes then my work on it doesn't make sence.

No, I was just thinking of starting. If you're already working on it, then I won't.

@cand, @vbrkov, for clarification, let me restate the task description:

Using POWER8 and POWER9 ISA enhancements, improve and provide proof that achieved performance of AES-GCM is close to optimal for the platform (i.e matching or beating OpenSSL performance).

Optimized assembly implementations in the Cryptogams repository [1] may serve as useful reference. C-based intrinsic implementations are discouraged. Financial bounty upon completion, community acceptance of patches, and bounty reviewer input.

[1] https://github.com/dot-asm/cryptogams/

(In reply to John Martinez from comment #9)

@cand, @vbrkov, for clarification, let me restate the task description:

Using POWER8 and POWER9 ISA enhancements, improve and provide proof that achieved performance of AES-GCM is close to optimal for the platform (i.e matching or beating OpenSSL performance).

Optimized assembly implementations in the Cryptogams repository [1] may serve as useful reference. C-based intrinsic implementations are discouraged. Financial bounty upon completion, community acceptance of patches, and bounty reviewer input.

[1] https://github.com/dot-asm/cryptogams/

One note of clarification:
For bounty purposes, as long as the intrinsic implementation matches or beats OpenSSL performance, it is acceptable.

Hello,
Is anyone working on this? Just want to confirm.

John,

Yes, I work on it. I have AES optimized implementation already, now have problems with endianness.

Hi @vbrkov. Have you made any progress since the last time I asked?

Hi @John, unfortunately I haven't done any progress on it yet (got my laptop broken). However I continue to work kn it this week.

Hi @vbrkov.
Do you have any updates?

This patch optimizes both AES_CTR and GCM for PowerPC64 to achieve the
optimal performance on this platform.

Before this patch GCM is slightly optimized to take advantage of hardware
crypto-specific instruction vpmsumd. In order to achieve the optimal
performance this patch add new implementation of GCM with new features:

  • Addressing 4 blocks (128-bit each) per loop (Before only one block per
    loop is addressed) this improves the parallelism by applying the equation:
    Digest = (((((((Digest⊕C0)*H)⊕C1)*H)⊕C2)H)⊕C3)H =>
    Digest = ((Digest⊕C0)H4)⊕(C1H3)⊕(C2
    H2)⊕(C3
    H) [1]
  • Handling Bit-reflection of the multiplication product [1]
  • Karatsuba Algorithm [1] (Used before this patch)
  • Deferred Recombination of partial products [1]
  • Multiplication-based reduction [2]
    This patch also optimizes AES_CTR by using built-in AES instructions vcipher..
    with assistance of other vector instructions such as vadduwm to increase the
    counter value.

Benchmark on POWER8 3.425GHz
(measuring AES_GCM encryption performance for a 16KB buffer, repeated 10000 times)
bltest -E -m aes_gcm -p 10000 -o /dev/null -g 16 -b 16384

Before (C implementation of AES_CTR. Slightly GCM hardware optimization)
mode          in symmkey  opreps  cxreps     context          op   time(sec)     thrgput
 aes_gcm_e       156Mb     128     10T       0       0.000    1413.000       1.413       110Mb
After (AES_CTR and GCM optimizations)
mode          in symmkey  opreps  cxreps     context          op   time(sec)     thrgput
 aes_gcm_e       156Mb     128     10T       0       0.000      64.000       0.064         2Gb

Cycles per byte

Before
29.7
After
1.6

[1]
https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/communications-ia-galois-counter-mode-paper.pdf
[2]
https://www.intel.com/content/dam/www/public/us/en/documents/software-support/enabling-high-performance-gcm.pdf

Hi. The original submitter of this bug, John Martinez, left IBM. For now, I'll be the designated contact from IBM.

Hi George,
Thank you for participation. As you already know, there is an enhanced algorithm for optimizing GCM on PowerPC architecture that has been implemented in GNU nettle library, I'm planning to update this patch too to take advantage of the enhanced algorithm but it seems Mozilla writing a policy for such contributions, hope it takes no longer. Once the policy is in place I will update this patch to the new algorithm so you can review it.

This patch implements enhanced algorithm of optimizing GCM on PowerPC architecture, the enhanced algorithm already implemented and approved in GNU nettle library and it's now released in nettle version 3.7. Considering AES-GCM is the most widely used TLS cipher in Firefox with 88% of total usage at the time of writhing this blog and now hits 98% of total TLS cipher usage in Firefox, I considered to implement the enhanced algorithm to squeeze the potential performance so PowerPC users could get the maximum experience when using products that depend on NSS.
To facilitate the task of reviewing the patch, I followed the same scheme used to optimize AES-GCM by intel. ppc-gcm-wrap.c is forked from intel-gcm-wrap.c for that reason. Also if Mozilla team has difficulties reviewing certain points in this patch, I believe that IBM team could help here, we had similar workflow in GNU nettle library to optimize AES-GCM and everything has gone smoothly.

Benchmark on POWER9 2.2GHz
(measuring AES_GCM encryption performance for a 16KB buffer, repeated 10000 times)
bltest -E -m aes_gcm -p 10000 -o /dev/null -g 16 -b 16384

Before (C implementation of AES_CTR and slightly GCM hardware optimization)
mode          in symmkey  opreps  cxreps     context          op   time(sec)     thrgput
 aes_gcm_e       156Mb     128     10T       0       0.000    2317.000       2.317        67Mb
After (GCM optimization)
mode          in symmkey  opreps  cxreps     context          op   time(sec)     thrgput
 aes_gcm_e       156Mb     128     10T       0       0.000     119.000       0.119         1Gb
Attachment #9167483 - Attachment is obsolete: true

Indeed, the GHASH optimization in Nettle greatly improved performance on ppc64le. We on the IBM side are happy to help test and review this patch.

Has a policy for platform-specific optimized code been posted?

I don't know, I saw it an initiative from Mozilla team here so I updated the patch.

Hi both,

Since Mamone updated the patch last, will he be the one to handle these changes? If so, can you please push the patch to Phabricator?

Georges, from an initial look at the patch, it is almost certain that I'll need a few other reviews than mine before I can land this in NSS.

Assignee: nobody → maamoun.tk
Flags: needinfo?(maamoun.tk)
Flags: needinfo?(gcwilson)

Rebuilt and submitted to Phabricator.

Flags: needinfo?(maamoun.tk)

Rebased*

Hi Benjamin. I've asked two developers from my team to create an account here to help with the review.

Flags: needinfo?(gcwilson)

@mt, can you please have a quick look at the phabricator patch? I'd like a second risk assessment in taking handwritten GCM assembly for PPC in NSS. When doing so, please keep in mind that George and IBM have agreed provide additional reviewing and support for this patch. Thanks!

Flags: needinfo?(mt)

Hi - I work with George at IBM and can help review the powerpc64 assembly bits in the patch.
I will post my comments in the next few days.

Hi Christopher, thanks for having a look, this is very helpful.

Mamone,
Martin, posted a "no red flags" in the Phabricator patch, so we are good to proceed : )
I have a very intense week, so I cannot guarantee that I can review this week, but I'll make sure we proceed ASAP.
In the meantime, can you please update the patch to fix the coding-style by running mach clang-format at the top-level of NSS (it uses Docker)?
(you can amend your current commit, as long as the same phab link is in the commit message, moz-phab submit will update the same patch)

Flags: needinfo?(mt) → needinfo?(maamoun.tk)
Flags: needinfo?(bbeurdouche)

(In reply to Benjamin Beurdouche [:beurdouche] from comment #30)

Mamone,
Martin, posted a "no red flags" in the Phabricator patch, so we are good to proceed : )
I have a very intense week, so I cannot guarantee that I can review this week, but I'll make sure we proceed ASAP.
In the meantime, can you please update the patch to fix the coding-style by running mach clang-format at the top-level of NSS (it uses Docker)?
(you can amend your current commit, as long as the same phab link is in the commit message, moz-phab submit will update the same patch)

Thank you both.
I submitted the formatted code to Phabricator but it created a new revision, is that what it's meant or I missed something?

Flags: needinfo?(maamoun.tk)

(In reply to Mamone from comment #32)
[...]

I submitted the formatted code to Phabricator but it created a new revision, is that what it's meant or I missed something?

@Mamone, I got quite confused with Phabricator, too, but once you get used a little bit to it, it's nice.

My understanding (though not 100% sure) is that you need to

No need to mention "Format ppc-gcm-wrap.c and ppc-gcm.h using clang-format", its like you sending the initial commit in.

Attachment #9211898 - Attachment description: Bug 1566124 - Optimize AES-GCM for ppc64le → Bug 1566124 - Optimize AES-GCM for ppc64le (OLD)
Attachment #9211618 - Attachment description: WIP: Bug 1566124 - Format ppc-gcm-wrap.c and ppc-gcm.h using clang-format → WIP: Bug 1566124 - Format ppc-gcm-wrap.c and ppc-gcm.h using clang-format (OLD)
Attachment #9211898 - Attachment is obsolete: true
Attachment #9211618 - Attachment is obsolete: true
Attachment #9202925 - Attachment is obsolete: true

Rebased in a new reversion.

This bug can be closed now and marked as resolved, the patch has been landed to the main repository.

@bbeurdouche could you please close the bug since the reporter is no longer active.

Waiting for the landing in Firefox as the patch is quite large, I don't expect this patch will break anything though... : )

Attachment #9211900 - Attachment is obsolete: true
Attachment #9208738 - Attachment is obsolete: true
Status: NEW → RESOLVED
Closed: 3 years ago
Flags: needinfo?(bbeurdouche)
Resolution: --- → FIXED

Georges, Mamone,
I landed the main branch of NSS 3.65, which contains this patch, in Firefox and it doesn't seem to cause trouble
for now, so I think we can call it resolved :)

Thanks Mamone for this contribution, I am sure the PPC community will enjoy this change!
Thanks to all that contributed to the discussion, and in particular to Christopher and John Paul for reviewing the patch! : )

Attachment #9211618 - Attachment is obsolete: false
Attachment #9211618 - Attachment is obsolete: true
Regressions: 1712230

I've tried to pick up nss 3.65 into fedora, and ppcle is failing all.sh. Setting NSS_DISABLE_PPC_GHASH=1 causes the tests to succeed. The test cases I saw failing were 2nd handshake client auth cases, the server would send the second SendCertificate message and the client would blow up trying to parse it. The whole message block (which includes other protocol messages) arrived corrupted on the client after decrypting with AES GCM. Corruption started at byte 6, continued for 104 bytes, 10 more bytes of good data, followed more bytes of corrupted data.

Builds with failing tests:
https://koji.fedoraproject.org/koji/taskinfo?taskID=68803260

Builds with PPC_GHASH turned off (which succeed):
https://koji.fedoraproject.org/koji/taskinfo?taskID=68875498

cpuinfo on the platform:
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 160
On-line CPU(s) list: 0-159
Model name: POWER9, altivec supported
Model: 2.2 (pvr 004e 1202)
Thread(s) per core: 4
Core(s) per socket: 20
Socket(s): 2
Frequency boost: enabled
CPU max MHz: 3800.0000
CPU min MHz: 2166.0000
Caches (sum of all):
L1d: 1.3 MiB (40 instances)
L1i: 1.3 MiB (40 instances)
L2: 10 MiB (20 instances)
L3: 200 MiB (20 instances)
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0-79
NUMA node8 CPU(s): 80-159
Vulnerabilities:
Itlb multihit: Not affected
L1tf: Mitigation; RFI Flush, L1D private per thread
Mds: Not affected
Meltdown: Mitigation; RFI Flush, L1D private per thread
Spec store bypass: Mitigation; Kernel entry/exit barrier (eieio)
Spectre v1: Mitigation; __user pointer sanitization, ori31 specula
tion barrier enabled
Spectre v2: Mitigation; Indirect branch cache disabled, Software l
ink stack flush
Srbds: Not affected
Tsx async abort: Not affected

For now I'm turning off ppc HW acceleration in Red hat releases, but we should get this fixed:
NOTE: ppcle is not one of the platforms in the mozilla CI for NSS, so when you submit patches for it you should at least make sure nss/tests/all.sh passes (cd nss/tests ; ./all.sh).

Bob

Status: RESOLVED → REOPENED
Resolution: FIXED → ---

I thought running BLAPI self-test would be enough to verify the correctness of data ciphering since it contains samples of different that passes multiple cipher functions. However, I'll investigate what cause the data corruption and push a patch with a fix soon.

I released NSS 3.66 today (Firefox 90) before I could see this, so it is also affected by this issue.
Note that the release cycle for NSS 3.67 is also a very short to realign with the Fx release schedule.

NSS 3.67 will be freezed+beta on June 7 for a release on June 10, so the decision to back out the patch or not will be taken on June 7.

Flags: needinfo?(bbeurdouche)

It seems the optimized core of AES_GCM mode for ppc64le processes the first 255-byte correctly and produce corrupted data for any additional bytes, I pushed a fix for this bug on Phabricator. Unfortunately, bltest is unable to catch such bug since all the sample messages it tests are less than 256-byte I needed to manually check the correctness in order to fix this bug. I'm running tests/all.sh to check if the failures are gone, it would take up some time to finish the test around ~2 hours on my power9 machine so I'll back with the result soon, meanwhile feel free to merge the last bug fix.

Apparently, the failures related to data corruption are gone, I'm not sure if the test reaches the end tho it stops after being running for 4 hours. However, reports like tests_results/security/localhost.1/gtests/pk11_gtests/report.xml are clean now, previously they had data corruption failures for many tests. It would be nice if someone give the test another run with the last two patches merged to confirm the result.

Mamome, if you want to do a quicker test, go to nss/tests/ssl and run ./ssl.sh. That's where I was seeing the failures.
Benjamin, I have a small patch that will just turn off hardware acceleration you can drop into 3.66.1,

I'll likely pick up mamome's current patch in nss 3.66 for RHEL if it's all good.

We should probably look into making sure blapi tests texts large blocks of code (I know we have that for hashing, but we probably didn't have the for stream and block ciphers).
bob

OK, I'm looking at the fix and it's small, simple and safe, so it's probably better to pick up the fix than to turn of hardware ppcle.

Thanks for the info.

I run tests/ssl.sh and all 1160 tests are passed, I'll give ssl.sh another run so I can save the output log to attach it here. By the way during fixing the bug I made a reviewing round for the whole patch to make sure it doesn't cause any trouble in the future.

(In reply to Mamone from comment #49)

Apparently, the failures related to data corruption are gone, I'm not sure if the test reaches the end tho it stops after being running for 4 hours. However, reports like tests_results/security/localhost.1/gtests/pk11_gtests/report.xml are clean now, previously they had data corruption failures for many tests. It would be nice if someone give the test another run with the last two patches merged to confirm the result.

FWIW, I checked out NSS_3_66_RTM, verified that the nss/tests/ssl.sh tests fail on a P9 machine, then applied your two patches (they apply without conflict) and did a clean build to verify that the nss/tests/ssl.sh tests now pass. Everything looks good on my machine.

(In reply to Robert Relyea from comment #51)

OK, I'm looking at the fix and it's small, simple and safe, so it's probably better to pick up the fix than to turn of hardware ppcle.

Yes - if there's anything I can do to further help prove these patches on POWER hardware please let me know as well.

Attachment #9224053 - Attachment description: WIP: Bug 1566124 - Fix AES_GCM mode on ppc64le for messages of length more than 255-byte → Bug 1566124 - Fix AES_GCM mode on ppc64le for messages of length more than 255-byte

@Mamone, @Christopher,
Thanks for the patch and the verification that it now works, I'll merge the two patches today...

Status: REOPENED → RESOLVED
Closed: 3 years ago3 years ago
Flags: needinfo?(bbeurdouche)
Resolution: --- → FIXED

Fx90, built from the tree, ppc64le on Fedora 34, all TLS transactions fail with SSL_ERROR_BAD_SERVER. Either building with the NSS in Fedora (--with-system-nss) or NSS_DISABLE_PPC_GHASH=1 fixes the problem, so I think this still has issues.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---

I've tested the latest patch of AES-GCM on ubuntu and it's built successfully with all TLS tests passed, I'll request a Fedora instance to check if there is a distribution-specific issue. More info or log files about the issue will be appreciated.

Actually, I see the problem. The fixes above aren't in Fx90. Sorry for the churn.

Status: REOPENED → RESOLVED
Closed: 3 years ago3 years ago
Resolution: --- → FIXED

especially since these patches are already checked into system-nss on fedora 34;).

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: