1613235 - freebl: POWER ChaCha20 stream cipher vector acceleration

Reporter

Description

•

6 years ago

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0

Expected results:

Would like to reach community to help improve the performance of ChaCha20
using POWER8 and POWER9 ISA enhancements, and to provide proof that
achieved performance of ChaCha20 is close to optimal for the platform
(i.e matching or beating OpenSSL performance *)

Optimized assembly implementations in the Cryptogams repository [1] may serve
as useful reference. C-based intrinsic implementations are also acceptable
given prior conditions (*).

Financial bounty upon completion, community acceptance of patches, and reviewer
input.

Moreover, free access to a ppc64le VM or direct Jenkins CI account can be
obtained at OSU Open Source Lab: https://osuosl.org/services/powerdev/
You can list me (johnjmar@linux.vnet.ibm.com) as the IBM Advocate.

[1] https://github.com/dot-asm/cryptogams/

J.C. Jones [:jcj] (he/they)

Updated

•

6 years ago

Status: UNCONFIRMED → NEW

Ever confirmed: true

Priority: -- → P5

Hardware: Unspecified → Other

J.C. Jones [:jcj] (he/they)

Updated

•

6 years ago

Comment 1

•

6 years ago

Attached patch patch adds vector accelerated chacha20 (obsolete) — Details — Splinter Review

This patch adds chacha20 vsx implementation, patch ChaCha20Xor() for calling vsx version of chacha20 and gyp patch for building (i will add a make patch later).

aoeuh

Assignee

Comment 2

•

6 years ago

Attached patch patch adds performance test for chacha20 (obsolete) — Details — Splinter Review

There is no way to do performance test for vsx version, so i add this patch.

before:

bltest -E -m chacha20_ctr -i data -o /dev/null -k key -v iv -p 3
      mode          in  opreps  cxreps     context          op   time(sec)     thrgput
chacha20_ctr_e   210Mb 0Kb       3       0       0.000   22141.000      29.395   7Mb 147Kb

after:

bltest -E -m chacha20_ctr -i data -o /dev/null -k key -v iv -p 300
      mode          in  opreps  cxreps     context          op   time(sec)     thrgput
chacha20_ctr_e  20Gb 520Mb     300       0       0.000   13735.000      13.790   1Gb 498Mb

aoeuh

Assignee

Comment 3

•

6 years ago

Attached patch chacha20ppc.diff (obsolete) — Details — Splinter Review

make patch added and small bug fixed

Attachment #9138243 - Attachment is obsolete: true

Attachment #9138830 - Flags: review?(jjones)

aoeuh

Assignee

Comment 4

•

6 years ago

Attached patch bltest.diff (obsolete) — Details — Splinter Review

small bug fix

Attachment #9138247 - Attachment is obsolete: true

Attachment #9138837 - Flags: review?(jjones)

John Martinez

Reporter

Comment 5

•

6 years ago

Hello,
I'm working in a Power9 system and I was able to apply both patches and build project using gyp.
However, when trying to run bltest I get a segmentation error :

ubuntu@oprom8:~/libs/nss$ patch  -p1  < ../nss.9138837.patch
patching file cmd/bltest/blapitest.c
patching file lib/freebl/blapi.h
patching file lib/freebl/blapit.h
patching file lib/freebl/chacha20poly1305.c
Hunk #1 succeeded at 70 with fuzz 1 (offset -4 lines).
patching file lib/freebl/chacha20poly1305.h
patching file lib/freebl/ldvector.c
patching file lib/freebl/loader.c
patching file lib/freebl/loader.h
ubuntu@oprom8:~/libs/nss$ cd ../; rm -rf dist; nss/build.sh -o -g
NSPR [1/5] configure ...
NSPR [2/5] make ...
NSPR [3/5] NOT building tests
NSPR [4/5] NOT running tests
NSPR [5/5] install ...
ninja: Entering directory `/home/ubuntu/libs/nss/out/Release'
[1120/1120] STAMP obj/nss_tests.actions_depends.stamp
ubuntu@oprom8:~/libs$ iv=nss/cmd/bltest/tests/chacha20_poly1305/iv1
ubuntu@oprom8:~/libs$ k=nss/cmd/bltest/tests/chacha20_poly1305/key1
ubuntu@oprom8:~/libs$ p=nss/cmd/bltest/tests/chacha20_poly1305/plaintext1
ubuntu@oprom8:~/libs$ bltest=dist/Release/bin/bltest
ubuntu@oprom8:~/libs$ export LD_LIBRARY_PATH=dist/Release/lib
ubuntu@oprom8:~/libs$
ubuntu@oprom8:~/libs$ $bltest -E -m  chacha20_ctr -i $p -o /dev/null -k $k -v $iv -5 10
Segmentation fault (core dumped)

I'm currently investigating.

John Martinez

Reporter

Comment 6

•

6 years ago

Looks like chacha20_Encrypt is getting a null ctx ...

Reading symbols from dist/Debug/bin/bltest...done.
(gdb) run -E -m chacha20_ctr -i nss/cmd/bltest/tests/chacha20_poly1305/plaintext1 -o /dev/null -k nss/cmd/bltest/tests/chacha20_poly1305/key1 -v nss/cmd/bltest/tests/chacha20_poly1305/iv1 -5 10
Starting program: /home/ubuntu/libs/dist/Debug/bin/bltest -E -m chacha20_ctr -i nss/cmd/bltest/tests/chacha20_poly1305/plaintext1 -o /dev/null -k nss/cmd/bltest/tests/chacha20_poly1305/key1 -v nss/cmd/bltest/tests/chacha20_poly1305/iv1 -5 10
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/powerpc64le-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff7a5f180 (LWP 71024)]

Thread 2 "bltest" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff7a5f180 (LWP 71024)]
chacha20_Encrypt (cx=0x0, output=0x100053370 "", outputLen=0x7ffff7a5e5f0, maxOutputLen=265,
    input=0x100053250 "Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference"...,
    inputLen=265) at ../../cmd/bltest/blapitest.c:1147
1147        return ChaCha20_Xor(output, input, inputLen, ctx->key, ctx->iv+4,
(gdb) where
#0  chacha20_Encrypt (cx=0x0, output=0x100053370 "", outputLen=0x7ffff7a5e5f0, maxOutputLen=265,
    input=0x100053250 "Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference"...,
    inputLen=265) at ../../cmd/bltest/blapitest.c:1147   
#1  0x000000010000c8f8 in cipherDoOp (cipherInfo=0x100052eb0) at ../../cmd/bltest/blapitest.c:2433
#2  0x00000001000105c8 in ThreadExecTest (data=0x100052eb0) at ../../cmd/bltest/blapitest.c:3327
#3  0x00007ffff7bec260 in _pt_root (arg=0x100054390) at ../../../../pr/src/pthreads/ptthread.c:201
#4  0x00007ffff7e9885c in start_thread (arg=0x0) at pthread_create.c:463
#5  0x00007ffff7db9028 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:82
(gdb) list
1142                unsigned int maxOutputLen, const unsigned char *input,
1143                unsigned int inputLen)
1144    {
1145        ChaCha20Context *ctx = cx;
1146        *outputLen = inputLen;
1147        return ChaCha20_Xor(output, input, inputLen, ctx->key, ctx->iv+4,
1148                            *((PRUint32 *)ctx->iv));
1149
1150    }
1151

aoeuh

Assignee

Comment 7

•

6 years ago

iv must be 16 byte file, key must be 32 byte file. The plaintext file size is also limited, but I do not know how much. I think the problem is in the internals of bltest.

aoeuh

Assignee

Comment 8

•

6 years ago

ubuntu@oprom8:~/libs$ $bltest -E -m chacha20_ctr -i $p -o /dev/null -k $k -v $iv -5 10
-5 10 looks like a typo

aoeuh

Assignee

Comment 9

•

6 years ago

Oh, I realized the problem is exactly in one of the input files. ChaCha20_CreateContext returns 0 if the iv or key size does not match

John Martinez

Reporter

Comment 10

•

6 years ago

(In reply to aoeuh from comment #7)

iv must be 16 byte file, key must be 32 byte file. The plaintext file size is also limited, but I do not know how much. I think the problem is in the internals of bltest.

Looks like the iv I'm using from the nss test files is 12 bytes:

ubuntu@oprom8:~/libs$ wc  $iv $k
+ wc nss/cmd/bltest/tests/chacha20_poly1305/iv1 nss/cmd/bltest/tests/chacha20_poly1305/key1
 0  0 12 nss/cmd/bltest/tests/chacha20_poly1305/iv1
 0  3 32 nss/cmd/bltest/tests/chacha20_poly1305/key1

However, looking at https://tools.ietf.org/html/rfc7539, section 2.3 states that the nonce or iv in this case MUST be 96 bits (12 bytes).

(In reply to aoeuh from comment #8)

ubuntu@oprom8:~/libs$ $bltest -E -m chacha20_ctr -i $p -o /dev/null -k $k -v $iv -5 10
-5 10 looks like a typo

I'm running the test for 10 seconds. Per the help output

+ dist/Release/bin/bltest
bltest: you must enter a command!

    bltest [-DEHSVR]    List available cipher modes

    bltest -E -m mode   Encrypt a buffer
                [-i plaintext] [-o ciphertext] [-k key] [-v iv]
                [-b bufsize] [-g keysize] [-e exp] [-r rounds]
                [-w wordsize] [-p repetitions | -5 time_interval]
                [-4 th_num]
           -m   cipher mode to use
           -i   file which contains input buffer
           -o   file for output buffer
           -k   file which contains key
           -v   file which contains initialization vector
           -b   size of input buffer
           -g   key size (in bytes)
           -p   do performance test
           -4   run test in multithread mode. th_num number of parallel threads
           -5   run test for specified time interval(in seconds)

Anyway, I tested using a 16 byte iv (not sure if legal), and I see a ~3x improvement (previous vs. optmized):

iv=nss/cmd/bltest/tests/aes_cbc/iv16

#patch -p1  < ../nss.9138837.patch
ubuntu@oprom8:~/libs$ $bltest -E -m  chacha20_ctr -i $p -o /dev/null -k $k -v $iv -5 10
#     mode          in  opreps  cxreps     context          op   time(sec)     thrgput
chacha20_ctr_e   2Gb 227Mb      9M       0       0.000   10000.000      10.001 227Mb 582Kb


#patch -p1  < ../nss.9138837.patch
#patch -p1  < ../nss.9138830.patch
ubuntu@oprom8:~/libs$ $bltest -E -m  chacha20_ctr -i $p -o /dev/null -k $k -v $iv -5 10
#     mode          in  opreps  cxreps     context          op   time(sec)     thrgput
chacha20_ctr_e   6Gb 180Mb     25M       0       0.000   10000.000      10.001 632Mb 424Kb

aoeuh

Assignee

Comment 11

•

6 years ago

my first test was on debug build, my fault.

Results on integricloud power9 vps (1 core 4 threads i suppose):
before

bltest -E -m chacha20_ctr -i testfile -o resnss -k key -v iv -5 10
#     mode          in  opreps  cxreps     context          op   time(sec)     thrgput
chacha20_ctr_e   2Gb 852Mb      2T       0       0.000   10025.000      10.030 289Mb 135Kb

after

bltest -E -m chacha20_ctr -i testfile -o resnss -k key -v iv -5 10
#     mode          in  opreps  cxreps     context          op   time(sec)     thrgput
chacha20_ctr_e  14Gb 894Mb     15T       0       0.000   10005.000      10.007   1Gb 497Mb

openssl

openssl speed -evp chacha20
Doing chacha20 for 3s on 16 size blocks: 38326198 chacha20's in 2.89s
Doing chacha20 for 3s on 64 size blocks: 15884799 chacha20's in 2.88s
Doing chacha20 for 3s on 256 size blocks: 15408018 chacha20's in 3.00s
Doing chacha20 for 3s on 1024 size blocks: 4190731 chacha20's in 3.00s
Doing chacha20 for 3s on 8192 size blocks: 539112 chacha20's in 2.99s
Doing chacha20 for 3s on 16384 size blocks: 274003 chacha20's in 3.00s
OpenSSL 1.1.1d  10 Sep 2019
built on: Sat Oct 12 19:56:43 2019 UTC
options:bn(64,64) rc4(char) des(int) aes(partial) blowfish(ptr) 
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -fdebug-prefix-map=/build/openssl-SVnnZZ/openssl-1.1.1d=. -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DAES_ASM -DVPAES_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
chacha20        212186.56k   352995.53k  1314817.54k  1430436.18k  1477058.70k  1496421.72k

However, looking at https://tools.ietf.org/html/rfc7539, section 2.3 states that the nonce or iv in this case MUST be 96 bits (12 bytes).

first 4 bytes is the counter (in little endian), openssl also uses 16 byte iv for chacha20.

Is there something I have to do?

John Martinez

Reporter

Comment 12

•

6 years ago

One more observation. Looking at perf traces for bltests chacha20_ctr and chacha20_poly1305 using unoptimized code, I see both tests sharing the same double_round function while performing the cipher stream.

For the same bltests using optimized code, I see the following:

# chacha20_poly1305_e
+   64.48%     0.00%  bltest  libc-2.27.so        [.] __clone                                                                         \u2592
+   64.48%     0.00%  bltest  libpthread-2.27.so  [.] start_thread                                                                    \u2592
+   64.48%     0.00%  bltest  libnspr4.so         [.] _pt_root                                                                        \u2592
+   64.48%     0.00%  bltest  bltest              [.] ThreadExecTest                                                                  \u2592
+   64.48%     0.05%  bltest  bltest              [.] cipherDoOp                                                                      \u2592
+   64.28%     0.11%  bltest  bltest              [.] chacha20_poly1305_Encrypt                                                       \u2592
+   64.12%     0.20%  bltest  bltest              [.] ChaCha20Poly1305_Seal                                                           \u2592
+   63.93%     0.11%  bltest  libfreeblpriv3.so   [.] ChaCha20Poly1305_Seal                                                           \u2592
+   62.14%     7.68%  bltest  libfreeblpriv3.so   [.] Hacl_Chacha20Poly1305_32_aead_encrypt                                           \u2592
+   53.73%    45.60%  bltest  libfreeblpriv3.so   [.] Hacl_Chacha20_chacha20_encrypt.localalias.0                                     \u2592
+    7.81%     7.77%  bltest  libfreeblpriv3.so   [.] double_round                                                                    \u2592
+    1.06%     1.06%  bltest  libfreeblpriv3.so   [.] Hacl_Poly1305_32_poly1305_finish                                                \u2592
+    1.00%     0.53%  bltest  libc-2.27.so        [.] __memcpy_chk

# chacha20_ctr_e
+   60.90%     0.00%  bltest  libc-2.27.so        [.] __clone                                                                         \u25c6
+   60.90%     0.00%  bltest  libpthread-2.27.so  [.] start_thread                                                                    \u2592
+   60.90%     0.00%  bltest  libnspr4.so         [.] _pt_root                                                                        \u2592
+   60.90%     0.00%  bltest  bltest              [.] ThreadExecTest                                                                  \u2592
+   60.80%     0.42%  bltest  bltest              [.] cipherDoOp                                                                      \u2592
+   59.94%     0.79%  bltest  bltest              [.] chacha20_Encrypt                                                                \u2592
+   59.13%     0.58%  bltest  bltest              [.] ChaCha20_Xor                                                                    \u2592
+   58.33%     2.24%  bltest  libfreeblpriv3.so   [.] ChaCha20_Xor                                                                    \u2592
+   56.39%     0.31%  bltest  libfreeblpriv3.so   [.] ChaCha20Xor                                                                     \u2592
+   23.81%    23.79%  bltest  libfreeblpriv3.so   [.] mainloop        <=  this used to be 'double_round' before                                                \u2592
+   23.28%    23.24%  bltest  libfreeblpriv3.so   [.] lastblock                                                                       \u2592
+    5.78%     5.78%  bltest  libfreeblpriv3.so   [.] chacha20vsx                                                                     \u2592
+    2.03%     2.03%  bltest  libfreeblpriv3.so   [.] xorlast                                                                         \u2592
+    1.18%     1.18%  bltest  libfreeblpriv3.so   [.] exitsub

The expectation here is that eventhough the authenticator poly1305 part is not optmized, one would still see some benefits running the chacha20_poly1305 bltest, which is not the case.

aoeuh

Assignee

Comment 13

•

6 years ago

Attached patch chacha20ppc.diff (obsolete) — Details — Splinter Review

Added functions Chacha20Poly1305_vsx_aead_encrypt and Chacha20Poly1305_vsx_aead_decrypt. Actually I just copy Hacl_Chacha20Poly1305_32.c and fix it a bit. I did not want to edit the file because it was verified.

Attachment #9138830 - Attachment is obsolete: true

Attachment #9138830 - Flags: review?(jjones)

John Martinez

Reporter

Comment 14

•

5 years ago

Hello,
After grabbing the latest NSS code available, I've verified the latest patch (9139949) using a Power9 system running baremetal linux (I had to manually patch both lib/freebl/ldvector.c and lib/freebl/Makefile due to the addition of SHA512 constructs in the latest version of the code)

Anyway, I see a 3x improvement in the chacha20_poly1305_e bltest and a 5x improvement in the newly created chacha20_ctr_e test. Notice I used a larger plaintext input for the tests.

#### BEFORE OPTIMIZATION PATCH
ubuntu@oprom8:~/libs/nss$ patch  -p1  < ../nss.9138837.patch # bltest
1 out of 1 hunk FAILED -- saving rejects to file lib/freebl/ldvector.c.rej
...
[1088/1088] STAMP obj/nss_tests.actions_depends.stamp
ubuntu@oprom8:~/libs$ iv=nss/cmd/bltest/tests/chacha20_poly1305/iv1
ubuntu@oprom8:~/libs$ k=nss/cmd/bltest/tests/chacha20_poly1305/key1
ubuntu@oprom8:~/libs$ p=nss/COPYING
ubuntu@oprom8:~/libs$ bltest=dist/Release/bin/bltest
ubuntu@oprom8:~/libs$ export LD_LIBRARY_PATH=dist/Release/lib
ubuntu@oprom8:~/libs$
ubuntu@oprom8:~/libs$ m=chacha20_poly1305 #chacha20_ctr
ubuntu@oprom8:~/libs$ $bltest -E -m $m -i $p -o /dev/null -k $k -v $iv -5 10
#     mode          in  opreps  cxreps     context          op   time(sec)     thrgput
chacha20_poly1305_e   2Gb 451Mb    144T       0       0.000   10000.000      10.003 249Mb 860Kb
ubuntu@oprom8:~/libs$
ubuntu@oprom8:~/libs$ iv=nss/cmd/bltest/tests/aes_cbc/iv16
ubuntu@oprom8:~/libs$ m=chacha20_ctr
ubuntu@oprom8:~/libs$ $bltest -E -m $m -i $p -o /dev/null -k $k -v $iv -5 10
#     mode          in  opreps  cxreps     context          op   time(sec)     thrgput
chacha20_ctr_e   2Gb 925Mb    172T       0       0.000   10000.000      10.002 297Mb 312Kb

#### AFTER OPTIMIZATION PATCH
ubuntu@oprom8:~/libs/nss$ patch -p1  < ../nss.9139949.patch # chacha20?
2 out of 2 hunks FAILED -- saving rejects to file lib/freebl/Makefile.rej
...
[1034/1034] STAMP obj/nss_tests.actions_depends.stamp
ubuntu@oprom8:~/libs$ iv=nss/cmd/bltest/tests/chacha20_poly1305/iv1
ubuntu@oprom8:~/libs$ k=nss/cmd/bltest/tests/chacha20_poly1305/key1
ubuntu@oprom8:~/libs$ p=nss/COPYING
ubuntu@oprom8:~/libs$ bltest=dist/Release/bin/bltest
ubuntu@oprom8:~/libs$ export LD_LIBRARY_PATH=dist/Release/lib
ubuntu@oprom8:~/libs$
ubuntu@oprom8:~/libs$ m=chacha20_poly1305 #chacha20_ctr
ubuntu@oprom8:~/libs$ $bltest -E -m $m -i $p -o /dev/null -k $k -v $iv -5 10
#     mode          in  opreps  cxreps     context          op   time(sec)     thrgput
chacha20_poly1305_e   7Gb 364Mb    436T       0       0.000   10000.000      10.002 753Mb 102Kb
ubuntu@oprom8:~/libs$
ubuntu@oprom8:~/libs$ iv=nss/cmd/bltest/tests/aes_cbc/iv16
ubuntu@oprom8:~/libs$ m=chacha20_ctr
ubuntu@oprom8:~/libs$ $bltest -E -m $m -i $p -o /dev/null -k $k -v $iv -5 10
#     mode          in  opreps  cxreps     context          op   time(sec)     thrgput
chacha20_ctr_e  14Gb 862Mb    880T       0       0.000   10000.000      10.001   1Gb 495Mb

aoeuh

Assignee

Comment 15

•

5 years ago

Attached patch chacha20ppc.diff — Details — Splinter Review

rebase to current master

Attachment #9139949 - Attachment is obsolete: true

aoeuh

Assignee

Comment 16

•

5 years ago

Attached patch bltest.diff — Details — Splinter Review

rebase to current master

Attachment #9138837 - Attachment is obsolete: true

Attachment #9138837 - Flags: review?(jjones)

John Martinez

Reporter

Comment 17

•

5 years ago

Hello, is there anyone that can review?
Thanks,

JM

Flags: needinfo?(jjones)

J.C. Jones [:jcj] (he/they)

Comment 18

•

5 years ago

I appreciate the work to bring this optimized code into NSS. To really consider this patch we’re going to need the source that generated the .s file, which you’ll find we generally land in-tree for other assembly for future verification purposes. Given the complexity of the patch, I’d also prefer to review these in Phabricator rather than inside Bugzilla’s Splinter.

That aside, we are writing a policy governing community contributions of platform-specific/optimized code and will post that on the Mozilla wiki, and there we can make clear the requirements for these sorts of contributions. As-is, if the assembly is doing something nefarious, I would be hard-pressed to figure that out. I’ll leave my needinfo here to return with a drafted policy statement - and it might be a good idea to wait for further code changes until we have that.

Assignee: nobody → aoeuh

Severity: normal → N/A

Status: NEW → ASSIGNED

Version: 3.4 → trunk

aoeuh

Assignee

Comment 19

•

5 years ago

To really consider this patch we’re going to need the source that generated the .s file

It is written by hand.

aoeuh

Assignee

Comment 20

•

5 years ago

Attached file Bug 1613235 - Add POWER ChaCha20 stream cipher vector acceleration. r=jcj (obsolete) — Details

aoeuh

Assignee

Comment 21

•

5 years ago

Hi, any updates?

Comment hidden (off-topic)

George Wilson

Comment 23

•

5 years ago

Hi. The original submitter of this bug, John Martinez, left IBM. For now, I'll be the designated contact from IBM.

George Wilson

Comment 24

•

5 years ago

Hi jc. Has the platform-specific code policy been posted? I understand the problems of assembly language but it's currently the way we're able to maximize performance. Is there anything I can do from my side to help with this patch?

abebeos

Comment 25

•

5 years ago

(In reply to George Wilson from comment #24)

Hi jc. Has the platform-specific code policy been posted? I understand the problems of assembly language but it's currently the way we're able to maximize performance. Is there anything I can do from my side to help with this patch?

Possibly your colleague David Edelsohn (https://github.com/edelsohn) has an idea about how to handle this months long "needinfo" status.

Benjamin Beurdouche [:beurdouche]

Comment 26

•

5 years ago

Hi all, JC accepted an offer from another company recently, so Kevin and I are gonna take over this discussion. This will happen after the holidays though. Because of many recent changes at Mozilla, the policy discussions have been delayed and I am gonna take that over in January as well. We’ll be in touch in January. Thanks for your patience and Happy holidays in the meantime !

Flags: needinfo?(bbeurdouche)

Benjamin Beurdouche [:beurdouche]

Updated

•

5 years ago

Flags: needinfo?(jc)

abebeos

Comment 27

•

5 years ago

(In reply to Benjamin Beurdouche [:beurdouche] from comment #26)

Hi all, JC accepted an offer from another company recently, so Kevin and I are gonna take over this discussion. This will happen after the holidays though. Because of many recent changes at Mozilla, the policy discussions have been delayed and I am gonna take that over in January as well. We’ll be in touch in January. Thanks for your patience and Happy holidays in the meantime !

As a long-term fan of both, Mozilla and Power, I am highly(!) disappointed by the way Mozilla processes this.

I would expect a far far more fluent contribution-process, and of course that a project respect the bounty-workers (which do not get payed for their hard work if their work is not merged).

There are situation where the contributor has a far higher knowledge-grade, so no reviewer can make a deep review (here, one needs to dig into hand-crafted assembler code). In such situation, one needs to trust the institution behind the work (here: IBM).

I believe that one can trust IBM.

Some more blocked issues:

https://bugzilla.mozilla.org/show_bug.cgi?id=1613236
https://bugzilla.mozilla.org/show_bug.cgi?id=1566124

Benjamin Beurdouche [:beurdouche]

Comment 28

•

5 years ago

That it came from IBM or not, this is code for an unsupported platform and it doesn't remotely have the level of quality that we expect from new code in NSS. Most new code for primitives in NSS is either machine generated or formally verified for functional correctness, secret independence and memory safety.

Mozilla does not have the resources to review and test new code for these unsupported platforms and I am don't think dumping code inside NSS without a serious plan is a good idea. That would increase our attack surface against the best interest of our users. If anything, a better plan would be to remove code for unsupported platforms, and everything that has been handwritten, even if we have to loose some performance there.

I am quite happy to be helpful and welcome lower priority contributions when we can, but there are times when it is tough, this is one of those.
I'll try to think about it again, but your comment did not serve these patches well and I would encourage gentle pings instead.

Mamone

Comment 29

•

5 years ago

Hi Benjamin,
I got your point here and I see your concerns are reasonable but isn't it better to find solutions to these concerns than prohibit users the perfect experience when using products that depend on NSS? I'm not just talking about Mozilla products because I'm aware of many applications that use NSS as the underlying security library.
PowerPC isn't supported platform here but there are a lot of potential users who can benefit from these optimizations. I wrote an optimization of AES-GCM for PowerPC platform and looked at the code here, I can tell you the quality of these patches meets the code level of supported platforms. I understand the lack of resources to review and test both patches but to be honest Mozilla has trusted intel before at least from what I have seen in it AES-GCM contribution, Mozilla couldn't dig into algorithms and details intel has committed to the patch and it accepted, yes x86 is supported platform but the situation isn't that different here. IBM has qualified crypto engineers who contributed similar patches for libraries such OpenSSL and they will happily review the current patches here.

Also, @abebeos is a passionate guy and has many contributions for PowerPC platform on other libraries and definitely didn't mean to not gentle, he just worry about the contributor works and how it gonna add considerable value to the platform.

regards,
Mamone

abebeos

Comment 30

•

5 years ago

(In reply to Benjamin Beurdouche [:beurdouche] from comment #28)

That it came from IBM or not, this is code for an unsupported platform and it doesn't remotely have the level of quality that we expect from new code in NSS. Most new code for primitives in NSS is either machine generated or formally verified for functional correctness, secret independence and memory safety.

I understand this argument, though I'm mostly arguing that IBM as a code-verifier can be trusted (see below).

Mozilla does not have the resources to review and test new code for these unsupported platforms and I am don't think dumping code inside NSS without a serious plan is a good idea.

Here's the plan:

let bounty-workers to the "heavy lifting"
let IBM verify the results (as it was already suggested here: https://bugzilla.mozilla.org/show_bug.cgi?id=1566124#c20)

That would increase our attack surface against the best interest of our users. If anything, a better plan would be to remove code for unsupported platforms, and everything that has been handwritten, even if we have to loose some performance there.

Then go on. Strip everything non-Intel away, state this clearly and prominently within the NSS repo ("Intel Only"), so everyone can save time.

(disclaimer: me exaggerating, to make my point...)

A possible alternative could be to say: "well, IBM, you need to provide the relevant (at least part-time) maintainer to make PowerPC a supported platform".

I am quite happy to be helpful and welcome lower priority contributions when we can, but there are times when it is tough, this is one of those.

Tough times? In the end, you're managers, so see to it that you delegate stuff out.

I'll try to think about it again, but your comment did not serve these patches well and I would encourage gentle pings instead.

Hey, this was actually my "gentle ping" tenor. Remember "diversity": persons have different expression styles (mine is usually clear of any diplomacy). Yes, I'm kind of brain-damaged, because diplomacy is needed much within today's OSS Project-landscape - even more than logic and abstractional/analytical abilities.

(disclaimer: I'm criticizing/suggesting "things", (here: project processes etc.), something that is clearly encouraged within the Mozilla Etiquette)

(and I'm an PowerPC&Mozilla fan, not related to IBM)

abebeos

Comment 31

•

5 years ago

And I just saw this:

the enhanced algorithm already implemented and approved in GNU nettle library and it's now released in nettle version 3.7.

source: https://bugzilla.mozilla.org/show_bug.cgi?id=1566124#c19 (all this sounds like a decent plan, as requested)

Benjamin Beurdouche [:beurdouche]

Comment 32

•

5 years ago

•

Edited

Hi all,

Ok, as I mentioned before, I am happy to help here, but as I have a no cycles to dedicate, let's try to efficiently handle this patch first
and then look at the GCM patch from Mamone. I don't think we would take the assembly for Poly1305, but fortunately no patch was provided and there is vectorization support in progress on the HACL* side, so I am sure that this code will be pretty fast...

I had a brief look at the patches on Sunday and at first sight, these changes look ok, but I'd like to do proper review in Phabricator.

@aoeuh, could you please split this work in two phabricator patches, one for blapi, one for chacha20?
(you can push two commits with moz-phab submit and it will separate in two stacked phabricator patches).

@George, once aoeuh is done, can you please make sure to have at least one or two person from IBM (or someone trusted) to:

have another look at the code to make sure it is correct
make sure this patch builds both with gyp (build.sh) and Makefiles on IBM platforms

Also, I'd like to make sure that some minimal help will be available from IBM if we hit issues, we won't have support for these machines in our CI, so we won't be able to test this. If we have reports that the code breaks something, for some reason, and there is no support, the only available option will be to revert that patch... :/ I am happy to discuss how we can see IBM engage more with us on supporting their platform in the long run, after we handle these changes.

When we are done with the changes, I'll push the changes to our CI to see if everything passes, it might require a bit more work but if not I'll land the changes to NSS.

Does that work for you all?

Heads up to you all and to @glaubitz who is handling separate changes there: https://bugzilla.mozilla.org/show_bug.cgi?id=1687164
(these seem pretty much disconnected for now, but we might collide at some point)

Flags: needinfo?(maamoun.tk)

Flags: needinfo?(glaubitz)

Flags: needinfo?(gcwilson)

Flags: needinfo?(aoeuh)

Flags: needinfo?(abebeos)

John Paul Adrian Glaubitz

Comment 33

•

5 years ago

I will try to get my changes in for #1687164 this week. I'm on vacation for two weeks now so plenty of time :).

Flags: needinfo?(glaubitz)

Mamone

Comment 34

•

5 years ago

(In reply to Benjamin Beurdouche [:beurdouche] from comment #32)

Also, I'd like to make sure that some minimal help will be available from IBM if we hit issues, we won't have support for these machines in our CI, so we won't be able to test this. If we have reports that the code breaks something, for some reason, and there is no support, the only available option will be to revert that patch... :/ I am happy to discuss how we can see IBM engage more with us on supporting their platform in the long run, after we handle these changes.

How does CI system handle the platform-specific code? Does it run natively or by emulating through qemu? What I'm thinking is tweaking the CI system to support PowerPC specific-code, I already have experience with gitlab CI and I may help here if it's open to contribute.

Benjamin Beurdouche [:beurdouche]

Comment 35

•

5 years ago

The CI runs natively and its under the control of another team so I don't have access to these machines.
I will ask but I am pretty sure that the rule is to only keep supported platforms in the CI, which makes sense since we won't maintain that code...

abebeos

Comment 36

•

5 years ago

(In reply to Mamone from comment #34)
[...]

How does CI system handle the platform-specific code? Does it run natively or by emulating through qemu? What I'm thinking is tweaking the CI system to support PowerPC specific-code, I already have experience with gitlab CI and I may help here if it's open to contribute.

My experience with this tells me that this is something that needs to be done by IBM. It looks like Mozilla opens up to contributions, so, in an ideal world, IBM would ensure to extend mozillas CI-system. This seems to be a separate issue/task.

Flags: needinfo?(abebeos)

George Wilson

Comment 37

•

5 years ago

Hi all, just catching up here. We can definitely help test and review submissions for Power. What are the requirements for a platform to be considered supported? NSS has run on Power and z for a long time - we've even sponsored successful Linux distro FIPS 140-2 validations of it on our platforms. What is needed for CI? I will discuss your requirements with my management.

Flags: needinfo?(gcwilson)

Benjamin Beurdouche [:beurdouche]

Comment 38

•

5 years ago

As I mentioned before, I think we should first handle this patch, see how it goes, then we can have the other discussions. : )

aoeuh

Assignee

Comment 39

•

5 years ago

Attached file Bug 1613235 - Add POWER ChaCha20 stream cipher vector acceleration. r=beurdouche — Details

aoeuh

Assignee

Comment 40

•

5 years ago

Attached file Bug 1613235 - Add POWER ChaCha20 stream cipher vector acceleration. r=beurdouche — Details

Depends on D107220

aoeuh

Assignee

Comment 41

•

5 years ago

rebased, added check for NSS_DISABLE_CRYPTO_VSX

Flags: needinfo?(aoeuh)

Benjamin Beurdouche [:beurdouche]

Comment 42

•

5 years ago

Attached file Bug 1613235 - Clang-format for: POWER ChaCha20 stream cipher vector acceleration — Details

Depends on D107221

Benjamin Beurdouche [:beurdouche]

Comment 43

•

5 years ago

https://hg.mozilla.org/projects/nss/rev/764124fddaa2a908ffa8222732d31829b71c357f
https://hg.mozilla.org/projects/nss/rev/4f7ba08bd991781a349083a5addda7216af4b160
https://hg.mozilla.org/projects/nss/rev/61e70233f80eeebb3eaf03a5a363f8204282ab4f

Status: ASSIGNED → RESOLVED

Closed: 5 years ago

Flags: needinfo?(maamoun.tk)

Flags: needinfo?(bbeurdouche)

Resolution: --- → FIXED

Benjamin Beurdouche [:beurdouche]

Comment 44

•

5 years ago

Thank you @aoeuh for the patch !
@Mamone, I'll look at https://bugzilla.mozilla.org/show_bug.cgi?id=1566124 next.

aoeuh

Assignee

Comment 45

•

5 years ago

Thanks Benjamin for the review!

Mike Hommey [:glandium]

Updated

•

5 years ago

Regressions: 1698320

Phabricator Automation

Updated

•

5 years ago

Attachment #9161494 - Attachment is obsolete: true

patch adds vector accelerated chacha20 6 years ago aoeuh 11.90 KB, patch		Details \| Diff \| Splinter Review
patch adds performance test for chacha20 6 years ago aoeuh 11.19 KB, patch		Details \| Diff \| Splinter Review
chacha20ppc.diff 6 years ago aoeuh 12.57 KB, patch		Details \| Diff \| Splinter Review
bltest.diff 6 years ago aoeuh 11.19 KB, patch		Details \| Diff \| Splinter Review
chacha20ppc.diff 6 years ago aoeuh 35.31 KB, patch		Details \| Diff \| Splinter Review
chacha20ppc.diff 5 years ago aoeuh 35.24 KB, patch		Details \| Diff \| Splinter Review
bltest.diff 5 years ago aoeuh 11.12 KB, patch		Details \| Diff \| Splinter Review
Bug 1613235 - Add POWER ChaCha20 stream cipher vector acceleration. r=jcj 5 years ago aoeuh 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1613235 - Add POWER ChaCha20 stream cipher vector acceleration. r=beurdouche 5 years ago aoeuh 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1613235 - Add POWER ChaCha20 stream cipher vector acceleration. r=beurdouche 5 years ago aoeuh 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1613235 - Clang-format for: POWER ChaCha20 stream cipher vector acceleration 5 years ago Benjamin Beurdouche [:beurdouche] 48 bytes, text/x-phabricator-request		Details \| Review