Open Bug 1609569 Opened 5 years ago Updated 2 years ago

ChaCha/Poly not hardware accelerated on Pentium Gold processors (non-AVX)

Categories

(NSS :: Libraries, defect, P3)

x86_64
All

Tracking

(Not tracked)

People

(Reporter: jcj, Unassigned)

References

(Blocks 1 open bug, Regression)

Details

(Keywords: regression)

As mentioned in https://bugzilla.mozilla.org/show_bug.cgi?id=1605369#c20, some 2017-era Intel CPUs have SSE4 but do not have AVX, similar to 2007-era Penryn CPUs.

The new HACL* update in Bug 1574643 requires AVX for somewhat unknown reasons - the actual intrinsics are all available in SSE4, but apparently our optimizer is adding AVX instructions on its own, resulting in crashes (Bug 1605369).

We should see what we can do to not leave these Pentium Gold-series out of our accelerated goodness.

If you can provide a reasonably small testcase and the command line indicating that we shouldn't be using avx, the clang folks are probably interested in a bug report.

I am going to follow along here as I have a collection of different systems with different CPU's that all display, of course, a different set of cpu flags. The primary offense here is that we suspect that anything that pre-dates 2017 may be thought of as "old" when we all have Oracle and HP systems running from the previous two decades. In production. Just fine. I think this is just an issue with the CFLAGS in play at build time and even the optimizer would not violate something as restrictive as :

-mno-mmx -mno-sse -mno-sse2 -mno-sse3 -mno-ssse3 -mno-sse4 -mno-sse4a \
-mno-sse4.1 -mno-sse4.2 -mno-avx -mno-avx2 -mno-avx512f -mno-avx512pf \
-mno-avx512er -mno-avx512cd -mno-avx512vl -mno-avx512bw -mno-avx512dq \
-mno-avx512ifma -mno-avx512vbmi

The list of "extensions" that can be disabled ( at least by gcc ) is quite long and can be seen at :

https://gcc.gnu.org/onlinedocs/gcc-9.2.0/gcc/x86-Options.html#x86-Options

I think the problem here is the CFLAGS in play at the time of compile wherein we do not restrict the use of certain optional cpu features. The LLVM/Clang folks are a mystery to me as their docs don't seem to be as clear. However an example of truely excessive CFLAGS would be to build for an AMD baseline k8 Opteron target which should result in a binary that runs ( walks briskly ) on everything for the past fifteen years :

boe13$ cat ../../src/hello/hello.c

/*********************************************************************
 * The Open Group Base Specifications Issue 6
 * IEEE Std 1003.1, 2004 Edition
 *********************************************************************/
#define _XOPEN_SOURCE 600

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[]) {
    printf ("argc = %i\n", argc );
    printf ("argv[0] = %s\n", argv[0]);
    return (EXIT_SUCCESS);
}

boe13$ 
boe13$ uname -a 
Linux boe13.genunix.com 3.10.0-693.el7.x86_64 #1 SMP Thu Jul 6 19:56:57 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux
boe13$ cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.4 (Maipo)
boe13$ 
boe13$ $CC $CFLAGS $CPPFLAGS $RESTRICT_FLAGS -o /tmp/foo ../../src/hello/hello.c
boe13$ 
boe13$ /tmp/foo
argc = 1
argv[0] = /tmp/foo
boe13$ 
boe13$ file /tmp/foo
/tmp/foo: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=5f50a618fb321bb509de87d2956fd28539f20747, not stripped
boe13$ 
boe13$ echo $CFLAGS 
 -std=iso9899:1999 -m64 -O0 -g -fno-fast-math -fno-builtin -march=k8 -mtune=k8 -Wall -pedantic -Wextra -pedantic-errors -malign-double -mpc80 -O0
boe13$ 
boe13$ echo $RESTRICT_FLAGS

-mno-mmx -mno-sse -mno-sse2 -mno-sse3 -mno-ssse3 -mno-sse4 -mno-sse4a -mno-sse4.1 -mno-sse4.2 -mno-avx -mno-avx2 -mno-avx512f -mno-avx512pf -mno-avx512er -mno-avx512cd -mno-avx512vl -mno-avx512bw -mno-avx512dq -mno-avx512ifma -mno-avx512vbmi -mno-sha -mno-aes -mno-pclmul -mno-clflushopt -mno-clwb -mno-fsgsbase -mno-rdrnd -mno-f16c -mno-fma -mno-pconfig -mno-wbnoinvd -mno-fma4 -mno-prfchw -mno-rdpid -mno-prefetchwt1 -mno-rdseed -mno-sgx -mno-xop -mno-lwp -mno-3dnow -mno-3dnowa -mno-popcnt -mno-abm -mno-adx -mno-bmi -mno-bmi2 -mno-lzcnt -mno-fxsr -mno-xsave -mno-xsaveopt -mno-xsavec -mno-xsaves -mno-rtm -mno-hle -mno-tbm -mno-mpx -mno-mwaitx -mno-clzero -mno-pku -mno-avx512vbmi2 -mno-gfni -mno-vaes -mno-vpclmulqdq -mno-avx512bitalg -mno-movdiri -mno-movdir64b -mno-avx512vpopcntdq -mno-avx5124fmaps -mno-avx512vnni -mno-avx5124vnniw
boe13$

That is offensive and it works. However I doubt that the resultant binary has much to say about performance.

At the very least once should be able to look at that diff list I came up with in bug 1605369 :

3dnowprefetch art avx avx2 bmi1 bmi2 clflushopt epb 
f16c fma hwp hwp_act_window hwp_epp hwp_notify 
ibpb ibrs ida intel_pt rdseed smx stibp 
tsc_known_freq xgetbv1 xsavec xsaves

Perhaps just the avx and avx2 could be "-mno-" prefixed into the CFLAGS.

At the same it would be a fun experiment to build for an Opteron k8 target with nearly every extension disabled. I will take a look at the LLVM/Clang side of life on FreeBSD and also Debian just to see what I can find there.

Dennis

Has Regression Range: --- → yes
Keywords: regression
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.