Last Comment Bug 805604 - Efficient AES-GCM implementation that uses Intel's AES and PCLMULQDQ instructions (AES-NI) and the Advanced Vector Extension (AVX) architecture.
: Efficient AES-GCM implementation that uses Intel's AES and PCLMULQDQ instruct...
Status: RESOLVED FIXED
:
Product: NSS
Classification: Components
Component: Libraries (show other bugs)
: 3.14
: x86_64 Linux
: P2 enhancement (vote)
: 3.14.2
Assigned To: Shay Gueron
:
Mentors:
Depends on:
Blocks: 835050
  Show dependency treegraph
 
Reported: 2012-10-25 13:48 PDT by Wan-Teh Chang
Modified: 2013-03-03 00:06 PST (History)
8 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
Proposed patch by Shay Gueron (47.79 KB, patch)
2012-10-25 13:48 PDT, Wan-Teh Chang
rrelyea: review-
Details | Diff | Review
Efficient AES-GCM implementation that uses Intel's AES and PCLMULQDQ instructions (AES-NI) and the Advanced Vector Extension (AVX) architecture --- Rev. 2 (49.74 KB, patch)
2012-11-29 10:41 PST, Shay Gueron
rrelyea: review+
Details | Diff | Review
GCM patch as checked in. (47.72 KB, patch)
2013-01-14 18:35 PST, Robert Relyea
no flags Details | Diff | Review
nspr configure test for AVX support (1.24 KB, patch)
2013-01-15 12:47 PST, Kai Engert (:kaie)
no flags Details | Diff | Review
bustage fix: On Linux, only optimize if installed assembler is >= 2.19 (3.45 KB, patch)
2013-01-15 15:20 PST, Kai Engert (:kaie)
rrelyea: review-
Details | Diff | Review
Work around the problem with Clang's integrated assembler (bug 835050) (1008 bytes, patch)
2013-01-26 16:36 PST, Wan-Teh Chang
kaie: review+
rrelyea: superreview+
wtc: feedback? (shay.gueron)
wtc: checked‑in+
Details | Diff | Review
Use a GNU make feature to add an extra compiler or assembler flag to just one source file (2.20 KB, patch)
2013-01-28 12:02 PST, Wan-Teh Chang
kaie: review+
rrelyea: superreview+
wtc: checked‑in+
Details | Diff | Review
license patch [checked in] (2.70 KB, patch)
2013-01-31 11:34 PST, Robert Relyea
no flags Details | Diff | Review

Description Wan-Teh Chang 2012-10-25 13:48:57 PDT
Created attachment 675278 [details] [diff] [review]
Proposed patch by Shay Gueron

This bug report was split off from bug 373108. The attached patch was originally
submitted by Shay Gueron of Intel in bug 373108 comment 55, with the following
description:

Hello all - 

This patch offers an efficient implementations of AES-GCM. 
The implementation uses Intel's AES and PCLMULQDQ instructions (AES-NI), and is
designed for the current (and future) Intel Core Processors, with the AVX 
instruction set (the 2nd, the 3rd, and the (future) 4th Generation Intel Core).

The algorithms and methods that underlie this code are detailed in references 
[1-4]: 

[1] Shay Gueron, Michael E. Kounavis: Intel® Carry-Less Multiplication 
    Instruction and its Usage for Computing the GCM Mode (Rev. 2.01)
    http://software.intel.com/sites/default/files/article/165685/clmul-wp-rev-2.01-2012-09-21.pdf
[2] S. Gueron, M. E. Kounavis: Efficient Implementation of the Galois Counter
    Mode Using a Carry-less Multiplier and a Fast Reduction Algorithm. 
    Information Processing Letters 110: 549–553 (2010).
[3] S. Gueron: AES Performance on the 2nd Generation Intel® Core™ Processor
    Family (to be posted) (2012).
[4] S. Gueron: Fast GHASH computations for speeding up AES-GCM (to be 
    published) (2012).
     

The patch assumes NSS version 3.14 RC0 

The performance 
===============

The performance of this patch was measured by using the built-in bltest utility
as follows:
bltest -E -m aes_gcm -p 10000 -o ct.txt -g 16 -b 16384 
(measuring authenticated encryption performance for a 16KB buffer, repeated 
10000 times)


                                          NSS 3.14 RC0   This patch   Speedup
----------------------------------------
Performance in GB/sec:
----------------------------------------
Core i7-2600K @3.4GHz ("Sandy Bridge")    0.057 GB/sec    1.17 GB/sec  20.5x
Core i7-3770 @3.4GHz  ("Ivy Bridge")      0.059 GB/sec    1.19 GB/sec  20.2x
----------------------------------------
Performance in Cycles per byte (C/B):
----------------------------------------
Core i7-2600K @3.4GHz ("Sandy Bridge")      55.42 C/B      2.70 C/B    20.5x
Core i7-3770 @3.4GHz  ("Ivy Bridge")        53.67 C/B      2.66 C/B    20.2x
----------------------------------------


Comments:
=========

This implementation is designed for the Intel(R) Core(TM) processors that 
support AES and PCLMULQDQ instructions, as well as the AVX instructions set.
(It also works, with high performance, on AMD Bulldozer processor that has 
these instructions).
Its AES-GCM performance is the fastest TLS authenticated encryption combination 
that is provided by the NSS library.

The patch integrates seamlessly with the existing NSS implementation: it 
selects the appropriate code path by checking the CPUID bits to detect AES-NI, 
PCLMULQDQ, and AVX support. For processors that do not support these 
instructions, the implementation resorts to the original GCM code.

The code was tested on NSS version 3.14 RC0.

Thanks to Wan-Teh Chang, Bob Relyea, Brian Smith, and Eric Rescorla for 
preliminary discussions on the patch and its integration. 


Developers and authors:
************************************************************************
Shay Gueron (1, 2), and Vlad Krasnov (1)

(1) Intel Corporation, Israel Development Center, Haifa, Israel
(2) University of Haifa, Israel
************************************************************************
Copyright(c) 2012, Intel Corp.
Comment 1 Robert Relyea 2012-11-06 15:57:48 PST
OK, I'm still the in the process of reviewing the actual code, but I do have some review comments, and lots of questions, plus there are a few things that would require a new patch, so I thought I'd get my comments and questions out right now. I currently only have partial reviews of the code under USE_HW_AES in gcm.c and intel-gcm.s.

---------------------------------------------------------------------------------------------

gcm.c

First: The patch includes some changes that will back out changes wtc has made to the trunk of NSS. These obviously need to be reverted. They all the changes to gcm.c not included inside the #ifdef USE_HW_AES ifdefs.

Now some general style comments:
Pretty much everything inside #ifdef USE_HW_AES are both self-contained, and specific to both AES and Intel HW, so it think it makes sense for them to be in their oen .c file (intel-aes-gcm.c or something like that). It also makes since to move the new function declarations for them from gcm.h to intel-gcm.h. The upshot is the new patch probably shouldn't have any changes to gcm.c (though I won't preclude it). The remaining comments for for the #ifdef USE_HW_AES portion of gcm.c, which I'm recommending to go into a new file.

Minor nit. We probably should add AES to the names of these functions since they are all AES specific (that is if we ever supported, say RC-5 GCM, we wouldn't be able to use these functions).

Minor nit. There are a lot of places where we use the magic number '16' in the code. I think it's fine, just as long as we have a comment that says that the blocksize is 16 to explain it. Replacing the magic number with a AES_BLOCKSIZE #define is fine as well.

NIT: We have a lot of places where an intermediate variable would add to readability,, like

   if (gcmParams->ulIvLen%16) {
       for (j=0; j < gcmParams->ulIvLen%16; j++) ..

I'm pretty sure the compiler will notice we did the same calculation and collapse it, but I just feel more comfortable adding a temp:
    unsigned in remainder = gcmParams->ulIvLen%16;
        .
        .
    if (remainder) {
       for (j=0; j < remainder; j++ )...
Be sure to keep 'C' style declarations rather than C++ style here.

NIT: There are a lot of memory copyies with for loops rather than PORT_Memcpy (which NSS #defines to the OS memcpy, which is often a compilier intrinsic). PORT_Memcpy should be both faster and more readable than a for loop copying byte at a time.

NIT: same comment as above except replace PORT_Memcpy with PORT_Memset when setting memory.

Bug: intel_gcmTAG returns the whole tag, which is fine, but intel_GCM_EncryptUpdate only handles truncating the tag to a byte boundary rather than a bit boundary (see gcm_GetTag in gcm.c for an example). Same issue in intel_GCM_DecryptUpdate, except there is less of an issue because we are comparing rather than returning, so missing the last 1-7 bits in the compare would only lead to a very rare false positive.

I still haven't reviewed the _mm_ compiler intrinsics, but other than that the rest of the c code is fine.
--------------------------------------------------------------------------------------------

rijndael.c

Only one style issue: at line 1005 in the existing code, we use the cpuid command to find out if aes_hw is available. The commands fills in a static global that is only visible to aes_InitContext. I think it would be better to just make that a file wide static global, and fill in all the globals from that one cpuid command rather than replicating the code again in AES_InitContext() (actually, it may make more since to move both tests to AES_InitContext and make sure they are ran before aes_InitContext(). In anycase it should be realtively easy to make this call once for both functions and set all the cpuid variables we need.

Minor NIT: the new block needs to be indented to match the standard indent for the case statement.

The rest of the rijndael.c changes are fine,

-------------------------------------------------------------------------------------

The Makefile changes are fine.

-------------------------------------------------------------------------------------

intel-aes.s

I've barely scratched the surface here, and I have lots of questions:

First there are lots of use of 'v' instructions (vmovdqu versus movdqu, vpbufb versus pbufb, etc.) I'm not clear what the real reason for the 'v' version versus the non-'v' version. I know in some cases the 'v' version can handle 256 bit registers (ymm0 versus xmm0), but I don't think we are using any here 256 bit registers in this code. I also know some 'v' versions have non-destructive implementations of the functions (A op B put into C rather than A op B put into B), but I see a lot of use of 'v' instructions where the 2nd and 3rd operand are the same (so they function the same as the non-'v' instructions. Is there some sort of intrinsic performance advantage to the 'v' instructions?

in intel_gcmTag, we use the 'u' (unaligned, verses 'a' or aligned) form of the move instructions. The data we operate on are elements in our context, so we could arrange for those elements to be properly aligned. If we did so, would there be a performance with using the 'a' form for these?

in GFMUL, it looks like we use the pclmulqdq instruction rather than shifts to do our mod reduction. I was unable to find an example of this in your paper, or a description for how it works. I'm working on a commented version of this to help explain how it works in case someone needs to modify it. Your paper, BTW, was extremely useful in helping me understand how the multiply function worked.

I still have the bulk of the .s file to review.
Comment 2 Shay Gueron 2012-11-09 20:45:34 PST
Bob, 
Thanks for the detailed review. Here are some comments and answers. 

>>> gcm.c
>>> First: The patch includes some changes that will back out changes wtc has 
>>> made to the trunk of NSS. 

The patch was built on the latest available version (as declared). 
But - no problem to build it on top of another version, and not to override changes made by Wan-Teh. 

The nits: no problem to modify. 

>>> Bug: intel_gcmTAG returns the whole tag, which is fine, but 
>>> intel_GCM_EncryptUpdate only 
>>> handles truncating the tag to a byte boundary rather than a bit boundary 
>>> (see gcm_GetTag in gcm.c for an example). 
>>> Same issue in intel_GCM_DecryptUpdate, except there is less of an issue because >>> we are 
>>> comparing rather than returning, so missing the last 1-7 bits in the compare would 
>>> only lead to a very rare false positive.

No - this is not a bug. The NIST spec (SP 800-38D) states on page 9 that 

"The bit length of the tag, denoted t, is a security parameter, as discussed in Appendix B. In general, t may be any one of the following five values: 128, 120, 112, 104, or 96. For certain applications, t may be 64 or 32..." 

So, truncation the tag to a bit-length that is not divisible by 8 is not adhering to the spec, and thus does not need to be supported (as is the case in the discussed patch). Furthermore, on that page of the spec it is stated: 

"An implementation shall not support values for t that are different from the seven choices in the preceding paragraph."

In this sense, I should even go more strictly and not allow truncation to “any” byte-length (only to the 7 allowed caes). 

I believe that the NSS implementation should do this as well. 

>>> rijndael.c
Style and Nits: no problem to modify 


>>> intel-aes.s
>>> I've barely scratched the surface here, and I have lots of questions:
>>> First there are lots of use of 'v' instructions 
>>> (vmovdqu versus movdqu, vpbufb versus pbufb, 

The AVX instructions are used for their non-destrcutive destination property: In the code, there are many instructions where the 2nd and 3rd operand are NOT the same, which saves moves, and greatly improves performance (by ~10%.). For the other cases, where the second and third operands are the same: there is a performance penalty associated with mixing SSE and AVX instructions, and to avoid this - all of them are coded as AVX. 

>>> in intel_gcmTag, we use the 'u' (unaligned, verses 'a' or aligned) form of 
>>> the move instructions. 

There is no difference in the performance of aligned and unalighned moves. 

>>> in GFMUL, it looks like we use the pclmulqdq instruction rather than shifts to 
>>> do our mod reduction. 
>>> I was unable to find an example of this in your paper, or a description for 
>>> how it works. I'm working on a commented version of this to help explain how it >>> works in case 
>>> someone needs to modify it. Your paper, BTW, was extremely useful in helping 
>>> me understand how the multiply function worked.

This is a new reduction method, and will be described and proved in my coming paper (Ref. [4] cited in the patch body). It uses a very short sequence of instructions and its efficiency increases with the performance of the pclmuqdq instruction.
Comment 3 Robert Relyea 2012-11-12 12:04:47 PST
No - this is not a bug. The NIST spec (SP 800-38D) states on page 9 that 

"The bit length of the tag, denoted t, is a security parameter, as discussed in Appendix B. In general, t may be any one of the following five values: 128, 120, 112, 104, or 96. For certain applications, t may be 64 or 32..." 

> So, truncation the tag to a bit-length that is not divisible by 8 is not adhering to the spec,
> and thus does not need to be supported (as is the case in the discussed patch). Furthermore,
> on that page of the spec it is stated: 
>
> "An implementation shall not support values for t that are different from the seven choices
> in the preceding paragraph."
> 
> In this sense, I should even go more strictly and not allow truncation to “any” byte-length
> (only to the 7 allowed caes). 

We're implementing to the PKCS #11 v2.30 draft 7. The draft specified ulTagbits to be 0 to 128 bits. Unfortunately the reference to the GCM description is [GCM]. It's not clear if it should McGrew/Viega or the NIST document. If it's the NIST document we should just disallow the unsupported lengths.

> 
> I believe that the NSS implementation should do this as well. 

The difference is why I flagged the issue. We shouldn't have different behavior between the NSS generic GCM and the accelerated-version.

bob
Comment 4 Shay Gueron 2012-11-12 13:40:04 PST
Indeed, there is a discrepancy between the NIST spec [NIST] and the PKCS spec [PKCS#11]. 

Here is my analysis of the situation: 

1. The reference to GCM in [PKCS#11] points to McGrew/Viega paper from 2004. This paper was a (initial) proposal. 
2. While NIST was soliciting comments, it was pointed out that truncating the MAC tag to a short tag has serious security implications [Ferguson]. 
3. This comment [Ferguson] is cited in the NIST spec [NIST] from 2009. While the restriction of the truncation to only 7 sizes is not motivated in [NIST], I assume that it is affected by the attack by [Ferguson]. 
4. Allowing general truncation, to any bit length from 0 to 128, would include lengths that are proven to be undesired. 

I therefore tend to suggest converging to the NIST spec and support truncation only to the 7 allowed values 128, 120, 112, 104, 96, 64, 32.

Of course, I agree that whatever is decided should be consistent in both NSS implementations. 

Shay 

[NIST] NIST Special Publication 800-38 (2007)

[PKCS#11] PKCS #11 v2.30 draft 7 (2009)

[Ferguson] Ferguson, N., Authentication Weaknesses in GCM, Natl. Inst. Stand. Technol. [Web page], 
http://csrc.nist.gov/groups/ST/toolkit/BCM/documents/comments/CWC-GCM/Ferguson2.pdf
(2005)
Comment 5 Robert Relyea 2012-11-12 17:15:52 PST
Let's just go with the truncation. I'll add a separate patch to make sure the args are full bytes. That is the simplest code, and most likely configuration, given the NIST recommendation.

bob
Comment 6 Robert Relyea 2012-11-13 17:02:39 PST
Comment on attachment 675278 [details] [diff] [review]
Proposed patch by Shay Gueron

r-

Shay is already working on a new patch to address the comments I already made in this bug, as well as the following I've sent him:

I only found one bug and one potential problem in the .s file....

The potential problem is the vpxor (PT), TMP1, TMP0 (at line 888 in intel_gcmENC near the .LLast 4 label) you have a comment that says "We assume that although the last block is partial, we can still read the whole block". The potential problem is this buffer comes all the way from the user. It may be the case that the user has passed us a buffer that is right at the end of the page. This is likely to be extremely rare, which means when we hit it we will get unexpected crashes that could be difficult to debug. The write to (CT) is fine because, as you said in your comment, we have to have space for the authentication block, so we know CT can hold at least another 128 bits (16 bytes) of additional data. I think we need to bit the bullet and handle loading the right number of bytes of (PT) before the xor.

The bug is the converse case in intel_gcmDEC (line 1257 near .LDECLast3). Here the vpxor (CT) is fine, since we know that CT includes the authdata. The problem here is we can't write past the end of PT. In some cases the buffer that is passed to us is exactly what is allocated. Writing past the end could trash the memory allocator data structures. The other case that there is a problem is when CT and PT are the same (It is permissible to pass the same pointer in). In this case we will trash our authentication data before we actually use it.

The rest of the .s file looks fine to me.

I also asked him for some specific comments in the .s file to guide future reviewers/modifiers of the code.

bob
Comment 7 Shay Gueron 2012-11-29 10:41:26 PST
Created attachment 686679 [details] [diff] [review]
Efficient AES-GCM implementation that uses Intel's AES and PCLMULQDQ instructions (AES-NI) and the Advanced Vector Extension (AVX) architecture --- Rev. 2

Hello everyone – 
Here is Rev. 2 of the patch, taking in all the comments from Bob’s. 
The changes (as requested):

1.	Patch applied to trunk (HEAD).
2.	Moved all the definitions to intel-gcm.h
3.	Moved all C code to a new file intel-gcm-wrap.c
4.	Changed the names of the functions to aes_gcm (to avoid confusion)
5.	Replaced 16 with AES_BLOCK_SIZE
6.	Added intermediate variables to improve readability
7.	Now use PORT_memcpy and PORT_memset, instead loops
8.	NOT CHANGED: Tag returned at byte granularity, like it was 
     Note: our discussions led to this decision and the “non AES-NI”  NSS AES-GCM code will patch and behave similarly
9.	rinjdael.c:  The CPU id check is now global, and is performed once
10.	Indents fixed
11.	Partial blocks are now copied at the exact boundary of the buffer, do not assume there is more space
12.	Regarding the –msse4 flag. Now using–mssse3 flag.  

Please review this version an see if it answers all the concerns.
Thanks, Shay
Comment 8 Robert Relyea 2012-11-29 17:26:00 PST
Comment on attachment 686679 [details] [diff] [review]
Efficient AES-GCM implementation that uses Intel's AES and PCLMULQDQ instructions (AES-NI) and the Advanced Vector Extension (AVX) architecture --- Rev. 2

Make sure it gets in my review queue.
Comment 9 Robert Relyea 2013-01-09 17:11:03 PST
Comment on attachment 686679 [details] [diff] [review]
Efficient AES-GCM implementation that uses Intel's AES and PCLMULQDQ instructions (AES-NI) and the Advanced Vector Extension (AVX) architecture --- Rev. 2

r+ rrelyea...
I'm going to r+ this but there still is a caveat.  We can't add the -msse3 in general either. We can add it specifically for the intel-aes.s file because we don't call that .o unless we've already veried we have sse3 (or actuall sse4).

bob
Comment 10 Robert Relyea 2013-01-14 18:35:28 PST
Created attachment 702114 [details] [diff] [review]
GCM patch as checked in.

The changes to the r+ patch were:

1) fixed incorrect addressing of a fixed buffer which lead to wrong data being hashed in intel-gcm-wrap.c
2) add missing jmp statements at the bottom of assembler loops in intel-gcm.s
3) gcm_decrypt was hashing the decrypted data not the encrypted data when dealing with the partial block. That is fixed.
Comment 11 Robert Relyea 2013-01-14 18:36:32 PST
Checking in Makefile;
/cvsroot/mozilla/security/nss/lib/freebl/Makefile,v  <--  Makefile
new revision: 1.124; previous revision: 1.123
done
RCS file: /cvsroot/mozilla/security/nss/lib/freebl/intel-gcm.h,v
done
Checking in intel-gcm.h;
/cvsroot/mozilla/security/nss/lib/freebl/intel-gcm.h,v  <--  intel-gcm.h
initial revision: 1.1
done
RCS file: /cvsroot/mozilla/security/nss/lib/freebl/intel-gcm.s,v
done
Checking in intel-gcm.s;
/cvsroot/mozilla/security/nss/lib/freebl/intel-gcm.s,v  <--  intel-gcm.s
initial revision: 1.1
done
RCS file: /cvsroot/mozilla/security/nss/lib/freebl/intel-gcm-wrap.c,v
done
Checking in intel-gcm-wrap.c;
/cvsroot/mozilla/security/nss/lib/freebl/intel-gcm-wrap.c,v  <--  intel-gcm-wrap.c
initial revision: 1.1
done
Checking in manifest.mn;
/cvsroot/mozilla/security/nss/lib/freebl/manifest.mn,v  <--  manifest.mn
new revision: 1.66; previous revision: 1.65
done
Checking in rijndael.c;
/cvsroot/mozilla/security/nss/lib/freebl/rijndael.c,v  <--  rijndael.c
new revision: 1.29; previous revision: 1.28
done
Comment 12 Kai Engert (:kaie) 2013-01-15 08:16:01 PST
The patch fails to build on Linux 64 bit, but only on RHEL 5.8.

32 bit works on the same machine.
64 bit works on other Linux machines.

The error log begins with the following, but shows a lot more of similar messages:

gcc -o Linux2.6_x86_64_glibc_PTH_64_DBG.OBJ/Linux_SINGLE_SHLIB/intel-gcm.o -g -D_POSIX_SOURCE -D_BSD_SOURCE -D_XOPEN_SOURCE -fPIC -DLINUX2_1  -Wall -Werror-implicit-function-declaration -Wno-switch -pipe -DLINUX -Dlinux -DHAVE_STRERROR -DXP_UNIX -DSHLIB_SUFFIX=\"so\" -DSHLIB_PREFIX=\"lib\" -DSHLIB_VERSION=\"3\" -DSOFTOKEN_SHLIB_VERSION=\"3\" -DRIJNDAEL_INCLUDE_TABLES -DDEBUG -UNDEBUG -DDEBUG_tinderbox -D_REENTRANT -DNSS_ENABLE_ECC -DNSS_ECC_MORE_THAN_SUITE_B -DUSE_UTIL_DIRECTLY -DNSS_USE_64 -DFREEBL_NO_DEPEND -DNSS_X86_OR_X64 -DNSS_X64 -DNSS_BEVAND_ARCFOUR -DMPI_AMD64 -DMP_ASSEMBLY_MULTIPLY -DNSS_USE_COMBA -DMP_CHAR_STORE_SLOW -DMP_IS_LITTLE_ENDIAN -DUSE_HW_AES -DMP_API_COMPATIBLE -I../../../../dist/Linux2.6_x86_64_glibc_PTH_64_DBG.OBJ/include -I../../../../dist/public/nss -I../../../../dist/private/nss -Impi -Iecl  -march=opteron -m64 -fPIC -Wa,--noexecstack -c intel-gcm.s
intel-gcm.s: Assembler messages:
intel-gcm.s:34: Error: no such instruction: `vmovdqu (Tp),T'
intel-gcm.s:35: Error: no such instruction: `vpshufb .Lbswap_mask(%rip),T,T'
intel-gcm.s:36: Error: no such instruction: `vpxor TMP0,TMP0,TMP0'
intel-gcm.s:39: Error: no such instruction: `vpinsrq $0,Mlen,TMP0,TMP0'
intel-gcm.s:40: Error: no such instruction: `vpinsrq $1,Alen,TMP0,TMP0'
intel-gcm.s:41: Error: no such instruction: `vpxor TMP0,T,T'
intel-gcm.s:42: Error: no such instruction: `vmovdqu (Htbl),TMP0'
...

Full log:
http://tinderbox.mozilla.org/showlog.cgi?log=NSS/1358220910.1358220998.17222.gz&fulltext=1
Comment 13 Kai Engert (:kaie) 2013-01-15 08:19:25 PST
[root@buildnss01 ~]# gcc -v
Using built-in specs.
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-libgcj-multifile --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --disable-plugin --with-java-home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre --with-cpu=generic --host=x86_64-redhat-linux
Thread model: posix
gcc version 4.1.2 20080704 (Red Hat 4.1.2-52)

[root@buildnss01 ~]# as --version
GNU assembler 2.17.50.0.6-20.el5_8.3 20061020
Copyright 2005 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License.  This program has absolutely no warranty.
This assembler was configured for a target of `x86_64-redhat-linux'.
Comment 14 Robert Relyea 2013-01-15 10:12:10 PST
Hmm... do we care? We don't build this version of freebl on RHEL-5. If we need tinderbox to work, let's try installing a newer assembler (I'm running 2.30). Can you install http://download.devel.redhat.com/brewroot/packages/binutils/2.20.51.0.2/5.36.el6/x86_64/binutils-2.20.51.0.2-5.36.el6.x86_64.rpm on tinderbox to keep our tinderboxen running..

bob
Comment 15 Kai Engert (:kaie) 2013-01-15 10:26:12 PST
I'd prefer a solution that enables or disables the use of these new optimizations based on the available build toolchain.

Not being able to build the latest upstream code on a supported RHEL branch seems unfortunate. Can we really be sure that we'll never upgrade RHEL 5.x to this code?

(No, RHEL 5 and RHEL 6 use different package signatures, we cannot simply install .el6 packages into an .el5 system, at least we'd have to rebuild. But will it really work to replace that core part of the system without breaking the environment? I'm worried and would rather not do that, in order to keep a sane RHEL 5.x testing environment.)
Comment 16 Ryan Sleevi 2013-01-15 11:00:35 PST
As a somewhat related note, given that by using GAS syntax, we also lose any ability to use the unrolled versions on Windows, would it be worthwhile considering using yasm for a single unified syntax and for x86 & x64 - which also matches the MozillaBuild system of requiring a minimum version of yasm?
Comment 17 Kai Engert (:kaie) 2013-01-15 11:12:38 PST
In my opinion, a new test should be added to NSPR's configure (the only place where we appear to test for toolchain capabilities), and compile this minimal file:

  .set  Tp, %rsi
  .set T,%xmm0
    vmovdqu  (Tp), T

and regarding of success/failure of the assembler, define a variable that will enable or disable the new code.


I propose to reopen this bug and back out the patch, until the code has been changed to work everywhere out of the box.
Comment 18 Brian Smith (:briansmith, :bsmith, use NEEDINFO?) 2013-01-15 11:34:08 PST
(In reply to Ryan Sleevi from comment #16)
> As a somewhat related note, given that by using GAS syntax, we also lose any
> ability to use the unrolled versions on Windows, would it be worthwhile
> considering using yasm for a single unified syntax and for x86 & x64 - which
> also matches the MozillaBuild system of requiring a minimum version of yasm?

If using yasm would result in the asm code being portable across Windows (MSVC), Linux, and Mac builds, then I definitely support switching to YASM.

(In reply to Kai Engert (:kaie) from comment #17)
> In my opinion, a new test should be added to NSPR's configure 
> and regarding of success/failure of the assembler, define a variable that
> will enable or disable the new code.

I would rather that the build fails when we don't have all the required tools installed, instead of having the code silently perform worse. Also, I think Wan-Teh we should keep NSS-specific stuff out of NSPR's build system, unless/until we decide to unify the two build systems into one.
Comment 19 Kai Engert (:kaie) 2013-01-15 12:03:46 PST
(In reply to Brian Smith (:bsmith) from comment #18)
> I would rather that the build fails when we don't have all the required
> tools installed, instead of having the code silently perform worse.

Your proposal is contrary to the classic idea of having portable code, that is supposed to work everyhwere. A new optimization mustn't mean the end of support for older platforms.


>  Also, I
> think Wan-Teh we should keep NSS-specific stuff out of NSPR's build system,
> unless/until we decide to unify the two build systems into one.

Testing for environment capabilities, which includes running small test compilations, is typically done in a "configure" script. I'm not aware of such scripts inside NSS, that's why I proposed to do that in NSPR.

I don't see how a generic test for "have AVX assembler instructions" would hurt if it's contained in NSPR.
Comment 20 Kai Engert (:kaie) 2013-01-15 12:47:56 PST
Created attachment 702474 [details] [diff] [review]
nspr configure test for AVX support
Comment 21 Kai Engert (:kaie) 2013-01-15 14:39:42 PST
Comment on attachment 702474 [details] [diff] [review]
nspr configure test for AVX support

This approach doesn't seem to help.

I had assumed that we're able to set variables during NSPR configure, and reuse such variables in NSS makefiles. But it seems I'm wrong.

I found a changelog file for the Gnu assembler, which says that support for AVX has been added in version 2.19

I propose that the NSS makefile get enhanced to check for that version number, and it the available AS is older, then disable the use of the new optimization.
Comment 22 Robert Relyea 2013-01-15 15:11:25 PST
> Can we really be sure that we'll never upgrade RHEL 5.x to this code?

I am *extremely* sure. We are at softoken 3.11.9, and the rules for moving RHEL 5 forward is getting more locked down after 5.9. It's remotely possible that it may move to 3.12.9, but to move to softoken 3.14, we would have to wait until another FIPS validation.

> (No, RHEL 5 and RHEL 6 use different package signatures, we cannot simply install .el6
> packages into an .el5 system, at least we'd have to rebuild)

It's both worse than that and better than that. el6 uses SHA256 as the cpio hash rather than md5, so even the source rpm doesn't work. Also the .el6 depends on a newer glibc, so the .el6 package is out.

Fortunately there is a RHEL 5 package that contains the newer as: binutils220. I've already installed it on the rhel5 box.
Comment 23 Robert Relyea 2013-01-15 15:12:50 PST
> I propose that the NSS makefile get enhanced to check for that version number, and it the 
> available AS is older, then disable the use of the new optimization.

I don't think that's necessary since we have a newer as that's officially supported on RHEL-5.

bob
Comment 24 Kai Engert (:kaie) 2013-01-15 15:20:52 PST
Created attachment 702549 [details] [diff] [review]
bustage fix: On Linux, only optimize if installed assembler is >= 2.19

patch tested on the failing machine and on a working machine
Comment 25 Robert Relyea 2013-01-15 17:06:07 PST
are you using buildnss01 for testing? If so, please note that the assembler has been updated on that platform per comment 22.
Comment 26 Robert Relyea 2013-01-15 17:13:20 PST
Comment on attachment 702549 [details] [diff] [review]
bustage fix: On Linux, only optimize if installed assembler is >= 2.19

r- This patch turns off AVX always.

You need a -DHAVE_AVX on the command line or the compile steps won't add the necessary code.

bob
Comment 27 Wan-Teh Chang 2013-01-26 16:36:05 PST
Created attachment 706820 [details] [diff] [review]
Work around the problem with Clang's integrated assembler (bug 835050)

Shay, Bob: the integrated assembler in Clang cannot handle how
intel-gcm.s uses the .set directive to give symbolic names to registers.
See bug 835050. I am going to work around the problem with this patch.
Let me know if you have other ideas. Thanks.

Patch checked in on the NSS trunk (NSS 3.14.2).

Checking in Makefile;
/cvsroot/mozilla/security/nss/lib/freebl/Makefile,v  <--  Makefile
new revision: 1.125; previous revision: 1.124
done
Comment 28 Robert Relyea 2013-01-28 10:44:19 PST
Comment on attachment 706820 [details] [diff] [review]
Work around the problem with Clang's integrated assembler (bug 835050)

r+ rrelyea
Comment 29 Wan-Teh Chang 2013-01-28 12:02:15 PST
Created attachment 707229 [details] [diff] [review]
Use a GNU make feature to add an extra compiler or assembler flag to just one source file

In bug 835050 comment 9, Jan Beich suggested a succinct way to add an
extra compiler or assembler flag to just one source file. I verified
that technique works. I am using GNU Make 3.81.

Since I am quite familiar with GNU Make but I didn't know about this
trick, I suspect it is a new GNU Make feature.

This patch also fixes two minor issues:
- Add or remove white space in variable definitions.
- Fix a typo (-msse4 vs. -mssse3) in a comment.
Comment 30 Kai Engert (:kaie) 2013-01-30 07:21:59 PST
Comment on attachment 707229 [details] [diff] [review]
Use a GNU make feature to add an extra compiler or assembler flag to just one source file

r=kaie
Comment 31 Wan-Teh Chang 2013-01-30 11:55:34 PST
Comment on attachment 707229 [details] [diff] [review]
Use a GNU make feature to add an extra compiler or assembler flag to just one source file

The GNU Make feature I used in this patch is called target-specific
variable values. I found it documented in the GNU Make 3.77 manual:
ftp://ftp.gnu.org/old-gnu/Manuals/make-3.77/html_node/make_69.html#SEC68

GNU Make 3.77 was released in May 1998. So it is safe to use this
feature.

Patch checked in on the NSS trunk (NSS 3.14.2).

Checking in Makefile;
/cvsroot/mozilla/security/nss/lib/freebl/Makefile,v  <--  Makefile
new revision: 1.126; previous revision: 1.125
done
Comment 32 Robert Relyea 2013-01-31 11:33:09 PST
Checking to update the license header...

/cvsroot/mozilla/security/nss/lib/freebl/intel-gcm-wrap.c,v  <--  intel-gcm-wrap.c
new revision: 1.2; previous revision: 1.1
done
Checking in intel-gcm.h;
/cvsroot/mozilla/security/nss/lib/freebl/intel-gcm.h,v  <--  intel-gcm.h
new revision: 1.2; previous revision: 1.1
done
Checking in intel-gcm.s;
/cvsroot/mozilla/security/nss/lib/freebl/intel-gcm.s,v  <--  intel-gcm.s
new revision: 1.2; previous revision: 1.1
done
Comment 33 Robert Relyea 2013-01-31 11:34:08 PST
Created attachment 708704 [details] [diff] [review]
license patch [checked in]
Comment 34 Robert Relyea 2013-02-05 15:38:51 PST
Comment on attachment 707229 [details] [diff] [review]
Use a GNU make feature to add an extra compiler or assembler flag to just one source file

r+ like it!
Comment 35 Shay Gueron 2013-02-17 06:03:19 PST
" "
Comment 36 Shay Gueron 2013-03-03 00:06:07 PST
The issue has been resolved by Bob/WTC.

Note You need to log in before you can comment on or make changes to this bug.