Last Comment Bug 459248 - Support Intel AES extensions.
: Support Intel AES extensions.
Status: RESOLVED FIXED
FIPS
:
Product: NSS
Classification: Components
Component: Libraries (show other bugs)
: 3.12.2
: x86_64 All
: P4 enhancement (vote)
: 3.12.2
Assigned To: Robert Relyea
:
:
Mentors:
Depends on:
Blocks: FIPS2008 536485
  Show dependency treegraph
 
Reported: 2008-10-09 09:41 PDT by Robert Relyea
Modified: 2010-01-14 15:09 PST (History)
5 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
patch for comments (48.71 KB, patch)
2008-10-09 09:44 PDT, Robert Relyea
no flags Details | Diff | Splinter Review
same patch, cvs diff -upN (48.63 KB, patch)
2008-10-09 14:01 PDT, Kai Engert (:kaie)
no flags Details | Diff | Splinter Review
New patch with optimized asm code and fixed has_intel_aes type (74.37 KB, patch)
2008-10-26 01:17 PDT, Ulrich Drepper
no flags Details | Diff | Splinter Review
patch equivalent to above, but without extraneous whitespace changes (Windows line endings) (67.73 KB, patch)
2008-10-30 18:47 PDT, Nelson Bolyard (seldom reads bugmail)
no flags Details | Diff | Splinter Review
patch equivalent to above, but with Unix line endings (65.81 KB, patch)
2008-10-30 18:55 PDT, Nelson Bolyard (seldom reads bugmail)
nelson: review-
Details | Diff | Splinter Review
Update patch based on comments. (66.85 KB, patch)
2008-11-26 16:52 PST, Robert Relyea
nelson: review+
Details | Diff | Splinter Review
Fix Tinderbox failures.... (13.77 KB, patch)
2008-12-03 15:07 PST, Robert Relyea
nelson: review-
Details | Diff | Splinter Review
rename s_mpi to freebl_ (13.77 KB, patch)
2008-12-03 18:19 PST, Robert Relyea
nelson: review+
Details | Diff | Splinter Review

Description Robert Relyea 2008-10-09 09:41:59 PDT
Intel supports new instructions to accellerate AES. these instructions are found here: 

http://software.intel.com/en-us/articles/advanced-encryption-standard-aes-instructions-set

A white paper on how use them is here:

http://software.intel.com/en-us/articles/advanced-encryption-standard-aes-instructions-set

an emulator which executes these instructions are here:

http://software.intel.com/en-us/articles/intelr-software-development-emulator
Comment 1 Robert Relyea 2008-10-09 09:44:06 PDT
Created attachment 342444 [details] [diff] [review]
patch for comments

From Ulrich Drepper:

Here's the first iteration of the patch.  I tested it using the SDE
emulator available from Intel

http://software.intel.com/en-us/articles/intel-software-development-emulator

The purpose of this patch is not to get it integrated yet (I'll have to
change the code to pipeline the encryption).  But I would like to get
some input on the integration of the optimization.

There are a few problems:

- - in the existing code, has a little problem.  When the PR_CallOnce
functions fail the code returns PR_FAILURE instead of SECFailure.  Not a
real bug, I fixed it anyway.

- - why do existing assembler files have the extension .s?  .s, as opposed
to .S, is not passed through the preprocessor.  This makes things a
whole lot less readable.

- - the code is obviously using the new instructions.  gas already
supports them for a while.  Does this mean we can require such a newer
binutils version?  The alternative is that I convert the instructions
into .byte pseudo-ops.

- - I've moved the PR_CallOnce calls into the en/decrypt functions because
we don't need it for the assembler functions.  This multiplies the code
(four instances instead of two) but this really shouldn't be an issue

- - I haven't changed the keysizes which are accepted.  We just don't use
assembler versions for blocksizes != 16 and keysizes other than 16, 24,
and 32.

- - the assembler code is quite different for the different key sizes.
Even more so with the upcoming optimized code.  This is why there are
seven functions per keysize.


Obviously I cannot provide any performance numbers at this time.
According to Intel it's well worth using the instruction, why else would
they go to the trouble of adding them.
Comment 2 Nelson Bolyard (seldom reads bugmail) 2008-10-09 13:02:16 PDT
Comment on attachment 342444 [details] [diff] [review]
patch for comments

Please resubmit this patch as a proper cvs unified diff, e.g. cvs diff -pu .  
No CVS account is required to do that.  A CVS diff can be made with an anonymous CVS account, IINM.
Comment 3 Kai Engert (:kaie) 2008-10-09 13:55:21 PDT
I'll attach a converted patch shortly.
Comment 4 Kai Engert (:kaie) 2008-10-09 14:01:25 PDT
Created attachment 342494 [details] [diff] [review]
same patch, cvs diff -upN

I don't understand why it seemed to important to have a patch produced by cvs. The contents of the patch are identical, only the diff command lines are different.

But here you are.
Comment 5 Nelson Bolyard (seldom reads bugmail) 2008-10-09 17:02:48 PDT
Kai, Thanks for the cvs patch.  To understand the difference in the patch 
generation methods, compare the pages shown by these URLs:

https://bugzilla.mozilla.org/attachment.cgi?action=diff&id=342444&collapsed=&headers=1&context=16

https://bugzilla.mozilla.org/attachment.cgi?action=diff&id=342494&collapsed=&headers=1&context=16
Comment 6 Julien Pierre 2008-10-09 17:12:23 PDT
Thanks for the contribution. I like the fact that the AES assembler code has been isolated to its own source file. This will make it easier to add this support on other x86 / x86_64 platforms.

I have some questions and comments about the patch :

1) It seems that this contribution is for the Linux x86 64-bit architecture only so far. Are the new Intel AES extensions are available in 64-bit mode only ? Or is it a conscious decision to skip the 32-bit implementation ?

2) Regarding PR_CallOnce, we already have a call to PR_CallOnce in every loader entry point with the following repeated code sequence :

  if (!vector && PR_SUCCESS != freebl_RunLoaderOnce())
      return NULL;

We could simply add code that needs to run once as part of freebl_RunLoaderOnce, without additional invocations of PR_CallOnce.

3) Right now, the bug is targeted for Linux only. The cpuid test code that checks for the aes-ni CPU capability has been placed in AES_InitContext.

I would suggest that we move this code to a much earlier step, in another source file. loader.c comes to mind .

The reason for that suggestion is that when we implement aes-ni on Solaris x86_64, we will most likely be required to load 2 different freebl shared library, one for the current C implementation that doesn't use aes-ni, and another freebl shared library that supports aes-ni. That's because Solaris flags object files and binaries by the instruction set they support, and libfreebl3.so will not load (dlopen will fail) on all x86_64 CPUs if it is flagged as containing some instructions that don't work on all CPUs. See bug 458480 I filed last week after a similar issue for the SSE2 contribution that was made for Linux 32-bits.

Linux doesn't have the requirement of having different shared libs per instruction set, so you aren't required to have 2 separate freebl shared libraries of it. I think that would still be a good idea if you did because we could then implement aes-ni the same way on all x86_64 operating systems.

Even if you don't want to have 2 freebl shared libs for Linux x86_64, you can still move up the aes-ni support test earlier into the loader.c code, and then link both AES implementations into the one freebl shared lib.
Comment 7 Julien Pierre 2008-10-09 17:49:36 PDT
Another possibility is to move the cpuid test code for aes-ni into the .s file and make an external call. This will probably port to more x86_64 platforms.
Comment 8 Nelson Bolyard (seldom reads bugmail) 2008-10-09 17:53:29 PDT
Julien, if Ulrich uses the  byte pseudo ops, as he suggested, there will
be no problems with dlopen failing.
Comment 9 Julien Pierre 2008-10-09 19:53:23 PDT
Nelson,

Using the .byte pseudo ops might cause the object file and shared lib to be flagged for the basic x86_64 instruction set rather than the one that supports aes_ni . This means dlopen would not fail indeed. But it could create problems down the line if we are using any new registers specific to aes_ni . Solaris might decide not to save and restore those registers during context switch if we don't tell it through the flags that we are using them.

I browsed the Intel docs but I'm not clear if the AES support is using any new registers or not. It seems that the new instructions work with existing xmm registers. Some xmm registers are part of the basic x86_64 instruction set, so Solaris would always save them during context switch regardless of the shared library flags. So, misflagging the Solaris freebl shared libs may not create any problem for the 64-bit case.

However, the same is not true in the 32-bit case. The basic instruction set supported by Solaris 32-bit does not include SSE2. So we couldn't guarantee that the xmm registers would be pushed and popped during context switch if we misflag the binaries. We should use different shared libs for the 32-bit Solaris case.
Comment 10 Ulrich Drepper 2008-10-14 22:49:48 PDT
A couple of quick comments, don't have enough time now.

- I have an optimized version of the assembler code.  It's currently reviewed and I'll submit it as soon as I got the comments

- re: 32-bit code.  Somebody can think about it.  But it's a problem.  If you look at the code (the current as well as the optimized code I'll ssubmit soon) you'll see that for good performance you need more than 8 registers.  Only in 64-bit mode you have 16 registers.  Yes, there will likely be some advantage even for the 32-bit code but I'm not motivated enough to do this work.

- the AES instructions are using the normal SSE registers.  No extensions there.  So, this means using .byte pseudo shouldn't be a problem, right?  The rest of the freebl code will not call into the asm code unless the CPU signals the instructions are available.
Comment 11 Nelson Bolyard (seldom reads bugmail) 2008-10-15 01:41:33 PDT
(In reply to comment #10)

> - the AES instructions are using the normal SSE registers.  No extensions
> there.  So, this means using .byte pseudo shouldn't be a problem, right?  

I agree.
Comment 12 Ulrich Drepper 2008-10-26 01:17:46 PDT
Created attachment 344797 [details] [diff] [review]
New patch with optimized asm code and fixed has_intel_aes type

Here's what I consider the final version.  It has the optimized asm code (parallelism through SIMD), one little but important bug fix in the C code, and it avoids the new asm opcodes.
Comment 13 Nelson Bolyard (seldom reads bugmail) 2008-10-30 18:47:27 PDT
Created attachment 345645 [details] [diff] [review]
patch equivalent to above, but without extraneous whitespace changes (Windows line endings)

The previous patch, which I was asked to review, apparently was produced
by some text editor that removed trailing spaces from all lines of text
that had any.  Consequently, that patch changes hundreds of lines to which 
it makes no visible change.  That makes reviewing difficult.  

So, I made this patch, which should be identical to the previous one, 
except that it doesn't change any lines whose sole change was the 
elimination of trailing whitespace.  I will review this one.

(Beware: this patch has Windows line endings :)
Comment 14 Nelson Bolyard (seldom reads bugmail) 2008-10-30 18:55:22 PDT
Created attachment 345647 [details] [diff] [review]
patch equivalent to above, but with Unix line endings

I think the file with Windows line endings confuses interdiff. 
So, this file is the same as the last one, but with Unix line endings,
I think.
Comment 15 Nelson Bolyard (seldom reads bugmail) 2008-10-30 18:58:08 PDT
Comment on attachment 345645 [details] [diff] [review]
patch equivalent to above, but without extraneous whitespace changes (Windows line endings)

Yes, the Windows line endings confused interdiff. converting to Unix line endings fixed it.
Comment 16 Nelson Bolyard (seldom reads bugmail) 2008-10-30 19:51:11 PDT
Comment on attachment 345647 [details] [diff] [review]
patch equivalent to above, but with Unix line endings

I have not yet done a complete review.  
These are preliminary comments in no particular order.
Some of these comments are to remind me to investigate further.

1.  This patch uses the symbol ebc in many pages.  I think this is
intended to be an abbreviation for "Electronic Code Book" mode, and as
such should be abbreviated ecb, not ebc.  Freebl uses the abbreviation
ecb for that everywhere else, and I'd like to see it be consistent.
I think doing a global search and replace on the patch file, changing all ebc to ecb would fix that.

2. The Makefile change causes intel-aes.s to be added to the list of 
assembler files, provided only that the CPU_ARCH is x86_64.  But the 
corresponding code inside rijndael.c uses the test:
#if defined __GNUC__ && defined __x86_64__

This makes me wonder:
2a: doesn't the "defined" c pre-processor op require parentheses around 
its argument?  
2b: Could this cause intel-aes.s to be built on some x86_64 systems 
that do not use gcc, Such as Solaris for x86, resulting in intel-aes.o being built but unused ?

3. I remember there being some discussion in email about the use of .s 
versus .S for assembly files.  As I recall, one of them uses the c pre-
processor and one does not.  I don't recall which is which any more.

The file intel-aes.s obviously needs to be processed by a c pre-processor.  
It also uses a mixture of c and c++ styles of comments.  SO, it could be 
an issue if that file is processed by a c preprocessor of a compiler that 
does not tolerate c++ comments.  So, the questions are:

3a: Will intel-aes.s be built on Solaris (which doesn't use gcc typically)
3b: Will it be built with the c pre-processor despite a .s suffix?
3c: Will the c++ style comments be a problem?

4. Moving the calls to init_rijndael_tables out of AES_{De,En}crypt and into the 4 functions rijndael_{de,en}crypt{CBC,ECB} means that the test
for whether (or not) the tables need to be built will be performed more
often for multi-part {en,de}cryption operations, being performed in each
part rather than once, I think.  It would be better to move them to 
AES_InitContext, or perhaps even better to a startup function for freebl.

Hmm.  Is that code that builds those tables at run time dead code?
In that case, let's just lose it.

5. comppring the latest patch to the first one, I see unexpected differences.  

a) a bunch of pshufd instructions are now shufps.  I wonder if those 
instructions will be recognized by all the relevant assemblers, or if 
any of them will need to be replaced with .byte pseudo-ops

b) additional instructions are somewhat expected.  Removed instructions
are less so.  One instruction present in the first patch and not the 
last is:
	movdqu	160(%rdi), %xmm12
It was just before label 4 in intel_aes_encrypt_ebc_128.
A similar instruction was removed from similar places in 6 functions.
Just tell this was not accidental, and I'll be satisfied.

I'm going to as Christophe to do a test build on all of our nightly build systems to see if this patch builds cleanly on all of them.
Comment 17 Ulrich Drepper 2008-11-01 09:22:35 PDT
(In reply to comment #16)
> 1.  This patch uses the symbol ebc in many pages.  I think this is
> intended to be an abbreviation for "Electronic Code Book" mode, and as
> such should be abbreviated ecb, not ebc.

Yes, my bad.


> 2. The Makefile change causes intel-aes.s to be added to the list of 
> assembler files, provided only that the CPU_ARCH is x86_64.  But the 
> corresponding code inside rijndael.c uses the test:
> #if defined __GNUC__ && defined __x86_64__

There is nothing gcc-specific in the C code left.  The Makefile can stay as it is.


> 2a: doesn't the "defined" c pre-processor op require parentheses around 
> its argument?  

The line is ignored without a preprocessor running.


> 2b: Could this cause intel-aes.s to be built on some x86_64 systems 
> that do not use gcc, Such as Solaris for x86, resulting in intel-aes.o being
> built but unused ?

Not if the __GNUC__ test in the C code is removed.



> 3. I remember there being some discussion in email about the use of .s 
> versus .S for assembly files.  As I recall, one of them uses the c pre-
> processor and one does not.  I don't recall which is which any more.
> 
> The file intel-aes.s obviously needs to be processed by a c pre-processor.  

No, it doesn't.  I left the preprocessor stuff in so that, when somebody extends the Makefiles, it can easily be enabled.

Since I don't have the time to do all the Makefile surgery I went with the flow and provided a .s file which doesn't have to run through the preprocessor.


> It also uses a mixture of c and c++ styles of comments.  SO, it could be 
> an issue if that file is processed by a c preprocessor of a compiler that 
> does not tolerate c++ comments.  So, the questions are:
> 
> 3a: Will intel-aes.s be built on Solaris (which doesn't use gcc typically)

I wouldn't know, I never touched Solaris x86.


> 3b: Will it be built with the c pre-processor despite a .s suffix?

See above.


> 3c: Will the c++ style comments be a problem?

I've never seen it causing problems.  If it does, it'll be obvious.



> 4. Moving the calls to init_rijndael_tables out of AES_{De,En}crypt and into
> the 4 functions rijndael_{de,en}crypt{CBC,ECB} means that the test
> for whether (or not) the tables need to be built will be performed more
> often for multi-part {en,de}cryption operations, being performed in each
> part rather than once, I think.  It would be better to move them to 
> AES_InitContext, or perhaps even better to a startup function for freebl.

That's an architectural decision I didn't want to make.  I kept the semantics identical to what was there before.  Feel free to change it.


> Hmm.  Is that code that builds those tables at run time dead code?
> In that case, let's just lose it.

You mean when the asm code is used?  In that case the tables certainly shouldn't be built.  That's easy enough to arrange.


> a) a bunch of pshufd instructions are now shufps.  I wonder if those 
> instructions will be recognized by all the relevant assemblers, or if 
> any of them will need to be replaced with .byte pseudo-ops

These are older instructions and shouldn't make any problem.  Again, if they do it's obvious.



> b) additional instructions are somewhat expected.  Removed instructions
> are less so.  One instruction present in the first patch and not the 
> last is:
>     movdqu    160(%rdi), %xmm12
> It was just before label 4 in intel_aes_encrypt_ebc_128.
> A similar instruction was removed from similar places in 6 functions.
> Just tell this was not accidental, and I'll be satisfied.

Look at the top of the function.  The value is loaded once there.
Comment 18 Nelson Bolyard (seldom reads bugmail) 2008-11-10 19:57:36 PST
Comment on attachment 345647 [details] [diff] [review]
patch equivalent to above, but with Unix line endings

Ulrich, From comment 17 I gather that there are now a number of changes you propose to make to this patch.  Please do, and I will review the new version.  
I will ask our release engineer to build and QA test that next patch against all our x86_64 platforms.  Passing that test will be essential to getting an R+ review.
Comment 19 Robert Relyea 2008-11-26 16:52:01 PST
Created attachment 350248 [details] [diff] [review]
Update patch based on comments.

I've done the following to Ulrich's patch:
1) I've made the code conditional on USE_HW_AES
2) I've added a license to the .h file.
3) I've changed the .s file to it the Solaris assembler happily accepts it (Mostly adding 'q' to several upcodes, and removing math in constants (8*16 -> 128)
4) I've tested the build on Solaris 64 (Solaris 10 x86_64).
5) I've tested the newly built .s file against the intel simulator on linux to make sure I didn't break anything.

bob
Comment 20 Nelson Bolyard (seldom reads bugmail) 2008-11-26 18:52:23 PST
Comment on attachment 350248 [details] [diff] [review]
Update patch based on comments.

Thanks Bob, Looks very good
Comment 21 Nelson Bolyard (seldom reads bugmail) 2008-12-02 15:21:51 PST
Bad news, Bob,
The checkin of your patch make tinderbox go red on at least one Solaris/x86 
box.  Also, your checkin comment cited the wrong bug number.  It cited 
bug 467298 instead of this bug.
Comment 22 Julien Pierre 2008-12-02 15:54:49 PST
Bob,

Did you really mean to write the following in freebl/Makefile ?

DEFINES != -DUSE_HW_AES
Comment 23 Robert Relyea 2008-12-03 11:03:57 PST
No, and ASFILES was ASFLAGS.

I don't know why they didn't trip on my RHEL5 (perhaps I forgot to set USE_64 in my test build!).

I have a new patch that fixes this and the solaris asm failure. I want to test it on Solaris first before I attach it.

bob

RE: Bug number... sigh, I thought I was extra careful because I was copying the bug number by hand, rather than my normal cut and paste. This was because I was checking in from a VM and cut and paste doesn't work between the host and VM...
Comment 24 Robert Relyea 2008-12-03 15:07:06 PST
Created attachment 351273 [details] [diff] [review]
Fix Tinderbox failures....

Again, tested on RHEL5 and Solaris 10, I've verified the code is actually being called on these platforms, and again verifed RHEL5 in the emulator.

I've also tested them from this same patch. I do not have access to older 64 bit versions of these operating systems on intel.

bob
Comment 25 Robert Relyea 2008-12-03 15:08:38 PST
Comment on attachment 351273 [details] [diff] [review]
Fix Tinderbox failures....

Also, let me know if I misused the mpi naming convention, I copied the prefix from the other mpi function in that file.

bob
Comment 26 Nelson Bolyard (seldom reads bugmail) 2008-12-03 16:44:46 PST
Comment on attachment 351273 [details] [diff] [review]
Fix Tinderbox failures....

I gather that you decided the problem was the use of the in-line __asm 
feature, and you noticed that you had already solved this problem in mpi, 
so you decided to "borrow" (reuse) the solution from MPI.

> Also, let me know if I misused the mpi naming convention, 

Yes, I'm afraid you did, enough that r- is really warranted.
But I think it will be easy to fix with a global search-and-replace.

> I copied the prefix from the other mpi function in that file.

Yes, but the function whose convention you copied is not a public MPI 
function.  The "s_" prefix means "private to MPI, not to be called 
except from other MPI functions, may be changed at any time."  
So, it's not appropriate to use the s_mp prefix for public functions. 
Functions whose names begin with s_ were originally static functions,
back when mpi was one BIG .c file.  When it was split up, those 
functions' names were given the s_ prefix to remind developers to treat
them as static functions (private to MPI).

But my next observation is that this cpuid function doesn't follow MPI's
calling conventions in any way.  All the mp_ and s_mp_ functions have 
a common convention about arguments.  They all take mp_int *s (pointers
to mp_int structures).  (Some also take a single mp_digit argument.)
All the s_mpv_ functions take mp_digit *s (pointers to "vectors" of 
mp_digits). 

Your cpuid function doesn't follow either of those conventions.  It will 
be public, in the sense that it will be called by code outside of MPI, 
but it does not follow the MPI public or private conventions in any way, 
not in naming or in use of arguments.   It really has nothing to do with 
MPI or "bignum" arithmetic.  So, I suggest that you should *NOT* attempt 
to make it appear to follow the MPI convention at all.

I suggest you change the name to somthing like "freebl_cpuid".  That will 
show that it's really NOT part of the MPI family of functions.  
I'm OK with declaring it in mpi.h, even though it's not an MPI function.  
But I really don't want to "pollute" the name space of MPI functions with 
this function that doesn't follow any other MPI conventions.

So, please change the name of that function to not pollute the MPI name
spaces.  With that change, I think this patch will be ready.
Comment 27 Robert Relyea 2008-12-03 18:19:44 PST
Created attachment 351300 [details] [diff] [review]
rename s_mpi to freebl_
Comment 28 Robert Relyea 2008-12-03 18:22:21 PST
Comment on attachment 351300 [details] [diff] [review]
rename s_mpi to freebl_

OK, I've updated the s_mpi_cpuid to freebl_cpuid.

I originally though s_ meant assembler implementation, but then I wondered if it really meant private to mpi, thus my question pointing to the naming.
Comment 29 Nelson Bolyard (seldom reads bugmail) 2008-12-03 23:56:40 PST
Comment on attachment 351300 [details] [diff] [review]
rename s_mpi to freebl_

yeah, Let's give this a try.
Comment 30 Robert Relyea 2008-12-04 18:09:09 PST
(+103/57) Lines changed.
When 	Who 	File 	Rev 	+/- 	Description
2008-12-04 10:16	rrelyea%redhat.com 	mozilla/security/nss/lib/freebl/Makefile 	1.98 	6/0  	

Last Comment Bug 459248 - Support Intel AES extensions.

r=nelsonb

See if Solaris likes this better...
2008-12-04 10:16	rrelyea%redhat.com 	mozilla/security/nss/lib/freebl/rijndael.c 	1.24 	63/29 
2008-12-04 10:16	rrelyea%redhat.com 	mozilla/security/nss/lib/freebl/mpi/mpcpucache.c 	1.6 	11/11 
2008-12-04 10:16	rrelyea%redhat.com 	mozilla/security/nss/lib/freebl/mpi/mpcpucache_amd64.s 	1.3 	9/8 
2008-12-04 10:16	rrelyea%redhat.com 	mozilla/security/nss/lib/freebl/mpi/mpcpucache_x86.s 	1.3 	9/8 
2008-12-04 10:16	rrelyea%redhat.com 	mozilla/security/nss/lib/freebl/mpi/mpi.h 	1.23 	5/1 


Tinderbox is green on all platforms....
Comment 31 Julien Pierre 2009-02-06 18:19:50 PST
Updating platform, since this is for x86_64 and not only for Linux - actually both Linux and Solaris.

Note You need to log in before you can comment on or make changes to this bug.