The mpi library in NSS (mozilla/security/nss/lib/freebl/mpi) has some x86 assembly code. It is written in several formats for various assemblers: mpi_x86.s (GNU as, or gas) mpi_i86pc.s (Solaris x86 assembler) mpi_x86.asm (Microsoft MASM, or ml.exe) mpi_x86_asm.c (MSVC inline assembly code) We should be able to use this assembly code on Intel Macs. Hopefully mpi_x86.s (the gas version) can be used directly. If the non-executable stack directives at the end of that file cause problem on Intel Macs: # Magic indicating no need for an executable stack .section .note.GNU-stack, "", @progbits .previous let us know. We recently learned that we can use the gas command-line option --noexecstack to accomplish the same thing, so we can remove those directives from mpi_x86.s if necessary. (https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=180726#c17) Note that the x86 assembly code in mpi_x86.s probably only requires Pentium. If all the Intel Mac processors support certain instruction set extensions (e.g., MMX, SSE2), we can take advantage of that, too. To use assembly code on Intel Macs, you need to modify mozilla/security/nss/lib/freebl/Makefile. By searching for mpi_x86.s in that file, you can figure out how to make the necessary changes for Intel Macs.
The gas version can't be used directly for the following reasons: 1. .type @function does not work 2. numeric local labels must be in the range 0..9 3. .section @progbits does not work If you want to use the existing mpi_x86.s: #2 can be solved by either moving to local labels of the form Ln and eliminating the hints, or by renumbering the local labels to keep them in range. Renumbering should be safe on all platforms because of the use of the 'f' and 'b' hints. #1 can be solved by introducing in-asm #ifdefs, as long as the assembler is always called by the compiler driver after running the preprocessor. The same can be done for #3, as an alternative to --noexecstack. Any x86 Mac ever sold is guaranteed to support MMX, SSE, and SSE2. SSE3 is available in all shipping CPUs, but Apple hasn't committed to it as they have to MMX, SSE, and SSE2.
Mark, You're invited to attach a patch to this bug to solve this for X86 macs.
Created attachment 218305 [details] [diff] [review] Patch This implements the suggestions from comment 1. It also adds leading underscores to symbol names, because Darwin as does not do this automatically. I also made changes to reduce unnecessary allocation of stack space. These changes can be brought over to the other x86 mpi implementations. (Aside: I found it odd that div_2dx1d omits the frame pointer but the other functions don't.) mp_digit was defined as unsigned long long (64-bit) but this asm code assumes that mp_digits are 32 bits wide, so MP_USE_UINT_DIGIT is defined to force mp_digit to 32 bits. The patch also includes changes to the mpi tests to allow them to build (mpcpucache.o was missing) and run without requiring . to be in the $PATH.
Mark, thanks a lot for the patch. In bug 326482 we have a patch for mpi_x86.s to allow it to decide at run time if it can use SSE2 instructions. The patch is titled "Use SSE2 multiply instructions on intel processors." (attachment 215739 [details] [diff] [review]) So you can also create a Mac-specific version of mpi_x86.s that just uses the SSE2 assembly code in that patch.
Created attachment 218349 [details] raw mpi_sse2 code. Mark, here is the original SSE2 code I did for intel processors if you don't want to split out the SSE2 code from the current mpi_x86.s code. It doesn't have your Mac specific changes. bob
Created attachment 218372 [details] [diff] [review] Patch, raw sse2 Thanks, Bob! I took your raw sse2 version but kept the Darwin #ifdefs in case there are other platforms want to take advantage of the raw sse2 code. It's an incredible pain to run the mpi tests with the asm implementations. It looks like the Makefile in mpi has fallen out of a good state of repair since everything is built from freebl. Is there interest in fixing that Makefile? Would it be a better idea to move the mpi tests into the rest of the NSS test suite?
There are two files attached to this bug, both awaiting my review.. They seem to overlap each other, significantly. Is one intended to replace the other? Bob, Your attachment 215739 [details] [diff] [review] to bug 326482 adds SSE2 code to mpi. It already has two reviews, but is not yet checked in. Attachment 218372 [details] [diff] above seems to be essentially a duplicate of attachment 215739 [details] [diff] [review]. Was it your intent to duplicate that file for the MAC x86 platform? I'd certainly rather that we have only ONE source file containing that source code, perhaps ifdef'ed for different platforms if necessary. I'd rather not have multiple copies of the file checked in with very different names for slightly-different platforms.
Comment on attachment 218305 [details] [diff] [review] Patch Nelson, please review the SSE2 one unless someone has a good reason to not check in an SSE2-only implementation. In fact, it's possible, with a few more #ifdefs, to have a single asm source file cover many unixish platforms in any of three configurations: - never SSE2 - dynamic SSE2 detection at runtime - always SSE2 We might even be able to roll the i86pc implementation, currently used on Solaris, into the same file. Is that the way you'd like to proceed?
(In reply to comment #9) > In fact, it's possible, with a few more #ifdefs, to have a single asm source > file cover many unixish platforms in any of three configurations: > > - never SSE2 > - dynamic SSE2 detection at runtime > - always SSE2 > > We might even be able to roll the i86pc implementation, currently used on > Solaris, into the same file. Is that the way you'd like to proceed? Yes. Note that SSE2 is often a win on Intel CPUs, and seldom (er, much less often) on AMD. That's why we need to continue to support non-SSE2. A one-source-fits-all solution would be potentially best, assuming it's not so ifdeffed that it's more difficult to understand than separate source files.
Comment on attachment 218372 [details] [diff] [review] Patch, raw sse2 Oh, my previous comments were intended to convey r-. Sorry.
Comment on attachment 218372 [details] [diff] [review] Patch, raw sse2 >+.globl s_mpv_mul_d >+TYPE_FUNCTION(s_mpv_mul_d) >+s_mpv_mul_d: For each of these functions, we also need to add the .private_extern directive so that the symbols do not become global. See bug 568600. Is .private_extern an Apple extension of GCC?
Created attachment 650635 [details] [diff] [review] Patch, raw sse2, updated to current trunk I just discovered that NSS is still not using x86 assembly code on Mac OS X. I updated Mark's "Patch, raw sse2" (attachment 218372 [details] [diff] [review]) to the current NSS trunk. This patch will be used as a baseline for further changes.
Created attachment 650651 [details] [diff] [review] Patch, raw sse2, v2 (checked in) Mark, could you take a look at this before I ask Bob for a review? I experimented with some flags. 1. -DMP_NO_MP_WORD hurts performance, especially if no assembly code is used. 2. -DMP_CHAR_STORE_SLOW -DMP_IS_LITTLE_ENDIAN makes little difference, seems slightly slower (roughly 1/740). I made the following two changes to mpi_sse2.s: 1. I added the .private_extern directives to the functions so that they are not exported from libfreebl4.dylib. 2. I compared with the current SSE2 code in mpi_x86.s and modified mpi_sse2.s accordingly.
Created attachment 650654 [details] [diff] [review] Changes to mpi_sse2.s in Patch, raw sse2, v2 For some reason, patch interdiffs don't work. Here are the differences between the mpi_sse2.s files in the two patches.
Comment on attachment 650651 [details] [diff] [review] Patch, raw sse2, v2 (checked in) I have compared mpi_sse2.s with the SSE2 assembly code in mpi_x86.s carefully, and tested this patch. I am now confident to ask Bob to review it. I tried to make mpi_x86.s compilable on Mac OS X, but it is nontrivial. I suggest that we not let that block the use of x86 assembly code on Mac OS X. Here are the RSA performance numbers returned by the rsaperf -n none -p 20 command. Before this patch: Using freebl with 1024 bits key. Using hardcoded 1024 bits key. 6997 iterations in 20.001 seconds 349.82 operations/s . one operation every 2858 microseconds After this patch: Using freebl with 1024 bits key. Using hardcoded 1024 bits key. 14876 iterations in 20 seconds 743.79 operations/s . one operation every 1344 microseconds So this patch doubles the RSA performance.
Comment on attachment 650651 [details] [diff] [review] Patch, raw sse2, v2 (checked in) Patch checked in on the NSS trunk (NSS 3.14). Checking in mpi/all-tests; /cvsroot/mozilla/security/nss/lib/freebl/mpi/all-tests,v <-- all-tests new revision: 1.5; previous revision: 1.4 done Checking in Makefile; /cvsroot/mozilla/security/nss/lib/freebl/Makefile,v <-- Makefile new revision: 1.122; previous revision: 1.121 done RCS file: /cvsroot/mozilla/security/nss/lib/freebl/mpi/mpi_sse2.s,v done Checking in mpi/mpi_sse2.s; /cvsroot/mozilla/security/nss/lib/freebl/mpi/mpi_sse2.s,v <-- mpi_sse2.s initial revision: 1.1 done