Last Comment Bug 439199 - SSE2 instructions for bignum are not implemented on Windows 32-bit
: SSE2 instructions for bignum are not implemented on Windows 32-bit
Status: RESOLVED FIXED
FIPS
:
Product: NSS
Classification: Components
Component: Libraries (show other bugs)
: trunk
: x86 Windows XP
: P1 enhancement with 1 vote (vote)
: 3.12.3
Assigned To: glen beasley
:
Mentors:
Depends on:
Blocks: FIPS2008
  Show dependency treegraph
 
Reported: 2008-06-13 23:46 PDT by Makoto Kato [:m_kato] (PTO 6/20-21, 6/24)
Modified: 2009-03-10 14:18 PDT (History)
3 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
a patch for latest trunk (10.46 KB, patch)
2008-06-13 23:46 PDT, Makoto Kato [:m_kato] (PTO 6/20-21, 6/24)
no flags Details | Diff | Review
patch v2 (checked in) (10.87 KB, patch)
2008-06-22 01:48 PDT, Makoto Kato [:m_kato] (PTO 6/20-21, 6/24)
rrelyea: review+
Details | Diff | Review
patch - use SSE2 in 32-bit code on non-Intel CPUs where present (562 bytes, patch)
2009-03-10 00:33 PDT, Nelson Bolyard (seldom reads bugmail)
rrelyea: review+
julien.pierre: superreview+
Details | Diff | Review

Description Makoto Kato [:m_kato] (PTO 6/20-21, 6/24) 2008-06-13 23:46:57 PDT
Created attachment 325071 [details] [diff] [review]
a patch for latest trunk

Current NSS code has SSE2 optimization of MPI.  But this is GCC only.
Comment 1 Makoto Kato [:m_kato] (PTO 6/20-21, 6/24) 2008-06-22 01:48:26 PDT
Created attachment 326160 [details] [diff] [review]
patch v2 (checked in)

Although GCC version uses SSE2 for MPI on Intel CPU, MSVC version is no SSE2 code.
I ported it to MSVC (x86).
Comment 2 Julien Pierre 2008-06-23 20:16:12 PDT
Comment on attachment 326160 [details] [diff] [review]
patch v2 (checked in)

Makoto,

I don't think we want to keep calling s_mpi_is_sse2 repeatedly during every computation.

I know that appears to be the way it's done on Linux, but I still think it's wrong.

This is the sort of test that should be done at initialization of freebl.
Comment 3 glen beasley 2008-10-16 11:27:15 PDT
If this bug is completed by Nov17 2008 it will be included in the FIPS2008 validation otherwise it will be dropped for a later release.
Comment 4 Nelson Bolyard (seldom reads bugmail) 2008-10-20 16:58:47 PDT
Comment on attachment 326160 [details] [diff] [review]
patch v2 (checked in)

This patch has already received a negative review, so it doesn't need another review.  
A new patch is needed.
Comment 5 Robert Relyea 2009-01-22 14:21:11 PST
I completely disagree with Julien's evaluation, and if that's his only issue I'm perfectly willing to give this patch an r+.

bob
Comment 6 Robert Relyea 2009-01-22 14:33:05 PST
BTW Julien, I've explained to you before.... and it's painfully obvious from reading the code: Neither the Linux patch, nor this was calls s_mpi_is_sse2 on every computation. If it did it *would* kill performance. It's called once on the first multiply and never again for the life of the process.
Comment 7 Robert Relyea 2009-02-27 16:43:05 PST
Comment on attachment 326160 [details] [diff] [review]
patch v2 (checked in)

r+ with one change.

is_sse should be initialized to -1
Comment 8 glen beasley 2009-03-06 09:40:25 PST
 cvs commit -m "439199 SSE2 Win 32 instructions for bignum r=
Bob patch from Makoto kato"
Checking in mpi_x86_asm.c;
/cvsroot/mozilla/security/nss/lib/freebl/mpi/mpi_x86_asm.c,v  <--  mpi_x86_asm.c
new revision: 1.3; previous revision: 1.2
done
Comment 9 Nelson Bolyard (seldom reads bugmail) 2009-03-06 13:04:14 PST
I'm reopening this bug because:
1) The patch was committed without making the correction Bob mentioned in
comment 7.

2) I agree with Julien that checking the CPU type on EVERY call to 
s_mpv_mul_d is very undesirable. This is a very low level function that 
gets called many times in a single multiplication of two bignums.  

3) There are some CPUs (AMD in particular) where the SSE2 implementation
is SLOWER than the non-SSE2 implementation.  On those CPUs, we don't want
to use SSE2, even if the CPU is capable of it.
Comment 10 Robert Relyea 2009-03-06 14:41:10 PST
> 1) The patch was committed without making the correction Bob mentioned in
> comment 7.

This is legitimate. The patch does need this change or it's a no-op.


> 2) I agree with Julien that checking the CPU type on EVERY call to 
> s_mpv_mul_d is very undesirable. This is a very low level function that 
> gets called many times in a single multiplication of two bignums. 

For hopefully the last time, we do NOT call s_mpi_is_sse2 on EVERY CALL!! The code clearly caches this.  I have done performance measurements on this code and cannot detect the cost of the additional branch (particularly in a function that's going to call MULTIPLY in an inner loop upteen times). Both Intel and AMD does a reasonably good job at branch prediction, so in the steady state, we don't even clear the pipleline.

I did the homework on this years ago and this exact code is performing quite well on linux, and has been for years. 

At this point I believe the burden is on others to prove there is a problem with this field tested code.

> 3) There are some CPUs (AMD in particular) where the SSE2 implementation
> is SLOWER than the non-SSE2 implementation.  On those CPUs, we don't want
> to use SSE2, even if the CPU is capable of it.

This should have been brought up earlier when I asked for other objections to the patch.
Comment 11 Nelson Bolyard (seldom reads bugmail) 2009-03-06 14:42:24 PST
Now that this patch has  been committed, fixing the problems is high 
priority.
Comment 12 glen beasley 2009-03-06 15:31:27 PST
Sorry about forgetting bob's change request. 

< static int is_sse = 0;
---
> static int is_sse = -1;

cvs commit -m "439199 SSE2 Win 32 instructions for bignum r=bob is_sse should be -1"  
Enter passphrase for key '/Users/gb/.ssh/id_dsa': 
Checking in mpi_x86_asm.c;
/cvsroot/mozilla/security/nss/lib/freebl/mpi/mpi_x86_asm.c,v  <--  mpi_x86_asm.c
new revision: 1.4; previous revision: 1.3
done


> 3) There are some CPUs (AMD in particular) where the SSE2 implementation
> is SLOWER than the non-SSE2 implementation.  On those CPUs, we don't want
> to use SSE2, even if the CPU is capable of it.

I will leave the bug open, to see if we can do performance testing to see 
how serious this issue is.
Comment 13 Julien Pierre 2009-03-06 17:18:29 PST
Nelson,

Re: comment 9 issue 3,

I don't think the SSE2 is slower than the regular multiply in 32 bit even on AMD CPUs. I believe that's only the case in 64-bit mode.
Comment 14 Julien Pierre 2009-03-06 18:59:59 PST
I have both an Intel Q6600 (core 2 quad 2.4 GHz) and AMD Phenom 9750 (2.4 GHz quad-core) at home. They are both running Vista x64 with 8 GB RAM.

I did two NSS builds, one of 3.11 (before this patch) and one of the trunk, with VC9. Both were optimized 32 bit builds, targeted for WINNT.

I ran the following command as a benchmark :
rsaperf -n none -p 30 -t 4

This tests 1024-bit RSA private key ops with 4 threads for 30 seconds. Here were the results :

3.11, Intel Q6600 : 1468 ops/s
trunk, Intel Q6600 : 1822 ops/s

3.11, Phenom 9750 : 646 ops/s
trunk, Phenom 9750 : 647 ops/s

This shows that the SSE2 helps significantly on the Intel chip.

The results on AMD are dysmal compared to Intel, especially considering these are both quad-core chips running at the same clock speed.

What is very surprising is that the SSE2 code seems to yield the same result as the old regular multiply code on the AMD. I am wondering if the SSE2 capability of the AMD CPU was properly detected.

FYI, I was running the same NSS bits on both machines.
Comment 15 Julien Pierre 2009-03-06 19:44:20 PST
As I suspected, SSE2 was not detected on the AMD chip.
This is due to the 3 lines of code starting at http://mxr.mozilla.org/security/source/security/nss/lib/freebl/mpi/mpcpucache.c#680

I commented that check and rebuilt the trunk build. The result is :

trunk, Phenom 9750 : 775 ops/s

So, using SSE2 actually improved the performance on this AMD chip in 32 bit mode.
Comment 16 Nelson Bolyard (seldom reads bugmail) 2009-03-07 00:59:41 PST
My guess is that AMDs CPUs have changed the relative performance of the 
base multiply instruction and the SSE2 instructions in the last 3 years.
I was pleasantly surprised to see that s_mpi_is_sse2 already checks for 
Intel.  Maybe we should remove that check now.
Comment 17 Julien Pierre 2009-03-07 01:33:45 PST
Nelson,

I don't think the AMD CPUs have really changed. In 32-bit mode, the SSE2 method is always faster on all chips, I believe even on older AMD chips. We had some SSE2 32-bit code working in our lab years ago with Saul, but we never bothered integrating it. It always helped more with Intel than with AMD, but it never hurt AMD in 32-bit mode.

However, in 64 bit mode, it was a different story, and on AMD chips the SSE2 method was slower than using the 64-bit multiply instruction.

Since this bug is concerned only about 32-bits, I think we should remove the Intel check. There is a separate source file on 64 bits with another check for SSE2.
Comment 18 Nelson Bolyard (seldom reads bugmail) 2009-03-10 00:33:12 PDT
Created attachment 366521 [details] [diff] [review]
patch - use SSE2 in 32-bit code on non-Intel CPUs where present

Is this all that remains to be done for this bug?
Comment 19 Robert Relyea 2009-03-10 14:14:35 PDT
Comment on attachment 366521 [details] [diff] [review]
patch - use SSE2 in 32-bit code on non-Intel CPUs where present

r+ based on the AMD SSE2 feedback.

I think that's it.

bob
Comment 20 Julien Pierre 2009-03-10 14:15:40 PDT
Comment on attachment 366521 [details] [diff] [review]
patch - use SSE2 in 32-bit code on non-Intel CPUs where present

I think so.

FYI, I tried this patch on my older (2004) 2-way AMD opteron 246 2 GHz. This is the generation of CPUs we used in our previous performance round.

Using 2 threads, rsaperf gave :

without patch : 533 ops/s with patch : 628 ops/s

So, the SSE2 is beneficial even on those old opterons.
Comment 21 Nelson Bolyard (seldom reads bugmail) 2009-03-10 14:18:41 PDT
Checking in mpcpucache.c; new revision: 1.7; previous revision: 1.6

Note You need to log in before you can comment on or make changes to this bug.