Closed Bug 439199 Opened 16 years ago Closed 15 years ago

SSE2 instructions for bignum are not implemented on Windows 32-bit

Categories

(NSS :: Libraries, enhancement, P1)

x86
Windows XP
enhancement

Tracking

(Not tracked)

RESOLVED FIXED
3.12.3

People

(Reporter: m_kato, Assigned: glenbeasley)

References

Details

(Whiteboard: FIPS)

Attachments

(2 files, 1 obsolete file)

Attached patch a patch for latest trunk (obsolete) — Splinter Review
Current NSS code has SSE2 optimization of MPI.  But this is GCC only.
Although GCC version uses SSE2 for MPI on Intel CPU, MSVC version is no SSE2 code.
I ported it to MSVC (x86).
Attachment #325071 - Attachment is obsolete: true
Attachment #326160 - Flags: review?(nelson)
Comment on attachment 326160 [details] [diff] [review]
patch v2 (checked in)

Makoto,

I don't think we want to keep calling s_mpi_is_sse2 repeatedly during every computation.

I know that appears to be the way it's done on Linux, but I still think it's wrong.

This is the sort of test that should be done at initialization of freebl.
Attachment #326160 - Flags: superreview-
Blocks: FIPS2008
Priority: -- → P4
Whiteboard: FIPS
Summary: Enable SSE2 for MSVC++ → SSE2 instructions for bignum are not implemented on Windows 32-bit
Severity: normal → enhancement
If this bug is completed by Nov17 2008 it will be included in the FIPS2008 validation otherwise it will be dropped for a later release.
Comment on attachment 326160 [details] [diff] [review]
patch v2 (checked in)

This patch has already received a negative review, so it doesn't need another review.  
A new patch is needed.
Attachment #326160 - Flags: review?(nelson)
I completely disagree with Julien's evaluation, and if that's his only issue I'm perfectly willing to give this patch an r+.

bob
BTW Julien, I've explained to you before.... and it's painfully obvious from reading the code: Neither the Linux patch, nor this was calls s_mpi_is_sse2 on every computation. If it did it *would* kill performance. It's called once on the first multiply and never again for the life of the process.
Attachment #326160 - Flags: review?(rrelyea)
Comment on attachment 326160 [details] [diff] [review]
patch v2 (checked in)

r+ with one change.

is_sse should be initialized to -1
Attachment #326160 - Flags: superreview-
Attachment #326160 - Flags: review?(rrelyea)
Attachment #326160 - Flags: review+
 cvs commit -m "439199 SSE2 Win 32 instructions for bignum r=
Bob patch from Makoto kato"
Checking in mpi_x86_asm.c;
/cvsroot/mozilla/security/nss/lib/freebl/mpi/mpi_x86_asm.c,v  <--  mpi_x86_asm.c
new revision: 1.3; previous revision: 1.2
done
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Target Milestone: --- → 3.12.2
Target Milestone: 3.12.2 → 3.12.3
I'm reopening this bug because:
1) The patch was committed without making the correction Bob mentioned in
comment 7.

2) I agree with Julien that checking the CPU type on EVERY call to 
s_mpv_mul_d is very undesirable. This is a very low level function that 
gets called many times in a single multiplication of two bignums.  

3) There are some CPUs (AMD in particular) where the SSE2 implementation
is SLOWER than the non-SSE2 implementation.  On those CPUs, we don't want
to use SSE2, even if the CPU is capable of it.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
> 1) The patch was committed without making the correction Bob mentioned in
> comment 7.

This is legitimate. The patch does need this change or it's a no-op.


> 2) I agree with Julien that checking the CPU type on EVERY call to 
> s_mpv_mul_d is very undesirable. This is a very low level function that 
> gets called many times in a single multiplication of two bignums. 

For hopefully the last time, we do NOT call s_mpi_is_sse2 on EVERY CALL!! The code clearly caches this.  I have done performance measurements on this code and cannot detect the cost of the additional branch (particularly in a function that's going to call MULTIPLY in an inner loop upteen times). Both Intel and AMD does a reasonably good job at branch prediction, so in the steady state, we don't even clear the pipleline.

I did the homework on this years ago and this exact code is performing quite well on linux, and has been for years. 

At this point I believe the burden is on others to prove there is a problem with this field tested code.

> 3) There are some CPUs (AMD in particular) where the SSE2 implementation
> is SLOWER than the non-SSE2 implementation.  On those CPUs, we don't want
> to use SSE2, even if the CPU is capable of it.

This should have been brought up earlier when I asked for other objections to the patch.
Now that this patch has  been committed, fixing the problems is high 
priority.
Assignee: m_kato → glen.beasley
Priority: P4 → P1
Sorry about forgetting bob's change request. 

< static int is_sse = 0;
---
> static int is_sse = -1;

cvs commit -m "439199 SSE2 Win 32 instructions for bignum r=bob is_sse should be -1"  
Enter passphrase for key '/Users/gb/.ssh/id_dsa': 
Checking in mpi_x86_asm.c;
/cvsroot/mozilla/security/nss/lib/freebl/mpi/mpi_x86_asm.c,v  <--  mpi_x86_asm.c
new revision: 1.4; previous revision: 1.3
done


> 3) There are some CPUs (AMD in particular) where the SSE2 implementation
> is SLOWER than the non-SSE2 implementation.  On those CPUs, we don't want
> to use SSE2, even if the CPU is capable of it.

I will leave the bug open, to see if we can do performance testing to see 
how serious this issue is.
Nelson,

Re: comment 9 issue 3,

I don't think the SSE2 is slower than the regular multiply in 32 bit even on AMD CPUs. I believe that's only the case in 64-bit mode.
I have both an Intel Q6600 (core 2 quad 2.4 GHz) and AMD Phenom 9750 (2.4 GHz quad-core) at home. They are both running Vista x64 with 8 GB RAM.

I did two NSS builds, one of 3.11 (before this patch) and one of the trunk, with VC9. Both were optimized 32 bit builds, targeted for WINNT.

I ran the following command as a benchmark :
rsaperf -n none -p 30 -t 4

This tests 1024-bit RSA private key ops with 4 threads for 30 seconds. Here were the results :

3.11, Intel Q6600 : 1468 ops/s
trunk, Intel Q6600 : 1822 ops/s

3.11, Phenom 9750 : 646 ops/s
trunk, Phenom 9750 : 647 ops/s

This shows that the SSE2 helps significantly on the Intel chip.

The results on AMD are dysmal compared to Intel, especially considering these are both quad-core chips running at the same clock speed.

What is very surprising is that the SSE2 code seems to yield the same result as the old regular multiply code on the AMD. I am wondering if the SSE2 capability of the AMD CPU was properly detected.

FYI, I was running the same NSS bits on both machines.
As I suspected, SSE2 was not detected on the AMD chip.
This is due to the 3 lines of code starting at http://mxr.mozilla.org/security/source/security/nss/lib/freebl/mpi/mpcpucache.c#680

I commented that check and rebuilt the trunk build. The result is :

trunk, Phenom 9750 : 775 ops/s

So, using SSE2 actually improved the performance on this AMD chip in 32 bit mode.
My guess is that AMDs CPUs have changed the relative performance of the 
base multiply instruction and the SSE2 instructions in the last 3 years.
I was pleasantly surprised to see that s_mpi_is_sse2 already checks for 
Intel.  Maybe we should remove that check now.
Nelson,

I don't think the AMD CPUs have really changed. In 32-bit mode, the SSE2 method is always faster on all chips, I believe even on older AMD chips. We had some SSE2 32-bit code working in our lab years ago with Saul, but we never bothered integrating it. It always helped more with Intel than with AMD, but it never hurt AMD in 32-bit mode.

However, in 64 bit mode, it was a different story, and on AMD chips the SSE2 method was slower than using the 64-bit multiply instruction.

Since this bug is concerned only about 32-bits, I think we should remove the Intel check. There is a separate source file on 64 bits with another check for SSE2.
Attachment #326160 - Attachment description: a patch v2 → patch v2 (checked in)
Is this all that remains to be done for this bug?
Attachment #366521 - Flags: review?(rrelyea)
Attachment #366521 - Attachment description: patch - use SSE2 on non-Intel CPUs where present → patch - use SSE2 in 32-bit code on non-Intel CPUs where present
Comment on attachment 366521 [details] [diff] [review]
patch - use SSE2 in 32-bit code on non-Intel CPUs where present

r+ based on the AMD SSE2 feedback.

I think that's it.

bob
Attachment #366521 - Flags: review?(rrelyea) → review+
Comment on attachment 366521 [details] [diff] [review]
patch - use SSE2 in 32-bit code on non-Intel CPUs where present

I think so.

FYI, I tried this patch on my older (2004) 2-way AMD opteron 246 2 GHz. This is the generation of CPUs we used in our previous performance round.

Using 2 threads, rsaperf gave :

without patch : 533 ops/s with patch : 628 ops/s

So, the SSE2 is beneficial even on those old opterons.
Attachment #366521 - Flags: superreview+
Checking in mpcpucache.c; new revision: 1.7; previous revision: 1.6
Status: REOPENED → RESOLVED
Closed: 15 years ago15 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.