Closed Bug 439199 Opened 17 years ago Closed 16 years ago

SSE2 instructions for bignum are not implemented on Windows 32-bit

Tracking

(Not tracked)

Status:

RESOLVED FIXED

Milestone:

3.12.3

People

(Reporter: m_kato, Assigned: glenbeasley)

References

Details

(Whiteboard: FIPS)

Attachments

(2 files, 1 obsolete file)

a patch for latest trunk 17 years ago Makoto Kato [:m_kato] 10.46 KB, patch		Details \| Diff \| Splinter Review
patch v2 (checked in) 17 years ago Makoto Kato [:m_kato] 10.87 KB, patch	rrelyea : review+	Details \| Diff \| Splinter Review
patch - use SSE2 in 32-bit code on non-Intel CPUs where present 16 years ago Nelson Bolyard (seldom reads bugmail) 562 bytes, patch	rrelyea : review+ julien.pierre : superreview+	Details \| Diff \| Splinter Review

Makoto Kato [:m_kato]

Reporter

Description

•

17 years ago

Attached patch a patch for latest trunk (obsolete) — Details — Splinter Review

Current NSS code has SSE2 optimization of MPI. But this is GCC only.

Makoto Kato [:m_kato]

Reporter

Comment 1

•

17 years ago

Attached patch patch v2 (checked in) — Details — Splinter Review

Although GCC version uses SSE2 for MPI on Intel CPU, MSVC version is no SSE2 code. I ported it to MSVC (x86).

Attachment #325071 - Attachment is obsolete: true

Attachment #326160 - Flags: review?(nelson)

Julien Pierre

Comment 2

•

17 years ago

Comment on attachment 326160 [details] [diff] [review] patch v2 (checked in) Makoto, I don't think we want to keep calling s_mpi_is_sse2 repeatedly during every computation. I know that appears to be the way it's done on Linux, but I still think it's wrong. This is the sort of test that should be done at initialization of freebl.

Attachment #326160 - Flags: superreview-

glen beasley

Assignee

Updated

•

17 years ago

Blocks: FIPS2008

Robert Relyea

Updated

•

17 years ago

Priority: -- → P4

glen beasley

Assignee

Updated

•

17 years ago

Whiteboard: FIPS

Julien Pierre

Updated

•

17 years ago

Summary: Enable SSE2 for MSVC++ → SSE2 instructions for bignum are not implemented on Windows 32-bit

Julien Pierre

Updated

•

17 years ago

Severity: normal → enhancement

glen beasley

Assignee

Comment 3

•

17 years ago

If this bug is completed by Nov17 2008 it will be included in the FIPS2008 validation otherwise it will be dropped for a later release.

Nelson Bolyard (seldom reads bugmail)

Comment 4

•

17 years ago

Comment on attachment 326160 [details] [diff] [review] patch v2 (checked in) This patch has already received a negative review, so it doesn't need another review. A new patch is needed.

Attachment #326160 - Flags: review?(nelson)

Robert Relyea

Comment 5

•

17 years ago

I completely disagree with Julien's evaluation, and if that's his only issue I'm perfectly willing to give this patch an r+. bob

Robert Relyea

Comment 6

•

17 years ago

BTW Julien, I've explained to you before.... and it's painfully obvious from reading the code: Neither the Linux patch, nor this was calls s_mpi_is_sse2 on every computation. If it did it *would* kill performance. It's called once on the first multiply and never again for the life of the process.

glen beasley

Assignee

Updated

•

16 years ago

Attachment #326160 - Flags: review?(rrelyea)

Robert Relyea

Comment 7

•

16 years ago

Comment on attachment 326160 [details] [diff] [review] patch v2 (checked in) r+ with one change. is_sse should be initialized to -1

Attachment #326160 - Flags: superreview-

Attachment #326160 - Flags: review?(rrelyea)

Attachment #326160 - Flags: review+

glen beasley

Assignee

Comment 8

•

16 years ago

cvs commit -m "439199 SSE2 Win 32 instructions for bignum r= Bob patch from Makoto kato" Checking in mpi_x86_asm.c; /cvsroot/mozilla/security/nss/lib/freebl/mpi/mpi_x86_asm.c,v <-- mpi_x86_asm.c new revision: 1.3; previous revision: 1.2 done

Status: NEW → RESOLVED

Closed: 16 years ago

Resolution: --- → FIXED

Wan-Teh Chang

Updated

•

16 years ago

Target Milestone: --- → 3.12.2

Wan-Teh Chang

Updated

•

16 years ago

Target Milestone: 3.12.2 → 3.12.3

Nelson Bolyard (seldom reads bugmail)

Comment 9

•

16 years ago

I'm reopening this bug because: 1) The patch was committed without making the correction Bob mentioned in comment 7. 2) I agree with Julien that checking the CPU type on EVERY call to s_mpv_mul_d is very undesirable. This is a very low level function that gets called many times in a single multiplication of two bignums. 3) There are some CPUs (AMD in particular) where the SSE2 implementation is SLOWER than the non-SSE2 implementation. On those CPUs, we don't want to use SSE2, even if the CPU is capable of it.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Robert Relyea

Comment 10

•

16 years ago

> 1) The patch was committed without making the correction Bob mentioned in > comment 7. This is legitimate. The patch does need this change or it's a no-op. > 2) I agree with Julien that checking the CPU type on EVERY call to > s_mpv_mul_d is very undesirable. This is a very low level function that > gets called many times in a single multiplication of two bignums. For hopefully the last time, we do NOT call s_mpi_is_sse2 on EVERY CALL!! The code clearly caches this. I have done performance measurements on this code and cannot detect the cost of the additional branch (particularly in a function that's going to call MULTIPLY in an inner loop upteen times). Both Intel and AMD does a reasonably good job at branch prediction, so in the steady state, we don't even clear the pipleline. I did the homework on this years ago and this exact code is performing quite well on linux, and has been for years. At this point I believe the burden is on others to prove there is a problem with this field tested code. > 3) There are some CPUs (AMD in particular) where the SSE2 implementation > is SLOWER than the non-SSE2 implementation. On those CPUs, we don't want > to use SSE2, even if the CPU is capable of it. This should have been brought up earlier when I asked for other objections to the patch.

Nelson Bolyard (seldom reads bugmail)

Comment 11

•

16 years ago

Now that this patch has been committed, fixing the problems is high priority.

Assignee: m_kato → glen.beasley

Priority: P4 → P1

glen beasley

Assignee

Comment 12

•

16 years ago

Sorry about forgetting bob's change request. < static int is_sse = 0; --- > static int is_sse = -1; cvs commit -m "439199 SSE2 Win 32 instructions for bignum r=bob is_sse should be -1" Enter passphrase for key '/Users/gb/.ssh/id_dsa': Checking in mpi_x86_asm.c; /cvsroot/mozilla/security/nss/lib/freebl/mpi/mpi_x86_asm.c,v <-- mpi_x86_asm.c new revision: 1.4; previous revision: 1.3 done > 3) There are some CPUs (AMD in particular) where the SSE2 implementation > is SLOWER than the non-SSE2 implementation. On those CPUs, we don't want > to use SSE2, even if the CPU is capable of it. I will leave the bug open, to see if we can do performance testing to see how serious this issue is.

Julien Pierre

Comment 13

•

16 years ago

Nelson, Re: comment 9 issue 3, I don't think the SSE2 is slower than the regular multiply in 32 bit even on AMD CPUs. I believe that's only the case in 64-bit mode.

Julien Pierre

Comment 14

•

16 years ago

I have both an Intel Q6600 (core 2 quad 2.4 GHz) and AMD Phenom 9750 (2.4 GHz quad-core) at home. They are both running Vista x64 with 8 GB RAM. I did two NSS builds, one of 3.11 (before this patch) and one of the trunk, with VC9. Both were optimized 32 bit builds, targeted for WINNT. I ran the following command as a benchmark : rsaperf -n none -p 30 -t 4 This tests 1024-bit RSA private key ops with 4 threads for 30 seconds. Here were the results : 3.11, Intel Q6600 : 1468 ops/s trunk, Intel Q6600 : 1822 ops/s 3.11, Phenom 9750 : 646 ops/s trunk, Phenom 9750 : 647 ops/s This shows that the SSE2 helps significantly on the Intel chip. The results on AMD are dysmal compared to Intel, especially considering these are both quad-core chips running at the same clock speed. What is very surprising is that the SSE2 code seems to yield the same result as the old regular multiply code on the AMD. I am wondering if the SSE2 capability of the AMD CPU was properly detected. FYI, I was running the same NSS bits on both machines.

Julien Pierre

Comment 15

•

16 years ago

As I suspected, SSE2 was not detected on the AMD chip. This is due to the 3 lines of code starting at http://mxr.mozilla.org/security/source/security/nss/lib/freebl/mpi/mpcpucache.c#680 I commented that check and rebuilt the trunk build. The result is : trunk, Phenom 9750 : 775 ops/s So, using SSE2 actually improved the performance on this AMD chip in 32 bit mode.

Nelson Bolyard (seldom reads bugmail)

Comment 16

•

16 years ago

My guess is that AMDs CPUs have changed the relative performance of the base multiply instruction and the SSE2 instructions in the last 3 years. I was pleasantly surprised to see that s_mpi_is_sse2 already checks for Intel. Maybe we should remove that check now.

Julien Pierre

Comment 17

•

16 years ago

Nelson, I don't think the AMD CPUs have really changed. In 32-bit mode, the SSE2 method is always faster on all chips, I believe even on older AMD chips. We had some SSE2 32-bit code working in our lab years ago with Saul, but we never bothered integrating it. It always helped more with Intel than with AMD, but it never hurt AMD in 32-bit mode. However, in 64 bit mode, it was a different story, and on AMD chips the SSE2 method was slower than using the 64-bit multiply instruction. Since this bug is concerned only about 32-bits, I think we should remove the Intel check. There is a separate source file on 64 bits with another check for SSE2.

Nelson Bolyard (seldom reads bugmail)

Updated

•

16 years ago

Attachment #326160 - Attachment description: a patch v2 → patch v2 (checked in)

Nelson Bolyard (seldom reads bugmail)

Comment 18

•

16 years ago

Attached patch patch - use SSE2 in 32-bit code on non-Intel CPUs where present — Details — Splinter Review

Is this all that remains to be done for this bug?

Attachment #366521 - Flags: review?(rrelyea)

Nelson Bolyard (seldom reads bugmail)

Updated

•

16 years ago

Attachment #366521 - Attachment description: patch - use SSE2 on non-Intel CPUs where present → patch - use SSE2 in 32-bit code on non-Intel CPUs where present

Robert Relyea

Comment 19

•

16 years ago

Comment on attachment 366521 [details] [diff] [review] patch - use SSE2 in 32-bit code on non-Intel CPUs where present r+ based on the AMD SSE2 feedback. I think that's it. bob

Attachment #366521 - Flags: review?(rrelyea) → review+

Julien Pierre

Comment 20

•

16 years ago

Comment on attachment 366521 [details] [diff] [review] patch - use SSE2 in 32-bit code on non-Intel CPUs where present I think so. FYI, I tried this patch on my older (2004) 2-way AMD opteron 246 2 GHz. This is the generation of CPUs we used in our previous performance round. Using 2 threads, rsaperf gave : without patch : 533 ops/s with patch : 628 ops/s So, the SSE2 is beneficial even on those old opterons.

Attachment #366521 - Flags: superreview+

Nelson Bolyard (seldom reads bugmail)

Comment 21

•

16 years ago

Checking in mpcpucache.c; new revision: 1.7; previous revision: 1.6

Status: REOPENED → RESOLVED

Closed: 16 years ago → 16 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.