Closed Bug 303507 Opened 19 years ago Closed 19 years ago

Enhance RSA performance using "comba" multiplication and squaring

Categories

(NSS :: Libraries, enhancement, P1)

3.10
Other
All
enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nelson, Assigned: saul.edwards.bugs)

Details

Attachments

(4 files, 2 obsolete files)

This is a replacement RFE for bug 303340

After comparing NSS's MPI bignum library to another one at length,
we concluded that the major advantage of the other one was due to 
the use of "Comba" algorithms for multiply and square functions.
The Comba algorithms are a win on CPUs with plenty of registers, 
such as the AMD64, and not on CPUs that don't such as the 32-bit x86.
So, this enhancement is potentially for more than one platform,
but not for all platforms.  Initially we will do this for AMD64.

The comba algorithms win by doing far fewer loads and (especially)
stores than the ordinary long multiplication algorithms, even when 
the latter are hand optimized assembly.

Rather than switch the entire bignum library, we decided to implement
Comba style functions for MPI.  We plan to use the public domain library 
TFM v0.03 source as a basis, and adapt it to fit our MPI library.  

We will probably have an assembly language version for AMD64.
Setting hardware to "other" since this initially is for AMD64.
Leaving OS at "all" because it should work on at least two AMD64 OSes.
Priority: -- → P1
Hardware: All → Other
Target Milestone: --- → 3.11
Attachment #192435 - Flags: review?(nelsonbhchan) → review?(nelson)
Comment on attachment 192435 [details] [diff] [review]
Add mp_comba.c for AMD64 Solaris/Linux, plus calls from existing mpi functions and Makefile changes

I find all the parts of this patch to be ok except one.  
The new code inserted into mp_mul and mp_sqr, which calls the new
comba routines, has been inserted into those functions *before* the 
point where those functions deal with input aliasing and ensure
that the space for the result is adequately allocated.	
The new code needs to be inserted just after the points in mp_mul
and mp_sqr where the aliasing and allocation checks have been done.

I'm sorry I didn't catch this earlier.	:(
Attachment #192435 - Flags: review?(nelson) → review-
Attachment #192435 - Attachment is obsolete: true
Attachment #192463 - Flags: review?(nelson)
Comment on attachment 192463 [details] [diff] [review]
Check sizes and for aliasing in mul and sqr operands

r=nelson@bolyard
The functions mp_mul and mp_sqr now ensure that the input lengths are
correct for the respective s_mp_comba functions, and that the output
allocation suffices.  I appears that the comba functions cannot fail,
so I'm ok with them being void functions.  

I'm satisfied that the .s file is properly generated from the .c file
by the "unnamed" compiler.  :)

Wan-Teh, please super review.  I will be happy to explain how this
code works, if you desire.
Attachment #192463 - Flags: superreview?(wtchang)
Attachment #192463 - Flags: review?(nelson)
Attachment #192463 - Flags: review+
Comment on attachment 192438 [details]
Associated assembly from mp_comba.c of above patch.  Zipped because it is slightly above the file upload limit.

Some suggestions:

For people who will maintain this file, please document precisely how this file
was generated in the comment.
Otherwise, they won't be able to reproduce this work.

Change "Studio 10" to "Sun Studio 10".

Change "AMD 64" to "AMD64".

It may be more correct to use "notice" instead of "license" because it is
public domain code.
Comment on attachment 192463 [details] [diff] [review]
Check sizes and for aliasing in mul and sqr operands

As far as I can tell, this patch is fine.  But to be a responsible
reviewer, I will try to do the best I can.  So I have some quesitons.

1. Makefile

Do we need to use a specific optimization level for mp_comba.c when
we use GCC?

2. mpi/mpi-priv.h: Some stylistic comments, which you may ignore.

The isPowerof2 macro name does not conform to the naming convention
in this file, which uses either all capital letters or all lowercase
letters for macro names.  (All caps is more common.)

I verified that the definition of isPowerof2 is correct.
You may consider omitting the (a) test if you are sure
none of the mp_int's you invoke this macro on has a zero
'used' field.

This file uses lowercase a, b, c as function parameters.  Your
comba functions use capital A, B, C as function parameters.

3. mpi/mpi.c

In the mp_sqr function, your outter if statement ensures
a->used <= 16, but you have an inner if statement that tests
a->used == 32, which can't be true.  Please resolve this
conflict.

Other general comments on changes to this file:

You use a->used and b->used.  The existing code in this file
uses MP_USED(a) and MP_USED(b).  It would be good to follow
that style.

It seems that the isPowerof2 test isn't that useful if the
'used' field is likely to be a power of 2.

Could you explain why you are calling the comba routines
at the right places in mp_mul and mp_sqr?

4. mpi/mp_comba.c

Are your contributions also public domain?  If yes, say so.
If not, add the MPL/GPL/LGPL header.

If you can provide a diff that shows your changes to the
original TFM v0.03 file, I can review it.  Otherwise, I'll
just trust that this file is correct.
Comment on attachment 192463 [details] [diff] [review]
Check sizes and for aliasing in mul and sqr operands

OK, I've completed my code review.

I found a potential memory leak.

In mp_mul and mp_sqr, after calling one of the comba functions, your
code should do "goto CLEANUP;" instead of "return MP_OKAY;".
Otherwise, the mp_int variable tmp may not be cleared.

It seems that the following code at the end of your mul_comba routines:

  C->sign = A->sign ^ B->sign;
  mp_clamp(C);

duplicates the folliwing code before the CLEANUP label in mp_mul:

  s_mp_clamp(c);

  if(SIGN(a) == SIGN(b) || s_mp_cmp_d(c, 0) == MP_EQ)
    SIGN(c) = ZPOS;
  else
    SIGN(c) = NEG;

(Similarly for your sqr_comba routines and mp_sqr.)

Why can't you use the existing code?  If your code is faster,
you should replace the existing code with yours.

mpi/mp_comba.c has some very long lines.  I assume those came
from the original TFM v0.03 code.

mpi/mp_comba.c is also full of things like a->dp, a->dp[1],
a->used, a->sign, that the original MPI code seems to use
macros for (MP_DIGITS(a), MP_DIGIT(a,1), MP_USED(a), MP_SIGN(a),
or without the MP_ prefix).  I don't want to be a nit picker, so
you decide if you want to change these.
Comment on attachment 192463 [details] [diff] [review]
Check sizes and for aliasing in mul and sqr operands

sr-.

Saul, Nelson, I don't mind you checking this in first
and addressing the bugs I found in a supplemental patch.
Attachment #192463 - Flags: superreview?(wtchang) → superreview-
Addresses the following issues:
1.) Remove test for <=16 in mp_sqr
2.) Change isPowof2 to macro IS_POWER_OF_2
3.) changed a->used to macro MP_USED(a)
4.) Removed the pot. momory leak by replacing "return MP_OKAY" with "goto
CLEANUP"

Did not address these issues:
1.) Variables a, b, c should be A, B, C
2.) Two definitions of clamp: I think this is a lengthy discussion to resolve
and I would like to get this patch in quickly.	Postpone.
3.) Instructions on reproducing the .s file: I would like to discuss this
through e-mail.
Attachment #192463 - Attachment is obsolete: true
Attachment #192734 - Flags: superreview?(wtchang)
Attachment #192734 - Flags: review?(nelson)
Comment on attachment 192734 [details] [diff] [review]
Address Wan-Teh's issues

r=nelson, provided that you make one change

>+#ifdef NSS_USE_COMBA
>+  if ((MP_USED(a) == MP_USED(b)) && IS_POWER_OF_2(b->used)) {

change that from IS_POWER_OF_2(b->used)
to		 IS_POWER_OF_2(MP_USED(b))

I suppose that really should be MPI_IS_POWER_OF_2, but it is
in a private header.  Change it if you wish.
Attachment #192734 - Flags: review?(nelson) → review+
Comment on attachment 192734 [details] [diff] [review]
Address Wan-Teh's issues

r=wtc.	I just wanted to point out that you didn't
address the last issue I raised in comment 9, either:

  mpi/mp_comba.c is also full of things like a->dp, a->dp[1],
  a->used, a->sign, that the original MPI code seems to use
  macros for (MP_DIGITS(a), MP_DIGIT(a,1), MP_USED(a), MP_SIGN(a),
  or without the MP_ prefix).  I don't want to be a nit picker, so
  you decide if you want to change these.

The reason I raised this issue again is that you made this
change in another file (mpi/mpi.c).
Attachment #192734 - Flags: superreview?(wtchang) → superreview+
Comment on attachment 192734 [details] [diff] [review]
Address Wan-Teh's issues

Saul: you should also address the copyright issue of
mpi/mp_comba.c.

The original code is public domain, but you've modified
it.  So is the resulting derivative work still in the
public domain?
Wan-Teh, the TFM code never uses the MP_USED() macro at all.  I think we 
want the code there to continue to look as much like the original TFM code
as possible, making as few changes to it as possible, so that it can 
easily be verified to be the same code.  I also think that consistency of
that style is more important within a source file than across source files.
That's also why we allowed extra long source lines.

I think we're looking for the right format for a message that says "our 
derivative work is public domain too."  But, IMO,t each contributor ought
not need to duplicate the public domain statement for his contribution.
Nelson: agreed.  Thanks for the clarification.
Checked in on trunk:

Checking in Makefile;
/cvsroot/mozilla/security/nss/lib/freebl/Makefile,v  <--  Makefile
new revision: 1.63; previous revision: 1.62
done
Checking in mpi/mp_comba.c;
/cvsroot/mozilla/security/nss/lib/freebl/mpi/mp_comba.c,v  <--  mp_comba.c
new revision: 1.2; previous revision: 1.1
done
Checking in mpi/mp_comba_amd64_sun.s;
/cvsroot/mozilla/security/nss/lib/freebl/mpi/mp_comba_amd64_sun.s,v  <-- 
mp_comba_amd64_sun.s
new revision: 1.2; previous revision: 1.1
done
Checking in mpi/mpi-priv.h;
/cvsroot/mozilla/security/nss/lib/freebl/mpi/mpi-priv.h,v  <--  mpi-priv.h
new revision: 1.19; previous revision: 1.18
done
Checking in mpi/mpi.c;
/cvsroot/mozilla/security/nss/lib/freebl/mpi/mpi.c,v  <--  mpi.c
new revision: 1.43; previous revision: 1.42
done
Status: NEW → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment on attachment 193743 [details] [diff] [review]
Consolidate OPTIMIZER to just Linux.mk, change default to -O3 for Linux and Solaris gcc builds

r=wtc.
Attachment #193743 - Flags: review?(wtchang) → review+
Checked into trunk:

Checking in Linux.mk;
/cvsroot/mozilla/security/coreconf/Linux.mk,v  <--  Linux.mk
new revision: 1.24; previous revision: 1.23
done
Checking in Linux2.1.mk;
/cvsroot/mozilla/security/coreconf/Linux2.1.mk,v  <--  Linux2.1.mk
new revision: 1.5; previous revision: 1.4
done
Checking in Linux2.2.mk;
/cvsroot/mozilla/security/coreconf/Linux2.2.mk,v  <--  Linux2.2.mk
new revision: 1.5; previous revision: 1.4
done
Checking in Linux2.4.mk;
/cvsroot/mozilla/security/coreconf/Linux2.4.mk,v  <--  Linux2.4.mk
new revision: 1.5; previous revision: 1.4
done
Checking in Linux2.5.mk;
/cvsroot/mozilla/security/coreconf/Linux2.5.mk,v  <--  Linux2.5.mk
new revision: 1.4; previous revision: 1.3
done
Checking in Linux2.6.mk;
/cvsroot/mozilla/security/coreconf/Linux2.6.mk,v  <--  Linux2.6.mk
new revision: 1.4; previous revision: 1.3
done
Checking in SunOS5.mk;
/cvsroot/mozilla/security/coreconf/SunOS5.mk,v  <--  SunOS5.mk
new revision: 1.17; previous revision: 1.16
done
Status: REOPENED → RESOLVED
Closed: 19 years ago19 years ago
Resolution: --- → FIXED
Comment on attachment 193743 [details] [diff] [review]
Consolidate OPTIMIZER to just Linux.mk, change default to -O3 for Linux and Solaris gcc builds

I thought Wan-Teh wanted to stay with -O2 (?), but in any case, it's better to
have this in just one place, not 5.
Attachment #193743 - Flags: superreview?(nelson) → superreview+
Sorry, Saul.

I found that I misunderstood Mozilla's configure script.
I thought that Mozilla is already using -O3 for GCC.  That's
one reason I thought it's okay to use -O3 to compile all of
NSS.

I just found that Mozilla is using -O (same as -O2), and
the official releases are compiled with -Os (optimize for
space), so I want to continue to use -O2 as the default for
GCC.

This means you need to come up with a new patch to just
compile libfreebl3.so with -O3.
Attachment #193879 - Flags: review?(saul.edwards.bugs)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Couldn't we just test for BUILD_SUN_PKG, which is set in all of Christophe's
builds for Solaris and Linux, and use -O3 globally ? I think this is simpler and
requires only a change to coreconf, and no more changes to the already complex
freebl Makefile .
BUILD_SUN_PKG should be only be used for build options
that are not suitable for regular NSS builds, such as
hardcoding the /usr/lib/mps pathname.  If -O3 is good,
we should use it in regular NSS builds, too.

GCC only has three optimization levels;  -O3 is the
highest level, and -O1 is the default.  So -O2 is
already pretty high for GCC.  Without the endorsement
of the Mozilla developers, I need more time to look
into the stability of code generated by -O3.  Using
-O3 only on freebl gets most of the performance benefit
and minimizes the risk.

I can take care of the mozilla/security/nss/lib/freebl/Makefile
change.  I think it'll only take two lines.
Wan-Teh,

We have been testing code built with -O3 for almost a year at Sun. I don't
recall that we ran into any optimizer bug.

Re: optimization levels, gcc actually has other many flags to control specific
optimizations, in addition to -O<level>, eg. -mno-omit-leaf-frame-pointer
-fno-omit-frame-pointer .
Comment on attachment 193879 [details] [diff] [review]
Change GCC's default OPTIMIZER back to -O2 (checked in)

I will submit the freebl-specific patch when the patches for 303508 are checked
in.
Attachment #193879 - Flags: review?(saul.edwards.bugs) → review+
Attachment #193879 - Attachment description: Change GCC's default OPTIMIZER back to -O2 → Change GCC's default OPTIMIZER back to -O2 (checked in)
I believe the only remaining thing left to do in this bug is the -O3 optimizer setting. Wan-Teh, have you found the information you needed about gcc -O3 code stability since you wrote comment #24 ?

My opinion is that it should be the default, preferably for all of NSS, but if not, certainly for freebl. Currently, everything is at -O2 . I know of no NSS test that has ever failed because of -O3 vs -O2, and the code is almost always faster with -O3 . This isn't a big deal for Sun since we hand-generated assembly files made with gcc -O3 for the files with the biggest wins, and we are not using gcc for our C sources in official builds. If you don't care about NSS performance on Linux and other platforms that use gcc, you can leave it at -O2, and close this bug as fixed.
The main issue of this bug has been resolved. Wan-Teh, if you care about the gcc -O3 flag on Linux, this can be dealt with in a separate bug. Per our meeting today, marking fixed.
Status: REOPENED → RESOLVED
Closed: 19 years ago19 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: