Closed Bug 285932 Opened 15 years ago Closed 15 years ago

Need faster SHA1 implementation, especially for AMD64

Categories

(NSS :: Libraries, enhancement, P2)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nelson, Assigned: nelson)

References

Details

Attachments

(5 files, 5 obsolete files)

A certain https server benchmark uses SSL 3.0 and SSL_RSA_RC4_128_WITH_SHA1.
It does full handshakes for about 1 in 5 connections.  
It makes an average of 10 http requests per connection using http keep-alive.
It sends an average of ~6K bytes to the server per request, and
receives an average of ~15K bytes from the server per request.

When the benchmarked server uses NSS built on AMD64 built in 64-bit mode, 
RSA is not the dominant consumer of CPU cycles.  SHA1 uses over twice as 
much CPU as RSA, and is the dominant consumer.  

There are at least 2 low-hanging fruit here:
a) switching from -O2 to -O3 on GCC speeds up NSS's existing SHA1 code on 
AMD64 by about 2x.  
b) using 64-bit variables (rather than 32-bit ones) to store all frequently
updated values gets another large performance boost on AMD64.  It has been
suggested that this is because AMD64 does Read-Modify-Write cycles when it
writes an amount less than 64-bits to memory.  However this change reduces
performance on some 64-bit CPUs, so it cannot be used on all 64-bit CPUs.

I am working on a new implementation.  So far I have tested it on X86,
x86-64 and Sparc v8+ and v9, and it is correct and is faster than the 
old NSS code on all of them.
Wan-Teh, is there a way to force NSPR to use the structure form of a 
PRUint54, even on CPUs/systems that HAVE_LONG_LONG?  

It seems that NSPR presently finds HAVE_LONG_LONG to be true even on 
CPUs that do not have 64-bit registers.  I think I could get the code
to be faster on those machines if it used the struct form.  I'd rather 
not have to abandon the PRUint64 type to get separate 32-bit variables 
for the high and low parts of the value.  
Priority: -- → P2
Target Milestone: --- → 3.10
Attached patch Early draft of patch (obsolete) — Splinter Review
This is the present state of my new code.  I'm not ready to check it in,
but your review comments are welcome and invited anyway.
I already see a few review issues:
a) inconsistent types, uint32 vs PRUint32
b) a new feature test macro that should be all upper case but isn't.

I want to test this on all platforms I can before committing, testing it
for correctness and for performance.  Don't want any regressions like 
the one I initially saw on 64-bit UltraSparc (now fixed).  

I may also want to put this into a new .c source file, since it is 
effectively a rewrite and is no longer Kocher's original code, except
for the definition of the F1 function.
Nelson,

NSPR's HAVE_LONG_LONG macro means whether the *compiler*
supports a 64-bit integer type.  It has nothing to do with
whether the CPU is 64-bit.

Because C99 added the long long type to the C Standard
officially, I expect that HAVE_LONG_LONG will be defined
on all platforms soon.  I believe it is already defined
on all platforms NSPR supports now.

I don't know exactly what you need.  Do you just need a
fast way to get the high and low parts of a 64-bit integer?
Can you do that by casting PRUint64 to a union of the
integer and structure forms of PRUint64?
The next step for this patch is to test it against the code now on the 
NSS trunk on all supported combinations of platforms and compilers to 
see if it is slower than the existing NSS code on any of them.  

So far I have tested in on 
- AMD64/Solaris10 with GCC 3.4.4, in both 32-bit mode and 64-bit mode and 
- UltraSparc/S9 with WorkShop 6.2, in 32-bit and 64-bit modes, and
in all those cases, I have found the new code to be faster than the old.
But there are many more combinations of CPUs and compilers to test.
QA Contact: bishakhabanerjee → jason.m.reid
This is planned for 3.11.  We have a new SHA1 implementation on the 
PERFORMANCE_HACKS_BRANCH that has proven to be faster than the old one
on all platforms on which it has been tested.  I expect to put it into
NSS 3.11.
Status: NEW → ASSIGNED
Target Milestone: 3.10 → 3.11
Attached patch current draft of patch (obsolete) — Splinter Review
This is the current version of the sha_fast code.  
I don't know if its the same as the previous patch or not.
The previous patch omitted the change to one file
Attachment #191549 - Attachment is obsolete: true
Comment on attachment 191550 [details] [diff] [review]
complete current draft of patch 

Please review this patch.  It is only a little different than the first draft.

With regard to the previous comment about HAVE_LONG_LONG, I have found that
even on 32-bit CPUs, the code generated is usually quite good for the very
simple operations this code is performing.  So, I don't think I need to 
change it to a different test that actually determines the CPU register size.
Perhaps we could unifdef this code, to remove the code used when HAVE_LONG_LONG
is undefined.  But for now, I don't think it hurts to leave it there.  

This patch changes one PORT_ZFree call to PORT_Free.  I believe that's
allowable
even in FIPS mode, with my current understanding of the requirements.  

You will notice one new function, SHA1_Clone.  It's related to another change
that will be part of another bug/patch coming soon.
Attachment #191550 - Flags: review?(wtchang)
THe patch for bug 303334 depends on the prior application of this patch.
Blocks: 303334
comments on this patch:

1) There are big blocks of similiar looking code that handles alignments. It's
possible to understand when compared the the original simple code, but could be
quite confusing without it. (not fatal).

2) The #ifdef controlling the creation of the temp variable is strictly based on
whether or not the ROTL macro needs it, but it is possible that the HTON macro
would also need it. The current code works because a new machine HTON macro is
defines whenever an machine ROTL macro is defined. The whole macro thing should
be  sorted at at some point (some macros are defines a head of time, then
changed on the fly, other  macros are pre-defined and the generic version is
defined if the pre-defined version isn't). (not fatal).

3) lenB is uninitiallize in the Update function if HAVE_LONG_LONG is not
defined. This should be fixed.

I'd particularly like to see 3) fixed immediately after check-in, and 2) fixed
after the carpool.

bob
Bob your description of issue #1 is sufficiently vague that I don't know
what code you're talking about.  Please cite some line numbers (from the 
branch) or something.  

We've got a pipeline of patches, 5 deep, where each patch depends on the
prior patches having been applied exactly as is.  Any change to an early
patch immediately invalidates all other patches in the pipeline, which adds
days or weeks to the schedule.  So, I'll fix any bugs at the end of the
pipeline. Hopefully that will be within the next week.
I presumed so, which is why I asked for the fixes after the carpool.

RE comment 1) My mind was failing me. I assumed the following block was repeated
in SHA1_End, but it turns out a similiar set of ifdefs were used there, but the
code in the ifdefs. The block is easy to see how it developed, but it's hard to
see what is going on if you don't have the  benefit of the original code. It's
probably not a big issue since most of freebl is this kind of code. It would be
nice to see something a little more compact. Here's the code in the patch:

#if defined(SHA_ALLOW_UNALIGNED_ACCESS)
  while (len >= 64U) {
    len    -= 64U;
    shaCompress(&ctx->H[H2X], (PRUint32 *)dataIn);
    dataIn += 64U;
  }
#else
  if ((ptrdiff_t)dataIn % sizeof(PRUint32)) {
    while (len >= 64U) {
      memcpy(ctx->B, dataIn, 64);
      len    -= 64U;
      shaCompress(&ctx->H[H2X], ctx->W);
      dataIn += 64U;
    }
  } else {
    while (len >= 64U) {
      len    -= 64U;
      shaCompress(&ctx->H[H2X], (PRUint32 *)dataIn);
      dataIn += 64U;
    }
  }
#endif

Something like:
/*
 * the following loop is implemented twice, once for aligned data buffers
 * which omits a data copy and once for unaligned data buffers which does a
 * memory copy. The test is made outside the loop (requiring to loop
 * implementations to prevent incurring the cost of the test in each 
 * iteration of this performance sensitive function.
 * If SHA_ALLOW_UNALIGNED_ACCESS is defined, then we never do the copy.
 */
#if !defined(SHA_ALLOW_UNALIGNED_ACCESS)
 if ((ptrdiff_t)dataIn % sizeof(PRUint32)) {
    while (len >= 64U) {
      memcpy(ctx->B, dataIn, 64);
      len    -= 64U;
      shaCompress(&ctx->H[H2X], ctx->W);
      dataIn += 64U;
    }
  } else 
#endif
  {
    while (len >= 64U) {
      len    -= 64U;
      shaCompress(&ctx->H[H2X], (PRUint32 *)dataIn);
      dataIn += 64U;
    }
  }

Anyway it's not a big deal. I can see why we need a least 2 inline copies of this.

bob
Comment on attachment 191550 [details] [diff] [review]
complete current draft of patch 

r=wtc.	But I have a lot of suggested changes and
questions.  I will submit my review comments as
an attachments.
Attachment #191550 - Flags: review?(wtchang) → review+
Comment on attachment 191550 [details] [diff] [review]
complete current draft of patch 

I forgot one comment.

> void 
> SHA1_End(SHA1Context *ctx, unsigned char *hashout,
>          unsigned int *pDigestLen, unsigned int maxDigestLen)
> {
...
>-#define A lenB
>+#define tmp lenB
...
> }
> 
> #undef A
> #undef B
>+#undef tmp

"#under A" should be removed because we no longer have
"#define A".
Just in case it wasn't clear, r+ for the patch with fixes 2 & 3 to go in after
the carpool.
This text attachment contains my 
responses to all previous comments,
answers to previously asked questions, and
proposals for certain changes to be made after the carpool.
Attachment #177270 - Attachment is obsolete: true
Checking in prng_fips1861.c; new revision: 1.18; previous revision: 1.17
Checking in sha_fast.c;      new revision: 1.11; previous revision: 1.10
Checking in sha_fast.h;      new revision: 1.4;  previous revision: 1.3

Leaving bug open pending resolution of remaining review issues.
Attached patch Address most review feedback (obsolete) — Splinter Review
This patch addresses all the review feedback except bob's comment about
the tmp variable being needed by both ROTL and HTONL.  If we find any 
platform on which it does not build due to this, I will fix it.

Notes on this patch.
1. Changed the strategy for the context's "size" variable.  Now it's 
only 64 bits on platforms that HAVE_LONG_LONG, so there's never any 
need for code that uses LL macros or code that uses .hi and .lo members.
There are no remaining tests for HAVE_LONG_LONG in sha_fast.c. 
Consequently, the code is limited to hashing no more than 4 gigabytes
at a time on some boxes.  I think we can live with that limitation. :)

2. blapi.h does not use NSPR types at all.  It uses uint32 and uint64
in the declarations of many functions.	I didn't want to force all of
blapi.h and related freebl files to convert to NSPR types, so I left
that alone.  I changed sha_fast.c's internal type uses to use NSPR 
types, and also the types in sha_fast.h use them, but I left blapi.h
alone, and so had to leave the uint32 declaration in SHA1_HashBuf.h.

I have only tested this patch on one platform.	I will immediately begin
to test on others, too.
Attachment #191918 - Flags: superreview?(wtchang)
Attachment #191918 - Flags: review?(rrelyea)
As you know, we effectively have two implementations of SHA1, in sha_fast.c
and in prng_fips1861.c.  They share sha_fast.h, which defines SHA1_HTONL,
but they do not share the numerous platform dependent optimizations for 
SHA1_HTONL, which are presently found only in sha_fast.c.  

Should I move some of those platform-dependent optimizations to sha_fast.h? 
Should this be the subject of another bug?
I have verified that, when built with SHA_PUT_W_IN_STACK, sha_fast.c builds
and runs correctly, at least on x86.  More platforms will be tested.  
Seems our Makefiles isn't setting this on any platform.  
I think we're likely to want to change that soon.
uint32 and uint64 are obsolete NSPR types.  You can use
LXR to verify that uint32 and uint64 are not defined in
any freebl or NSS header files.  They are defined in
"obsolete/protypes.h", which is included by "prtypes.h".

It isn't hard to define your own 32-bit and 64-bit
unsigned integers if you want to be truly NSPR-free in
freebl.

In any case, I don't mind if you don't fix this pre-existing
problem.
Comment on attachment 191918 [details] [diff] [review]
Address most review feedback

r+ = relyea.

My only question is are we sure we never hash more than 4G of data on 32 bit
systems. I'm confident that we never call 'hashBuf' with an entry that big, but
could a data stream get that big? If we are sure this patch is definately
superior to an other patch we've had.

bob
Attachment #191918 - Flags: review?(rrelyea) → review+
I have no particular desire to make or keep blapi.h "truly NSPR-free".
It just happens to be that way now, and I think it should be self consistent.
I wouldn't object to someone converting all of blapi.h and all the code it 
declares and all of its users to use NSPR types, but that's not in the 
schedule now, and is beyond the scope of this bug.

I'm pretty confident we can live with a 4GB limit to the total amount of 
data fed into SHA1 over numerous SHA1_Update calls, at least one 32-bit 
systems.  SSL3/TLS1 will never need that much data in any one hash context.  
S/MIME could conceivably send a monstrous file in a signed message, but 
at least for now I think that's beyond the limits of most mailboxes.
Comment on attachment 191918 [details] [diff] [review]
Address most review feedback

>-#if defined(_LP64) && !defined(__sparc)
>+#if defined(_LP64) && !defined(__sparc) && defined(HAVE_LONG_LONG)

This change is a no-op because
    defined(_LP64) => 'long' is 64-bit => defined(HAVE_LONG_LONG)
Note: HAVE_LONG_LONG means the platform has a 64-bit integer type.
It doesn't mean the platform has the 'long long' type.

I'm wondering if by _LP64 you really meant IS_64 here.	NSPR defines
the IS_64 macro on all 64-bit platforms.  Not all 64-bit platforms
use the LP64 data model.  The most notable exception is 64-bit
Windows, which uses the IL32P64 model.

> struct SHA1ContextStr {
...
>-  PRUint64 size;          	/* 64-bit count of hashed bytes. */
>+  SHA_HW_t size;          	/* count of hashed bytes. */

Since almost all platforms (including 32-bit ones) support large
file systems now, are you sure you want to use a 32-bit size for
something that may be a large file?  This will mean one can't
implement the 'sha1sum' command using NSS.

> SECStatus
> SHA1_HashBuf(unsigned char *dest, const unsigned char *src, uint32 src_length)
> {
...
>-/*  memset(&ctx, 0, sizeof ctx); */
>+/*  memset(&ctx, 0, sizeof ctx); Intentionally NOT zeroing here. */
>     return SECSuccess;
> }

Please remove the memset call in the comment.

In prng_fips1861.c, we have

>     /* housekeeping */
>+    /* do we need to memset sha1cx here? XXX */
>     memset(x_j, 0, BSIZE);
>     memset(XVAL, 0, BSIZE);
>     return SECSuccess;
> }

I think the answer is no because the internal state of the PRNG
is more valuable than anything in sha1cx.

Let's not add this comment.  (It'll stay there forever.)
Wan-Teh, I take comment 25 to have been SR- 

There is no sha1sum command that uses nss, so I don't consider it important.
But since you do, I propose this alternative:

- ctx->size will be PRUint64 on all platforms on which HAVE_LONG_LONG is
defined, whether they are 64-bit or 32-bit, and will be PRUint32 on the others.

The point of this is that the code will always be able to treat ctx->size
as an integer type, to which simply integer assignments can be done, e.g.
ctx->size = 0   and never ctx->size.lo = 0, and never any LL macros.

As far as we know, all supported platforms now define HAVE_LONG_LONG so 
this should work on all our platforms.  OK with that?
Re: comment 20: I agree it is a good idea to move the
optimizations for SHA_HTONL in sha_fast.c to sha_fast.h
so that prng_fips1861.c can benefit from them.
Just declare the size member of struct SHA1ContextStr as PRUint64
and assume HAVE_LONG_LONG is always defined.  We already
made that leap when we added the 64-bit SECCertificateUsage
type.

Recall that the 'long long' type became a standard C type in 1999
and had been a popular language extension before then.  Looking
forward we can safely assume all platforms have a 64-bit integer
type today and will have 'long long' in a couple of years.
If someone ever needs to port NSS to an environment whose compiler
does not have a 64-bit integer type, it is a straightforward
porting job.
While moving the machine optimizations to sha_fast.h, I noticed that 
the hash output changes made in sha_fast.c had not also been made in
prng_fips1861.c.  This patch fixes that too.  I will request review 
when I have built and tested this on a few more platforms.
Attachment #191918 - Attachment is obsolete: true
That last patch worked on all platforms tested except 64-bit HPUX.
Attachment #192054 - Attachment is obsolete: true
Comment on attachment 192061 [details] [diff] [review]
address feedback and fix PRNG (v2)

This has been built on windows, on Solaris/Sparc (32+64, studio+gcc), on
Solaris/AMD64 (gcc), on HPUX/PARisc 32+64.

I think it's ready for review.
Attachment #192061 - Flags: review?(wtchang)
Comment on attachment 192061 [details] [diff] [review]
address feedback and fix PRNG (v2)

r=wtc, with suggested changes.

In sha_fast.h, we have

>+#include "prlong.h"

I believe it is not necessary to include "prlong.h" because
we aren't using any LL_ macros.  The PRUint64 type is defined
in "prtypes.h".  "prlong.h" defines the LL_ macros.

> struct SHA1ContextStr {
...
>-  PRUint64 size;          	/* 64-bit count of hashed bytes. */
>+  PRUint64 size;          	/* count of hashed bytes. */

Might want to restore the original comment.  Not a big deal.

>+#if defined(_MSC_VER) && defined(_X86_)
>+#if defined(IS_LITTLE_ENDIAN) 
>+#undef SHA_HTONL
...
>+#undef SHA_HTONL

These two instances of #under SHA1_HTONL are not necessary now
because we now define the optimized versions of SHA1_HTONL before
the general version.

>+#if !defined(SHA_ROTL_IS_DEFINED)
>+#define SHA_NEED_TMP_VARIABLE 1

Would be good to restore the comment (in a previous version of
this patch) that says the tmp variable should have the PRUint32
type.  That's useful information.

>+#if defined(_X86_) || defined(__x86_64__) || defined(__x86_64) 
>+#define SHA_ALLOW_UNALIGNED_ACCESS 1
>+#endif

This code is repeated.	Please remove the duplicated code.

>+#define STORE(n) ((PRUint32*)hashout)[n] = SHA_HTONL(ctx->H[n])
>+#define W u.w
>+#if defined(SHA_ALLOW_UNALIGNED_ACCESS)
>+    STORE(0);
>+    STORE(1);
>+    STORE(2);
>+    STORE(3);
>+    STORE(4);
>+#else /* ! unaligned access */
>+    if (!((ptrdiff_t)hashout % sizeof(PRUint32))) {
>+	STORE(0);
>+	STORE(1);
>+	STORE(2);
>+	STORE(3);
>+	STORE(4);
>+    } else {
>+    	/* hashout is not aligned */
>+#if defined(IS_LITTLE_ENDIAN) || defined( SHA1_USING_64_BIT )
>+	ctx->W[0] = SHA_HTONL(ctx->H[0]);
>+	ctx->W[1] = SHA_HTONL(ctx->H[1]);
>+	ctx->W[2] = SHA_HTONL(ctx->H[2]);
>+	ctx->W[3] = SHA_HTONL(ctx->H[3]);
>+	ctx->W[4] = SHA_HTONL(ctx->H[4]);
>+	memcpy(hashout, ctx->W, SHA1_LENGTH);
>+#else  /* 32-bit big-endian */
>+	memcpy(hashout, ctx->H, SHA1_LENGTH);
>+#endif /* 32-bit big endian */
>+    }
>+#endif /* ! unaligned access */
>+#undef STORE
>+#undef W

Is it possible to avoid duplicating this code in sha_fast.c
and prng_fips1861.c?  Just wondering.
Attachment #192061 - Flags: review?(wtchang) → review+
Comment on attachment 191778 [details]
responses to previous comments, proposals for changes

Nelson,

The C code optimizations you did are impressive but also
not obvious at all.  You should add a big warning to this
file that says "Hand-optimized C code. If it isn't broken,
don't fix it."	I'd also encourage you to add comments for
every C code optimization (e.g., the W_x stack variables and
the placement of the len -= 64U statement in the while loop).

>    It appears to me that NSC_DeriveKey does call SHA1_DestroyContext.
>    If it didn't, that would be a memory leak.  See 
>http://lxr.mozilla.org/security/source/security/nss/lib/softoken/pkcs11c.c#5110
>    Perhaps there is some other case in NSC_DeriveKey that fails to 
>    call it as needed, but that would be a definite leak.

Yes, there is.	In lib/softoken/pkcs11c.c, rev. 1.65, lines 4714-4745.
It calls PORT_Free instead of SHA1_DestroyContext, so I don't think
there is a memory leak.

>    It appears that the code for the SHA_PUT_W_IN_STACK case is now
>    incomplete.  I am willing to remove it if you like.

I think we should not add dead code that needs to be removed.
Wan-Teh, what NSPR header file would you suggest I include instead?  prtypes.h?

SHA_PUT_W_IN_STACK is not incomplete after all.  See comment 21 above.
Attachment #192061 - Attachment is obsolete: true
Attachment #192150 - Flags: review?(wtchang)
Comment on attachment 192150 [details] [diff] [review]
address feedback and fix PRNG (v3)

Nelson: Yes, you can just include "prtypes.h".
Comment on attachment 192150 [details] [diff] [review]
address feedback and fix PRNG (v3)

r=wtc.	You can include "prtypes.h" instead of "prlong.h".

>+#if defined(__GNUC__) 
>+/* __x86_64__  and __x86_64 are defined by GCC on x86_64 CPUs */

This comment is a little confusing because you also test
__x86_64__ and __x86_64 (to define SHA_ALLOW_UNALIGNED_ACCESS)
outside the #if defined(__GNUC__) block, and it makes me wonder
if __x86_64__  and __x86_64 are also defined by Sun's compiler on
x86_64 CPUs.
Attachment #192150 - Flags: review?(wtchang) → review+
Wan-Teh,

__x86_64 is defined by both gcc and Sun compilers. I used it in the NSPR port.
See mozilla/nsprpub/pr/include/md/_solaris.cfg and
mozilla/nsprpub/pr/include/md/_solaris.h .
Checking in prng_fips1861.c; new revision: 1.19; previous revision: 1.18
Checking in sha_fast.c;      new revision: 1.12; previous revision: 1.11
Checking in sha_fast.h;      new revision: 1.5;  previous revision: 1.4

Note that Saul will soon be attaching an assembler version of the SHA code 
for AMD64. This bug will be left unresolved until that work is also done.
Comment on attachment 192554 [details] [diff] [review]
Compiled sha_fast.c to assembler with gcc -O3; use this only on Solaris AMD64 builds with Sun Studio.

r=wtc.	The information on how the sha-fast-amd64-sun.s
was generated should be a comment in that file.
Attachment #192554 - Flags: superreview?(wtchang) → superreview+
Comment on attachment 192554 [details] [diff] [review]
Compiled sha_fast.c to assembler with gcc -O3; use this only on Solaris AMD64 builds with Sun Studio.

r=nelson
Attachment #192554 - Flags: review?(nelson) → review+
Comment on attachment 191918 [details] [diff] [review]
Address most review feedback

Cancelling review request for this obsolete patch.
Attachment #191918 - Flags: superreview?(wtchang)
Checked in to trunk:

Checking in Makefile;
/cvsroot/mozilla/security/nss/lib/freebl/Makefile,v  <--  Makefile
new revision: 1.62; previous revision: 1.61
done
Checking in manifest.mn;
/cvsroot/mozilla/security/nss/lib/freebl/manifest.mn,v  <--  manifest.mn
new revision: 1.39; previous revision: 1.38
done
RCS file: /cvsroot/mozilla/security/nss/lib/freebl/sha-fast-amd64-sun.s,v
done
Checking in sha-fast-amd64-sun.s;
/cvsroot/mozilla/security/nss/lib/freebl/sha-fast-amd64-sun.s,v  <-- 
sha-fast-amd64-sun.s
initial revision: 1.1
done
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.