Closed Bug 285932 Opened 20 years ago Closed 19 years ago

Need faster SHA1 implementation, especially for AMD64

Categories

(NSS :: Libraries, enhancement, P2)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nelson, Assigned: nelson)

References

Details

Attachments

(5 files, 5 obsolete files)

A certain https server benchmark uses SSL 3.0 and SSL_RSA_RC4_128_WITH_SHA1. It does full handshakes for about 1 in 5 connections. It makes an average of 10 http requests per connection using http keep-alive. It sends an average of ~6K bytes to the server per request, and receives an average of ~15K bytes from the server per request. When the benchmarked server uses NSS built on AMD64 built in 64-bit mode, RSA is not the dominant consumer of CPU cycles. SHA1 uses over twice as much CPU as RSA, and is the dominant consumer. There are at least 2 low-hanging fruit here: a) switching from -O2 to -O3 on GCC speeds up NSS's existing SHA1 code on AMD64 by about 2x. b) using 64-bit variables (rather than 32-bit ones) to store all frequently updated values gets another large performance boost on AMD64. It has been suggested that this is because AMD64 does Read-Modify-Write cycles when it writes an amount less than 64-bits to memory. However this change reduces performance on some 64-bit CPUs, so it cannot be used on all 64-bit CPUs. I am working on a new implementation. So far I have tested it on X86, x86-64 and Sparc v8+ and v9, and it is correct and is faster than the old NSS code on all of them.
Wan-Teh, is there a way to force NSPR to use the structure form of a PRUint54, even on CPUs/systems that HAVE_LONG_LONG? It seems that NSPR presently finds HAVE_LONG_LONG to be true even on CPUs that do not have 64-bit registers. I think I could get the code to be faster on those machines if it used the struct form. I'd rather not have to abandon the PRUint64 type to get separate 32-bit variables for the high and low parts of the value.
Priority: -- → P2
Target Milestone: --- → 3.10
Attached patch Early draft of patch (obsolete) — Splinter Review
This is the present state of my new code. I'm not ready to check it in, but your review comments are welcome and invited anyway. I already see a few review issues: a) inconsistent types, uint32 vs PRUint32 b) a new feature test macro that should be all upper case but isn't. I want to test this on all platforms I can before committing, testing it for correctness and for performance. Don't want any regressions like the one I initially saw on 64-bit UltraSparc (now fixed). I may also want to put this into a new .c source file, since it is effectively a rewrite and is no longer Kocher's original code, except for the definition of the F1 function.
Nelson, NSPR's HAVE_LONG_LONG macro means whether the *compiler* supports a 64-bit integer type. It has nothing to do with whether the CPU is 64-bit. Because C99 added the long long type to the C Standard officially, I expect that HAVE_LONG_LONG will be defined on all platforms soon. I believe it is already defined on all platforms NSPR supports now. I don't know exactly what you need. Do you just need a fast way to get the high and low parts of a 64-bit integer? Can you do that by casting PRUint64 to a union of the integer and structure forms of PRUint64?
The next step for this patch is to test it against the code now on the NSS trunk on all supported combinations of platforms and compilers to see if it is slower than the existing NSS code on any of them. So far I have tested in on - AMD64/Solaris10 with GCC 3.4.4, in both 32-bit mode and 64-bit mode and - UltraSparc/S9 with WorkShop 6.2, in 32-bit and 64-bit modes, and in all those cases, I have found the new code to be faster than the old. But there are many more combinations of CPUs and compilers to test.
QA Contact: bishakhabanerjee → jason.m.reid
This is planned for 3.11. We have a new SHA1 implementation on the PERFORMANCE_HACKS_BRANCH that has proven to be faster than the old one on all platforms on which it has been tested. I expect to put it into NSS 3.11.
Status: NEW → ASSIGNED
Target Milestone: 3.10 → 3.11
Attached patch current draft of patch (obsolete) — Splinter Review
This is the current version of the sha_fast code. I don't know if its the same as the previous patch or not.
The previous patch omitted the change to one file
Attachment #191549 - Attachment is obsolete: true
Comment on attachment 191550 [details] [diff] [review] complete current draft of patch Please review this patch. It is only a little different than the first draft. With regard to the previous comment about HAVE_LONG_LONG, I have found that even on 32-bit CPUs, the code generated is usually quite good for the very simple operations this code is performing. So, I don't think I need to change it to a different test that actually determines the CPU register size. Perhaps we could unifdef this code, to remove the code used when HAVE_LONG_LONG is undefined. But for now, I don't think it hurts to leave it there. This patch changes one PORT_ZFree call to PORT_Free. I believe that's allowable even in FIPS mode, with my current understanding of the requirements. You will notice one new function, SHA1_Clone. It's related to another change that will be part of another bug/patch coming soon.
Attachment #191550 - Flags: review?(wtchang)
THe patch for bug 303334 depends on the prior application of this patch.
Blocks: 303334
comments on this patch: 1) There are big blocks of similiar looking code that handles alignments. It's possible to understand when compared the the original simple code, but could be quite confusing without it. (not fatal). 2) The #ifdef controlling the creation of the temp variable is strictly based on whether or not the ROTL macro needs it, but it is possible that the HTON macro would also need it. The current code works because a new machine HTON macro is defines whenever an machine ROTL macro is defined. The whole macro thing should be sorted at at some point (some macros are defines a head of time, then changed on the fly, other macros are pre-defined and the generic version is defined if the pre-defined version isn't). (not fatal). 3) lenB is uninitiallize in the Update function if HAVE_LONG_LONG is not defined. This should be fixed. I'd particularly like to see 3) fixed immediately after check-in, and 2) fixed after the carpool. bob
Bob your description of issue #1 is sufficiently vague that I don't know what code you're talking about. Please cite some line numbers (from the branch) or something. We've got a pipeline of patches, 5 deep, where each patch depends on the prior patches having been applied exactly as is. Any change to an early patch immediately invalidates all other patches in the pipeline, which adds days or weeks to the schedule. So, I'll fix any bugs at the end of the pipeline. Hopefully that will be within the next week.
I presumed so, which is why I asked for the fixes after the carpool. RE comment 1) My mind was failing me. I assumed the following block was repeated in SHA1_End, but it turns out a similiar set of ifdefs were used there, but the code in the ifdefs. The block is easy to see how it developed, but it's hard to see what is going on if you don't have the benefit of the original code. It's probably not a big issue since most of freebl is this kind of code. It would be nice to see something a little more compact. Here's the code in the patch: #if defined(SHA_ALLOW_UNALIGNED_ACCESS) while (len >= 64U) { len -= 64U; shaCompress(&ctx->H[H2X], (PRUint32 *)dataIn); dataIn += 64U; } #else if ((ptrdiff_t)dataIn % sizeof(PRUint32)) { while (len >= 64U) { memcpy(ctx->B, dataIn, 64); len -= 64U; shaCompress(&ctx->H[H2X], ctx->W); dataIn += 64U; } } else { while (len >= 64U) { len -= 64U; shaCompress(&ctx->H[H2X], (PRUint32 *)dataIn); dataIn += 64U; } } #endif Something like: /* * the following loop is implemented twice, once for aligned data buffers * which omits a data copy and once for unaligned data buffers which does a * memory copy. The test is made outside the loop (requiring to loop * implementations to prevent incurring the cost of the test in each * iteration of this performance sensitive function. * If SHA_ALLOW_UNALIGNED_ACCESS is defined, then we never do the copy. */ #if !defined(SHA_ALLOW_UNALIGNED_ACCESS) if ((ptrdiff_t)dataIn % sizeof(PRUint32)) { while (len >= 64U) { memcpy(ctx->B, dataIn, 64); len -= 64U; shaCompress(&ctx->H[H2X], ctx->W); dataIn += 64U; } } else #endif { while (len >= 64U) { len -= 64U; shaCompress(&ctx->H[H2X], (PRUint32 *)dataIn); dataIn += 64U; } } Anyway it's not a big deal. I can see why we need a least 2 inline copies of this. bob
Comment on attachment 191550 [details] [diff] [review] complete current draft of patch r=wtc. But I have a lot of suggested changes and questions. I will submit my review comments as an attachments.
Attachment #191550 - Flags: review?(wtchang) → review+
Comment on attachment 191550 [details] [diff] [review] complete current draft of patch I forgot one comment. > void > SHA1_End(SHA1Context *ctx, unsigned char *hashout, > unsigned int *pDigestLen, unsigned int maxDigestLen) > { ... >-#define A lenB >+#define tmp lenB ... > } > > #undef A > #undef B >+#undef tmp "#under A" should be removed because we no longer have "#define A".
Just in case it wasn't clear, r+ for the patch with fixes 2 & 3 to go in after the carpool.
This text attachment contains my responses to all previous comments, answers to previously asked questions, and proposals for certain changes to be made after the carpool.
Attachment #177270 - Attachment is obsolete: true
Checking in prng_fips1861.c; new revision: 1.18; previous revision: 1.17 Checking in sha_fast.c; new revision: 1.11; previous revision: 1.10 Checking in sha_fast.h; new revision: 1.4; previous revision: 1.3 Leaving bug open pending resolution of remaining review issues.
Attached patch Address most review feedback (obsolete) — Splinter Review
This patch addresses all the review feedback except bob's comment about the tmp variable being needed by both ROTL and HTONL. If we find any platform on which it does not build due to this, I will fix it. Notes on this patch. 1. Changed the strategy for the context's "size" variable. Now it's only 64 bits on platforms that HAVE_LONG_LONG, so there's never any need for code that uses LL macros or code that uses .hi and .lo members. There are no remaining tests for HAVE_LONG_LONG in sha_fast.c. Consequently, the code is limited to hashing no more than 4 gigabytes at a time on some boxes. I think we can live with that limitation. :) 2. blapi.h does not use NSPR types at all. It uses uint32 and uint64 in the declarations of many functions. I didn't want to force all of blapi.h and related freebl files to convert to NSPR types, so I left that alone. I changed sha_fast.c's internal type uses to use NSPR types, and also the types in sha_fast.h use them, but I left blapi.h alone, and so had to leave the uint32 declaration in SHA1_HashBuf.h. I have only tested this patch on one platform. I will immediately begin to test on others, too.
Attachment #191918 - Flags: superreview?(wtchang)
Attachment #191918 - Flags: review?(rrelyea)
As you know, we effectively have two implementations of SHA1, in sha_fast.c and in prng_fips1861.c. They share sha_fast.h, which defines SHA1_HTONL, but they do not share the numerous platform dependent optimizations for SHA1_HTONL, which are presently found only in sha_fast.c. Should I move some of those platform-dependent optimizations to sha_fast.h? Should this be the subject of another bug?
I have verified that, when built with SHA_PUT_W_IN_STACK, sha_fast.c builds and runs correctly, at least on x86. More platforms will be tested. Seems our Makefiles isn't setting this on any platform. I think we're likely to want to change that soon.
uint32 and uint64 are obsolete NSPR types. You can use LXR to verify that uint32 and uint64 are not defined in any freebl or NSS header files. They are defined in "obsolete/protypes.h", which is included by "prtypes.h". It isn't hard to define your own 32-bit and 64-bit unsigned integers if you want to be truly NSPR-free in freebl. In any case, I don't mind if you don't fix this pre-existing problem.
Comment on attachment 191918 [details] [diff] [review] Address most review feedback r+ = relyea. My only question is are we sure we never hash more than 4G of data on 32 bit systems. I'm confident that we never call 'hashBuf' with an entry that big, but could a data stream get that big? If we are sure this patch is definately superior to an other patch we've had. bob
Attachment #191918 - Flags: review?(rrelyea) → review+
I have no particular desire to make or keep blapi.h "truly NSPR-free". It just happens to be that way now, and I think it should be self consistent. I wouldn't object to someone converting all of blapi.h and all the code it declares and all of its users to use NSPR types, but that's not in the schedule now, and is beyond the scope of this bug. I'm pretty confident we can live with a 4GB limit to the total amount of data fed into SHA1 over numerous SHA1_Update calls, at least one 32-bit systems. SSL3/TLS1 will never need that much data in any one hash context. S/MIME could conceivably send a monstrous file in a signed message, but at least for now I think that's beyond the limits of most mailboxes.
Comment on attachment 191918 [details] [diff] [review] Address most review feedback >-#if defined(_LP64) && !defined(__sparc) >+#if defined(_LP64) && !defined(__sparc) && defined(HAVE_LONG_LONG) This change is a no-op because defined(_LP64) => 'long' is 64-bit => defined(HAVE_LONG_LONG) Note: HAVE_LONG_LONG means the platform has a 64-bit integer type. It doesn't mean the platform has the 'long long' type. I'm wondering if by _LP64 you really meant IS_64 here. NSPR defines the IS_64 macro on all 64-bit platforms. Not all 64-bit platforms use the LP64 data model. The most notable exception is 64-bit Windows, which uses the IL32P64 model. > struct SHA1ContextStr { ... >- PRUint64 size; /* 64-bit count of hashed bytes. */ >+ SHA_HW_t size; /* count of hashed bytes. */ Since almost all platforms (including 32-bit ones) support large file systems now, are you sure you want to use a 32-bit size for something that may be a large file? This will mean one can't implement the 'sha1sum' command using NSS. > SECStatus > SHA1_HashBuf(unsigned char *dest, const unsigned char *src, uint32 src_length) > { ... >-/* memset(&ctx, 0, sizeof ctx); */ >+/* memset(&ctx, 0, sizeof ctx); Intentionally NOT zeroing here. */ > return SECSuccess; > } Please remove the memset call in the comment. In prng_fips1861.c, we have > /* housekeeping */ >+ /* do we need to memset sha1cx here? XXX */ > memset(x_j, 0, BSIZE); > memset(XVAL, 0, BSIZE); > return SECSuccess; > } I think the answer is no because the internal state of the PRNG is more valuable than anything in sha1cx. Let's not add this comment. (It'll stay there forever.)
Wan-Teh, I take comment 25 to have been SR- There is no sha1sum command that uses nss, so I don't consider it important. But since you do, I propose this alternative: - ctx->size will be PRUint64 on all platforms on which HAVE_LONG_LONG is defined, whether they are 64-bit or 32-bit, and will be PRUint32 on the others. The point of this is that the code will always be able to treat ctx->size as an integer type, to which simply integer assignments can be done, e.g. ctx->size = 0 and never ctx->size.lo = 0, and never any LL macros. As far as we know, all supported platforms now define HAVE_LONG_LONG so this should work on all our platforms. OK with that?
Re: comment 20: I agree it is a good idea to move the optimizations for SHA_HTONL in sha_fast.c to sha_fast.h so that prng_fips1861.c can benefit from them.
Just declare the size member of struct SHA1ContextStr as PRUint64 and assume HAVE_LONG_LONG is always defined. We already made that leap when we added the 64-bit SECCertificateUsage type. Recall that the 'long long' type became a standard C type in 1999 and had been a popular language extension before then. Looking forward we can safely assume all platforms have a 64-bit integer type today and will have 'long long' in a couple of years. If someone ever needs to port NSS to an environment whose compiler does not have a 64-bit integer type, it is a straightforward porting job.
While moving the machine optimizations to sha_fast.h, I noticed that the hash output changes made in sha_fast.c had not also been made in prng_fips1861.c. This patch fixes that too. I will request review when I have built and tested this on a few more platforms.
Attachment #191918 - Attachment is obsolete: true
That last patch worked on all platforms tested except 64-bit HPUX.
Attachment #192054 - Attachment is obsolete: true
Comment on attachment 192061 [details] [diff] [review] address feedback and fix PRNG (v2) This has been built on windows, on Solaris/Sparc (32+64, studio+gcc), on Solaris/AMD64 (gcc), on HPUX/PARisc 32+64. I think it's ready for review.
Attachment #192061 - Flags: review?(wtchang)
Comment on attachment 192061 [details] [diff] [review] address feedback and fix PRNG (v2) r=wtc, with suggested changes. In sha_fast.h, we have >+#include "prlong.h" I believe it is not necessary to include "prlong.h" because we aren't using any LL_ macros. The PRUint64 type is defined in "prtypes.h". "prlong.h" defines the LL_ macros. > struct SHA1ContextStr { ... >- PRUint64 size; /* 64-bit count of hashed bytes. */ >+ PRUint64 size; /* count of hashed bytes. */ Might want to restore the original comment. Not a big deal. >+#if defined(_MSC_VER) && defined(_X86_) >+#if defined(IS_LITTLE_ENDIAN) >+#undef SHA_HTONL ... >+#undef SHA_HTONL These two instances of #under SHA1_HTONL are not necessary now because we now define the optimized versions of SHA1_HTONL before the general version. >+#if !defined(SHA_ROTL_IS_DEFINED) >+#define SHA_NEED_TMP_VARIABLE 1 Would be good to restore the comment (in a previous version of this patch) that says the tmp variable should have the PRUint32 type. That's useful information. >+#if defined(_X86_) || defined(__x86_64__) || defined(__x86_64) >+#define SHA_ALLOW_UNALIGNED_ACCESS 1 >+#endif This code is repeated. Please remove the duplicated code. >+#define STORE(n) ((PRUint32*)hashout)[n] = SHA_HTONL(ctx->H[n]) >+#define W u.w >+#if defined(SHA_ALLOW_UNALIGNED_ACCESS) >+ STORE(0); >+ STORE(1); >+ STORE(2); >+ STORE(3); >+ STORE(4); >+#else /* ! unaligned access */ >+ if (!((ptrdiff_t)hashout % sizeof(PRUint32))) { >+ STORE(0); >+ STORE(1); >+ STORE(2); >+ STORE(3); >+ STORE(4); >+ } else { >+ /* hashout is not aligned */ >+#if defined(IS_LITTLE_ENDIAN) || defined( SHA1_USING_64_BIT ) >+ ctx->W[0] = SHA_HTONL(ctx->H[0]); >+ ctx->W[1] = SHA_HTONL(ctx->H[1]); >+ ctx->W[2] = SHA_HTONL(ctx->H[2]); >+ ctx->W[3] = SHA_HTONL(ctx->H[3]); >+ ctx->W[4] = SHA_HTONL(ctx->H[4]); >+ memcpy(hashout, ctx->W, SHA1_LENGTH); >+#else /* 32-bit big-endian */ >+ memcpy(hashout, ctx->H, SHA1_LENGTH); >+#endif /* 32-bit big endian */ >+ } >+#endif /* ! unaligned access */ >+#undef STORE >+#undef W Is it possible to avoid duplicating this code in sha_fast.c and prng_fips1861.c? Just wondering.
Attachment #192061 - Flags: review?(wtchang) → review+
Comment on attachment 191778 [details] responses to previous comments, proposals for changes Nelson, The C code optimizations you did are impressive but also not obvious at all. You should add a big warning to this file that says "Hand-optimized C code. If it isn't broken, don't fix it." I'd also encourage you to add comments for every C code optimization (e.g., the W_x stack variables and the placement of the len -= 64U statement in the while loop). > It appears to me that NSC_DeriveKey does call SHA1_DestroyContext. > If it didn't, that would be a memory leak. See >http://lxr.mozilla.org/security/source/security/nss/lib/softoken/pkcs11c.c#5110 > Perhaps there is some other case in NSC_DeriveKey that fails to > call it as needed, but that would be a definite leak. Yes, there is. In lib/softoken/pkcs11c.c, rev. 1.65, lines 4714-4745. It calls PORT_Free instead of SHA1_DestroyContext, so I don't think there is a memory leak. > It appears that the code for the SHA_PUT_W_IN_STACK case is now > incomplete. I am willing to remove it if you like. I think we should not add dead code that needs to be removed.
Wan-Teh, what NSPR header file would you suggest I include instead? prtypes.h? SHA_PUT_W_IN_STACK is not incomplete after all. See comment 21 above.
Attachment #192061 - Attachment is obsolete: true
Attachment #192150 - Flags: review?(wtchang)
Comment on attachment 192150 [details] [diff] [review] address feedback and fix PRNG (v3) Nelson: Yes, you can just include "prtypes.h".
Comment on attachment 192150 [details] [diff] [review] address feedback and fix PRNG (v3) r=wtc. You can include "prtypes.h" instead of "prlong.h". >+#if defined(__GNUC__) >+/* __x86_64__ and __x86_64 are defined by GCC on x86_64 CPUs */ This comment is a little confusing because you also test __x86_64__ and __x86_64 (to define SHA_ALLOW_UNALIGNED_ACCESS) outside the #if defined(__GNUC__) block, and it makes me wonder if __x86_64__ and __x86_64 are also defined by Sun's compiler on x86_64 CPUs.
Attachment #192150 - Flags: review?(wtchang) → review+
Wan-Teh, __x86_64 is defined by both gcc and Sun compilers. I used it in the NSPR port. See mozilla/nsprpub/pr/include/md/_solaris.cfg and mozilla/nsprpub/pr/include/md/_solaris.h .
Checking in prng_fips1861.c; new revision: 1.19; previous revision: 1.18 Checking in sha_fast.c; new revision: 1.12; previous revision: 1.11 Checking in sha_fast.h; new revision: 1.5; previous revision: 1.4 Note that Saul will soon be attaching an assembler version of the SHA code for AMD64. This bug will be left unresolved until that work is also done.
Comment on attachment 192554 [details] [diff] [review] Compiled sha_fast.c to assembler with gcc -O3; use this only on Solaris AMD64 builds with Sun Studio. r=wtc. The information on how the sha-fast-amd64-sun.s was generated should be a comment in that file.
Attachment #192554 - Flags: superreview?(wtchang) → superreview+
Comment on attachment 192554 [details] [diff] [review] Compiled sha_fast.c to assembler with gcc -O3; use this only on Solaris AMD64 builds with Sun Studio. r=nelson
Attachment #192554 - Flags: review?(nelson) → review+
Comment on attachment 191918 [details] [diff] [review] Address most review feedback Cancelling review request for this obsolete patch.
Attachment #191918 - Flags: superreview?(wtchang)
Checked in to trunk: Checking in Makefile; /cvsroot/mozilla/security/nss/lib/freebl/Makefile,v <-- Makefile new revision: 1.62; previous revision: 1.61 done Checking in manifest.mn; /cvsroot/mozilla/security/nss/lib/freebl/manifest.mn,v <-- manifest.mn new revision: 1.39; previous revision: 1.38 done RCS file: /cvsroot/mozilla/security/nss/lib/freebl/sha-fast-amd64-sun.s,v done Checking in sha-fast-amd64-sun.s; /cvsroot/mozilla/security/nss/lib/freebl/sha-fast-amd64-sun.s,v <-- sha-fast-amd64-sun.s initial revision: 1.1 done
Status: ASSIGNED → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: