Closed Bug 331043 Opened 14 years ago Closed 12 years ago

Improve hash performance using _rotr intrinsic

Categories

(Core :: JavaScript Engine, defect, P1)

x86
Windows XP
defect

Tracking

()

RESOLVED FIXED

People

(Reporter: mmoy, Assigned: mmoy)

References

()

Details

(Keywords: perf)

Attachments

(4 files, 11 obsolete files)

16.20 KB, patch
Details | Diff | Splinter Review
1.94 KB, patch
Details | Diff | Splinter Review
2.39 KB, patch
Details | Diff | Splinter Review
7.81 KB, patch
brendan
: review+
brendan
: approval1.9+
Details | Diff | Splinter Review
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060318 Firefox/1.5.0.1 (mmoy CE 1.5.0.1 K8B-X69)
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060318 Firefox/1.5.0.1 (mmoy CE 1.5.0.1 K8B-X69)

There are about a dozen modules in the Mozilla code base that use the xor of a shift right and shift left to simulate a rotate as there is no rotate function
in the C language. This frequently looks like:

h = (h >> (JS_HASH_BITS - 4)) ^ (h << 4) ^ *s

Microsoft provides the intrinsic _rotr(tartet, number_of_bits) which 
results in generated code using the ror (rotate right) instruction. In
VS2002 and VS2003, two shifts, a mov and an xor are generated instead.

In VS2005, the rotate instruction is generated so Microsoft fixed the
bug. But I think that the 1.5 line is still using VS2002 so that line
would benefit from using the intrinsic. On VS2005, using the intrinsic
won't hurt.

I've been using patches to use _rotr since last summer with no problems.
I also use inline assembler for some routines.

Here's the assembler output from VS2003 compiled with -O2 and you can see
the difference between the intrinsic and the current Mozilla code. Hashing is
used a lot in Mozilla and this improvement should make for at least a small
improvement in overall performance. I've been using hash performance patches
since the Summer of 2004 though the older ones were inline assembler.

There is some additional performance gains possible using dword xors instead of
byte xors, especially when the length of the object to be hashed is known in advance but perhaps that's something for another bug.

; 6    : main() {

        push    ecx

; 7    :   unsigned int a, b, c, d;
; 8    :
; 9    :   scanf("%d", &a);

        lea     eax, DWORD PTR _a$[esp+4]
        push    eax
        push    OFFSET FLAT:??_C@_02DPKJAMEF@?$CFd?$AA@
        call    _scanf

; 10   :   b = a >> 28 | a << 4;
; 11   :   c = _rotr(a, 28);

        mov     eax, DWORD PTR _a$[esp+12]
        mov     ecx, eax
        mov     edx, eax
        ror     ecx, 28                                 ; 0000001cH
        shr     edx, 28                                 ; 0000001cH
        shl     eax, 4

; 12   :   printf("%d %d\n", b, c);

        push    ecx
        or      edx, eax
        push    edx
        push    OFFSET FLAT:??_C@_06IPGPIAII@?$CFd?5?$CFd?6?$AA@
        call    _printf

Reproducible: Always
If you're interested in this change, I can code up the dozen or so patches. I'd just have to add the conditionals for Windows.
Assignee: nobody → general
Component: General → JavaScript Engine
Product: Firefox → Core
QA Contact: general → general
Version: unspecified → Trunk
Flags: in-testsuite-
List of files that simulate rotate with left and right shifts related to hash:

    * nsprpub\lib\ds\plhash.c
    * js\src\jshash.c
    * js\src\jsdhash.c
    * xpcom\ds\nscrt.cpp
    * xpcom\glue\pldhash.c
    * js\src\liveconnect\jsj_hash.c
    * content\xul\document\src\nsElementMap.cpp
    * xpcom\ds\nsStaticNameTable.cpp
    * js\src\xpconnect\src\xpcmaps.cpp
    * netwerk\protocol\http\src\nsHttp.cpp
    * netwerk\cache\src\nsDiskCacheDevice.cpp
    * xpcom\glue\nsTHashtable.cpp
    * security\nss\lib\pki\pkistore.c
    * security\nss\lib\base\hash.c
    * js\src\jsscope.c
    * js\src\jsstr.c

Can someone assign this to me?
Assignee: general → mmoy
Note that xpcom/ds/pldhash.[ch] are sed-generated from js/src/jsdhash.[ch], so you should modify only the latter, then generate the former using

  sed -f plify_jsdhash.sed jsdhash.c > ../../xpcom/ds/pldhash.c
  sed -f plify_jsdhash.sed jsdhash.h > ../../xpcom/ds/pldhash.h

in js/src.

Also the only place that contains #ifdefs should be two header files, one in NSPR and the other in SpiderMonkey (js/src).  These two should mirror each other and define a macro, PR_ROTATE_RIGHT for NSPR and JS_ROTATE_RIGHT for SpiderMonkey.  The best header file for SpiderMonkey is arguably jsbit.h.  There may be an NSPR file with corresponding purpose, if not name.

/be
Status: UNCONFIRMED → NEW
Ever confirmed: true
VS2003 requires Oi optimization or O2 optimization to inline _rotr. Without Oi in VS2003, there's a callout to the __rotr routine and this would make this not worth
doing. I assume that VC6 has the same optimization switch requirement.

The last I checked, Mozilla uses -O1 -G6 as optimization switches. If this is incorrect, please let me know. I will check to see if there are #pragmas available to turn on inline optimization. And I will need to do some testing with VC6.
Good to know what VS7.1/8 do, as well.

I wonder why we aren't using O2 on Windows -- cc'ing bsmedberg, bryner, and vlad in case they know.

/be
O2 on windows flips the space/speed tradeoff towards "speed".  I believe the last time we evaluated this, we determined that the rather small speedup (3%?) wasn't worth the code bloat.
That was a vc6 test though, it would be useful to retest on our newer Windows compilers.
VS2005 requires -O2 or -Oi to get the inlined rotate command as well.
On VS2003 and VS2005, the ror inlines with:

#include <stdlib.h>
#pragma intrinsic(_rotr)

This approach doesn't have any requirements on the command line optimization flags.
(In reply to comment #8)
> O2 on windows flips the space/speed tradeoff towards "speed".  I believe the
> last time we evaluated this, we determined that the rather small speedup (3%?)
> wasn't worth the code bloat.

The official build image takes 6,999 KB (not sure whether that's 1.5 or 1.5.0.1 but it's probably 1.5.0.1 on my machine). Unofficial builders that use -O2 typically have image sizes in the 8,000 to 8,700 KB range with some of that due to -O2 and the rest due to private code or additional functionality or code.

Most of the unofficial builders also use -GL (Whole Program Optmization) which
adds to the image size by inlining across modules and routines. I also use
Profile Guided Optimization which improves performance by about 7%. I haven't
noticed larger code sizes with PGO but my comparisons are with GL and O2
assumed.
(In reply to comment #11)
> On VS2003 and VS2005, the ror inlines with:
> 
> #include <stdlib.h>
> #pragma intrinsic(_rotr)

NSS has been using _lrotr and _lrotl in rotate macros for years, FWIW.
http://lxr.mozilla.org/security/source/security/nss/lib/freebl/sha512.c#142
(In reply to comment #5)
> Also the only place that contains #ifdefs should be two header files, one in
> NSPR and the other in SpiderMonkey (js/src).  These two should mirror each
> other and define a macro, PR_ROTATE_RIGHT for NSPR and JS_ROTATE_RIGHT for
> SpiderMonkey.  The best header file for SpiderMonkey is arguably jsbit.h. 
> There may be an NSPR file with corresponding purpose, if not name.

how about prbit.h?
This applies against MOZILLA_1_8_BRANCH
Attachment #215598 - Attachment is obsolete: true
I'm going to try the patch on a trunk build next.
BTW, I verified the assembler code on all but two or three of the modules. The -FAs switch didn't pass on to some of the compile commands.
This is the trunk version of the patch. The previous version is for the MOZILLA_1_8_BRANCH.
I'm building the trunk right now.

I'm hoping that _X86_ covers Win64 as well. I'll try a Win64 build tonight if I have time.
Attached patch Trunk _rotr patch (obsolete) — Splinter Review
Trunkn patch modified to support Win64
Attached patch Trunk _rotr patch (obsolete) — Splinter Review
Trunkn patch modified to support Win64
Attachment #215741 - Attachment is obsolete: true
Attachment #215763 - Attachment is obsolete: true
Modified to support Win64
The second set of patches has been tested on Win64 (Release 1.5.0.1 - Branch Patch), Win32 Branch (MOZILLA_1_8_BRANCH) and Win32 Trunk in that I checked one
or two of the output assembler code and the ror (Rotate Right) instructions was
there.

How do I go about requesting reviews for the two patches?
What is the definition  of _rotr on Win64? Does it assumes 64 bit argument or 32 bit one?
(In reply to comment #24)
> What is the definition  of _rotr on Win64? Does it assumes 64 bit argument or
> 32 bit one?

That's a very good question.

http://msdn2.microsoft.com/en-us/library/5cc576c4(VS.80).aspx indicates that
_rotl and _rotr are 32-bit instructions while _rotl64 and _rotr64 are the
64-bit versions. There are also 8-bit and 16-bit versions of the intrinsic.

I had a look at the generated code and the register for the 64-bit version was
r8 while the register for the 32-bit version was r8d. I need to brush up on
AMD 64 assembler notation.

I wrote a small program that verified that the instruction produces results
for a 32-bit rotate.

And I've been using this intrinsic in my Community Edition Win64 1.5 and 
1.5.0.1 Release builds since Summer 2005.

Comment on attachment 215764 [details] [diff] [review]
Trunk _rotr patch

>+#if defined(_MSC_VER) && (defined(_X86_) || defined(_AMD64) || defined(_M_AMD64))
>+#include <stdlib.h>
>+#pragma intrinsic(_rotr)
>+#define JS_ROTATE_RIGHT(a, bits, length) _rotr(a, bits)
>+#else
>+#define JS_ROTATE_RIGHT(a, bits, length) (((a) >> (bits)) ^ ((a) << ((length) - (bits))))
>+#endif

Given that _rotr always assumes 32-bits argument and result, IMO the macros should be called JS_ROTATE_RIGHT32 etc. and do not take the length argument at all.
(In reply to comment #26)
> Given that _rotr always assumes 32-bits argument and result, IMO the macros
> should be called JS_ROTATE_RIGHT32 etc. and do not take the length argument at
> all.

The other half of the macro deals with non-Windows x86 platforms and the
assumption on those other platforms is that hashes are 32-bits as well (I 
think). The number of objects would have to be monsterous to suggest using 
64-bits.

The use of the length was to implicitly support some of the tags that indicate
integer length but changing the value to something other than 32 would work in
the Macro as a different intrinsic would be required.

I could change it to ROTATE_RIGHT32 or take out the length with the assumption
of 32 bits. I think that I'll leave this for discussion for a little while as
it it takes about 90 minutes to do the testing on the Branch, Trunk and Win64
and I'd rather keep to a minimum the number of times that I have to retest
everything.

One more comment is that the other side of the macro does not assume 32-bits in
the macro. I'd guess that the number of bits depends on the datatype that's
rotated.
(In reply to comment #28)
> One more comment is that the other side of the macro does not assume 32-bits in
> the macro. 

And this is bad: just imagine that new code would use JS_ROTATE_RIGHT(a, 4, sizeof(word-sized-variable)*8). It would work everywhere except Windows 64. Thus the reason for JS_ROTATE_RIGHT32 since it tells very explicitly what kind of rotation is going on. If some places would require a rotation over a 64 bit value, then JS_ROTATE_RIGHT64 can be added. 
(In reply to comment #29)
> And this is bad: just imagine that new code would use JS_ROTATE_RIGHT(a, 4,
> sizeof(word-sized-variable)*8). It would work everywhere except Windows 64.
> Thus the reason for JS_ROTATE_RIGHT32 since it tells very explicitly what kind
> of rotation is going on. If some places would require a rotation over a 64 bit
> value, then JS_ROTATE_RIGHT64 can be added. 

The development environment on Windows64 is the same as it is on Windows32 as far as datatypes go. So ints are 32 bits and everything else works as it would in a 32-bit environment. If you want to use 64-bit datatypes, then you'd use
long long or __int64. Which is what you'd probably do if you wanted to do 64-bit
operations in the 32-bit environment. But the compiler would simulate the 64-bit operations for you.

When I say the other side of the macro, I'm referring to the >> << case.

Interestingly enough, the case that you refer to (8 * sizeof x) does exist in
the Mozilla code base and works fine in the Windows 64 build.

Brendan suggested the names for the macros. How about just swiping the macro
used in sha512.c, rotr32, and attaching wither PR or JS for PR_rotr32 and JS_rotr32? Again, I'm looking for a concensus as I don't want to change the
names or the functionality more than once.

Michael: my name suggestion was just a suggested starting point.  Many similar macros in SpiderMonkey (js/src/*.[ch]) have implicit int32-based representation limits.

If on the other hand, the rotate macros want to work on a type whose bit size depends on compilation target, and if that type is not jsword/jsuword, then I agree with Igor that it's better to have the bit-size in the macro names.

The NSS names are kind of crufty-terse based on the MS names, and not ALL_CAPS (macros generally should be to follow prevailing style, unless free of side effects and multiple argument expansions).

So depending on which type is implicit in the rotate macros, we could have either JS_ROTATE_RIGHT as proposed, with one universal type which would have to be either uint32 or jsuword -- or JS_ROTATE_RIGHT_32 and _64.

If I'm reading things right here (recovering from a lost filesystem, also on Jury Duty now!), the type that's implicit in the macros is not well-defined in terms of jsuword (which is defined as the integral type big enough to hold any pointer type), so we should go with JS_ROTATE_RIGHT_32 and _64.

If others are happy with this reasoning and proposal after today, I say act on it.

Thanks for taking on this work.

/be
It was written in comment #25:

> http://msdn2.microsoft.com/en-us/library/5cc576c4(VS.80).aspx indicates that
> _rotl and _rotr are 32-bit instructions

and it was written in comment #30:

> (In reply to comment #29)
> Interestingly enough, the case that you refer to (8 * sizeof x) does exist in
> the Mozilla code base and works fine in the Windows 64 build.

Given these 2 comments with sizeof(x) == sizeof(word) the proposed definition of the macro would break Win64.
(In reply to comment #32)
> Given these 2 comments with sizeof(x) == sizeof(word) the proposed definition
> of the macro would break Win64.

The size of x apparently is 32 bits because the code has been working fine on
Windows 64 since last summer. The size of a word, at least the way it is
implemented in Mozilla, is apparently 32 bits as well. My Windows 64 builds
are at vector64.com if you want to see for yourself.

(In reply to comment #31)
> So depending on which type is implicit in the rotate macros, we could have
> either JS_ROTATE_RIGHT as proposed, with one universal type which would have to
> be either uint32 or jsuword -- or JS_ROTATE_RIGHT_32 and _64.
> 
> If I'm reading things right here (recovering from a lost filesystem, also on
> Jury Duty now!), the type that's implicit in the macros is not well-defined in
> terms of jsuword (which is defined as the integral type big enough to hold any
> pointer type), so we should go with JS_ROTATE_RIGHT_32 and _64.

On Windows 64, pointers are 64 bits but I'd guess that the jsuword is 32 bits.
My Windows 64 Trial License expired this morning and I just ordered the kit so
I can't spend much time on Windows 64 (it lets me login for an hour at a time)
but I should be able to verify that the correct code is generated. The Windows
64 license should arrive in a few days which will make it easier to do test
builds.

> If others are happy with this reasoning and proposal after today, I say act on
> it.

Thanks for your comments.
I guess I was not presize enough to write why I do not like the definition of the macros from the last patch. If one uses ROTATE(x, 4, sizeof(x) * 8) where x is int64, then it would work everywhere but Win64.

If one want universal macro, then it it should be something like:

#include <stdlib.h>
...

#if defined(_MSC_VER) && (defined(_X86_) || defined(_AMD64) || defined(_M_AMD64))
#pragma intrinsic(_rotr)
#define JS_ROTATE_RIGHT(a, bits) \
    (sizeof(a) == 4 ? _rotr(a, bits) : sizeof(a) == 8 ? _rotr64(a, bits) \
     : (((a) >> (bits)) ^ ((a) << (sizeof(a) * JS_BITS_PER_BYTE - (bits)))))
#else
#define JS_ROTATE_RIGHT(a, bits) \
    (((a) >> (bits)) ^ ((a) << (sizeof(a) * JS_BITS_PER_BYTE - (bits))))
#endif

Note also that <stdlib.h> is moved outside of #if so it would be impossible to forget to include stdlib.h in non-msvc builds.

But this also assumes that MSVC would eliminate constant expression from the code with options used with Mozilla builds. If not, then separated ROTATE_32/ROTATE_64 should be defined IMO.
(In reply to comment #34)
> On Windows 64, pointers are 64 bits but I'd guess that the jsuword is 32 bits.

SpiderMonkey assumes that sizeof(jsword) == sizeof(void*). In other word, jsword is an alias to intptr_t from C99.
(In reply to comment #36)
> (In reply to comment #34)
> > On Windows 64, pointers are 64 bits but I'd guess that the jsuword is 32 bits.
> 
> SpiderMonkey assumes that sizeof(jsword) == sizeof(void*). In other word,
> jsword is an alias to intptr_t from C99.

Or should be, but C99 support is still sadly lacking on a number of platforms we target, or targeted last year.

For a long time, only LP64 was supported on 64-bit targets.  That's the only sane model in my opinion, but of course MS came out with an LLP64 variant where long is 32 bits.  So we have had to add #ifdefs to fundamental typedefs that used to use long to mean "pointer-sized int".  Using C99 where we can autoconf-test for it and it works is great, so we should prefer that.

I'm still recovering from data loss, not sure where things stand in the trunk tip.

/be
(In reply to comment #37)
> That's the only sane model in my opinion, but of course MS came 
> out with an LLP64 variant where long is 32 bits.

I think that they looked at Alpha and Itanium and figured that they
needed something that minimized porting effort for their installed
base of developers and came up with what they did. There are some
performance benefits as some 32-bit instructions (like multiply and
divide) have lower latencies when working on 32-bit integers.
(In reply to comment #38)
> There are some
> performance benefits as some 32-bit instructions (like multiply and
> divide) have lower latencies when working on 32-bit integers.

Interesting. I read on Linux kernel mailing list that for Amd64 using 64-bit values for things like loop counters, indexes etc. is faster than 32 bit ones. Perhaps on Intel this is different.

Itanium, cough.  Alpha, so last millennium.

They did what they did, but I think it's bogus in light of the prehistory going back 30+ years where int could hold a pointer, and then 20+ years ago we went through the effort of purging that assumption in favor of long being big enough.

/be
(In reply to comment #39)
> Interesting. I read on Linux kernel mailing list that for Amd64 using 64-bit
> values for things like loop counters, indexes etc. is faster than 32 bit ones.
> Perhaps on Intel this is different.
 
Software Optimization Guide for AMD Athlon 64 and AMD Opteron Processors 3.4 32-Bit Legacy GPRs and Small Unsigned Integers

Use the 32-bit legacy general-purpose registers EAX through ESI instead of their 64-bit extensions to store unsigned integer values whose range never requires more than 32 bits, even if subsequent statements use the 32-bit value in a 64-bit operation.

...

In 64-bit mode, the machine-language representation of many instructions that operate on 64-bit register operands require a REX prefix byte, which increases the size of the code. However, instructions that operate on a 32-bit legacy register operand do not require the prefix and have the desirable side-effect of clearing the upper 32 bits of the extended register to zero. For example, using the AND instruction on ACX clears the upper half of RCX.

I didn't see anything in Chapter 3 General 64-Bit Optimizations that recommended using 64-bit operations over 32-bit operations unless you actually are dealing with 64-bit data.

In Appendix C, instructions that take longer to execute in 64-bit mode compared to 32-bit mode are BSF, DIV, IDIV, IMUL, SGDT, SIDT

The LOOP* instructions have latencies of 9 in 32-bit mode and 8 in 64-bit mode so this may be source of the information on your mailing list. Though I think
that most loops don't use the LOOP* instructions these days. The WBINVD is more
expensive in 32-bit than 64-bit mode but I've never heard of that instruction
before and it uses near 10,000 cycles.

I don't think that I have looked at the Pentium 4/Xeon docs for EM64T. The Intel
SOG that I have only refers to the 32-bit chips.
(In reply to comment #40)
> Itanium, cough.  Alpha, so last millennium.

A bunch of companies are spending $10 billion on Itanium and that's a lot
of good-paying software jobs. I don't know how successful it will be but
there's a lot of gravy out there right now.

Alpha had its problems but many of the hardware guys that worked on it
migrated over to Intel and AMD and some of the technology that we enjoy
today comes out of the work done on Alpha.

I think that the market has indicated to Microsoft that customers want the
easiest migration possible to 64-bits and that's what they have at the moment.
To get further improvements, you have to then get your hands dirty. But I think
that customers with legacy systems generally want to get the thing ported and
running and then work on incremental improvements later on.

> They did what they did, but I think it's bogus in light of the prehistory going
> back 30+ years where int could hold a pointer, and then 20+ years ago we went
> through the effort of purging that assumption in favor of long being big
> enough.

Microsoft put in a fair amount of effort into XP Pro x64 and there are still a
lot of driver issues out there that make it difficult to get up and running with
what you want to run on x64. I think that most people dual-boot and Microsoft
has implicitly acknowledged the issue of drivers with a relatively inexpensive
selling cost on x64. I paid $89 for xn x64 license today and this was a full
license; not an upgrade. The same thing on 32-bits would run somewhere between
$200 and $300.

So from an IT perspective, Microsoft's approach carries a lot of value to their
customers and developers. And of course AMD64 allows them to do this pretty
easily. From an engineering and portability perspective, it may not be pure.
But hey, they're Microsoft and they can quite often throw their weight around.
For operations happening in registers, 32-bit instructions can be minutely 
faster than 64.  This is true for both Intel and AMD 64 bit CPUs.  But for 
memory stores (reg->mem) on AMD Opterons (only), any stores other than 
64-bit aligned ones seem to require a read-modify-write memory cycle.  

We sped up a lot of crypto code for AMD Opteron by changing it to use 64-bit 
arrays in memory rather than 8-bit or 32-bit arrays.  Every such store becomes 
2-3x faster.  Overall RC4 performance went up over 2x by changing one array of
1024 uchars to 1024 ulongs, alone.  Technique was published by Mr Marc Bevand, 
http://etudiant.epita.fr/~bevand_m/ and converting to 64-bit values in memory 
for performance has become known as "Bevandizing".  
Packing in structs for space efficiency has often been opposed to minimizing latency when retiring stores.

Loads dominate stores.

Crypto, DSP, and otherwise numerically or combinatorially intensive code may need hand tuning.

All things considered, I still believe LP64 to be an easier migration target than LLP64, but I'm biased.  And, I'm not MS ;-).

/be
This works on the Trunk Win32 build. I'm going to try it on Win64 Release 1.5.0.1 next. And then the Win64 Trunk. I've never built the Win64 Trunk so I don't know if I will be able to get that to work.

I'm going to have to remove the 64-bit version for the Branch as VC6 doesn't support _rotr64. VS2003 does though.
Attachment #215715 - Attachment is obsolete: true
Attachment #215764 - Attachment is obsolete: true
The Trunk patch generates correct code on a Win64 Release 1.5.0.1 build so please have a look at that.

On the Branch side, I'm either going to need to dump the 64-bit version or
conditional code the 64-bit version based on the MSVC version number.

-----

On the alignment thing: it sounds like something to do with the cache coherency
channels as that's one of the few differences between the Opteron and the consumer CPUs. The SOG is pretty clear that you should naturally align 
everything in sight for best performance. Especially when you're working with
SIMD instructions.

The SOG indicates that you can get read/write performance on movlps/movhps
of two instructions per cycle if the moves are 8-byte aligned with latency of
1. If unaligned, the latency doubles.
How do I proceed on getting this into the code base?
Might I suggest the addition of a XX_ROTATE_RIGHT16 macro?

Apart from completeness, there are cases in the code of 16-bit SHIFT-and-OR where a simple rotate instruction would do.

For example, check out the SwapBytes() routines in xpfe/components/history/src/nsGlobalHistory.cpp and intl/uconv/ucvlatin/nsUnicodeToUCS2BE.cpp

It would be nice to see 

*p++ = (0x00FF & (aChar >> 8)) | (0xFF00 & (aChar << 8));

turned into a single rotate/move via

*p++ = XX_ROTATE_RIGHT16(aChar, 8);

Micheal,

I saw on your blog that you're looking at the 16-bit version of the rotate code.  Here are 4 examples, all SwapBytes(), doing byte-swapping on 16-bit Unicode chars:

mozilla/intl/uconv/ucvlatin/nsUCS2BEToUnicode.cpp
mozilla/intl/uconv/ucvlatin/nsUnicodeToUCS2BE.cpp
mozilla/toolkit/components/history/src/nsGlobalHistory.cpp
mozilla/xpfe/components/history/src/nsGlobalHistory.cpp

As noted above, they're all doing that godawfull shift/mask/or.  Those unfortuneate Windows users who regularly translate Unicode from non-native byte-ordering would surely appreciate a more efficient translation.  (Too bad MSVC doesn't like using the XCHG instruction.)
MSVC doesn't like to play with registers like al and ah in general so getting
it to do an xchg like that would be difficult. I haven't checked to see if
there's an intrinsic for that but I would doubt it.

One other thing about the xchg instruction is that it is latency 2 and VectorPath on the K8. On the Pentium 4, the latency is 1.5. On the other
hand, the rotate instructions have a 1 cycle latency on K8, P4 (except for Northwood where it is 4). I'm pretty sure that rotates are 1 cycle on Pentium M and Core though it isn't in my optimization guide for Intel.

There's no doubt that 16-bit rotate is the most efficient. At any rate, I'm
pretty busy at work right now but am doing test stuff at night on Mozilla.

(In reply to comment #49)
> Micheal,
> 
> I saw on your blog that you're looking at the 16-bit version of the rotate
> code.  Here are 4 examples, all SwapBytes(), doing byte-swapping on 16-bit
> Unicode chars:
> 
> mozilla/intl/uconv/ucvlatin/nsUCS2BEToUnicode.cpp
> mozilla/intl/uconv/ucvlatin/nsUnicodeToUCS2BE.cpp
> mozilla/toolkit/components/history/src/nsGlobalHistory.cpp
> mozilla/xpfe/components/history/src/nsGlobalHistory.cpp
> 
> As noted above, they're all doing that godawfull shift/mask/or.  Those
> unfortuneate Windows users who regularly translate Unicode from non-native
> byte-ordering would surely appreciate a more efficient translation.  (Too bad
> MSVC doesn't like using the XCHG instruction.)
> 

Two more places where the standard "(h >> 28) ^ (h << 4)" code can be replaced with the PR_ROTATE_RIGHT32 macro.

Patch modifies 2 files from the 1.8.0 branch.
This one also needs "perf" key word.
(In reply to comment #52)
> This one also needs "perf" key word.
> 

Keywords: perf
We are using VS2005 on trunk right now and cc in particular uses hashing quite a bit - so methinks we should definitely take a stab at this...
Flags: blocking1.9?
This bug is open against component JS, but the original patch touches several subsystems.

What is desired here, a JS-only patch or one that applies the use of the rotate instructions (via compiler intrinsics) across the mozilla source tree?
Comment 0 implies that VC2005 produces good instructions here, and we're using that on trunk, so is there any need for a fix?
VS2005 will generate the rotate instructions if an OR is used in the source, but not if an XOR is used.

Does generate rotate instruction:

        a >> 28 | a << 4

Does not generate rotate instruction:

        a >> 28 ^ a << 4

Most (all?) Mozilla code is using XOR, so the proposed changes really do provide some benefit.

(In reply to comment #56)
> Comment 0 implies that VC2005 produces good instructions here, and we're using
> that on trunk, so is there any need for a fix?
> 

I believe that comment is talking about the instructions emitted if the instrinsic's are used - which we currently don't use.  Is that right Michael?
(In reply to comment #55)
> This bug is open against component JS, but the original patch touches several
> subsystems.
> 
> What is desired here, a JS-only patch or one that applies the use of the rotate
> instructions (via compiler intrinsics) across the mozilla source tree?
> 

If it helps why not use it everywhere?
VS2005 uses the rotate instruction when modules are compiled at -Oi or -O2. According to the post from March 2006, Firefox builds at -O1 as -O2 increases code size. If the optimization level has moved to add -Oi or moved from -O1 to O2, then this code isn't needed outside of clarifying the code and putting the code for rotate into one central place. I would personally prefer that Firefox moved to -O2 but this isn't the best place to discuss that.

The idea, a long time ago, was to put the rotate intrinsics in one place so that they could be called from anywhere else and that it a change was required, that the change would only need be made in one place. There are other areas of code that could benefit from the intrinsic (they may be referenced above) used for a general rotate function.
(In reply to comment #56)
> Comment 0 implies that VC2005 produces good instructions here, and we're using
> that on trunk, so is there any need for a fix?
> 

VC2005 will produce it at -O2 and -Oi. We build at -O1, so we should take this patch for our hashtables.
The code is clearer with the patch anyway.  Let's take it.  It seems likely we need two separate patches, one for JS (spidermonkey), and one for the rest of the browser, to get this committed in a timely fashion?  Or at least, in my experience, it's been easier to drive patches home that way.  If there's disagreement, let me know.  I'll be glad to review the JS patch.
(In reply to comment #62)
> The code is clearer with the patch anyway.  Let's take it.  It seems likely we
> need two separate patches, one for JS (spidermonkey), and one for the rest of
> the browser, to get this committed in a timely fashion?  Or at least, in my
> experience, it's been easier to drive patches home that way.  If there's
> disagreement, let me know.  I'll be glad to review the JS patch.
> 

yea - separate JS and other patches might be good.  What do we need to do for GCC for linux/win?  Either way this looks like a good, easy, win.  So let's move on it.
Comment on attachment 215887 [details] [diff] [review]
Trunk patch to use _rotr for hash

>RCS file: /cvsroot/mozilla/js/src/jsbit.h,v
>retrieving revision 3.13
>diff -u -r3.13 jsbit.h
>--- js/src/jsbit.h	21 Feb 2006 11:28:28 -0000	3.13
>+++ js/src/jsbit.h	22 Mar 2006 04:13:56 -0000
>@@ -147,5 +147,28 @@
>     JS_END_MACRO
> #endif
> 
>+/*
>+** Macros for rotate right. There is no rotate operation in the C Language
>+** so the construct a >> 28 ^ a << 4 is frequently used instead. Most
>+** compilers convert this to a rotate instruction but MSVC doesn't without
>+** a little help. To get MSVC to generate a rotate instruction, we have
>+** to use the _rotr intrinsic and use a pragma to make _rotr inline.
>+**
>+** MSVC in VS2005 will do an inline rotate instruction on the above
>+** construct if the shifts are put together using or (|) but Mozilla
>+** uses xor (^) so we wouldn't pick up the improvement just by using
>+** VS2005.
>+*/

Please reformat like so:

/*
 * Macros for rotate right. There is no rotate operation in the C Language so
 * the construct a >> 28 ^ a << 4 is frequently used instead. Most compilers
 * convert this to a rotate instruction but MSVC doesn't without a little help.
 * To get MSVC to generate a rotate instruction, we have to use the _rotr
 * intrinsic and use a pragma to make _rotr inline.
 *
 * MSVC in VS2005 will do an inline rotate instruction on the above construct
 * if the shifts are put together using or (|) but Mozilla uses xor (^) so we
 * wouldn't pick up the improvement just by using VS2005.
 */

The ** comment style is old and copied from NSPR, so not to be imitated. I used vim Qj a bunch of times to get the wrapping to the right margin uniform.

I hope we can get this in with just my blanket r/sr+, but module owners should be cc'ed and have a chance to comment if they care.

/be
Attachment #215887 - Flags: superreview+
Attachment #215887 - Flags: review+
Attachment #215887 - Flags: approval1.9+
Flags: blocking1.9? → blocking1.9+
(In reply to comment #63)
> (In reply to comment #62)
> > The code is clearer with the patch anyway.  Let's take it.  It seems likely we
> > need two separate patches, one for JS (spidermonkey), and one for the rest of
> > the browser, to get this committed in a timely fashion?  Or at least, in my
> > experience, it's been easier to drive patches home that way.  If there's
> > disagreement, let me know.  I'll be glad to review the JS patch.
> > 
> 
> yea - separate JS and other patches might be good.  What do we need to do for
> GCC for linux/win?  Either way this looks like a good, easy, win.  So let's
> move on it.
> 

I think that GNU does the right thing without an intrinsic.
(In reply to comment #64)
...
> The ** comment style is old and copied from NSPR, so not to be imitated. I used
> vim Qj a bunch of times to get the wrapping to the right margin uniform.
> 
> I hope we can get this in with just my blanket r/sr+, but module owners should
> be cc'ed and have a chance to comment if they care.
> 
> /be

I need to set up a new FF3 environment to redo the patch and retest.
Priority: -- → P1
NSS does efficient rotates in code that does extensive rotation.  NSS has optimizations for several platforms, including MSVC and others.  See 
http://mxr.mozilla.org/security/search?string=_rot&find=fast&tree=security
Attached patch updated to current trunk (obsolete) — Splinter Review
Attachment #215887 - Attachment is obsolete: true
The following is the only difference shown by a diff-of-diffs between attachment 296638 [details] [diff] [review] and attachment 215887 [details] [diff] [review]:

374,377c374,377
< retrieving revision 1.20
< diff -u -r1.20 nsStaticNameTable.cpp
< --- xpcom/ds/nsStaticNameTable.cpp	18 Apr 2004 14:18:13 -0000	1.20
< +++ xpcom/ds/nsStaticNameTable.cpp	22 Mar 2006 04:14:27 -0000
---
> retrieving revision 1.24
> diff -u -p -3 -r1.24 nsStaticNameTable.cpp
> --- xpcom/ds/nsStaticNameTable.cpp	8 Jul 2007 07:08:50 -0000	1.24
> +++ xpcom/ds/nsStaticNameTable.cpp	10 Jan 2008 18:58:18 -0000
386,387c386,387
< @@ -82,7 +83,7 @@
<             NS_STATIC_CAST(const unsigned char*, key);
---
> @@ -112,14 +113,14 @@ caseInsensitiveStringHashKey(PLDHashTabl
>          for (const PRUnichar* s = tableKey->mKeyStr.m2b->get();
391a392,400
>      } else {
>          for (const unsigned char* s =
>                   reinterpret_cast<const unsigned char*>
>                                   (tableKey->mKeyStr.m1b->get());
>               *s != '\0';
>               s++)
> -            h = (h >> (PL_DHASH_BITS - 4)) ^ (h << 4) ^ (*s & ~0x20);
> +            h = PR_ROTATE_RIGHT32(h, PL_DHASH_BITS - 4) ^ (*s & ~0x20);
>      }
Attached patch with comment nit picked (obsolete) — Splinter Review
Asking for review again, since this could introduce subtle performance bugs, if got wrong.
Attachment #296368 - Attachment is obsolete: true
Attachment #296370 - Flags: review?(brendan)
Comment on attachment 296370 [details] [diff] [review]
with comment nit picked

>+ * Macros for rotate right. There is no rotate operation in the C Language
>+ * so the construct a >> 28 ^ a << 4 is frequently used instead. Most

Uber-nit, helps avoid ragged right margin too: parenthesize << and >> against ^ (not necessary but C's operator precedence hierarchy misplaces things, and gcc warns [or used to] if you don't over-parenthesize, so it's a good idea even in comments).

Ditto for clone in NSPR:

>+** Macros for rotate right. There is no rotate operation in the C Language
>+** so the construct a >> 28 ^ a << 4 is frequently used instead. Most

/be
Attachment #296370 - Flags: review?(brendan)
Attachment #296370 - Flags: review+
Attachment #296370 - Flags: approval1.9+
Crowder is this ready to go?
Attachment #296370 - Attachment is obsolete: true
Attachment #299100 - Flags: review+
Yeah, ready to go as soon as the tree is happy again.
Keywords: checkin-needed
Comment on attachment 299100 [details] [diff] [review]
updated to current trunk, and brendan's last nits picked

Brian, why don't you use | instead of ^ in the default definition
of the rotate right macros?  Using ^ contradicts your comments.
Wan-Teh Chang:  Not sure I see your issue, can you reference the code and comments you mean, specifically?  Also, the original patch belongs to Michael Moy (who I hope is still listening here), who I hope can defend his logic better than I can.
Brian, In comment #57, Steve Snyder wrote:

> Does generate rotate instruction:
>         a >> 28 | a << 4
> 
> Does not generate rotate instruction:
>         a >> 28 ^ a << 4

Suggesting that these rotate expressions should use '|' rather than '^'
but the patch in attachment 299100 [details] [diff] [review] persists in using '^'.
(In reply to comment #77)
> Suggesting that these rotate expressions should use '|' rather than '^'
> but the patch in attachment 299100 [details] [diff] [review] persists in using '^'.

Well, OR vs XOR does make a difference in code generation to get the same result.  But we're talking about an eccentricity of a specific compiler, MSVC.

Given that the purpose of the patch is to specify the use of the _rotr intrinsics when building with MSVC we shouldn't have to be concerned with whether better code is generated with OR or with XOR.  Put simply, _rotr() takes the OR / XOR issue off the table.
Nelson:  Right.  Therefore we must use the intrinsic, since we need "^".  The intrinsic tells the compiler to do the right thing, since it does not divine the correct instruction, as it does in the "|" case.
Using OR accumulates bits so long strings will tend towards all of the bits being turned on. Using XOR can flip bits on and off better distributing incoming bytes, words and longs through a longword.

I'm a bit short on Windows hardware for Mozilla at the moment and am in the middle of setting up a new system to finish this bug up.
I think we're pretty close to wrapping up this bug.  mmoy: can you check out my patch?  Next problem is getting the pieces landed; there is a limited subset of people who have access to all of these parts of the tree (brendan, only?).  I will split the patches, if Brendan can't land.
(In reply to comment #81)
> Next problem is getting the pieces landed; there is a limited subset of
> people who have access to all of these parts of the tree (brendan, only?).  I
> will split the patches, if Brendan can't land.

Brendan doesn't have access to all three restricted partitions; only gerv does, strangely enough (for licensing things). So, probably should just split the patches by putting the NSS and NSPR changes in a separate patch. Then wtc or kaie or somebody can land that part, and you can land everything else.
reed:  Feel like splitting it for me?  :)
There's no real need to split patches.  Wan-Teh or I can land the NSS & NSPR 
parts of that patch, as it is.  

One question though.  Do the parts outside of NSS & NSPR depend on the NSS 
parts being applied first?  (I'm too lazy to look. :)  IF so, then the other
parts of the patch should not land until the NSS & NSPR parts have landed, 
and the CVS tag for NSS used in mozilla builds is updated.
js/src relies on its own macros, so that chunk of the patch can land whenever.  For the rest of the tree, the NSPR hunks should probably land first.  After that, I think order does not matter.
(a >> 28) ^ (a << 4) is equal to
(a >> 28) | (a << 4) if a is unsigned,
and a needs to be unsigned for either expression to be equal to _rotr(a, 28).

So why don't we use | in the default definition of the ROTATE_RIGHT macros?
The comments in the patch suggest that | has a better chance of causing the
compiler to generate good code.

Test program:
#include <stdio.h>


int main()
{
    /*unsigned*/ int a = 0x87654321;
    printf("0x%08x\n", (a >> 28) ^ (a << 4));
    printf("0x%08x\n", (a >> 28) | (a << 4));
    printf("0x%08x\n", _rotr(a, 28));
    return 0;
}
Reed, I am a despot (https://despot.mozilla.org/despot.cgi) -- I have all the keys to the kingdoms.

/be
(In reply to comment #86)
> (a >> 28) ^ (a << 4) is equal to
> (a >> 28) | (a << 4) if a is unsigned,
> and a needs to be unsigned for either expression to be equal to _rotr(a, 28).
> 
> So why don't we use | in the default definition of the ROTATE_RIGHT macros?
> The comments in the patch suggest that | has a better chance of causing the
> compiler to generate good code.

In the past, I've run into problems with two or three cases of using ROR for the above hash code and I'm wondering if they were using signed integers which would imply a bug. I'm planning on going through all of the cases where this code is used to check for signed integers and should have results Saturday night EST.

Using | does look simpler than using the intrinsic. Perhaps with casts to ensure unsigned ints.
(In reply to comment #87)
> Reed, I am a despot (https://despot.mozilla.org/despot.cgi) -- I have all the
> keys to the kingdoms.

So am I, but that doesn't mean I go around and add myself to partitions that I don't own just because I want to check-in there. :)
Reed, lighten up -- reasoning based on who happens to be listed in despot is silly (Gerv wouldn't do random commits just because he got access to relicense, once upon a time). My point is that the commits could be done if needed.

/be
Comment on attachment 299100 [details] [diff] [review]
updated to current trunk, and brendan's last nits picked

I'd like to suggest a change to this patch.

This patch shows that ROTATE_LEFT32 is more appropriate
for our code base.  If we only provide macros for rotating
in one direction, we should provide ROTATE_LEFT32, then
code like:

    PR_ROTATE_RIGHT32(result, 28)
    JS_ROTATE_RIGHT32(h, JS_DHASH_BITS - 4)

would become

    PR_ROTATE_LEFT32(result, 4)
    JS_ROTATE_LEFT32(h, 4)

Another change I need to make to the NSPR portions of
this patch is to check the value of _MSC_VER as well
because NSPR still supports MSVC 6.0.
(In reply to comment #90)
> Reed, lighten up -- reasoning based on who happens to be listed in despot is
> silly (Gerv wouldn't do random commits just because he got access to relicense,
> once upon a time). My point is that the commits could be done if needed.

I left out "only" after "reasoning", and even then this may sound like a fast and loose approach to who commits changes -- not my intention.

/be
I added PR_ROTATE_LEFT32 and removed PR_ROTATE_RIGHT64.
Please let me know if you need 64-bit rotate macros.

I edited the comments to reflect this change and to call
out that the first argument |a| must be an unsigned 32-bit
integer type.

I believe the macro _AMD64 is missing a trailing underscore,
so I changed it to _AMD64_.  MSDN says the _rotl and _rotr
intrinsics are available on the "x86, Intel 64-bit, and AMD
64-bit" architectures, so perhaps we don't need to test any
architecture macro.

I compiled the code with MSVC 6.0 and 2005 successfully, so
I didn't add a check for MSVC version.

I checked in this patch on the NSPR trunk for NSPR 4.7.  I
will create a new NSPR CVS tag for the Mozilla trunk tomorrow.

Checking in pr/include/prbit.h;
/cvsroot/mozilla/nsprpub/pr/include/prbit.h,v  <--  prbit.h
new revision: 3.9; previous revision: 3.8
done
Checking in lib/ds/plhash.c;
/cvsroot/mozilla/nsprpub/lib/ds/plhash.c,v  <--  plhash.c
new revision: 3.11; previous revision: 3.10
done
Comment on attachment 299463 [details] [diff] [review]
NSPR patch (checked in)

I also use OR (|) instead of XOR (^).
Comment on attachment 299100 [details] [diff] [review]
updated to current trunk, and brendan's last nits picked

>+#if defined(_MSC_VER) && (defined(_X86_) || defined(_AMD64) || defined(_M_AMD64))

We need to test _M_IX86, which is defined by the compiler.  _X86_
is defined by <windows.h>, so it isn't defined if the file doesn't
include <windows.h>.
Thanks for doing the update. I went through all calls and saw unsigned in every case. I also checked the assembler code to ensure that there were no SARs. My AMD64 tags are from 2005 and the 2.0 line and they may have changed since then.
NSS changes committed on trunk
Checking in base/hash.c;    new revision: 1.10; previous revision: 1.9
Checking in pki/pkistore.c; new revision: 1.31; previous revision: 1.30
This is the macro test I ended up using:

#if defined(_MSC_VER) && (defined(_M_IX86) || defined(_M_AMD64) || \
    defined(_M_X64))

You might want to do the same in JavaScript's jsbit.h.
_M_AMD64 and _M_X64 mean the same thing, but _M_X64 is
the official one now.

I'd prefer shorter macro names such as PR_ROTL32 and JS_ROTR32.
Not a big deal though if you disagree.
I'd prefer to match wtc's changes in the JS macrology. Crowder, can do?

/be
Why keep ROTATE_RIGHT32?
Attached patch left is more right than right (obsolete) — Splinter Review
Changed js/src and xpconnect to use ROTATE_LEFT instead of ROTATE_RIGHT, abandoned 64-bit rotate macros and changed the preprocessor check to use the same guards as used in NSPR.  Still think there's no point in keeping JS_ROTATE_RIGHT32
Attachment #299100 - Attachment is obsolete: true
Attachment #299856 - Flags: review?(brendan)
Whiteboard: [needs landing]
Comment on attachment 299856 [details] [diff] [review]
left is more right than right

>+#pragma intrinsic(_rotr, _rotr64)

You need to list _rotl instead of _rotr64 here.

I agree that JavaScript doesn't need JS_ROTATE_RIGHT32.
I kept PR_ROTATE_RIGHT32 for completeness because NSPR
is a general-purpose library.  But jsbit.h is for JavaScript
internal use only, so you only need to define what you need.

You still have two calls to JS_ROTATE_RIGHT32 left.
Comment on attachment 299856 [details] [diff] [review]
left is more right than right

>+/*
>+ * Macros for rotate right. There is no rotate operation in the C Language
>+ * so the construct (a >> 28) ^ (a << 4) is frequently used instead. Most

Remember to update this block comment.  You can use what I did in NSPR
as a reference:
http://lxr.mozilla.org/nspr/source/nsprpub/pr/include/prbit.h#110
Comment on attachment 299856 [details] [diff] [review]
left is more right than right

Patch coming that fixes the things Wan-Teh mentions
Attachment #299856 - Attachment is obsolete: true
Attachment #299856 - Flags: review?(brendan)
Attached patch one more try (obsolete) — Splinter Review
Attachment #300095 - Flags: review?(brendan)
Comment on attachment 300095 [details] [diff] [review]
one more try

r=wtc.  There is a comment bug and a code bug.

>+ * the construct (a << 28) | (a >> 4) is used instead. Most compilers convert

Nit: say  (a << 4) | (a >> 28)

>+ * the _rotr intrinsic and use a pragma to make _rotr inline.

_rotr ==> _rotl (two occurrences)

>-    hash = (hash >> (JS_DHASH_BITS - 4)) ^ (hash << 4)
>+    hash = JS_ROTATE_LEFT32(hash, JS_DHASH_BITS - 4)

The second argument should be 4 (this is in jsscope.c).
Attachment #300095 - Flags: review+
Attachment #300095 - Attachment is obsolete: true
Attachment #300095 - Flags: review?(brendan)
Comment on attachment 300165 [details] [diff] [review]
more silly errors on my part fixed, thanks wtc

Still like to give Brendan a chance to eyeball this before I land it.
Attachment #300165 - Flags: review?(brendan)
Comment on attachment 300165 [details] [diff] [review]
more silly errors on my part fixed, thanks wtc

Looks good, thanks.

/be
Attachment #300165 - Flags: review?(brendan)
Attachment #300165 - Flags: review+
Attachment #300165 - Flags: approval1.9+
Checking in jsbit.h;
/cvsroot/mozilla/js/src/jsbit.h,v  <--  jsbit.h
new revision: 3.18; previous revision: 3.17
done
Checking in jsdhash.c;
/cvsroot/mozilla/js/src/jsdhash.c,v  <--  jsdhash.c
new revision: 1.48; previous revision: 1.47
done
Checking in jshash.c;
/cvsroot/mozilla/js/src/jshash.c,v  <--  jshash.c
new revision: 3.20; previous revision: 3.19
done
Checking in jsscope.c;
/cvsroot/mozilla/js/src/jsscope.c,v  <--  jsscope.c
new revision: 3.77; previous revision: 3.76
done
Checking in jsstr.c;
/cvsroot/mozilla/js/src/jsstr.c,v  <--  jsstr.c
new revision: 3.188; previous revision: 3.187
done
Checking in liveconnect/jsj_hash.c;
/cvsroot/mozilla/js/src/liveconnect/jsj_hash.c,v  <--  jsj_hash.c
new revision: 1.13; previous revision: 1.12
done
Checking in xpconnect/src/xpcmaps.cpp;
/cvsroot/mozilla/js/src/xpconnect/src/xpcmaps.cpp,v  <--  xpcmaps.cpp
new revision: 1.26; previous revision: 1.25
done
I believe that my landing means this bug is fixed.  Someone else should disabuse me of that notion, if I am incorrect.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Blocks: 415264
Blocks: 415262
You need to log in before you can comment on or make changes to this bug.