Closed Bug 356852 Opened 14 years ago Closed 10 years ago

Bitscanning operation can be made much faster with x86 instructions

Categories

(NSPR :: NSPR, enhancement, P2)

x86
Windows XP
enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: swsnyder, Assigned: swsnyder)

References

Details

(Keywords: perf)

Attachments

(1 file, 8 obsolete files)

User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.7) Gecko/20060911 SUSE/1.5.0.7-1.6 Firefox/1.5.0.7
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.7) Gecko/20060911 SUSE/1.5.0.7-1.6 Firefox/1.5.0.7

The Log2 and hibit/lobit functions in NSPR can be made much faster by using native x86 instructions rather than a series of shifts and compares.


Reproducible: Always

Steps to Reproduce:
1. Look at existing Log2 and hibit/lobit functions.
2. Be appalled at the contortions needed to reproduce what a single x86 opcode can do.
3.

Actual Results:  
Functions execution much slower than needed on x86 platforms.


Expected Results:  
Software should better utilize that hardware it is running on.


I selected "Windows XP" as the OS but any x86 build using GCC can be improved.  Patch was tested on Windows2K and on Linux.
Edited Summary for spelling.  Sigh.
Summary: Bitscanning operation can be made much faster with x86 intructions → Bitscanning operation can be made much faster with x86 instructions
This patch is broken out from a larger patch that enhanced bit-scan operations for both the JS and NSPR components.  I hope that this component-specific patch will be deemed acceptable.  See bug #351532 for background and benchmarking info.
Notes:

0. This patch is against NSPR v4.6.3, as found in Firefox v2.0RC3.

1. There will probably be objections that mozilla/dbm/src/h_log2.c is not part of NSPR.  True, but it is dependent upon NSPR, so I might as well use the enhanced functionality that the patch provides.  I won't break my heart to have the modification of this file removed, though it does vastly improve the horrific performance of __log2().

Status: UNCONFIRMED → NEW
Ever confirmed: true
Comment on attachment 242422 [details] [diff] [review]
Conditionally use x86 instructions for bit-scan operations

Steve,

mozilla/dbm is now part of NSS . For the dbm part, you will need a separate NSS bug. r- alone just for this - NSPR patches should only include changes to NSPR.

If you want to implement the new function in NSPR and use it from NSS/DBM, you'll have to make it a public NSPR function, and export it from the NSPR shared library. 

But you'll need to think about your interface some more. Generally, if you define a public NSPR interface, then the caller should not need to know about the internals of the implementation of the function, and specifically there should be no #ifdef at the caller level.

I think the same case could probably be made about your code in prdtoa.c, but it is less objectionable there, since that is internal NSPR code.
Attachment #242422 - Flags: review-
I definately do *not* want to establish a new interface in NSPR.

My goal is that those who call the effected (existing) functions and macros will experience faster execution of the same functionality.

Those who call PR_CeilingLog2(), PR_FloorLog2(),  hi0bits(), lo0bits() should percieve no changes in calling conventions or behavior.

I'll submit a patch that doesn't touch file mozilla/dbm/src/h_log2.c.
mozilla/dbm is now part of the NSS component.  It seems that
the best way to improve the __log2 function in dbm is to have
it call PR_CEILING_LOG2.

Although I am interested in such optimizations, your benchmarking
needs to show this optimization really matters to the users of Mozilla
clients.  For example, if you can show that Firefox spends 5% of the
time in these functions during startup, or if Firefox spends 4% of
the time when rendering a page with a lot of JavaScript, that'll make
a strong case to improve these functions at the cost of complicating
the source code with ifdefs.  I certainly believe your patch speeds
up these functions significantly, but I'm not sure if the performance
of these functions matters.  There is some value in using portable C
code.

If you are interested in these functions, I can recommend a book.
Hacker's Delight by Hanry S. Warren, Jr. has algorithms for many
such problems.  The log2 functions are discussed in Section 5-3,
Counting Leading 0's.  You can download a Revisions PDF file from
the book's web site, which has new algorithms for counting leading
0's since the book was published.  The book doesn't use assembly
code, so its algorithms won't replace your patch.  The book's goal
is to do the best you can with C code.  The C code is often written
in hand-tuned styles for a target compiler to generate efficient code.
It seems that our implementation is a variant of the algorithm in
Figure 5.7.
Hmm...

Bringing Firefox v2.0RC3 up to http://www.yahoo.com on Win2K invokes PR_CeilingLog2() ~10100 times and PR_FloorLog2() ~600 times.  I submit that any chunk of code that is called over 10,000 times in casual use is worthy of optimization.  (Pls see bug #351532 for relative performance stats.)

I agree that the use of the PR_CEILING_LOG2 macro is the way to go with __log2().  In fact, the JavaScript people have removed the guts from JS_CeilingLog2() and JS_FloorLog2() entirely and now use them simply as function wrappers around the equivalent macros.  Doing the same in my NSPR patch would have simplified it at the cost of making assumptions about what the NSPR owners want.

Finally, thanks for the book recommendation.  As someone who has tried to make his through Mozilla's nested macros and platform-specific #ifdef's I'm sympathetic to the plea for simple C code.  That said, when I see a torturous sequence of jumps and conditional shifts it is hard to ignore it when I know a single instruction would accomplish the same thing.
Changes from previous patch:

1. prbit.h: Renamed macros to indicate their 32-bitness.  Done in anticipation of NSPR support for 64-bit bit-scan operations (intrinsically supported by both MSVC and GCC).

2. prdtoa.c: In lo0bit() replaced table look-up with faster xor/shift.

3. prdtoa.c: Added grouping parenthesis to 8 lines in d2b() and b2d().  Done because the window full of operator-precedence warnings issued by MSVC is distracting.

4. prlog2.c: Removed guts of PR_CeilingLog2() and PR_FloorLog2(), replacing the code with invocations of respective macros.  Done for ease of maintainability since the function's and macro's code was the same.
Assignee: wtchang → swsnyder
Attachment #242422 - Attachment is obsolete: true
Attachment #242441 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Target Milestone: --- → 4.7
QA Contact: wtchang → nspr
Modifications on this 13-month anniversary of the last revision:

1. Removed the clean-up of warnings from d2b() and b2d().  They are not related to the issue that this bug addresses.

2. Renamed the wrappers around the MSVC intrinsic functions from __BitScan[Direction]32 to __prBitScan[Direction]32.  Done to avoid potential name conflicts with the same wrappers used in JavaScript (see Bug 356856).

3. Cleaner patching against the current v4.6.7 release (line number fuzz).
Attachment #244350 - Attachment is obsolete: true
Attachment #291038 - Flags: review?(wtc)
Attachment #291038 - Flags: review?(julien.pierre.boogz)
Julien, wtc, think you all would have time to review this and get it in before the Firefox 3 Beta 3 code freeze?
Blocks: 404824
Flags: blocking1.9?
Keywords: perf
Comment on attachment 291038 [details] [diff] [review]
Updated for impending review request

I will review this on Saturday.
Yea we want this for b3
Flags: blocking1.9? → blocking1.9+
Priority: -- → P1
Attached patch Steve Snyder's patch v1.1 (obsolete) — Splinter Review
This is the same as Steve's patch except for some formatting
changes.

I don't have time to finish the review this weekend.  Sorry.
Here are some comments.

1. For GCC, we should use the __builtin_ctz and __builtin_clz
intrinsics only on processors that have instructions for
counting the number of leading and trailing zeroes.  Otherwise
the intrinsics would be similar to our current implementations
of the PR_CEILING_LOG2 and PR_FLOOR_LOG2 macros, and defining
PR_CEILING_LOG2 and PR_FLOOR_LOG2 in terms of the clz intrinsic
may make them slower.

2. pr_bitscan_ctz32 is only used internally by prdtoa.c.
prdtoa.c is third-party code.  I want to avoid changing it
unless absolutely necessary.  So we may want to remove the
prdtoa.c changes and the pr_bitscan_ctz32 macro from this
patch.

Otherwise, this patch looks good.  I don't think this patch
is a must-have for Firefox 3 though, especially after the
bug it blocks (bug 404824) has been fixed with a different
solution (a lookup table of precomputed values).
Attachment #291038 - Attachment is obsolete: true
Attachment #299581 - Flags: superreview?(wtc)
Attachment #299581 - Flags: review?(julien.pierre.boogz)
Attachment #291038 - Flags: review?(wtc)
Attachment #291038 - Flags: review?(julien.pierre.boogz)
Attached patch Steve Snyder's patch v1.2 (obsolete) — Splinter Review
I did some more editing.
Attachment #299581 - Attachment is obsolete: true
Attachment #299642 - Flags: superreview?(wtc)
Attachment #299642 - Flags: review?(julien.pierre.boogz)
Attachment #299581 - Flags: superreview?(wtc)
Attachment #299581 - Flags: review?(julien.pierre.boogz)
(In reply to comment #14)

> Otherwise, this patch looks good.  I don't think this patch
> is a must-have for Firefox 3 though, especially after the
> bug it blocks (bug 404824) has been fixed with a different
> solution (a lookup table of precomputed values).
> 

I'd love to get it in anyway...
Movin to P2 - but we'd still love to take this
Priority: P1 → P2
Flags: tracking1.9+ → wanted-next+
Comment on attachment 299642 [details] [diff] [review]
Steve Snyder's patch v1.2

I think this is OK, but Wan-Teh should still review it before checkin.
Attachment #299642 - Flags: review?(julien.pierre.boogz) → review+
Wan-Teh if you get a review of this in we can land it for 1.9.1
Flags: wanted1.9.1?
Comment on attachment 299642 [details] [diff] [review]
Steve Snyder's patch v1.2

I built this patch on windows with MSVC and looked at the results.  
No optimizations were seen.  The reason was apparent after a closer
look at the code.  Observe:

>+#define _PR_HAVE_BUILTIN_BITSCAN32
>+#ifdef   PR_HAVE_BUILTIN_BITSCAN32 

So, I changed all the referencs from PR_HAVE_BUILTIN_BITSCAN32 to 
_PR_HAVE_BUILTIN_BITSCAN32 and built it again, and this time I 
got the expected changes.  

But this tells me that this patch wasn't tested, and certainly no
performance improvements were measured as having been caused by it.

Although Wan-Teh didn't give it an r-, he did make a comment that
clearly (to me) indicated that it needed a change before it would
be approved.  That change was to the portion of the code built 
under GNUC.  He requested that it only use the builtin intrinsics
on CPUs that actually have special instructions for this purpose,
rather than on all CPUs.  We're talking about some additional #ifs
in the following block of new code:

>+#elif (__GNUC__ > 3) || (__GNUC__ == 3 && __GNUC_MINOR__ >= 4)
>+
>+#define __pr_ntz32(val)  __builtin_ctz(val)
>+#define __pr_nlz32(val)  __builtin_clz(val)
>+#define _PR_HAVE_BUILTIN_BITSCAN32
>+
>+#endif /* MSVC || GCC */

I don't know what CPUs have this feature or I'd have taken a stab
at that myself.  I'll let Steve do that.  Steve, please ask me to
review the next version of your patch.
Attachment #299642 - Flags: superreview?(wtc) → superreview-
Attachment #299642 - Attachment is obsolete: true
Comment on attachment 433774 [details] [diff] [review]
Patches cleanly against NSPR v4.8.4


>+#elif (__GNUC__ >= 4) || (__GNUC__ == 3 && __GNUC_MINOR__ >= 4)
>+# define pr_bitscan_ctz32(val)  __builtin_ctz(val)
>+# define pr_bitscan_clz32(val)  __builtin_clz(val)
>+# define  PR_HAS_BUILTIN_BITSCAN32
>+#endif /* MSVC || GCC */

Where's the test that says "and this CPU has instructions that make
this intrinsic efficient" ?
(In reply to comment #22)
> (From update of attachment 433774 [details] [diff] [review])
> 
> >+#elif (__GNUC__ >= 4) || (__GNUC__ == 3 && __GNUC_MINOR__ >= 4)
> >+# define pr_bitscan_ctz32(val)  __builtin_ctz(val)
> >+# define pr_bitscan_clz32(val)  __builtin_clz(val)
> >+# define  PR_HAS_BUILTIN_BITSCAN32
> >+#endif /* MSVC || GCC */
> 
> Where's the test that says "and this CPU has instructions that make
> this intrinsic efficient" ?

Nelson, you're killing me.

The BSF/BSR instructions were introduced in the 80386, in 1985.  Any 32- or 64-bit x86-compatible CPU has these instructions.

http://pdos.csail.mit.edu/6.828/2008/readings/i386/BSF.htm

I think the code below speaks for itself.  I can gen VS2005 asm output too if further justification is required to bring this 25-year-old technology to NSPR.

Before patch:
-------------

PR_FloorLog2:
        movl    %edi, %edx      # n, j_
        movl    $16, %eax       #, log2
        shrl    $16, %edx       #, j_
        testl   %edx, %edx      # j_
        jne     .L18    #,
        movl    %edi, %edx      # n, j_
        xorb    %al, %al        #
.L18:
        movl    %edx, %ecx      # j_, j_.47
        shrl    $8, %ecx        #, j_.47
        testl   %ecx, %ecx      # j_.47
        jne     .L19    #,
        movl    %edx, %ecx      # j_, j_.47
        movl    %ecx, %edx      # j_.47, j_.48
        shrl    $4, %edx        #, j_.48
        testl   %edx, %edx      # j_.48
        jne     .L21    #,
.L27:
        movl    %ecx, %edx      # j_.47, j_.48
        movl    %edx, %ecx      # j_.48, j_.49
        shrl    $2, %ecx        #, j_.49
        testl   %ecx, %ecx      # j_.49
        jne     .L23    #,
.L28:
        movl    %edx, %ecx      # j_.48, j_.49
        shrl    %ecx    # tmp65
        cmpl    $1, %ecx        #, tmp65
        sbbl    $-1, %eax       #, log2
        ret

L19:
        addl    $8, %eax        #, log2
        movl    %ecx, %edx      # j_.47, j_.48
        shrl    $4, %edx        #, j_.48
        testl   %edx, %edx      # j_.48
        je      .L27    #,
.L21:
        addl    $4, %eax        #, log2
        movl    %edx, %ecx      # j_.48, j_.49
        shrl    $2, %ecx        #, j_.49
        testl   %ecx, %ecx      # j_.49
        je      .L28    #,
.L23:
        addl    $2, %eax        #, log2
        shrl    %ecx    # tmp65
        cmpl    $1, %ecx        #, tmp65
        sbbl    $-1, %eax       #, log2
        ret


After patch:
------------

PR_FloorLog2:
        orl     $1, %edi        #, tmp61
        movl    $31, %eax       #, tmp62
        bsrl    %edi, %edi      # tmp61, D.1995
        xorl    $31, %edi       #, D.1995
        subl    %edi, %eax      # D.1995, tmp62
        ret
(In reply to comment #22)
> (From update of attachment 433774 [details] [diff] [review])
> 
> >+#elif (__GNUC__ >= 4) || (__GNUC__ == 3 && __GNUC_MINOR__ >= 4)
> >+# define pr_bitscan_ctz32(val)  __builtin_ctz(val)
> >+# define pr_bitscan_clz32(val)  __builtin_clz(val)
> >+# define  PR_HAS_BUILTIN_BITSCAN32
> >+#endif /* MSVC || GCC */
> 
> Where's the test that says "and this CPU has instructions that make
> this intrinsic efficient" ?

A less snarky answer to your question: the beauty of using compiler intrinsics (as opposed to asm) is that we don't have to make that decision.  The compiler does it for us.

What instruction is actually behind the __builtin_clz intrinsic?  That's for the GCC developers to worry about.  On x86 it is the BSR instruction, on PowerPC, or ARM, or whatever it is a different instruction, or sequence of instructions as is appropriate for that platform.

The people at GCC and Microsoft have already done the heavy lifting for us.  Why not use the features that NSPR-supported compilers make available to us?
Besides x86, NSPR runs on UltraSparc, PPC, PA-Risc 2, iTanium, ARM7, s390, 
MIPS R4k, and probably others that aren't coming to mind just now (we 
recently dropped DEC VAX, and I think we dropped DEC Alpha a while ago).  
Enable the intrinsics only on the CPUs where it's known to be an improvement
due to special CPU instruction support.  Thanks.
The bug says "with x86 instructions", so let's limit it to X86 CPUs, until
we know of others that win, too.
Modified to address comment #26.
Attachment #433774 - Attachment is obsolete: true
Attachment #433777 - Flags: review?(nelson)
I'm prepared to give r+ to this latest patch, however ...
In comment 14, Wan-Teh expressed great reluctance to changing prdtoa.c.
Given all the problems we've had with that code lately, I don't blame him.
(For a list, see
https://bugzilla.mozilla.org/buglist.cgi?chfieldto=Now&query_format=advanced&chfieldfrom=365d&short_desc=prdtoa%20balloc&short_desc_type=anywords&product=Core&product=NSPR
so I'm thinking maybe we should not commit the part of this patch that modifies prdtoa.c.  

Just as you measured ~10100 calls to PR_CeilingLog2 during Firefox startup,
I will ask you to measure the number of calls to prtdtoa.c's functions hi0bits and lo0bits that occur during startup.  This will help us get some idea of the
value of perturbing prdtoa.c.
OK.

I expect the total call count to be considerably less than last measured, given that a lot of calls were eliminated by the fix for bug 404824.
Count of invocations, bring Firefox v3.6.2 up to http://www.yahoo.com/ on WinXP:

PR_CeilingLog2()              =   73
PR_FloorLog2()                =    0
hi0bits()                     =    1 
lo0bits()                     =   15 

macro PR_CEILING_LOG2         = 1640
   PL_DHashTableInit()        = 1537
   nsVoidArray::GrowArrayBy() =   90
   PL_DHashTableEnumerate()   =   13

macro PR_CEILING_LOG2         = not used

I also note that PL_InitArenaPool() is called ~950 times.  At the time this bug was created, PL_InitArenaPool() called PR_CeilingLog2() unconditionally on every call.  Now it uses tables look-ups.

Overall, quite a difference in invocation counts from the Gecko 1.8.0 olden days.
Comment on attachment 433849 [details] [diff] [review]
Steve's latest patch (1.4?) in CVS diff form

OK, so I'll commit this patch, except for the part affecting prdtoa.c.  The numbers suggest the prdtoa changes would not be significant anyway.
Attachment #433849 - Flags: review+
Attachment #433777 - Attachment is obsolete: true
Attachment #433777 - Flags: review?(nelson)
> macro PR_CEILING_LOG2         = not used

I'm guessing that should say 

 macro PR_FLOOR_LOG2         = not used

Am I right?
Comment on attachment 299642 [details] [diff] [review]
Steve Snyder's patch v1.2

Nelson, thanks for the code review.  To be fair to Steve, I think it was me who
introduced the _PR_HAVE_BUILTIN_BITSCAN32 vs. PR_HAVE_BUILTIN_BITSCAN32 typo.

When you check in the patch, could you change PR_HAS to PR_HAVE for consistency
with the existing macros?  Thanks.
(In reply to comment #33)
> > macro PR_CEILING_LOG2         = not used
> 
> I'm guessing that should say 
> 
>  macro PR_FLOOR_LOG2         = not used
> 
> Am I right?

Yes, that's right.  Cut 'n' paste + poor proofreading on my part.  Sorry.

It seems that both the floor function and macro just exist to balance to the ceiling equivalents.
Bug 356852: Bitscanning operation can be made much faster with x86 instructions
Patch contributed by Steve Snyder <swsnyder@snydernet.net>, r=nelson

Checking in include/prbit.h;   new revision: 3.13; previous revision: 3.12
Checking in src/misc/prdtoa.c; new revision: 4.15; previous revision: 4.14
Checking in src/misc/prlog2.c; new revision: 3.6;  previous revision: 3.5

Thanks for your contribution, Steve.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: 4.7 → 4.8.5
Blocks: 575534
You need to log in before you can comment on or make changes to this bug.