Closed Bug 356852 Opened 14 years ago Closed 10 years ago
Bitscanning operation can be made much faster with x86 instructions
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:220.127.116.11) Gecko/20060911 SUSE/18.104.22.168-1.6 Firefox/22.214.171.124 Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:126.96.36.199) Gecko/20060911 SUSE/188.8.131.52-1.6 Firefox/184.108.40.206 The Log2 and hibit/lobit functions in NSPR can be made much faster by using native x86 instructions rather than a series of shifts and compares. Reproducible: Always Steps to Reproduce: 1. Look at existing Log2 and hibit/lobit functions. 2. Be appalled at the contortions needed to reproduce what a single x86 opcode can do. 3. Actual Results: Functions execution much slower than needed on x86 platforms. Expected Results: Software should better utilize that hardware it is running on. I selected "Windows XP" as the OS but any x86 build using GCC can be improved. Patch was tested on Windows2K and on Linux.
Edited Summary for spelling. Sigh.
Summary: Bitscanning operation can be made much faster with x86 intructions → Bitscanning operation can be made much faster with x86 instructions
This patch is broken out from a larger patch that enhanced bit-scan operations for both the JS and NSPR components. I hope that this component-specific patch will be deemed acceptable. See bug #351532 for background and benchmarking info.
Notes: 0. This patch is against NSPR v4.6.3, as found in Firefox v2.0RC3. 1. There will probably be objections that mozilla/dbm/src/h_log2.c is not part of NSPR. True, but it is dependent upon NSPR, so I might as well use the enhanced functionality that the patch provides. I won't break my heart to have the modification of this file removed, though it does vastly improve the horrific performance of __log2().
Status: UNCONFIRMED → NEW
Ever confirmed: true
Comment on attachment 242422 [details] [diff] [review] Conditionally use x86 instructions for bit-scan operations Steve, mozilla/dbm is now part of NSS . For the dbm part, you will need a separate NSS bug. r- alone just for this - NSPR patches should only include changes to NSPR. If you want to implement the new function in NSPR and use it from NSS/DBM, you'll have to make it a public NSPR function, and export it from the NSPR shared library. But you'll need to think about your interface some more. Generally, if you define a public NSPR interface, then the caller should not need to know about the internals of the implementation of the function, and specifically there should be no #ifdef at the caller level. I think the same case could probably be made about your code in prdtoa.c, but it is less objectionable there, since that is internal NSPR code.
Attachment #242422 - Flags: review-
I definately do *not* want to establish a new interface in NSPR. My goal is that those who call the effected (existing) functions and macros will experience faster execution of the same functionality. Those who call PR_CeilingLog2(), PR_FloorLog2(), hi0bits(), lo0bits() should percieve no changes in calling conventions or behavior. I'll submit a patch that doesn't touch file mozilla/dbm/src/h_log2.c.
Changes from previous patch: 1. prbit.h: Renamed macros to indicate their 32-bitness. Done in anticipation of NSPR support for 64-bit bit-scan operations (intrinsically supported by both MSVC and GCC). 2. prdtoa.c: In lo0bit() replaced table look-up with faster xor/shift. 3. prdtoa.c: Added grouping parenthesis to 8 lines in d2b() and b2d(). Done because the window full of operator-precedence warnings issued by MSVC is distracting. 4. prlog2.c: Removed guts of PR_CeilingLog2() and PR_FloorLog2(), replacing the code with invocations of respective macros. Done for ease of maintainability since the function's and macro's code was the same.
Attachment #244350 - Attachment is obsolete: true
Julien, wtc, think you all would have time to review this and get it in before the Firefox 3 Beta 3 code freeze?
Comment on attachment 291038 [details] [diff] [review] Updated for impending review request I will review this on Saturday.
Yea we want this for b3
Flags: blocking1.9? → blocking1.9+
Priority: -- → P1
This is the same as Steve's patch except for some formatting changes. I don't have time to finish the review this weekend. Sorry. Here are some comments. 1. For GCC, we should use the __builtin_ctz and __builtin_clz intrinsics only on processors that have instructions for counting the number of leading and trailing zeroes. Otherwise the intrinsics would be similar to our current implementations of the PR_CEILING_LOG2 and PR_FLOOR_LOG2 macros, and defining PR_CEILING_LOG2 and PR_FLOOR_LOG2 in terms of the clz intrinsic may make them slower. 2. pr_bitscan_ctz32 is only used internally by prdtoa.c. prdtoa.c is third-party code. I want to avoid changing it unless absolutely necessary. So we may want to remove the prdtoa.c changes and the pr_bitscan_ctz32 macro from this patch. Otherwise, this patch looks good. I don't think this patch is a must-have for Firefox 3 though, especially after the bug it blocks (bug 404824) has been fixed with a different solution (a lookup table of precomputed values).
I did some more editing.
(In reply to comment #14) > Otherwise, this patch looks good. I don't think this patch > is a must-have for Firefox 3 though, especially after the > bug it blocks (bug 404824) has been fixed with a different > solution (a lookup table of precomputed values). > I'd love to get it in anyway...
Movin to P2 - but we'd still love to take this
Priority: P1 → P2
Comment on attachment 299642 [details] [diff] [review] Steve Snyder's patch v1.2 I think this is OK, but Wan-Teh should still review it before checkin.
Attachment #299642 - Flags: review?(julien.pierre.boogz) → review+
Wan-Teh if you get a review of this in we can land it for 1.9.1
Comment on attachment 299642 [details] [diff] [review] Steve Snyder's patch v1.2 I built this patch on windows with MSVC and looked at the results. No optimizations were seen. The reason was apparent after a closer look at the code. Observe: >+#define _PR_HAVE_BUILTIN_BITSCAN32 >+#ifdef PR_HAVE_BUILTIN_BITSCAN32 So, I changed all the referencs from PR_HAVE_BUILTIN_BITSCAN32 to _PR_HAVE_BUILTIN_BITSCAN32 and built it again, and this time I got the expected changes. But this tells me that this patch wasn't tested, and certainly no performance improvements were measured as having been caused by it. Although Wan-Teh didn't give it an r-, he did make a comment that clearly (to me) indicated that it needed a change before it would be approved. That change was to the portion of the code built under GNUC. He requested that it only use the builtin intrinsics on CPUs that actually have special instructions for this purpose, rather than on all CPUs. We're talking about some additional #ifs in the following block of new code: >+#elif (__GNUC__ > 3) || (__GNUC__ == 3 && __GNUC_MINOR__ >= 4) >+ >+#define __pr_ntz32(val) __builtin_ctz(val) >+#define __pr_nlz32(val) __builtin_clz(val) >+#define _PR_HAVE_BUILTIN_BITSCAN32 >+ >+#endif /* MSVC || GCC */ I don't know what CPUs have this feature or I'd have taken a stab at that myself. I'll let Steve do that. Steve, please ask me to review the next version of your patch.
Attachment #299642 - Flags: superreview?(wtc) → superreview-
Comment on attachment 433774 [details] [diff] [review] Patches cleanly against NSPR v4.8.4 >+#elif (__GNUC__ >= 4) || (__GNUC__ == 3 && __GNUC_MINOR__ >= 4) >+# define pr_bitscan_ctz32(val) __builtin_ctz(val) >+# define pr_bitscan_clz32(val) __builtin_clz(val) >+# define PR_HAS_BUILTIN_BITSCAN32 >+#endif /* MSVC || GCC */ Where's the test that says "and this CPU has instructions that make this intrinsic efficient" ?
(In reply to comment #22) > (From update of attachment 433774 [details] [diff] [review]) > > >+#elif (__GNUC__ >= 4) || (__GNUC__ == 3 && __GNUC_MINOR__ >= 4) > >+# define pr_bitscan_ctz32(val) __builtin_ctz(val) > >+# define pr_bitscan_clz32(val) __builtin_clz(val) > >+# define PR_HAS_BUILTIN_BITSCAN32 > >+#endif /* MSVC || GCC */ > > Where's the test that says "and this CPU has instructions that make > this intrinsic efficient" ? Nelson, you're killing me. The BSF/BSR instructions were introduced in the 80386, in 1985. Any 32- or 64-bit x86-compatible CPU has these instructions. http://pdos.csail.mit.edu/6.828/2008/readings/i386/BSF.htm I think the code below speaks for itself. I can gen VS2005 asm output too if further justification is required to bring this 25-year-old technology to NSPR. Before patch: ------------- PR_FloorLog2: movl %edi, %edx # n, j_ movl $16, %eax #, log2 shrl $16, %edx #, j_ testl %edx, %edx # j_ jne .L18 #, movl %edi, %edx # n, j_ xorb %al, %al # .L18: movl %edx, %ecx # j_, j_.47 shrl $8, %ecx #, j_.47 testl %ecx, %ecx # j_.47 jne .L19 #, movl %edx, %ecx # j_, j_.47 movl %ecx, %edx # j_.47, j_.48 shrl $4, %edx #, j_.48 testl %edx, %edx # j_.48 jne .L21 #, .L27: movl %ecx, %edx # j_.47, j_.48 movl %edx, %ecx # j_.48, j_.49 shrl $2, %ecx #, j_.49 testl %ecx, %ecx # j_.49 jne .L23 #, .L28: movl %edx, %ecx # j_.48, j_.49 shrl %ecx # tmp65 cmpl $1, %ecx #, tmp65 sbbl $-1, %eax #, log2 ret L19: addl $8, %eax #, log2 movl %ecx, %edx # j_.47, j_.48 shrl $4, %edx #, j_.48 testl %edx, %edx # j_.48 je .L27 #, .L21: addl $4, %eax #, log2 movl %edx, %ecx # j_.48, j_.49 shrl $2, %ecx #, j_.49 testl %ecx, %ecx # j_.49 je .L28 #, .L23: addl $2, %eax #, log2 shrl %ecx # tmp65 cmpl $1, %ecx #, tmp65 sbbl $-1, %eax #, log2 ret After patch: ------------ PR_FloorLog2: orl $1, %edi #, tmp61 movl $31, %eax #, tmp62 bsrl %edi, %edi # tmp61, D.1995 xorl $31, %edi #, D.1995 subl %edi, %eax # D.1995, tmp62 ret
(In reply to comment #22) > (From update of attachment 433774 [details] [diff] [review]) > > >+#elif (__GNUC__ >= 4) || (__GNUC__ == 3 && __GNUC_MINOR__ >= 4) > >+# define pr_bitscan_ctz32(val) __builtin_ctz(val) > >+# define pr_bitscan_clz32(val) __builtin_clz(val) > >+# define PR_HAS_BUILTIN_BITSCAN32 > >+#endif /* MSVC || GCC */ > > Where's the test that says "and this CPU has instructions that make > this intrinsic efficient" ? A less snarky answer to your question: the beauty of using compiler intrinsics (as opposed to asm) is that we don't have to make that decision. The compiler does it for us. What instruction is actually behind the __builtin_clz intrinsic? That's for the GCC developers to worry about. On x86 it is the BSR instruction, on PowerPC, or ARM, or whatever it is a different instruction, or sequence of instructions as is appropriate for that platform. The people at GCC and Microsoft have already done the heavy lifting for us. Why not use the features that NSPR-supported compilers make available to us?
Besides x86, NSPR runs on UltraSparc, PPC, PA-Risc 2, iTanium, ARM7, s390, MIPS R4k, and probably others that aren't coming to mind just now (we recently dropped DEC VAX, and I think we dropped DEC Alpha a while ago). Enable the intrinsics only on the CPUs where it's known to be an improvement due to special CPU instruction support. Thanks.
The bug says "with x86 instructions", so let's limit it to X86 CPUs, until we know of others that win, too.
Modified to address comment #26.
I'm prepared to give r+ to this latest patch, however ... In comment 14, Wan-Teh expressed great reluctance to changing prdtoa.c. Given all the problems we've had with that code lately, I don't blame him. (For a list, see https://bugzilla.mozilla.org/buglist.cgi?chfieldto=Now&query_format=advanced&chfieldfrom=365d&short_desc=prdtoa%20balloc&short_desc_type=anywords&product=Core&product=NSPR so I'm thinking maybe we should not commit the part of this patch that modifies prdtoa.c. Just as you measured ~10100 calls to PR_CeilingLog2 during Firefox startup, I will ask you to measure the number of calls to prtdtoa.c's functions hi0bits and lo0bits that occur during startup. This will help us get some idea of the value of perturbing prdtoa.c.
OK. I expect the total call count to be considerably less than last measured, given that a lot of calls were eliminated by the fix for bug 404824.
Count of invocations, bring Firefox v3.6.2 up to http://www.yahoo.com/ on WinXP: PR_CeilingLog2() = 73 PR_FloorLog2() = 0 hi0bits() = 1 lo0bits() = 15 macro PR_CEILING_LOG2 = 1640 PL_DHashTableInit() = 1537 nsVoidArray::GrowArrayBy() = 90 PL_DHashTableEnumerate() = 13 macro PR_CEILING_LOG2 = not used I also note that PL_InitArenaPool() is called ~950 times. At the time this bug was created, PL_InitArenaPool() called PR_CeilingLog2() unconditionally on every call. Now it uses tables look-ups. Overall, quite a difference in invocation counts from the Gecko 1.8.0 olden days.
Comment on attachment 433849 [details] [diff] [review] Steve's latest patch (1.4?) in CVS diff form OK, so I'll commit this patch, except for the part affecting prdtoa.c. The numbers suggest the prdtoa changes would not be significant anyway.
Attachment #433849 - Flags: review+
> macro PR_CEILING_LOG2 = not used I'm guessing that should say macro PR_FLOOR_LOG2 = not used Am I right?
Comment on attachment 299642 [details] [diff] [review] Steve Snyder's patch v1.2 Nelson, thanks for the code review. To be fair to Steve, I think it was me who introduced the _PR_HAVE_BUILTIN_BITSCAN32 vs. PR_HAVE_BUILTIN_BITSCAN32 typo. When you check in the patch, could you change PR_HAS to PR_HAVE for consistency with the existing macros? Thanks.
(In reply to comment #33) > > macro PR_CEILING_LOG2 = not used > > I'm guessing that should say > > macro PR_FLOOR_LOG2 = not used > > Am I right? Yes, that's right. Cut 'n' paste + poor proofreading on my part. Sorry. It seems that both the floor function and macro just exist to balance to the ceiling equivalents.
Bug 356852: Bitscanning operation can be made much faster with x86 instructions Patch contributed by Steve Snyder <firstname.lastname@example.org>, r=nelson Checking in include/prbit.h; new revision: 3.13; previous revision: 3.12 Checking in src/misc/prdtoa.c; new revision: 4.15; previous revision: 4.14 Checking in src/misc/prlog2.c; new revision: 3.6; previous revision: 3.5 Thanks for your contribution, Steve.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.