Closed
Bug 356852
Opened 18 years ago
Closed 14 years ago
Bitscanning operation can be made much faster with x86 instructions
Categories
(NSPR :: NSPR, enhancement, P2)
Tracking
(Not tracked)
RESOLVED
FIXED
4.8.5
People
(Reporter: swsnyder, Assigned: swsnyder)
References
Details
(Keywords: perf)
Attachments
(1 file, 8 obsolete files)
5.44 KB,
patch
|
nelson
:
review+
|
Details | Diff | Splinter Review |
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.7) Gecko/20060911 SUSE/1.5.0.7-1.6 Firefox/1.5.0.7 Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.7) Gecko/20060911 SUSE/1.5.0.7-1.6 Firefox/1.5.0.7 The Log2 and hibit/lobit functions in NSPR can be made much faster by using native x86 instructions rather than a series of shifts and compares. Reproducible: Always Steps to Reproduce: 1. Look at existing Log2 and hibit/lobit functions. 2. Be appalled at the contortions needed to reproduce what a single x86 opcode can do. 3. Actual Results: Functions execution much slower than needed on x86 platforms. Expected Results: Software should better utilize that hardware it is running on. I selected "Windows XP" as the OS but any x86 build using GCC can be improved. Patch was tested on Windows2K and on Linux.
Assignee | ||
Comment 1•18 years ago
|
||
Edited Summary for spelling. Sigh.
Summary: Bitscanning operation can be made much faster with x86 intructions → Bitscanning operation can be made much faster with x86 instructions
Assignee | ||
Comment 2•18 years ago
|
||
This patch is broken out from a larger patch that enhanced bit-scan operations for both the JS and NSPR components. I hope that this component-specific patch will be deemed acceptable. See bug #351532 for background and benchmarking info.
Assignee | ||
Comment 3•18 years ago
|
||
Notes: 0. This patch is against NSPR v4.6.3, as found in Firefox v2.0RC3. 1. There will probably be objections that mozilla/dbm/src/h_log2.c is not part of NSPR. True, but it is dependent upon NSPR, so I might as well use the enhanced functionality that the patch provides. I won't break my heart to have the modification of this file removed, though it does vastly improve the horrific performance of __log2().
Updated•18 years ago
|
Status: UNCONFIRMED → NEW
Ever confirmed: true
Comment 4•18 years ago
|
||
Comment on attachment 242422 [details] [diff] [review] Conditionally use x86 instructions for bit-scan operations Steve, mozilla/dbm is now part of NSS . For the dbm part, you will need a separate NSS bug. r- alone just for this - NSPR patches should only include changes to NSPR. If you want to implement the new function in NSPR and use it from NSS/DBM, you'll have to make it a public NSPR function, and export it from the NSPR shared library. But you'll need to think about your interface some more. Generally, if you define a public NSPR interface, then the caller should not need to know about the internals of the implementation of the function, and specifically there should be no #ifdef at the caller level. I think the same case could probably be made about your code in prdtoa.c, but it is less objectionable there, since that is internal NSPR code.
Attachment #242422 -
Flags: review-
Assignee | ||
Comment 5•18 years ago
|
||
I definately do *not* want to establish a new interface in NSPR. My goal is that those who call the effected (existing) functions and macros will experience faster execution of the same functionality. Those who call PR_CeilingLog2(), PR_FloorLog2(), hi0bits(), lo0bits() should percieve no changes in calling conventions or behavior. I'll submit a patch that doesn't touch file mozilla/dbm/src/h_log2.c.
Assignee | ||
Comment 6•18 years ago
|
||
Comment 7•18 years ago
|
||
mozilla/dbm is now part of the NSS component. It seems that the best way to improve the __log2 function in dbm is to have it call PR_CEILING_LOG2. Although I am interested in such optimizations, your benchmarking needs to show this optimization really matters to the users of Mozilla clients. For example, if you can show that Firefox spends 5% of the time in these functions during startup, or if Firefox spends 4% of the time when rendering a page with a lot of JavaScript, that'll make a strong case to improve these functions at the cost of complicating the source code with ifdefs. I certainly believe your patch speeds up these functions significantly, but I'm not sure if the performance of these functions matters. There is some value in using portable C code. If you are interested in these functions, I can recommend a book. Hacker's Delight by Hanry S. Warren, Jr. has algorithms for many such problems. The log2 functions are discussed in Section 5-3, Counting Leading 0's. You can download a Revisions PDF file from the book's web site, which has new algorithms for counting leading 0's since the book was published. The book doesn't use assembly code, so its algorithms won't replace your patch. The book's goal is to do the best you can with C code. The C code is often written in hand-tuned styles for a target compiler to generate efficient code. It seems that our implementation is a variant of the algorithm in Figure 5.7.
Assignee | ||
Comment 8•18 years ago
|
||
Hmm... Bringing Firefox v2.0RC3 up to http://www.yahoo.com on Win2K invokes PR_CeilingLog2() ~10100 times and PR_FloorLog2() ~600 times. I submit that any chunk of code that is called over 10,000 times in casual use is worthy of optimization. (Pls see bug #351532 for relative performance stats.) I agree that the use of the PR_CEILING_LOG2 macro is the way to go with __log2(). In fact, the JavaScript people have removed the guts from JS_CeilingLog2() and JS_FloorLog2() entirely and now use them simply as function wrappers around the equivalent macros. Doing the same in my NSPR patch would have simplified it at the cost of making assumptions about what the NSPR owners want. Finally, thanks for the book recommendation. As someone who has tried to make his through Mozilla's nested macros and platform-specific #ifdef's I'm sympathetic to the plea for simple C code. That said, when I see a torturous sequence of jumps and conditional shifts it is hard to ignore it when I know a single instruction would accomplish the same thing.
Assignee | ||
Comment 9•18 years ago
|
||
Changes from previous patch: 1. prbit.h: Renamed macros to indicate their 32-bitness. Done in anticipation of NSPR support for 64-bit bit-scan operations (intrinsically supported by both MSVC and GCC). 2. prdtoa.c: In lo0bit() replaced table look-up with faster xor/shift. 3. prdtoa.c: Added grouping parenthesis to 8 lines in d2b() and b2d(). Done because the window full of operator-precedence warnings issued by MSVC is distracting. 4. prlog2.c: Removed guts of PR_CeilingLog2() and PR_FloorLog2(), replacing the code with invocations of respective macros. Done for ease of maintainability since the function's and macro's code was the same.
Assignee: wtchang → swsnyder
Attachment #242422 -
Attachment is obsolete: true
Attachment #242441 -
Attachment is obsolete: true
Status: NEW → ASSIGNED
Updated•18 years ago
|
Target Milestone: --- → 4.7
Updated•18 years ago
|
QA Contact: wtchang → nspr
Assignee | ||
Comment 10•17 years ago
|
||
Modifications on this 13-month anniversary of the last revision: 1. Removed the clean-up of warnings from d2b() and b2d(). They are not related to the issue that this bug addresses. 2. Renamed the wrappers around the MSVC intrinsic functions from __BitScan[Direction]32 to __prBitScan[Direction]32. Done to avoid potential name conflicts with the same wrappers used in JavaScript (see Bug 356856). 3. Cleaner patching against the current v4.6.7 release (line number fuzz).
Attachment #244350 -
Attachment is obsolete: true
Updated•17 years ago
|
Attachment #291038 -
Flags: review?(wtc)
Attachment #291038 -
Flags: review?(julien.pierre.boogz)
Comment 11•17 years ago
|
||
Julien, wtc, think you all would have time to review this and get it in before the Firefox 3 Beta 3 code freeze?
Updated•17 years ago
|
Comment 12•17 years ago
|
||
Comment on attachment 291038 [details] [diff] [review] Updated for impending review request I will review this on Saturday.
Comment 13•17 years ago
|
||
Yea we want this for b3
Flags: blocking1.9? → blocking1.9+
Priority: -- → P1
Comment 14•17 years ago
|
||
This is the same as Steve's patch except for some formatting changes. I don't have time to finish the review this weekend. Sorry. Here are some comments. 1. For GCC, we should use the __builtin_ctz and __builtin_clz intrinsics only on processors that have instructions for counting the number of leading and trailing zeroes. Otherwise the intrinsics would be similar to our current implementations of the PR_CEILING_LOG2 and PR_FLOOR_LOG2 macros, and defining PR_CEILING_LOG2 and PR_FLOOR_LOG2 in terms of the clz intrinsic may make them slower. 2. pr_bitscan_ctz32 is only used internally by prdtoa.c. prdtoa.c is third-party code. I want to avoid changing it unless absolutely necessary. So we may want to remove the prdtoa.c changes and the pr_bitscan_ctz32 macro from this patch. Otherwise, this patch looks good. I don't think this patch is a must-have for Firefox 3 though, especially after the bug it blocks (bug 404824) has been fixed with a different solution (a lookup table of precomputed values).
Attachment #291038 -
Attachment is obsolete: true
Attachment #299581 -
Flags: superreview?(wtc)
Attachment #299581 -
Flags: review?(julien.pierre.boogz)
Attachment #291038 -
Flags: review?(wtc)
Attachment #291038 -
Flags: review?(julien.pierre.boogz)
Comment 15•17 years ago
|
||
I did some more editing.
Attachment #299581 -
Attachment is obsolete: true
Attachment #299642 -
Flags: superreview?(wtc)
Attachment #299642 -
Flags: review?(julien.pierre.boogz)
Attachment #299581 -
Flags: superreview?(wtc)
Attachment #299581 -
Flags: review?(julien.pierre.boogz)
Comment 16•17 years ago
|
||
(In reply to comment #14) > Otherwise, this patch looks good. I don't think this patch > is a must-have for Firefox 3 though, especially after the > bug it blocks (bug 404824) has been fixed with a different > solution (a lookup table of precomputed values). > I'd love to get it in anyway...
Updated•16 years ago
|
Flags: tracking1.9+ → wanted-next+
Comment 18•16 years ago
|
||
Comment on attachment 299642 [details] [diff] [review] Steve Snyder's patch v1.2 I think this is OK, but Wan-Teh should still review it before checkin.
Attachment #299642 -
Flags: review?(julien.pierre.boogz) → review+
Comment 19•16 years ago
|
||
Wan-Teh if you get a review of this in we can land it for 1.9.1
Flags: wanted1.9.1?
Comment 20•14 years ago
|
||
Comment on attachment 299642 [details] [diff] [review] Steve Snyder's patch v1.2 I built this patch on windows with MSVC and looked at the results. No optimizations were seen. The reason was apparent after a closer look at the code. Observe: >+#define _PR_HAVE_BUILTIN_BITSCAN32 >+#ifdef PR_HAVE_BUILTIN_BITSCAN32 So, I changed all the referencs from PR_HAVE_BUILTIN_BITSCAN32 to _PR_HAVE_BUILTIN_BITSCAN32 and built it again, and this time I got the expected changes. But this tells me that this patch wasn't tested, and certainly no performance improvements were measured as having been caused by it. Although Wan-Teh didn't give it an r-, he did make a comment that clearly (to me) indicated that it needed a change before it would be approved. That change was to the portion of the code built under GNUC. He requested that it only use the builtin intrinsics on CPUs that actually have special instructions for this purpose, rather than on all CPUs. We're talking about some additional #ifs in the following block of new code: >+#elif (__GNUC__ > 3) || (__GNUC__ == 3 && __GNUC_MINOR__ >= 4) >+ >+#define __pr_ntz32(val) __builtin_ctz(val) >+#define __pr_nlz32(val) __builtin_clz(val) >+#define _PR_HAVE_BUILTIN_BITSCAN32 >+ >+#endif /* MSVC || GCC */ I don't know what CPUs have this feature or I'd have taken a stab at that myself. I'll let Steve do that. Steve, please ask me to review the next version of your patch.
Attachment #299642 -
Flags: superreview?(wtc) → superreview-
Assignee | ||
Comment 21•14 years ago
|
||
Attachment #299642 -
Attachment is obsolete: true
Comment 22•14 years ago
|
||
Comment on attachment 433774 [details] [diff] [review] Patches cleanly against NSPR v4.8.4 >+#elif (__GNUC__ >= 4) || (__GNUC__ == 3 && __GNUC_MINOR__ >= 4) >+# define pr_bitscan_ctz32(val) __builtin_ctz(val) >+# define pr_bitscan_clz32(val) __builtin_clz(val) >+# define PR_HAS_BUILTIN_BITSCAN32 >+#endif /* MSVC || GCC */ Where's the test that says "and this CPU has instructions that make this intrinsic efficient" ?
Assignee | ||
Comment 23•14 years ago
|
||
(In reply to comment #22) > (From update of attachment 433774 [details] [diff] [review]) > > >+#elif (__GNUC__ >= 4) || (__GNUC__ == 3 && __GNUC_MINOR__ >= 4) > >+# define pr_bitscan_ctz32(val) __builtin_ctz(val) > >+# define pr_bitscan_clz32(val) __builtin_clz(val) > >+# define PR_HAS_BUILTIN_BITSCAN32 > >+#endif /* MSVC || GCC */ > > Where's the test that says "and this CPU has instructions that make > this intrinsic efficient" ? Nelson, you're killing me. The BSF/BSR instructions were introduced in the 80386, in 1985. Any 32- or 64-bit x86-compatible CPU has these instructions. http://pdos.csail.mit.edu/6.828/2008/readings/i386/BSF.htm I think the code below speaks for itself. I can gen VS2005 asm output too if further justification is required to bring this 25-year-old technology to NSPR. Before patch: ------------- PR_FloorLog2: movl %edi, %edx # n, j_ movl $16, %eax #, log2 shrl $16, %edx #, j_ testl %edx, %edx # j_ jne .L18 #, movl %edi, %edx # n, j_ xorb %al, %al # .L18: movl %edx, %ecx # j_, j_.47 shrl $8, %ecx #, j_.47 testl %ecx, %ecx # j_.47 jne .L19 #, movl %edx, %ecx # j_, j_.47 movl %ecx, %edx # j_.47, j_.48 shrl $4, %edx #, j_.48 testl %edx, %edx # j_.48 jne .L21 #, .L27: movl %ecx, %edx # j_.47, j_.48 movl %edx, %ecx # j_.48, j_.49 shrl $2, %ecx #, j_.49 testl %ecx, %ecx # j_.49 jne .L23 #, .L28: movl %edx, %ecx # j_.48, j_.49 shrl %ecx # tmp65 cmpl $1, %ecx #, tmp65 sbbl $-1, %eax #, log2 ret L19: addl $8, %eax #, log2 movl %ecx, %edx # j_.47, j_.48 shrl $4, %edx #, j_.48 testl %edx, %edx # j_.48 je .L27 #, .L21: addl $4, %eax #, log2 movl %edx, %ecx # j_.48, j_.49 shrl $2, %ecx #, j_.49 testl %ecx, %ecx # j_.49 je .L28 #, .L23: addl $2, %eax #, log2 shrl %ecx # tmp65 cmpl $1, %ecx #, tmp65 sbbl $-1, %eax #, log2 ret After patch: ------------ PR_FloorLog2: orl $1, %edi #, tmp61 movl $31, %eax #, tmp62 bsrl %edi, %edi # tmp61, D.1995 xorl $31, %edi #, D.1995 subl %edi, %eax # D.1995, tmp62 ret
Assignee | ||
Comment 24•14 years ago
|
||
(In reply to comment #22) > (From update of attachment 433774 [details] [diff] [review]) > > >+#elif (__GNUC__ >= 4) || (__GNUC__ == 3 && __GNUC_MINOR__ >= 4) > >+# define pr_bitscan_ctz32(val) __builtin_ctz(val) > >+# define pr_bitscan_clz32(val) __builtin_clz(val) > >+# define PR_HAS_BUILTIN_BITSCAN32 > >+#endif /* MSVC || GCC */ > > Where's the test that says "and this CPU has instructions that make > this intrinsic efficient" ? A less snarky answer to your question: the beauty of using compiler intrinsics (as opposed to asm) is that we don't have to make that decision. The compiler does it for us. What instruction is actually behind the __builtin_clz intrinsic? That's for the GCC developers to worry about. On x86 it is the BSR instruction, on PowerPC, or ARM, or whatever it is a different instruction, or sequence of instructions as is appropriate for that platform. The people at GCC and Microsoft have already done the heavy lifting for us. Why not use the features that NSPR-supported compilers make available to us?
Comment 25•14 years ago
|
||
Besides x86, NSPR runs on UltraSparc, PPC, PA-Risc 2, iTanium, ARM7, s390, MIPS R4k, and probably others that aren't coming to mind just now (we recently dropped DEC VAX, and I think we dropped DEC Alpha a while ago). Enable the intrinsics only on the CPUs where it's known to be an improvement due to special CPU instruction support. Thanks.
Comment 26•14 years ago
|
||
The bug says "with x86 instructions", so let's limit it to X86 CPUs, until we know of others that win, too.
Assignee | ||
Comment 27•14 years ago
|
||
Modified to address comment #26.
Attachment #433774 -
Attachment is obsolete: true
Attachment #433777 -
Flags: review?(nelson)
Comment 28•14 years ago
|
||
Comment 29•14 years ago
|
||
I'm prepared to give r+ to this latest patch, however ... In comment 14, Wan-Teh expressed great reluctance to changing prdtoa.c. Given all the problems we've had with that code lately, I don't blame him. (For a list, see https://bugzilla.mozilla.org/buglist.cgi?chfieldto=Now&query_format=advanced&chfieldfrom=365d&short_desc=prdtoa%20balloc&short_desc_type=anywords&product=Core&product=NSPR so I'm thinking maybe we should not commit the part of this patch that modifies prdtoa.c. Just as you measured ~10100 calls to PR_CeilingLog2 during Firefox startup, I will ask you to measure the number of calls to prtdtoa.c's functions hi0bits and lo0bits that occur during startup. This will help us get some idea of the value of perturbing prdtoa.c.
Assignee | ||
Comment 30•14 years ago
|
||
OK. I expect the total call count to be considerably less than last measured, given that a lot of calls were eliminated by the fix for bug 404824.
Assignee | ||
Comment 31•14 years ago
|
||
Count of invocations, bring Firefox v3.6.2 up to http://www.yahoo.com/ on WinXP: PR_CeilingLog2() = 73 PR_FloorLog2() = 0 hi0bits() = 1 lo0bits() = 15 macro PR_CEILING_LOG2 = 1640 PL_DHashTableInit() = 1537 nsVoidArray::GrowArrayBy() = 90 PL_DHashTableEnumerate() = 13 macro PR_CEILING_LOG2 = not used I also note that PL_InitArenaPool() is called ~950 times. At the time this bug was created, PL_InitArenaPool() called PR_CeilingLog2() unconditionally on every call. Now it uses tables look-ups. Overall, quite a difference in invocation counts from the Gecko 1.8.0 olden days.
Comment 32•14 years ago
|
||
Comment on attachment 433849 [details] [diff] [review] Steve's latest patch (1.4?) in CVS diff form OK, so I'll commit this patch, except for the part affecting prdtoa.c. The numbers suggest the prdtoa changes would not be significant anyway.
Attachment #433849 -
Flags: review+
Updated•14 years ago
|
Attachment #433777 -
Attachment is obsolete: true
Attachment #433777 -
Flags: review?(nelson)
Comment 33•14 years ago
|
||
> macro PR_CEILING_LOG2 = not used
I'm guessing that should say
macro PR_FLOOR_LOG2 = not used
Am I right?
Comment 34•14 years ago
|
||
Comment on attachment 299642 [details] [diff] [review] Steve Snyder's patch v1.2 Nelson, thanks for the code review. To be fair to Steve, I think it was me who introduced the _PR_HAVE_BUILTIN_BITSCAN32 vs. PR_HAVE_BUILTIN_BITSCAN32 typo. When you check in the patch, could you change PR_HAS to PR_HAVE for consistency with the existing macros? Thanks.
Assignee | ||
Comment 35•14 years ago
|
||
(In reply to comment #33) > > macro PR_CEILING_LOG2 = not used > > I'm guessing that should say > > macro PR_FLOOR_LOG2 = not used > > Am I right? Yes, that's right. Cut 'n' paste + poor proofreading on my part. Sorry. It seems that both the floor function and macro just exist to balance to the ceiling equivalents.
Comment 36•14 years ago
|
||
Bug 356852: Bitscanning operation can be made much faster with x86 instructions Patch contributed by Steve Snyder <swsnyder@snydernet.net>, r=nelson Checking in include/prbit.h; new revision: 3.13; previous revision: 3.12 Checking in src/misc/prdtoa.c; new revision: 4.15; previous revision: 4.14 Checking in src/misc/prlog2.c; new revision: 3.6; previous revision: 3.5 Thanks for your contribution, Steve.
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Updated•14 years ago
|
Target Milestone: 4.7 → 4.8.5
You need to log in
before you can comment on or make changes to this bug.
Description
•