Closed Bug 351532 Opened 19 years ago Closed 19 years ago

Log2 function can be made faster with x86 intrinsics

Categories

(Core :: JavaScript Engine, defect)

x86
All
defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: swsnyder, Unassigned)

References

Details

Attachments

(2 files)

User-Agent: Opera/9.01 (X11; Linux i686; U; en) Build Identifier: Gecko v1.8.0 There are 9 occurrances of Log2 functions (calculating a power of 2 in relation to a specified number) present in the Mozilla code. The functions are currently implemented as a series of compare/jump/add operations. These functions can be made much faster by using an x86 instruction (BSR) that exists for this purpose. Reproducible: Always Steps to Reproduce: 1.Run Mozilla v1.8 code. 2. 3. Actual Results: Code runs slower on x86 platforms than need be. Expected Results: Code should run as faster as possible on a given platform.
Uses the x86 BSR instruction (implemented as MSVC and GCC intrinsics) to speed up the Log2 calculations.
Edited title to make x86-specific nature clearer.
Summary: Log2 function can be made faster with intrinsics → Log2 function can be made faster with x86 intrinsics
What kind of improvement can one expect to see? This kind: Running __log2()... Comparing new/old results... matched! Timing old code... 81741399 kTicks Timing new code... 10808808 kTicks Running JS_CEILING_LOG2()... Comparing new/old results... matched! Timing old code... 18585866 kTicks Timing new code... 10787381 kTicks Running JS_FLOOR_LOG2()... Comparing new/old results... matched! Timing old code... 17114537 kTicks Timing new code... 6714373 kTicks Running JS_CeilingLog2()... Comparing new/old results... matched! Timing old code... 18604952 kTicks Timing new code... 10849871 kTicks Running JS_FloorLog2()... Comparing new/old results... matched! Timing old code... 17011702 kTicks Timing new code... 6678710 kTicks Note #0: The Log2 functions are called ~1300 times when bringing SeaMonkey v1.0.4 up to a blank page on Win2K. Bringing the same version of SM up to http://www.yahoo.com causes ~1700 calls. I didn't count the number of times that the macros were invoked, just the functions. Note #1: The timing of the PR_* functions/macros aren't shown above because the code is the same as the JS_* equivalents. Note #2: The improvement in __log2() is the most dramatic because it was the most poorly implemented. Where the other code is implemented as a sequence of compare/jump/add steps, __log2() used a for() loop for the calculations. This caused really bad exec times on P3/P4 systems. It seems to only be run a single time, though, at startup, so the pain isn't too great. Note #3: I made use of the MSVC _BitScanReverse intrinsic conditional on MSVC v7.1 (VS2003) because I am not certain that earlier verision of the compiler supported it. They probably did, but I don't know that for a fact. Note #4: The Floor functions benefit more (60% reduction in exec time) from BSR than the Ceiling functions (41% reduction) do. This is because the Floor functionality is reduced to a single BSR instruction, where the Ceiling functions need a little additional checking after BSR execution. Note #5: The exec times above were done with code compiled by MSVC v7.1. Roughly the same timings (and code generated) are seen with GCC v4.1 on Linux. The timings for both MSVC/Win and GCC/Linux were done on Pentium4 and Pentium3 machines.
well.... http://lxr.mozilla.org/seamonkey/source/js/src/jsbit.h#127 Trunk and 1.8 use __builtin_ctlz for JS_FLOOR_LOG2 already... (although not in NSPR it seems) dbm isn't Mozilla code, should probably not patch it here.
Assignee: general → general
Component: General → JavaScript Engine
Product: Mozilla Application Suite → Core
QA Contact: general → general
Version: unspecified → Trunk
(In reply to comment #4) > well.... > http://lxr.mozilla.org/seamonkey/source/js/src/jsbit.h#127 > > Trunk and 1.8 use __builtin_ctlz for JS_FLOOR_LOG2 already... (although not in > NSPR it seems) But that only for GNUC, it would be nice to extend that to include MSVC support.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Regarding the use of the GCC intrinsic: Microsoft and Gnu use different semantics for their respective implementations of the x86 BSR instruction. GCC implements it just like the actual BSR instruction operates: it takes a single imput and returns a single output. Microsoft's intrinsic takes 2 inputs, one of which is a pointer to storage for the result. What it returns is the status of the operation, an operation that can only fail if given 0 as an input (BSR(0) == undefined). If both implementations are used in a macro, as I have done, they have to be reconciled such that the macro takes the same input and returns the same output regardless of platform. Either the Gnu intrinsic has to be made to act like the Microsoft intrinsic or vice-versa. Since I implemented the use of the MS intrinsic first I used that as the "standard" behavior despite the fact that it is actually less standard than the Gnu implementation.
(In reply to comment #7) > If both implementations are used in a macro, as I have done, they have to be > reconciled such that the macro takes the same input and returns the same output > regardless of platform. It is not necessary to change the signatures of JS_CEILING_LOG2 and JS_FLOOR_LOG2 from [1]. They are already statements placing the result into the first argument so adding MS-specific ifdefs should be straightforward. On the other hand JS_CEILING_LOG2W and JS_FLOOR_LOG2W has to be changed to follow the same pattern as int32-counterparts do. Or does MSVC support declaring variables inside expressions? [1] http://lxr.mozilla.org/seamonkey/source/js/src/jsbit.h
*** Bug 349364 has been marked as a duplicate of this bug. ***
Attached patch Revision 2Splinter Review
Major changes from previous patch: 1. Where rev1 used small ASM routines in GCC to emulate Microsoft intrinsics, rev2 uses the actual GCC built-in functions, with MS intrinsics encapulated in small wrappers to present a GCC-like interface. Done primarily because GCC is closer to the actual x86 BSF/BSR instructions and secondarily for compatibility with existing GCC __builtin_xxx use. 2. Added use of bitscan macros in the lo0bits() and hi0bits() functions, both in JS and PR. Relative exec times to follow.
Updated relative execution times, using the rev2 patch. Comparison is original 1.8.0 code vs use of bitscan macros. Running test__log2()... Comparing old/new results... matched! Timing old code... 43949453 kTicks Timing new code... 6457092 kTicks Running testJS_CEILING_LOG2()... Comparing old/new results... matched! Timing old code... 10815129 kTicks Timing new code... 3564112 kTicks Running testJS_FLOOR_LOG2()... Comparing old/new results... matched! Timing old code... 10409788 kTicks Timing new code... 3131176 kTicks Running testJS_CeilingLog2()... Comparing old/new results... matched! Timing old code... 15793098 kTicks Timing new code... 6487000 kTicks Running testJS_FloorLog2()... Comparing old/new results... matched! Timing old code... 14017382 kTicks Timing new code... 5377323 kTicks Running testlo0bits()... Comparing old/new results... matched! Timing old code... 12376650 kTicks Timing new code... 7780646 kTicks Running testhi0bits()... Comparing old/new results... matched! Timing old code... 9478704 kTicks Timing new code... 5643684 kTicks Test code above was built with GCC 4.0.3. Similar relative exec times are seen with MSVC v7.1 and Intel's ICC v9.1.
I have separated the JS and NSPR modifications into 2 patches, each suitable (I hope) for committal into their respective code bases. For JS, see bug #356856 For NSPR, see bug #356852 The creation of these patches makes this bug obsolete, so I'm marking it as WONTFIX.
Status: NEW → RESOLVED
Closed: 19 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: