2.70 KB, patch
|Details | Diff | Splinter Review|
6.18 KB, patch
|Details | Diff | Splinter Review|
2.40 KB, patch
|Details | Diff | Splinter Review|
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:18.104.22.168) Gecko/20060413 Fedora/22.214.171.124-1fc4.sws Firefox/126.96.36.199 Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:188.8.131.52) Gecko/20060413 Fedora/184.108.40.206-1fc4.sws Firefox/220.127.116.11 The PR_AtomicIncrement() and PR_AtomicDecrement() routines are called very often. Bringing up Seamonkey v1.0.1 to a blank page involves a combined ~232,000 calls to these routines. The cost of these calls on Windows platforms can be greatly reduced by the use of inline intrinsic functions. Reproducible: Always
PR_AtomicIncrement() and PR_AtomicDecrement() act as wrappers, each encapulating a single call to a platform-specific inline ASM routine. But the PR_Atomic* calls themselves are not inlined. On an X86 platform the overhead (push param onto stack, call function, move param from stack to register, return from function, clean up stack) is greater than the few instructions of actual functionality (the platform-specific ASM code). Moveover the compiler cannot make optimum register allocations because the stack must be used for passing the parameter to the called routine. I am submitting a patch which conditionally replaces PR_AtomicIncrement() and PR_AtomicDecrement() with their MSC intrinsic equivelents, _InterlockedIncrement() and _InterlockedDecrement(). Here is an example of the original code generation, calling PR_AtomicDecrement(). The code was generated by the MSVC 7.1 compiler from unmodified Seamonkey v1.0.1 source: nsDirectoryService::Release: 007FE823 push esi 007FE824 mov esi,dword ptr [ebp+8] 007FE827 push edi 007FE828 lea edi,[esi+0Ch] 007FE82B push edi 007FE82C call dword ptr [__imp__PR_AtomicDecrement (8352B4h)] PR_AtomicDecrement: 30009D53 mov ecx,dword ptr [val] 30009D56 mov eax,0FFFFFFFFh 30009D5B lock xadd dword ptr [ecx],eax 30009D5F dec eax 30009D60 ret 007FE832 add esp,4 007FE835 test eax,eax 007FE837 jne nsDirectoryService::Release+4Eh (7FE86Eh) Here's an example, after patching, of the new code generated: nsStorageInputStream::Release: 008058D3 push esi 008058D4 mov esi,dword ptr [this] 008058D7 lea ecx,[esi+8] 008058DA or eax,0FFFFFFFFh 008058DD lock xadd dword ptr [ecx],eax 008058E1 dec eax 008058E2 jne nsStorageInputStream::Release+46h (805916h) The relevant code starts at 0x007FE828 in the first example and 0x008058D7 in the second. The number of instructions is reduced from 10 to 4, and the number of memory references is reduced from 5 to 1. Note that the return values of the intrinsic functions different from the PR_Atomic* functions when run on Win95 and WinNT v3.51. I'm told that the minimum version of Windows supported for trunk builds is Win2K, so that shouldn't prevent the acceptance of this patch.
This patch generally improves performance and reduces code size on x86 Win32 platform by replacing the PR_AtomicIncrement() and PR_AtomicDecrement() with their intrinsic equivalents.
Edited summary for Win32-specific nature of the patch.
Summary: Replace some atomic funtions with intrinsics → Replace some atomic functions with Win32 intrinsics
Wouldn't it be simpler to ask the compiler to inline PR_AtomicIncrement() and PR_AtomicDecrement()?
Component: General → XPCOM
Product: Firefox → Core
QA Contact: general → xpcom
Version: unspecified → Trunk
Darin, bsmedberg, what do you think of Steve's patch?
Assignee: nobody → wtchang
Status: UNCONFIRMED → NEW
Component: XPCOM → NSPR
Ever confirmed: true
Product: Core → NSPR
QA Contact: xpcom → wtchang
Target Milestone: --- → 4.7
Version: Trunk → 4.7
Wouldn't it be simpler to ask the compiler to inline PR_AtomicIncrement() and >PR_AtomicDecrement()? If that were done we would benefit from the lack of stack handling. But we would still be using the existing ASM code, along with the hard-coding of registers that that implies, with the compiler having to work around those decisions. With the intrinsics the compiler can select the registers and instructions available for code gen on the fly, based on the surrounding code. Simply inlining the existing code would be a half-way point between the existing scheme and my use of intrinsics. On the other hard, it would be more portable, as my patch addresses only the x86 Win32 platform.
I've discovered a problem resulting from my original patch. It seems that nspr4.dll is giving out the wrong value to an extension that is querying the version number. With the patched code, the browser itself works great and the About page shows the correct version numbers (Gecko & browser). However, the Roboform extension for Seamonkey complains at runtime that only versions 1.0.0 through 1.0.1 are supported and I am running version "rv:18.104.22.168". Hmmm... Replacing nspr4.dll with one built from unpatched code satisfies the Roboform version query. Perhaps the lack of references to the PR_AtomicIncrement() and PR_AtomicDecrement() routines means that something is not being pulled into the link of nspr4.dll? More later.
your removal of the functions from the .c file means that everything that was linked with the old nspr will fail with the new one...
Updated patch retains the PR_AtomicIncrement() and PR_AtomicDecrement() functions for compatibility with previous builds. These are not called by any Mozilla code; they just exist unused in the NSPR library to satisfy the expectations of 3rd-party code.
fwiw, NSPR is used by more than just Mozilla code.
Comment on attachment 219147 [details] [diff] [review] Replace some atomic ops with Win32 intrinsics breaking nspr compatibility is obviously not ok :)
Attachment #219147 - Flags: review?(wtchang) → review-
Comment #12: Jeez, you break a fundamental component and you never hear the end of it! :-) I've been using the revised patch on 1.8.0 builds for the previous 5 months (run on Win2K and WinXP only, of course) and have experienced no problems since that initial "setback." My SeaMonkey builds have all been completely stable and I've had no further complaints from 3rd-party extensions. It's not an improvement that users would notice, but it does provide a noteable general reduction in code size and clock cycles on Win32 builds. I'll take my performance gains wherever I can get 'em.
for reference, i meant to post a counter patch. but this got lost. i'll try to do that now.
instead of trying to hack the code, this offers a macro which either does the optimized thing or not. all code built against the newer nspr could use the new macro. a treewide s/r would upgrade a tree. please test.
+#if defined(_WIN32) && defined(_M_IX86) && (_MSC_VER >= 1310) && defined(_MT) +#define NSPR_USE_INTRINSIC_LOCKS +#else +#undef NSPR_USE_INTRINSIC_LOCKS + +#ifdef NSPR_USE_INTRINSIC_LOCKS that's missing an #endif, right?
Comment on attachment 219500 [details] [diff] [review] Retains PR_Atomic* functions for compatibility Wan-Teh, what do you think of this? I'm inclined to give this r+, except that I think the symbol _USE_INTRIN_ATOMICS_ should be NSPR_USE_INTRIN_ATOMICS to avoid name space pollution. I checked that nsprstub.c does (indirectly) include pratom.h.
Comment on attachment 239519 [details] [diff] [review] does this work I was going to suggest this, and was pleasantly surprised that it's exactly what timeless suggested. One reason PR_AtomicXXX are functions is that we can fix bugs or enhance the implementation. Macros will be expanded into the caller's code, and we can't change them without recompiling the caller's code. We should use the compiler intrinsics. To prevent breaking binary compatibility by accident, it's safer to add new macros or functions and not touch the existing PR_AtomicXXX functions.
Would love to get this into NSPR in time for Gecko 1.9/FF3...
(In reply to comment #8) > I've discovered a problem resulting from my original patch. It seems that > nspr4.dll is giving out the wrong value to an extension that is querying the > version number. > > With the patched code, the browser itself works great and the About page shows > the correct version numbers (Gecko & browser). However, the Roboform extension > for Seamonkey complains at runtime that only versions 1.0.0 through 1.0.1 are > supported and I am running version "rv:22.214.171.124". Hmmm... Replacing nspr4.dll > with one built from unpatched code satisfies the Roboform version query. > > Perhaps the lack of references to the PR_AtomicIncrement() and > PR_AtomicDecrement() routines means that something is not being pulled into the > link of nspr4.dll? > > More later. I did a bunch of these intrinsics things quite some time ago and a user ran into the same problem with Roboform. I don't recall if I took the change out or if I just left it there and the user stopped using my builds. I should go over my intrinsics patches at some point in case I have a few that haven't already been found and merged in.
(In reply to comment #20) [snip] The problem was that my original patch completely removed the original functions in favor of inline intrinsics. The problem was resolved in my 2nd patch by leaving the original functions (for binary compatibility) and using the inline intrinsics for new builds. Though I saw the problem with Roboform, it was correctly observed that my original patch broke the published API and was therefore unacceptable.
Comment on attachment 298515 [details] [diff] [review] matched if/endif timeless: thanks for the patch. Let's name the macro _PR_HAVE_INTRINSIC_ATOMIC_OPS. (There is an existing internal macro called _PR_HAVE_ATOMIC_OPS.) Please add an underscore (_) between ATOMIC and INCREMENT/DECREMENT. Are there compiler intrinsics for _InterlockedAdd and _InterlockedExchange? It is a little strange to only have intrinsics macros for two of the four PR_AtomicXXX functions.
http://msdn2.microsoft.com/en-us/library/w5405h95.aspx Alphabetical Listing of Intrinsic Functions http://msdn2.microsoft.com/en-us/library/51s265a6.aspx _InterlockedAdd Intrinsic Functions http://msdn2.microsoft.com/en-us/library/1s26w950.aspx _InterlockedExchange Intrinsic Functions
(In reply to comment #23) > Are there compiler intrinsics for _InterlockedAdd and _InterlockedExchange? > It is a little strange to only have intrinsics macros for two of the four > PR_AtomicXXX functions. There's an argument to be made for using these 2 intrinsics for completeness/consistency with Inc & Dec, but there is really no real-world performance gain. While the intrinsics are more flexible than the existing ASM code in terms of code generation, those 2 functions are called so rarely (at least by Firefox/SeaMonkey) that there is really no practical reason to prefer them over the existing code.
Re comment #23 and naming conventions: is there any advantage to noting the compiler that provides the intrinsic functions? These functions are compiler-specific. I'm thinking here of potential confusion of the analogous GCC intrinsics +# define _MD_ATOMIC_INCREMENT(val) __sync_add_and_fetch(val, 1) +# define _MD_ATOMIC_DECREMENT(val) __sync_sub_and_fetch(val, 1) +# define _MD_ATOMIC_ADD(ptr,val) __sync_add_and_fetch(ptr, val) +# define _MD_ATOMIC_SET(ptr,newval) __sync_lock_test_and_set(ptr, newval) (or similar) as replacements for the functions in os_Linux_x86.s.
I agree we should skip the intermediate HAVE_XXX macro and just do this: +#if defined(_WIN32) && (_MSC_VER >= 1310) +long __cdecl _InterlockedIncrement(PRInt32 volatile *Addend); +#pragma intrinsic(_InterlockedIncrement) + +long __cdecl _InterlockedDecrement(PRInt32 volatile *Addend); +#pragma intrinsic(_InterlockedDecrement) + +#define PR_ATOMIC_INCREMENT(val) _InterlockedIncrement(val) +#define PR_ATOMIC_DECREMENT(val) _InterlockedDecrement(val) +#else +#define PR_ATOMIC_INCREMENT(val) PR_AtomicIncrement(val) +#define PR_ATOMIC_DECREMENT(val) PR_AtomicDecrement(val) +#endif These intrinsics are available on all processor architectures and in single-threaded programs, so we don't need to test the _M_IX86 and _MT macros. The return values of the intrinsics don't match NSPR's specification on Windows 95 and NT 3.51, so we will be dropping support for these two old versions of Windows.
(In reply to comment #25) > While the intrinsics are more flexible than the existing ASM code in terms of > code generation, those 2 functions are called so rarely (at least by > Firefox/SeaMonkey) that there is really no practical reason to prefer them over > the existing code. The advantages to using Intrinsics over MSVC Inline Assembler are: - Intrinsics are generally portable and typically easier to understand - MSVC for x64 doesn't support MSVC inline assembler so using intrinsics simplifies the code - I don't know if the intrinsics are the same between GCC and MSVC but having only one code set would be nice for multiple platforms.
Add PR_ATOMIC_INCREMENT, PR_ATOMIC_DECREMENT, PR_ATOMIC_SET, and PR_ATOMIC_ADD. Implement these macros as compiler intrinsics for the latest versions of MSVC and GCC. Implement them as the PR_AtomicXXX functions by default. I've checked in this patch on the NSPR trunk (NSPR 4.7) for testing. Checking in pr/include/pratom.h; /cvsroot/mozilla/nsprpub/pr/include/pratom.h,v <-- pratom.h new revision: 3.9; previous revision: 3.8 done Checking in pr/tests/atomic.c; /cvsroot/mozilla/nsprpub/pr/tests/atomic.c,v <-- atomic.c new revision: 3.9; previous revision: 3.8 done
Comment on attachment 299498 [details] [diff] [review] Proposed patch (checked in) The patch to atomic.c contains a pattern of 6 lines of code that is repeated many times. I would have preferred that those lines appear in a macro with several arguments, and that each occurrence be a single macro invocation. Also, I would have written the statements of this form: result = result | ((EXPRESSION) ? 0 : 1); as the equivalent and simpler statement: result |= !(EXPRESSION); But this patch will do for test code. r=nelson
Attachment #299498 - Flags: superreview?(nelson) → superreview+
Nelson, I agree with both of your suggestions. I don't have time to implement them now, but I added a TODO comment. I also think of an issue -- the PR_AtomicXXX functions may be implemented with a lock even when the platform has atomic instructions because we don't have the time or expertise to write the assembly code. So if we define PR_ATOMIC_XXX with compiler intrinsics, the macros and the functions can't be used on the same variable because they use incompatible methods to achieve atomicity. But people will expect that the macros and the functions can be used interchangeably. So I documented this issue in pratom.h and restrict the use of GCC atomic intrinsics to the platforms where PR_AtomicXXX is truly atomic. You don't need to review this patch carefully. This is just FYI. I've checked in this patch on the NSPR trunk (NSPR 4.7). Checking in pr/include/pratom.h; /cvsroot/mozilla/nsprpub/pr/include/pratom.h,v <-- pratom.h new revision: 3.10; previous revision: 3.9 done Checking in pr/tests/atomic.c; /cvsroot/mozilla/nsprpub/pr/tests/atomic.c,v <-- atomic.c new revision: 3.10; previous revision: 3.9 done
Attachment #299608 - Flags: review?(nelson)
Comment on attachment 219500 [details] [diff] [review] Retains PR_Atomic* functions for compatibility Obsoleting this patch, which has been superseded.
Comment on attachment 299608 [details] [diff] [review] Supplemental patch 1 (checked in) I think this bug can now be resolved as fixed.
Attachment #299608 - Flags: review?(nelson) → review+
Severity: normal → enhancement
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Version: 4.7 → other
Comment on attachment 299498 [details] [diff] [review] Proposed patch (checked in) Removed myself from review in attempt to get Bugzilla to stop nagging me about this 4-year-old bug.
Attachment #299498 - Flags: review?(swsnyder) → review?
Attachment #299498 - Flags: review?
You need to log in before you can comment on or make changes to this bug.