Closed Bug 334826 Opened 14 years ago Closed 13 years ago

Replace some atomic functions with Win32 intrinsics

Categories

(NSPR :: NSPR, enhancement)

x86
Windows XP
enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: swsnyder, Assigned: wtc)

References

Details

(Keywords: perf)

Attachments

(3 files, 3 obsolete files)

User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.2) Gecko/20060413 Fedora/1.5.0.2-1fc4.sws Firefox/1.5.0.2
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.2) Gecko/20060413 Fedora/1.5.0.2-1fc4.sws Firefox/1.5.0.2

The PR_AtomicIncrement() and PR_AtomicDecrement() routines are called very often.  Bringing up Seamonkey v1.0.1 to a blank page involves a combined ~232,000 calls to these routines.  The cost of these calls on Windows platforms can be greatly reduced by the use of inline intrinsic functions.  


Reproducible: Always
PR_AtomicIncrement() and PR_AtomicDecrement() act as wrappers, each encapulating a single call to a platform-specific inline ASM routine.  But the PR_Atomic* calls themselves are not inlined.  On an X86 platform the overhead (push param onto stack, call function, move param from stack to register, return from function, clean up stack) is greater than the few instructions of actual functionality (the platform-specific ASM code).  Moveover the compiler cannot make optimum register allocations because the stack must be used for passing the parameter to the called routine.  

I am submitting a patch which conditionally replaces PR_AtomicIncrement() and PR_AtomicDecrement() with their MSC intrinsic equivelents, _InterlockedIncrement() and _InterlockedDecrement().  

Here is an example of the original code generation, calling PR_AtomicDecrement().  The code was generated by the MSVC 7.1 compiler from unmodified Seamonkey v1.0.1 source: 

nsDirectoryService::Release:
007FE823  push        esi
007FE824  mov         esi,dword ptr [ebp+8]
007FE827  push        edi
007FE828  lea         edi,[esi+0Ch]
007FE82B  push        edi
007FE82C  call        dword ptr [__imp__PR_AtomicDecrement (8352B4h)]
  PR_AtomicDecrement:
  30009D53  mov         ecx,dword ptr [val]
  30009D56  mov         eax,0FFFFFFFFh
  30009D5B  lock xadd   dword ptr [ecx],eax
  30009D5F  dec         eax
  30009D60  ret
007FE832  add         esp,4
007FE835  test        eax,eax
007FE837  jne         nsDirectoryService::Release+4Eh (7FE86Eh)

Here's an example, after patching, of the new code generated:

nsStorageInputStream::Release:
008058D3  push        esi
008058D4  mov         esi,dword ptr [this]
008058D7  lea         ecx,[esi+8]
008058DA  or          eax,0FFFFFFFFh
008058DD  lock xadd   dword ptr [ecx],eax
008058E1  dec         eax
008058E2  jne         nsStorageInputStream::Release+46h (805916h)

The relevant code starts at 0x007FE828 in the first example and 0x008058D7 in the second.  The number of instructions is reduced from 10 to 4, and the number of memory references is reduced from 5 to 1.  

Note that the return values of the intrinsic functions different from the PR_Atomic* functions when run on Win95 and WinNT v3.51.  I'm told that the minimum version of Windows supported for trunk builds is Win2K, so that shouldn't prevent the acceptance of this patch.  

This patch generally improves performance and reduces code size on x86 Win32 platform by replacing the PR_AtomicIncrement() and PR_AtomicDecrement() with their intrinsic equivalents.
Edited summary for Win32-specific nature of the patch.
Summary: Replace some atomic funtions with intrinsics → Replace some atomic functions with Win32 intrinsics
Wouldn't it be simpler to ask the compiler to inline PR_AtomicIncrement() and PR_AtomicDecrement()?
Component: General → XPCOM
Keywords: perf
Product: Firefox → Core
QA Contact: general → xpcom
Version: unspecified → Trunk
Darin, bsmedberg, what do you think of Steve's patch?
Attachment #219147 - Flags: review?(wtchang)
-> nspr
Assignee: nobody → wtchang
Status: UNCONFIRMED → NEW
Component: XPCOM → NSPR
Ever confirmed: true
Product: Core → NSPR
QA Contact: xpcom → wtchang
Target Milestone: --- → 4.7
Version: Trunk → 4.7
Wouldn't it be simpler to ask the compiler to inline PR_AtomicIncrement() and
>PR_AtomicDecrement()?

If that were done we would benefit from the lack of stack handling.  But we would still be using the existing ASM code, along with the hard-coding of registers that that implies, with the compiler having to work around those decisions.  

With the intrinsics the compiler can select the registers and instructions available for code gen on the fly, based on the surrounding code.

Simply inlining the existing code would be a half-way point between the existing scheme and my use of intrinsics.  On the other hard, it would be more portable, as my patch addresses only the x86 Win32 platform.
I've discovered a problem resulting from my original patch.  It seems that nspr4.dll is giving out the wrong value to an extension that is querying the version number.

With the patched code, the browser itself works great and the About page shows the correct version numbers (Gecko & browser).  However, the Roboform extension for Seamonkey complains at runtime that only versions 1.0.0 through 1.0.1 are supported and I am running version "rv:1.8.0.2".  Hmmm...  Replacing nspr4.dll with one built from unpatched code satisfies the Roboform version query.

Perhaps the lack of references to the PR_AtomicIncrement() and PR_AtomicDecrement() routines means that something is not being pulled into the link of nspr4.dll?  

More later.
your removal of the functions from the .c file means that everything that was linked with the old nspr will fail with the new one...
Updated patch retains the PR_AtomicIncrement() and PR_AtomicDecrement() functions for compatibility with previous builds.  These are not called by any Mozilla code; they just exist unused in the NSPR library to satisfy the expectations of 3rd-party code.
fwiw, NSPR is used by more than just Mozilla code.
Comment on attachment 219147 [details] [diff] [review]
Replace some atomic ops with Win32 intrinsics

breaking nspr compatibility is obviously not ok :)
Attachment #219147 - Flags: review?(wtchang) → review-
Comment #12: Jeez, you break a fundamental component and you never hear the end of it!  :-)  

I've been using the revised patch on 1.8.0 builds for the previous 5 months (run on Win2K and WinXP only, of course) and have experienced no problems since that initial "setback."  My SeaMonkey builds have all been completely stable and I've had no further complaints from 3rd-party extensions.

It's not an improvement that users would notice, but it does provide a noteable general reduction in code size and clock cycles on Win32 builds.  I'll take my performance gains wherever I can get 'em.
for reference, i meant to post a counter patch. but this got lost. i'll try to do that now.
Attached patch does this work (obsolete) — Splinter Review
instead of trying to hack the code, this offers a macro which either does the optimized thing or not.

all code built against the newer nspr could use the new macro.

a treewide s/r would upgrade a tree.

please test.
+#if defined(_WIN32) && defined(_M_IX86) && (_MSC_VER >= 1310) && defined(_MT)
+#define NSPR_USE_INTRINSIC_LOCKS
+#else
+#undef NSPR_USE_INTRINSIC_LOCKS
+
+#ifdef NSPR_USE_INTRINSIC_LOCKS

that's missing an #endif, right?
QA Contact: wtchang → nspr
Comment on attachment 219500 [details] [diff] [review]
Retains PR_Atomic* functions for compatibility

Wan-Teh, what do you think of this?
I'm inclined to give this r+, except that I think the symbol _USE_INTRIN_ATOMICS_ should be NSPR_USE_INTRIN_ATOMICS to avoid name space pollution. 

I checked that nsprstub.c does (indirectly) include pratom.h.
Attachment #219500 - Flags: review?(wtc)
Comment on attachment 239519 [details] [diff] [review]
does this work

I was going to suggest this, and was pleasantly surprised that
it's exactly what timeless suggested.

One reason PR_AtomicXXX are functions is that we can fix bugs
or enhance the implementation.  Macros will be expanded into the
caller's code, and we can't change them without recompiling
the caller's code.

We should use the compiler intrinsics.  To prevent breaking
binary compatibility by accident, it's safer to add new macros
or functions and not touch the existing PR_AtomicXXX functions.
Would love to get this into NSPR in time for Gecko 1.9/FF3...
Flags: blocking1.9+
(In reply to comment #8)
> I've discovered a problem resulting from my original patch.  It seems that
> nspr4.dll is giving out the wrong value to an extension that is querying the
> version number.
> 
> With the patched code, the browser itself works great and the About page shows
> the correct version numbers (Gecko & browser).  However, the Roboform extension
> for Seamonkey complains at runtime that only versions 1.0.0 through 1.0.1 are
> supported and I am running version "rv:1.8.0.2".  Hmmm...  Replacing nspr4.dll
> with one built from unpatched code satisfies the Roboform version query.
> 
> Perhaps the lack of references to the PR_AtomicIncrement() and
> PR_AtomicDecrement() routines means that something is not being pulled into the
> link of nspr4.dll?  
> 
> More later.
 
I did a bunch of these intrinsics things quite some time ago and a user ran into the same problem with Roboform. I don't recall if I took the change out or if I just left it there and the user stopped using my builds.

I should go over my intrinsics patches at some point in case I have a few that haven't already been found and merged in.
(In reply to comment #20)
[snip]

The problem was that my original patch completely removed the original functions in favor of inline intrinsics.  The problem was resolved in my 2nd patch by leaving the original functions (for binary compatibility) and using the inline intrinsics for new builds.

Though I saw the problem with Roboform, it was correctly observed that my original patch broke the published API and was therefore unacceptable.
Attached patch matched if/endif (obsolete) — Splinter Review
Attachment #239519 - Attachment is obsolete: true
Attachment #298515 - Flags: review?(wtc)
Comment on attachment 298515 [details] [diff] [review]
matched if/endif

timeless: thanks for the patch.

Let's name the macro _PR_HAVE_INTRINSIC_ATOMIC_OPS.  (There is
an existing internal macro called _PR_HAVE_ATOMIC_OPS.)

Please add an underscore (_) between ATOMIC and INCREMENT/DECREMENT.

Are there compiler intrinsics for _InterlockedAdd and _InterlockedExchange?
It is a little strange to only have intrinsics macros for two of the four
PR_AtomicXXX functions.
http://msdn2.microsoft.com/en-us/library/w5405h95.aspx
Alphabetical Listing of Intrinsic Functions

http://msdn2.microsoft.com/en-us/library/51s265a6.aspx
_InterlockedAdd Intrinsic Functions

http://msdn2.microsoft.com/en-us/library/1s26w950.aspx
_InterlockedExchange Intrinsic Functions
(In reply to comment #23)
> Are there compiler intrinsics for _InterlockedAdd and _InterlockedExchange?
> It is a little strange to only have intrinsics macros for two of the four
> PR_AtomicXXX functions.

There's an argument to be made for using these 2 intrinsics for completeness/consistency with Inc & Dec, but there is really no real-world performance gain.

While the intrinsics are more flexible than the existing ASM code in terms of code generation, those 2 functions are called so rarely (at least by Firefox/SeaMonkey) that there is really no practical reason to prefer them over the existing code.
Re comment #23 and naming conventions: is there any advantage to noting the compiler that provides the intrinsic functions?  These functions are compiler-specific.

I'm thinking here of potential confusion of the analogous GCC intrinsics

+# define _MD_ATOMIC_INCREMENT(val)   __sync_add_and_fetch(val, 1)
+# define _MD_ATOMIC_DECREMENT(val)   __sync_sub_and_fetch(val, 1)
+# define _MD_ATOMIC_ADD(ptr,val)     __sync_add_and_fetch(ptr, val)
+# define _MD_ATOMIC_SET(ptr,newval)  __sync_lock_test_and_set(ptr, newval)

(or similar) as replacements for the functions in os_Linux_x86.s.
I agree we should skip the intermediate HAVE_XXX macro and just do this:

+#if defined(_WIN32) && (_MSC_VER >= 1310)
+long __cdecl _InterlockedIncrement(PRInt32 volatile *Addend);
+#pragma intrinsic(_InterlockedIncrement)
+
+long __cdecl _InterlockedDecrement(PRInt32 volatile *Addend);
+#pragma intrinsic(_InterlockedDecrement)
+
+#define PR_ATOMIC_INCREMENT(val) _InterlockedIncrement(val)
+#define PR_ATOMIC_DECREMENT(val) _InterlockedDecrement(val)
+#else
+#define PR_ATOMIC_INCREMENT(val) PR_AtomicIncrement(val)
+#define PR_ATOMIC_DECREMENT(val) PR_AtomicDecrement(val)
+#endif

These intrinsics are available on all processor architectures and
in single-threaded programs, so we don't need to test the _M_IX86
and _MT macros.  The return values of the intrinsics don't match
NSPR's specification on Windows 95 and NT 3.51, so we will be
dropping support for these two old versions of Windows.
(In reply to comment #25)
> While the intrinsics are more flexible than the existing ASM code in terms of
> code generation, those 2 functions are called so rarely (at least by
> Firefox/SeaMonkey) that there is really no practical reason to prefer them over
> the existing code.

The advantages to using Intrinsics over MSVC Inline Assembler are:

- Intrinsics are generally portable and typically easier to understand
- MSVC for x64 doesn't support MSVC inline assembler so using intrinsics simplifies the code
- I don't know if the intrinsics are the same between GCC and MSVC but having only one code set would be nice for multiple platforms.

Add PR_ATOMIC_INCREMENT, PR_ATOMIC_DECREMENT, PR_ATOMIC_SET,
and PR_ATOMIC_ADD.  Implement these macros as compiler intrinsics
for the latest versions of MSVC and GCC.  Implement them as
the PR_AtomicXXX functions by default.

I've checked in this patch on the NSPR trunk (NSPR 4.7) for
testing.

Checking in pr/include/pratom.h;
/cvsroot/mozilla/nsprpub/pr/include/pratom.h,v  <--  pratom.h
new revision: 3.9; previous revision: 3.8
done
Checking in pr/tests/atomic.c;
/cvsroot/mozilla/nsprpub/pr/tests/atomic.c,v  <--  atomic.c
new revision: 3.9; previous revision: 3.8
done
Attachment #298515 - Attachment is obsolete: true
Attachment #299498 - Flags: superreview?(nelson)
Attachment #299498 - Flags: review?(swsnyder)
Attachment #298515 - Flags: review?(wtc)
Comment on attachment 299498 [details] [diff] [review]
Proposed patch (checked in)

The patch to atomic.c contains a pattern of 6 lines of code
that is repeated many times.  I would have preferred that 
those lines appear in a macro with several arguments, and 
that each occurrence be a single macro invocation.  Also, 
I would have written the statements of this form:
    result = result | ((EXPRESSION) ? 0 : 1);
as the equivalent and simpler statement:
    result |= !(EXPRESSION);
But this patch will do for test code.  r=nelson
Attachment #299498 - Flags: superreview?(nelson) → superreview+
Duplicate of this bug: 344307
Nelson, I agree with both of your suggestions.  I don't have time
to implement them now, but I added a TODO comment.

I also think of an issue -- the PR_AtomicXXX functions may be
implemented with a lock even when the platform has atomic
instructions because we don't have the time or expertise to
write the assembly code.  So if we define PR_ATOMIC_XXX with
compiler intrinsics, the macros and the functions can't be
used on the same variable because they use incompatible methods
to achieve atomicity.  But people will expect that the macros
and the functions can be used interchangeably.  So I documented
this issue in pratom.h and restrict the use of GCC atomic
intrinsics to the platforms where PR_AtomicXXX is truly atomic.

You don't need to review this patch carefully.  This is just FYI.
I've checked in this patch on the NSPR trunk (NSPR 4.7).

Checking in pr/include/pratom.h;
/cvsroot/mozilla/nsprpub/pr/include/pratom.h,v  <--  pratom.h
new revision: 3.10; previous revision: 3.9
done
Checking in pr/tests/atomic.c;
/cvsroot/mozilla/nsprpub/pr/tests/atomic.c,v  <--  atomic.c
new revision: 3.10; previous revision: 3.9
done
Attachment #299608 - Flags: review?(nelson)
Comment on attachment 219500 [details] [diff] [review]
Retains PR_Atomic* functions for compatibility

Obsoleting this patch, which has been superseded.
Attachment #219500 - Attachment is obsolete: true
Attachment #219500 - Flags: review?(wtc)
Comment on attachment 299608 [details] [diff] [review]
Supplemental patch 1 (checked in)

I think this bug can now be resolved as fixed.
Attachment #299608 - Flags: review?(nelson) → review+
Severity: normal → enhancement
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Version: 4.7 → other
Comment on attachment 299498 [details] [diff] [review]
Proposed patch (checked in)

Removed myself from review in attempt to get Bugzilla to stop nagging me about this 4-year-old bug.
Attachment #299498 - Flags: review?(swsnyder) → review?
You need to log in before you can comment on or make changes to this bug.