Replace some atomic functions with Win32 intrinsics

RESOLVED FIXED in 4.7

Status

enhancement
RESOLVED FIXED
13 years ago
7 years ago

People

(Reporter: swsnyder, Assigned: wtc)

Tracking

({perf})

Bug Flags:
blocking1.9 +

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(3 attachments, 3 obsolete attachments)

Reporter

Description

13 years ago
User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.2) Gecko/20060413 Fedora/1.5.0.2-1fc4.sws Firefox/1.5.0.2
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.2) Gecko/20060413 Fedora/1.5.0.2-1fc4.sws Firefox/1.5.0.2

The PR_AtomicIncrement() and PR_AtomicDecrement() routines are called very often.  Bringing up Seamonkey v1.0.1 to a blank page involves a combined ~232,000 calls to these routines.  The cost of these calls on Windows platforms can be greatly reduced by the use of inline intrinsic functions.  


Reproducible: Always
Reporter

Comment 1

13 years ago
PR_AtomicIncrement() and PR_AtomicDecrement() act as wrappers, each encapulating a single call to a platform-specific inline ASM routine.  But the PR_Atomic* calls themselves are not inlined.  On an X86 platform the overhead (push param onto stack, call function, move param from stack to register, return from function, clean up stack) is greater than the few instructions of actual functionality (the platform-specific ASM code).  Moveover the compiler cannot make optimum register allocations because the stack must be used for passing the parameter to the called routine.  

I am submitting a patch which conditionally replaces PR_AtomicIncrement() and PR_AtomicDecrement() with their MSC intrinsic equivelents, _InterlockedIncrement() and _InterlockedDecrement().  

Here is an example of the original code generation, calling PR_AtomicDecrement().  The code was generated by the MSVC 7.1 compiler from unmodified Seamonkey v1.0.1 source: 

nsDirectoryService::Release:
007FE823  push        esi
007FE824  mov         esi,dword ptr [ebp+8]
007FE827  push        edi
007FE828  lea         edi,[esi+0Ch]
007FE82B  push        edi
007FE82C  call        dword ptr [__imp__PR_AtomicDecrement (8352B4h)]
  PR_AtomicDecrement:
  30009D53  mov         ecx,dword ptr [val]
  30009D56  mov         eax,0FFFFFFFFh
  30009D5B  lock xadd   dword ptr [ecx],eax
  30009D5F  dec         eax
  30009D60  ret
007FE832  add         esp,4
007FE835  test        eax,eax
007FE837  jne         nsDirectoryService::Release+4Eh (7FE86Eh)

Here's an example, after patching, of the new code generated:

nsStorageInputStream::Release:
008058D3  push        esi
008058D4  mov         esi,dword ptr [this]
008058D7  lea         ecx,[esi+8]
008058DA  or          eax,0FFFFFFFFh
008058DD  lock xadd   dword ptr [ecx],eax
008058E1  dec         eax
008058E2  jne         nsStorageInputStream::Release+46h (805916h)

The relevant code starts at 0x007FE828 in the first example and 0x008058D7 in the second.  The number of instructions is reduced from 10 to 4, and the number of memory references is reduced from 5 to 1.  

Note that the return values of the intrinsic functions different from the PR_Atomic* functions when run on Win95 and WinNT v3.51.  I'm told that the minimum version of Windows supported for trunk builds is Win2K, so that shouldn't prevent the acceptance of this patch.  

Reporter

Comment 2

13 years ago
This patch generally improves performance and reduces code size on x86 Win32 platform by replacing the PR_AtomicIncrement() and PR_AtomicDecrement() with their intrinsic equivalents.
Reporter

Comment 3

13 years ago
Edited summary for Win32-specific nature of the patch.
Summary: Replace some atomic funtions with intrinsics → Replace some atomic functions with Win32 intrinsics

Comment 4

13 years ago
Wouldn't it be simpler to ask the compiler to inline PR_AtomicIncrement() and PR_AtomicDecrement()?

Updated

13 years ago
Component: General → XPCOM
Keywords: perf
Product: Firefox → Core
QA Contact: general → xpcom
Version: unspecified → Trunk

Comment 5

13 years ago
Darin, bsmedberg, what do you think of Steve's patch?

Updated

13 years ago
Attachment #219147 - Flags: review?(wtchang)

Comment 6

13 years ago
-> nspr
Assignee: nobody → wtchang
Status: UNCONFIRMED → NEW
Component: XPCOM → NSPR
Ever confirmed: true
Product: Core → NSPR
QA Contact: xpcom → wtchang
Target Milestone: --- → 4.7
Version: Trunk → 4.7
Reporter

Comment 7

13 years ago
Wouldn't it be simpler to ask the compiler to inline PR_AtomicIncrement() and
>PR_AtomicDecrement()?

If that were done we would benefit from the lack of stack handling.  But we would still be using the existing ASM code, along with the hard-coding of registers that that implies, with the compiler having to work around those decisions.  

With the intrinsics the compiler can select the registers and instructions available for code gen on the fly, based on the surrounding code.

Simply inlining the existing code would be a half-way point between the existing scheme and my use of intrinsics.  On the other hard, it would be more portable, as my patch addresses only the x86 Win32 platform.
Reporter

Comment 8

13 years ago
I've discovered a problem resulting from my original patch.  It seems that nspr4.dll is giving out the wrong value to an extension that is querying the version number.

With the patched code, the browser itself works great and the About page shows the correct version numbers (Gecko & browser).  However, the Roboform extension for Seamonkey complains at runtime that only versions 1.0.0 through 1.0.1 are supported and I am running version "rv:1.8.0.2".  Hmmm...  Replacing nspr4.dll with one built from unpatched code satisfies the Roboform version query.

Perhaps the lack of references to the PR_AtomicIncrement() and PR_AtomicDecrement() routines means that something is not being pulled into the link of nspr4.dll?  

More later.
your removal of the functions from the .c file means that everything that was linked with the old nspr will fail with the new one...
Reporter

Comment 10

13 years ago
Updated patch retains the PR_AtomicIncrement() and PR_AtomicDecrement() functions for compatibility with previous builds.  These are not called by any Mozilla code; they just exist unused in the NSPR library to satisfy the expectations of 3rd-party code.
fwiw, NSPR is used by more than just Mozilla code.

Comment 12

13 years ago
Comment on attachment 219147 [details] [diff] [review]
Replace some atomic ops with Win32 intrinsics

breaking nspr compatibility is obviously not ok :)
Attachment #219147 - Flags: review?(wtchang) → review-
Reporter

Comment 13

13 years ago
Comment #12: Jeez, you break a fundamental component and you never hear the end of it!  :-)  

I've been using the revised patch on 1.8.0 builds for the previous 5 months (run on Win2K and WinXP only, of course) and have experienced no problems since that initial "setback."  My SeaMonkey builds have all been completely stable and I've had no further complaints from 3rd-party extensions.

It's not an improvement that users would notice, but it does provide a noteable general reduction in code size and clock cycles on Win32 builds.  I'll take my performance gains wherever I can get 'em.

Comment 14

13 years ago
for reference, i meant to post a counter patch. but this got lost. i'll try to do that now.

Comment 15

13 years ago
Posted patch does this work (obsolete) — Splinter Review
instead of trying to hack the code, this offers a macro which either does the optimized thing or not.

all code built against the newer nspr could use the new macro.

a treewide s/r would upgrade a tree.

please test.
+#if defined(_WIN32) && defined(_M_IX86) && (_MSC_VER >= 1310) && defined(_MT)
+#define NSPR_USE_INTRINSIC_LOCKS
+#else
+#undef NSPR_USE_INTRINSIC_LOCKS
+
+#ifdef NSPR_USE_INTRINSIC_LOCKS

that's missing an #endif, right?
QA Contact: wtchang → nspr
Comment on attachment 219500 [details] [diff] [review]
Retains PR_Atomic* functions for compatibility

Wan-Teh, what do you think of this?
I'm inclined to give this r+, except that I think the symbol _USE_INTRIN_ATOMICS_ should be NSPR_USE_INTRIN_ATOMICS to avoid name space pollution. 

I checked that nsprstub.c does (indirectly) include pratom.h.
Attachment #219500 - Flags: review?(wtc)
Assignee

Comment 18

12 years ago
Comment on attachment 239519 [details] [diff] [review]
does this work

I was going to suggest this, and was pleasantly surprised that
it's exactly what timeless suggested.

One reason PR_AtomicXXX are functions is that we can fix bugs
or enhance the implementation.  Macros will be expanded into the
caller's code, and we can't change them without recompiling
the caller's code.

We should use the compiler intrinsics.  To prevent breaking
binary compatibility by accident, it's safer to add new macros
or functions and not touch the existing PR_AtomicXXX functions.

Comment 19

12 years ago
Would love to get this into NSPR in time for Gecko 1.9/FF3...
Flags: blocking1.9+

Comment 20

12 years ago
(In reply to comment #8)
> I've discovered a problem resulting from my original patch.  It seems that
> nspr4.dll is giving out the wrong value to an extension that is querying the
> version number.
> 
> With the patched code, the browser itself works great and the About page shows
> the correct version numbers (Gecko & browser).  However, the Roboform extension
> for Seamonkey complains at runtime that only versions 1.0.0 through 1.0.1 are
> supported and I am running version "rv:1.8.0.2".  Hmmm...  Replacing nspr4.dll
> with one built from unpatched code satisfies the Roboform version query.
> 
> Perhaps the lack of references to the PR_AtomicIncrement() and
> PR_AtomicDecrement() routines means that something is not being pulled into the
> link of nspr4.dll?  
> 
> More later.
 
I did a bunch of these intrinsics things quite some time ago and a user ran into the same problem with Roboform. I don't recall if I took the change out or if I just left it there and the user stopped using my builds.

I should go over my intrinsics patches at some point in case I have a few that haven't already been found and merged in.
Reporter

Comment 21

12 years ago
(In reply to comment #20)
[snip]

The problem was that my original patch completely removed the original functions in favor of inline intrinsics.  The problem was resolved in my 2nd patch by leaving the original functions (for binary compatibility) and using the inline intrinsics for new builds.

Though I saw the problem with Roboform, it was correctly observed that my original patch broke the published API and was therefore unacceptable.

Comment 22

12 years ago
Posted patch matched if/endif (obsolete) — Splinter Review
Attachment #239519 - Attachment is obsolete: true
Attachment #298515 - Flags: review?(wtc)
Assignee

Comment 23

12 years ago
Comment on attachment 298515 [details] [diff] [review]
matched if/endif

timeless: thanks for the patch.

Let's name the macro _PR_HAVE_INTRINSIC_ATOMIC_OPS.  (There is
an existing internal macro called _PR_HAVE_ATOMIC_OPS.)

Please add an underscore (_) between ATOMIC and INCREMENT/DECREMENT.

Are there compiler intrinsics for _InterlockedAdd and _InterlockedExchange?
It is a little strange to only have intrinsics macros for two of the four
PR_AtomicXXX functions.

Comment 24

12 years ago
http://msdn2.microsoft.com/en-us/library/w5405h95.aspx
Alphabetical Listing of Intrinsic Functions

http://msdn2.microsoft.com/en-us/library/51s265a6.aspx
_InterlockedAdd Intrinsic Functions

http://msdn2.microsoft.com/en-us/library/1s26w950.aspx
_InterlockedExchange Intrinsic Functions
Reporter

Comment 25

12 years ago
(In reply to comment #23)
> Are there compiler intrinsics for _InterlockedAdd and _InterlockedExchange?
> It is a little strange to only have intrinsics macros for two of the four
> PR_AtomicXXX functions.

There's an argument to be made for using these 2 intrinsics for completeness/consistency with Inc & Dec, but there is really no real-world performance gain.

While the intrinsics are more flexible than the existing ASM code in terms of code generation, those 2 functions are called so rarely (at least by Firefox/SeaMonkey) that there is really no practical reason to prefer them over the existing code.
Reporter

Comment 26

12 years ago
Re comment #23 and naming conventions: is there any advantage to noting the compiler that provides the intrinsic functions?  These functions are compiler-specific.

I'm thinking here of potential confusion of the analogous GCC intrinsics

+# define _MD_ATOMIC_INCREMENT(val)   __sync_add_and_fetch(val, 1)
+# define _MD_ATOMIC_DECREMENT(val)   __sync_sub_and_fetch(val, 1)
+# define _MD_ATOMIC_ADD(ptr,val)     __sync_add_and_fetch(ptr, val)
+# define _MD_ATOMIC_SET(ptr,newval)  __sync_lock_test_and_set(ptr, newval)

(or similar) as replacements for the functions in os_Linux_x86.s.
Assignee

Comment 27

12 years ago
I agree we should skip the intermediate HAVE_XXX macro and just do this:

+#if defined(_WIN32) && (_MSC_VER >= 1310)
+long __cdecl _InterlockedIncrement(PRInt32 volatile *Addend);
+#pragma intrinsic(_InterlockedIncrement)
+
+long __cdecl _InterlockedDecrement(PRInt32 volatile *Addend);
+#pragma intrinsic(_InterlockedDecrement)
+
+#define PR_ATOMIC_INCREMENT(val) _InterlockedIncrement(val)
+#define PR_ATOMIC_DECREMENT(val) _InterlockedDecrement(val)
+#else
+#define PR_ATOMIC_INCREMENT(val) PR_AtomicIncrement(val)
+#define PR_ATOMIC_DECREMENT(val) PR_AtomicDecrement(val)
+#endif

These intrinsics are available on all processor architectures and
in single-threaded programs, so we don't need to test the _M_IX86
and _MT macros.  The return values of the intrinsics don't match
NSPR's specification on Windows 95 and NT 3.51, so we will be
dropping support for these two old versions of Windows.

Comment 28

12 years ago
(In reply to comment #25)
> While the intrinsics are more flexible than the existing ASM code in terms of
> code generation, those 2 functions are called so rarely (at least by
> Firefox/SeaMonkey) that there is really no practical reason to prefer them over
> the existing code.

The advantages to using Intrinsics over MSVC Inline Assembler are:

- Intrinsics are generally portable and typically easier to understand
- MSVC for x64 doesn't support MSVC inline assembler so using intrinsics simplifies the code
- I don't know if the intrinsics are the same between GCC and MSVC but having only one code set would be nice for multiple platforms.

Assignee

Comment 29

12 years ago
Add PR_ATOMIC_INCREMENT, PR_ATOMIC_DECREMENT, PR_ATOMIC_SET,
and PR_ATOMIC_ADD.  Implement these macros as compiler intrinsics
for the latest versions of MSVC and GCC.  Implement them as
the PR_AtomicXXX functions by default.

I've checked in this patch on the NSPR trunk (NSPR 4.7) for
testing.

Checking in pr/include/pratom.h;
/cvsroot/mozilla/nsprpub/pr/include/pratom.h,v  <--  pratom.h
new revision: 3.9; previous revision: 3.8
done
Checking in pr/tests/atomic.c;
/cvsroot/mozilla/nsprpub/pr/tests/atomic.c,v  <--  atomic.c
new revision: 3.9; previous revision: 3.8
done
Attachment #298515 - Attachment is obsolete: true
Attachment #299498 - Flags: superreview?(nelson)
Attachment #299498 - Flags: review?(swsnyder)
Attachment #298515 - Flags: review?(wtc)
Comment on attachment 299498 [details] [diff] [review]
Proposed patch (checked in)

The patch to atomic.c contains a pattern of 6 lines of code
that is repeated many times.  I would have preferred that 
those lines appear in a macro with several arguments, and 
that each occurrence be a single macro invocation.  Also, 
I would have written the statements of this form:
    result = result | ((EXPRESSION) ? 0 : 1);
as the equivalent and simpler statement:
    result |= !(EXPRESSION);
But this patch will do for test code.  r=nelson
Attachment #299498 - Flags: superreview?(nelson) → superreview+
Assignee

Updated

12 years ago
Duplicate of this bug: 344307
Assignee

Comment 32

12 years ago
Nelson, I agree with both of your suggestions.  I don't have time
to implement them now, but I added a TODO comment.

I also think of an issue -- the PR_AtomicXXX functions may be
implemented with a lock even when the platform has atomic
instructions because we don't have the time or expertise to
write the assembly code.  So if we define PR_ATOMIC_XXX with
compiler intrinsics, the macros and the functions can't be
used on the same variable because they use incompatible methods
to achieve atomicity.  But people will expect that the macros
and the functions can be used interchangeably.  So I documented
this issue in pratom.h and restrict the use of GCC atomic
intrinsics to the platforms where PR_AtomicXXX is truly atomic.

You don't need to review this patch carefully.  This is just FYI.
I've checked in this patch on the NSPR trunk (NSPR 4.7).

Checking in pr/include/pratom.h;
/cvsroot/mozilla/nsprpub/pr/include/pratom.h,v  <--  pratom.h
new revision: 3.10; previous revision: 3.9
done
Checking in pr/tests/atomic.c;
/cvsroot/mozilla/nsprpub/pr/tests/atomic.c,v  <--  atomic.c
new revision: 3.10; previous revision: 3.9
done
Attachment #299608 - Flags: review?(nelson)
Comment on attachment 219500 [details] [diff] [review]
Retains PR_Atomic* functions for compatibility

Obsoleting this patch, which has been superseded.
Attachment #219500 - Attachment is obsolete: true
Attachment #219500 - Flags: review?(wtc)
Comment on attachment 299608 [details] [diff] [review]
Supplemental patch 1 (checked in)

I think this bug can now be resolved as fixed.
Attachment #299608 - Flags: review?(nelson) → review+
Assignee

Updated

12 years ago
Severity: normal → enhancement
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Version: 4.7 → other
Reporter

Comment 35

7 years ago
Comment on attachment 299498 [details] [diff] [review]
Proposed patch (checked in)

Removed myself from review in attempt to get Bugzilla to stop nagging me about this 4-year-old bug.
Attachment #299498 - Flags: review?(swsnyder) → review?
You need to log in before you can comment on or make changes to this bug.