Closed Bug 732043 Opened 12 years ago Closed 11 years ago

MFBT should provide a mozilla::Atomic template class encapsulating values accessed atomically

Categories

(Core :: MFBT, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla24

People

(Reporter: froydnj, Assigned: froydnj)

References

Details

Attachments

(3 files, 12 obsolete files)

2.98 KB, text/plain
Details
5.34 KB, patch
Waldo
: review+
Details | Diff | Splinter Review
32.30 KB, patch
Waldo
: review+
Details | Diff | Splinter Review
Similar to pratom.h, but without some of the not-related-to-atomics cruft in there.  Also with extra stuff, such as atomic-compare-and-exchange.
I've been poking around C++11 memory model recently, and it probably makes some sense to emulate <atomic> a bit (or even use it where it exists!), but the question is how much to pull in of the memory ordering stuff, since it's not going to be supported pre-clang-3.1 or pre-gcc-4.7 (IIRC) (I don't know about MSVC, but it's primarily for x86, where memory ordering isn't an issue anyways :-P).
Basic straw proposal for a mozilla/Atomics.h:

template <T, memory ordering defaulting to sequentially consistent>
class Atomic {
public:
  T operator op=(T); (op = +, -, ^, |, &)
  T operator op() (post/pre inc/dec)
  T operator T();
  constructor/copy-construct/operator =
  bool CompareAndSwap(T &expected, T other); (strong only or weak only?)
};

Differences from C++11:
* No support for doing atomic operations on random memory locations, only on actual atomic operations. I think this is less foot-gunny (all operations on the location would be done via atomic operations).
* Memory ordering is specified on the atomic class and not as individual operations.
* Only sequentially consistent and relaxed memory models would be supported. Possibly release/acquire too (store ops would be release, loads acquire, and rmw would be rel_acquire). Definitely no support for release/consume
* No explicit FetchAndPHI methods.
* No support for test-and-set--I don't think we need it?
* No support for memory fences--this could come in a later bug if desired.

This would be implemented by wrapping C++11 <atomic> if available. If not, all Atomic memory ordering would collapse to sequentially consistent, and we'd use (using gcc as an example):
class Atomic {
  volatile T value;
public:
  T operator +=(T delta) {
     return __sync_add_and_fetch(&value, delta);
  }
// Copying from lock-free variant of gcc 4.5's atomic
  T operator T() {
      __sync_synchronize();
      T val = value;
      __sync_synchronize();
      return val;
   }
  T operator=(T newvalue) {
      value = newvalue;
      __sync_synchronize();
   }
   bool CompareAndSwap(T &expected, T other) {
     T old = __sync_val_compare_and_swap(&value, expected, other);
     bool success = old == expected;
     expected = old;
     return old;
   }
};

Support notes: gcc 4.5 has <atomic> (although support for memory ordering need 4.7). Clang 3.1 supports nearly-full atomics (no release/consume, no weak cmpxchg) and libc++ brings along <atomic>. MSVC10 (2010) does not support <atomic>, but 11 (2012) does. Documentation appears to indicate that advanced memory model is only supported on Windows 8 (?). Since x86 is pretty much fully sequentially consistent, it probably doesn't matter that much.

Thoughts/Comments/Questions/Concerns/Bikesheds/Flames?
Target Milestone: --- → mozilla16
Version: unspecified → Trunk
There aren't a lot of types __sync_add_and_fetch supports, it might be better to assert on the sizeof(T).
(In reply to Mike Hommey [:glandium] from comment #3)
> There aren't a lot of types __sync_add_and_fetch supports, it might be
> better to assert on the sizeof(T).

If I didn't make this clear: Atomic<T> would only be well defined when T is an integral type or a pointer, so __sync_add_and_fetch should always be valid for any T that Atomic<T> would use.
(In reply to Joshua Cranmer [:jcranmer] from comment #2)
> * No support for doing atomic operations on random memory locations, only on
> actual atomic operations. I think this is less foot-gunny (all operations on
> the location would be done via atomic operations).

Do you mean "no support for doing atomic operations on random memory locations, only on Atomic<T> variables"?  At first glance, it seems to be hard to make pratom.h go away with this restriction.  (For instance, atomic XPCOM refcounting macros wouldn't work well with this scheme...or would require quite some work elsewhere.)

All the other differences vis-a-vis C++11 seem OK.  For conformity with C++11, I suppose operator {&,^,|}= are needed, though I don't think they'd get used.

Mike does raise a good point; not all processors support all sizes of Atomic<T> (e.g. Atomic<uint64_t> on some ARM/GCC combos, most other 32-bit processors, etc.).  But this is a minor point for the actual implementation, not a high-level detail.
(In reply to Nathan Froyd (:froydnj) from comment #5)
> (In reply to Joshua Cranmer [:jcranmer] from comment #2)
> > * No support for doing atomic operations on random memory locations, only on
> > actual atomic operations. I think this is less foot-gunny (all operations on
> > the location would be done via atomic operations).
> 
> Do you mean "no support for doing atomic operations on random memory
> locations, only on Atomic<T> variables"?  At first glance, it seems to be
> hard to make pratom.h go away with this restriction.  (For instance, atomic
> XPCOM refcounting macros wouldn't work well with this scheme...or would
> require quite some work elsewhere.)

Just modify the macros to make threadsafe addrefs/releases use Atomic<T>.

> Mike does raise a good point; not all processors support all sizes of
> Atomic<T> (e.g. Atomic<uint64_t> on some ARM/GCC combos, most other 32-bit
> processors, etc.).  But this is a minor point for the actual implementation,
> not a high-level detail.

Hmm, the documentation of gcc's builtins was a bit vague in how this works. It sounds like the builtins get compiled to an external function call if they're not provided with a warning (it doesn't indicate if this would further induce a link error).
(In reply to Joshua Cranmer [:jcranmer] from comment #6)
> > Do you mean "no support for doing atomic operations on random memory
> > locations, only on Atomic<T> variables"?  At first glance, it seems to be
> > hard to make pratom.h go away with this restriction.  (For instance, atomic
> > XPCOM refcounting macros wouldn't work well with this scheme...or would
> > require quite some work elsewhere.)
> 
> Just modify the macros to make threadsafe addrefs/releases use Atomic<T>.

But the actual nsISupports.mRefCnt is not Atomic<T>, so I'm not sure how that helps.

> > Mike does raise a good point; not all processors support all sizes of
> > Atomic<T> (e.g. Atomic<uint64_t> on some ARM/GCC combos, most other 32-bit
> > processors, etc.).  But this is a minor point for the actual implementation,
> > not a high-level detail.
> 
> Hmm, the documentation of gcc's builtins was a bit vague in how this works.
> It sounds like the builtins get compiled to an external function call if
> they're not provided with a warning (it doesn't indicate if this would
> further induce a link error).

I believe this can induce a link error; I don't believe the runtime libraries are appropriately generic in this regard.  __GCC_HAVE_SYNC_COMPARE_AND_SWAP_${N} can be checked for appropriate ${N}...but I see that GCC 4.4 for our Android toolchains doesn't have those macros available.  Grumble.  I wonder if 4.5 has those macros.  Might be another point for arguing a minimum GCC 4.5 in the Compiler Version Upgrade Party.
gcc 4.4 for android has the functions, but they are libgcc functions, not inlined intrinsics.
Attached file Prototype of mozilla::Atomic (obsolete) —
This is a prototype that works with g++/clang++ with both -std=c++11 and without it. It's just the header, not a patch, because it's easier to test both branches of the #if (gcc w/o <atomic> and gcc with it).

For now, I've added all the volatile methods, but that really complicates things a fair amount (I can get rid of half the type-traiting in the C++11 version if we don't do have the volatile-qualified methods).

What I still need:
1. Implement for MSVC. Note that __sync_synchronize should be equivalent to MemoryBarrier on MSVC...
2. Solve the not-all-platforms-have-64bit atomics issue. Perhaps something like this at the end might work:

#if sizeof(void*) != 8
template <MemoryOrder Order> class Atomic<uint64_t, Order> {};
template <MemoryOrder Order> class Atomic<int64_t, Order> {};
#endif

Actually, MSVC looks like it's missing intrinsics for sub-word-size operations too.

3. Relax memory barriers on the load/store. Right now, they're implemented as if they're always sequentially consistent. Unordered should be able to get away with no barriers. MSVC documentation claims that VS2005 uses acq/rel semantics for volatile variables, so we shouldn't need barriers there; on gcc, we can remove the barrier after the load in acq/rel mode.

4. Implement Atomic<T*>.
Assignee: nobody → Pidgeot18
Status: NEW → ASSIGNED
Attachment #680483 - Flags: feedback?(nfroyd)
Target Milestone: mozilla16 → ---
Comment on attachment 680483 [details]
Prototype of mozilla::Atomic

I think this looks sane enough as a starting point.  I'm inclined to say that we don't need the volatile-qualified variants, as we don't use |volatile| that much.  (Much of the code that does, according to a glance at grep, seems to be using |volatile| as a quasi-locking primitive...  Or variables for intra-thread communication, which ought to be replaced with these anyway.)
Attachment #680483 - Flags: feedback?(nfroyd) → feedback+
Maybe I misunderstood something, but .. what are the |volatile| variants
for?  My understanding is that volatile is useless for inter-thread sync,
as per 
http://software.intel.com/en-us/blogs/2007/11/30/volatile-almost-useless-for-multi-threaded-programming
(In reply to Julian Seward from comment #11)
> Maybe I misunderstood something, but .. what are the |volatile| variants
> for?  My understanding is that volatile is useless for inter-thread sync,
> as per 
> http://software.intel.com/en-us/blogs/2007/11/30/volatile-almost-useless-for-
> multi-threaded-programming

Legitimate reasons to use volatile basically boil down to "I'm talking to device registers" or "I need things to be live in signal handlers."
Summary: MFBT should provide macros for common atomic operations → MFBT should provide a mozilla::Atomic template class encapsulating values accessed atomically
I haven't tested the MSVC end nor have I compiled this in a while, but here's a newer WIP if anyone wants to finish this.
Attachment #680483 - Attachment is obsolete: true
Blocks: 864035
Ehsan, were you going to work on this or shall I finish it up?
Flags: needinfo?(ehsan)
This patch brings Joshua's patch closer to MFBT style (*grumble* silly
MFBT indentation rules for classes and structs without accessibility
specifiers *grumble*), cleans up the Windows support, and makes pointer
arithmetic more consistent across implementations.

- AFAICS, Windows limits us to atomics of 32-bit values (and maybe
  64-bit values on Win64).  If I'm reading the docs correctly, a lot of
  the functions we need don't appear until Vista, Windows 7, or Windows
  8, depending.  (It's possible that the compiler actually supports
  intrinsics for some of these functions, even if they don't appear in
  the runtime?  Not sure about this, need to examine the docs more
  closely.)  If all this reading of the docs is correct, there should
  probably be a static assert added somewhere to prevent people from
  inadvertently burning the tree with their space-optimizing
  atomic<uint16_t> variables.

- The previous patch was inconsistent on whether arithmetic on atomic<T*>
  meant ordinary pointer arithmetic or arithmetic on the moral equivalent
  of char*s.  From reading what docs/headers I could, it appears that
  arithmetic on atomic<T*> is the same as arithmetic on T*.  I believe
  I have corrected this inconsistency.

Haven't tried to compile it; hopefully it still builds on Linux and Mac.
Attachment #732542 - Attachment is obsolete: true
Attached file Unorganized tests
These are the tests that I originally used when writing the patch to make sure I didn't do anything stupid. Testing the correctness of the memory ordering variables is extremely tricky, and so long as we don't have MFBT threads, pretty much impossible from the MFBT testsuite.

I think we'd only need to test uint32_t, size_t/uintptr_t, and T* for some T where sizeof(T) != 1, as 64-bit atomic types will fail on 32-bit ARM (and a few other non-tier-1 platforms), and sub-32-bit atomic types are at best deficient in Windows land.
Attached patch slightly more-tested patch (obsolete) — Splinter Review
This patch at least survives checking with Joshua's testcases (slightly
modified, will attach next) on Linux/x86-64.  MSVC was complaining about
the code, but I think I might have satisfied it; try will tell me
shortly.

clang, unfortunately, is causing problems: the comment near the top of
Atomics.h describes the first problem.  The second problem is that GCC
is somewhat lax in its enforcement of the __sync_* family of functions
and clang is somewhat more strict.  In cases like:

  return __sync_fetch_and_add(ptr, intval);

where |ptr| has type T** and intval has some integer-valued type, GCC
will accept it, whereas clang will not.  (clang really wants |intval| to
be of type T*, which is correct according to the GCC documentation, but
not really correct given how you'd want to use the function.)

I haven't decided yet whether it's worth writing a separate clang only
arm to the monster conditional or trying to write a local patch that
would hopefully get accepted upstream to fix this issue.  I suppose one
could just blindly cast |intval| to a pointer in the above, but I'm not
sure that would work out very well...
Attachment #740384 - Attachment is obsolete: true
Attached patch add tests for mfbt/Atomics.h (obsolete) — Splinter Review
Tests based on Joshua's.  Cosmetic changes only; adding tests to ensure
the logical operations work properly would be a good idea.
(In reply to comment #14)
> Ehsan, were you going to work on this or shall I finish it up?

I was hoping somebody else would volunteer...  And I'll go ahead and assume that you just did.  ;-)
Flags: needinfo?(ehsan)
(In reply to Nathan Froyd (:froydnj) from comment #17)
> clang, unfortunately, is causing problems: the comment near the top of
> Atomics.h describes the first problem.

Clang can use libstdc++ on Linux systems, so maybe the solution for clang is:

#ifdef __clang__
# if has_include(<atomic>)
#   define USE_CXX11_ATOMICS
# endif
#endif

[We might still need a C++11 guard, since I think libstdc++ complains if it's not in C++11 mode.]
(In reply to Joshua Cranmer [:jcranmer] from comment #20)
> (In reply to Nathan Froyd (:froydnj) from comment #17)
> > clang, unfortunately, is causing problems: the comment near the top of
> > Atomics.h describes the first problem.
> 
> Clang can use libstdc++ on Linux systems, so maybe the solution for clang is:
> 
> #ifdef __clang__
> # if has_include(<atomic>)
> #   define USE_CXX11_ATOMICS
> # endif
> #endif
> 
> [We might still need a C++11 guard, since I think libstdc++ complains if
> it's not in C++11 mode.]

That could work for auto-upgrading us to use <atomic> when we switch our minimum deployment target to OS X 10.7.  But we'd still need something that works correctly on 10.6 (and works for anybody out there still compiling with GCC 4.4).  clang's support of the __sync_* functions doesn't match GCC's and appears to do weird things with reinterpret_casts thrown in, so something more complex is needed.
(In reply to Nathan Froyd (:froydnj) from comment #21)
> (In reply to Joshua Cranmer [:jcranmer] from comment #20)
> > (In reply to Nathan Froyd (:froydnj) from comment #17)
> > > clang, unfortunately, is causing problems: the comment near the top of
> > > Atomics.h describes the first problem.
> > 
> > Clang can use libstdc++ on Linux systems, so maybe the solution for clang is:
> > 
> > #ifdef __clang__
> > # if has_include(<atomic>)
> > #   define USE_CXX11_ATOMICS
> > # endif
> > #endif
> > 
> > [We might still need a C++11 guard, since I think libstdc++ complains if
> > it's not in C++11 mode.]
> 
> That could work for auto-upgrading us to use <atomic> when we switch our
> minimum deployment target to OS X 10.7.  But we'd still need something that
> works correctly on 10.6 (and works for anybody out there still compiling
> with GCC 4.4).

To be a little more precise: compiling with libstdc++ from GCC 4.4 and with clang (an admittedly unusual combination).
Attached patch add mfbt/Atomics.h (obsolete) — Splinter Review
Finally, something that compiles and works on Windows.

I'm fairly certain all of this is correct, but it may not all be efficient.
In particular, the AtomicIntrinsics::{load,store} implementations are probably
not as efficient as they could be on non-<atomic> GCC/Clang versions and MSVC.
Since that's virtually all of our userbase, it's probably worth spending a
little more time getting those right.
Attachment #740504 - Attachment is obsolete: true
Attached patch add tests for mfbt/Atomics.h (obsolete) — Splinter Review
A greater amount of Atomics.h is now tested, including non-increment/decrement
for pointers and the compound assignment operators for integer values.
Attachment #740505 - Attachment is obsolete: true
Comment on attachment 741521 [details] [diff] [review]
add mfbt/Atomics.h

Asking for feedback, as I'd like to get the efficiency concerns worked out before asking for review.
Attachment #741521 - Flags: feedback?(jwalden+bmo)
Attached patch add mfbt/Atomics.h (obsolete) — Splinter Review
OK, after wading through documentation and explanations of memory ordering,
this version should be about as efficient as we can make it without wading
into the tedium and riskiness of per-processor defines.  I took the liberty
of making the Windows code x86-specific.
Attachment #741521 - Attachment is obsolete: true
Attachment #741521 - Flags: feedback?(jwalden+bmo)
Attachment #741979 - Flags: review?(jwalden+bmo)
Attached patch add tests for mfbt/Atomics.h (obsolete) — Splinter Review
Tests, solely updated for Unordered -> Relaxed renaming and more ordering
testing of the pointer specializations, now that we care about ordering.
Attachment #741524 - Attachment is obsolete: true
Attachment #741980 - Flags: review?(jwalden+bmo)
Comment on attachment 741979 [details] [diff] [review]
add mfbt/Atomics.h

Review of attachment 741979 [details] [diff] [review]:
-----------------------------------------------------------------

::: mfbt/Atomics.h
@@ +54,5 @@
> + *                  this atomic, then a thread B that reads the value that A
> + *                  wrote to this atomic can read the value that A wrote to C.
> + * SequentiallyConsistent - This is similar to ReleaseAcquire, but all threads
> + *                          agree on the same order for all reads/writes of all
> + *                          sequentially consistent atomic variable.

I think it's worth noting somewhere that sequentially consistent access is the default for atomics, and should be unless you have a compelling reason (i.e., performance) not to make it so.

@@ +397,5 @@
> +#  pragma intrinsic(_InterlockedCompareExchange64)
> +#  pragma intrinsic(_InterlockedExchange64)
> +
> +template <>
> +struct PrimitiveIntrinsics<8> {

If I'm understanding correctly, Win x86-64 doesn't have support for Atomic<uint32_t>?
(In reply to Joshua Cranmer [:jcranmer] from comment #28)
> I think it's worth noting somewhere that sequentially consistent access is
> the default for atomics, and should be unless you have a compelling reason
> (i.e., performance) not to make it so.

That's a good point, I'll do that.

> @@ +397,5 @@
> > +#  pragma intrinsic(_InterlockedCompareExchange64)
> > +#  pragma intrinsic(_InterlockedExchange64)
> > +
> > +template <>
> > +struct PrimitiveIntrinsics<8> {
> 
> If I'm understanding correctly, Win x86-64 doesn't have support for
> Atomic<uint32_t>?

Doh, yes, you're reading correctly.  It should.  I'll rearrange the #ifdeffery to make it so.
Attached patch add mfbt/Atomics.h (obsolete) — Splinter Review
Addressed Joshua's comments and some other miscellaneous issues.  Adding Julian
as a reviewer to get a second pair of eyes on the atomicity bits.
Attachment #741979 - Attachment is obsolete: true
Attachment #741979 - Flags: review?(jwalden+bmo)
Attachment #742326 - Flags: review?(jwalden+bmo)
Attached patch add tests for mfbt/Atomics.h (obsolete) — Splinter Review
Updated to get rid of warnings.
Attachment #741980 - Attachment is obsolete: true
Attachment #741980 - Flags: review?(jwalden+bmo)
Attachment #742327 - Flags: review?(jwalden+bmo)
Comment on attachment 742326 [details] [diff] [review]
add mfbt/Atomics.h

Adding Julian for real this time.
Attachment #742326 - Flags: review?(jseward)
(In reply to Joshua Cranmer [:jcranmer] from comment #6)
> (In reply to Nathan Froyd (:froydnj) from comment #5)
> > (In reply to Joshua Cranmer [:jcranmer] from comment #2)
> > > * No support for doing atomic operations on random memory locations, only on
> > > actual atomic operations. I think this is less foot-gunny (all operations on
> > > the location would be done via atomic operations).
> > 
> > Do you mean "no support for doing atomic operations on random memory
> > locations, only on Atomic<T> variables"?  At first glance, it seems to be
> > hard to make pratom.h go away with this restriction.  (For instance, atomic
> > XPCOM refcounting macros wouldn't work well with this scheme...or would
> > require quite some work elsewhere.)
> 
> Just modify the macros to make threadsafe addrefs/releases use Atomic<T>.

Ah, so I finally figured out what was bugging me about this.  If you have a class that is NS_DECL_ISUPPORTS, you inherit from that class, and then attempt to make this second class use thread-safe refcounting, you will lose: the declaration of mRefCnt by NS_DECL_ISUPPORTS isn't the correct type.  A quick glance though greps for NS_DECL_ISUPPORTS_INHERITED and NS_IMPL_THREADSAFE_* suggests that we don't have this corner case lurking anywhere in mozilla-central, but only actually changing everything around will bring any ugliness to light.

(You might not notice, depending on how the NS_IMPL_THREADSAFE_* macros are implemented: if they use, say, ++ or --, those operations will work just fine with plain old integers and the users will be none the wiser.  It might not be a bad idea to provide Atomic<T>::{add,sub,...} methods that can be used in such contexts so declaring mRefCnt as an integer and then attempting to use thread-safe refcounting with it will blow up appropriately.)
(In reply to Nathan Froyd (:froydnj) from comment #33)
> Ah, so I finally figured out what was bugging me about this.  If you have a
> class that is NS_DECL_ISUPPORTS, you inherit from that class, and then
> attempt to make this second class use thread-safe refcounting, you will
> lose: the declaration of mRefCnt by NS_DECL_ISUPPORTS isn't the correct
> type.  A quick glance though greps for NS_DECL_ISUPPORTS_INHERITED and
> NS_IMPL_THREADSAFE_* suggests that we don't have this corner case lurking
> anywhere in mozilla-central, but only actually changing everything around
> will bring any ugliness to light.

We don't have NS_IMPL_THREADSAFE_ISUPPORTS_INHERITED* macros like we do NS_IMPL_ISUPPORTS_INHERITED*, so anyone who wants this has to do this stuff by hand.

One idea I had was to eliminate the need for the distinction between NS_IMPL_THREADSAFE_ADDREF and NS_IMPL_ADDREF macros (which would imply that NS_IMPL_THREADSAFE_* would almost completely vanish). However, the "on wrong thread" assertion checks make that unfeasible without refactoring more of the header files.
(In reply to Nathan Froyd (:froydnj) from comment #32)
> Adding Julian for real this time.

I should say up front that I'm not well informed about C++11 atomics
nor about memory models, despite ranting on about them at the Paris
perf week back in March.  So I wouldn't place too much reliance on the
following comments.

* Overall, it looks plausible to me.

* It's not terribly clear which bits of the file are for external use
  and which are implementation.  This might confuse users of it?
  Maybe mark the external interface clearly?  Move the implementation
  into a different header, eg Atomics-impl.h ?  The latter might be
  a nice solution.

* The comment describing Relaxed/ReleaseAcquire/SequentiallyConsistent
  is quite hard to follow, and I think will defeat all but the most
  ardent users of this facility.  Can it be improved?  Do we need all
  3 variants?  I am concerned about offering users options they don't
  understand.  (Or maybe this just reflects my own ignorance about
  these memory models.)

* I find the use of "oldval" in the context of CAS confusing.  I think
  of CAS involving a pointer, an expected value and a new value.  Is
  the "old" value the expected value or the value in memory before the
  CAS happened?  Can you use "expval" instead of "oldval" ?  Also,
  would it be possible to make CAS return a bool indicating
  success/failure, rather than having to grapple with inferring that
  from the returned value vs the given parameters?

* The real challenge here is verifying (testing) the implementation.
  Would it be possible to write test cases that generate multiple
  threads, do a large number of operations, and check for unexpected
  races by looking at the result numbers?  (eg, 2 threads, each does 1
  million atomic incs in a loop, we expect final result to be exactly
  2 million).

+ // We only support 32-bit types on 32-bit Windows.  And so it goes elsewhere.

Goes elsewhere?  I didn't understand that.
(In reply to Julian Seward from comment #35)
> * It's not terribly clear which bits of the file are for external use
>   and which are implementation.  This might confuse users of it?
>   Maybe mark the external interface clearly?  Move the implementation
>   into a different header, eg Atomics-impl.h ?  The latter might be
>   a nice solution.

Anything in mozilla::detail (i.e., AtomicIntrinsics) is not for public consumption. mozilla::Atomic is effectively the only public export (and mozilla::MemoryOrdering, which really only ought to be used as a template parameter).
 
> * The comment describing Relaxed/ReleaseAcquire/SequentiallyConsistent
>   is quite hard to follow, and I think will defeat all but the most
>   ardent users of this facility.  Can it be improved?  Do we need all
>   3 variants?  I am concerned about offering users options they don't
>   understand.  (Or maybe this just reflects my own ignorance about
>   these memory models.)

Having originally written the documentation in question, here's my best summary of implications for people who don't understand this stuff.

Hardware and compilers cheat and do memory operations out of order. For actions that happen on the same thread, that thread can't tell that it happens out of order, but other threads can. The purpose of memory ordering is to constrain what kinds of reordering other threads can see.
Relaxed - no constraints. The atomic operation is just a memory access implemented atomically; this is useful for counters that may involve many threads but don't guard anything critical (which does not include reference counters!).

Sequentially-consistent - this is the complete opposite; effectively no reordering is allowed. Everything that happens before the atomic access is observed to happen before by any thread and everything that happens after can't be observed to happen before by any thread. If that's too confusing, the simplest explanation is "this is what you think should happen if you don't understand what really happens."

ReleaseAcquire - This is hard to explain. The closest parallel is to a lock. Imagine all memory operations to be protected by a phantom lock corresponding to this atomic variable. When you write to it, the lock is released, and when you read it, it is acquired. Thus, reading from the atomic variable guarantees that you can see all writes that happened before the atomic variable was last written. I think off-hand this is what thread-safe reference counters should be (freebie: no fences on x86, thanks to its strong memory model!), but I'd have to think a lot more before definitively subscribing to this.

[And if you think these explanations are confusing, try putting them into context of a memory model where it's not variables that have memory ordering implications but the atomic operations, so the same variable can be accessed under different operations. "Don't do that" is a very apt statement.]

In short: use sequentially-consistent unless you know what you're doing and wish to use something else.

> * The real challenge here is verifying (testing) the implementation.
>   Would it be possible to write test cases that generate multiple
>   threads, do a large number of operations, and check for unexpected
>   races by looking at the result numbers?  (eg, 2 threads, each does 1
>   million atomic incs in a loop, we expect final result to be exactly
>   2 million).

We don't have threads in MFBT, so we can't test this from MFBT. I actually even started devising some testcases to make sure that the memory ordering requirements are satisfied.
(In reply to Nathan Froyd (:froydnj) from comment #33)
> > Just modify the macros to make threadsafe addrefs/releases use Atomic<T>.
> 
> Ah, so I finally figured out what was bugging me about this.  If you have a
> class that is NS_DECL_ISUPPORTS, you inherit from that class, and then
> attempt to make this second class use thread-safe refcounting, you will
> lose: the declaration of mRefCnt by NS_DECL_ISUPPORTS isn't the correct
> type.  A quick glance though greps for NS_DECL_ISUPPORTS_INHERITED and
> NS_IMPL_THREADSAFE_* suggests that we don't have this corner case lurking
> anywhere in mozilla-central, but only actually changing everything around
> will bring any ugliness to light.
> 
> (You might not notice, depending on how the NS_IMPL_THREADSAFE_* macros are
> implemented: if they use, say, ++ or --, those operations will work just
> fine with plain old integers and the users will be none the wiser.  It might
> not be a bad idea to provide Atomic<T>::{add,sub,...} methods that can be
> used in such contexts so declaring mRefCnt as an integer and then attempting
> to use thread-safe refcounting with it will blow up appropriately.)

NS_IMPL_THREADSAFE_* could also static-assert that the type of the refcount is Atomic<T>, which wouldn't require any widespread change to verify.

I'm hoping I can look at the patches here, now, and finish them off today.  Or at least get them out of my queue.  :-)  But given the subject matter, that could be optimistic.  We'll see.

(In reply to Joshua Cranmer [:jcranmer] from comment #36)
> We don't have threads in MFBT, so we can't test this from MFBT. I actually
> even started devising some testcases to make sure that the memory ordering
> requirements are satisfied.

Bug 773491 had a bunch of work on this, but it got held up, as I recall, by WinXP not having a mutex primitive you could implement without including <windows.h> and all its goriness.  (I was thinking maybe it could be hacked around with |char data[sizeof(CRITICAL_SECTION)];| in the header and out-of-lining everything to a .cpp file on Windows, but I never looked closely enough to know if that was a truly reasonable hackaround.  Note that the NSPR methods are already out-of-line, so putting mutex stuff OOL wouldn't be a perf regression, or something.)
Comment on attachment 742326 [details] [diff] [review]
add mfbt/Atomics.h

Review of attachment 742326 [details] [diff] [review]:
-----------------------------------------------------------------

Generally this looks pretty nice.  Too bad we only have 32-bit and pointer-sized type support here, but oh well.  I guess NSPR and whatever we use now doesn't give us any more, so...

I was not at all particularly knowledgeable about a whole lot of this, before I began reading this patch.  I may have made mistakes, or missed things.  But hopefully not too many things.  :-)  More eyes here would definitely be good.

It's possible I'm misreading this, but there seems to be a good deal of inconsistency about whether the old value, or the new value, is returned by the various mutator methods.  It's possible it all happens to just work, and that inconsistencies offset each other just so, but I'm a little skeptical.  That's the biggest issue here.  There seems to be confusion about __sync_*_and_fetch versus __sync_fetch_and_* too.

Documentation and comments need beefing up a bit -- I suggested more verbose text, poke holes in it if you can.  :-)

Various style nits.

Let's avoid #including Windows headers -- I think it's likely possible here.

::: mfbt/Atomics.h
@@ +41,5 @@
> +
> +namespace mozilla {
> +
> +/**
> + * An enum of memory ordering possibilities for atomics.

Notwithstanding the bit about "complete discussion" being undesirable, we should clarify what "memory ordering" means.  Perhaps something like:

Memory ordering is the observable state of distinct values in memory.  (It's a separate concept from atomicity, which concerns whether an operation can ever be observed in an intermediate state.  Don't conflate the two!)  Given a sequence of operations in code on memory, it is *not* always the case that, at all times and on all cores, those operations will appear to have occurred in that exact sequence.  First, the compiler might reorder that sequence, if it thinks another ordering will be more efficient.  Second, the CPU may not expose so consistent a view of memory.  CPUs will often perform their own instruction reordering, above and beyond that performed by the compiler.  And each core has its own memory caches, and accesses (reads and writes both) to "memory" may only resolve to out-of-date cache entries -- not to the "most recently" performed operation in some global sense.  Any access to a value that may be used by multiple threads, potentially across multiple cores, must therefore have a memory ordering imposed on it, for all code on all threads/cores to have a sufficiently coherent worldview.

http://gcc.gnu.org/wiki/Atomic/GCCMM/AtomicSync and http://en.cppreference.com/w/cpp/atomic/memory_order go into more detail on all this, including examples of how each mode works.

Note that for simplicity, not all of the modes in C++11 are supported.  These three modes are confusing enough as it is!  If your use case really really really really really needs the extra speed the additional modes provide, *and* you are tall enough to ride the atomics memory-ordering roller coaster (if you're not sure, you aren't), file a bug and we'll consider accommodating you.

@@ +45,5 @@
> + * An enum of memory ordering possibilities for atomics.
> + *
> + * Note that these ordering models have extremely different semantics. A
> + * complete discussion of how memory ordering works is beyond the scope of any
> + * documentation here, but here's a brief summary:

Let's put each of these descriptions in a comment next to exactly that initializer.

Not surprisingly, I agree with Julian that more information here would be good.  :-)  Also, this reads pretty close to the hardware level, and not so close to how people want to use atomic-y things.  The response to Julian's comment is better on this front.  I'll propose updated versions of all the descriptions below; please critique them as much as you want!

@@ +48,5 @@
> + * complete discussion of how memory ordering works is beyond the scope of any
> + * documentation here, but here's a brief summary:
> + *
> + * Relaxed - All modifications of this atomic have no impact on visibilty of
> + *           other memory. This is great for thread-local atomic counters.

Relaxed:
Relaxed ordering is the simplest memory ordering: none at all.  When the result of a write is observed, nothing may be inferred about other memory.  Writes ostensibly performed "before" on the writing thread may not yet be visible.  Writes performed "after" on the writing thread may already be visible, if the compiler or CPU reordered them.  (The latter can happen if reads and/or writes get held up in per-processor caches.)  Relaxed ordering basically means operations can always use cached values (as long as the actual updates to atomic values actually occur, correctly, eventually), so it's usually the fastest sort of atomic access.  For this reason, *it's also the most dangerous kind of access*.

Relaxed ordering is good for things like process-wide statistics counters that don't need to be consistent with anything else, so long as updates themselves are atomic.  (And so long as any observations of that value can tolerate being out-of-date -- if you need some sort of up-to-date value, you need some sort of other synchronizing operation.)  It's *not* good for locks, mutexes, reference counts, etc. that mediate access to other memory, or must be observably consistent with other memory.

x86 architectures don't take advantage of the optimization opportunities that relaxed ordering permits.  Thus it's possible that using relaxed ordering will "work" on x86 but fail elsewhere (ARM, say, which *does* implement non-sequentially-consistent relaxed ordering semantics).  Be extra-careful using relaxed ordering if you can't easily test non-x86 architectures!

@@ +51,5 @@
> + * Relaxed - All modifications of this atomic have no impact on visibilty of
> + *           other memory. This is great for thread-local atomic counters.
> + * ReleaseAcquire - If thread A writes memory to slot C and then does a store to
> + *                  this atomic, then a thread B that reads the value that A
> + *                  wrote to this atomic can read the value that A wrote to C.

ReleaseAcquire:
When an atomic value is updated with ReleaseAcquire ordering, and that new value is observed with ReleaseAcquire ordering, prior writes (atomic or not) are also observable.  What ReleaseAcquire *doesn't* give you is any observable ordering guarantees for ReleaseAcquire-ordered operations on different objects.  For example, if there are two cores that each perform ReleaseAcquire operations on separate objects, each core may or may not observe the operations made by the other core.  The only way the cores can be synchronized with ReleaseAcquire is if they both ReleaseAcquire-access the same object.  This implies that you can't necessarily describe some global total ordering of ReleaseAcquire operations.

ReleaseAcquire ordering is good for (as the name implies) atomic operations on values controlling ownership of things: reference counts, mutexes, and the like.

@@ +54,5 @@
> + *                  this atomic, then a thread B that reads the value that A
> + *                  wrote to this atomic can read the value that A wrote to C.
> + * SequentiallyConsistent - This is similar to ReleaseAcquire, but all threads
> + *                          agree on the same order for all reads/writes of all
> + *                          sequentially consistent atomic variable.

SequentiallyConsistent:
When an atomic value is updated with SequentiallyConsistent ordering, all writes observable when the update is observed, just as with ReleaseAcquire ordering.  But, furthermore, a global total ordering of SequentiallyConsistent operations *can* be described.  For example, if two cores perform SequentiallyConsistent operations on separate objects, one core will observably perform its update (and all previous operations will have completed), then the other core will observably perform its update (and all previous operations will have completed).  (Although those previous operations aren't themselves ordered -- they could be intermixed, or ordered if they occur on atomic values with ordering requirements.)  SequentiallyConsistent is the *simplest and safest* ordering of atomic operations -- it's always as if one operation happens, then another, then another, in some order -- and every core observes updates to happen in that single order.  Because it has the most synchronization requirements, operations ordered this way also tend to be slowest.

SequentiallyConsistent ordering can be desirable when multiple threads observe objects, and they all have to agree on the observable order of changes to them.  It's also probably the hardest to describe its use cases.  If you don't know what order to use, use this one.

@@ +71,5 @@
> +namespace mozilla {
> +namespace detail {
> +
> +template <MemoryOrdering Order>
> +struct AtomicOrderConstraints {

No space after "template" (change this pervasively), and put the {} on one line.  New line or same line are both okay.

@@ +75,5 @@
> +struct AtomicOrderConstraints {
> +};
> +
> +template <>
> +struct AtomicOrderConstraints<Relaxed> {

{ should go on a new line, pervasively, for struct/class definitions.

@@ +78,5 @@
> +template <>
> +struct AtomicOrderConstraints<Relaxed> {
> +    static const std::memory_order AtomicRMWOrder = std::memory_order_relaxed;
> +    static const std::memory_order GetOrder = std::memory_order_relaxed;
> +    static const std::memory_order SetOrder = std::memory_order_relaxed;

We should probably use LoadOrder and StoreOrder for consistency with the spec terminology, and symmetry of calls to the atomic methods.

@@ +85,5 @@
> +template <>
> +struct AtomicOrderConstraints<ReleaseAcquire> {
> +    static const std::memory_order AtomicRMWOrder = std::memory_order_acq_rel;
> +    static const std::memory_order GetOrder = std::memory_order_acquire;
> +    static const std::memory_order SetOrder = std::memory_order_release;

Given the definition of memory_order_acq_rel -- and that it's acquire for loads and release for stores -- I think you could use acq_rel for all of these.  I'm a bit indifferent as to whether this is clearer and more correct-looking, or that is.  :-)  That, and we're all probably too stupid to really know which is actually better.

@@ +96,5 @@
> +    static const std::memory_order SetOrder = std::memory_order_seq_cst;
> +};
> +
> +template <typename T, MemoryOrdering Order>
> +struct AtomicIntrinsics {

Let's do this (modulo typos), to share the common code between these two:

template<typename T, MemoryOrdering Order, typename M>
struct AtomicInstrinsicsShared
{
    ...all the common stuff: add/sub/inc/dec/cas/load/store, with M as the |val/desired/expected| type...
};

template<typename T, MemoryOrdering Order>
struct AtomicInstrinsics : AtomicIntrinsicsBase<T, Order, T>
{
    ...add in or_/xor_/and_
};

template<typename PointerType, MemoryOrdering Order>
struct AtomicInstrinsics<PointerType*, Order> : AtomicIntrinsicsBase<PointerType*, Order, ptrdiff_t>
{};

It looks like similar sharing should be done in the other places where we have PointerType* specializations.

@@ +97,5 @@
> +};
> +
> +template <typename T, MemoryOrdering Order>
> +struct AtomicIntrinsics {
> +    typedef std::atomic<T> ValueType;

I might name this typedef (everywhere in the implemented-by-<atomic> version) AtomicT, to make clearer that it's the std::atomic type.

@@ +116,5 @@
> +      return ptr.fetch_and(val, OrderedOp::AtomicRMWOrder) & val;
> +    }
> +    static T inc(ValueType& ptr) { return add(ptr, 1); }
> +    static T dec(ValueType& ptr) { return sub(ptr, 1); }
> +    static T cas(ValueType& ptr, T oldval, T newval) {

Seeing as everyone wants a bool return indicating whether the value was the expected one, maybe we should just make all these intrinsic cas() methods return that?  I think that simplifies almost all of them very slightly.

@@ +144,5 @@
> +    static T inc(ValueType& ptr) { return add(ptr, 1); }
> +    static T dec(ValueType& ptr) { return sub(ptr, 1); }
> +    static T cas(ValueType& ptr, T oldval, T newval) {
> +      ptr.compare_exchange_strong(oldval, newval, OrderedOp::AtomicRMWOrder);
> +      return oldval;

Hmm, only strong CAS?  I guess maybe weak CAS is something we should add in another type specifically designed for use in loops, and advertised as such, if it's to be added.

Rather than oldval/newval, the expected/desired naming from the spec seems like a really good idea here for readability.

@@ +157,5 @@
> +};
> +
> +} // namespace detail
> +} // namespace mozilla
> +#elif defined(__GNUC__)

A little whitespace around this #elif, please!

@@ +175,5 @@
> +struct Barrier<Relaxed> {
> +  static void beforeLoad() {}
> +  static void afterLoad() {}
> +  static void beforeStore() {}
> +  static void afterStore() {}

Contents should be indented two spaces further, here and a couple other places.

@@ +195,5 @@
> +  static void afterStore() { __sync_synchronize(); }
> +};
> +
> +template <typename T, MemoryOrdering Order>
> +struct AtomicIntrinsics {

I think you can do the same AtomicIntrinsicsShared thing mentioned above, here, too.  Maybe.  The __sync_* sizeof multiplication may mean no; if so, okay.

@@ +198,5 @@
> +template <typename T, MemoryOrdering Order>
> +struct AtomicIntrinsics {
> +    typedef T ValueType;
> +    static ValueType add(ValueType& ptr, ValueType val) {
> +      return __sync_fetch_and_add(&ptr, val);

A comment before all these __sync_fetch_and_* methods indicating that they all include a full memory barrier seems like a good thing, referring to http://gcc.gnu.org/onlinedocs/gcc-4.6.2/gcc/Atomic-Builtins.html (or some version) as authoritative.

According to those docs, aren't the old values returned here?  So don't you need + val, - val, & val, etc. like in the std::atomic<T> implementation?  Or actually, didn't you mean to use __sync_*_and_fetch?

@@ +215,5 @@
> +    }
> +    static ValueType inc(ValueType& ptr) { return add(ptr, 1); }
> +    static ValueType dec(ValueType& ptr) { return sub(ptr, 1); }
> +    static ValueType cas(ValueType& ptr, ValueType oldval, ValueType newval) {
> +      return __sync_val_compare_and_swap(&ptr, oldval, newval);

Hmm.  I didn't notice this earlier, but.  Isn't it slightly more normal for compareAndSwap to return true/false to indicate whether the value was changed?

@@ +257,5 @@
> +    };
> +#  endif
> +    static ValueType add(ValueType& ptr, ptrdiff_t val) {
> +      typename SyncValue::Type amount = SyncValue::convert(val * sizeof(T));
> +      return __sync_fetch_and_add(&ptr, amount);

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52291 says that __sync_fetch_and_add are deprecated.  I assume they're not available in old-enough gcc for us?

Doesn't this need to be add-and-fetch (same comment as above)?  Ditto for the rest of the contents of this struct.

@@ +266,5 @@
> +    }
> +    static ValueType inc(ValueType& ptr) { return add(ptr, 1); }
> +    static ValueType dec(ValueType& ptr) { return sub(ptr, 1); }
> +    static ValueType cas(ValueType& ptr, ValueType oldval, ValueType newval) {
> +      return __sync_val_compare_and_swap(&ptr, oldval, newval);

If this returned bool, __sync_bool_compare_and_swap.

@@ +297,5 @@
> + */
> +
> +#  include <windef.h>
> +#  include <winbase.h>
> +#  include <intrin.h>

Judging by pratom.h's code, we can *probably* get away with not including these headers, and with manually declaring these functions.

http://mxr.mozilla.org/mozilla-central/source/nsprpub/pr/include/pratom.h#84

Let's do that here, too, if at all possible.  Don't forget extern "C".  Including Windows header files gunks up the global namespace incredibly, so we've been very hesitant to ever do it.

@@ +336,5 @@
> + * These two functions increment or decrement, respectively, the value
> + * stored in |*ptr| by 1 and return the value previously stored in
> + * |*ptr|.  Note that the return value requirement is different from the
> + * return value provided by the obvious _Interlocked{Increment,Decrement}
> + * functions.

Consistency with the above code would have this return the new value -- which probably means it matches up with _InterlockedFoo, right?

@@ +345,5 @@
> + * This function performs a compare-and-swap operation on |*ptr|, expecting
> + * to find |oldval| and replacing it with |newval|.  Returns the value
> + * previously stored in |*ptr|.
> + *
> + * static Type cas(Type* ptr, Type oldval, Type newval);

This looks like returning bool for whether |*ptr == oldval| (please rename that to expected, too) would be nicer too.

@@ +349,5 @@
> + * static Type cas(Type* ptr, Type oldval, Type newval);
> + *
> + * This function atomically stores |val| into |*ptr| and must provide a
> + * full memory fence after the store to prevent compiler and hardware
> + * instruction reordering.

Before the store, right?  We want all memory effects before the store to be visible, if the result of the store is visible.  That requires a barrier before storing.

Oh, it *must* be before the store -- your barrier traits later on don't even have an after-store method.  :-)

@@ +379,5 @@
> +    static Type and_(Type* ptr, Type val) {
> +      return _InterlockedAnd(ptr, val);
> +    }
> +    static Type inc(Type* ptr) {
> +      return _InterlockedIncrement(ptr) - 1;

Consistency with the other incs wants this to not have -1, right?  Same for dec, mutatis mutandis.

@@ +385,5 @@
> +    static Type dec(Type* ptr) {
> +      return _InterlockedDecrement(ptr) + 1;
> +    }
> +    static Type cas(Type* ptr, Type oldval, Type newval) {
> +      return _InterlockedCompareExchange(ptr, newval, oldval);

I guess this is the only case where cas() returning bool doesn't match the platform.  Oh well, Windows can be slightly weird like normal.  :-)

@@ +426,5 @@
> +    static Type and_(Type* ptr, Type val) {
> +      return _InterlockedAnd64(ptr, val);
> +    }
> +    static Type inc(Type* ptr) {
> +      return _InterlockedIncrement64(ptr) - 1;

Same no-need for -1/+1 here.

@@ +432,5 @@
> +    static Type dec(Type* ptr) {
> +      return _InterlockedDecrement64(ptr) + 1;
> +    }
> +    static Type cas(Type* ptr, Type oldval, Type newval) {
> +      return _InterlockedCompareExchange64(ptr, newval, oldval);

Same bool inconsistency weirding here.

@@ +478,5 @@
> +    typedef T ValueType;
> +    typedef PrimitiveIntrinsics<sizeof(T)> Primitives;
> +    typedef typename Primitives::Type PrimType;
> +    MOZ_STATIC_ASSERT(sizeof(PrimType) == sizeof(T),
> +      "Selection of PrimitiveIntrinsics was wrong");

Align "..." with the first sizeof.

@@ +500,5 @@
> +      return applyBinaryFunction(&Primitives::add, ptr, val);
> +    }      
> +    static ValueType sub(ValueType& ptr, ValueType val) {
> +      return applyBinaryFunction(&Primitives::sub, ptr, val);
> +    }      

Kill the trailing whitespace.

@@ +552,5 @@
> +      ValueType oldval, newval;
> +      do {
> +        oldval = ptr;
> +        newval = oldval + val;
> +      } while (cas(ptr, oldval, newval) != oldval);

Hm, we really can't just cast the pointer to uintptr_t, do the atomic ops on that using all the _Interlocked stuff, then cast back?  That seems preferable to me, rather than having to CAS our way through these bits.  Or is there a reason we can't CAS?

@@ +567,5 @@
> +    static ValueType inc(ValueType& ptr) { return add(ptr, 1); }
> +    static ValueType dec(ValueType& ptr) { return sub(ptr, 1); }
> +    static ValueType cas(ValueType& ptr, ValueType oldval, ValueType newval) {
> +      PVOID result = 
> +        InterlockedCompareExchangePointer((PVOID*)&ptr, newval, oldval);

Just use void* here rather than PVOID, to avoid the Windows dependency.

@@ +619,5 @@
> +template <typename T, MemoryOrdering Order = SequentiallyConsistent>
> +class Atomic {
> +    // We only support 32-bit types on 32-bit Windows.  And so it goes elsewhere.
> +#if !defined(HAVE_64BIT_OS)
> +    MOZ_STATIC_ASSERT(sizeof(T) == 4, "mozilla/Atomics.h only supports 32-bit types");

Hmm, let's avoid the #if and just do

    MOZ_STATIC_ASSERT(sizeof(T) == 4 || (sizeof(uintptr_t) == 8 && sizeof(T) == 8),
                      "mozilla/Atomics.h only supports 32-bit and pointer-sized types");

@@ +636,5 @@
> +
> +  private:
> +    template <MemoryOrdering AnyOrder>
> +    Atomic(const Atomic<T, AnyOrder>& aCopy) MOZ_DELETE;
> +    Atomic<T, Order>& operator=(Atomic<T, Order>& aOther) MOZ_DELETE;

void operator=(const Atomic<T, Order>& aOther) MOZ_DELETE will trigger errors slightly earlier, if anyone tries this.

@@ +646,5 @@
> +    T operator |=(T delta) { return Intrinsics::or_(mValue, delta); }
> +    T operator &=(T delta) { return Intrinsics::and_(mValue, delta); }
> +
> +    T operator++(int) { return Intrinsics::inc(mValue); }
> +    T operator--(int) { return Intrinsics::dec(mValue); }

I believe these want to have -1/+1 added to them, because Intrinsics::inc/dec return the new value.  And the nullary ops below want the +1/-1 removed.

@@ +668,5 @@
> +     *   aExpected = mValue;
> +     *   return false;
> +     * }
> +     */
> +    bool compareAndSwap(T& aExpected, T aValue) {

expected/desired for the naming here, please.

@@ +683,5 @@
> + * provided, as are atomic increment and decement operators. Also provided are
> + * the compound assignment operators for addition and subtraction.
> + *
> + * Atomic accesses are sequentially consistent by default.  You should use the
> + * default unless you have a compelling reason to do otherwise.

Let's make that "unless you know what you're doing and you have a compelling...".  :-)

@@ +688,5 @@
> + */
> +template <typename T, MemoryOrdering Order>
> +class Atomic<T*, Order> {
> +    MOZ_STATIC_ASSERT(IsIntegral<T>::value,
> +                      "can only have pointer-to-integral atomic variables");

Why?  This should all work just fine on pointers to arbitrary types; why would we limit people about this?

@@ +699,5 @@
> +
> +  private:
> +    template <MemoryOrdering AnyOrder>
> +    Atomic(const Atomic<T*, AnyOrder>& aCopy) MOZ_DELETE;
> +    Atomic<T*, Order>& operator=(Atomic<T*, Order>& aOther) MOZ_DELETE;

void operator= again

@@ +712,5 @@
> +
> +    T* operator++(int) { return Intrinsics::inc(mValue); }
> +    T* operator--(int) { return Intrinsics::dec(mValue); }
> +    T* operator++() { return Intrinsics::inc(mValue) + 1; }
> +    T* operator--() { return Intrinsics::dec(mValue) - 1; }

Same issue about moving around the +1 and -1.
Attachment #742326 - Flags: review?(jwalden+bmo) → review-
(In reply to Jeff Walden [:Waldo] (remove +bmo to email) from comment #38)
> SequentiallyConsistent ordering can be desirable when multiple threads
> observe objects, and they all have to agree on the observable order of
> changes to them.  It's also probably the hardest to describe its use cases. 
> If you don't know what order to use, use this one.

I think most lock-free concurrent data structures require sequentially-consistent atomic accesses. At the very least, reasoning if such algorithms work in a release-acquire mode is extremely difficult, and I (as a reviewer) probably wouldn't accept such an implementation.

> @@ +144,5 @@
> > +    static T inc(ValueType& ptr) { return add(ptr, 1); }
> > +    static T dec(ValueType& ptr) { return sub(ptr, 1); }
> > +    static T cas(ValueType& ptr, T oldval, T newval) {
> > +      ptr.compare_exchange_strong(oldval, newval, OrderedOp::AtomicRMWOrder);
> > +      return oldval;
> 
> Hmm, only strong CAS?  I guess maybe weak CAS is something we should add in
> another type specifically designed for use in loops, and advertised as such,
> if it's to be added.

I based the subset to originally implement for all of this code based off of what Clang supports. This excludes the release/consume model (which really breaks your mind) and weak compare-and-swap. The purpose of weak CAS is to avoid extra loops when doing it in a loop; the LLVM developers are of the opinion that the codegen optimizer is or could be powerful enough to detect weak CAS in the most common cases, so they did not expose it in the IR. From a footgun perspective, erring on the side of strong CAS is always safe, but using weak CAS incorrectly won't be detectable on x86 machines.

> @@ +257,5 @@
> > +    };
> > +#  endif
> > +    static ValueType add(ValueType& ptr, ptrdiff_t val) {
> > +      typename SyncValue::Type amount = SyncValue::convert(val * sizeof(T));
> > +      return __sync_fetch_and_add(&ptr, amount);
> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52291 says that
> __sync_fetch_and_add are deprecated.  I assume they're not available in
> old-enough gcc for us?

Any gcc that has __atomic ought to have <atomic> as well. I think gcc 4.6 is when the C11/C++11 atomics were added.
(In reply to Joshua Cranmer [:jcranmer] from comment #39)
> I think most lock-free concurrent data structures require
> sequentially-consistent atomic accesses. At the very least, reasoning if
> such algorithms work in a release-acquire mode is extremely difficult, and I
> (as a reviewer) probably wouldn't accept such an implementation.

I don't doubt this.  :-)  That language was because I couldn't immediately and obviously think of a simply-described and reasonably intuitive use case.  The cppreference page talked about a multiple-consumer-single-producer situation, but that language is so vague it seems almost useless as an actual example.

> I based the subset to originally implement for all of this code based off of
> what Clang supports. This excludes the release/consume model (which really
> breaks your mind) and weak compare-and-swap. The purpose of weak CAS is to
> avoid extra loops when doing it in a loop; the LLVM developers are of the
> opinion that the codegen optimizer is or could be powerful enough to detect
> weak CAS in the most common cases, so they did not expose it in the IR. From
> a footgun perspective, erring on the side of strong CAS is always safe, but
> using weak CAS incorrectly won't be detectable on x86 machines.

Yeah.  Note I said "if it's to be added".  :-)  Although, I'm slightly surprised the CAS primitives would be amenable to heuristically-determined optimization.  Not that I've thought about it, but I'd think I'd want a well-defined primitive, rather than one with varying performance in different cases.

> > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52291 says that
> > __sync_fetch_and_add are deprecated.  I assume they're not available in
> > old-enough gcc for us?
> 
> Any gcc that has __atomic ought to have <atomic> as well. I think gcc 4.6 is
> when the C11/C++11 atomics were added.

Okay.  So basically, since those would have <atomic> and we'd probably be using it in that case (modulo no -std=c++0x), we should keep using the sync builtins.
Or have multiple CAS primitives, some well-defined, some the optimizer is allowed to make strong or weak as it sees fit.
Comment on attachment 742326 [details] [diff] [review]
add mfbt/Atomics.h

Review of attachment 742326 [details] [diff] [review]:
-----------------------------------------------------------------

::: mfbt/Atomics.h
@@ +668,5 @@
> +     *   aExpected = mValue;
> +     *   return false;
> +     * }
> +     */
> +    bool compareAndSwap(T& aExpected, T aValue) {

It's not especially obvious that aExpected will be modified by this method, when it's a reference -- make it a pointer so that the possibility of modification is more clearly visible at call sites.
Comment on attachment 742327 [details] [diff] [review]
add tests for mfbt/Atomics.h

Review of attachment 742327 [details] [diff] [review]:
-----------------------------------------------------------------

::: mfbt/tests/TestAtomics.cpp
@@ +2,5 @@
> +#include "mozilla/Atomics.h"
> +#include "mozilla/DebugOnly.h"
> +#include "mozilla/StandardInteger.h"
> +
> +using namespace mozilla;

We've been listing uses manually, not opening the whole namespace, as a rule.

using mozilla::Atomic;
using mozilla::DebugOnly;
using mozilla::MemoryOrdering;

@@ +22,5 @@
> +  MOZ_ASSERT(atomic-- == 6, "Atomic post-decrement did not work");
> +  MOZ_ASSERT(atomic == 5, "Atomic post-decrement did not work");
> +
> +  // Test other arithmetic.
> +  atomic += T(5);

DebugOnly<T> result;

result = (atomic += T(5));
MOZ_ASSERT(atomic == T(10), "Atomic += did not work");
MOZ_ASSERT(result == T(10), "Atomic += returned the wrong value");

and the same (reusing result) for -=, ^=, |=, &=.

@@ +38,5 @@
> +  MOZ_ASSERT(value == T(5), "Atomic compare-and-swap swapped the wrong value");
> +
> +  // Test assignment
> +  MOZ_ASSERT((atomic = T(5)) == T(5) && atomic == T(5),
> +             "Atomic assignment failed");

This could use result too.

@@ +55,5 @@
> +}
> +
> +template <MemoryOrdering Order>
> +static void
> +TestPointer()

You should be able to common up everything but the ^=, |=, &= tests in a single method, then have that method be called for both the T and T* cases.

@@ +76,5 @@
> +  atomic += 2;
> +  MOZ_ASSERT(atomic == array1 + 2, "Atomic += did not work");
> +  atomic -= 1;
> +  MOZ_ASSERT(atomic == array1 + 1, "Atomic -= did not work");
> +  atomic -= 1;

MOZ_ASSERT(atomic == array1, "Atomic -= did not work");

just for reader clarity in evaluating the subsequent code.

@@ +92,5 @@
> +  MOZ_ASSERT(value == array1, "Atomic compare-and-swap swapped the wrong value");
> +
> +  // Test stores
> +  atomic = array1;
> +  MOZ_ASSERT(atomic == array1, "Atomic assignment does not work");

This should use result = (atomic = array1) as well.
Attachment #742327 - Flags: review?(jwalden+bmo)
Thanks for the review.

(In reply to Jeff Walden [:Waldo] (remove +bmo to email) from comment #38)
> It's possible I'm misreading this, but there seems to be a good deal of
> inconsistency about whether the old value, or the new value, is returned by
> the various mutator methods.  It's possible it all happens to just work, and
> that inconsistencies offset each other just so, but I'm a little skeptical. 
> That's the biggest issue here.  There seems to be confusion about
> __sync_*_and_fetch versus __sync_fetch_and_* too.

This is because I *thought* I had tested this on an <atomic>-supporting machine, but I had not (I tested on my GCC 4.6 system, but the compiler #ifdeffery was bogus for GCC, so we wound up using the __sync* versions always).  So the non-<atomic> versions were consistent and passed the tests, but the <atomic> versions did not!  I have fixed up the <atomic> versions.

> @@ +349,5 @@
> > + * static Type cas(Type* ptr, Type oldval, Type newval);
> > + *
> > + * This function atomically stores |val| into |*ptr| and must provide a
> > + * full memory fence after the store to prevent compiler and hardware
> > + * instruction reordering.
> 
> Before the store, right?  We want all memory effects before the store to be
> visible, if the result of the store is visible.  That requires a barrier
> before storing.
> 
> Oh, it *must* be before the store -- your barrier traits later on don't even
> have an after-store method.  :-)

The barrier traits for Windows don't have an after store method because we don't need one: relaxed and acq/rel don't need a barrier after the store, and the Primitives::store (via _InterlockedExchange{,64}) will generate all the barriers and fences that we need.  (So says the documentation: http://msdn.microsoft.com/en-us/library/windows/desktop/ttk2z1ws%28v=vs.85%29.aspx )  The x86 memory model is strong enough to DTRT in this case.  I suppose it would be good to document that the call here must act as a barrier to compiler reordering, as _InterlockedExchange does.

> @@ +688,5 @@
> > + */
> > +template <typename T, MemoryOrdering Order>
> > +class Atomic<T*, Order> {
> > +    MOZ_STATIC_ASSERT(IsIntegral<T>::value,
> > +                      "can only have pointer-to-integral atomic variables");
> 
> Why?  This should all work just fine on pointers to arbitrary types; why
> would we limit people about this?

I think I believed non-integer types wouldn't work, and I'm not quite sure why.
(In reply to Jeff Walden [:Waldo] (remove +bmo to email) from comment #38)
> @@ +215,5 @@
> > +    }
> > +    static ValueType inc(ValueType& ptr) { return add(ptr, 1); }
> > +    static ValueType dec(ValueType& ptr) { return sub(ptr, 1); }
> > +    static ValueType cas(ValueType& ptr, ValueType oldval, ValueType newval) {
> > +      return __sync_val_compare_and_swap(&ptr, oldval, newval);
> 
> Hmm.  I didn't notice this earlier, but.  Isn't it slightly more normal for
> compareAndSwap to return true/false to indicate whether the value was
> changed?

It is.  But we're defining these the way we do to parallel what the actual header does:

http://en.cppreference.com/w/cpp/atomic/atomic/compare_exchange

Since we're defining these with the intent of replacing pratom.h and since that header doesn't include atomic compare-and-exchange, I'm inclined to keep things closer to the C++ header to make for an easier transition.  Assuming, of course, we are just having this header stand in for <atomic> until we get universal compiler support, like with TypeTraits.h etc.  If we are having this header as a wrapper around <atomic> to warn you away from the crazier ordering modes, having to specify ordering with every operation, and the like, then we can do whatever we think is sane.

Which way would you like to see this header go?

(In reply to Jeff Walden [:Waldo] (remove +bmo to email) from comment #42)
> > +    bool compareAndSwap(T& aExpected, T aValue) {
> 
> It's not especially obvious that aExpected will be modified by this method,
> when it's a reference -- make it a pointer so that the possibility of
> modification is more clearly visible at call sites.

Here, though, I think making it a pointer is good regardless of whether we keep Atomics.h forever or we eventually use <atomic>: it's better in the former case and an easy fixup in the latter.  Changing the semantics of compareAndExchange, above, would not be such an easy change.
Poking Waldo about comment 45.
Flags: needinfo?(jwalden+bmo)
Assignee: Pidgeot18 → nfroyd
(In reply to Nathan Froyd (:froydnj) from comment #45)
> Assuming, of course, we are just having this header stand in for <atomic>
> until we get universal compiler support, like with TypeTraits.h etc.  If we
> are having this header as a wrapper around <atomic> to warn you away from
> the crazier ordering modes, having to specify ordering with every operation,
> and the like, then we can do whatever we think is sane.
> 
> Which way would you like to see this header go?

Hum.  Yeah, I dunno.  In cases with clear semantics, where someone else has obviously solved the what-memory-ordering-should-be-used problem (like with reference counting), it seems obvious to me that we *should* be doing std::atomic-style low-level ordering on all operations.  In cases with not-so-obvious semantics, something simpler is better.

The more I read about all this, the less sure I am that even all-relaxed, all-relacq, all-seqcst are necessarily fully correct modes to expose for use.  The first is somewhat obviously fine for very, very narrow classes of things.  But even those are pretty tricky to diagnose.  The third is obviously fine, because it doesn't add any complexity above the usual race concern.  The second is complicated enough that I kind of wonder whether anyone will actually understand it enough to use it correctly.  I kind of think we might do best, if we want to expose a simplified version, to only have a solely sequentially-consistent one.  Then, for expert cases, have another that emulates <atomic>'s interface.

But this is obviously also a bunch of scope-creep.  :-\  So, yeah.  Maybe jcranmer can say something more about the friendliness of the three-orders interface here, because I have no real certainty about anything of this.

And, perhaps the solution to warning people away from the memory-ordering pitfalls and such is to require very clear encapsulation of atomic field modifications, with review by someone who knows what they're doing.  Super-encapsulation might be sort of the wrong hammer here.  Bluh.
Flags: needinfo?(jwalden+bmo)
Attached patch add mfbt/Atomics.h (obsolete) — Splinter Review
OK, new patch.  Major changes:

- MOAR COMMENTS;
- Refactored intrinsics to share code when possible;
- Removed compare-and-exchange;
- Indentation all tidied.

I took out compare-and-exchange for a couple of reasons:

1. I don't think we have a good feel for what the interface ought to look like;
2. We don't have any code in the tree that would use it now or will be likely
   to use it in the future;
3. It makes our life when compiling for Windows simpler (the pointer version of
   compare-and-exchange is *documented* to be a macro for 32-bit; we'd have to
   copy that code in, which in addition to being ugly is possibly of questionable
   legal status.  and error-prone, fragile, yada yada).

I realize that you could apply point (2) to the logical ops.  I don't
have strong feelings about those (they're not that much extra code), but
if we'd like to ditch those, we could do that.

Tested on Linux.  Will push to try and double-check on an <atomic>-supporting
machine.
Attachment #742326 - Attachment is obsolete: true
Attachment #742326 - Flags: review?(jseward)
Attachment #749910 - Flags: review?(jwalden+bmo)
Improved tests.  I left the pointer and integer versions separate.  If you have
your heart set on merging them together, I can try to do that, but it seemed a
lot easier to keep them distinct.
Attachment #742327 - Attachment is obsolete: true
Attachment #749912 - Flags: review?(jwalden+bmo)
Also, if we wanted to ditch memory orderings, I'd vote for keeping Relaxed and SequentiallyConsistent.  Keep Relaxed so that in an <atomic> world, the compiler can see that certain variables may be accessed by multiple threads and data races are explicitly OK.  Keep SequentiallyConsistent because that's what people want a good percentage of the time.  Everything else is probably a little too subtle.  (I'm not even sure the documentation for ReleaseAcquire describes the proper usecases; if you want locks/mutexes/etc. you should almost certainly be using actual primitives for those--at least at the levels Gecko/JS people are working at--rather than rolling your own with atomic variables.)
Comment on attachment 749910 [details] [diff] [review]
add mfbt/Atomics.h

Review of attachment 749910 [details] [diff] [review]:
-----------------------------------------------------------------

::: mfbt/Atomics.h
@@ +56,5 @@
> + * their own instruction reordering, above and beyond that performed by
> + * the compiler.  And each core has its own memory caches, and accesses
> + * (reads and writes both) to "memory" may only resolve to out-of-date
> + * cache entries -- not to the "most recently" performed operation in
> + * some global sense.  Any access to a value that may be used by

From a pedantry perspective, this is wrong. Cache coherency guarantees that, at any given point in time [from the perspective of the global memory/interconnect bus], all processors will agree on the value for a memory location. Indeed, C++11 states that every memory location has a single global memory order. Thus reads never resolve to "out-of-date cache entries" in a strict sense.

The memory ordering problem (as its name implies) is really that, to ensure that everyone agrees on the value at any point in time, read and write operations can take varying amounts of times and happen pretty much asynchronously with respect to the calling thread. As a result, just because a thread requested two reads to happen in a specific order doesn't mean they'll happen in the same order from the point of view of the global interconnect.

@@ +70,5 @@
> + * supported.  These three modes are confusing enough as it is!  If your
> + * use case really really really really really needs the extra speed the
> + * additional modes provide, *and* you are tall enough to ride the
> + * atomics memory-ordering roller coaster (if you're not sure, you
> + * aren't), file a bug and we'll consider accommodating you.

We basically support x86, x86-64, and ARM as processors we care about. I think we also have active maintainers for PowerPC and SPARC. x86/x86-64 basically has release-acquire or sequentially consistent as accessible modes, while the others also support relaxed memory ordering constraints. The release-consume mode, the mode I did not originally add, is only distinct from release-acquire on Alpha processors, which I'm quite sure we don't care about in any capacity, let alone possible small performance improvements. C++11's enum has 6 values in it, but they basically fall into one of the four modes.

@@ +82,5 @@
> +   * the writing thread may already be visible, if the compiler or CPU
> +   * reordered them.  (The latter can happen if reads and/or writes get
> +   * held up in per-processor caches.)  Relaxed ordering means
> +   * operations can always use cached values (as long as the actual
> +   * updates to atomic values actually occur, correctly, eventually), so

Pedantically speaking, incorrect again.

@@ +119,5 @@
> +   * operations.
> +   *
> +   * ReleaseAcquire ordering is good for (as the name implies) atomic
> +   * operations on values controlling ownership of things: reference
> +   * counts, mutexes, and the like.

ReleaseAcquire is basically useful for things that only use a single atomic variable, not several in conjuction.

@@ +145,5 @@
> +   * SequentiallyConsistent ordering can be desirable when multiple
> +   * threads observe objects, and they all have to agree on the
> +   * observable order of changes to them.  It's also probably the
> +   * hardest to describe its use cases.  If you don't know what order to
> +   * use, use this one.

Lockless data structures use sequentially-consistent ordering.
(In reply to Nathan Froyd (:froydnj) from comment #48)
> 2. We don't have any code in the tree that would use it now or will be likely
>    to use it in the future;

I recall hearing in IRC once or twice people express a desire to have compare-and-swap. It's basically the only useful operation for Atomic<T*> (the other operations are provided to make Atomic<T*> "act like" T* and for completeness, but I can't think of any useful use-case for an atomic pointer that doesn't have compare-and-swap).
(In reply to Joshua Cranmer [:jcranmer] from comment #52)
> (In reply to Nathan Froyd (:froydnj) from comment #48)
> > 2. We don't have any code in the tree that would use it now or will be likely
> >    to use it in the future;
> 
> I recall hearing in IRC once or twice people express a desire to have
> compare-and-swap. It's basically the only useful operation for Atomic<T*>
> (the other operations are provided to make Atomic<T*> "act like" T* and for
> completeness, but I can't think of any useful use-case for an atomic pointer
> that doesn't have compare-and-swap).

I am willing to have people file bugs if they would find other operations useful.
This version compiles properly on Linux (with and without <atomic>).  Waiting for
try to come back up.
Attachment #749910 - Attachment is obsolete: true
Attachment #749910 - Flags: review?(jwalden+bmo)
Attachment #750001 - Flags: review?(jwalden+bmo)
Comment on attachment 749910 [details] [diff] [review]
add mfbt/Atomics.h

Review of attachment 749910 [details] [diff] [review]:
-----------------------------------------------------------------

Yee-haw!  Here goes nothin'.

::: mfbt/Atomics.h
@@ +56,5 @@
> + * their own instruction reordering, above and beyond that performed by
> + * the compiler.  And each core has its own memory caches, and accesses
> + * (reads and writes both) to "memory" may only resolve to out-of-date
> + * cache entries -- not to the "most recently" performed operation in
> + * some global sense.  Any access to a value that may be used by

Doesn't relaxed ordering specifically permit cores' caches to de-cohere, so that operations are observed in different orders on different threads, and different cores will *not* agree on the instantaneous value of a memory location?  If there was one thing I was certain about in researching this all, it was this point!

@@ +70,5 @@
> + * supported.  These three modes are confusing enough as it is!  If your
> + * use case really really really really really needs the extra speed the
> + * additional modes provide, *and* you are tall enough to ride the
> + * atomics memory-ordering roller coaster (if you're not sure, you
> + * aren't), file a bug and we'll consider accommodating you.

Yeah, consume/Alpha aren't relevant.  But release+acquire is different from release, and different from acquire, and it's that refinement that it may be sometimes advantageous to make usable, if you're very very very careful.

@@ +190,5 @@
> +template<typename T, MemoryOrdering Order>
> +struct IntrinsicBase
> +{
> +    typedef std::atomic<T> ValueType;
> +    typedef AtomicOrderConstraints<Order> OrderedOp;

Hmm.  Aren't these both going to be dependent name lookups in all the subclasses, therefore requiring IntrinsicBase:: qualification or similar?

@@ +225,5 @@
> +    static T* add(ValueType& ptr, ptrdiff_t val) {
> +      return ptr.fetch_add(val * sizeof(T), OrderedOp::AtomicRMWOrder);
> +    }
> +    static T* sub(ValueType& ptr, ptrdiff_t val) {
> +      return ptr.fetch_sub(val * sizeof(T), OrderedOp::AtomicRMWOrder);

I'd be pretty surprised if these need sizeof(T) in them, or indeed if these bits need T* specializations, but tests should pick up issues quickly, presumably.

@@ +238,5 @@
> +};
> +
> +template<typename T, MemoryOrdering Order>
> +struct AtomicIntrinsics : public IntrinsicMemoryOps<T, Order>,
> +                          public IntrinsicIncDec<T, Order>

You've got way more IntrinsicFoo breakouts than I would, but whatever.  :-)

@@ +339,5 @@
> +      Barrier<Order>::afterStore();
> +    }
> +    static T exchange(T& ptr, T val) {
> +      return __sync_lock_test_and_set(&ptr, val);
> +    }

Hmm.  So this is only an acquire, if docs and your comment above are to be believed.  Do we want/need more here?  I don't understand what operations are supposed to happen on an exchange operation to speak confidently here.

@@ +364,5 @@
> +     * __sync_fetch_and_{add,sub} will properly type-check.
> +     */
> +    static ValueType add(ValueType& ptr, ptrdiff_t val) {
> +      ValueType amount = reinterpret_cast<ValueType>(val * sizeof(T));
> +      return __sync_fetch_and_add(&ptr, amount);

Again I'm kind of surprised sizeofs are needed, but tests should smoke this out too.

@@ +424,5 @@
> +long __cdecl _InterlockedXor(long volatile* Destination, long Value);
> +long __cdecl _InterlockedAnd(long volatile* Destination, long Value);
> +long __cdecl _InterlockedIncrement(long volatile* Destination);
> +long __cdecl _InterlockedDecrement(long volatile* Destination);
> +long __cdecl _InterlockedExchange(long volatile *Destination, long Value);

dst/value as a nitpick.

@@ +463,5 @@
> + * These two functions increment or decrement, respectively, the value
> + * stored in |*ptr| by 1 and return the value previously stored in
> + * |*ptr|.  Note that the return value requirement is different from the
> + * return value provided by the obvious _Interlocked{Increment,Decrement}
> + * functions.

Might as well say "These functions are equivalent to add(ptr, 1) and sub(ptr, 1)." for brevity and clarity.

@@ +568,5 @@
> +    static Type and_(Type* ptr, Type val) {
> +      return _InterlockedAnd64(ptr, val);
> +    }
> +    static Type inc(Type* ptr) {
> +      return _InterlockedIncrement64(ptr) - 1;

Could this be just done with add(ptr, 1) for clarity (same for dec/sub)?  Seems optimal to avoid the increment/decrement intrinsic issues around what value they return, to me, and parallels the other implementations.  Same for the PrimitiveIntrinsics<4> versions, too, jumping back slightly.

@@ +622,5 @@
> +struct IntrinsicBase
> +{
> +    typedef T ValueType;
> +    typedef PrimitiveIntrinsics<sizeof(T)> Primitives;
> +    typedef typename Primitives::Type PrimType;

Again I'm surprised these work in subclasses without qualification.  This might be a case where MSVC's name lookup is buggy; I remember hearing that only clang did this stuff right, once.

@@ +653,5 @@
> +    }
> +};
> +
> +template<typename T>
> +struct IntrinsicApplyHelper : public IntrinsicBase<T>

Technically we should probably put the function pointers into the template parameters for guaranteed-est inlining, but hopefully MSVC is smart enough.  We can loop back if needed later.

@@ +810,5 @@
> +    Atomic(T aInit) : detail::AtomicBase<T, Order>(aInit) {}
> +
> +#define OP_ASSIGN(assignOp, op, intrinsic)                              \
> +  T operator assignOp(T delta) { return intrinsic(Base::mValue, delta) op delta; }
> +  OP_ASSIGN(+=, +, Base::Intrinsics::add)

I'd just write these all out if it were me.

@@ +835,5 @@
> + *
> + * There is one exception to the case of atomic memory accesses: providing an
> + * initial value of the atomic value is not guaranteed to be atomic. This is a
> + * deliberate design choice that enables static atomic variables to be declared
> + * without introducing extra static constructors.

In practice I think we're going to get static constructors here as long as the type's not POD or so, but I guess we can at least try.  :-\
Attachment #749910 - Attachment is obsolete: false
Comment on attachment 750001 [details] [diff] [review]
add mfbt/Atomics.h

Heh, so I guess I was reviewing an old patch.  That addresses most of the comments.  The changes look fine.  You might consider |using Base::operator=;| instead of removing AtomicBase::operator= and implementing it fully in the subclasses -- that seems slightly better to me, and I *think* I've seen it working before.
Attachment #750001 - Flags: review?(jwalden+bmo) → review+
Comment on attachment 749912 [details] [diff] [review]
add tests for mfbt/Atomics.h

Review of attachment 749912 [details] [diff] [review]:
-----------------------------------------------------------------

Whee, what could go wrong?

::: mfbt/tests/TestAtomics.cpp
@@ +1,1 @@
> +#include "mozilla/Assertions.h"

Add the usual license header.

@@ +1,4 @@
> +#include "mozilla/Assertions.h"
> +#include "mozilla/Atomics.h"
> +#include "mozilla/DebugOnly.h"
> +#include "mozilla/StandardInteger.h"

Given we just dropped MSVC<10 support, let's use <stdint.h> for this -- no need to keep adding more of these for someone to clean up.  :-)
Attachment #749912 - Flags: review?(jwalden+bmo) → review+
(In reply to Jeff Walden [:Waldo] (remove +bmo to email) from comment #56)
> Comment on attachment 750001 [details] [diff] [review]
> add mfbt/Atomics.h
> 
> Heh, so I guess I was reviewing an old patch.  That addresses most of the
> comments.  The changes look fine.  You might consider |using
> Base::operator=;| instead of removing AtomicBase::operator= and implementing
> it fully in the subclasses -- that seems slightly better to me, and I
> *think* I've seen it working before.

Doh, sorry about that.

The operator= weirdness is because I was running into some extremely inscrutable errors with <atomic> and AtomicBase::operator=.  I'll try your suggestion before pushing.
(In reply to Jeff Walden [:Waldo] (remove +bmo to email) from comment #55)
> Doesn't relaxed ordering specifically permit cores' caches to de-cohere, so
> that operations are observed in different orders on different threads, and
> different cores will *not* agree on the instantaneous value of a memory
> location?  If there was one thing I was certain about in researching this
> all, it was this point!

[intro.multithread].6 [1.10p6] clearly states:
All modifications to a particular atomic object M occur in some particular total order, called the modification order of M.

There is a non-normative note that says [atomics.order].1
[ Note: Atomic operations specifying memory_order_relaxed are relaxed with respect to memory ordering. Implementations must still guarantee that any given atomic access to a particular atomic object be indivisible with respect to all other atomic accesses to that object. —end note ]
I don't see any further normative discussion of memory_order_relaxed which would overrule the first paragraph and the later coherency clauses. The intent of memory_order_relaxed I think is to be able to express an appropriate processor memory instruction (load/store/atomic-* operand).

The wording is set up such that it describes all atomic objects, and the release/acquire/consume stuff work by setting up happens-before relationships (paragraph 2), and sequential consistency has same extra clauses to make it work (most of the rest of that section).

It may be a subtle point, but, *for each* object, there exists a global total order on the accesses to that object. There does not exist (in general) a global total order on all accesses.
(In reply to Jeff Walden [:Waldo] (remove +bmo to email) from comment #55)
> @@ +190,5 @@
> > +template<typename T, MemoryOrdering Order>
> > +struct IntrinsicBase
> > +{
> > +    typedef std::atomic<T> ValueType;
> > +    typedef AtomicOrderConstraints<Order> OrderedOp;
> 
> Hmm.  Aren't these both going to be dependent name lookups in all the
> subclasses, therefore requiring IntrinsicBase:: qualification or similar?

This has been fixed by copious usage of Base:: in an updated version of the header, which I think you have seen.

> @@ +225,5 @@
> > +    static T* add(ValueType& ptr, ptrdiff_t val) {
> > +      return ptr.fetch_add(val * sizeof(T), OrderedOp::AtomicRMWOrder);
> > +    }
> > +    static T* sub(ValueType& ptr, ptrdiff_t val) {
> > +      return ptr.fetch_sub(val * sizeof(T), OrderedOp::AtomicRMWOrder);
> 
> I'd be pretty surprised if these need sizeof(T) in them, or indeed if these
> bits need T* specializations, but tests should pick up issues quickly,
> presumably.

It turns out that GCC 4.6 has a broken <atomic> header; the test program:

#include <atomic>

void
adjust_pointer(std::atomic<int*>& p)
{
  p.fetch_add(1);
}

compiles to:

_Z14adjust_pointerRSt6atomicIPiE:
.LFB382:
	.cfi_startproc
	lock addq	$1, (%rdi)
	ret
	.cfi_endproc

which is Wrong.  GCC 4.5 and 4.7+ appear to be OK, so I've changed add/sub on pointer types to look like:

    static T* add(typename Base::ValueType& ptr, ptrdiff_t val) {
      return ptr.fetch_add(fixup(val), Base::OrderedOp::AtomicRMWOrder);
    }
    static T* sub(typename Base::ValueType& ptr, ptrdiff_t val) {
      return ptr.fetch_sub(fixup(val), Base::OrderedOp::AtomicRMWOrder);
    }
  private:
    static ptrdiff_t fixup(ptrdiff_t val) {
#if defined(__clang__) || defined(_MSC_VER)
      return val;
#elif defined(__GNUC__) && MOZ_GCC_VERSION_AT_LEAST(4, 6, 0) && \
      !MOZ_GCC_VERSION_AT_LEAST(4, 7, 0)
      return val * sizeof(T);
#else
      return val; 
#endif
    }

will add a comment explaining the situation.

> @@ +339,5 @@
> > +      Barrier<Order>::afterStore();
> > +    }
> > +    static T exchange(T& ptr, T val) {
> > +      return __sync_lock_test_and_set(&ptr, val);
> > +    }
> 
> Hmm.  So this is only an acquire, if docs and your comment above are to be
> believed.  Do we want/need more here?  I don't understand what operations
> are supposed to happen on an exchange operation to speak confidently here.

If we wanted, we could add __sync_sychronize calls here.  From a correctness point of view (and not a performance point of view), that would be proper.  From a pragmatic point of view, PR_ATOMIC_SET has always only provided acquire semantics (by virtue of using __sync_lock_test_and_set) on non-Win32 platforms and it hasn't hurt anything so far.  I'm inclined to leave as-is.

This implementation should also slowly go away as we get newer compilers, though Android's stlport may hold us back here. :(

> @@ +364,5 @@
> > +     * __sync_fetch_and_{add,sub} will properly type-check.
> > +     */
> > +    static ValueType add(ValueType& ptr, ptrdiff_t val) {
> > +      ValueType amount = reinterpret_cast<ValueType>(val * sizeof(T));
> > +      return __sync_fetch_and_add(&ptr, amount);
> 
> Again I'm kind of surprised sizeofs are needed, but tests should smoke this
> out too.

The __sync_fetch_and_{add,sub} functions are dumb in this regard, insofar as they just take two values and smash them together without any semantic knowledge.  The original Intel versions only took integral types and it was a GCC extension to permit pointer types as well; I guess the full implications of allowing pointer types didn't occur to folks. =/

> @@ +568,5 @@
> > +    static Type and_(Type* ptr, Type val) {
> > +      return _InterlockedAnd64(ptr, val);
> > +    }
> > +    static Type inc(Type* ptr) {
> > +      return _InterlockedIncrement64(ptr) - 1;
> 
> Could this be just done with add(ptr, 1) for clarity (same for dec/sub)? 
> Seems optimal to avoid the increment/decrement intrinsic issues around what
> value they return, to me, and parallels the other implementations.  Same for
> the PrimitiveIntrinsics<4> versions, too, jumping back slightly.

Yeah, probably.  We weren't even using these!

> @@ +622,5 @@
> > +struct IntrinsicBase
> > +{
> > +    typedef T ValueType;
> > +    typedef PrimitiveIntrinsics<sizeof(T)> Primitives;
> > +    typedef typename Primitives::Type PrimType;
> 
> Again I'm surprised these work in subclasses without qualification.  This
> might be a case where MSVC's name lookup is buggy; I remember hearing that
> only clang did this stuff right, once.

GCC 4.7+ gets it right IIRC too.

> @@ +810,5 @@
> > +    Atomic(T aInit) : detail::AtomicBase<T, Order>(aInit) {}
> > +
> > +#define OP_ASSIGN(assignOp, op, intrinsic)                              \
> > +  T operator assignOp(T delta) { return intrinsic(Base::mValue, delta) op delta; }
> > +  OP_ASSIGN(+=, +, Base::Intrinsics::add)
> 
> I'd just write these all out if it were me.

Fair enough.

> @@ +835,5 @@
> > + *
> > + * There is one exception to the case of atomic memory accesses: providing an
> > + * initial value of the atomic value is not guaranteed to be atomic. This is a
> > + * deliberate design choice that enables static atomic variables to be declared
> > + * without introducing extra static constructors.
> 
> In practice I think we're going to get static constructors here as long as
> the type's not POD or so, but I guess we can at least try.  :-\

Pretty sure GCC won't fold these away (GCC in general does not eliminate static constructors even for POD classes), guessing clang will make a valiant effort, and not sure about MSVC.  We can try tweaking this if it becomes a problem.
Attachment #749910 - Attachment is obsolete: true
Depends on: 873649
Blocks: 873649
No longer depends on: 873649
No longer blocks: 873649
Depends on: 873649
https://hg.mozilla.org/mozilla-central/rev/8096f5bdab92
https://hg.mozilla.org/mozilla-central/rev/eab6eac1a371
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla24
Depends on: 876156
Depends on: 881163
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: