Closed Bug 1142593 Opened 9 years ago Closed 9 years ago

Incorrect to cast T* to atomic<T>* even for integral T

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla40
Tracking Status
firefox40 --- fixed

People

(Reporter: lth, Assigned: lth)

References

(Blocks 1 open bug)

Details

Attachments

(1 file, 2 obsolete files)

Due to a misreading of the C++ spec I implemented the C++ methods for the atomics as code that casts eg int* to atomic<int>* and then accesses the data via the latter pointer.  That will almost certainly work properly (modulo things the C++ compiler might do with code like that if it deems that undefined behavior) on all Firefox platforms at present for data up to 32 bits, because all platforms support atomic operations directly up to that size, but it is in actual fact incorrect: atomic<T> is not guaranteed to be the same size as T, even for integral T.

The best fix is probably to provide an abstraction layer where the JIT generates the appropriate code, which is then called from C++, or where there's handwritten assembler for ditto.  This will have the highly desired side benefit that we will be able to guarantee that C++ and jitted code use the same method for atomic access, something that has been lacking up to now.  For JIT-less platforms we can provide something lame, like a global spinlock, or there can be more hand-written assembler.
The appearance of the "RegionLock" structure may appear surprising to the reviewer, since it is not used anywhere in this patch.  It, and the complexity around 64-bit accesses, is motivated by code that's coming in bug 1131613, I'll attach WIP code to that bug shortly.
Attachment #8579372 - Flags: review?(luke)
Comment on attachment 8579372 [details] [diff] [review]
Factor out the atomics functions, and make them platform-specific

Review of attachment 8579372 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/shared/AtomicOperations-x86-shared.h
@@ +8,5 @@
> +#define jit_shared_AtomicOperations_x86_shared_h
> +
> +#include "jit/AtomicOperations.h"
> +
> +/* This file and its companion files in other platform directories all

I think you should invert this a little.  Instead of having users #ifdef to pieces to pick the platform-specific header they want, they should #include AtomicOperations.h.  *That* file should include declarations of all these operations, with all the documentation there.  Then at the end of that header, do the platform-testing ifdefs and include the right platform-specific header that defines the methods just declared.  It seems very likely someone would first look in AtomicOperations.h for information on what any of these methods does, and to the architecture-specific headers for the right definition.  We should support that, not put it all in one random arch-specific header.
(In reply to Jeff Walden [:Waldo] (remove +bmo to email) from comment #2)
> Comment on attachment 8579372 [details] [diff] [review]
> Factor out the atomics functions, and make them platform-specific
> 
> I think you should invert this a little.  [...]

The structure is what it is because I don't really anticipate there being more than one file that will need access to those function definitions, and I really don't want AtomicOperations.h to pull in all the inline definitions everywhere it's needed, so I probably won't do exactly what you suggest, but I'll consider some other options.
Attachment #8579372 - Flags: review?(luke)
Changes from the previous version:

- Definitions of all the functions and types are now in jit/AtomicOperations.h,
  and comments have moved into that file too, with xrefs updated
- There is jit/AtomicOperations-inl.h which includes the appropriate platform
  file with function bodies
- MSVC version has been slightly reworked to fit into the template-function
  mold
- atomic_sc_exchange() is defined for 64-bit, an oversight

(A try run of this + the client code and test cases from bug 1131613 is green.)
Attachment #8579372 - Attachment is obsolete: true
Attachment #8580069 - Flags: review?(jwalden+bmo)
Blocks: 1146817
Comment on attachment 8580069 [details] [diff] [review]
Factor out the atomics functions, and make them platform-specific, v2

New patch coming.
Attachment #8580069 - Flags: review?(jwalden+bmo)
Changes from the previous version are local adjustments to 32-bit compilers:

- clang -m32 simply does not support 8-bit atomics with the __atomic builtins,
  so stop trying
- on 32-bit windows, _InterlockedExchange64 is not defined, so compensate
  by falling back on a CMPXCHG loop

FWIW, try is green (with a long patch queue on top of this patch that makes use of the new APIs):
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f03d3c7be9ec
Attachment #8580069 - Attachment is obsolete: true
Attachment #8582597 - Flags: review?(jwalden+bmo)
A small error in this patch, the change made to the Uint32 case in atomic_binop_impl should not have been made.
Comment on attachment 8582597 [details] [diff] [review]
Factor out the atomics functions, and make them platform-specific, v3

Review of attachment 8582597 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/builtin/AtomicsObject.cpp
@@ +112,5 @@
>  
>  void
>  js::atomics_fullMemoryBarrier()
>  {
> +    atomic_sc_fence();

I would agree with (if I'm reading it correctly, which is not at all in evidence!) https://docs.google.com/document/d/1NDGA_gZJ7M7w1Bh8S0AoDyEqwDdRh4uSoTPSNn77PFk/edit?disco=AAAAAVlAnIs by some Hans Boehm dude guy, that having to have memory barriers anywhere seems not-desirable.  He sounds like he might know what he's talking about.  It's not clear to me why atomic operations on locations wouldn't be adequate, and with all the operations that *actually* observe/affect atomic locations being sequentially consistent, I'm not even sure how this is observable.

But anyway.  I should probably keep reading assuming atomic_sc_fence is useful/desirable for now.  :-)

::: js/src/jit/AtomicOperations.h
@@ +6,5 @@
> +
> +#ifndef jit_AtomicOperations_h
> +#define jit_AtomicOperations_h
> +
> +/* The atomic operations layer defines the following types and

/*
 * The atomic...

as far as style goes.

@@ +9,5 @@
> +
> +/* The atomic operations layer defines the following types and
> + * functions ("_sc_" means "sequentially consistent" and means these
> + * functions need to use appropriate atomic instructions on the
> + * hardware or insert explicit fences).

...and means such a function's operation must have "sequentially consistent" memory ordering.  See mfbt/Atomics.h for an explanation of this memory ordering.

Specifically, I'd rather not discuss "fences" here.  The reader *might* be somewhat aware of the meaning of sequentially-consistent already, from C/C++ memory models.  Best to stay away from the primitives occasionally used to implement such semantics at lower levels.  And the reference to Atomics.h tells the reader exactly where to look to understand SC if he doesn't already.  (Although to be sure, such readers would want to proceed super-cautiously for some time in any event.)

@@ +26,5 @@
> + * out to code that's generated at run time.
> + */
> +
> +/* Execute a full memory barrier (LoadLoad+LoadStore+StoreLoad+StoreStore) */
> +void atomic_sc_fence();

Standard SpiderMonkey style would probably have:

class AtomicOperations
{
  public:
    inline void fence();

    inline bool eightByteIsLockFree();

    template<typename T>
    inline T load(T* addr);

    template<typename T>
    inline void store(T* addr, T val);

    ...
};

and so on, with camelCase naming for things.  That also nicely uses a real C++ construct for namespacing, versus poor-man's prefixing.

@@ +89,5 @@
> + *
> + * The implementations of acquire and release are system-specific
> + * and defined in jit/ARCH/AtomicOperations-ARCH.h.
> + */
> +struct RegionLock

Will the plan be (or is it already) for RegionLock to be something JIT code would poke at manually?  I reviewed a patch elsewhere for a spin lock; could be getting time to have one in mfbt (although that one just yielded on failure because its critical section was a bit longer than a couple instructions).  But if you're going to touch it in JIT code that's not really practical any more.

@@ +91,5 @@
> + * and defined in jit/ARCH/AtomicOperations-ARCH.h.
> + */
> +struct RegionLock
> +{
> +public:

public/private should be indented two spaces.

@@ +110,5 @@
> +    void release(void *addr);
> +
> +private:
> +    /* For now, a simple spinlock that covers the entire buffer */
> +    uint32_t spinlock;

Given I assume we're talking operations that access single locations (or I guess up to eight of them, big deal), this seems sensible.

::: js/src/jit/arm/AtomicOperations-arm.h
@@ +6,5 @@
> +
> +/* For documentation, see jit/AtomicOperations.h */
> +
> +#ifndef jit_arm_AtomicOperations_arm_h
> +#define jit_arm_AtomicOperations_arm_h

We're not trying to take advantage of any CPU-specific instructions for these things, yet.  Not sure how that would fit into any of this -- you'd need the compiler to generate instructions that might not be legal for the target CPU.  I guess a concern for another day.

@@ +77,5 @@
> +{
> +    MOZ_ASSERT(sizeof(T) < 8 || atomic_lockfree8());
> +# ifdef USE_SYNC_NOT_ATOMIC
> +    __sync_synchronize();
> +    *addr = val;

Does none of the ARM implementation need volatiles at all, in the sync-using cases?  (Same naming comments as from x86 apply here, too.)

::: js/src/jit/shared/AtomicOperations-x86-shared.h
@@ +36,5 @@
> +// It's not clear to me that Firefox is even supported for 486 CPUs.
> +// Windows XP requires a Pentium, and Apple has never shipped less
> +// than a Core-1.  Linux is a minor wild card.  I note we use the
> +// CPUID instruction freely, but that's not supported on all 486
> +// systems.  In practice I think we can count on a Pentium everywhere.

...uh, yeah, we can count on a Pentium.  I sort of doubt we have explicitly stated minimum requirements, but there's no possible way we're supporting non-Pentium Linux.  Just remove the paragraph.  If someone ever cares (we're never going to), the comments here already note the minimum-Pentium requirement.

@@ +56,5 @@
> +// the newer __atomic functions.  This will be required for GCC 4.6.x
> +// and earlier, and probably for Clang 3.1, should we need to use
> +// those versions.
> +
> +// #define USE_SYNC_NOT_ATOMIC

I find myself misreading this name every time I look at it, as something like "use sync" is not "atomic.  And it doesn't help to see an #ifndef on at least one use of it below.  Could you rename this to ATOMICS_IMPLEMENTED_WITH_SYNC_INTRINSICS or so?  I'd rather see a longer, clearer name than one that I continually misread.

Also, it'd be nice to lead with describing the default tactic: prefix the paragraph with this sentence: "The default implementation tactic for gcc/clang is to use the newer __atomic intrinsics added for use in C++11 <atomic>.  Where that isn't available, we use GCC's older __sync functions instead."

@@ +110,5 @@
> +{
> +    MOZ_ASSERT(sizeof(T) < 8 || atomic_lockfree8());
> +# ifdef USE_SYNC_NOT_ATOMIC
> +    // The volatile is a signal to the compiler not to move the load,
> +    // on x86 it should be OK.

// Inhibit *compiler* reordering with a volatile load.  x86 peculiarities mean we don't have to specially inhibit CPU reordering: see the comment in atomic_sc_store.

@@ +111,5 @@
> +    MOZ_ASSERT(sizeof(T) < 8 || atomic_lockfree8());
> +# ifdef USE_SYNC_NOT_ATOMIC
> +    // The volatile is a signal to the compiler not to move the load,
> +    // on x86 it should be OK.
> +    T v = *(T volatile *)addr;

*static_cast<volatile T*>(addr), and elsewhere.

@@ +144,5 @@
> +# ifdef USE_SYNC_NOT_ATOMIC
> +    // The volatile is a signal to the compiler not to move the store,
> +    // on x86 it should be OK.
> +    *(T volatile *)addr = val;
> +    __sync_synchronize();

x86 being finicky about atomics, and *almost* s-c on its own, it seems a very good idea to explain this better.  I understand what it's doing because I've read through the link below, and various other commentary on this x86 quirk.  But I don't expect others would have nearly the context.  How does this comment try for size?

"""
The volatile access inhibits compiler reordering, but it doesn't affect what the CPU does.  x86 guarantees ordering *except* that loads may be reordered across older stores to different locations.  http://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/  Consider:

  uint8_t x = 0, y = 0; // to start

  thread1:
    sx: AtomicOperations::store(&x, 1);
    gy: uint8_t obs1 = AtomicOperations::load(&y);

  thread2:
    sy: AtomicOperations::store(&y, 1);
    gx: uint8_t obs2 = AtomicOperations::load(&x);

Sequential consistency requires a total global ordering of operations: sx-gy-sy-gx, sx-sy-gx-gy, sx-sy-gy-gx, sy-gx-sx-gy, sy-sx-gy-gx, sy-sx-gx-gy.  In every ordering at least one of sx-before-gx or sy-before-gy happens, so *at least one* of obs1/obs2 is 1.

If AtomicOperations::{load,store} were just volatile {load,store}, x86 could reorder gx/gy before each thread's prior load.  That would permit gx-gy-sx-sy: both loads would be 0!  Thus after a volatile store we must synchronize to ensure the store happens before the load.
"""

@@ +176,5 @@
> +    T v;
> +    do {
> +        // Here I assume the compiler will not hoist the load.  It
> +        // shouldn't, because the CAS could affect *addr.
> +        v = *addr;

Is there a reason not to *static_cast<volatile T*>(addr) here?

Or, actually.  Is there a reason not to make all the argument types in all these functions volatile T*?  People can still pass T* just fine to such methods.  And no need for casting anywhere in these functions: you'd get volatile behavior without extra casts.

@@ +181,5 @@
> +    } while (!__sync_bool_compare_and_swap(addr, v, val));
> +    return v;
> +# else
> +    T v;
> +    __atomic_exchange(addr, &val, &v);

Seems best to pass __ATOMIC_SEQ_CST explicitly as the last parameter, rather than have it be implied.

@@ +204,5 @@
> +# endif // LOCKFREE8
> +
> +template<typename T>
> +inline T
> +atomic_sc_compare_and_swap(T *addr, T oldval, T newval)

s/oldval/expected/?  It's unfortunate this variable has both this meaning, and "expected" as meaning, at different parts of the method.  But it seems to me a function signature should reflect the meaning of the variable at the start of the function, not the end (particularly as functions in general have many endings).

@@ +235,5 @@
> +template<typename T>
> +inline T
> +atomic_sc_fetch_and_add(T *addr, T val)
> +{
> +    static_assert(sizeof(T) <= 4, "Not available for 8-byte values yet");

Nitpick: lowercase the assertion message, as compiler message context typically wants that.

@@ +297,5 @@
> +RegionLock::acquire(void *addr)
> +{
> +# ifdef USE_SYNC_NOT_ATOMIC
> +    while (!__sync_bool_compare_and_swap(&spinlock, 0, 1))
> +        ;

continue; for the loop body is slightly more readable.

@@ +311,5 @@
> +inline void
> +RegionLock::release(void *addr)
> +{
> +# ifdef USE_SYNC_NOT_ATOMIC
> +    __sync_sub_and_fetch(&spinlock, 1); // Should turn into LOCK XADD

#ifdef DEBUG
    uint32_t after =
#endif
        __sync_sub_and_fetch(&spinlock, 1); // should turn into LOCK XADD
    MOZ_ASSERT(after == 0, "releasing unowned region lock?");

@@ +369,5 @@
> +}
> +
> +template<typename T>
> +inline T
> +atomic_sc_load(T *addr) {

{ on its own line.

@@ +373,5 @@
> +atomic_sc_load(T *addr) {
> +    MOZ_ASSERT(sizeof(T) < 8 || atomic_lockfree8());
> +    _ReadWriteBarrier();
> +    return *addr;
> +    _ReadWriteBarrier();

This can't be right.  Do you really want

  _RWB();
  T v = *addr;
  _RWB();
  return v;

?

@@ +492,5 @@
> +inline void
> +RegionLock::acquire(void *addr)
> +{
> +    while (_InterlockedCompareExchange((long*)&spinlock, /*newval=*/1, /*oldval=*/0) == 1)
> +        ;

continue;
Attachment #8582597 - Flags: review?(jwalden+bmo) → review+
(In reply to Jeff Walden [:Waldo] (remove +bmo to email) from comment #8)
> Comment on attachment 8582597 [details] [diff] [review]
> Factor out the atomics functions, and make them platform-specific, v3

Thank you for very good feedback.

Re the fences, yes, the Atomics.fence API will be removed, but it's not been urgent to make that change, hence it's being grandfathered in here.
(In reply to Jeff Walden [:Waldo] (remove +bmo to email) from comment #8)

> Comment on attachment 8582597 [details] [diff] [review]
> Factor out the atomics functions, and make them platform-specific, v3

And yes, the idea is that the RegionLock will be accessible to the JIT, for open-coded access, so using an existing abstraction is probably not what we want to do yet.
(In reply to Jeff Walden [:Waldo] (remove +bmo to email) from comment #8)
> Comment on attachment 8582597 [details] [diff] [review]
> Factor out the atomics functions, and make them platform-specific, v3

> @@ +77,5 @@
> > +{
> > +    MOZ_ASSERT(sizeof(T) < 8 || atomic_lockfree8());
> > +# ifdef USE_SYNC_NOT_ATOMIC
> > +    __sync_synchronize();
> > +    *addr = val;
> 
> Does none of the ARM implementation need volatiles at all, in the sync-using
> cases?  (Same naming comments as from x86 apply here, too.)

The __sync operations act as barriers on compiler reordering / reuse of memory operands (equivalent to ReadWriteBarrier for MSVC) so volatile is not necessary.  This is poorly specified in the GCC docs but the fourth paragraph here discusses it: https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html#g_t_005f_005fsync-Builtins.  The wording about "a full barrier" in the first and second sentences addresses compiler instruction reordering; this is further spelled out in section 7.4 of the Itanium ABI (also referenced from that doc), which is eg here: http://www.uclibc.org/docs/psABI-ia64.pdf.

> @@ +176,5 @@
> > +    T v;
> > +    do {
> > +        // Here I assume the compiler will not hoist the load.  It
> > +        // shouldn't, because the CAS could affect *addr.
> > +        v = *addr;
> 
> Is there a reason not to *static_cast<volatile T*>(addr) here?

ISTR that I convinced myself that it was not necessary due to the atomic op that follows.

> Or, actually.  Is there a reason not to make all the argument types in all
> these functions volatile T*?  People can still pass T* just fine to such
> methods.  And no need for casting anywhere in these functions: you'd get
> volatile behavior without extra casts.

Reasons not to do so would be (a) that the arguments don't need to be volatile, so a volatile API is in a way an overspecification, and (b) there wouldn't be any savings if I have to cast away volatile lower down because some system API can't use them (that's speculative).

I'll ponder this some more.  My inclination is to keep the API non-volatile for now.
Blocks: 1131613
Another try run with the revised patch and the (now massive) queue from bug 1131613 on top is gloriously green, so this is ready to land:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=c091834d93aa
https://hg.mozilla.org/mozilla-central/rev/fa58bda5bde2
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla40
Depends on: 1226277
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: