Closed Bug 1394420 Opened 2 years ago Closed 9 months ago

Provide reliable implementation of jit/AtomicOperations for tier-1 platforms

Categories

(Core :: JavaScript Engine: JIT, enhancement, P2)

enhancement

Tracking

()

RESOLVED FIXED
mozilla66
Tracking Status
firefox66 --- fixed

People

(Reporter: lth, Assigned: lth)

References

(Blocks 2 open bugs)

Details

Attachments

(2 files, 8 obsolete files)

Our current implementations of jit/AtomicOperations.h use compiler intrinsics, such as _InterlockedCompareExchange for MSVC and __atomic and __sync for GCC and clang.  The MSVC implementations *might* be OK because they use intrinsics to prevent reordering in the compiler and by extension might tell the compiler not to assume anything about UB (who knows?).  The GCC/Clang implementations are known not to be OK, because those intrinsics are documented as assuming race-free programs and the whole point of our primitives is to support racy programs.  It's even known (believed?) that if a program is race free, TSO lets the C++ compiler engage in some reordering we want to avoid (by not emitting a post-fence for seq_cst stores), and we must fix this.

Basically, the primitives have to be rewritten in in-line assembler or code has to be generated by the JIT for them at startup for our tier-1 platforms.  They are not (should not be...) very performance-sensitive and the cost of a call for each of them is probably just fine.

We should not worry about tier-3 platforms, notably MIPS, or ambiguous-tier platforms, such as ARM64, at this time.
Priority: -- → P2
I have some initial sketches for how to jit-compile these primitives for tier-1 platforms, so taking.
Assignee: nobody → lhansen
Status: NEW → ASSIGNED
Blocks: 1409713
This is a sketch of the high level, not jitting yet.  Although this is now phrased in terms of x86-shared, it is *probably* the case that we can share most code here when we're done, with very modest ifdeffery / platform-specificity to handle corner cases, most notably x86 (everything turns into cmpxchg8b) and maybe mips32 (no 64-bit atomics at all).  In the worst case we'll share everything among other platforms and fork something for these two.

I haven't dealt with memcpy and memmove; in principle they are easy but we can't afford performance to be awful so ... we'll see.
Generates code on x86 and x64 and has a clear path to generalization for other tier-1 platforms, and is independent of the compiler we're using.
Attachment #9002887 - Attachment is obsolete: true
Supports all tier-1 platforms.  Not tested yet on ARM and ARM64 as I need to do that on hardware; jitted code is not compatible with simulators.
Attachment #9003186 - Attachment is obsolete: true
Attached patch arm64-fixes.patch (obsolete) — Splinter Review
Additional fixes for arm64.

Unfortunately arm32 fixes must wait until my dev system is in a usable state.

(FF build requirements changes -> Ubuntu 14 no longer sufficient -> Upgrade to Ubuntu 16 -> "Welcome to Emergency Mode!" -> now in systemd dependency hell.)
Now with ARM64 and not-very-well-tested-but-seems-to-be-working ARM support.
Attachment #9003460 - Attachment is obsolete: true
Attachment #9004203 - Attachment is obsolete: true
A totally gross hack to make things link on ARM because ARM code uses volatile in some places in the TypedArray code to work around an old compiler bug.  Ideally we should check that code to see if we can stop using volatile.  If that's not possible we need something better than this mess here.
My ARM system is still in a bad state and I've not been able to test the ARM code properly.

(It's in a different bad state today.  I have something that resembles Ubuntu 16 yet strangely boots an old kernel that is probably sitting on a secret flash drive somewhere.  Additionally the system is hard-float and so test cases don't generally pass at all, even without my changes.  We've known in the past that this is a problem but I can't find the bug for it.)
OK, so, I can cross-compile from x64 for ARM and run on my dev board, so some progress is to be expected here.  Basically two things remaining apart from debugging, quoting myself from above:

"I haven't dealt with memcpy and memmove; in principle they are easy but we can't afford performance to be awful so ... we'll see."

"A totally gross hack to make things link on ARM because ARM code uses volatile in some places in the TypedArray code to work around an old compiler bug.  Ideally we should check that code to see if we can stop using volatile.  If that's not possible we need something better than this mess here."
Depends on: 1495731
Attachment #9004851 - Attachment is obsolete: true
Apart from some minor cleanup and one more round of testing this is believed to be complete.

A follow-up patch will remove a lot of C++ code that now supports tier-1 platforms, leaving us with jitted atomics for tier-1 (and probably mips) and a feeling-lucky solution for everyone else.  This is pretty much the best we can do, and has the smallest maintenance burden we can hope for.

A follow-up to the follow-up might consider whether we want to jit the atomics even on the simulators, to simplify testing, but I sort of doubt that it's worth it.
Attachment #9004850 - Attachment is obsolete: true
Passes testing on x64 and x86 and has bug fixes and extensive commit comments.
Attachment #9015541 - Attachment is obsolete: true
Depends on: 1497778
See commit msg for most information.  Also be aware of the following patch, which cleans up a lot of the mess in the Atomics layer.

Review guidance / requests:

- nbp, check the JIT bits please
- waldo, please check overall fit with the engine and note performance remarks
  in the commit message
- froydnj, verify that this makes sense to you please
Attachment #9015623 - Attachment is obsolete: true
Attachment #9016313 - Flags: review?(nicolas.b.pierron)
Attachment #9016313 - Flags: review?(nfroyd)
Attachment #9016313 - Flags: review?(jwalden+bmo)
Consolidates many of the files in the atomics layer and prevents most future expansion of it.  Largely, this patch moves or removes files.  I have tried hard to carry over all the platform-specific bits from the files that are removed, even if that may strictly be redundant, as these files are fallbacks.
Attachment #9016314 - Flags: review?(nfroyd)
Comment on attachment 9016313 [details] [diff] [review]
bug1394420-jit-generate-atomics-v7.patch

Review of attachment 9016313 [details] [diff] [review]:
-----------------------------------------------------------------

I did not look super-closely at the generators, but from a superficial look, they looked fine.

::: js/src/jit/shared/AtomicOperations-shared-jit.cpp
@@ +295,5 @@
> +#endif
> +      default:
> +        MOZ_CRASH("Unknown size");
> +    }
> +    masm.memoryBarrier(sync.barrierAfter);

I think the semantics imposed by unconditionally generating barriers is stronger than we need on e.g. x86.  So this approach is not currently as performant as we could be?

@@ +302,5 @@
> +    return start;
> +}
> +
> +static uint32_t
> +GenStore(MacroAssembler& masm, Scalar::Type size, Synchronization sync)

`sync` is apparently unused in this function, which seems suspicious.

@@ +383,5 @@
> +#endif
> +          default:
> +            MOZ_CRASH("Unknown size");
> +        }
> +        offset += direction == CopyDir::DOWN ? 1 : -1;

Pull the check out of the loop, perhaps?

@@ +385,5 @@
> +            MOZ_CRASH("Unknown size");
> +        }
> +        offset += direction == CopyDir::DOWN ? 1 : -1;
> +    }
> +    masm.memoryBarrier(sync.barrierAfter);

We only need barriers before and after all the instructions, not in between?  (I'm not sure what the semantics are here, so asking mostly for my own edification?

@@ +750,5 @@
> +    uint32_t nop = GenNop(masm);
> +#endif
> +
> +    uint32_t load8SeqCst = GenLoad(masm, SIZE8, Synchronization::Full());
> +    uint32_t load16SeqCst = GenLoad(masm, SIZE16, Synchronization::Full());

Is it worth having local synchronization objects that can be passed in here, rather than continuously repeating Sychronization::FOO()?

@@ +875,5 @@
> +    AtomicCompilerFence = (void(*)())(code + nop);
> +#endif
> +
> +    AtomicLoad8SeqCst = (uint8_t(*)(const uint8_t* addr))(code + load8SeqCst);
> +    AtomicLoad16SeqCst = (uint16_t(*)(const uint16_t* addr))(code + load16SeqCst);

Drive-by comments, but this is not code I own, so feel free to ignore!

1. I think the use of variable names in the function type clutters things up.
2. I think all of these would be better served with a template like:

template<typename R, typename... Args>
R(*)(Args...)
Frob(uint8_t* ptr, uint32_t offset)
{
  return (R(*))(Args...)(code + offset);
}

if that's even valid syntax.  Then these turn into things like:

  AtomicLoad8SeqCst = Frob<uint8_t(const uint8_t*)>(code, load8SeqCst);

which seems a little more readable?  I guess the main shortening there comes from taking out the variable name, so maybe it's not worth much.
Attachment #9016313 - Flags: review?(nfroyd) → review+
Comment on attachment 9016313 [details] [diff] [review]
bug1394420-jit-generate-atomics-v7.patch

Review of attachment 9016313 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with AtomicOperations-shared-jit.h changes.

::: js/src/jit/shared/AtomicOperations-shared-jit.cpp
@@ +46,5 @@
> +
> +// Selected registers match the argument registers exactly, and none of them
> +// overlap the result register.
> +
> +static const LiveRegisterSet NonVolatileRegs;

nit: Prefix these with their namespaces: ::js::jit::NonVolatileRegs as there is already a NonVolatileRegs in wasm namespace.

@@ +167,5 @@
> +    ABIArg arg = iter->abi.next(t);
> +    switch (arg.kind()) {
> +      case ABIArg::GPR: {
> +        if (arg.gpr() != reg) {
> +            masm.movePtr(arg.gpr(), reg);

While I agree that with the predefined registers enumerated above these functions are going to behave correctly given the current set of arguments.
However, I suspect that this might become a unnoticed pit-fall for the next person adding a new function here, which is erasing the argument value with an already loaded argument.

Instead of having GenGprArg, I would recommend to add "getABIArg" to the MacroAssembler, using the MoveResolver like done in passABIArg. Then Add a function to resolve all the Move instructions [1].

[1] https://searchfox.org/mozilla-central/source/js/src/jit/x64/MacroAssembler-x64.cpp#342-352

@@ +875,5 @@
> +    AtomicCompilerFence = (void(*)())(code + nop);
> +#endif
> +
> +    AtomicLoad8SeqCst = (uint8_t(*)(const uint8_t* addr))(code + load8SeqCst);
> +    AtomicLoad16SeqCst = (uint16_t(*)(const uint16_t* addr))(code + load16SeqCst);

if-you-want-nit:

I wish that the C++ type system we could properly assert the function types against to the generate code.
Therefore I will suggest to change the offset to:

 - A structure to hold a typed offset:
> template <T>
> struct TypedOffset {
>     const uint32_t start;
>     TypedOffset(uint32_t start) : start_(start) {}
> };

 - Returned by each function:
> template <typename V>
> static TypedOffset<V(*)(const V*)>
> GenLoad(MacroAssembler& masm, Synchronization sync)
> {
>     Scalar::Type size = Scalar::FromType<V>();
>     ArgIterator iter;
>     uint32_t start = GenPrologue(masm, &iter);
>     GenGprArg(masm, MIRType::Pointer, &iter, AtomicPtrReg);
> 
>     // …
> }

 - A structure to cast TypedOffset into its typed pointer:
>    template <typename T>
>    SetFromTypedOffset(T& val, uint8_t* code, TypedOffset<T> offset) {
>        val = T(code + offset.start);
>    }

 - And used to set in the correct variable:
> auto load8SeqCst = GenLoad<uint8_t>(masm, Synchronization::Full());
> 
> […]
> 
> SetFromTypedOffset(AtomicLoad8SeqCst, code, load8SeqCst);

::: js/src/jit/shared/AtomicOperations-shared-jit.h
@@ +210,5 @@
> +        T oldval = *addr; /* good initial approximation */            \
> +        for (;;) {                                                    \
> +            T nextval = (T)AtomicCmpXchg64SeqCst((uint64_t*)addr,     \
> +                                                   (uint64_t)val,     \
> +                                                   (uint64_t)oldval); \

Swap oldval and val in this function call.

@@ +338,5 @@
> +        T oldval = *addr; /* Good initial approximation */              \
> +        for (;;) {                                                      \
> +            T nextval = (T)AtomicCmpXchg64SeqCst((uint64_t*)addr,       \
> +                                                   (uint64_t)(oldval + val), \
> +                                                   (uint64_t)oldval);   \

Swap these 2 lines in the function call: (oldval + val) is the new value.

@@ +408,5 @@
> +        T oldval = *addr;                                               \
> +        for (;;) {                                                      \
> +            T nextval = (T)AtomicCmpXchg64SeqCst((uint64_t*)addr,       \
> +                                                   (uint64_t)(oldval OP val), \
> +                                                   (uint64_t)oldval);   \

Swap these 2 lines in the function call to AtomicCmpXchg64SeqCst.
Attachment #9016313 - Flags: review?(nicolas.b.pierron) → review+
Attachment #9016314 - Flags: review?(nfroyd) → review+
Attachment #9016313 - Flags: review?(jwalden+bmo)

(In reply to Nathan Froyd [:froydnj] from comment #14)

Comment on attachment 9016313 [details] [diff] [review]
bug1394420-jit-generate-atomics-v7.patch

Review of attachment 9016313 [details] [diff] [review]:

I did not look super-closely at the generators, but from a superficial look,
they looked fine.

::: js/src/jit/shared/AtomicOperations-shared-jit.cpp
@@ +295,5 @@

+#endif

  •  default:
    
  •    MOZ_CRASH("Unknown size");
    
  • }
  • masm.memoryBarrier(sync.barrierAfter);

I think the semantics imposed by unconditionally generating barriers is
stronger than we need on e.g. x86. So this approach is not currently as
performant as we could be?

It doesn't do that. The sync object has a number of bitvectors specifying the barriers that are actually required for the context of the Synchronization. So if the Synchronization comes from an unordered access, barrierAfter will be zero, and no fences will be emitted. If the Synchronization comes from Synchronization::Load(), no code will be emitted on x86, due to the definition of memoryBarrier(). Only for Synchronization::Store and Synchronization::Full will there be a bit here that forces a fence emitted, as desired.

@@ +302,5 @@

  • return start;
    +}

+static uint32_t
+GenStore(MacroAssembler& masm, Scalar::Type size, Synchronization sync)

sync is apparently unused in this function, which seems suspicious.

Nice catch.

@@ +385,5 @@

  •        MOZ_CRASH("Unknown size");
    
  •    }
    
  •    offset += direction == CopyDir::DOWN ? 1 : -1;
    
  • }
  • masm.memoryBarrier(sync.barrierAfter);

We only need barriers before and after all the instructions, not in between?
(I'm not sure what the semantics are here, so asking mostly for my own
edification?

A good question. Right now all the callers of GenCopy() pass Synchronization::None(), so it's a moot point. For WebAssembly I expect the spec will say that the copy will have a barrier at the end. I'll remove the sync for now.

@@ +750,5 @@

  • uint32_t nop = GenNop(masm);
    +#endif
  • uint32_t load8SeqCst = GenLoad(masm, SIZE8, Synchronization::Full());
  • uint32_t load16SeqCst = GenLoad(masm, SIZE16, Synchronization::Full());

Is it worth having local synchronization objects that can be passed in here,
rather than continuously repeating Sychronization::FOO()?

Well... matter of taste? Let me think about it.

@@ +875,5 @@

  • AtomicCompilerFence = (void(*)())(code + nop);
    +#endif
  • AtomicLoad8SeqCst = (uint8_t()(const uint8_t addr))(code + load8SeqCst);
  • AtomicLoad16SeqCst = (uint16_t()(const uint16_t addr))(code + load16SeqCst);

Drive-by comments, but this is not code I own, so feel free to ignore!

  1. I think the use of variable names in the function type clutters things up.

Fair point, though they're a useful mnemonic device.

  1. I think all of these would be better served with a template like:

template<typename R, typename... Args>
R()(Args...)
Frob(uint8_t
ptr, uint32_t offset)
{
return (R(*))(Args...)(code + offset);
}

if that's even valid syntax. Then these turn into things like:

AtomicLoad8SeqCst = Frob<uint8_t(const uint8_t*)>(code, load8SeqCst);

which seems a little more readable? I guess the main shortening there comes
from taking out the variable name, so maybe it's not worth much.

Hm, yes. I guess the code's been written and is unlikely to be rewritten, so I may give this one a pass. Removing the variable names in the casts would help by itself, will give that a try.

(In reply to Nicolas B. Pierron [:nbp] from comment #15)

Comment on attachment 9016313 [details] [diff] [review]
bug1394420-jit-generate-atomics-v7.patch

Review of attachment 9016313 [details] [diff] [review]:

r=me with AtomicOperations-shared-jit.h changes.

::: js/src/jit/shared/AtomicOperations-shared-jit.cpp
@@ +46,5 @@

+// Selected registers match the argument registers exactly, and none of them
+// overlap the result register.
+
+static const LiveRegisterSet NonVolatileRegs;

nit: Prefix these with their namespaces: ::js::jit::NonVolatileRegs as there
is already a NonVolatileRegs in wasm namespace.

Renamed as AtomicNonVolatileRegs.

@@ +167,5 @@

> > +    ABIArg arg = iter->abi.next(t);
> > +    switch (arg.kind()) {
> > +      case ABIArg::GPR: {
> > +        if (arg.gpr() != reg) {
> > +            masm.movePtr(arg.gpr(), reg);

While I agree that with the predefined registers enumerated above these
functions are going to behave correctly given the current set of arguments.
However, I suspect that this might become a unnoticed pit-fall for the next
person adding a new function here, which is erasing the argument value with
an already loaded argument.

I agree this is a little brittle.

Instead of having GenGprArg, I would recommend to add "getABIArg" to the
MacroAssembler, using the MoveResolver like done in passABIArg. Then Add a
function to resolve all the Move instructions [1].

[1] https://searchfox.org/mozilla-central/source/js/src/jit/x64/MacroAssembler- x64.cpp#342-352

Given the very fixed functionality of this code and how long this patch has been languishing, I won't do so now, but I agree with this in principle. ISTR there were in fact some bugs during development resulting from the unchecked register constraints.

For now, I will:

  • beef up the comments a little to make it even clearer that these are hard constraints
  • add a pointer to the language you use above so that if the complexity of this code grows much (eg, MIPS will tend to need more registers) we will reconsider.
> ::: js/src/jit/shared/AtomicOperations-shared-jit.h
> @@ +210,5 @@
> > +        T oldval = *addr; /* good initial approximation */            \
> > +        for (;;) {                                                    \
> > +            T nextval = (T)AtomicCmpXchg64SeqCst((uint64_t*)addr,     \
> > +                                                   (uint64_t)val,     \
> > +                                                   (uint64_t)oldval); \
> 
> Swap oldval and val in this function call.

Nice catch, this was a cut-and-paste from MSVC code, which has reversed the arguments for this function according to what's normal.

Ditto the next two.

Pushed by lhansen@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/2f5be1913934
jit-generate atomic ops to be called from c++.  r=nbp, r=froydnj
https://hg.mozilla.org/integration/mozilla-inbound/rev/b2ffeeac7326
Consolidate feeling-lucky atomics.  r=froydnj
See Also: → 1521448

The problem here is that while ReturnReg (as defined in the assembler) is identical to the C/C++ return register on x86, ReturnReg64 is not, at least on Linux. So more register definitions are needed. There is a risk that some of this is platform-dependent of course. Investigating further.

Specifically, the C/C++ 64-bit return register is edx:eax, while our ReturnReg64 is edi:eax.

Flags: needinfo?(lhansen)

A better link: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=b2ffeeac7326d673ff09474ce3007e84687c5a62. The apparent ARM failures are emulator failures, thus the same bug. Thus it looks like only x86 is affected.

OK, several problems found:

  • on x86 (32-bit), the ReturnReg64 is incorrect, this is bug 1521512, I've worked around it
  • on ARM (32-bit), the LR is also a temp register that is used by the atomic core operations so the return address must be saved and restored
  • on ARM and ARM64, the argBase value (for reading arguments from the stack) incorrectly incorporated an adjustment for the pushed return address that is only appropriate for x86/x64

In general, prologue and epilogue are a little platform-dependent and therefore now have proper ifdef nestes with an #error for unsupported platforms, just like the register definitions.

Pushed by lhansen@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/2d921f6ce0a9
jit-generate atomic ops to be called from c++.  r=nbp, r=froydnj
https://hg.mozilla.org/integration/mozilla-inbound/rev/4e928a279f87
Consolidate feeling-lucky atomics.  r=froydnj
Status: ASSIGNED → RESOLVED
Closed: 9 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla66
Blocks: 1225033
You need to log in before you can comment on or make changes to this bug.