Closed Bug 1385998 Opened 3 years ago Closed 2 years ago

Reduce the cost of modifications to PseudoStack::stackPointer

Categories

(Core :: Gecko Profiler, enhancement, P3)

enhancement

Tracking

()

RESOLVED FIXED
mozilla60
Tracking Status
firefox60 --- fixed

People

(Reporter: mstange, Assigned: mstange)

References

Details

Attachments

(2 files)

At the moment, stackPointer is an Atomic<uint32_t, SequentiallyConsistent>, and we are modifying it using stackPointer++ and stackPointer--. Those modifications compile to "lock inc" and "lock dec" instructions on x86, which are expensive.

We can do better.
Summary: Reduce the overhead of the PseudoStack stackPointer modifications → Reduce the overhead of modifications to PseudoStack::stackPointer
Status: NEW → ASSIGNED
Summary: Reduce the overhead of modifications to PseudoStack::stackPointer → Reduce the cost of modifications to PseudoStack::stackPointer
See bug 785440 comment 18 for numbers which prove the performance relevance of these patches.
Comment on attachment 8892614 [details]
Bug 1385998 - Don't use atomic increments / decrements on stackPointer.

https://reviewboard.mozilla.org/r/163592/#review169306

I appreciate the high comment lines to modified lines ratio in patches that touch atomic operations!

::: js/public/ProfilingStack.h:230
(Diff revision 1)
>          }
>  
>          // This must happen at the end! The compiler will not reorder this
>          // update because stackPointer is Atomic.
> -        stackPointer++;
> +        // Do the read and the write as two separate statements, in order to
> +        // make it clear that we don't need an atomic increment here.

Maybe "...we don't need an atomic read-modify-write operation here, which would be more expensive on x86 than the separate operations done here".  WDYT?  I don't know what the instruction sequences look like on ARM--this could be more expensive there; I guess if we were using `std::atomic`, this would be more like a relaxed load followed by an release store, which would express intent better, but I'm not sure we want to go that far.

And if we change this text, we'll need to change the comments below as well.  Or pull things out into a separate method--not sure which is better.
Attachment #8892614 - Flags: review?(nfroyd) → review+
Comment on attachment 8892615 [details]
Bug 1385998 - Use ReleaseAcquire memory ordering when modifying the PseudoStack.

https://reviewboard.mozilla.org/r/163594/#review169294

I think I buy that `ReleaseAcquire` for the stack pointer is correct; I am less sure about the comment below.

::: js/public/ProfilingStack.h:33
(Diff revision 1)
> +    // All the fields below are Atomic<...,ReleaseAcquire>, in order to make
> +    // sure that whenever we write to such a field, all previous writes from
> +    // the current thread are visible to other threads. This is important so
> +    // that a previous decrement of the PseudoStack's stackPointer is visible
> +    // to the sampler thread before we update information in the ProfileEntry
> +    // that the stackPointer was pointing at.

Thanks for all the commentary in these patches!  The second sentence doesn't seem to follow from the first, though: the first is talking about these files in ProfileEntry and then the second sentence says that all these things in ProfileEntry are necessary so that operations on PseudoStack are visible.  The types of things in ProfileEntry have nothing to do with how operations on the PseudoStack work, no?  Unless you mean that the writes to these variables should be making the decrement of the PseudoStack pointer visible, which seems like some sketchy action at a distance to me and which should already be handled by the appropriate Atomicness of the stack pointer itself.

I think the "before we update information in the ProfileEntry" qualification is trying to tie this all together, but my reading is that it does not do so effectively.  Can you try explaining it differently?

::: js/public/ProfilingStack.h:169
(Diff revision 1)
>      }
>  
>      // Note that the pointer returned might be invalid.
>      JSScript* rawScript() const {
>          MOZ_ASSERT(isJs());
> -        return (JSScript*)spOrScript;
> +        return (JSScript*)(void*)spOrScript;

It would be splendid if the casts here were made into C++-style casts, if possible.

::: js/src/vm/GeckoProfiler.cpp:481
(Diff revision 1)
>  
>  JS_PUBLIC_API(JSScript*)
>  ProfileEntry::script() const
>  {
>      MOZ_ASSERT(isJs());
> -    auto script = reinterpret_cast<JSScript*>(spOrScript);
> +    auto script = reinterpret_cast<JSScript*>((void*)spOrScript);

Likewise here for the casts.
Attachment #8892615 - Flags: review?(nfroyd) → review-
(In reply to Nathan Froyd [:froydnj] from comment #6)
> Unless you mean that the writes to
> these variables should be making the decrement of the PseudoStack pointer
> visible, which seems like some sketchy action at a distance to me

That's exactly what I mean, yes. I agree it's a bit sketchy.

> and which
> should already be handled by the appropriate Atomicness of the stack pointer
> itself.

Is it, though? If it is, I'd be extremely happy. See the stackoverflow question for the case that I'm concerned about.
(In reply to Nathan Froyd [:froydnj] from comment #6)
> Unless you mean that the writes to
> these variables should be making the decrement of the PseudoStack pointer
> visible, which seems like some sketchy action at a distance to me and which
> should already be handled by the appropriate Atomicness of the stack pointer
> itself.

I can use stackPointer.exchange(oldStackPointer - 1) in order to do the decrement, which is a release-acquire read-modify-write operation. And http://en.cppreference.com/w/cpp/atomic/memory_order says about that: "No memory reads or writes in the current thread can be reordered before or after this store."

Would that work?
(In reply to Markus Stange [:mstange] from comment #8)
> I can use stackPointer.exchange(oldStackPointer - 1) in order to do the
> decrement, which is a release-acquire read-modify-write operation.

Oh no, using exchange causes an xchng instruction to be emitted, which is slow. I guess I'll just do another acquire-load after the decrement.
The person on the stackoverflow question suggests using std::atomic_signal_fence instead. What do you think about that?
Thread suspension seems to be a required element of any testcase that demonstrates the problem I'm having here, and on Linux, we suspend the thread using a signal handler, so on the surface it seems like the right idea.
I don't know if std::atomic_signal_fence would have any effect on Windows.
(In reply to Markus Stange [:mstange] from comment #14)
> The person on the stackoverflow question suggests using
> std::atomic_signal_fence instead. What do you think about that?
> Thread suspension seems to be a required element of any testcase that
> demonstrates the problem I'm having here, and on Linux, we suspend the
> thread using a signal handler, so on the surface it seems like the right
> idea.
> I don't know if std::atomic_signal_fence would have any effect on Windows.

Let me make sure I understand the problem by restating things.

* We're going to write to various things in a ProfileEntry at the stack pointer;
* We're then going to increment the stack pointer to indicate where the next frame/entry should go.
* We want to ensure two things:
  * The compiler is not free to reorder reads and writes from before the increment to after the increment
  * The compiler is not free to reorder reads and writes from after the increment to before the increment
* We also care about enforcing the same restrictions on the *hardware*, correct?
* We want some sort of high performance here, i.e. we want to use the weakest memory barrier instructions we possibly can, none if at all possible.
* The patch proposes using release stores (and acquire loads) to manipulate the stack pointer--in particular, updating the stack pointer with a release store.
* We could do all of this with sequentially consistent operations, but that imposes an unacceptable performance cost.
* The release store guarantees that all prior writes will a) not be reordered after the store and b) will become visible to other threads.
* The release store *doesn't* guarantee that operations from *after* the store won't be reordered *before* the store (I think this is your Stack Overflow question, basically).
* So we need something else to make that second guarantee for us.

Is that correct?

SO suggests atomic_signal_fence, which the standard says:

Effects: equivalent to atomic_thread_fence(order), except that the resulting ordering constraints
are established only between a thread and a signal handler executed in the same thread.

Note: atomic_signal_fence can be used to specify the order in which actions performed by the thread
become visible to the signal handler.

Note: compiler optimizations and reorderings of loads and stores are inhibited in the same way as with
atomic_thread_fence, but the hardware fence instructions that atomic_thread_fence would have
inserted are not emitted.

Do we care about the lack of hardware ordering, since we don't get that with atomic_signal_fence?

Is the directionality of atomic_signal_fence (see first note, above) the correct direction for what we're doing here?
Flags: needinfo?(mstange)
Thank you for the thorough review.

(In reply to Nathan Froyd [:froydnj] from comment #15)
> (In reply to Markus Stange [:mstange] from comment #14)
> > The person on the stackoverflow question suggests using
> > std::atomic_signal_fence instead. What do you think about that?
> > Thread suspension seems to be a required element of any testcase that
> > demonstrates the problem I'm having here, and on Linux, we suspend the
> > thread using a signal handler, so on the surface it seems like the right
> > idea.
> > I don't know if std::atomic_signal_fence would have any effect on Windows.
> 
> Let me make sure I understand the problem by restating things.
> 
> * We're going to write to various things in a ProfileEntry at the stack
> pointer;
> * We're then going to increment the stack pointer to indicate where the next
> frame/entry should go.
> * We want to ensure two things:
>   * The compiler is not free to reorder reads and writes from before the
> increment to after the increment

This is precisely correct.

>   * The compiler is not free to reorder reads and writes from after the
> increment to before the increment

Here it's the *decrement* that I care about, not the increment. So make that:

 * The compiler is not free to reorder reads and writes from after the
decrement to before the decrement

We don't want the things that are written to the ProfileEntry by the *next* pseudo stack frame that we'll be pushing to be visible to the sampler thread before the decrement of the stack pointer is visible to the sampler thread.

> * We also care about enforcing the same restrictions on the *hardware*,
> correct?

Yes. But I'm less certain about what that means in practice.
I think on x86, the hardware already guarantees everything we need. (Sorry for the slightly imprecise wording...)
On arm, it probably does not. But if we're on arm, then we're on Android. And on Android, we use the same method to interrupt the thread as on Linux: we send a signal to the thread, and then sem_post to a semaphore in order to let the sampler thread know that the thread is paused (and sem_wait until the sampler is done). See http://searchfox.org/mozilla-central/rev/13148faaa91a1c823a7d68563d9995480e714979/tools/profiler/core/platform-linux-android.cpp#137
And my expectation would be that sem_post enforces sequential consistency.

> * We want some sort of high performance here, i.e. we want to use the
> weakest memory barrier instructions we possibly can, none if at all possible.
> * The patch proposes using release stores (and acquire loads) to manipulate
> the stack pointer--in particular, updating the stack pointer with a release
> store.
> * We could do all of this with sequentially consistent operations, but that
> imposes an unacceptable performance cost.
> * The release store guarantees that all prior writes will a) not be
> reordered after the store and b) will become visible to other threads.
> * The release store *doesn't* guarantee that operations from *after* the
> store won't be reordered *before* the store (I think this is your Stack
> Overflow question, basically).
> * So we need something else to make that second guarantee for us.

This part is entirely correct.

> SO suggests atomic_signal_fence, which the standard says:
> 
> Effects: equivalent to atomic_thread_fence(order), except that the resulting
> ordering constraints
> are established only between a thread and a signal handler executed in the
> same thread.
> 
> Note: atomic_signal_fence can be used to specify the order in which actions
> performed by the thread
> become visible to the signal handler.
> 
> Note: compiler optimizations and reorderings of loads and stores are
> inhibited in the same way as with
> atomic_thread_fence, but the hardware fence instructions that
> atomic_thread_fence would have
> inserted are not emitted.
> 
> Do we care about the lack of hardware ordering, since we don't get that with
> atomic_signal_fence?

I don't know. My guess would be that the answer to this question depends on how SuspendThread (Windows) and thread_suspend (macOS) work at the hardware level. Which is something I don't know.

> Is the directionality of atomic_signal_fence (see first note, above) the
> correct direction for what we're doing here?

I think so. If we insert the fence just after the decrement of the stack pointer, then the signal handler will either not see the decrement and also won't see anything that would have been written after the decrement, or it will see the decrement and doesn't care about things that were written after the decrement.
Flags: needinfo?(mstange)
I would be happy to go back to the first version of this patch, to what you called "sketchy action" - use release-stores to put the data into ProfileEntry. It's located in an unexpected place, but it's easier to reason about (I think) and seems sound to me.
(In reply to Markus Stange [:mstange] from comment #16)
> Thank you for the thorough review.
> 
> (In reply to Nathan Froyd [:froydnj] from comment #15)
> > (In reply to Markus Stange [:mstange] from comment #14)
> > > The person on the stackoverflow question suggests using
> > > std::atomic_signal_fence instead. What do you think about that?
> > > Thread suspension seems to be a required element of any testcase that
> > > demonstrates the problem I'm having here, and on Linux, we suspend the
> > > thread using a signal handler, so on the surface it seems like the right
> > > idea.
> > > I don't know if std::atomic_signal_fence would have any effect on Windows.
> > 
> > Let me make sure I understand the problem by restating things.
> > 
> > * We're going to write to various things in a ProfileEntry at the stack
> > pointer;
> > * We're then going to increment the stack pointer to indicate where the next
> > frame/entry should go.
> > * We want to ensure two things:
> >   * The compiler is not free to reorder reads and writes from before the
> > increment to after the increment
> 
> This is precisely correct.
> 
> >   * The compiler is not free to reorder reads and writes from after the
> > increment to before the increment
> 
> Here it's the *decrement* that I care about, not the increment. So make that:
> 
>  * The compiler is not free to reorder reads and writes from after the
> decrement to before the decrement
> 
> We don't want the things that are written to the ProfileEntry by the *next*
> pseudo stack frame that we'll be pushing to be visible to the sampler thread
> before the decrement of the stack pointer is visible to the sampler thread.

OK, wait.  We have increments *and* decrements?  Where are the decrements coming from?

> > * We also care about enforcing the same restrictions on the *hardware*,
> > correct?
> 
> Yes. But I'm less certain about what that means in practice.
> I think on x86, the hardware already guarantees everything we need. (Sorry
> for the slightly imprecise wording...)

What, exactly, do we need?  It looks like the constraints on Intel are:

* Reads are not reordered with other reads
* Writes are not reordered with older reads

which I think means that reads never "float up" to before other operations;

* Writes are not reordered with other writes, except for a few cases that we don't care about
* Reads *may* be reordered with older writes to different locations, but not with older writes to the same location

I think that means that we just need to stop reads from being reordered--which I think is identical to the problem you describe above about things not being reordered with the stack pointer decrement.  So I think we really do need some sort of hardware barrier.

> On arm, it probably does not. But if we're on arm, then we're on Android.
> And on Android, we use the same method to interrupt the thread as on Linux:
> we send a signal to the thread, and then sem_post to a semaphore in order to
> let the sampler thread know that the thread is paused (and sem_wait until
> the sampler is done). See
> http://searchfox.org/mozilla-central/rev/
> 13148faaa91a1c823a7d68563d9995480e714979/tools/profiler/core/platform-linux-
> android.cpp#137
> And my expectation would be that sem_post enforces sequential consistency.
> 
> > * We want some sort of high performance here, i.e. we want to use the
> > weakest memory barrier instructions we possibly can, none if at all possible.
> > * The patch proposes using release stores (and acquire loads) to manipulate
> > the stack pointer--in particular, updating the stack pointer with a release
> > store.
> > * We could do all of this with sequentially consistent operations, but that
> > imposes an unacceptable performance cost.
> > * The release store guarantees that all prior writes will a) not be
> > reordered after the store and b) will become visible to other threads.
> > * The release store *doesn't* guarantee that operations from *after* the
> > store won't be reordered *before* the store (I think this is your Stack
> > Overflow question, basically).
> > * So we need something else to make that second guarantee for us.
> 
> This part is entirely correct.
> 
> > SO suggests atomic_signal_fence, which the standard says:
> > 
> > Effects: equivalent to atomic_thread_fence(order), except that the resulting
> > ordering constraints
> > are established only between a thread and a signal handler executed in the
> > same thread.
> > 
> > Note: atomic_signal_fence can be used to specify the order in which actions
> > performed by the thread
> > become visible to the signal handler.
> > 
> > Note: compiler optimizations and reorderings of loads and stores are
> > inhibited in the same way as with
> > atomic_thread_fence, but the hardware fence instructions that
> > atomic_thread_fence would have
> > inserted are not emitted.
> > 
> > Do we care about the lack of hardware ordering, since we don't get that with
> > atomic_signal_fence?
> 
> I don't know. My guess would be that the answer to this question depends on
> how SuspendThread (Windows) and thread_suspend (macOS) work at the hardware
> level. Which is something I don't know.
> 
> > Is the directionality of atomic_signal_fence (see first note, above) the
> > correct direction for what we're doing here?
> 
> I think so. If we insert the fence just after the decrement of the stack
> pointer, then the signal handler will either not see the decrement and also
> won't see anything that would have been written after the decrement, or it
> will see the decrement and doesn't care about things that were written after
> the decrement.

I think if we can rely on some sort of memory-ordering properties of x86 and knowledge of exactly how the profiler works, we can use the compiler barrier for x86 only.  For ARM, I guess we can rely on how the profiler works, perhaps with some BIG SCARY COMMENTS about how you should not modify those particular mechanisms unless you are willing to modify this unrelated code over here and vice versa.  Or do we need the compiler barrier for ARM as well?  For everything else, we can use the strong atomic_thread_fence(sequential_consistency.  WDYT?
Flags: needinfo?(mstange)
Priority: -- → P3
Comment on attachment 8892615 [details]
Bug 1385998 - Use ReleaseAcquire memory ordering when modifying the PseudoStack.

Canceling review to move this out of my queue while the ni? waits for an answer.
Attachment #8892615 - Flags: review?(nfroyd)
Sorry for the extremely long delay in replying to this comment.

(In reply to Nathan Froyd [:froydnj] from comment #18)
> OK, wait.  We have increments *and* decrements?  Where are the decrements
> coming from?

We use the pseudo stack to keep track of functions that are running. So when the function starts, we need to push an entry, and when the function ends, we need to pop an entry. These pushes and pops need to do increments and decrements of the pseudo stack "stack pointer".

Here's the code I used to demonstrate this in the stackoverflow issue:

  theStack.stackFrames[1] = StackFrame{ "someFunction", 30 };      // A
  theStack.stackTop.store(1, std::memory_order_seq_cst);           // B
  someFunction();                                                  // C
  theStack.stackTop.store(0, std::memory_order_seq_cst);           // D

  theStack.stackFrames[1] = StackFrame{ "someOtherFunction", 35 }; // E
  theStack.stackTop.store(1, std::memory_order_seq_cst);           // F
  someOtherFunction();                                             // G
  theStack.stackTop.store(0, std::memory_order_seq_cst);           // H

In that code,
 theStack.stackTop.store(1, ...) is the "increment", and
 theStack.stackTop.store(0, ...) is the "decrement".
I used stores with 1 and 0 instead of increments and decrements so that we don't get distracted by thinking about atomic increments and decrements: Only one thread ever writes to the stack pointer. So we only need to think about the atomicity and ordering of the write, not of the read on this thread.

> > > * We also care about enforcing the same restrictions on the *hardware*,
> > > correct?
> > 
> > Yes. But I'm less certain about what that means in practice.
> > I think on x86, the hardware already guarantees everything we need. (Sorry
> > for the slightly imprecise wording...)
> 
> What, exactly, do we need?

Quoting from the stack overflow question again:

The central requirement is: When the sampler thread suspends the target thread and reads stackTop == 1, then the information in stackFrames[1] needs to be fully present and consistent. This means:

 - When B is observed, A must also be observed. ("Don't increment stackTop before putting the stack frame in place.")
 - When E is observed, D must also be observed. ("When putting the next frame's information in place, the previous stack frame must have been exited.")

The first item can be achieved by doing the stack pointer increment as a release-store. This gives us the guarantee that any previous writes (in our case, putting the pseudoframe information into the pseudostack) do not get reordered after the stack pointer write.
The second item is not achieved by making the stack pointer *decrement* a release-store, because we need to have a guarantee about writes that happen *after* the stack pointer write in program order.

>  It looks like the constraints on Intel are:
> 
> * Reads are not reordered with other reads
> * Writes are not reordered with older reads
> 
> which I think means that reads never "float up" to before other operations;
> 
> * Writes are not reordered with other writes, except for a few cases that we
> don't care about
> * Reads *may* be reordered with older writes to different locations, but not
> with older writes to the same location
> 
> I think that means that we just need to stop reads from being
> reordered--which I think is identical to the problem you describe above
> about things not being reordered with the stack pointer decrement.  So I
> think we really do need some sort of hardware barrier.

The reason I thought we wouldn't need a hardware barrier is the following: In the first patch which I attached to this bug, which you referred to as "sketchy action at a distance", I achieved all the guarantees I needed using only ReleaseAcquire atomics, and when I looked at the generated x86 code I didn't see any lock or xchng instructions.
Flags: needinfo?(mstange)
I've gone back to the approach that I used in the first version of this patch: using release-writes for the fields of a ProfileEntry. I think this solution is the easiest to reason about, even if the effects are a bit non-local.
(In reply to Markus Stange [:mstange] from comment #20)
> >  It looks like the constraints on Intel are:
> > 
> > * Reads are not reordered with other reads
> > * Writes are not reordered with older reads
> > 
> > which I think means that reads never "float up" to before other operations;
> > 
> > * Writes are not reordered with other writes, except for a few cases that we
> > don't care about
> > * Reads *may* be reordered with older writes to different locations, but not
> > with older writes to the same location
> > 
> > I think that means that we just need to stop reads from being
> > reordered--which I think is identical to the problem you describe above
> > about things not being reordered with the stack pointer decrement.  So I
> > think we really do need some sort of hardware barrier.
> 
> The reason I thought we wouldn't need a hardware barrier is the following:
> In the first patch which I attached to this bug, which you referred to as
> "sketchy action at a distance", I achieved all the guarantees I needed using
> only ReleaseAcquire atomics, and when I looked at the generated x86 code I
> didn't see any lock or xchng instructions.

I am unsure whether "the code looks the way I want it to" actually matches "the code actually does what it's supposed to" in this case.  In particular, we haven't explicitly ordered the hardware to do what we want here, no?  (Based on a quick and possibly incomplete re-read of previous discussion.)
Comment on attachment 8892615 [details]
Bug 1385998 - Use ReleaseAcquire memory ordering when modifying the PseudoStack.

https://reviewboard.mozilla.org/r/163594/#review218282

We really should have talked more about this in Austin, that's my fault for not being on the ball with that.  Comment below.

::: js/public/ProfilingStack.h:34
(Diff revision 4)
> +    // All the fields below are Atomic<...,ReleaseAcquire>. This is needed so
> +    // that writes to these fields are release-writes, which ensures that
> +    // earlier writes in this thread don't get reordered after the writes to
> +    // these fields. In particular, the decrement of the stack pointer in
> +    // PseudoStack::pop() is a write that *must* happen before the values in
> +    // this ProfileEntry are changed. Otherwise, the sampler thread might see
> +    // an inconsistent state where the stack pointer still points to a
> +    // ProfileEntry which has already been popped off the stack and whose
> +    // fields have now been partially repopulated with new values.

Thank you for expanding this comment.

I think we want some "WARNING WARNING WARNING"-esque text somewhere in this patch, either here or in `PseudoStack` or (probably best) both, because this is going back to the "sketchy action at a distance" original patch.  I think--based on my hazy remembering of our conversation in Austin and my attempt to swap the discussion in this bug back in--is that if we *didn't* want the sketchy action at a distance version, we'd be forced to insert explicit hardware barriers at certain points to guarantee that the hardware would do what we wanted.  Indeed, that's the state of things today, with the sequentially consistent stack pointer.

But that's no good for the common target that we care about, x86/64, because the hardware actually provides the guarantees that we care about (I think?), so the explicit barrier is just slowing things down.  So this patch is more-or-less a hack to convince the compiler to DTRT and also generate code that runs fast on the hardware that we care about.  (Assuming this particular patch correctly tells the hardware what's going on, see comment 24.)  Is that a correct summary of what we're attempting to do here?

It's worth noting that marking all of these `ReleaseAcquire` will actually slow things down on ARM/64, right?  Because `ReleaseAcquire` there will insert memory barriers that *weren't* there before.  Do we think that's acceptable, or should we do something like:

```
#ifdef x86ish
#define PROFILE_ENTRY_CONSISTENCY mozilla::ReleaseAcquire
#define PSEUDOSTACK_STACK_POINTER_CONSISTENCY mozilla::ReleaseAcquire
#else
#define PROFILE_ENTRY_CONSISTENCY mozilla::Relaxed
#define PSEUDOSTACK_STACK_POINTER_CONSISTENCY mozilla::SequentiallyConsistent
#endif
```

with a scary comment about what we're doing here?

If we do go with some variant of this patch, it'd be good to discuss the alternatives in a comment somewhere so people who want to modify this code know exactly what they're getting into.
Attachment #8892615 - Flags: review?(nfroyd)
(In reply to Nathan Froyd [:froydnj] from comment #24)
> > The reason I thought we wouldn't need a hardware barrier is the following:
> > In the first patch which I attached to this bug, which you referred to as
> > "sketchy action at a distance", I achieved all the guarantees I needed using
> > only ReleaseAcquire atomics, and when I looked at the generated x86 code I
> > didn't see any lock or xchng instructions.
> 
> I am unsure whether "the code looks the way I want it to" actually matches
> "the code actually does what it's supposed to" in this case.  In particular,
> we haven't explicitly ordered the hardware to do what we want here, no? 
> (Based on a quick and possibly incomplete re-read of previous discussion.)

I believe that we have explicitly ordered the *compiler* to do what we want: the stores to the stack pointer and the profile entry fields are all release-stores now and I believe this should ensure correct ordering.
And since the compiler is not generating any explicit hardware barriers on x86, I think we can infer that the x86 hardware doesn't *need* any explicit hardware barriers to do what we want. (I haven't read up on x86 to confirm that this is actually the case, but I don't think I have a reason to distrust the compiler on this.)

(In reply to Nathan Froyd [:froydnj] from comment #25)
> I think we want some "WARNING WARNING WARNING"-esque text somewhere in this
> patch, either here or in `PseudoStack` or (probably best) both, because this
> is going back to the "sketchy action at a distance" original patch.

Ok, I'll add some more comments. When I touched this patch again last week, I realized that the "distance" isn't actually all that large; all the pieces that are involved with it are defined in the same file and are pretty close to each other.

> I think--based on my hazy remembering of our conversation in Austin and my
> attempt to swap the discussion in this bug back in--is that if we *didn't*
> want the sketchy action at a distance version, we'd be forced to insert
> explicit hardware barriers at certain points to guarantee that the hardware
> would do what we wanted.  Indeed, that's the state of things today, with the
> sequentially consistent stack pointer.

I believe that is true.

> But that's no good for the common target that we care about, x86/64, because
> the hardware actually provides the guarantees that we care about (I think?),
> so the explicit barrier is just slowing things down.  So this patch is
> more-or-less a hack to convince the compiler to DTRT and also generate code
> that runs fast on the hardware that we care about.

Agreed.

> (Assuming this
> particular patch correctly tells the hardware what's going on, see comment
> 24.)  Is that a correct summary of what we're attempting to do here?

Yes.

> It's worth noting that marking all of these `ReleaseAcquire` will actually
> slow things down on ARM/64, right?  Because `ReleaseAcquire` there will
> insert memory barriers that *weren't* there before.

You're right! This is something I hadn't considered before.

> Do we think that's
> acceptable, or should we do something like:
> 
> ```
> #ifdef x86ish
> #define PROFILE_ENTRY_CONSISTENCY mozilla::ReleaseAcquire
> #define PSEUDOSTACK_STACK_POINTER_CONSISTENCY mozilla::ReleaseAcquire
> #else
> #define PROFILE_ENTRY_CONSISTENCY mozilla::Relaxed
> #define PSEUDOSTACK_STACK_POINTER_CONSISTENCY mozilla::SequentiallyConsistent
> #endif
> ```
> 
> with a scary comment about what we're doing here?

Hmm. I'd prefer to wait until we've done some performance testing on ARM before doing this. But it's a good option to keep in mind.

> If we do go with some variant of this patch, it'd be good to discuss the
> alternatives in a comment somewhere so people who want to modify this code
> know exactly what they're getting into.

Ok, I'll try. Thanks!
I've updated the patch and added more comments. Sorry for the extremely slow iteration time here, Nathan...
Comment on attachment 8892615 [details]
Bug 1385998 - Use ReleaseAcquire memory ordering when modifying the PseudoStack.

https://reviewboard.mozilla.org/r/163594/#review224324

Feel free to take a long time circling back around if it means we always get comments like the one in this patch!
Attachment #8892615 - Flags: review?(nfroyd) → review+
Pushed by mstange@themasta.com:
https://hg.mozilla.org/integration/autoland/rev/2f1f2a979ff1
Don't use atomic increments / decrements on stackPointer. r=froydnj
https://hg.mozilla.org/integration/autoland/rev/f5dbc10dacf5
Use ReleaseAcquire memory ordering when modifying the PseudoStack. r=froydnj
https://hg.mozilla.org/mozilla-central/rev/2f1f2a979ff1
https://hg.mozilla.org/mozilla-central/rev/f5dbc10dacf5
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla60
You need to log in before you can comment on or make changes to this bug.