Closed Bug 1080584 Opened 5 years ago Closed 5 years ago

Clean up allocateFromArenaInline() some more

Categories

(Core :: JavaScript: GC, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla35

People

(Reporter: ehoogeveen, Assigned: ehoogeveen)

References

Details

Attachments

(2 files, 2 obsolete files)

Completely contradicting what I said on IRC yesterday, I decided to take a stab at this. "some more" because of bug 1007549!
This compiles locally for me, which will have to do for now, as try is closed. Some notes about this patch:

* Mostly orthogonal to bug 1074961, but it will need some light rebasing after that lands (due to the renaming).

* Names need work, suggestions appreciated. s/allocateFromArena/allocateFromArenaWrapped/ wasn't my finest hour ;)

* This removes a lot of the big comments in favor of smaller ones and letting the code document itself. I felt the old comments were adding more clutter than clarity. Setting the background finalization state to BFS_DONE might need a comment of some kind, preferably something better than 'look at the comments above the definition of BackgroundFinalizeState'.

* I'm fairly certain that the backgroundFinalizationIsRunning check is no longer needed. We *used* to update the arena lists directly during background finalization, but now that we use SortedArenaList as a temporary container, this should no longer be an issue. At the same time, I don't think we'll ever have any partially full arenas available during background finalization now, so there's no advantage either. It's just a bit less state.

* I replaced the null check after chunk->allocateArena with an assert - surely it would be a bug if the chunk returned by pickChunk couldn't hold another arena. I only skimmed pickChunk to confirm this though.

* I split the common logic off into allocateFromArenaInner (again, name might need work). This does introduce a runtime check to differentiate the two cases (a pre-existing arena or a newly allocated one) - but keeping that part of the logic in allocateFromArena was a lot uglier. I considered adding another parameter to allocateFromArenaInner or making it into a function template, but it seemed kind of ugly. Suggestions welcome if you think the branch will be a problem (if the compiler inlines the two uses, the branch should be 100% predictable).

* I renamed the AutoMaybeStartBackgroundAllocation to the somewhat less redundant and shorter maybeAllocOutsideGCLock. I considered alternatives like 'allocateUponDestruction', but I'm not sure what the right interpretation really is.
Attachment #8502539 - Flags: review?(terrence)
Depends on: 1074961
Ah, a few more things:

* Currently the arenaAllocatedDuringGC() call uses runtimeFromMainThread() in the fast path, but runtimeFromAnyThread() (through the existing rt pointer) in the slow path. When combining the logic I went with runtimeFromAnyThread() since I'm not sure we're even guaranteed to be on the main thread here, but that might need some scrutiny.

* Is it possible for different threads to call allocateFromArena at the same time? Parallel JS apparently uses a different ArenaLists for each individual thread, but are there other cases where this might happen?

* One other thing I was considering: if the old comment is correct, we only take the GC lock before pickChunk because of parallel JS. If that's true, we could templatize allocFromArena and just call the right version from each refillFreeList variant. The only difference would be that the main thread version could avoid taking the GC lock altogether if background finalization is idle.
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #2)
> The only difference would be that the main thread
> version could avoid taking the GC lock altogether if background finalization
> is idle.

Actually I guess that wouldn't work if parallel JS was active. You could probably check for that, but that might get messy.
Attached patch allocateFromArena (obsolete) — Splinter Review
Since it can be hard to tell with a diff, here's what it looks like afterward. Also, some trailing spaces snuck into the patch; fixed that locally.
Comment on attachment 8502539 [details] [diff] [review]
Clean up ArenaLists::allocateFromArenaInline some more (and rename it).

Review of attachment 8502539 [details] [diff] [review]:
-----------------------------------------------------------------

\o/ This is great! I just have a couple of naming requests, and a few quick pointers on making that branch static. Given this is our core arena allocation routine, we should probably go to the effort, even if GGC take most of the pressure off.

::: js/src/jsgc.cpp
@@ +1925,5 @@
> +
> +void *
> +ArenaLists::allocateFromArena(JS::Zone *zone, AllocKind thingKind,
> +                              AutoMaybeStartBackgroundAllocation &maybeAllocOutsideGCLock)
> +{   

Trailing whitespace.

@@ +1935,5 @@
>          maybeLock.lock(rt);
> +        if (backgroundFinalizeState[thingKind] == BFS_JUST_FINISHED)
> +            backgroundFinalizeState[thingKind] = BFS_DONE;
> +    }
> +    

Trailing whitespace.

@@ +1958,2 @@
>      aheader = chunk->allocateArena(zone, thingKind);
> +    MOZ_ASSERT(aheader);

I agree! We're under the GC lock here even. There really is no way this could fail.

@@ +1969,5 @@
> +{
> +    size_t thingSize = Arena::thingSize(thingKind);
> +
> +    FreeSpan span;
> +    if (aheader->hasFreeThings()) {

Actually, making this branch static is /extremely/ easy, modulo PJS. Please change the decl of this method to be:

enum ArenaAllocMode { HasFreeThings = true, IsEmpty = false };
template <ArenaAllocMode hasFreeThings> 
inline void* allocateFromArenaInner(JS::Zone *zone, ArenaHeader *aheader, AllocKind thingKind);

Then you can make this branch statically on hasFreeThings and have everything boil down to the same code as is currently present.

For the pick chunk case, we'll obviously want:
allocateFromArenaInner<IsEmpty>(zone, aheader, thingKind);

But I'm a bit confused about the other case. It looks like the current code is doing:
allocateFromArenaInner<HasFreeThings>(zone, aheader, thingKind);

But it seems that to handle PJS you'd really want it to be something like:
if (aheader->isEmpty()) {
    MOZ_ASSERT(InParallelSection());
    allocateFromArenaInner<IsEmpty>(zone, aheader, thingKind);
} else {
    allocateFromArenaInner<HasFreeThings>(zone, aheader, thingKind);
}

I'm not sure how the code works as is. If you think it would be better as the above, please do change it as such: I'm sure we'll come to our collective senses and remove PJS eventually. When we do, the assert will find this case and make it easy to fix the code to be fully static at that time.

::: js/src/jsgc.h
@@ +853,5 @@
>      inline void forceFinalizeNow(FreeOp *fop, AllocKind thingKind);
>      inline void queueForForegroundSweep(FreeOp *fop, AllocKind thingKind);
>      inline void queueForBackgroundSweep(FreeOp *fop, AllocKind thingKind);
>  
> +    void *allocateFromArenaWrapped(JS::Zone *zone, AllocKind thingKind);

I think this should continue being allocateFromArena: AutoMaybeStartBackgroundAllocation is an internal detail, so we should (1) move this decl up to the public section and (2) unfriend Nursery and FJSNursery (only friended for this method IIRC). I think the other allocateFromArena can continue being called allocateFromArena, as it's the same call, but with slightly more detail exposed (and private so it's only callable from GCRuntime).

@@ +855,5 @@
>      inline void queueForBackgroundSweep(FreeOp *fop, AllocKind thingKind);
>  
> +    void *allocateFromArenaWrapped(JS::Zone *zone, AllocKind thingKind);
> +    void *allocateFromArena(JS::Zone *zone, AllocKind thingKind,
> +                            AutoMaybeStartBackgroundAllocation &maybeAllocOutsideGCLock);

The patch in bug 1074961 uses maybeStartBGAlloc, which I think is even shorter.
Attachment #8502539 - Flags: review?(terrence) → review+
Carrying forward r=terrence. Rebased on top of attachment 8502638 [details] [diff] [review] from bug 1074961.

(In reply to Terrence Cole [:terrence] from comment #6)
> > +    MOZ_ASSERT(aheader);
> 
> I agree! We're under the GC lock here even. There really is no way this
> could fail.

For the record, this turned out to be wrong. Chunk::allocateArena can fail for various reasons unrelated to the state of the chunk; I added a comment to this effect.

> Actually, making this branch static is /extremely/ easy, modulo PJS. Please
> change the decl of this method to be:

Done. The declaration in jsgc.h became too long for a single line, so I broke with the usual style a bit and placed |template <...>| on its own line.

> AutoMaybeStartBackgroundAllocation is an internal detail, so we should (1)
> move this decl up to the public section and (2) unfriend Nursery and
> FJSNursery (only friended for this method IIRC). I think the other
> allocateFromArena can continue being called allocateFromArena, as it's the
> same call, but with slightly more detail exposed (and private so it's only
> callable from GCRuntime).

Done and done (and done).

> The patch in bug 1074961 uses maybeStartBGAlloc, which I think is even
> shorter.

Done.
Attachment #8502539 - Attachment is obsolete: true
Attachment #8502576 - Attachment is obsolete: true
Attachment #8502802 - Flags: review+
This compiles locally, but I need to clone mozilla-inbound on my VM before I can send it to try; doing that now.

Writing this patch brought me great joy.
Attachment #8502809 - Flags: review?(terrence)
Try is having some issues, so let's hope that works. jit-tests seem to be running just fine locally though.
Comment on attachment 8502809 [details] [diff] [review]
Part 2 v1: Remove BFS_JUST_FINISHED since it doesn't really add any protection.

Review of attachment 8502809 [details] [diff] [review]:
-----------------------------------------------------------------

Since I started working at mozilla, I've come to believe that the probability of a bug existing is proportional to the length of the explanatory comment. Given the number of barely-explicable words this patch removes, I'm inclined to think it must fix at least one bug. ;-)
Attachment #8502809 - Flags: review?(terrence) → review+
Figured I'd check Octane to be safe, difference with either of the patches is in the noise:

baseline: 33536, 33645, 32366, 32625, 33591, 32570
part 1:   33658, 32969, 32831, 32853, 33059, 33664
both:     32413, 33538, 32882, 33258, 33472, 33647
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/a95bb1fdcd5b
https://hg.mozilla.org/mozilla-central/rev/8551357e330e
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla35
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #8)
> Created attachment 8502809 [details] [diff] [review]
> Part 2 v1: Remove BFS_JUST_FINISHED since it doesn't really add any
> protection.

My understanding of this is that the BFS_JUST_FINISHED state exists to that we can execute a memory barrier but not have to take a lock the first time we allocate after the background thread has finished sweeping.  With this removed, it looks to me like we don't have the necessary synchronization here between the two threads.  

The backgroundFinalizeState member is atomic, but with ReleaseAcquire memory ordering.  I believe this means we will see updates to this member performed by other threads, but not necessarily anything else.  Changing this to SequentiallyConsistent would lead us to see all updates performed by other threads every time we read this value, so that would work but would lead to more overhead in the case where we've already observed that the background thread has finished sweeping.

So without it I don't think the main thread will necessarily see updates to arenaLists made by the background thread.

While I'd love to get rid of this code, I don't think we are safe without it.  What do you think -- is there some other way that this is safe?
Flags: needinfo?(emanuel.hoogeveen)
(In reply to Jon Coppeard (:jonco) from comment #15)
> My understanding of this is that the BFS_JUST_FINISHED state exists to that
> we can execute a memory barrier but not have to take a lock the first time
> we allocate after the background thread has finished sweeping.

That's what the description in backgroundFinalize made it sound like, but in fact it just made us take the GC lock in allocateFromArenaInline. There was certainly no explicit memory barrier on that path.

> The backgroundFinalizeState member is atomic, but with ReleaseAcquire memory
> ordering.  I believe this means we will see updates to this member performed
> by other threads, but not necessarily anything else.  Changing this to
> SequentiallyConsistent would lead us to see all updates performed by other
> threads every time we read this value, so that would work but would lead to
> more overhead in the case where we've already observed that the background
> thread has finished sweeping.
> 
> So without it I don't think the main thread will necessarily see updates to
> arenaLists made by the background thread.

IIUC this generally isn't an issue on strongly ordered platforms like x86, but might be a problem on ARM. I think if we actually want this safety though, we should insert an explicit memory barrier before releasing the GC lock in backgroundFinalize. That way we would only do it when needed and not have to worry about the overhead in allocateFromArena.

Terrence, does that sound reasonable? My threading-fu isn't strong enough that I can be 100% sure, though I'm pretty certain what we *used* to have here wasn't useful.
Flags: needinfo?(emanuel.hoogeveen) → needinfo?(terrence)
Oh, though if taking the GC lock isn't enough to ensure a memory barrier, backgroundFinalize might also get an out-of-date view of the ArenaList it's merging the finalized arenas back into.

If taking the lock *does* cause a memory barrier, then I guess taking it in allocateFromArenaInline might have done something - but it seems like a weird thing to rely on at best.
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #16)
> That's what the description in backgroundFinalize made it sound like, but in
> fact it just made us take the GC lock in allocateFromArenaInline. There was
> certainly no explicit memory barrier on that path.

My description was off -- we need to take the lock the first time after the background thread finishes sweeping, and that provides the memory barrier.

> IIUC this generally isn't an issue on strongly ordered platforms like x86,
> but might be a problem on ARM.

Sure it's not a problem on x86, but I'm pretty sure it is a problem on ARM.

> I think if we actually want this safety
> though, we should insert an explicit memory barrier before releasing the GC
> lock in backgroundFinalize. That way we would only do it when needed and not
> have to worry about the overhead in allocateFromArena.

There needs to be appropriate memory barriers on both thread so that changes are propagated correctly.  Taking/releasing a lock will always execute a memory barrier, so the background thread is fine.  I think we still need something on the main thread though.
(In reply to Jon Coppeard (:jonco) from comment #18)
> Taking/releasing a lock will always execute a
> memory barrier, so the background thread is fine.  I think we still need
> something on the main thread though.

Won't a single memory barrier synchronize all threads? If we're already getting a memory barrier on both PR_Lock and PR_Unlock, then along with the atomic BFS I don't think the main thread can get an out-of-date view.
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #19)
> If we're already getting a memory barrier on both PR_Lock and PR_Unlock,

We are - they invoke pthread_mutex_lock and pthread_mutex_unlock, which are both specced [1] to synchronize memory with other threads.

[1] http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_11
Hmm, there's one edge case I think we should handle though: currently we set the BFS to BFS_DONE *before* releasing the GC lock, which potentially leaves a period before the memory barrier where allocateFromArena could see BFS == BFS_DONE.

I think the easiest way to deal with this is to simply wrap the relevant bit of backgroundFinalize in a local block scope. I'll whip up a patch.
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #19)
> Won't a single memory barrier synchronize all threads? 

No, I don't think that's how barriers work.  They operate in pairs, and you need one on both threads.

There's some information below, partiularly the section "SMP Barrier Pairing":

https://www.kernel.org/doc/Documentation/memory-barriers.txt

> [PR_Lock and PR_Unlock] invoke pthread_mutex_lock and pthread_mutex_unlock, which are both 
> specced [1] to synchronize memory with other threads.

Well the assumption here is that the other threads will also lock/unlock the same mutex to achieve synchronization.  We're explicitly trying to avoid having to do that here.
Depends on: 1081952
Ugh, that's terrifying. However, after bug 1081952 I think we're okay here: 

(1) While BFS == BFS_RUN, we always take the GC lock in allocateFromArena. That means that if we allocate any arenas during background finalization, we'll have memory barriers on both sides, ensuring that backgroundFinalize will see the allocated arenas.

(2) If allocateFromArena runs while backgroundFinalize is merging the lists, BFS == BFS_RUN and it will wait on the GC lock, which means we'll have memory barriers on both sides and allocateFromArena will see the merged list.

(3) If allocateFromArena runs after backgroundFinalize releases the GC lock but before BFS = BFS_DONE, it'll take the GC lock and we'll have memory barriers on both sides.

(4) If allocateFromArena runs after backgroundFinalize releases the GC lock and after it sets BFS to BFS_DONE, we won't get a memory barrier on the main thread. However, because BFS is set to BFS_DONE *after* the memory barrier happens on the background thread, the memory barrier guarantees that all the stores that happened before it (and by extension before BFS = BFS_DONE) will be globally visible. Therefore the main thread is guaranteed to see the ArenaList as the background thread left it.

But right now, it seems like things are broken. (4) is also pretty subtle (and I really hope I didn't misinterpret anything), but not necessarily any more subtle than the BFS_JUST_FINISHED situation.
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #23)

I agree with points 1 to 3...

> (4) If allocateFromArena runs after backgroundFinalize releases the GC lock
> and after it sets BFS to BFS_DONE, we won't get a memory barrier on the main
> thread. However, because BFS is set to BFS_DONE *after* the memory barrier
> happens on the background thread, the memory barrier  guarantees that all the
> stores that happened before it (and by extension before BFS = BFS_DONE) will
> be globally visible. 

As discussed on IRC, I'm still not convinced this is correct since there's no barrier on the main thread to force us to see changes to arenaLists.
I looked up some descriptions of the underlying instructions.

The x86 description of MFENCE is very vague, saying merely that "This serializing operation guarantees that every load and store instruction that precedes in program order the MFENCE instruction is globally visible before any load or store instruction that follows the MFENCE instruction is globally visible." [1] On the other hand, x86 is strongly ordered, so everything has ReleaseAcquire semantics and none of this matters.

The ARM description of DMB is more specific: "It ensures that all explicit memory accesses that appear in program order before the DMB instruction are observed before any explicit memory accesses that appear in program order after the DMB instruction." [2] As far as I can tell, DSB and ISB are supersets of DMB, adding more serialization.

Since, after bug 1081952, we set the background finalize state *after* the memory barrier, every store prior to the barrier should be visible when the background finalize state is acquired.

I'll be honest, though, I keep changing my mind about whether it's too subtle to rely on. I think the BFS_JUST_FINISHED stuff was messier, but less subtle (if poorly understood), since it didn't rely on the ordering between the memory barrier and the atomic store. On the other hand, it's not that hard to understand why the store *should* go after the memory barrier - there are just more subtleties involved than might be apparent.

[1] http://x86.renejeschke.de/html/file_module_x86_id_170.html
[2] http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204g/CIHJFGFE.html
I learned a ton from that discussion! Thanks.

How much of a performance hit would making the state SequentiallyConsistent have, exactly. We don't update it so often, so unless reads totally destroys our performance on uncontented threads, I'd certainly prefer the saner semantics it offers.
Flags: needinfo?(terrence)
To summarise further IRC discussion: I was wrong about how release/acquire atomics work, and the main thread will see any changes made by the background thread before it sets the atomic.

So I now think this is fine as long as we set the atomic as the last thing we do on the background thread.
Updated the patch in bug 1081952 to reflect our revised understanding :)
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #25)
> 
> The x86 description of MFENCE is very vague, saying merely that "This
> serializing operation guarantees that every load and store instruction that
> precedes in program order the MFENCE instruction is globally visible before
> any load or store instruction that follows the MFENCE instruction is
> globally visible." [1] On the other hand, x86 is strongly ordered, so
> everything has ReleaseAcquire semantics and none of this matters.

The x86 is not actually strongly ordered, and that fact is not fully appreciated.  (I made the same mistake myself just weeks back.)  In particular, if a store precedes a load in the program order, the hardware may perform the read before it performs the write.  MFENCE can be used to insert the necessary StoreLoad fence.

I believe you need the MFENCE for sequential consistency, but not for release-acquire consistency (see ref below).

> The ARM description of DMB is more specific: "It ensures that all explicit
> memory accesses that appear in program order before the DMB instruction are
> observed before any explicit memory accesses that appear in program order
> after the DMB instruction." [2] As far as I can tell, DSB and ISB are
> supersets of DMB, adding more serialization.

Right.  It's not 100% clear to me yet whether DMB is enough for sequential consistency or whether it actually requires an ISB [sic].  Apple's Clang sources emit two DMB for a C++11 seq cst atomic (one before, one after), but another source (again see ref below) uses an ISB after.  Caveat: Comments in the Clang code suggests it's been customized for Apple CPUs.

This is the "ref below":
http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html

The "Implementation and Hardware Notes" section of the following doc contains more information, plus points to other useful references:
https://docs.google.com/document/d/1NDGA_gZJ7M7w1Bh8S0AoDyEqwDdRh4uSoTPSNn77PFk/edit?usp=sharing

(For practical purposes MIPS and POWER can be treated as ARM.)
Flags: qe-verify-
You need to log in before you can comment on or make changes to this bug.