Closed Bug 1434598 Opened 6 years ago Closed 6 years ago

Investigate splitting atoms table to reduce lock contention

Categories

(Core :: JavaScript: GC, enhancement, P2)

enhancement

Tracking

()

RESOLVED FIXED
mozilla63
Tracking Status
firefox63 --- fixed

People

(Reporter: jonco, Assigned: jonco)

References

Details

Attachments

(7 files, 6 obsolete files)

42.00 KB, patch
jonco
: review+
Details | Diff | Splinter Review
14.59 KB, patch
jonco
: review+
Details | Diff | Splinter Review
39.47 KB, patch
jonco
: review+
Details | Diff | Splinter Review
50.25 KB, patch
sfink
: review+
Details | Diff | Splinter Review
15.41 KB, patch
jonco
: review+
Details | Diff | Splinter Review
8.62 KB, patch
sfink
: review+
Details | Diff | Splinter Review
5.55 KB, text/plain
Details
As suggested by Steve, to reduce contention on the exclusive access lock by off-thread parsing, we could split the atoms table into several smaller tables which each had their own lock.  The table to use would be selected based on the key's hash.  AIUI this is how Java's ConcurrentHashMap works.

This could give a big boost to off-thread parsing performance.

One complication is that for incremental sweeping we already have two tables: one that we can add new atoms to one while we sweep the other.  We would probably want to abstract that first and then add a concurrent map template wrapper over it.
Thinking more about this, this would only help with looking up existing atoms.  You would still need to take a single lock to create a new atom, unless the storage was split up too.
When I was thinking about this after our meeting, I was wondering if you knew whether the contention is from both reading and writing, or just reading? As in, would it help to have a rwlock type thing to reduce contention for reads? (I'm a little skeptical, since it seems like parsing would be creating a lot of atoms, not just looking them up, but I really have no idea.)

Another thing that comes to mind is the way we handle offthread compilation zones by having a merge process. I wonder how hard it would be to have offthread parsing have its own storage layer for atoms (layered on top of the shared one), then merge those atoms into the shared table while updating all parser-created pointers. But I don't know anything about how the parser works, and this is different from what offthread compilation does (it has its own zone).
(In reply to Steve Fink [:sfink] [:s:] from comment #2)
Here is breakdown of what happens in AtomizeAndCopyChars for the tp6 talos tests I was looking at:

static string       185202  36.81%
per-zone cache hit  222904  44.30%
permanent atom        5803   1.15%
(lock taken here)
existing atom found  34502   6.86%
new atom created	     54773  10.89%

So if we take the lock we're more likely to write that read.

Allocating the atoms locally to each parse zone and then merging them could work, but that makes merging a potential bottleneck.
Do you know how many of the new atoms are for string literals vs identifiers etc? Lazy parsing should be able to get away with not atomizing string literals I think.
(In reply to Jan de Mooij [:jandem] from comment #4)
That's a great idea.  I suspect there are several different reasons we make things atomic, but I don't know what they all are or if they all apply to lazy parsing.

Another aspect is that we may be able to get away with making strings have a single instance that is unique to the zone, rather than globally.
(In reply to Steve Fink [:sfink] [:s:] from comment #2)
I thought some more about allocating atoms locally to the parse zone and then merging and came to the conclusion that this could be a big win, although it's complicated to make it work.

We can atomise locally to a parse zone by using the per-zone atom cache as the local atoms table, and always allocating atoms that are not found there (excluding static/permanent atoms).  So we don't need to take the exclusive lock while parsing.

After parsing we have a merge phase.  This runs on parse thread and iterates through the parse zone's atom arenas.  We check each atom against the main atom table.  If it exists, we finalize the local atom and forward the cell to the main thread version.  We also mark the cell as free within the arena.  When we've done this for all cells in an arena, we merge the arena into the real atoms zone.  This merge phase requires the exclusive access lock.

Then we have an update phase where we trace the generated scripts to update references to the main-thread versions where necessary.  This doesn't require a lock and happens off-thread.  This depends on not collecting atoms since the previous phase, but off-thread parsing already requires this.

Finally we merge as normal.

Pros: New atoms are allocated off-thread and merged.  The lock is required for less time because it's not held for atom allocation, only table insertion.  Also we can do the insertion in batches which improves efficiency.

Cons: Atoms that already exist in the main atoms table are re-allocated and then discarded, although this happens off-thread.
The other big con is what you said at the beginning, that it's complicated. :-/ It's the "right" thing to do conceptually, though.
Priority: -- → P5
I came back to this bug because it turns out that bholley is facing the same issue with offthread CSS parsing and Gecko atoms. He's investigating using comment 0 with separate storage -- or in other words, just having N independent atoms tables, partitioned by string hash.
(In reply to Steve Fink [:sfink] [:s:] from comment #8)
> I came back to this bug because it turns out that bholley is facing the same
> issue with offthread CSS parsing and Gecko atoms. He's investigating using
> comment 0 with separate storage -- or in other words, just having N
> independent atoms tables, partitioned by string hash.

Turns out this works great. Mutex overhead went from ~150ms per thread to ~2ms per thread. I recommend the approach.
(In reply to Bobby Holley (:bholley) from comment #9)
Great, thanks for the data point!
Assignee: nobody → jcoppeard
See [1] for documentation around the parameters we chose for the Gecko atom table.

[1] https://searchfox.org/mozilla-central/rev/78dbe34925f04975f16cb9a5d4938be714d41897/xpcom/ds/nsAtomTable.cpp#294
Depends on: 1467842
Currently we take the exclusive access lock for anything that allocates in the atoms zone.  This is:
 - atoms and the atoms table
 - symbols and the symbol registry
 - various JITRuntime thunks

The plan for this bug is to:
 - allow parallel allocation into the atoms zone
 - remove locking for things accessed by the main thread only
 - split the atoms table into many small tables and use a lock per table
 - remove the exclusive access lock

Parallel allocation works by keeping a set for free lists for the atoms zone in every JSContext.  Each context thus allocates into its own set of arenas and allocation only needs to take the GC lock when it allocates a new arena.

Parallel allocation into the atoms zone by helper threads is OK because 1) new allocations are not visible to the main thread until they are placed in some table (and that's protected by a lock), and 2) we don't GC the atoms zone while there are any helper threads running that could allocate. 

When we split the atoms table we will sometimes need to take all the locks for each sub-table when we want to trace the heap e.g. for memory reporting.  This is not required for GC.
This patch separates out FreeLists from ArenaLists and stores a pointer to the FreeLists for the current Zone in JSContext rather than a pointer to the ArenaLists.  arenaAllocatedDuringGC() moves to Arena and ArenaLists gets a back pointer to the Zone.

There is some refactoring of allocation to pass a reference to the current FreeLists into refillFreeListAndAllocate which is renamed from allocateFromArena.
Attachment #8987583 - Flags: review?(sphink)
Patch to allow parallel allocation in the atoms zone without holding the exclusive access lock.  This is safe if things allocated on separate threads are not visible to each other without going through some shared data structure which is protected appropriately, which is still the case.  With the exception of atoms allocated by parallel parsing everything in the atoms zone is allocated on the main thread.

This gives each JSContext a separate FreeLists structure which can be used to allocate in the atoms zone.  The GC lock is required for allocating arenas, but when an arena has been allocated we can allocate out of it without taking any locks.

This expands the background finalize state to be a concurrent use flag which indicates whether the GC lock is necessary in ArenaLists::refillFreeListAndAllocate.  We set this when parallel parsing is running.

I removed the atoms zone check in CheckZone - this is handled like all other zones now.  I added a custom check for ArenaLists access that allows access from the main thread with the exclusive access lock held and which otherwise works like CheckZone.
Attachment #8987593 - Flags: review?(sphink)
Attached patch bug1434598-3-remove-some-locking (obsolete) — Splinter Review
Remove some locking that's no longer needed.

Symbol registry is only ever accessed from the main thread.  JitRuntime initialisation doesn't need the exclusive access lock any more.  I'm not sure why GCMarker needed a lock.
Attachment #8987594 - Flags: review?(sphink)
More patches to follow...
Exciting stuff!
Comment on attachment 8987583 [details] [diff] [review]
bug1434598-1-use-free-list-in-context

Review of attachment 8987583 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/gc/Allocator.cpp
@@ +237,5 @@
>  GCRuntime::tryNewTenuredThing(JSContext* cx, AllocKind kind, size_t thingSize)
>  {
>      // Bump allocate in the arena's current free-list span.
> +
> +    T* t = reinterpret_cast<T*>(cx->freeLists()[kind]->allocate(thingSize));

This is fine, though for some reason I was expecting cx->freeLists(kind)->allocate(thingSize). I guess the fact that there is a fixed array of free lists indexed by kind feels like an implementation detail or something? Anyway, not a big deal and it might just be me; do it whichever way makes more sense to you.

I guess it's partly a question of whether the FreeLists type is a useful abstraction. I think it probably is some of the time, but it looks like >90% of the time we immediately index into FreeLists, which implies to me that FreeSpan is the important concept to operate on. (Or maybe there should be a FreeList which is either a typedef of FreeSpan, or struct FreeList { FreeSpan* head; } or whatever.)

::: js/src/gc/ArenaList.h
@@ +228,5 @@
>  
> +    FreeLists();
> +
> +    FreeSpan* freeList(AllocKind i) const { return freeLists_[i]; }
> +    FreeSpan* operator[](AllocKind i) const { return freeLists_[i]; }

If FreeList/FreeSpan becomes the primary type, this operator[] probably ought to go away.
Attachment #8987583 - Flags: review?(sphink) → review+
Comment on attachment 8987593 [details] [diff] [review]
bug1434598-2-parallel-atoms-alloc

Review of attachment 8987593 [details] [diff] [review]:
-----------------------------------------------------------------

I *think* this is all good.

::: js/src/vm/JSContext.h
@@ +282,5 @@
>      const JS::Zone* atomsZone(const js::AutoAccessAtomsZone& access) {
>          return runtime_->atomsZone(access);
>      }
> +
> +    // Methods to access runtime data that must be protected by locks.

This is probably in the next patch, but it looks odd to move this past two things that take an AutoAccessAtomsZone but not the third.

::: js/src/vm/Realm.h
@@ +913,5 @@
>      AutoRealm(const AutoRealm&) = delete;
>      AutoRealm& operator=(const AutoRealm&) = delete;
>  };
>  
> +class MOZ_RAII AutoAtomsZone // TODO: Consider renaming to indicate this is for allocation only.

Hm. AutoAllocAtoms? I don't know which name is better; either is fine with me.
Attachment #8987593 - Flags: review?(sphink) → review+
Comment on attachment 8987594 [details] [diff] [review]
bug1434598-3-remove-some-locking

Review of attachment 8987594 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/vm/SymbolType.cpp
@@ +23,2 @@
>  {
> +    AutoAtomsZone az(cx);

Hm, maybe this *would* be more clear with a full-on AutoAllocFromAtomsZone or something. I'm not sure.

@@ +75,3 @@
>  
> +    // p is still valid here because we have held the lock since the
> +    // lookupForAdd call, and newInternal can't GC.

No lock anymore. I guess this should be

  // p is still valid because we only access the symbol registry from the main thread, and newInternal can't GC.

?
Attachment #8987594 - Flags: review?(sphink) → review+
Refactored patch to give FreeLists a nicer API that doesn't make it feel like an array of FreeSpans.
Attachment #8987583 - Attachment is obsolete: true
Attachment #8987847 - Flags: review+
Moved comment as suggested and renamed AutoAtomsZone to AutoAllocInAtomsZone.
Attachment #8987593 - Attachment is obsolete: true
Attachment #8987848 - Flags: review+
Fixed comment.
Attachment #8987594 - Attachment is obsolete: true
Attachment #8987850 - Flags: review+
Uploading correct version this time.
Attachment #8987847 - Attachment is obsolete: true
Attachment #8987870 - Flags: review+
Attached patch bug1434598-4-split-atoms-table (obsolete) — Splinter Review
Patch to implement the atoms table with multiple hash tables each with their own lock.

I changed the way permanent atoms are initialised.  Currently they are added to the atoms table as normal during initialisation and at some point the entire atoms table becomes the permanent atoms table and a new atoms table is created.  This patch adds a path to add atoms to the permanent atoms table if we're currently initialising.

Incremental sweeping of atoms works the same way as currently, with an auxiliary table for every atoms hash table.

There's an auto class to take all atoms table locks for the cases that need it (heap tracing).  This replaces the use of AutoLockForExclusiveAccess in AutoTraceSession.

I removed FrozenAtomSet and just return a const ref for the permanent atoms, but I could put this back if you think it's important.
Attachment #8987876 - Flags: review?(sphink)
Remove the exclusive access lock.
Attachment #8987877 - Flags: review?(sphink)
Comment on attachment 8987876 [details] [diff] [review]
bug1434598-4-split-atoms-table

Review of attachment 8987876 [details] [diff] [review]:
-----------------------------------------------------------------

Basically r=me, but if you decide to remove FrozenAtomSet then I want that part to be reviewed by njn (I'm good with the rest), or if you decide to keep it then I'd like to see it again after that change.

::: js/src/vm/AtomsTable.h
@@ +17,2 @@
>  
>  namespace js {

Here or somewhere, could you add a block comment describing the partitioning scheme?

@@ +17,4 @@
>  
>  namespace js {
>  
> +// To allow iterating over cells in the atoms zone take all atoms table locks.

Either add a comma after zone, or swap the two clauses. ("Take all locks to allow...")

@@ -68,5 @@
>  
> -// This class is a wrapper for AtomSet that is used to ensure the AtomSet is
> -// not modified. It should only expose read-only methods from AtomSet.
> -// Note however that the atoms within the table can be marked during GC.
> -class FrozenAtomSet

Based on the thread in bug 979293, removing this makes me pretty nervous with respect to tsan. I think you should r?njn for the removal.

@@ +85,5 @@
> +    // Use a low initial capacity for atom hash tables to avoid penalizing runtimes
> +    // which create a small number of atoms.
> +    static const size_t InitialTableSize = 16;
> +
> +    struct Set

My one big, obnoxious bit of feedback: I would much prefer the term "partition" rather than "set". Mathematically, partitions are non-overlapping subsets that cover the whole set. "Set" (or "subset", which would be slightly better) implies neither.

https://en.wikipedia.org/wiki/Partition_of_a_set

It's a lot longer, but it makes it much easier for me to see what's going on at a glance. I'm fine with "part" as a short form of "partition" wherever that helps.

@@ +96,5 @@
> +        // The atoms in this set.
> +        AtomSet atoms;
> +
> +        // Set of atoms added while the |atoms| set is being swept.
> +        AtomSet* atomsAddedWhileSweeping;

This makes me wonder if a general MVCCT<T> or AddOnlyMVCC<T> would be useful for various things... but I guess it would still need to be tailored based on the specific mutating APIS of T, so... bleh.

@@ -75,4 @@
>  
>    public:
> -    // This constructor takes ownership of the passed-in AtomSet.
> -    explicit FrozenAtomSet(AtomSet* set) { mSet = set; }

...though I wonder if today we would make this

    class FrozenAtomSet : private FrozenAtomSet {
      public:
        explicit FrozenAtomSet(AtomSet&& set) ...

to essentially just relabel the AtomSet as a FrozenAtomSet to eliminate a pointer indirection. Also, I *think* that means you can just have a series of public: using foo = foo; statements to re-export the desired portions of the API. I usually prefer delegation, but for this purpose, it doesn't feel like the right choice.

@@ +157,5 @@
>  
> +    size_t sizeOfIncludingThis(mozilla::MallocSizeOf mallocSizeOf) const;
> +
> +  private:
> +    size_t getSetIndex(const AtomHasher::Lookup& lookup);

Can you comment this, eg

    // Map a string to a partition index, based on the string's hash.

(I'm not sure whether to call the lookup a string or an atom.)

::: js/src/vm/JSAtom.cpp
@@ +132,5 @@
>  FOR_EACH_COMMON_PROPERTYNAME(CONST_CHAR_STR)
>  #undef CONST_CHAR_STR
>  
> +// Use a low initial capacity for the permanent atoms table to avoid penalizing
> +// runtimes which create a small number of atoms.

While you're changing this comment, can you s/which/that/? Sorry, it's my inner grammar nitpicker coming out.

@@ +606,5 @@
>  
> +    AtomSet::Ptr pp = cx->permanentAtoms().readonlyThreadsafeLookup(lookup);
> +    if (pp) {
> +        JSAtom* atom = pp->asPtr(cx);
> +        if (zonePtr && !zone->atomCache().add(*zonePtr, AtomStateEntry(atom, false))) {

Is Zone::atomCache_ a true cache, or is it necessary for correctness during off-thread parsing or compilation? For some reason, I had thought it was necessary for correctness at some point, but I'm not seeing it now. It looks like it's purely an optimization to reduce locking overhead, as the comments indicate.

@@ +616,3 @@
>      }
>  
> +    // Validate the length before taking a lock, as throwing an exception here

"...taking an atoms partition lock, ..."?

@@ +710,5 @@
> +PermanentlyAtomizeAndCopyChars(JSContext* cx, const CharT* tbchars, size_t length,
> +                               const Maybe<uint32_t>& indexValue,
> +                               const AtomHasher::Lookup& lookup)
> +{
> +    MOZ_ASSERT(!cx->isPermanentAtomsInitialized());

This reads a little funny, since normally initialization does not include population. isPermanentAtomsConstructed or isPermanentAtomsPopulated might be better. But it doesn't seem worth the bother.

::: js/src/vm/JSContext.h
@@ +1193,5 @@
>  // accessed by the main thread and by off-thread parsing. There are two
>  // situations in which it is safe:
>  //
> +//  - the current thread holds all atoms table locks (off-thread parsing may be
> +//    running and this must also take one of these lock for access)

*locks

Also, s/this//. 'this' has an ambiguous antecedent; when I first read it, I thought it was referring to the current thread, which didn't make sense.
Attachment #8987876 - Flags: review?(sphink)
Comment on attachment 8987877 [details] [diff] [review]
bug1434598-5-remove-exlusive-access-lock

Review of attachment 8987877 [details] [diff] [review]:
-----------------------------------------------------------------

\o/ This is really great. The exclusive access lock was starting to feel like "uh... we're just going to lock here, to be safe, in order to protect... something." Atoms were the last thing that really made sense to lock for.
Attachment #8987877 - Flags: review?(sphink) → review+
(In reply to Steve Fink [:sfink] [:s:] (PTO June 31) from comment #27)

Thanks for all the comments.  I'll put FrozenAtomSet back the way it was and rename Set to Partition (this is better because there's no ambiguity with the HashSet).

> Is Zone::atomCache_ a true cache, or is it necessary for correctness during
> off-thread parsing or compilation? 

It is necessary now since my patch to allow atoms GC while the main thread is parsing.  We mark cached atoms for zones that parsing as this is a superset of the atoms the parser may be accessing.
Patch updated with review comments.
Attachment #8987876 - Attachment is obsolete: true
Attachment #8988209 - Flags: review?(sphink)
Rebased patch.
Attachment #8987877 - Attachment is obsolete: true
Attachment #8988210 - Flags: review+
This seems of higher importance than what P5 implies.
Priority: P5 → P2
Comment on attachment 8988209 [details] [diff] [review]
bug1434598-4-split-atoms-table v2

Review of attachment 8988209 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/vm/AtomsTable.h
@@ +19,5 @@
> + * The atoms table is a mapping from strings to JSAtoms that supports concurrent
> + * access and incremental sweeping.
> + *
> + * The table is partitioned based on the key into multiple sub-tables. Each
> + * sub-table is protected by lock to ensure safety when accessed by helper

by *a lock

@@ +28,4 @@
>  
>  namespace js {
>  
> +// Take all atoms table locks too allow iterating over cells in the atoms zone.

*to

::: js/src/vm/Runtime.h
@@ +777,5 @@
> +    bool permanentAtomsPopulated() const {
> +        return permanentAtoms_;
> +    }
> +
> +    // For internal use, return the permanent atoms table while it being

it *is
Attachment #8988209 - Flags: review?(sphink) → review+
A final patch to ensure that the atom allocation path is all inlined (as it was previously) and to fix caching of permanent atoms created during initialisation.
Attachment #8989785 - Flags: review?(sphink)
Attached file benchmark_results.txt
Benchmark results for atoms-heavy parsing workload.

This shows a modest improvement on Windows and Linux and a big win on MacOS (where it avoids that OS's crummy behaviour under lock contention).
Comment on attachment 8989785 [details] [diff] [review]
bug1434598-6-atoms-tidyup

Review of attachment 8989785 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/vm/JSAtom.cpp
@@ +254,5 @@
>  {
>      Mutex* lock = nullptr;
>  
>    public:
> +    MOZ_ALWAYS_INLINE explicit AutoLock(JSRuntime* rt, Mutex& aLock) {

Whoa, does this MOZ_ALWAYS_INLINE make a difference? It was already inline.

@@ +767,5 @@
>          return nullptr;
>      }
>  
> +    if (zonePtr &&
> +        MOZ_UNLIKELY(!cx->zone()->atomCache().add(*zonePtr, AtomStateEntry(atom, false))))

Oops, sorry.
Attachment #8989785 - Flags: review?(sphink) → review+
(In reply to Jon Coppeard (:jonco) from comment #35)
> This shows a modest improvement on Windows and Linux and a big win on MacOS
> (where it avoids that OS's crummy behaviour under lock contention).

Wow, osx is a totally different world!
(In reply to Steve Fink [:sfink] [:s:] from comment #36)
> Whoa, does this MOZ_ALWAYS_INLINE make a difference? It was already inline.

I'm not sure.  This is to make doubly sure everything is inlined.  Overall, the patch made a difference but I didn't check before and after for each individual change.  It does feel a bit like sprinkling magic pixie dust I know.

> Wow, osx is a totally different world!

Yeah, it really doesn't like lock contention.
Pushed by jcoppeard@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/245e2c17b272
Refactor allocation to work from a free list stored in the JSContext r=sfink
https://hg.mozilla.org/integration/mozilla-inbound/rev/5113d07ed7c6
Allow concurrent allocation in atoms zone r=sfink
https://hg.mozilla.org/integration/mozilla-inbound/rev/0583016547e9
Remove some locking that is no longer required r=sfink
https://hg.mozilla.org/integration/mozilla-inbound/rev/250b5f292dae
Partition atoms table into multiple sub-tables each with its own lock r=sfink
https://hg.mozilla.org/integration/mozilla-inbound/rev/6b959f8ced42
Remove the exclusive access lock r=sfink
https://hg.mozilla.org/integration/mozilla-inbound/rev/b5e740d8b4f5
Make sure atom allocation is always inlined and fix caching of permanent atoms during initialisation r=sfink
Pushed by jcoppeard@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/09a7192014ab
Fix static analysis bustage on a CLOSED TREE r=me
(In reply to Jon Coppeard (:jonco) from comment #38)
> (In reply to Steve Fink [:sfink] [:s:] from comment #36)
> > Whoa, does this MOZ_ALWAYS_INLINE make a difference? It was already inline.
> 
> I'm not sure.  This is to make doubly sure everything is inlined.  Overall,
> the patch made a difference but I didn't check before and after for each
> individual change.  It does feel a bit like sprinkling magic pixie dust I
> know.

Yeah, that's fine. I was just wondering if you already had concrete evidence that it was needed, in which case I would file that away as Something I Should Watch Out For. I have no objection to doing it here just to be extra-sure.
Depends on: 1478402
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: