Closed Bug 1440824 Opened 6 years ago Closed 6 years ago

Split the Gecko atom table into multiple subtables to reduce locking contention

Categories

(Core :: XPCOM, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla60
Tracking Status
firefox60 --- fixed

People

(Reporter: bholley, Assigned: bholley)

References

Details

Attachments

(3 files, 1 obsolete file)

In bug 1346988, I'm working on moving CSS parsing to a background thread pool to speed up page load. One problem I've encountered is locking contention in the atom table. CSS parsing is very heavy on atomization, and atomizing requires a mutex to protect operations on the hashtable. In my initial measurements, moving ~50ms of parsing work to 6 background threads causes _each_ thread to spend ~150ms acquiring the mutex.

There are lockless hashtable implementations, but they're extremely complex. A simpler approach is to use the hash key to segment the table into N subtables, each of which is protected by a separate mutex.

I've implemented this and it works quite well. I can bring the locking overhead down to a couple of milliseconds per thread with 16 sub-tables, and under a millisecond with 128 or 256. Will post patches shortly.
MozReview-Commit-ID: 4uMktcaYwWW
Attachment #8953699 - Flags: review?(nfroyd)
MozReview-Commit-ID: E1bcchzuMOu
Attachment #8953700 - Flags: review?(nfroyd)
MozReview-Commit-ID: Hj8gKPap0cR
Attachment #8953701 - Flags: review?(nfroyd)
Comment on attachment 8953700 [details] [diff] [review]
Part 2 - Overhaul the atom infrastructure to support multiple subtables. v1

Review of attachment 8953700 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/ds/nsAtomTable.cpp
@@ +234,5 @@
> +  Mutex* mLock;
> +  void GCLocked(GCKind aKind);
> +  void SizeOfIncludingThisLocked(MallocSizeOf aMallocSizeOf, size_t* aSize);
> +
> +  AtomTableEntry* Add(AtomTableKey& key)

Nit: aKey

@@ +256,5 @@
> +  already_AddRefed<nsAtom> Atomize(const nsAString& aUTF16String);
> +  already_AddRefed<nsAtom> Atomize(const nsACString& aUTF8String);
> +  already_AddRefed<nsAtom> AtomizeMainThread(const nsAString& aUTF16String);
> +  void RegisterStaticAtoms(const nsStaticAtomSetup* aSetup, uint32_t aCount);
> +  size_t RaceySlowCount();

Nit: Racy

@@ +265,5 @@
> +
> +private:
> +  // XXXbholley: We enable multiple subtables in the next patch.
> +  const static size_t kNumSubTables = 1;
> +  nsAtomSubTable mSubTables[kNumSubTables];

Please comment (and/or statically assert) that this number should be/is a power of two.

@@ +331,3 @@
>  //
> +// So we first make the simplifying assumption that the atoms are more or less
> +// evenly-distributed across the subtables (which is the case emperically).

Nit: empirically.

@@ +369,5 @@
> +  // There are a few considerations around how we select subtables.
> +  //
> +  // First, we want entries to be evenly distributed across the subtables. This
> +  // can be achieved by using any bits in the hash key, assuming the key itself
> +  // is evenly-distributed. Emperical measurements indicate that this method

Nit: Empirical

@@ +379,5 @@
> +  // every entry it observes, leading to pessimal performance. In this case,
> +  // we're using PLDHashTable, whose primary hash function uses the N leftmost
> +  // bits of the hash value (where N is the log2 capacity of the table). This
> +  // means we should prefer the rightmost bits here.
> +  return mSubTables[aKey.mHash % kNumSubTables];

It's very likely that compilers will implement this using a mask instead of a modulo. But is it worth doing that explicitly -- `aKey.mHash & (kNumSubTables - 1)` -- to avoid any doubt?

@@ +385,3 @@
>  
>  void
> +nsAtomTable::SizeOfIncludingThis(MallocSizeOf aMallocSizeOf, size_t* aSize)

SizeOf functions that increment counters are usually called "AddSizeOf{In,Ex}cludingThis" because that makes it clearer.
 
But since this is only taking a single measurement, it would be better to remove the outparam and use the return value.

@@ +433,5 @@
> +  MOZ_ASSERT_IF(aKind == GCKind::Shutdown, gUnusedAtomCount == 0);
> +}
> +
> +size_t
> +nsAtomTable::RaceySlowCount()

A function with a provocative name like this deserves a comment, either here or on the declaration. In what way is it racy? What are the possibly consequences? (I can tell by reading the code, but it would be nice to have extra explanation.)

Also, NS_GetNumberOfAtoms() relies on this function. It appears to only be used in tests. Its declaration also deserves a comment about its reliability.

@@ +655,5 @@
>             : 0;
>  }
>  
> +void
> +nsAtomSubTable::SizeOfIncludingThisLocked(MallocSizeOf aMallocSizeOf, size_t* aSize)

Again, please use the return value instead of an outparam.

@@ +660,2 @@
>  {
> +  mLock->AssertCurrentThreadOwns();

A nice idiom we use in the Gecko Profiler: in functions like these that require a lock to be held, we have a "proof of lock" argument -- a reference to the AutoLock:

https://searchfox.org/mozilla-central/rev/b469db5c90d618f4b202d5ef280e1a78224eb43b/tools/profiler/core/platform.cpp#178-180

as used here, for example:

https://searchfox.org/mozilla-central/rev/b469db5c90d618f4b202d5ef280e1a78224eb43b/tools/profiler/core/platform.cpp#1621

Might be worth doing that here, and for GCLocked()? Types are generally better than assertions.
Attachment #8953700 - Flags: feedback+
Comment on attachment 8953701 [details] [diff] [review]
Part 3 - Enable multiple hashtables for atoms. v1

Review of attachment 8953701 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/ds/nsAtomTable.cpp
@@ +268,5 @@
> +  // parsing by increasing the number of subtables up to 128. This has been
> +  // measured to have neglible impact on the performance of initialization, GC,
> +  // and shutdown, and only costs about 20k of memory relative to a single
> +  // subtable.
> +  const static size_t kNumSubTables = 128;

With bug 1436250 in mind I wonder if a smaller value would be better. What does the perf vs. memory usage curve look like?
(In reply to Nicholas Nethercote [:njn] from comment #5) 
> @@ +660,2 @@
> >  {
> > +  mLock->AssertCurrentThreadOwns();
> 
> A nice idiom we use in the Gecko Profiler: in functions like these that
> require a lock to be held, we have a "proof of lock" argument -- a reference
> to the AutoLock:

It was actually in the old code, but I think it's overkill in the new setup, and make things unnecessarily verbose. In the new setup, the *Locked methods are only called in one place, by the outer table (they're only separate methods for readability). The proof-of-lock stuff is most useful when you have multiple callers and nested *Locked functions.

(In reply to Nicholas Nethercote [:njn] from comment #6)
> > +  // parsing by increasing the number of subtables up to 128. This has been
> > +  // measured to have neglible impact on the performance of initialization, GC,
> > +  // and shutdown, and only costs about 20k of memory relative to a single
> > +  // subtable.
> > +  const static size_t kNumSubTables = 128;
> 
> With bug 1436250 in mind I wonder if a smaller value would be better. What
> does the perf vs. memory usage curve look like?

Memory overhead seems roughly linear with the number of tables, at least in the ranges I measured (up to 256). IIRC the locking overhead was ~3-6ms with 64 subtables, and ~1-2ms with 128 subtables.

I don't think it's worth taking a measurable slowdown in the milliseconds to save 10k per process. There's lower-hanging fruit, and if we're going to chase wins at that level, I think we'll want some kind of "unimportant process" heuristic so that we don't compromise performance on stuff that matters.
Actually, that memory measurement was for tp6-facebook, not an empty page. Let me remeasure.
Remeasuring with a more-or-less empty page, I get 280,784 bytes with 64 tables vs 282,576 bytes with 128 tables. So even if 10k isn't negligible, 2k certainly is. :-)
> > "proof of lock"
> 
> It was actually in the old code, but I think it's overkill in the new setup,
> and make things unnecessarily verbose.

It's an extra argument, but it saves the need for an assertion. *shrug*
(In reply to Nicholas Nethercote [:njn] from comment #10)
> > > "proof of lock"
> > 
> > It was actually in the old code, but I think it's overkill in the new setup,
> > and make things unnecessarily verbose.
> 
> It's an extra argument, but it saves the need for an assertion. *shrug*

I guess another point is that I still want the assertion, since there are multiple locks involved, and we want to make sure we have the right one.
Attachment #8953699 - Flags: review?(nfroyd) → review+
Comment on attachment 8953700 [details] [diff] [review]
Part 2 - Overhaul the atom infrastructure to support multiple subtables. v1

Review of attachment 8953700 [details] [diff] [review]:
-----------------------------------------------------------------

I generally agree with all of njn's comments.  I like the style of aProofOfLock arguments, but it's true that with multiple locks, that style breaks down, because you would like to be sure the correct lock is being held (and we don't have a MutexAutoLock::AssertIsHoldingLock(const Mutex&) or similar).

::: xpcom/ds/nsAtomTable.cpp
@@ +225,5 @@
>  
> +// In order to reduce locking contention for concurrent atomization, we segment
> +// the atom table into N subtables, each with a separate lock. If the hash
> +// values we use to select the subtable are evenly distributed, this reduces the
> +// probability of contention by a factor of N. See bug 1440824.

Do we want to cite any prior work here, e.g. Java's ConcurrentHashTable?

@@ +230,5 @@
> +class nsAtomSubTable
> +{
> +  friend class nsAtomTable;
> +  PLDHashTable* mTable;
> +  Mutex* mLock;

Why are these pointers rather than ordinary values?  Is this just so detecting accesses after atom table shutdown is easier?

If we do want to keep these as pointers, could they be made UniquePtrs instead?

@@ +269,5 @@
> +  nsAtomSubTable mSubTables[kNumSubTables];
> +};
> +
> +// Static singleton instance for the atom table.
> +static nsAtomTable gAtomTable;

Ah, this would be a reasonable point for having the pointers in nsAtomSubTable above: we don't have to worry about static constructors for this instance if nsAtomSubTable has trivial constructors only.

@@ +343,5 @@
> +void nsAtomTable::Init()
> +{
> +  MOZ_ASSERT(!IsInitialized());
> +  for (size_t i = 0; i < kNumSubTables; ++i) {
> +    nsAtomSubTable& entry = mSubTables[i];

Nit: we could do:

for (auto& table : mSubTables) {
  ...
}

here and everywhere else in this file.  Also nit: I think we should call the temporary `table`, rather than `entry` to avoid confusion between it and hash table entries.

@@ +356,5 @@
> +  MOZ_ASSERT(IsInitialized());
> +  for (size_t i = 0; i < kNumSubTables; ++i) {
> +    nsAtomSubTable& entry = mSubTables[i];
> +    delete entry.mTable;
> +    entry.mTable = nullptr;

Do we want to be super-paranoid and lock before deleting this?

@@ +421,5 @@
> +  //   we might see a gUnusedAtomCount value in between, say, AddRef()
> +  //   incrementing mRefCnt and it decrementing gUnusedAtomCount.
> +  //
> +  // So, we don't bother asserting that there are no unused atoms at the end of
> +  // a regular GC. But we can (and do) assert thist just after the last GC at

Nit: "assert this"

@@ +660,3 @@
>  {
> +  mLock->AssertCurrentThreadOwns();
> +  *aSize += mTable->ShallowSizeOfIncludingThis(aMallocSizeOf);

It's worth noting that mLock can take up a reasonable amount of space (the platform-specific mutexes take 64 bytes on Mac, I think, 40 bytes on Linux, and something similarly-sized on Windows, not to mention whatever bits we include on top of that).  So measuring mLock here somehow would be worthwhile.

It's a little peculiar to have a SizeOfIncludingThis method that doesn't actually measure the memory occupied by `this`...though I guess the memory isn't allocated.  Perhaps this method should be SizeOfExcludingThisLocked for that reason?  (And similarly for the nsAtomTable method?)

@@ +689,5 @@
>      uint32_t stringLen = NS_strlen(string);
>  
>      uint32_t hash;
> +    AtomTableKey key(string, stringLen, &hash);
> +    nsAtomSubTable& entry = SelectSubTable(key);

Naming nit here.

@@ +741,3 @@
>    uint32_t hash;
> +  AtomTableKey key(aUTF8String.Data(), aUTF8String.Length(), &hash);
> +  nsAtomSubTable& entry = SelectSubTable(key);

Naming nit here.

@@ +782,3 @@
>    uint32_t hash;
> +  AtomTableKey key(aUTF16String.Data(), aUTF16String.Length(), &hash);
> +  nsAtomSubTable& entry = SelectSubTable(key);

Naming nit here.

@@ +823,5 @@
>        return retVal.forget();
>      }
>    }
>  
> +  nsAtomSubTable& entry = SelectSubTable(key);

Naming nit here.
Attachment #8953700 - Flags: review?(nfroyd) → review+
Comment on attachment 8953701 [details] [diff] [review]
Part 3 - Enable multiple hashtables for atoms. v1

Review of attachment 8953701 [details] [diff] [review]:
-----------------------------------------------------------------

Given that we weren't measuring mLock in the previous patch, I am less sure about the size numbers.  Let's go with the 20k number quoted in the comments, saying that's what we measured for an increase vs. a single subtable.  That didn't include:

* 1 lock * 64 bytes/lock * 128 sub-tables = 8k (plus a little more for BlockingResourceBase, sadly, another 1-2k)
* 8 bytes/subtable * 128 sub-tables = 1k (for the gAtomTable itself--not malloc'd memory, but implicitly taking up space in memory via the binary)

which raises that 20k to nearly 30k.  Is that worth it?

30k * 10-ish processes (content, file://, EME, addon, etc.) starts to sound fairly large...

Where did the overhead numbers mentioned in comment 7 come from?
OTOH, my current session has 2.1MB for the main atom table in the main process and between 2.9 and 5.8MB for the main atom table in content processes.  So maybe a 0.5-1%-ish increase in the size of the atom table is nothing to get too worried about...
(In reply to Nicholas Nethercote [:njn] from comment #5)
> Nit: aKey
> Nit: Racy

Fixed.

> Please comment (and/or statically assert) that this number should be/is a
> power of two.

Fixed.


> Nit: empirically.
> Nit: Empirical

Fixed.

> It's very likely that compilers will implement this using a mask instead of
> a modulo. But is it worth doing that explicitly -- `aKey.mHash &
> (kNumSubTables - 1)` -- to avoid any doubt?

Ok.

> But since this is only taking a single measurement, it would be better to
> remove the outparam and use the return value.

Fixed.

> A function with a provocative name like this deserves a comment, either here
> or on the declaration. In what way is it racy? What are the possibly
> consequences? (I can tell by reading the code, but it would be nice to have
> extra explanation.)
> 
> Also, NS_GetNumberOfAtoms() relies on this function. It appears to only be
> used in tests. Its declaration also deserves a comment about its reliability.

Fixed.


> Again, please use the return value instead of an outparam.

Fixed.

> Might be worth doing that here, and for GCLocked()? Types are generally
> better than assertions.

Per previous discussion I'm think it's better to skip that here.



(In reply to Nathan Froyd [:froydnj] from comment #12)
> I generally agree with all of njn's comments.

Addressed, per above.

> I like the style of
> aProofOfLock arguments, but it's true that with multiple locks, that style
> breaks down, because you would like to be sure the correct lock is being
> held (and we don't have a MutexAutoLock::AssertIsHoldingLock(const Mutex&)
> or similar).

Yep.

> Do we want to cite any prior work here, e.g. Java's ConcurrentHashTable?

Ok.

> 
> @@ +230,5 @@
> > +class nsAtomSubTable
> > +{
> > +  friend class nsAtomTable;
> > +  PLDHashTable* mTable;
> > +  Mutex* mLock;
> 
> Why are these pointers rather than ordinary values?  Is this just so
> detecting accesses after atom table shutdown is easier?

It was. But upon reflection, I think it's preferable to put everything inline in nsAtomTable, and then make gAtomTable a pointer. I've gone ahead and done this.

> @@ +343,5 @@
> > +void nsAtomTable::Init()
> > +{
> > +  MOZ_ASSERT(!IsInitialized());
> > +  for (size_t i = 0; i < kNumSubTables; ++i) {
> > +    nsAtomSubTable& entry = mSubTables[i];
> 
> Nit: we could do:
> 
> for (auto& table : mSubTables) {
>   ...
> }
> 
> here and everywhere else in this file.  Also nit: I think we should call the
> temporary `table`, rather than `entry` to avoid confusion between it and
> hash table entries.

Done and done.

> 
> @@ +356,5 @@
> > +  MOZ_ASSERT(IsInitialized());
> > +  for (size_t i = 0; i < kNumSubTables; ++i) {
> > +    nsAtomSubTable& entry = mSubTables[i];
> > +    delete entry.mTable;
> > +    entry.mTable = nullptr;
> 
> Do we want to be super-paranoid and lock before deleting this?

I don't think so, since we'd be deleting the lock straight after. In any case, it's a bit of a moot point now because the shutdown function is gone, replaced with a default destructor.

> Nit: "assert this"

Fixed.

> It's worth noting that mLock can take up a reasonable amount of space (the
> platform-specific mutexes take 64 bytes on Mac, I think, 40 bytes on Linux,
> and something similarly-sized on Windows, not to mention whatever bits we
> include on top of that).  So measuring mLock here somehow would be
> worthwhile.
> 
> It's a little peculiar to have a SizeOfIncludingThis method that doesn't
> actually measure the memory occupied by `this`...though I guess the memory
> isn't allocated.  Perhaps this method should be SizeOfExcludingThisLocked
> for that reason?  (And similarly for the nsAtomTable method?)

This is all fixed under the new setup.

> Naming nit here.
> Naming nit here.
> Naming nit here.
> Naming nit here.

Fixed.
Flagging for re-review mostly for the new inline array setup, plus the memory
reporter changes. Everything else is a pretty straightforward addressing of
review comments.

MozReview-Commit-ID: E1bcchzuMOu
Attachment #8954124 - Flags: review?(nfroyd)
Attachment #8953700 - Attachment is obsolete: true
Attachment #8954124 - Flags: review?(nfroyd) → review+
Comment on attachment 8953701 [details] [diff] [review]
Part 3 - Enable multiple hashtables for atoms. v1

Review of attachment 8953701 [details] [diff] [review]:
-----------------------------------------------------------------

On reflection, I think the size concerns here are valiant, but probably not the lowest-hanging fruit on the tree.
Attachment #8953701 - Flags: review?(nfroyd) → review+
(In reply to Nathan Froyd [:froydnj] from comment #17)
> On reflection, I think the size concerns here are valiant, but probably not
> the lowest-hanging fruit on the tree.

Agreed. That said, I wanted to clarify the numbers a bit. Comment 13 is based on the numbers of a non-empty content process, which doesn't so much matter. What _does_ matter is the overhead in an empty content process, because that's the fixed overhead that hurts us when we scale up the number of content processes.

I did a more thorough writeup in a comment for part 3. Here it is:

  // Another important consideration is memory, since we're adding fixed overhead
  // per content process, which we try to avoid. Measuring a mostly-empty page [1]
  // with various numbers of subtables, we get the following deep sizes for the
  // atom table:
  //       1 subtable:  278K
  //       8 subtables: 279K
  //      16 subtables: 282K
  //      64 subtables: 286K
  //     128 subtables: 290K
  //
  // So 128 subtables costs us 12K relative to a single table, and 4K relative
  // to 64 subtables. Conversely, measuring parallel (6 thread) CSS parsing on
  // tp6-facebook, a single table provides ~150ms of locking overhead per thread,
  // 64 subtables provides ~2-3ms of overhead, and 128 subtables provides <1ms.
  // And so while either 64 or 128 subtables would probably be acceptable,
  // achieving a measurable redution in contention for 4k of fixed memory
  // overhead is probably worth it.
  //
  // [1] The numbers will look different for content processes with complex
  // pages loaded, but in those cases the actual atoms will dominate memory
  // usage and the overhead of extra tables will be negligible. We're mostly
  // interested in the fixed cost for nearly-empty content processes.
>  // achieving a measurable redution in contention for 4k of fixed memory

Nit: reduction
Pushed by bholley@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/973afb5d4909
Move some code around. r=froydnj
https://hg.mozilla.org/integration/autoland/rev/5d83e440bbb0
Overhaul the atom infrastructure to support multiple subtables. r=froydnj
https://hg.mozilla.org/integration/autoland/rev/ef3ac3531192
Enable multiple hashtables for atoms. r=froydnj
You need to log in before you can comment on or make changes to this bug.