Open Bug 1402910 Opened 7 years ago Updated 2 years ago

drag gecko hashtables into the current century

Categories

(Core :: XPCOM, enhancement, P3)

enhancement

Tracking

()

Performance Impact medium
Tracking Status
firefox58 --- wontfix
firefox59 --- wontfix

People

(Reporter: froydnj, Unassigned)

References

(Depends on 1 open bug)

Details

(Keywords: perf, perf:responsiveness)

Attachments

(2 files, 2 obsolete files)

The current consensus on how to build a fast hashtable seems to be linear probing (for cache-friendliness) combined with Robin Hood hashing (to deal with long probe chains).  See several articles around the web:

https://www.sebastiansylvan.com/post/robin-hood-hashing-should-be-your-default-hash-table-implementation/
http://codecapsule.com/2013/11/11/robin-hood-hashing/
http://codecapsule.com/2013/11/17/robin-hood-hashing-backward-shift-deletion/
https://probablydance.com/2014/05/03/i-wrote-a-fast-hash-table/
https://probablydance.com/2017/02/26/i-wrote-the-fastest-hashtable/

Rust's hashtables also use this strategy, and they seem to have had decent success with it.

Our PLDHashTable uses quadratic probing, which is cache-unfriendly; every probe is essentially a cache miss.

Over the past month or so, I've been prototyping a hashtable that we could at least start using for the underlying substrate of nsTHashtable.  (This is attractive because it enables us to switch large numbers of Gecko hashtable instances over with minimal effort; switching the users of raw PLDHashTable would be a more involved effort.)  ASan Linux try finally turned green after fixing a few last thinkos today, so it is at least ready for an initial round of feedback.  Basic, simplistic benchmarks that should be taken with a large grain of salt indicate the new hashtable is ~2x faster.

The question of whether nsTHashtable and friends (and whatever the underlying hashtable is) should be templated, instead of using function pointers, is a separate, future discussion.  The question of whether we could just FFI to Rust's hashtables, since that would probably be no worse than what we have today, is an interesting one, but I have run into enough little thinkos about our current hashtable implementation that I think such a project would run into some serious difficulties.
Attached patch part 1 - add QMEHashTable (obsolete) — Splinter Review
This is where almost all of the interesting code is.  The files are largely a
copy-and-paste of PLDHashtable.{h,cpp}, with types and algorithms changed where
appropriate.  We abuse unified compilation to pull in the hash table
concurrency checking mechanisms.

Things should be familiar to anybody who has spent some time in PLDHashTable.
Notable changes:

* QMEHashTable::SearchTable has been changed to do linear probing + Robin Hood
  hashing.  It's also been templated so that things like
  PLDHashTable::FindFreeEntry are no longer needed.  The logic is maybe a
  little hard to follow because of this, but I think this is better than
  maintaining similar algorithms with minor variations in two different places.

* QMEHashTable::RawRemove has been changed to not use tombstones (entries with
  mKeyHash having kCollisionFlag set), instead shifting entries backward where
  appropriate.  I am less convinced of the goodness of this part, as I can't
  convince myself that some patterns of inserts and removes will not cause
  terrible behavior.  Various other things (e.g. ComputeKeyHash) have been
  altered from their PLDHashTable counterparts to deal with kCollisionFlag now
  being unnecessary.

* Iterators are slightly different as well, because we can no longer blindly
  advance the internal pointer each time: removes may deposit live elements at
  the entry we were currently looking at, so we need to take that into account.

A non-exhaustive list of little things that remain to be done:

* Load factor adjustments; Robin Hood hashing enables higher load factors to be
  used with little degradation in performance.  Rust's hashtables, for
  instance, use a 90% load factor before triggering resizes.

* Reinsertion of existing entries in SearchTable currently use three moveEntry
  calls to perform swaps; it's possible that having a swapEntries function
  pointer in QMEHashTableOps would be more efficient.  Doing so involves a lot
  of code modification elsewhere, though.

* Some benchmarking to see what the most efficient way to iterate over entries
  in SearchTable and RawRemove would be good: should we be using the current
  strategy (which requires multiplications each iteration) or directly
  increment raw pointers into the table (which requires a branch or cmov every
  iteration to check for wraparound)?

* Other decisions from PLDHashTable could be revisited: the comment over
* CapacityFromHashShift is a perfect example.

* The memory for temporary storage could be allocated contiguously with the
  memory for the entry storage; I'm not sure whether this would be more or less
  efficient.

* QMEHashTable can be shrunk using techniques that PLDHashTable has recently
  adopted.

Credit goes to Ehsan for initially suggesting this idea!
Attachment #8911870 - Flags: feedback?(n.nethercote)
Attachment #8911870 - Flags: feedback?(erahm)
This is the easy part, whereby we move a bunch of things over unconditionally.
I have not benchmarked this part with e.g. a talos run, but doing a Speedometer
run with this would be enlightening.
> Our PLDHashTable uses quadratic probing

No, it uses double hashing.

> QMEHashTable

What does QME stand for?

> * Load factor adjustments; Robin Hood hashing enables higher load factors to
> be
>   used with little degradation in performance.  Rust's hashtables, for
>   instance, use a 90% load factor before triggering resizes.

I'd suggest trying 87.5% (7/8) and 93.75% (15/16), which are computable without using integer division.
Comment on attachment 8911870 [details] [diff] [review]
part 1 - add QMEHashTable

Review of attachment 8911870 [details] [diff] [review]:
-----------------------------------------------------------------

This is excellent, thank you for working on it. I've done a high-level pass, looking at API stuff and comments, but I haven't looked closely at nitty-gritty parts of the code, such as lookup and insertion. I like that you have lots of comments.

What's the rationale for introducing QMEHashTable rather than just modifying PLDHashTable? AFAICT the API is much the same (though I might be overlooking something). Seems like it's just adding more code for no particular reason. It's not in the service of a gradual transition strategy because you're changing nsTHashtable to use QMEHashTable in the next patch, which affects lots of tables anyway. (Modify PLDHashTable makes reviewing easier, too :)

> * Iterators are slightly different as well, because we can no longer blindly
>   advance the internal pointer each time: removes may deposit live elements at
>   the entry we were currently looking at, so we need to take that into account.

Another possibility is to record the removed elements in a Vector and then wait until iteration ends before doing the actual removal. It's possible that is simpler, though it does require extra time and memory for the Vector.

::: xpcom/ds/QMEHashTable.cpp
@@ +22,5 @@
> +
> +// TODO: thread-safety checks
> +
> +static bool
> +QSizeOfEntryStore(uint32_t aCapacity, uint32_t aEntrySize, uint32_t* aNbytes)

Why the Q prefix?

@@ +35,5 @@
> +// - table must be at most 75% full;
> +// - capacity must be a power of two;
> +// - capacity cannot be too small.
> +static inline void
> +QBestCapacity(uint32_t aLength, uint32_t* aCapacityOut,

ditto

@@ +198,5 @@
> +  bool addingEntry = Reason == ForAdd || Reason == ForAddDuringResize;
> +  // Try to ensure that we don't add new reasons without handling them.
> +  MOZ_ASSERT_IF(!addingEntry, Reason == ForSearchOrRemove);
> +
> +  // QMEHashTable employs linear probing with Robin Hood-style hashing.

Does Robin Hood-style hashing not imply linear probing? (Genuine question, I'm not sure.)

@@ +206,5 @@
> +  //
> +  // As we go along, if there's a bucket whose distance from *its*
> +  // desired initial bucket is lower than the distance we've probed from
> +  // our desired initial bucket, we'll put our inserted entry there, and
> +  // continue inserting the now-vacated element.

I'm going to nitpick the language here a bit. I don't think we need to introduce the term "bucket"; we already use "entry" in much the same way and I don't think having two terms is good. We have to be a bit careful about distinguishing e.g. empty entries vs use entries, but I think it can be done. "Entry index" can be used to refer to the actual index of an entry.

@@ +208,5 @@
> +  // desired initial bucket is lower than the distance we've probed from
> +  // our desired initial bucket, we'll put our inserted entry there, and
> +  // continue inserting the now-vacated element.
> +  //
> +  // We are guaranteed to find a new entry.

Do you have your editor set to break comment lines at column 72? This block comment could/should be a lot wider. Likewise for other comments throughout this patch.

@@ +215,5 @@
> +  uint32_t bucketIndex = IdealBucketIndex(aKeyHash);
> +
> +  uint32_t probeLength = 0;
> +  PLDHashEntryHdr* temporary = static_cast<PLDHashEntryHdr*>(mTemporaryEntry);
> +  bool reinserting = false;

I'm not quite sure what this bool means. A comment might help.

@@ +235,5 @@
> +        mOps->moveEntry(this, temporary, entry);
> +        entry->mKeyHash = tempHash;
> +        return retval;
> +      }
> +      return (addingEntry) ? entry : nullptr;

Redundant parens.

@@ +242,5 @@
> +    // What we do here depends on what state we're in:
> +    //
> +    // - If we are reinserting some entry from the table, then we do nothing,
> +    //   because the reinserted entry cannot possibly match any of the other
> +    //   entries in the hashtable.

This is a bit unclear. Is this during resize? As above, I'm not sure what "reinserting" means.

@@ +289,5 @@
> +    //     we still have work to do, so continue through the loop, with
> +    //     a new entry.
> +    // 3. We are re-inserting elements, so:
> +    // 3a. Swap the temporary entry with the offending element.
> +    // 3b. Continue inserting the new temporary entry.

The numbering here is inconsistent: there's 2a/2b/2c without 2, but there's 3a/3b with 3.

Also, I find this list confusing. Is 3/3a/3b an alternative to 1/2a/2b/2c? Making it slightly more pseudocode-y might help, esp. if there should be an "else" in there?

@@ +305,5 @@
> +          retval = entry;
> +        } else {
> +          // We have two entries that we need to swap: the current entry,
> +          // `entry` and the temporary entry.  We *also* have a free entry
> +          // in `retval` that we can use for temporary storage.

The list after the colon reads like it has three elements, not two. This might be clearer: "the current entry (`entry`) and the temporary entry".

@@ +558,5 @@
> +  // When we remove an entry, we'll scan forwards to find an entry that is
> +  // either empty, or in its desired place.  Then we'll shift all entries
> +  // between our removed entry and the found entry backwards, lowering their
> +  // distance from their desired entry slot.  Doing this improves lookup and
> +  // insertion performance.

And it saves space because we don't have wasted entries.

@@ +563,5 @@
> +  //
> +  // What about accidentally quadratic concerns here?  For instance, what if
> +  // we had a *really* long probe chain?  Such chains are unlikely, given our
> +  // Robin Hood insertion strategy above.  XXX more robust explanation.
> +  

trailing whitespace

@@ +601,5 @@
> +
> +    emptyBucket = nextBucket;
> +    emptyEntry = nextEntry;
> +  }
> +  

trailing whitespace

::: xpcom/ds/QMEHashTable.h
@@ +60,5 @@
> +  // strategy, so we need to do something a little different: we'll keep
> +  // a count of the maximum number of entries we've ever seen in the table.
> +  // Using that value and the current number of entries, we can compute
> +  // how many entries have been removed.
> +  uint32_t            mHighWaterEntryCount;

Do we need to track this? In PLDHashTable we track the number of removed entries because they take up space and so need to be taken into account by the table loading calculations. But in QMEHashTable they don't take up space. So I think all this logic relating to the number of removed entries can be completely removed.

@@ +62,5 @@
> +  // Using that value and the current number of entries, we can compute
> +  // how many entries have been removed.
> +  uint32_t            mHighWaterEntryCount;
> +  EntryStore          mEntryStore;    // (Lazy) entry storage and generation.
> +  uint32_t            mSizeMask;

What's this for? Needs a comment.

@@ +71,5 @@
> +  // caller owns the memory for the inserted entry and therefore we can't
> +  // scribble over it.  So we need a small entry that is obviously ours.
> +  //
> +  // XXX: a better design for this class would obviate the need for this.
> +  void*               mTemporaryEntry;

It's a shame the effect this has on sizeof(QMEHashTable). AFAICT this field is only used in SearchTable() -- can we use stack allocation instead? The size depends on the table (mEntrySize) so you can't use a normal local variable but we have precedent in the tree for using alloca() in layout/style/nsRuleNode.{h,cpp}.

::: xpcom/tests/gtest/HashTableBench.cpp
@@ +11,5 @@
> +using namespace mozilla;
> +
> +// A trivial hash function is good enough here. It's also super-fast for the
> +// GrowToMaxCapacity test because we insert the integers 0.., which means it's
> +// collision-free.

Please use a realistic hash function! Probably mozilla::HashGeneric().

@@ +41,5 @@
> +  PLDHashTable table(&trivialPOps, sizeof(PLDHashEntryStub), kHashTableSize);
> +
> +  for (size_t i = 0; i < kHashTableSize; ++i) {
> +    bool success = table.Add(reinterpret_cast<const void*>(i), fallible);
> +    ASSERT_TRUE(success);

This benchmark needs strengthening. You're only measuring insertion, not lookup or removal. And as well as measuring on this simple table, measuring on another table with a larger PLDHashEntry will stress the cache in different ways.

@@ +66,5 @@
> +
> +MOZ_GTEST_BENCH(XPCOM, XPCOM_QMEHashTable_Bench, QMEHashBench);
> +MOZ_GTEST_BENCH(XPCOM, XPCOM_PLDHashTable_Bench, PLDHashBench);
> +
> +TEST(QMEHashTable, Insertion)

This test could also be strengthened:
- Insert many more items to test resizing more;
- More interleaving of insertion, lookup, and removal.
Attachment #8911870 - Flags: feedback?(n.nethercote) → feedback+
(In reply to Nicholas Nethercote [:njn] from comment #4)
> What's the rationale for introducing QMEHashTable rather than just modifying
> PLDHashTable? AFAICT the API is much the same (though I might be overlooking
> something). Seems like it's just adding more code for no particular reason.
> It's not in the service of a gradual transition strategy because you're
> changing nsTHashtable to use QMEHashTable in the next patch, which affects
> lots of tables anyway. (Modify PLDHashTable makes reviewing easier, too :)

Adding a new class means that I don't have to change the entire world every time I want to make changes in the new implementation.  I can move nsTHashtable to the new implementation without changing all the extant users of PLDHashTable--adding a new "virtual" method to the ops table, for example.  I was also unsure of how many clients might be silently depending on implementation details of PLDHashTable.  Some of dbaron's experiments with modifying PLDHashTable have turned up subtle dependencies, and I wanted to try and minimize that.  (I realize that moving over many of the hashtables through nsTHashtable is probably incompatible with that desire!)  Having both implementations in-tree also makes benchmarking much easier during development.

> > * Iterators are slightly different as well, because we can no longer blindly
> >   advance the internal pointer each time: removes may deposit live elements at
> >   the entry we were currently looking at, so we need to take that into account.
> 
> Another possibility is to record the removed elements in a Vector and then
> wait until iteration ends before doing the actual removal. It's possible
> that is simpler, though it does require extra time and memory for the Vector.

We could also attempt to find an unused entry before starting iteration, and start iterating *backwards* through the table from there; that would ensure we're never moving live elements into an entry we're currently on.

> ::: xpcom/ds/QMEHashTable.cpp
> @@ +22,5 @@
> > +
> > +// TODO: thread-safety checks
> > +
> > +static bool
> > +QSizeOfEntryStore(uint32_t aCapacity, uint32_t aEntrySize, uint32_t* aNbytes)
> 
> Why the Q prefix?

So it won't collide with static functions in PLDHashTable.cpp; I wasn't sure how much I would have to modify at first.

> @@ +198,5 @@
> > +  bool addingEntry = Reason == ForAdd || Reason == ForAddDuringResize;
> > +  // Try to ensure that we don't add new reasons without handling them.
> > +  MOZ_ASSERT_IF(!addingEntry, Reason == ForSearchOrRemove);
> > +
> > +  // QMEHashTable employs linear probing with Robin Hood-style hashing.
> 
> Does Robin Hood-style hashing not imply linear probing? (Genuine question,
> I'm not sure.)

I think you could do it without doing linear probing, though it'd be awkward: computing distance from your ideal entry is no longer straightforward, for instance.

> @@ +206,5 @@
> > +  //
> > +  // As we go along, if there's a bucket whose distance from *its*
> > +  // desired initial bucket is lower than the distance we've probed from
> > +  // our desired initial bucket, we'll put our inserted entry there, and
> > +  // continue inserting the now-vacated element.
> 
> I'm going to nitpick the language here a bit. I don't think we need to
> introduce the term "bucket"; we already use "entry" in much the same way and
> I don't think having two terms is good. We have to be a bit careful about
> distinguishing e.g. empty entries vs use entries, but I think it can be
> done. "Entry index" can be used to refer to the actual index of an entry.

Yeah, I will try and adjust the language around this a bit.

> @@ +208,5 @@
> > +  // desired initial bucket is lower than the distance we've probed from
> > +  // our desired initial bucket, we'll put our inserted entry there, and
> > +  // continue inserting the now-vacated element.
> > +  //
> > +  // We are guaranteed to find a new entry.
> 
> Do you have your editor set to break comment lines at column 72? This block
> comment could/should be a lot wider. Likewise for other comments throughout
> this patch.

I do, can try to fix.

> @@ +242,5 @@
> > +    // What we do here depends on what state we're in:
> > +    //
> > +    // - If we are reinserting some entry from the table, then we do nothing,
> > +    //   because the reinserted entry cannot possibly match any of the other
> > +    //   entries in the hashtable.
> 
> This is a bit unclear. Is this during resize? As above, I'm not sure what
> "reinserting" means.

I thought this was clear; we're reinserting entries when we're doing the "Robin Hood" part of "Robin Hood hashing".  I'll try to clarify this.

> ::: xpcom/ds/QMEHashTable.h
> @@ +60,5 @@
> > +  // strategy, so we need to do something a little different: we'll keep
> > +  // a count of the maximum number of entries we've ever seen in the table.
> > +  // Using that value and the current number of entries, we can compute
> > +  // how many entries have been removed.
> > +  uint32_t            mHighWaterEntryCount;
> 
> Do we need to track this? In PLDHashTable we track the number of removed
> entries because they take up space and so need to be taken into account by
> the table loading calculations. But in QMEHashTable they don't take up
> space. So I think all this logic relating to the number of removed entries
> can be completely removed.

OK, I think I can buy that.

> @@ +71,5 @@
> > +  // caller owns the memory for the inserted entry and therefore we can't
> > +  // scribble over it.  So we need a small entry that is obviously ours.
> > +  //
> > +  // XXX: a better design for this class would obviate the need for this.
> > +  void*               mTemporaryEntry;
> 
> It's a shame the effect this has on sizeof(QMEHashTable). AFAICT this field
> is only used in SearchTable() -- can we use stack allocation instead? The
> size depends on the table (mEntrySize) so you can't use a normal local
> variable but we have precedent in the tree for using alloca() in
> layout/style/nsRuleNode.{h,cpp}.

Stack allocation seems like a better idea, especially after we've enforced an 8-bit entry size akin to what PLDHashTable does.  Or even just having a char[256] on the stack, since we have a reasonable maximum entry size.

> ::: xpcom/tests/gtest/HashTableBench.cpp
> @@ +11,5 @@
> > +using namespace mozilla;
> > +
> > +// A trivial hash function is good enough here. It's also super-fast for the
> > +// GrowToMaxCapacity test because we insert the integers 0.., which means it's
> > +// collision-free.
> 
> Please use a realistic hash function! Probably mozilla::HashGeneric().

This is a good point, though testing with a hash function that returns numbers in-order is not a bad stress test of the hashtable, too.

> @@ +41,5 @@
> > +  PLDHashTable table(&trivialPOps, sizeof(PLDHashEntryStub), kHashTableSize);
> > +
> > +  for (size_t i = 0; i < kHashTableSize; ++i) {
> > +    bool success = table.Add(reinterpret_cast<const void*>(i), fallible);
> > +    ASSERT_TRUE(success);
> 
> This benchmark needs strengthening. You're only measuring insertion, not
> lookup or removal. And as well as measuring on this simple table, measuring
> on another table with a larger PLDHashEntry will stress the cache in
> different ways.
> 
> @@ +66,5 @@
> > +
> > +MOZ_GTEST_BENCH(XPCOM, XPCOM_QMEHashTable_Bench, QMEHashBench);
> > +MOZ_GTEST_BENCH(XPCOM, XPCOM_PLDHashTable_Bench, PLDHashBench);
> > +
> > +TEST(QMEHashTable, Insertion)
> 
> This test could also be strengthened:
> - Insert many more items to test resizing more;
> - More interleaving of insertion, lookup, and removal.

Will try writing several more sophisticated benchmarks, probably cribbing from some of the Robin Hood hashing blog posts, and seeing what happens.
(In reply to Nathan Froyd [:froydnj] from comment #5)
> (In reply to Nicholas Nethercote [:njn] from comment #4)
> > ::: xpcom/ds/QMEHashTable.h
> > @@ +60,5 @@
> > > +  // strategy, so we need to do something a little different: we'll keep
> > > +  // a count of the maximum number of entries we've ever seen in the table.
> > > +  // Using that value and the current number of entries, we can compute
> > > +  // how many entries have been removed.
> > > +  uint32_t            mHighWaterEntryCount;
> > 
> > Do we need to track this? In PLDHashTable we track the number of removed
> > entries because they take up space and so need to be taken into account by
> > the table loading calculations. But in QMEHashTable they don't take up
> > space. So I think all this logic relating to the number of removed entries
> > can be completely removed.
> 
> OK, I think I can buy that.

Actually, wait, I don't.  The removed entries take up the same amount of space in both cases.  We're keeping track of them in the PLDHashTable case so we know when our table was large at some point in the past, but at the present time we don't have enough entries to really justify keeping it that large.  Why wouldn't we do the same thing for QMEHashTable?

How would we make ShrinkIfAppropriate() work in this new world where we don't track stats about removed entries?  We'd just remove the constraint on a quarter of all entries being removed?  The same question can be asked for the check in Add(), though there I guess we'd never compress the table?  (I'm not quite sure how we "compress" the table when we're using a deltaLog2 of zero--that is, not changing the size of the table at all--but maybe I am missing something subtle there.)
(In reply to Nicholas Nethercote [:njn] from comment #3)
> > QMEHashTable
> 
> What does QME stand for?

QME doesn't stand for anything, it's PLD++, in the same way that IBM is HAL++.  (One *really* wonders whether one side of the latter is deliberate or just a happy accident.)
> > Why the Q prefix?
> 
> So it won't collide with static functions in PLDHashTable.cpp; I wasn't sure
> how much I would have to modify at first.

Is the collision because of unified compilation? The C++ One Definition Rule? Something else?

> Stack allocation seems like a better idea, especially after we've enforced
> an 8-bit entry size akin to what PLDHashTable does.  Or even just having a
> char[256] on the stack, since we have a reasonable maximum entry size.

Oh excellent, an unintended benefit of my having restricted mEntrySize.

> > Please use a realistic hash function! Probably mozilla::HashGeneric().
> 
> This is a good point, though testing with a hash function that returns
> numbers in-order is not a bad stress test of the hashtable, too.

Seems to me like it's the opposite of a stress test -- aren't you guaranteed to have no collisions? Anyway, this is already a microbenchmark, and hence unrealistic; introducing additional unrealism is not going to help things :)
> Actually, wait, I don't.  The removed entries take up the same amount of
> space in both cases.  We're keeping track of them in the PLDHashTable case
> so we know when our table was large at some point in the past, but at the
> present time we don't have enough entries to really justify keeping it that
> large.  Why wouldn't we do the same thing for QMEHashTable?
> 
> How would we make ShrinkIfAppropriate() work in this new world where we
> don't track stats about removed entries?  We'd just remove the constraint on
> a quarter of all entries being removed? The same question can be asked for
> the check in Add(), though there I guess we'd never compress the table? 
> (I'm not quite sure how we "compress" the table when we're using a deltaLog2
> of zero--that is, not changing the size of the table at all--but maybe I am
> missing something subtle there.)

deltaLog2==0 means that we "resize" the table without changing its capacity. I.e. it effectively just clears out all the tombstones. We don't need that operation ever in QMEHashTable.

Let's consider pseudocode.

PLD Add:

> if (load > 75% (incl. tombstones))
>   if (tombstone load > 25%)
>     rehash table to the same capacity, eliminating tombstones (takes us back to 50% or less capacity)
>   else
>     rehash table to twice the capacity

PLD Remove:

> if (tombstone load > 25%) || ((not at min capacity) && (non-tombstone load < 25%))
>   rehash table to same or smaller capacity (chosen capacity depends on non-tombstone load, can be less than half)

With tombstones, we have entries that take up space uselessly but can be eliminated with effort. I.e. this is all about preventing us from ending up with overly-large tables that contain mostly tombstones when we remove lots of elements.

(Now that I think about it... the tombstone load check during Remove is clearly useful, because that's when tombstones are created, so we can obviously cross the 25% tombstone threshold. But the tombstone load check during Add might not be necessary, because I can't see how we can exceed the 25% tombstone threshold when adding entries... if we were above that threshold we would have already rehashed on a previous Remove. I will check this empirically.)

Without tombstones, all live entries are useful -- simply rehashing the table to the same capacity will never change its load factor -- so things are much simpler.

QME Add:

> if (load > 75%)  // or whatever maximum load factor is appropriate
>   rehash table to twice the capacity

QME Remove:

> if (load < 25%) && (not at min capacity)  // or whatever minimum load factor is appropriate
>   rehash table to half the capacity or less

(I think the "or less" is necessary in the case where an iterator is used to remove most or all the elements, and the ShrinkIfAppropriateCheck() happens at the end of that iteration.)

Much nicer!
> But the tombstone load check
> during Add might not be necessary, because I can't see how we can exceed the
> 25% tombstone threshold when adding entries... if we were above that
> threshold we would have already rehashed on a previous Remove. I will check
> this empirically.

Checked:

> 50760 counts:
> (  1)    40252 (79.3%, 79.3%): PLDAdd 1
> (  2)     5245 (10.3%, 89.6%): PLDRem -1
> (  3)     4860 ( 9.6%, 99.2%): PLDRem 0
> (  4)      165 ( 0.3%, 99.5%): PLDRem -2
> (  5)       91 ( 0.2%, 99.7%): PLDRem -5
> (  6)       51 ( 0.1%, 99.8%): PLDRem -6
> (  7)       43 ( 0.1%, 99.9%): PLDRem -3
> (  8)       36 ( 0.1%,100.0%): PLDRem -4
> (  9)        9 ( 0.0%,100.0%): PLDRem -8
> ( 10)        3 ( 0.0%,100.0%): PLDRem -10
> ( 11)        3 ( 0.0%,100.0%): PLDRem -7
> ( 12)        2 ( 0.0%,100.0%): PLDRem -11

The "PLDAdd 0" case never occurs in practice. Although, in theory the RawRemove() function could be used to subvert things, because it removes entries without calling ShrinkIfAppropriate(). But that function is so little-used that we could still easily remove the tombstone condition in Add. But given that QMEHashTable will likely replace PLDHashTable soon, it's probably not worth it. (Though there is also js::HashTable...)
Been writing some benchmarks today, staring at `perf` output, and thinking.  The benchmarks are:

* Insert N random integers with a trivial hash function
* Insert N sorted integers with a trivial hash function

And then, using a sophisticated (mozilla::HashGeneric) hash function:

* Insert N random integers
* Insert N sorted integers
* Insert N random integers and then search for all of them
* Insert N random integers and then remove all of them (using ::Remove())
* Insert N random integers and then remove all of them, without resizing (using ::RawRemove())
* Insert N random integers and then iterate over the table
* Insert N random integers and then iterate over the table, removing them as we go

N is one million in all of the below timings; the integers are 64-bit integers.  Timings for the above, using 75% max load in QMEHashTable:

 121.703 ± 12.950 ms    HashIntegers.QME_Insert_Trivial_Random
 135.815 ± 12.020 ms    HashIntegers.PLD_Insert_Trivial_Random
  63.296 ±  2.754 ms    HashIntegers.QME_Insert_Trivial_Sorted
 164.017 ±  4.706 ms    HashIntegers.PLD_Insert_Trivial_Sorted

 108.120 ±  2.949 ms    HashIntegers.QME_Insert_Generic_Random
 134.776 ±  2.498 ms    HashIntegers.PLD_Insert_Generic_Random
  70.527 ±  2.413 ms    HashIntegers.QME_Insert_Generic_Sorted
 129.989 ±  2.141 ms    HashIntegers.PLD_Insert_Generic_Sorted

 109.770 ±  0.798 ms    HashIntegers.QME_Search_Generic_Random
 121.557 ±  0.591 ms    HashIntegers.PLD_Search_Generic_Random

 177.122 ±  0.549 ms    HashIntegers.QME_Remove_Generic_Random
 225.674 ±  0.942 ms    HashIntegers.PLD_Remove_Generic_Random
 138.908 ±  0.808 ms    HashIntegers.QME_RawRemove_Generic_Random
 149.394 ±  0.665 ms    HashIntegers.PLD_RawRemove_Generic_Random

  81.745 ±  0.828 ms    HashIntegers.QME_Iterate_Generic_Random
  87.305 ±  1.140 ms    HashIntegers.PLD_Iterate_Generic_Random
  89.772 ±  0.398 ms    HashIntegers.QME_IterateRemove_Generic_Random
  85.097 ±  0.743 ms    HashIntegers.PLD_IterateRemove_Generic_Random

QMEHashTable is roughly 5-10% faster on most things, about 20% faster on random inserts, and ~twice as fast on inserting sorted inputs.  Iteration seems to be a wash, but we are much faster when removing outside of iteration.

I cannot explain the Search results: those benchmarks are identical to the Insert benchmarks, but with the additional work of searching...yet PLDHashTable is *faster* on the Search benchmark than the Insert benchmarks.

Sadly, bumping the max load to 87.5% in QMEHashTable produces disappointing results:

 156.047 ±  8.152 ms    HashIntegers.QME_Insert_Trivial_Random
 134.613 ±  1.949 ms    HashIntegers.PLD_Insert_Trivial_Random
  65.857 ±  1.282 ms    HashIntegers.QME_Insert_Trivial_Sorted
 168.841 ±  1.409 ms    HashIntegers.PLD_Insert_Trivial_Sorted

 150.063 ±  2.002 ms    HashIntegers.QME_Insert_Generic_Random
 136.244 ±  2.831 ms    HashIntegers.PLD_Insert_Generic_Random
  78.378 ±  2.474 ms    HashIntegers.QME_Insert_Generic_Sorted
 137.879 ±  5.011 ms    HashIntegers.PLD_Insert_Generic_Sorted

 109.219 ±  3.439 ms    HashIntegers.QME_Search_Generic_Random
 120.113 ±  3.274 ms    HashIntegers.PLD_Search_Generic_Random

 174.887 ±  1.367 ms    HashIntegers.QME_Remove_Generic_Random
 220.440 ±  1.425 ms    HashIntegers.PLD_Remove_Generic_Random
 136.646 ±  0.568 ms    HashIntegers.QME_RawRemove_Generic_Random
 146.116 ±  1.244 ms    HashIntegers.PLD_RawRemove_Generic_Random

  79.622 ±  0.810 ms    HashIntegers.QME_Iterate_Generic_Random
  86.621 ±  0.759 ms    HashIntegers.PLD_Iterate_Generic_Random
  88.592 ±  0.849 ms    HashIntegers.QME_IterateRemove_Generic_Random
  84.390 ±  0.515 ms    HashIntegers.PLD_IterateRemove_Generic_Random

Notice that QMEHashTable is now much slower for inserting random numbers, 10-20% slower than PLDHashTable.  I'd guess this stems from longer hash chains and additional work performing Robin Hood hashing when inserting as the load factor increases.
Can you profile on more realistic table sizes? Maybe just run through some web pages and see what the largest / average sizes are?
(In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment #12)
> Can you profile on more realistic table sizes? Maybe just run through some
> web pages and see what the largest / average sizes are?

Shrinking the table sizes to 16 (and running 50k iterations to make the results take actual time) gives basically the same results.

One notable finding is that switching the order of the first two tests makes the QMEHashTable version come out on top; perhaps there's some cache effects (warming up the pages that the memory allocator is going to hand out?) at play here.
> QMEHashTable is roughly 5-10% faster on most things, about 20% faster on
> random inserts, and ~twice as fast on inserting sorted inputs.

Great results!

I have some suggestions for alternative things to measure, all prompted by thinking about how real hash tables are likely to be used in Firefox.

- As erahm says, 1 million is a big hashtable! Could you try say, inserting 1000 and then doing a million searches?

- Do any of your searches fail? If not, that case should be tested, probably just as often as successful searches.

- It would be good to try a table with entries larger than integers.

If you want to go all out you could somehow record the usage patterns for a particularly hot hash table, such as the cycle collector's, and then replay that in the benchmark.
Depends on: 1415980
I think in order for us to have a realistic benchmark for this, we should look at the mPtrToNodeMap hashtable defined here: <https://searchfox.org/mozilla-central/rev/769222fadff46164f8cc0dc7a0bae5a60dc2f335/xpcom/base/nsCycleCollector.cpp#889>.  This hashtable can get relatively large and it only stores pointers.  Based on the performance measurements we did in the QF project, this is probably our most expensive hashtable in terms of searches.

It would be nice to collect some metrics on how many entries it can contain in a typical browsing session and construct a benchmark based on that.  I think once we have a working implementation of QMEHashTable, we probably want to replace mPtrToNodeMap with that class as a first task, so doing this measurement would be useful to assess that possibility as well.
CCing jesup since mccr8 told me he's run into perf issues with the CC hashtable as well.
When leaks in sites get bad, we can see 100ms+ ChangeTable calls (rehashing/reinserting all the data) as it grows the table.  This typically is only 5-10% of the total time for CC (most is traversing and inserting into the hashtable), but it's still significant.
Keywords: perf
Whiteboard: [qf]
Whiteboard: [qf] → [qf:f64][qf:p1]
Whiteboard: [qf:f64][qf:p1] → [qf:p1:f64]
Depends on: 1474789
Resetting the qf flag; I don't think it's clear that this is a P1 bug until the measurements from comment 16 or similar are done.  We don't have a good sense of how much performance this would win yet.

If somebody wants to do those measurements, I'm happy to work on getting the dependent bugs landed and sending WIP patches.
Whiteboard: [qf:p1:f64] → [qf]
(In reply to Nathan Froyd [:froydnj] from comment #20)
> Resetting the qf flag; I don't think it's clear that this is a P1 bug until
> the measurements from comment 16 or similar are done.  We don't have a good
> sense of how much performance this would win yet.
> 
> If somebody wants to do those measurements, I'm happy to work on getting the
> dependent bugs landed and sending WIP patches.

Ignoring the parts about measuring the number of entries, here are the measurements for time spent since process startup in all PLD hash table lookups, and only in the CC hashtable lookups in both parent and content processes for:

A fresh browser session and content process:

PLD hashtable lookup overhead: 25.924723ms content=0
CC hashtable lookup overhead: 1.516823ms content=0

PLD hashtable lookup overhead: 3.326255ms content=1
CC hashtable lookup overhead: 0.000000ms content=1

The same session after loading YouTube, starting to play a video, going to Amazon, browsing a bit, going to Facebook (but not logging in), going back to Amazon (about a minute in total of browsing):

PLD hashtable lookup overhead: 312.702908ms content=0
CC hashtable lookup overhead: 42.262718ms content=0

PLD hashtable lookup overhead: 1808.481920ms content=1
CC hashtable lookup overhead: 2491.989567ms content=1


Note that the CC numbers include the entire times of add/remove operations, not just the look-ups, which is why they're higher than the total PLD hash table numbers. Also, times for each individual operation were collected with 1 ns resolution, minus whatever slop TimeStamp adds. I didn't count the total number of operations.
Whiteboard: [qf] → [qf:p1:f64]
Not sure CC hashtable is the most important to look at, or at least not the only case to look at. Hashtable overhead shows badly often in some hot paths too - say any path needing string atomization which isn't optimized for mainthread usage only.

This is qf:p1 because we need faster hashtables in general, if possible.
(if we end up doing hashtable perf work in some other bug, that should be then qf:p1)
IIRC we decided to pick the CC hashtable as the first target for this project because it can typically get large, and as such the hashtable lookup times for it typically show up in profiles.  But I agree with Olli that it's not the only hashtable to look at.
This rebases QMEHashTable onto a more-or-less recent working tree and I believe
it also addresses many of njn's comment 4.
Attachment #8911870 - Attachment is obsolete: true
Attachment #8911870 - Flags: feedback?(erahm)
Again, rebased.  If you want to turn off BROKEN_HASH_KEYS in QMEHashTable.cpp,
you'll need a tree with bug 1415980 applied.
Attachment #8911872 - Attachment is obsolete: true
njn wanted more recent code; ni to him to notify that it's available.

Note there are also some optimizations worth attempting: iteration on QMEHashTables should be done in reverse, from higher-address entries to lower-address entries, so we don't have to have the mRemovedCurrentElement special case.  Bonus points if we delayed shifting entries down until we exited a contiguous run of removed entries, though I don't know how complex that is to implement.

Undefining BROKEN_HASH_KEYS should make things slightly faster, especially if all the virtual functions for QMEHashTable were inlined properly.  Bug 1415980 is meant to make that easier to do, but I think there are still a few botched entries somewhere.
Flags: needinfo?(n.nethercote)
Comment on attachment 8996096 [details] [diff] [review]
part 1 - add QMEHashTable

Review of attachment 8996096 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/ds/QMEHashTable.cpp
@@ +217,5 @@
> +}
> +
> +template<QMEHashTable::SearchReason Reason>
> +PLDHashEntryHdr*
> +QMEHashTable::SearchTable(const void* aKey, PLDHashNumber aKeyHash)

Please try making this function MOZ_ALWAYS_INLINE -- it was a big win (on some platforms at least) for PLDHashTable in bug 1477632.

::: xpcom/tests/gtest/HashTableBench.cpp
@@ +5,5 @@
> +
> +#include "gtest/gtest.h"
> +#include "gtest/MozGTestBench.h"
> +#include "PLDHashTable.h"
> +#include "QMEHashTable.h"

We now have xpcom/rust/gtest/bench-collections/Bench.cpp... I think it would make sense to remove this file and add QME there.
Flags: needinfo?(n.nethercote)
Another thing: PLDHashTable.cpp now has a bunch of `recordreplay` stuff in it that I don't understand. I'm guessing QMEHashTable should have it too?
Whiteboard: [qf:p1:f64] → [qf:p2:responsiveness]

Just curious if this project is something we care to resume any time in the near future? I'm going through old layout/style system bugs that were on my radar, and bug 1474789 potentially blocks this, but I'm unsure it's anything we want to invest time fixing.

Flags: needinfo?(nfroyd)

(In reply to Sean Voisen (:svoisen) from comment #30)

Just curious if this project is something we care to resume any time in the near future? I'm going through old layout/style system bugs that were on my radar, and bug 1474789 potentially blocks this, but I'm unsure it's anything we want to invest time fixing.

It's not something I plan on working on in the near future. I'm a little concerned about changes like "reading the hashtable now involves writing to the hashtable", which I think breaks a lot of other things, not just stylo, and I think the hash table state-of-the-art has moved on a bit anyway.

Flags: needinfo?(nfroyd)
Performance Impact: --- → P2
Whiteboard: [qf:p2:responsiveness]
Priority: -- → P3
Assignee: froydnj+bz → nobody
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: