1402910 - drag gecko hashtables into the current century

Reporter

Description

•

8 years ago

The current consensus on how to build a fast hashtable seems to be linear probing (for cache-friendliness) combined with Robin Hood hashing (to deal with long probe chains). See several articles around the web: https://www.sebastiansylvan.com/post/robin-hood-hashing-should-be-your-default-hash-table-implementation/ http://codecapsule.com/2013/11/11/robin-hood-hashing/ http://codecapsule.com/2013/11/17/robin-hood-hashing-backward-shift-deletion/ https://probablydance.com/2014/05/03/i-wrote-a-fast-hash-table/ https://probablydance.com/2017/02/26/i-wrote-the-fastest-hashtable/ Rust's hashtables also use this strategy, and they seem to have had decent success with it. Our PLDHashTable uses quadratic probing, which is cache-unfriendly; every probe is essentially a cache miss. Over the past month or so, I've been prototyping a hashtable that we could at least start using for the underlying substrate of nsTHashtable. (This is attractive because it enables us to switch large numbers of Gecko hashtable instances over with minimal effort; switching the users of raw PLDHashTable would be a more involved effort.) ASan Linux try finally turned green after fixing a few last thinkos today, so it is at least ready for an initial round of feedback. Basic, simplistic benchmarks that should be taken with a large grain of salt indicate the new hashtable is ~2x faster. The question of whether nsTHashtable and friends (and whatever the underlying hashtable is) should be templated, instead of using function pointers, is a separate, future discussion. The question of whether we could just FFI to Rust's hashtables, since that would probably be no worse than what we have today, is an interesting one, but I have run into enough little thinkos about our current hashtable implementation that I think such a project would run into some serious difficulties.

Nathan Froyd [:froydnj]

Reporter

Comment 1

•

8 years ago

Attached patch part 1 - add QMEHashTable (obsolete) — Details — Splinter Review

This is where almost all of the interesting code is. The files are largely a copy-and-paste of PLDHashtable.{h,cpp}, with types and algorithms changed where appropriate. We abuse unified compilation to pull in the hash table concurrency checking mechanisms. Things should be familiar to anybody who has spent some time in PLDHashTable. Notable changes: * QMEHashTable::SearchTable has been changed to do linear probing + Robin Hood hashing. It's also been templated so that things like PLDHashTable::FindFreeEntry are no longer needed. The logic is maybe a little hard to follow because of this, but I think this is better than maintaining similar algorithms with minor variations in two different places. * QMEHashTable::RawRemove has been changed to not use tombstones (entries with mKeyHash having kCollisionFlag set), instead shifting entries backward where appropriate. I am less convinced of the goodness of this part, as I can't convince myself that some patterns of inserts and removes will not cause terrible behavior. Various other things (e.g. ComputeKeyHash) have been altered from their PLDHashTable counterparts to deal with kCollisionFlag now being unnecessary. * Iterators are slightly different as well, because we can no longer blindly advance the internal pointer each time: removes may deposit live elements at the entry we were currently looking at, so we need to take that into account. A non-exhaustive list of little things that remain to be done: * Load factor adjustments; Robin Hood hashing enables higher load factors to be used with little degradation in performance. Rust's hashtables, for instance, use a 90% load factor before triggering resizes. * Reinsertion of existing entries in SearchTable currently use three moveEntry calls to perform swaps; it's possible that having a swapEntries function pointer in QMEHashTableOps would be more efficient. Doing so involves a lot of code modification elsewhere, though. * Some benchmarking to see what the most efficient way to iterate over entries in SearchTable and RawRemove would be good: should we be using the current strategy (which requires multiplications each iteration) or directly increment raw pointers into the table (which requires a branch or cmov every iteration to check for wraparound)? * Other decisions from PLDHashTable could be revisited: the comment over * CapacityFromHashShift is a perfect example. * The memory for temporary storage could be allocated contiguously with the memory for the entry storage; I'm not sure whether this would be more or less efficient. * QMEHashTable can be shrunk using techniques that PLDHashTable has recently adopted. Credit goes to Ehsan for initially suggesting this idea!

Attachment #8911870 - Flags: feedback?(n.nethercote)

Attachment #8911870 - Flags: feedback?(erahm)

Nathan Froyd [:froydnj]

Reporter

Comment 2

•

8 years ago

Attached patch part 2 - make nsTHashtable use QMEHashTable underneath (obsolete) — Details — Splinter Review

This is the easy part, whereby we move a bunch of things over unconditionally. I have not benchmarked this part with e.g. a talos run, but doing a Speedometer run with this would be enlightening.

Nicholas Nethercote [inactive]

Comment 3

•

8 years ago

> Our PLDHashTable uses quadratic probing No, it uses double hashing. > QMEHashTable What does QME stand for? > * Load factor adjustments; Robin Hood hashing enables higher load factors to > be > used with little degradation in performance. Rust's hashtables, for > instance, use a 90% load factor before triggering resizes. I'd suggest trying 87.5% (7/8) and 93.75% (15/16), which are computable without using integer division.

Nicholas Nethercote [inactive]

Comment 4

•

8 years ago

Comment on attachment 8911870 [details] [diff] [review] part 1 - add QMEHashTable Review of attachment 8911870 [details] [diff] [review]: ----------------------------------------------------------------- This is excellent, thank you for working on it. I've done a high-level pass, looking at API stuff and comments, but I haven't looked closely at nitty-gritty parts of the code, such as lookup and insertion. I like that you have lots of comments. What's the rationale for introducing QMEHashTable rather than just modifying PLDHashTable? AFAICT the API is much the same (though I might be overlooking something). Seems like it's just adding more code for no particular reason. It's not in the service of a gradual transition strategy because you're changing nsTHashtable to use QMEHashTable in the next patch, which affects lots of tables anyway. (Modify PLDHashTable makes reviewing easier, too :) > * Iterators are slightly different as well, because we can no longer blindly > advance the internal pointer each time: removes may deposit live elements at > the entry we were currently looking at, so we need to take that into account. Another possibility is to record the removed elements in a Vector and then wait until iteration ends before doing the actual removal. It's possible that is simpler, though it does require extra time and memory for the Vector. ::: xpcom/ds/QMEHashTable.cpp @@ +22,5 @@ > + > +// TODO: thread-safety checks > + > +static bool > +QSizeOfEntryStore(uint32_t aCapacity, uint32_t aEntrySize, uint32_t* aNbytes) Why the Q prefix? @@ +35,5 @@ > +// - table must be at most 75% full; > +// - capacity must be a power of two; > +// - capacity cannot be too small. > +static inline void > +QBestCapacity(uint32_t aLength, uint32_t* aCapacityOut, ditto @@ +198,5 @@ > + bool addingEntry = Reason == ForAdd || Reason == ForAddDuringResize; > + // Try to ensure that we don't add new reasons without handling them. > + MOZ_ASSERT_IF(!addingEntry, Reason == ForSearchOrRemove); > + > + // QMEHashTable employs linear probing with Robin Hood-style hashing. Does Robin Hood-style hashing not imply linear probing? (Genuine question, I'm not sure.) @@ +206,5 @@ > + // > + // As we go along, if there's a bucket whose distance from *its* > + // desired initial bucket is lower than the distance we've probed from > + // our desired initial bucket, we'll put our inserted entry there, and > + // continue inserting the now-vacated element. I'm going to nitpick the language here a bit. I don't think we need to introduce the term "bucket"; we already use "entry" in much the same way and I don't think having two terms is good. We have to be a bit careful about distinguishing e.g. empty entries vs use entries, but I think it can be done. "Entry index" can be used to refer to the actual index of an entry. @@ +208,5 @@ > + // desired initial bucket is lower than the distance we've probed from > + // our desired initial bucket, we'll put our inserted entry there, and > + // continue inserting the now-vacated element. > + // > + // We are guaranteed to find a new entry. Do you have your editor set to break comment lines at column 72? This block comment could/should be a lot wider. Likewise for other comments throughout this patch. @@ +215,5 @@ > + uint32_t bucketIndex = IdealBucketIndex(aKeyHash); > + > + uint32_t probeLength = 0; > + PLDHashEntryHdr* temporary = static_cast<PLDHashEntryHdr*>(mTemporaryEntry); > + bool reinserting = false; I'm not quite sure what this bool means. A comment might help. @@ +235,5 @@ > + mOps->moveEntry(this, temporary, entry); > + entry->mKeyHash = tempHash; > + return retval; > + } > + return (addingEntry) ? entry : nullptr; Redundant parens. @@ +242,5 @@ > + // What we do here depends on what state we're in: > + // > + // - If we are reinserting some entry from the table, then we do nothing, > + // because the reinserted entry cannot possibly match any of the other > + // entries in the hashtable. This is a bit unclear. Is this during resize? As above, I'm not sure what "reinserting" means. @@ +289,5 @@ > + // we still have work to do, so continue through the loop, with > + // a new entry. > + // 3. We are re-inserting elements, so: > + // 3a. Swap the temporary entry with the offending element. > + // 3b. Continue inserting the new temporary entry. The numbering here is inconsistent: there's 2a/2b/2c without 2, but there's 3a/3b with 3. Also, I find this list confusing. Is 3/3a/3b an alternative to 1/2a/2b/2c? Making it slightly more pseudocode-y might help, esp. if there should be an "else" in there? @@ +305,5 @@ > + retval = entry; > + } else { > + // We have two entries that we need to swap: the current entry, > + // `entry` and the temporary entry. We *also* have a free entry > + // in `retval` that we can use for temporary storage. The list after the colon reads like it has three elements, not two. This might be clearer: "the current entry (`entry`) and the temporary entry". @@ +558,5 @@ > + // When we remove an entry, we'll scan forwards to find an entry that is > + // either empty, or in its desired place. Then we'll shift all entries > + // between our removed entry and the found entry backwards, lowering their > + // distance from their desired entry slot. Doing this improves lookup and > + // insertion performance. And it saves space because we don't have wasted entries. @@ +563,5 @@ > + // > + // What about accidentally quadratic concerns here? For instance, what if > + // we had a *really* long probe chain? Such chains are unlikely, given our > + // Robin Hood insertion strategy above. XXX more robust explanation. > + trailing whitespace @@ +601,5 @@ > + > + emptyBucket = nextBucket; > + emptyEntry = nextEntry; > + } > + trailing whitespace ::: xpcom/ds/QMEHashTable.h @@ +60,5 @@ > + // strategy, so we need to do something a little different: we'll keep > + // a count of the maximum number of entries we've ever seen in the table. > + // Using that value and the current number of entries, we can compute > + // how many entries have been removed. > + uint32_t mHighWaterEntryCount; Do we need to track this? In PLDHashTable we track the number of removed entries because they take up space and so need to be taken into account by the table loading calculations. But in QMEHashTable they don't take up space. So I think all this logic relating to the number of removed entries can be completely removed. @@ +62,5 @@ > + // Using that value and the current number of entries, we can compute > + // how many entries have been removed. > + uint32_t mHighWaterEntryCount; > + EntryStore mEntryStore; // (Lazy) entry storage and generation. > + uint32_t mSizeMask; What's this for? Needs a comment. @@ +71,5 @@ > + // caller owns the memory for the inserted entry and therefore we can't > + // scribble over it. So we need a small entry that is obviously ours. > + // > + // XXX: a better design for this class would obviate the need for this. > + void* mTemporaryEntry; It's a shame the effect this has on sizeof(QMEHashTable). AFAICT this field is only used in SearchTable() -- can we use stack allocation instead? The size depends on the table (mEntrySize) so you can't use a normal local variable but we have precedent in the tree for using alloca() in layout/style/nsRuleNode.{h,cpp}. ::: xpcom/tests/gtest/HashTableBench.cpp @@ +11,5 @@ > +using namespace mozilla; > + > +// A trivial hash function is good enough here. It's also super-fast for the > +// GrowToMaxCapacity test because we insert the integers 0.., which means it's > +// collision-free. Please use a realistic hash function! Probably mozilla::HashGeneric(). @@ +41,5 @@ > + PLDHashTable table(&trivialPOps, sizeof(PLDHashEntryStub), kHashTableSize); > + > + for (size_t i = 0; i < kHashTableSize; ++i) { > + bool success = table.Add(reinterpret_cast<const void*>(i), fallible); > + ASSERT_TRUE(success); This benchmark needs strengthening. You're only measuring insertion, not lookup or removal. And as well as measuring on this simple table, measuring on another table with a larger PLDHashEntry will stress the cache in different ways. @@ +66,5 @@ > + > +MOZ_GTEST_BENCH(XPCOM, XPCOM_QMEHashTable_Bench, QMEHashBench); > +MOZ_GTEST_BENCH(XPCOM, XPCOM_PLDHashTable_Bench, PLDHashBench); > + > +TEST(QMEHashTable, Insertion) This test could also be strengthened: - Insert many more items to test resizing more; - More interleaving of insertion, lookup, and removal.

Attachment #8911870 - Flags: feedback?(n.nethercote) → feedback+

Nathan Froyd [:froydnj]

Reporter

Comment 5

•

8 years ago

(In reply to Nicholas Nethercote [:njn] from comment #4) > What's the rationale for introducing QMEHashTable rather than just modifying > PLDHashTable? AFAICT the API is much the same (though I might be overlooking > something). Seems like it's just adding more code for no particular reason. > It's not in the service of a gradual transition strategy because you're > changing nsTHashtable to use QMEHashTable in the next patch, which affects > lots of tables anyway. (Modify PLDHashTable makes reviewing easier, too :) Adding a new class means that I don't have to change the entire world every time I want to make changes in the new implementation. I can move nsTHashtable to the new implementation without changing all the extant users of PLDHashTable--adding a new "virtual" method to the ops table, for example. I was also unsure of how many clients might be silently depending on implementation details of PLDHashTable. Some of dbaron's experiments with modifying PLDHashTable have turned up subtle dependencies, and I wanted to try and minimize that. (I realize that moving over many of the hashtables through nsTHashtable is probably incompatible with that desire!) Having both implementations in-tree also makes benchmarking much easier during development. > > * Iterators are slightly different as well, because we can no longer blindly > > advance the internal pointer each time: removes may deposit live elements at > > the entry we were currently looking at, so we need to take that into account. > > Another possibility is to record the removed elements in a Vector and then > wait until iteration ends before doing the actual removal. It's possible > that is simpler, though it does require extra time and memory for the Vector. We could also attempt to find an unused entry before starting iteration, and start iterating *backwards* through the table from there; that would ensure we're never moving live elements into an entry we're currently on. > ::: xpcom/ds/QMEHashTable.cpp > @@ +22,5 @@ > > + > > +// TODO: thread-safety checks > > + > > +static bool > > +QSizeOfEntryStore(uint32_t aCapacity, uint32_t aEntrySize, uint32_t* aNbytes) > > Why the Q prefix? So it won't collide with static functions in PLDHashTable.cpp; I wasn't sure how much I would have to modify at first. > @@ +198,5 @@ > > + bool addingEntry = Reason == ForAdd || Reason == ForAddDuringResize; > > + // Try to ensure that we don't add new reasons without handling them. > > + MOZ_ASSERT_IF(!addingEntry, Reason == ForSearchOrRemove); > > + > > + // QMEHashTable employs linear probing with Robin Hood-style hashing. > > Does Robin Hood-style hashing not imply linear probing? (Genuine question, > I'm not sure.) I think you could do it without doing linear probing, though it'd be awkward: computing distance from your ideal entry is no longer straightforward, for instance. > @@ +206,5 @@ > > + // > > + // As we go along, if there's a bucket whose distance from *its* > > + // desired initial bucket is lower than the distance we've probed from > > + // our desired initial bucket, we'll put our inserted entry there, and > > + // continue inserting the now-vacated element. > > I'm going to nitpick the language here a bit. I don't think we need to > introduce the term "bucket"; we already use "entry" in much the same way and > I don't think having two terms is good. We have to be a bit careful about > distinguishing e.g. empty entries vs use entries, but I think it can be > done. "Entry index" can be used to refer to the actual index of an entry. Yeah, I will try and adjust the language around this a bit. > @@ +208,5 @@ > > + // desired initial bucket is lower than the distance we've probed from > > + // our desired initial bucket, we'll put our inserted entry there, and > > + // continue inserting the now-vacated element. > > + // > > + // We are guaranteed to find a new entry. > > Do you have your editor set to break comment lines at column 72? This block > comment could/should be a lot wider. Likewise for other comments throughout > this patch. I do, can try to fix. > @@ +242,5 @@ > > + // What we do here depends on what state we're in: > > + // > > + // - If we are reinserting some entry from the table, then we do nothing, > > + // because the reinserted entry cannot possibly match any of the other > > + // entries in the hashtable. > > This is a bit unclear. Is this during resize? As above, I'm not sure what > "reinserting" means. I thought this was clear; we're reinserting entries when we're doing the "Robin Hood" part of "Robin Hood hashing". I'll try to clarify this. > ::: xpcom/ds/QMEHashTable.h > @@ +60,5 @@ > > + // strategy, so we need to do something a little different: we'll keep > > + // a count of the maximum number of entries we've ever seen in the table. > > + // Using that value and the current number of entries, we can compute > > + // how many entries have been removed. > > + uint32_t mHighWaterEntryCount; > > Do we need to track this? In PLDHashTable we track the number of removed > entries because they take up space and so need to be taken into account by > the table loading calculations. But in QMEHashTable they don't take up > space. So I think all this logic relating to the number of removed entries > can be completely removed. OK, I think I can buy that. > @@ +71,5 @@ > > + // caller owns the memory for the inserted entry and therefore we can't > > + // scribble over it. So we need a small entry that is obviously ours. > > + // > > + // XXX: a better design for this class would obviate the need for this. > > + void* mTemporaryEntry; > > It's a shame the effect this has on sizeof(QMEHashTable). AFAICT this field > is only used in SearchTable() -- can we use stack allocation instead? The > size depends on the table (mEntrySize) so you can't use a normal local > variable but we have precedent in the tree for using alloca() in > layout/style/nsRuleNode.{h,cpp}. Stack allocation seems like a better idea, especially after we've enforced an 8-bit entry size akin to what PLDHashTable does. Or even just having a char[256] on the stack, since we have a reasonable maximum entry size. > ::: xpcom/tests/gtest/HashTableBench.cpp > @@ +11,5 @@ > > +using namespace mozilla; > > + > > +// A trivial hash function is good enough here. It's also super-fast for the > > +// GrowToMaxCapacity test because we insert the integers 0.., which means it's > > +// collision-free. > > Please use a realistic hash function! Probably mozilla::HashGeneric(). This is a good point, though testing with a hash function that returns numbers in-order is not a bad stress test of the hashtable, too. > @@ +41,5 @@ > > + PLDHashTable table(&trivialPOps, sizeof(PLDHashEntryStub), kHashTableSize); > > + > > + for (size_t i = 0; i < kHashTableSize; ++i) { > > + bool success = table.Add(reinterpret_cast<const void*>(i), fallible); > > + ASSERT_TRUE(success); > > This benchmark needs strengthening. You're only measuring insertion, not > lookup or removal. And as well as measuring on this simple table, measuring > on another table with a larger PLDHashEntry will stress the cache in > different ways. > > @@ +66,5 @@ > > + > > +MOZ_GTEST_BENCH(XPCOM, XPCOM_QMEHashTable_Bench, QMEHashBench); > > +MOZ_GTEST_BENCH(XPCOM, XPCOM_PLDHashTable_Bench, PLDHashBench); > > + > > +TEST(QMEHashTable, Insertion) > > This test could also be strengthened: > - Insert many more items to test resizing more; > - More interleaving of insertion, lookup, and removal. Will try writing several more sophisticated benchmarks, probably cribbing from some of the Robin Hood hashing blog posts, and seeing what happens.

Nathan Froyd [:froydnj]

Reporter

Comment 6

•

8 years ago

(In reply to Nathan Froyd [:froydnj] from comment #5) > (In reply to Nicholas Nethercote [:njn] from comment #4) > > ::: xpcom/ds/QMEHashTable.h > > @@ +60,5 @@ > > > + // strategy, so we need to do something a little different: we'll keep > > > + // a count of the maximum number of entries we've ever seen in the table. > > > + // Using that value and the current number of entries, we can compute > > > + // how many entries have been removed. > > > + uint32_t mHighWaterEntryCount; > > > > Do we need to track this? In PLDHashTable we track the number of removed > > entries because they take up space and so need to be taken into account by > > the table loading calculations. But in QMEHashTable they don't take up > > space. So I think all this logic relating to the number of removed entries > > can be completely removed. > > OK, I think I can buy that. Actually, wait, I don't. The removed entries take up the same amount of space in both cases. We're keeping track of them in the PLDHashTable case so we know when our table was large at some point in the past, but at the present time we don't have enough entries to really justify keeping it that large. Why wouldn't we do the same thing for QMEHashTable? How would we make ShrinkIfAppropriate() work in this new world where we don't track stats about removed entries? We'd just remove the constraint on a quarter of all entries being removed? The same question can be asked for the check in Add(), though there I guess we'd never compress the table? (I'm not quite sure how we "compress" the table when we're using a deltaLog2 of zero--that is, not changing the size of the table at all--but maybe I am missing something subtle there.)

Nathan Froyd [:froydnj]

Reporter

Comment 7

•

8 years ago

(In reply to Nicholas Nethercote [:njn] from comment #3) > > QMEHashTable > > What does QME stand for? QME doesn't stand for anything, it's PLD++, in the same way that IBM is HAL++. (One *really* wonders whether one side of the latter is deliberate or just a happy accident.)

Nicholas Nethercote [inactive]

•

8 years ago

CCing jesup since mccr8 told me he's run into perf issues with the CC hashtable as well.

Randell Jesup [:jesup] (needinfo me)

Comment 18

•

8 years ago

These benchmarks may be interesting: https://tessil.github.io/2016/08/29/benchmark-hopscotch-map.html https://tessil.github.io//other/hash_table_benchmark.html

Randell Jesup [:jesup] (needinfo me)

•

7 years ago

Attachment #8911872 - Attachment is obsolete: true

Nathan Froyd [:froydnj]

Reporter

Comment 27

•

Reporter

Comment 31

•

6 years ago

(In reply to Sean Voisen (:svoisen) from comment #30)

Just curious if this project is something we care to resume any time in the near future? I'm going through old layout/style system bugs that were on my radar, and bug 1474789 potentially blocks this, but I'm unsure it's anything we want to invest time fixing.

It's not something I plan on working on in the near future. I'm a little concerned about changes like "reading the hashtable now involves writing to the hashtable", which I think breaks a lot of other things, not just stylo, and I think the hash table state-of-the-art has moved on a bit anyway.

Flags: needinfo?(nfroyd)

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

4 years ago

Performance Impact: --- → P2

Keywords: perf:responsiveness

Whiteboard: [qf:p2:responsiveness]

Nika Layzell [:nika] (ni? for response)

Updated

•

4 years ago

Priority: -- → P3

Nika Layzell [:nika] (ni? for response)

Updated

•

4 years ago

Assignee: froydnj+bz → nobody

BMO Automation

Updated

•

3 years ago

Severity: normal → S3

Nika Layzell [:nika] (ni? for response)

Comment 32

•

1 year ago

I don't think we're going to adjust things in exactly the way described in this bug at this point. We might still be interested in doing improvements to the hashtable algorithm in the future (I'm sure there's room for improvement there), but closing for now.

Status: NEW → RESOLVED

Closed: 1 year ago

Resolution: --- → WONTFIX

part 1 - add QMEHashTable 8 years ago Nathan Froyd [:froydnj] 47.86 KB, patch	n.nethercote : feedback+	Details \| Diff \| Splinter Review
part 2 - make nsTHashtable use QMEHashTable underneath 8 years ago Nathan Froyd [:froydnj] 8.60 KB, patch		Details \| Diff \| Splinter Review
part 1 - add QMEHashTable 7 years ago Nathan Froyd [:froydnj] 52.12 KB, patch		Details \| Diff \| Splinter Review
part 2 - make nsTHashtable use QMEHashTable underneath 7 years ago Nathan Froyd [:froydnj] 7.02 KB, patch		Details \| Diff \| Splinter Review