Closed Bug 729952 Opened 13 years ago Closed 13 years ago

Need a better hash function for atoms

Categories

(Core :: XPCOM, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla13

People

(Reporter: bzbarsky, Assigned: justin.lebar+bug)

References

Details

Attachments

(5 files, 6 obsolete files)

7.63 KB, patch
Details | Diff | Splinter Review
130.66 KB, text/plain
Details
5.27 KB, patch
Details | Diff | Splinter Review
6.39 KB, patch
bzbarsky
: review+
Details | Diff | Splinter Review
6.60 KB, patch
Waldo
: review+
Details | Diff | Splinter Review
I just tried measuring our atom hash collisions, and out of 11000 unique atomized strings close to 900 collided. That's sucky in the extreme. It would suck even more for purposes of bug 705877. The atom hashing code uses nsCRT::HashCode(PRUnichar*) and nsCRT::HashCodeAsUTF8. When fixing it, we need to make the latter keep working as now. I _think_, that the current behavior where nsCRT::HashCode(PRUnichar*) has the same behavior as nsCRT::HashCode(char*) when called on ASCII input does not need to be preserved if we change the hashing behavior in nsCRT. Thre is some discussion of hashing functions and their performance and behavior in bug 290032. There is also a good comparative writeup at http://burtleburtle.net/bob/hash/doobs.html (note the links to SpookyHash and lookup3.c). What we use right now is what that writeup calls "Rotating", and as it notes it's clearly crappy. WebKit's string hashing uses what the writeup above calls "Paul Hsieh's hash", fwiw.
Attached patch Measurement codeSplinter Review
Test methodology: applied this, ran for a bit while logging to a file, then quit the browser. Then pipe the log file through: grep STR | sort | uniq | sed 's/.*|,[^,]*,//' | sort | uniq -c | sort -n and cried at all the 5s visible in my terminal.
In bug 676071, I am importing SpookyHash into ANGLE. SpookyHash seems to be a good 128bit (can of course be truncated to smaller sizes) non-cryptographic public-domain hash function: http://burtleburtle.net/bob/hash/spooky.html Aside from that, the only other bit I can contribute to this discussion, is the following formula telling roughly how many hashes one can do with a perfect N-bit hash function while staying under a given probability P of collision: 2^(N/2) * sqrt(2*P) This an approximation, valid for not-too-large P (say P < 0.2), of the formula given there: http://en.wikipedia.org/wiki/Birthday_problem#Cast_as_a_collision_problem It can help decide whether a good 32bit hash function is enough or not, for a given use case. For example, if you are happy with a probability P=1e-4 of collision, a 32bit hash allows to safely do: 2^(32/2) * sqrt(2e-4) ~= 900 hashes
"keep working as now" means "UTF8 and UTF16 strings need to hash to the same value"? It's not clear to me why we have so many different functions to do the same thing, but can we please move this off of nsCRT while we're at it? Do you have a set of realworld sample strings we can run tests against?
> "keep working as now" means "UTF8 and UTF16 strings need to hash to the same value"? Yes, if they represent the same Unicode string. The atom table depends on this behavior. I believe it's the sole consumer of nsCRT::HashCodeAsUTF8, outside of tests. > but can we please move this off of nsCRT while we're at it? That would be just fine by me. > Do you have a set of realworld sample strings we can run tests against? I can attach the set of strings that came out of the procedure in comment 1. I basically started the browser, loaded gmail, web.mit.edu, some github pages, and a news site or two, then quit the browser to get that string set.
Attached file 13k strings or so
Data is UTF-8. Note that there is one non-ASCII string at the end there.
I'm a bit concerned about using a 128-bit hash in hot code when we need only 32 (or even 64) bits. But maybe it's not a big deal. Let's run murmur and spooky over bz's data to see if one is obviously better in terms of collisions?
One other note: for atoms specifically, the typical string being hashed is short, not least because people minify their class names and the like. In particular, for the attached list I get numbers like so (first column is count, second column is length): 66 1 1797 2 1383 3 1279 4 1272 5 707 6 525 7 539 8 970 9 647 10 566 11 558 12 473 13 409 14 289 15 346 16 283 17 230 18 207 19 186 20 159 21 122 22 99 23 91 24 63 25 69 26 53 27 42 28 26 29 23 30 20 31 10 32 13 33 8 34 5 35 4 36 6 37 2 38 1 39 46 40 2 42 2 43 So for hashing performance we're more interested in low startup costs than in good behavior on long inputs. For hashing behavior we want functions that do a good job spreading a small number of bytes over a 32-bit int. For example, our current hash when given a 3-byte input produces a 32-bit number which has 0s in the top 16 bits. And if you restrict the set of possible byte values to the expected [a-zA-Z0-9-_] then I bet we end up using even fewer bits than that, which is why collision rates are so high.
How about: if a string is 4 bytes or less, just use the value of those bytes as the hash value? For longer strings, use one of the "complex" hash algorithms. e.g. "A" -> 0x41000000 "AB" -> 0x41420000 (ignore endianness, it doesn't matter) "ABC" -> 0x41424300 "ABCD" -> 0x41424344 "ABCDE" -> spooky("ABCDE") I'm pretty sure that the "PRUint32 HashCode(const char* str, PRUint32* resultingStrLen = nsnull); variant isn't important, and the rest pass in a length so we can switch on it without reading the data first...
That's worth testing, at least.
Heh, spookyhash has exactly one collision on bz's whole dataset.
That's a bit more like it! ;)
Actually, none of the three hash functions have any collisions on the test set. I was hashing the empty string twice.
Time to hash bz's atoms list on my mac: spookyhash 234us murmurhash3_x86_32 187us cityhash 133us The other versions of murmurhash (x86_128 and x64_128) are all slower than spookyhash. Seeing as none of these has any collisions, I think we have a winner.
(In reply to Justin Lebar [:jlebar] from comment #14) > Time to hash bz's atoms list on my mac: > > spookyhash 234us > murmurhash3_x86_32 187us > cityhash 133us What's the noise level in such ultra-short benchmarks? Are these results consistent across runs? I would run this benchmark 10k times and average (or, perhaps better: keep the best score)
One other question, just so I know how much to cry... What's the timing like for our existing hash function?
I'm running it 1000 times and taking the average. The noise is +/- 3ms or so. I threw the code online in case anyone else wants to benchmark. You should just need to run |make|, although I make no guarantees. I'll add the current hash function in the morning; I am, as they say here, le tired. https://github.com/jlebar/hashtest
(In reply to Justin Lebar [:jlebar] from comment #14) > cityhash 133us Keep in mind cityhash relies on the special x86 CRC instruction. If you don't have a processor that supports it (and many of our users do not, especially on mobile), it is nowhere near as fast.
These are the times for the non-CRC variant.
The current nsCRT hash is 64us, so twice as fast as cityhash.
(In reply to Benjamin Smedberg [:bsmedberg] from comment #3) > can we please move this off of nsCRT while we're at it? Where would you like it to live?
There are existing declarations in nsHashKeys.h that we could just use, or have a new header/cpp specifically for these hashing functions.
Waldo, how would you feel about checking this in to mfbt? I'd like this hash function to be accessible everywhere. It's one third-party cpp file, plus a header file. Both have some Mozilla-specific modifications, but I don't want to convert the whole thing into Mozilla/mfbt style, because that will make it harder to take upstream changes.
I tried bsmedberg's suggestion of, if the string's length <= 4, use the chars cast as a uint32_t as the hashcode. This gets rid of all but 11 of the collisions with the current hashcode. Unfortunately, it's slower (much slower for nscrt), at least in a naive implementation (I used a case statement for the length branch). The following times are all with length <= 4 wrapping: spookyhash 219us cityhash 172us murmurhash 219us nscrt 205us Here are the original (unwrapped) times again: > spookyhash 234us > murmurhash 187us > cityhash 133us > nscrt 64us
Comment on attachment 600450 [details] [diff] [review] Part 1, WIPv1: Add cityhash to mfbt Asking for feedback on the general idea of having third-party code in mfbt. I need to clean up the comments and so on in here before review.
Attachment #600450 - Flags: feedback?(jwalden+bmo)
Attached patch Add cityhash (obsolete) — Splinter Review
Attachment #600725 - Flags: review?(jwalden+bmo)
Attachment #600449 - Attachment is obsolete: true
Attachment #600450 - Attachment is obsolete: true
Attachment #600450 - Flags: feedback?(jwalden+bmo)
Justin, what's the actual plan for HashCodeAsUTF8 given the APIs in that patch?
I wasn't sure if you needed a hash function that speaks unicode. If so, I think I can rewrite cityhash to take a data transform function. I'm not sure exactly what you need, but I presume an arbitrary input transform would be sufficient? We'll have to see how fast I can make it.
Comment on attachment 600725 [details] [diff] [review] Add cityhash If I just add a multiply to the current nsCRT hashcode function, it has 0 collisions on bz's dataset and runs just as fast as the current code. That seems like much simpler than rewriting CityHash as streaming code.
Attachment #600725 - Flags: review?(jwalden+bmo)
> I wasn't sure if you needed a hash function that speaks unicode. See comment 3 and comment 4...
(In reply to Boris Zbarsky (:bz) from comment #32) > > I wasn't sure if you needed a hash function that speaks unicode. > > See comment 3 and comment 4... I see; I didn't understand that exchange earlier. :) Multiplying by golden ratio when adding a char seems to work just fine -- in retrospect, using just rol and xor seems pretty nuts. I'm working on converting all callers to this better hash, but I'll post a smaller patch asap.
Note that the atom table code currently hash all strings in utf16 space. I.e. we hash nsStrings "as they are", and nsCStrings are hashed using HashCodeAsUTF16 which uses a utf8 iterator to extract individual unicode characters and add their utf16 equivalent to the hash. http://mxr.mozilla.org/mozilla-central/source/xpcom/ds/nsCRT.cpp#278
I've been trying to reproduce bz's collisions list without much success. > I basically started the browser, loaded gmail, web.mit.edu, some github pages, and a news site or > two, then quit the browser to get that string set. I've instrumented nsCRT's hashcode functions, plus the ones in nsTHashtable and pldhash. I'm printing the input and output of every hash function, and I see just a few (~10) hash collisions in total.
Hmm. I might have loaded the Facebook front page and twitter as well, possibly..... In any case, the attached string list is definitely the set of strings that were in atoms when those atoms were destroyed in my build.
Oh, d'oh, I was missing a |sort| before one of my |uniq|s. Much better now.
Attachment #600725 - Attachment description: Patch v1 → Add cityhash
(In reply to Justin Lebar [:jlebar] from comment #37) > Oh, d'oh, I was missing a |sort| before one of my |uniq|s. Much better now. Or perhaps much worse? :) In any case, I can verify that > #define ADD_TO_HASHVAL(hashval, c) \ > hashval = (GOLDEN_RATIO * PR_ROTATE_LEFT32(hashval, 4)) ^ (c); is much, much, much better than without the GOLDEN_RATIO call. In these tables, e.g. "1097 3" means that we had 1097 three-way collisions. Originally: 39662 1 1956 2 1097 3 891 4 266 5 15 6 1 7 With golden ratio multiply: 47195 1 1 2 (These are from two separate runs, so the strings aren't exactly the same. But I loaded the same sites, via session-restore. And you can see that the number of strings -- sum the first column -- is roughly the same.)
I intend to expand upon these functions in a followup bug, but this is the bare minimum required functionality to get something working.
I used the following to generate the collisions histogram: dist/bin/firefox | grep XXX > hashes && cut -f '2-3' hashes | sort | uniq | cut -f 2 | sort | uniq -c | sort -n | sed 's/^ *//' | cut -f 1 -d ' ' | sort | uniq -c
Attachment #600725 - Attachment is obsolete: true
Attachment #600962 - Attachment is obsolete: true
Attachment #600963 - Attachment is obsolete: true
Attachment #600963 - Flags: review?(bzbarsky)
Attachment #600965 - Flags: review?(bzbarsky)
Attachment #600966 - Flags: review?(jwalden+bmo)
Attachment #600964 - Attachment description: Debugging printf's. → Part 3 - Debugging printf's.
Blocks: 730895
> but can we please move this off of nsCRT while we're at it? I'm going to do this in bug 730895, if you don't mind, so bz's work isn't blocked on a large-ish code change.
Comment on attachment 600966 [details] [diff] [review] Part 1, v1.1: Add a better hash function to mfbt. r?waldo Should mozilla::GoldenRatioU32 be static? Or does it not matter? Also, worth documenting why the rotation is by 5 bits, assuming there's a reason.
Comment on attachment 600965 [details] [diff] [review] Part 2, v1.1: Use a better hash function in nsCRT, nsTHashtable, and pldhash. r=me
Attachment #600965 - Flags: review?(bzbarsky) → review+
(In reply to Boris Zbarsky (:bz) from comment #45) > Comment on attachment 600966 [details] [diff] [review] > Part 1, v1.1: Add a better hash function to mfbt. r?waldo > > Should mozilla::GoldenRatioU32 be static? Or does it not matter? I don't know how |static const| would be different from |const| in a header, actually. I guess it's a matter of the const being exported out of the compilation unit? > Also, worth documenting why the rotation is by 5 bits, assuming there's a > reason. No good reason. An odd shift makes much more sense to me than an even shift; I'll comment that. Also, I think I should change (GoldenRatio * rol(hash, 5)) ^ value to GoldenRatio * (rol(hash, 5) ^ value). With the former, the hash of a single char is that char's numeric value (assuming hash is initially 0), whereas with the latter, we get an arbitrary value. This would matter if the bloom filter doesn't multiply hash values by something before using them, since the bloom filter looks at the high 16 and low 16 bits of the hashcode separately.
To get even more entropy you should do: uint64_t res = GoldenRatio * (rol(hash, 5) ^ value); return (uint32_t)res ^ ((uint32_t)(res >> 32); Which *should* compile to something basically as fast on x86, but I'm not sure that's true in practice, and I'm definitely not sure about ARM.
The bloom filter is using the pldhash hashcode, which is already multiplied by golden ratio on entry. The good thing is that GoldenRation^N != 1 for small N...
Which may be worth documenting too.
I am not an expert on hash functions, but I'm hesitant to do something like that because: * It's not clearly fast on all platforms * I've never seen a commercial hash function do that * The low 32 bits of |res| are random, but the number of non-zero high-order bits is determined by the magnitude of |rol(hash, 5) ^ value|. So if rol(hash, 5) and value happen to be equal in their high-order bits, |rol(hash, 5) ^ value| will be small, which automatically means that many of the top 32 bits of |res| will be 0. If we wanted more entropy, we could multiply by a second constant. Here's the internal loop of murmurhash3's 32-bit hash (c1, c2, are constants), for example: k1 *= c1; k1 = ROTL32(k1,15); k1 *= c2; h1 ^= k1; h1 = ROTL32(h1,13); h1 = h1*5+0xe6546b64;
> The good thing is that GoldenRation^N != 1 for small N... To be clear, this is because GoldenRatio is odd. We care about the Carmichael function [1]; in our case, lambda(2^32) == 2^30, so 2^30 is the smallest N such that a^N == 1, for any odd |a|. I'm not sure how much group theory we want to include in this file... :) [1] http://en.wikipedia.org/wiki/Carmichael_function
> The good thing is that GoldenRation^N != 1 for small N... Anyway this isn't particularly relevant because we have that rotate() function in place, which means we're never actually doing GoldenRatio^N.
(In reply to Jonas Sicking (:sicking) from comment #48) > Which *should* compile to something basically as fast on x86, but I'm not > sure that's true in practice, and I'm definitely not sure about ARM. You have to use a long multiply on ARM, which, e.g., on a Cortex A8 is three cycles instead of two, but has the same latency (4 cycles). But after that it's just one extra instruction for the xor. (In reply to Justin Lebar [:jlebar] from comment #51) > |rol(hash, 5) ^ value| will be small, which automatically means that many of > the top 32 bits of |res| will be 0. This shouldn't be a serious issue, since you're xor'ing with a bunch of bits that don't have this problem. I've seen it done in, e.g., Marsaglia's classic "MWC" PRNG (which has many of the same requirements as a hash function): https://groups.google.com/group/sci.math.num-analysis/msg/eb4ddde782b17051 But that also multiplies by two constants. With just a single multiplicand, I'd worry that some repeating pattern in the interior of the product would cancel most of the bits. However, before we worry about adding more entropy, how does the current approach fare in things like Bob Jenkin's frog-test? (In reply to Justin Lebar [:jlebar] from comment #52) > I'm not sure how much group theory we want to include in this file... :) We want to include all the group theory (someone was telling me just the other week they didn't understand when it ever gets used in real life... and didn't believe me when I said "all the time").
> We care about the Carmichael function [1]; in our case, lambda(2^32) == 2^30, so 2^30 is > the smallest N such that a^N == 1, for any odd |a|. 2^30 is the smallest N such that a^N = 1 no matter what |a| is. For any particular value of a, the smallest n such that a^n = 1 will be a factor of 2^30, but may be much smaller. In the extreme cases, 1^1 = 1 and (2^32 - 1)^2 = 1, say. What really matters here is what the order of GoldenRatio is in the multiplicative group Z mod 2^32. A brute-force search suggests this order is 2^29, which is good enough for our purposes. > I'm not sure how much group theory we want to include in this file... :) I think saying that 2^29 is the smallest N for which GoldenRatio^N == 1 mod 2^32 is good enough.
> which means we're never actually doing GoldenRatio^N. Well, we're doing it for N=2 if you do it to produce your hashcode and then pldhash does it on the input hashcode.
Heh, okay, I think we all are having too much fun with this (myself included!). :) I propose we table this hash bikes--er, analysis until I have bug 730895 in hand. This will push all our hashing (or as much as I can muster) through the new algorithm. We can then analyze using all forms of amphibian tests. As is, this new hash function appears to be a clear win.
Attachment #600966 - Attachment is obsolete: true
Attachment #601024 - Flags: review?(jwalden+bmo)
Attachment #600966 - Flags: review?(jwalden+bmo)
See also http://isthe.com/chongo/tech/comp/fnv/, which is very similar to what we're using, but has theoretically well-chosen magic numbers.
(In reply to Boris Zbarsky from comment #55) > What really matters here is what the order of GoldenRatio is in the > multiplicative group Z mod 2^32. For odd numbers, the order depends on the first bit transition after bit 1. For N > 2, (k×2^N±1)² = (k²×2^2N±k×2^(N+1)+1) = k'×2^(N+1)+1 which is of the previous form for N'=N+1, so that squaring enough times will reach 1 mod 2^32. GoldenRatio is of this form where N = 3 so that 32-3=29 squarings suffice. For N = 1 or 2, (k×2³±3)² = (k²×2^6±k×2^4+9) which is of the previous form for N'=3, so that 30 squarings are required in the worst case. (Put another way, GoldenRatio has two square roots mod 2^32.) I didn't have to consider odd powers because the lowest two set bits are always the same in any odd power.
(In reply to neil@parkwaycc.co.uk from comment #60) > GoldenRatio has two square roots mod 2^32. Of course it doesn't, it either has zero or four square roots.
Comment on attachment 601024 [details] [diff] [review] Part 1, v1.2: Add a better hash function to mfbt. r?waldo Review of attachment 601024 [details] [diff] [review]: ----------------------------------------------------------------- ::: mfbt/Attributes.h @@ +320,5 @@ > # define MOZ_FINAL /* no support */ > #endif > > +/** > + * MOZ_WARN_UNUSED_RESULT tells the compiler to emit a warning if a function's Just curious, what motivated adding this? I can't think of cases where the result of a hash function would accidentally not be used. @@ +322,5 @@ > > +/** > + * MOZ_WARN_UNUSED_RESULT tells the compiler to emit a warning if a function's > + * return value is not used by the caller. This currently works with GCC and > + * clang. I seem to recall most documentation comments don't call out which compilers support various things, so I'd say don't do that. It's kind of don't-repeat-yourself as well, come to think of it. @@ +337,5 @@ > +#if defined(__GNUC__) > +#define MOZ_WARN_UNUSED_RESULT __attribute__ ((warn_unused_result)) > +#else > +#define MOZ_WARN_UNUSED_RESULT > +#endif Indent the defines two spaces, like all the others in the file: #if ... # define ... #else # define ... #endif Please check for defined(__clang__) || defined(__GNUC__) to be explicit about clang support here, rather than relying on its pretending to be gcc. ::: mfbt/HashFunctions.h @@ +1,1 @@ > +/* -*- Mode: C++; tab-width: 4; indent-tabs-mode: nil; c-basic-offset: 4 -*- s/4/2/g @@ +35,5 @@ > + * and other provisions required by the GPL or the LGPL. If you do not delete > + * the provisions above, a recipient may use your version of this file under > + * the terms of any one of the MPL, the GPL or the LGPL. > + * > + * ***** END LICENSE BLOCK ***** */ Whee, MPL2 is actually required for new files now, not that I knew it for all the mfbt headers added recently. @@ +43,5 @@ > +#ifndef mozilla_HashFunctions_h_ > +#define mozilla_HashFunctions_h_ > + > +#include "mozilla/Attributes.h" > +#include "mozilla/Types.h" If StandardInteger.h supplies everything you need -- I think it does, looks like you're just using uint32_t -- use that instead. @@ +45,5 @@ > + > +#include "mozilla/Attributes.h" > +#include "mozilla/Types.h" > + > +#ifdef __cplusplus Unless SpiderMonkey gets around to dropping C compatibility sooner than I think it will, we may come to regret this. I guess we can cross that bridge when we reach it. @@ +56,5 @@ > + > +inline uint32_t > +RotateLeft32(uint32_t value, uint8_t bits) > +{ > + return (value << bits) | (value >> (32 - bits)); Might as well MOZ_ASSERT(bits < 32); just to be safe. @@ +93,5 @@ > + * > + * evaluates to |value|. > + * > + * (Number-theoretic aside: Because any odd number |m| is relatively prime to > + * our modulus (2^32), the list Use ** for exponentiation to distinguish it from xor, particularly since these algorithms are mixing both concepts. I keep thinking that if you're going to have a "number-theoretic aside", you can do better than this. :-P Like saying they form a group under modulo multiplication, or something -- except that's not actually what you're saying (although I think they come to the same thing if you squint a little). I dunno, I've thought about this longer than I should have. :-)
Attachment #601024 - Flags: review?(jwalden+bmo) → review+
> Just curious, what motivated adding this? I can't think of cases where the result of a hash > function would accidentally not be used. I actually wrote this bug myself. It's > PRUint32 hash = HashString(...) > AddToHash(hash, some, other, stuff); > return hash; > Whee, MPL2 is actually required for new files now, not that I knew it for all the mfbt headers > added recently. You have no idea how happy I am to make this change. :)
> I keep thinking that if you're going to have a "number-theoretic aside", you can do better than > this. :-P Like saying they form a group under modulo multiplication, or something -- except that's > not actually what you're saying (although I think they come to the same thing if you squint a > little). I think we're saying is that for odd x, the group (x * Z/2^32) (that is, multiply each element in Z mod 2^32 by x) is isomorphic to Z/2^32.
Status: NEW → ASSIGNED
> Use ** for exponentiation to distinguish it from xor, particularly since these algorithms are mixing > both concepts. Gah, I didn't change this before pushing. Too many things going through my brain today. I've changed it in my patches to bug 729940.
Depends on: 732914
Try run for 284fb758b848 is complete. Detailed breakdown of the results available here: https://tbpl.mozilla.org/?tree=Try&rev=284fb758b848 Results (out of 279 total builds): success: 214 warnings: 64 failure: 1 Builds (or logs if builds failed) available at: http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/jlebar@mozilla.com-284fb758b848
Try run for 284fb758b848 is complete. Detailed breakdown of the results available here: https://tbpl.mozilla.org/?tree=Try&rev=284fb758b848 Results (out of 283 total builds): success: 218 warnings: 64 failure: 1 Builds (or logs if builds failed) available at: http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/jlebar@mozilla.com-284fb758b848
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: