Closed Bug 729952 Opened 13 years ago Closed 13 years ago

Need a better hash function for atoms

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla13

People

(Reporter: bzbarsky, Assigned: justin.lebar+bug)

References

Details

Attachments

(5 files, 6 obsolete files)

Measurement code 13 years ago Boris Zbarsky [:bzbarsky] 7.63 KB, patch		Details \| Diff \| Splinter Review
13k strings or so 13 years ago Boris Zbarsky [:bzbarsky] 130.66 KB, text/plain		Details
Part 2, WIPv1: Use cityhash in nsTHashtable. 13 years ago Justin Lebar (not reading bugmail) 3.25 KB, patch		Details \| Diff \| Splinter Review
Part 1, WIPv1: Add cityhash to mfbt 13 years ago Justin Lebar (not reading bugmail) 18.99 KB, patch		Details \| Diff \| Splinter Review
Add cityhash 13 years ago Justin Lebar (not reading bugmail) 22.28 KB, patch		Details \| Diff \| Splinter Review
Part 1: Add a better hash function to mfbt. 13 years ago Justin Lebar (not reading bugmail) 5.63 KB, patch		Details \| Diff \| Splinter Review
Part 2: Use a better hash function in nsCRT, nsTHashtable, and pldhash. 13 years ago Justin Lebar (not reading bugmail) 6.35 KB, patch		Details \| Diff \| Splinter Review
Part 3 - Debugging printf's. 13 years ago Justin Lebar (not reading bugmail) 5.27 KB, patch		Details \| Diff \| Splinter Review
Part 2, v1.1: Use a better hash function in nsCRT, nsTHashtable, and pldhash. 13 years ago Justin Lebar (not reading bugmail) 6.39 KB, patch	bzbarsky : review+	Details \| Diff \| Splinter Review
Part 1, v1.1: Add a better hash function to mfbt. r?waldo 13 years ago Justin Lebar (not reading bugmail) 5.63 KB, patch		Details \| Diff \| Splinter Review
Part 1, v1.2: Add a better hash function to mfbt. r?waldo 13 years ago Justin Lebar (not reading bugmail) 6.60 KB, patch	Waldo : review+	Details \| Diff \| Splinter Review

Boris Zbarsky [:bzbarsky]

Reporter

Description

•

13 years ago

I just tried measuring our atom hash collisions, and out of 11000 unique atomized strings close to 900 collided. That's sucky in the extreme. It would suck even more for purposes of bug 705877. The atom hashing code uses nsCRT::HashCode(PRUnichar*) and nsCRT::HashCodeAsUTF8. When fixing it, we need to make the latter keep working as now. I _think_, that the current behavior where nsCRT::HashCode(PRUnichar*) has the same behavior as nsCRT::HashCode(char*) when called on ASCII input does not need to be preserved if we change the hashing behavior in nsCRT. Thre is some discussion of hashing functions and their performance and behavior in bug 290032. There is also a good comparative writeup at http://burtleburtle.net/bob/hash/doobs.html (note the links to SpookyHash and lookup3.c). What we use right now is what that writeup calls "Rotating", and as it notes it's clearly crappy. WebKit's string hashing uses what the writeup above calls "Paul Hsieh's hash", fwiw.

Boris Zbarsky [:bzbarsky]

Reporter

Comment 1

•

13 years ago

Attached patch Measurement code — Details — Splinter Review

Benoit Jacob [:bjacob] (mostly away)

Comment 2

•

13 years ago

In bug 676071, I am importing SpookyHash into ANGLE. SpookyHash seems to be a good 128bit (can of course be truncated to smaller sizes) non-cryptographic public-domain hash function: http://burtleburtle.net/bob/hash/spooky.html Aside from that, the only other bit I can contribute to this discussion, is the following formula telling roughly how many hashes one can do with a perfect N-bit hash function while staying under a given probability P of collision: 2^(N/2) * sqrt(2*P) This an approximation, valid for not-too-large P (say P < 0.2), of the formula given there: http://en.wikipedia.org/wiki/Birthday_problem#Cast_as_a_collision_problem It can help decide whether a good 32bit hash function is enough or not, for a given use case. For example, if you are happy with a probability P=1e-4 of collision, a 32bit hash allows to safely do: 2^(32/2) * sqrt(2e-4) ~= 900 hashes

Benjamin Smedberg

Comment 3

•

13 years ago

"keep working as now" means "UTF8 and UTF16 strings need to hash to the same value"? It's not clear to me why we have so many different functions to do the same thing, but can we please move this off of nsCRT while we're at it? Do you have a set of realworld sample strings we can run tests against?

Boris Zbarsky [:bzbarsky]

Reporter

Comment 4

•

13 years ago

> "keep working as now" means "UTF8 and UTF16 strings need to hash to the same value"? Yes, if they represent the same Unicode string. The atom table depends on this behavior. I believe it's the sole consumer of nsCRT::HashCodeAsUTF8, outside of tests. > but can we please move this off of nsCRT while we're at it? That would be just fine by me. > Do you have a set of realworld sample strings we can run tests against? I can attach the set of strings that came out of the procedure in comment 1. I basically started the browser, loaded gmail, web.mit.edu, some github pages, and a news site or two, then quit the browser to get that string set.

Boris Zbarsky [:bzbarsky]

Reporter

Comment 5

•

13 years ago

Attached file 13k strings or so — Details

Data is UTF-8. Note that there is one non-ASCII string at the end there.