970152 - Don't hash past the first 1000 bytes of a string

Assignee

Description

•

11 years ago

We spend a non-trivial amount of time hashing extremely long strings in order to put them in hash tables. A while back jlebar had an idea about only hashing part of such strings. I don't think anything came of it at the time, but I just tried it out and it appears to be worthwhile. Consider SunSpider. During its execution, we call HashKnownLength() 509,192 times. The average string size (in bytes) is 12.8, but there's a long tail. The following table shows what fraction of the bytes processed by HashKnownLength() is done for strings of varying lengths. 0+ 6551978 100% 10+ 6128982 93.5% 100+ 2046114 31.2% 1000+ 1646374 25.1% 10000+ 1414048 21.6% 100000+ 1201432 18.3% I.e. 21.6% of the work is done for strings of size 10,000+, and 18.3% of the work is done for strings of size 100,000+. The longest string is 600,112 bytes. Very long strings are rare, but because they are so big, they account for a significant fraction of the hashing effort. I did the same measurements for a few minutes of browing at techcrunch.com, nytimes.com, and bbc.co.uk. We hashed 5,505,563 strings, with an average string size of 14.5 bytes, and the longest string was 96,710 bytes. 0+ 79580839 100% 10+ 76318870 95.9% 100+ 15336229 19.3% 1000+ 9961271 12.5% 10000+ 8274210 10.4% So if we, say, limited ourselves to just hashing the first 999 bytes, we'd do 25.1% less hashing in SunSpider, and 12.5% less in vanilla browsing. The hashing mostly comes from SpiderMonkey, where we atomize every identifier and string that appears in user code; long strings in user code appear to be the major source of very long strings. BTW, HashUntilZero() gets used much less than HashKnownLength() -- e.g. it's not used at all by SpiderMonkey.

Nicholas Nethercote [inactive]

Assignee

Comment 1

•

11 years ago

Attached patch Don't hash past the first 1000 bytes of a string. (obsolete) — Details — Splinter Review

This patch makes HashKnownLength() and HashUntilZero() only hash the first 1000 chars. This is about a 1.01x speed-up for SunSpider on my desktop machine. regexp-dna is typically 1.05--1.08x faster, and string-tagcloud and string-unpack-code are typically 1.02--1.04x faster. Unsurprisingly, these are the tests with very long strings. If I time just the parsing of SunSpider (parsemark has the whole suite concatenated into a single file), I get about a 1.10--1.12x speed-up. If you're wondering how I chose 1000 chars, here are instruction counts for parsing Sunspider with different limits. - no limit 74.461 - 100000 73.439 - 10000 71.960 - 1000 71.622 - 100 71.539 - 10 71.975 - 8,9 72.384 - 6,7 73.498 - 4,5 76.603 - 2,3 95.338 - 0,1 604.685 100 was the best result, but 1000 is almost as good and will be more robust against bad cases. The instruction count reduction going from no limit to 1000 is 1.04x; I assume the remainder of the 1.10--1.12x speed-up I saw is due to fewer memory accesses. (During parsing, we tokenize the string and then immediately atomize it; two consecutive iterations over a huge string will trash the caches.) ARM devices typically have slower memory subsystems, so improvements there might be bigger. Speaking of bad cases: for this to hurt performance, we'd need to be hashing *many* strings that are identical for the first 1000 bytes, but differ after that, but also aren't much longer than 1000 chars. That seems *extremely* unlikely. My only other concern is this: these functions are only used for hash tables, right? I.e. nobody's using these for any kind of cryptographic/security purposes? I sure hope so. If this is a concern, I could keep the existing HashString() functions and instead add some HashStringLimited() functions, and replace existing HashString() calls where appropriate.

Attachment #8373111 - Flags: review?(jwalden+bmo)

Luke Wagner [:luke]

Comment 2

•

11 years ago

It's a little scary to have a central utility function behave differently than advertised in this manner. I'd much rather have a new HashFirst1000Chars function and opt in at sites we want to optimize. Case and point: dom/asmjscache/AsmJSCache.cpp uses HashString() and I think it would be rather likely for there to be two separate cache lookups that have an asm.js module that differs only in bytes after the first 1000 chars (which are mostly boilerplate for large modules).

Luke Wagner [:luke]

Comment 3

•

11 years ago

(Putting "1000" in the name is probably a bad idea.)

(no longer active)

Comment 4

•

11 years ago

Maybe make that an optional argument?

Nathan Froyd [:froydnj]

Comment 5

•

11 years ago

(In reply to :Ehsan Akhgari (needinfo? me!) (slow responsiveness, emailacopolypse) from comment #4) > Maybe make that an optional argument? I think optional arguments are bad here for reasons similar to Luke's in comment 2. How about HashPrefixOfString?

Nicholas Nethercote [inactive]

Assignee

Comment 6

•

11 years ago

While we're bikeshedding... I actually started working on a patch to do this yesterday, and I used the names HashStringLimited(), HashKnownLengthLimited(), and HashUntilZeroLimited(). I'll finish it up and post an updated patch.

Nicholas Nethercote [inactive]

Assignee

Comment 7

•

11 years ago

> dom/asmjscache/AsmJSCache.cpp uses HashString() and I think it > would be rather likely for there to be two separate cache lookups that have > an asm.js module that differs only in bytes after the first 1000 chars > (which are mostly boilerplate for large modules). Is this a direct-mapped cache that just overwrites old entries when a collision occurs? If so, then yes, I can see that using the limited hash would be bad.

Nicholas Nethercote [inactive]

Assignee

Comment 8

•

11 years ago

Attached patch Introduce HashStringLimited(), and use it in jsatom.h. — Details — Splinter Review

Updated version that introduces HashStringLimited(), rather than changing HashString(), and changes jatom.h to use it. I thought about giving it an extra argument instead of hard-coding the limit, but I'm a little leery of having a non-constant multiple and divide, even if the RH operands are constant. There are other places in the browser where we still hash big strings -- I'm seeing stuff relating to fonts and images. But they can be done in a follow-up.

Attachment #8373512 - Flags: review?(jwalden+bmo)

Nicholas Nethercote [inactive]

Assignee

Updated

•

11 years ago

Attachment #8373111 - Attachment is obsolete: true

Attachment #8373111 - Flags: review?(jwalden+bmo)

Nicholas Nethercote [inactive]

Assignee

Comment 9

•

11 years ago

This script generates a pathologically bad case for this optimization: s = '0123456789' * 100; for i in range(10000): print('var x%d = "%s_%d";' % (i, s, i)) The resulting script runs vastly slower when this optimization is present. This doesn't particularly worry me, because it's obscure and if I want to write JS code that runs slowly there are plenty of simpler ways to do it. But Kannan is concerned, and possibly Brendan. So is there some other way to achieve the same benefit? Hashing incredibly long, probably-unique strings isn't smart. For example, if the atoms table was a tree rather than a hash table, we wouldn't have this problem.

Brian Hackett [Laid off!]

Comment 10

•

11 years ago

Maybe use a separate tree for indexing atoms over N characters?

Nicholas Nethercote [inactive]

Assignee

Comment 11

•

11 years ago

> Maybe use a separate tree for indexing atoms over N characters? That's a nice idea. The only problem is that we don't have any generic tree implementations and I don't particular want to write one :(

Jeff Walden [:Waldo]

Comment 12

•

11 years ago

Comment on attachment 8373512 [details] [diff] [review] Introduce HashStringLimited(), and use it in jsatom.h. Review of attachment 8373512 [details] [diff] [review]: ----------------------------------------------------------------- Particularly given this is targeted at speeding up hashing of long strings, I don't think we should provide *UntilZero versions of these methods. Callers should be required to hand in the length of the contents they're hashing. If they can't compute it more easily than O(n), there's something seriously wrong with the optimality of the relevant code. This also has an additional benefit: when the length exceeds the limit, we can add *the length* into the final hash. That means we don't have to worry about long, similar strings (like Luke's asm.js code with boilerplate prefixes) trivially colliding. That, plus there only being a single use of this, militates strongly, IMO, for providing only ptr+length methods here. ::: mfbt/HashFunctions.h @@ +296,5 @@ > hash = AddToHash(hash, str[i]); > return hash; > } > > +static const size_t limitedStringHashSize = 1000; So the limit is in amount of memory, and not in amount of characters? That's probably a little unintuitive, but it's not unreasonable. limitedStringMemoryExamined seems like a better name to me. "hash size" does not connote the meaning I actually observe here. @@ +329,5 @@ > > /** > + * The HashString overloads below do just what you'd expect. However, if you > + * might be hashing very long strings and a tiny potential increase in > + * collision rate doesn't matter, consider using HashStringLimited() instead. "tiny potential increase in collision rate" isn't clear enough about when it's okay to use HashStringLimited. Instead, say "might be hashing very long strings sharing the same long prefix". It's not long strings that are the problem, it's long strings with the same prefix. Or at least, you should say this if until-zero overloads are provided, which I think they shouldn't be. @@ +401,5 @@ > + * avoid slow hash computations, they don't consider the entire string if the > + * string's size is greater than |limitedStringHashSize|. > + * > + * If you have the string's length, you might as well call the overload which > + * includes the length. It may be marginally faster. If you were to have the until-zero overload, I'd say: don't be wishy-washy, say flat-out "If you have the string's length, use the faster overload that includes the length". Removing it eliminates the need for this suggestion. @@ +431,5 @@ > +{ > + return detail::HashKnownLengthLimited(str, length); > +} > + > +#ifdef MOZ_CHAR16_IS_NOT_WCHAR Hmm, I should probably go apply template SFINAE to this to get rid of the ifdefs here, sometime. Not now.

Attachment #8373512 - Flags: review?(jwalden+bmo) → review-

Nicholas Nethercote [inactive]

Assignee

Comment 13

•

11 years ago

> > Maybe use a separate tree for indexing atoms over N characters? > > That's a nice idea. The only problem is that we don't have any generic tree > implementations and I don't particular want to write one :( ... and I just noticed we have both mfbt/SplayTree.h and js/src/ds/SplayTree.h, and the latter was added to the tree almost exactly one month before the former. Each one appears to have exactly one use. I filed bug 981372 about removing one of them.

Nicholas Nethercote [inactive]

Assignee

Comment 14

•

11 years ago

I no longer think this is worth doing.

Status: ASSIGNED → RESOLVED

Closed: 11 years ago

Resolution: --- → WONTFIX

Don't hash past the first 1000 bytes of a string. 11 years ago Nicholas Nethercote [inactive] 1.99 KB, patch		Details \| Diff \| Splinter Review
Introduce HashStringLimited(), and use it in jsatom.h. 11 years ago Nicholas Nethercote [inactive] 5.68 KB, patch	Waldo : review-	Details \| Diff \| Splinter Review

Bugzilla

Don't hash past the first 1000 bytes of a string

Categories

(Core :: MFBT, defect)

Tracking

()

People

(Reporter: n.nethercote, Assigned: n.nethercote)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file, 1 obsolete file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Updated

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Attachment

General

Description

File Name

Content Type