Closed Bug 970152 Opened 11 years ago Closed 11 years ago

Don't hash past the first 1000 bytes of a string

Categories

(Core :: MFBT, defect)

defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: n.nethercote, Assigned: n.nethercote)

Details

Attachments

(1 file, 1 obsolete file)

We spend a non-trivial amount of time hashing extremely long strings in order to put them in hash tables. A while back jlebar had an idea about only hashing part of such strings. I don't think anything came of it at the time, but I just tried it out and it appears to be worthwhile. Consider SunSpider. During its execution, we call HashKnownLength() 509,192 times. The average string size (in bytes) is 12.8, but there's a long tail. The following table shows what fraction of the bytes processed by HashKnownLength() is done for strings of varying lengths. 0+ 6551978 100% 10+ 6128982 93.5% 100+ 2046114 31.2% 1000+ 1646374 25.1% 10000+ 1414048 21.6% 100000+ 1201432 18.3% I.e. 21.6% of the work is done for strings of size 10,000+, and 18.3% of the work is done for strings of size 100,000+. The longest string is 600,112 bytes. Very long strings are rare, but because they are so big, they account for a significant fraction of the hashing effort. I did the same measurements for a few minutes of browing at techcrunch.com, nytimes.com, and bbc.co.uk. We hashed 5,505,563 strings, with an average string size of 14.5 bytes, and the longest string was 96,710 bytes. 0+ 79580839 100% 10+ 76318870 95.9% 100+ 15336229 19.3% 1000+ 9961271 12.5% 10000+ 8274210 10.4% So if we, say, limited ourselves to just hashing the first 999 bytes, we'd do 25.1% less hashing in SunSpider, and 12.5% less in vanilla browsing. The hashing mostly comes from SpiderMonkey, where we atomize every identifier and string that appears in user code; long strings in user code appear to be the major source of very long strings. BTW, HashUntilZero() gets used much less than HashKnownLength() -- e.g. it's not used at all by SpiderMonkey.
This patch makes HashKnownLength() and HashUntilZero() only hash the first 1000 chars. This is about a 1.01x speed-up for SunSpider on my desktop machine. regexp-dna is typically 1.05--1.08x faster, and string-tagcloud and string-unpack-code are typically 1.02--1.04x faster. Unsurprisingly, these are the tests with very long strings. If I time just the parsing of SunSpider (parsemark has the whole suite concatenated into a single file), I get about a 1.10--1.12x speed-up. If you're wondering how I chose 1000 chars, here are instruction counts for parsing Sunspider with different limits. - no limit 74.461 - 100000 73.439 - 10000 71.960 - 1000 71.622 - 100 71.539 - 10 71.975 - 8,9 72.384 - 6,7 73.498 - 4,5 76.603 - 2,3 95.338 - 0,1 604.685 100 was the best result, but 1000 is almost as good and will be more robust against bad cases. The instruction count reduction going from no limit to 1000 is 1.04x; I assume the remainder of the 1.10--1.12x speed-up I saw is due to fewer memory accesses. (During parsing, we tokenize the string and then immediately atomize it; two consecutive iterations over a huge string will trash the caches.) ARM devices typically have slower memory subsystems, so improvements there might be bigger. Speaking of bad cases: for this to hurt performance, we'd need to be hashing *many* strings that are identical for the first 1000 bytes, but differ after that, but also aren't much longer than 1000 chars. That seems *extremely* unlikely. My only other concern is this: these functions are only used for hash tables, right? I.e. nobody's using these for any kind of cryptographic/security purposes? I sure hope so. If this is a concern, I could keep the existing HashString() functions and instead add some HashStringLimited() functions, and replace existing HashString() calls where appropriate.
Attachment #8373111 - Flags: review?(jwalden+bmo)
It's a little scary to have a central utility function behave differently than advertised in this manner. I'd much rather have a new HashFirst1000Chars function and opt in at sites we want to optimize. Case and point: dom/asmjscache/AsmJSCache.cpp uses HashString() and I think it would be rather likely for there to be two separate cache lookups that have an asm.js module that differs only in bytes after the first 1000 chars (which are mostly boilerplate for large modules).
(Putting "1000" in the name is probably a bad idea.)
Maybe make that an optional argument?
(In reply to :Ehsan Akhgari (needinfo? me!) (slow responsiveness, emailacopolypse) from comment #4) > Maybe make that an optional argument? I think optional arguments are bad here for reasons similar to Luke's in comment 2. How about HashPrefixOfString?
While we're bikeshedding... I actually started working on a patch to do this yesterday, and I used the names HashStringLimited(), HashKnownLengthLimited(), and HashUntilZeroLimited(). I'll finish it up and post an updated patch.
> dom/asmjscache/AsmJSCache.cpp uses HashString() and I think it > would be rather likely for there to be two separate cache lookups that have > an asm.js module that differs only in bytes after the first 1000 chars > (which are mostly boilerplate for large modules). Is this a direct-mapped cache that just overwrites old entries when a collision occurs? If so, then yes, I can see that using the limited hash would be bad.
Updated version that introduces HashStringLimited(), rather than changing HashString(), and changes jatom.h to use it. I thought about giving it an extra argument instead of hard-coding the limit, but I'm a little leery of having a non-constant multiple and divide, even if the RH operands are constant. There are other places in the browser where we still hash big strings -- I'm seeing stuff relating to fonts and images. But they can be done in a follow-up.
Attachment #8373512 - Flags: review?(jwalden+bmo)
Attachment #8373111 - Attachment is obsolete: true
Attachment #8373111 - Flags: review?(jwalden+bmo)
This script generates a pathologically bad case for this optimization: s = '0123456789' * 100; for i in range(10000): print('var x%d = "%s_%d";' % (i, s, i)) The resulting script runs vastly slower when this optimization is present. This doesn't particularly worry me, because it's obscure and if I want to write JS code that runs slowly there are plenty of simpler ways to do it. But Kannan is concerned, and possibly Brendan. So is there some other way to achieve the same benefit? Hashing incredibly long, probably-unique strings isn't smart. For example, if the atoms table was a tree rather than a hash table, we wouldn't have this problem.
Maybe use a separate tree for indexing atoms over N characters?
> Maybe use a separate tree for indexing atoms over N characters? That's a nice idea. The only problem is that we don't have any generic tree implementations and I don't particular want to write one :(
Comment on attachment 8373512 [details] [diff] [review] Introduce HashStringLimited(), and use it in jsatom.h. Review of attachment 8373512 [details] [diff] [review]: ----------------------------------------------------------------- Particularly given this is targeted at speeding up hashing of long strings, I don't think we should provide *UntilZero versions of these methods. Callers should be required to hand in the length of the contents they're hashing. If they can't compute it more easily than O(n), there's something seriously wrong with the optimality of the relevant code. This also has an additional benefit: when the length exceeds the limit, we can add *the length* into the final hash. That means we don't have to worry about long, similar strings (like Luke's asm.js code with boilerplate prefixes) trivially colliding. That, plus there only being a single use of this, militates strongly, IMO, for providing only ptr+length methods here. ::: mfbt/HashFunctions.h @@ +296,5 @@ > hash = AddToHash(hash, str[i]); > return hash; > } > > +static const size_t limitedStringHashSize = 1000; So the limit is in amount of memory, and not in amount of characters? That's probably a little unintuitive, but it's not unreasonable. limitedStringMemoryExamined seems like a better name to me. "hash size" does not connote the meaning I actually observe here. @@ +329,5 @@ > > /** > + * The HashString overloads below do just what you'd expect. However, if you > + * might be hashing very long strings and a tiny potential increase in > + * collision rate doesn't matter, consider using HashStringLimited() instead. "tiny potential increase in collision rate" isn't clear enough about when it's okay to use HashStringLimited. Instead, say "might be hashing very long strings sharing the same long prefix". It's not long strings that are the problem, it's long strings with the same prefix. Or at least, you should say this if until-zero overloads are provided, which I think they shouldn't be. @@ +401,5 @@ > + * avoid slow hash computations, they don't consider the entire string if the > + * string's size is greater than |limitedStringHashSize|. > + * > + * If you have the string's length, you might as well call the overload which > + * includes the length. It may be marginally faster. If you were to have the until-zero overload, I'd say: don't be wishy-washy, say flat-out "If you have the string's length, use the faster overload that includes the length". Removing it eliminates the need for this suggestion. @@ +431,5 @@ > +{ > + return detail::HashKnownLengthLimited(str, length); > +} > + > +#ifdef MOZ_CHAR16_IS_NOT_WCHAR Hmm, I should probably go apply template SFINAE to this to get rid of the ifdefs here, sometime. Not now.
Attachment #8373512 - Flags: review?(jwalden+bmo) → review-
> > Maybe use a separate tree for indexing atoms over N characters? > > That's a nice idea. The only problem is that we don't have any generic tree > implementations and I don't particular want to write one :( ... and I just noticed we have both mfbt/SplayTree.h and js/src/ds/SplayTree.h, and the latter was added to the tree almost exactly one month before the former. Each one appears to have exactly one use. I filed bug 981372 about removing one of them.
I no longer think this is worth doing.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: