Closed
Bug 970152
Opened 11 years ago
Closed 11 years ago
Don't hash past the first 1000 bytes of a string
Categories
(Core :: MFBT, defect)
Core
MFBT
Tracking
()
RESOLVED
WONTFIX
People
(Reporter: n.nethercote, Assigned: n.nethercote)
Details
Attachments
(1 file, 1 obsolete file)
5.68 KB,
patch
|
Waldo
:
review-
|
Details | Diff | Splinter Review |
We spend a non-trivial amount of time hashing extremely long strings in order
to put them in hash tables. A while back jlebar had an idea about only hashing
part of such strings. I don't think anything came of it at the time, but I just
tried it out and it appears to be worthwhile.
Consider SunSpider. During its execution, we call HashKnownLength() 509,192
times. The average string size (in bytes) is 12.8, but there's a long tail. The
following table shows what fraction of the bytes processed by HashKnownLength()
is done for strings of varying lengths.
0+ 6551978 100%
10+ 6128982 93.5%
100+ 2046114 31.2%
1000+ 1646374 25.1%
10000+ 1414048 21.6%
100000+ 1201432 18.3%
I.e. 21.6% of the work is done for strings of size 10,000+, and 18.3% of the
work is done for strings of size 100,000+. The longest string is 600,112 bytes.
Very long strings are rare, but because they are so big, they account for a
significant fraction of the hashing effort.
I did the same measurements for a few minutes of browing at techcrunch.com,
nytimes.com, and bbc.co.uk. We hashed 5,505,563 strings, with an average string
size of 14.5 bytes, and the longest string was 96,710 bytes.
0+ 79580839 100%
10+ 76318870 95.9%
100+ 15336229 19.3%
1000+ 9961271 12.5%
10000+ 8274210 10.4%
So if we, say, limited ourselves to just hashing the first 999 bytes, we'd do
25.1% less hashing in SunSpider, and 12.5% less in vanilla browsing.
The hashing mostly comes from SpiderMonkey, where we atomize every identifier
and string that appears in user code; long strings in user code appear to be
the major source of very long strings.
BTW, HashUntilZero() gets used much less than HashKnownLength() -- e.g. it's
not used at all by SpiderMonkey.
![]() |
Assignee | |
Comment 1•11 years ago
|
||
This patch makes HashKnownLength() and HashUntilZero() only hash the first 1000
chars. This is about a 1.01x speed-up for SunSpider on my desktop machine.
regexp-dna is typically 1.05--1.08x faster, and string-tagcloud and
string-unpack-code are typically 1.02--1.04x faster. Unsurprisingly, these are
the tests with very long strings.
If I time just the parsing of SunSpider (parsemark has the whole suite
concatenated into a single file), I get about a 1.10--1.12x speed-up.
If you're wondering how I chose 1000 chars, here are instruction counts for
parsing Sunspider with different limits.
- no limit 74.461
- 100000 73.439
- 10000 71.960
- 1000 71.622
- 100 71.539
- 10 71.975
- 8,9 72.384
- 6,7 73.498
- 4,5 76.603
- 2,3 95.338
- 0,1 604.685
100 was the best result, but 1000 is almost as good and will be more robust
against bad cases. The instruction count reduction going from no limit to 1000
is 1.04x; I assume the remainder of the 1.10--1.12x speed-up I saw is due to
fewer memory accesses. (During parsing, we tokenize the string and then
immediately atomize it; two consecutive iterations over a huge string will
trash the caches.) ARM devices typically have slower memory subsystems, so
improvements there might be bigger.
Speaking of bad cases: for this to hurt performance, we'd need to be hashing
*many* strings that are identical for the first 1000 bytes, but differ after
that, but also aren't much longer than 1000 chars. That seems *extremely*
unlikely.
My only other concern is this: these functions are only used for hash tables,
right? I.e. nobody's using these for any kind of cryptographic/security
purposes? I sure hope so. If this is a concern, I could keep the existing
HashString() functions and instead add some HashStringLimited() functions, and
replace existing HashString() calls where appropriate.
Attachment #8373111 -
Flags: review?(jwalden+bmo)
![]() |
||
Comment 2•11 years ago
|
||
It's a little scary to have a central utility function behave differently than advertised in this manner. I'd much rather have a new HashFirst1000Chars function and opt in at sites we want to optimize. Case and point: dom/asmjscache/AsmJSCache.cpp uses HashString() and I think it would be rather likely for there to be two separate cache lookups that have an asm.js module that differs only in bytes after the first 1000 chars (which are mostly boilerplate for large modules).
![]() |
||
Comment 3•11 years ago
|
||
(Putting "1000" in the name is probably a bad idea.)
Comment 4•11 years ago
|
||
Maybe make that an optional argument?
![]() |
||
Comment 5•11 years ago
|
||
(In reply to :Ehsan Akhgari (needinfo? me!) (slow responsiveness, emailacopolypse) from comment #4)
> Maybe make that an optional argument?
I think optional arguments are bad here for reasons similar to Luke's in comment 2. How about HashPrefixOfString?
![]() |
Assignee | |
Comment 6•11 years ago
|
||
While we're bikeshedding... I actually started working on a patch to do this yesterday, and I used the names HashStringLimited(), HashKnownLengthLimited(), and HashUntilZeroLimited(). I'll finish it up and post an updated patch.
![]() |
Assignee | |
Comment 7•11 years ago
|
||
> dom/asmjscache/AsmJSCache.cpp uses HashString() and I think it
> would be rather likely for there to be two separate cache lookups that have
> an asm.js module that differs only in bytes after the first 1000 chars
> (which are mostly boilerplate for large modules).
Is this a direct-mapped cache that just overwrites old entries when a collision occurs? If so, then yes, I can see that using the limited hash would be bad.
![]() |
Assignee | |
Comment 8•11 years ago
|
||
Updated version that introduces HashStringLimited(), rather than changing
HashString(), and changes jatom.h to use it.
I thought about giving it an extra argument instead of hard-coding the limit,
but I'm a little leery of having a non-constant multiple and divide, even if
the RH operands are constant.
There are other places in the browser where we still hash big strings -- I'm
seeing stuff relating to fonts and images. But they can be done in a follow-up.
Attachment #8373512 -
Flags: review?(jwalden+bmo)
![]() |
Assignee | |
Updated•11 years ago
|
Attachment #8373111 -
Attachment is obsolete: true
Attachment #8373111 -
Flags: review?(jwalden+bmo)
![]() |
Assignee | |
Comment 9•11 years ago
|
||
This script generates a pathologically bad case for this optimization:
s = '0123456789' * 100;
for i in range(10000):
print('var x%d = "%s_%d";' % (i, s, i))
The resulting script runs vastly slower when this optimization is present. This doesn't particularly worry me, because it's obscure and if I want to write JS code that runs slowly there are plenty of simpler ways to do it. But Kannan is concerned, and possibly Brendan.
So is there some other way to achieve the same benefit? Hashing incredibly long, probably-unique strings isn't smart. For example, if the atoms table was a tree rather than a hash table, we wouldn't have this problem.
Comment 10•11 years ago
|
||
Maybe use a separate tree for indexing atoms over N characters?
![]() |
Assignee | |
Comment 11•11 years ago
|
||
> Maybe use a separate tree for indexing atoms over N characters?
That's a nice idea. The only problem is that we don't have any generic tree implementations and I don't particular want to write one :(
Comment 12•11 years ago
|
||
Comment on attachment 8373512 [details] [diff] [review]
Introduce HashStringLimited(), and use it in jsatom.h.
Review of attachment 8373512 [details] [diff] [review]:
-----------------------------------------------------------------
Particularly given this is targeted at speeding up hashing of long strings, I don't think we should provide *UntilZero versions of these methods. Callers should be required to hand in the length of the contents they're hashing. If they can't compute it more easily than O(n), there's something seriously wrong with the optimality of the relevant code. This also has an additional benefit: when the length exceeds the limit, we can add *the length* into the final hash. That means we don't have to worry about long, similar strings (like Luke's asm.js code with boilerplate prefixes) trivially colliding. That, plus there only being a single use of this, militates strongly, IMO, for providing only ptr+length methods here.
::: mfbt/HashFunctions.h
@@ +296,5 @@
> hash = AddToHash(hash, str[i]);
> return hash;
> }
>
> +static const size_t limitedStringHashSize = 1000;
So the limit is in amount of memory, and not in amount of characters? That's probably a little unintuitive, but it's not unreasonable.
limitedStringMemoryExamined seems like a better name to me. "hash size" does not connote the meaning I actually observe here.
@@ +329,5 @@
>
> /**
> + * The HashString overloads below do just what you'd expect. However, if you
> + * might be hashing very long strings and a tiny potential increase in
> + * collision rate doesn't matter, consider using HashStringLimited() instead.
"tiny potential increase in collision rate" isn't clear enough about when it's okay to use HashStringLimited. Instead, say "might be hashing very long strings sharing the same long prefix". It's not long strings that are the problem, it's long strings with the same prefix.
Or at least, you should say this if until-zero overloads are provided, which I think they shouldn't be.
@@ +401,5 @@
> + * avoid slow hash computations, they don't consider the entire string if the
> + * string's size is greater than |limitedStringHashSize|.
> + *
> + * If you have the string's length, you might as well call the overload which
> + * includes the length. It may be marginally faster.
If you were to have the until-zero overload, I'd say: don't be wishy-washy, say flat-out "If you have the string's length, use the faster overload that includes the length". Removing it eliminates the need for this suggestion.
@@ +431,5 @@
> +{
> + return detail::HashKnownLengthLimited(str, length);
> +}
> +
> +#ifdef MOZ_CHAR16_IS_NOT_WCHAR
Hmm, I should probably go apply template SFINAE to this to get rid of the ifdefs here, sometime. Not now.
Attachment #8373512 -
Flags: review?(jwalden+bmo) → review-
![]() |
Assignee | |
Comment 13•11 years ago
|
||
> > Maybe use a separate tree for indexing atoms over N characters?
>
> That's a nice idea. The only problem is that we don't have any generic tree
> implementations and I don't particular want to write one :(
... and I just noticed we have both mfbt/SplayTree.h and js/src/ds/SplayTree.h, and the latter was added to the tree almost exactly one month before the former. Each one appears to have exactly one use. I filed bug 981372 about removing one of them.
![]() |
Assignee | |
Comment 14•11 years ago
|
||
I no longer think this is worth doing.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•