Closed
Bug 64592
Opened 24 years ago
Closed 24 years ago
nsCRT::HashCode on strings accumulates bits badly
Categories
(Core :: XPCOM, defect, P1)
Core
XPCOM
Tracking
()
RESOLVED
FIXED
mozilla0.8.1
People
(Reporter: brendan, Assigned: scc)
References
()
Details
Attachments
(4 files)
3.06 KB,
text/plain
|
Details | |
3.13 KB,
patch
|
Details | Diff | Splinter Review | |
5.08 KB,
patch
|
Details | Diff | Splinter Review | |
112.58 KB,
text/plain
|
Details |
It uses << to strength-reduce *= 37 on the accumulator, which saves information from only the last ~6 characters, yet it wastes cycles loading all chars from a longer string. Worse, using + rather than ^ reduces entropy in the high order bits as carries propagate. It should rotate and ^, a la http://lxr.mozilla.org/mozilla/source/nsprpub/lib/ds/plhash.c#524 and clones (but those Java-esque hash functions should accumulate every character toward the end of longer strings, to get the more frequently different suffix chars of URIs and filenames -- or perhaps the whole Java-esque sampling should be tossed, and every char should be accumulated -- separate bug on this issue coming up). /be
Assignee | ||
Updated•24 years ago
|
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla0.9.1
Assignee | ||
Comment 1•24 years ago
|
||
important, but behind strings; note, we'll want to update the |HashSink| identically when we change this code.
Assignee | ||
Updated•24 years ago
|
Component: XPCOM → String
Assignee | ||
Comment 2•24 years ago
|
||
See related bug #70085. Jag is converting a bunch of |ns[C]String&| parameters to |const nsAReadableString&| and |nsAWritableString&| as appropriate (which helps me to fix bug #46892), he needs a hash that works on readables right away. It's just convenient to fix this bug and part of 70085 in answer to his need.
Assignee | ||
Comment 3•24 years ago
|
||
Moving the hashing part of 70085 to this bug (because this bug _doesn't_ really depend on a complete fix for that bug). Therefore, as part of this bug, put the fixed hashing functionality into |nsCharTraits|.
No longer depends on: 70085
Assignee | ||
Comment 4•24 years ago
|
||
Assignee | ||
Updated•24 years ago
|
Target Milestone: mozilla0.9 → mozilla0.8.1
Assignee | ||
Comment 5•24 years ago
|
||
Assignee | ||
Comment 6•24 years ago
|
||
I'm posting this comment from a browser that includes this fix. I'll email brendan for an sr=, maybe waterson or jag for an r=?
Comment 7•24 years ago
|
||
sr=vidur
Comment 8•24 years ago
|
||
Unless you want to be up all night fixing build bustage, I'd move the declaration of ``c'' out of the while loop's condition. r=waterson
Reporter | ||
Comment 9•24 years ago
|
||
Just curious why you convert the "UCS2" (do we really handle surrogate pairs and so on in our PRUnichar usage elsewhere?) to UTF8 and then accumulate? The 0x80 constants or'd in don't really help, and there are still "gap bits" accumulated that could be squeezed out. /be
Assignee | ||
Comment 10•24 years ago
|
||
Assignee | ||
Comment 11•24 years ago
|
||
Addressing Brendan's questions: hopefully the (new) comments make this clear; with this behavior, equivalent strings from different representations among ASCII, UTF-8, UCS-2, and UTF-16, will hash to the same value. If there are no surrogate pairs, we never get into that code. In particular, if the value fits in ASCII (as one would expect it currently usually does) then no extra tag bits are inserted. The one-byte case is _fairly_ streamlined. The only annoyance along that path is the no-op lookups in |sBytePrefix| and |sShift| arrays. Is it wishful thinking to hope those two arrays will be held in the on-chip data cache for the duration of the loop?
Comment 12•24 years ago
|
||
> do we really handle surrogate pairs and
> so on in our PRUnichar usage elsewhere?
I don't know what "and so on" means, but no, we don't support surrogates
elsewhere (yet). We only support them in our UTF-8 converters.
Reporter | ||
Comment 13•24 years ago
|
||
erik: "and so on" means I'm still a Unicode ignoramus :-) -- if surrogates are the only cause of multiple adjacent uint16 elements being required to represent a character, I'm happy. scc: I wouldn't worry about those array element loads yet -- let's quantify and see whether this is not noise. One question tho: where is the input file you passed to your hash function test? I got results favoring P.J. Weinberger's function using /usr/dict/words, but I don't understand the test algorithm yet, so I'm not sure it tests what we want (multiplicative hash does not use divide by a prime). /be
Assignee | ||
Comment 14•24 years ago
|
||
Comment 15•24 years ago
|
||
brendan: There are endless debates about the meaning of the term "character" in I18N circles, but there are cases other than surrogates where you need more than one 16-bit unit to create some "thing". For example, in Arabic there are compulsory ligatures (where 2 or more Unicodes combine to form a single glyph from a font). Ligatures such as "fi" in English are not compulsory. Also, Unicode has the concept of a combining mark, such as an accent. These marks are encoded after their base character, and the rendering system is expected to draw them somewhere in the vicinity of their base glyph (or a precomposed glyph may be used). Anyway, I won't go into any more details, but for Mozilla we are just beginning to see some of these, namely Arabic ligatures. Surrogates are not supported yet.
Assignee | ||
Comment 16•24 years ago
|
||
Fix checked in to "nsCRT.cpp" version 3.38
Updated•3 years ago
|
Component: String → XPCOM
You need to log in
before you can comment on or make changes to this bug.
Description
•