3.06 KB, text/plain
fixes |nsCRT|s implementations to do what |PL_Hash...| does, and as a bonus unifies wide and narrow hashing
3.13 KB, patch
|Details | Diff | Splinter Review|
move declarations out of the loop condition (as per waterson); wrap assignment to prevent warnings; add comments; also fix |nsCRT::BufferHashCode|
5.08 KB, patch
|Details | Diff | Splinter Review|
112.58 KB, text/plain
It uses << to strength-reduce *= 37 on the accumulator, which saves information from only the last ~6 characters, yet it wastes cycles loading all chars from a longer string. Worse, using + rather than ^ reduces entropy in the high order bits as carries propagate. It should rotate and ^, a la http://lxr.mozilla.org/mozilla/source/nsprpub/lib/ds/plhash.c#524 and clones (but those Java-esque hash functions should accumulate every character toward the end of longer strings, to get the more frequently different suffix chars of URIs and filenames -- or perhaps the whole Java-esque sampling should be tossed, and every char should be accumulated -- separate bug on this issue coming up). /be
important, but behind strings; note, we'll want to update the |HashSink| identically when we change this code.
See related bug #70085. Jag is converting a bunch of |ns[C]String&| parameters to |const nsAReadableString&| and |nsAWritableString&| as appropriate (which helps me to fix bug #46892), he needs a hash that works on readables right away. It's just convenient to fix this bug and part of 70085 in answer to his need.
Moving the hashing part of 70085 to this bug (because this bug _doesn't_ really depend on a complete fix for that bug). Therefore, as part of this bug, put the fixed hashing functionality into |nsCharTraits|.
Created attachment 26596 [details] Here's a test harness that compares several hash functions ... feel free to add your own
Created attachment 27222 [details] [diff] [review] fixes |nsCRT|s implementations to do what |PL_Hash...| does, and as a bonus unifies wide and narrow hashing
I'm posting this comment from a browser that includes this fix. I'll email brendan for an sr=, maybe waterson or jag for an r=?
Unless you want to be up all night fixing build bustage, I'd move the declaration of ``c'' out of the while loop's condition. r=waterson
Just curious why you convert the "UCS2" (do we really handle surrogate pairs and so on in our PRUnichar usage elsewhere?) to UTF8 and then accumulate? The 0x80 constants or'd in don't really help, and there are still "gap bits" accumulated that could be squeezed out. /be
Created attachment 27252 [details] [diff] [review] move declarations out of the loop condition (as per waterson); wrap assignment to prevent warnings; add comments; also fix |nsCRT::BufferHashCode|
Addressing Brendan's questions: hopefully the (new) comments make this clear; with this behavior, equivalent strings from different representations among ASCII, UTF-8, UCS-2, and UTF-16, will hash to the same value. If there are no surrogate pairs, we never get into that code. In particular, if the value fits in ASCII (as one would expect it currently usually does) then no extra tag bits are inserted. The one-byte case is _fairly_ streamlined. The only annoyance along that path is the no-op lookups in |sBytePrefix| and |sShift| arrays. Is it wishful thinking to hope those two arrays will be held in the on-chip data cache for the duration of the loop?
> do we really handle surrogate pairs and > so on in our PRUnichar usage elsewhere? I don't know what "and so on" means, but no, we don't support surrogates elsewhere (yet). We only support them in our UTF-8 converters.
erik: "and so on" means I'm still a Unicode ignoramus :-) -- if surrogates are the only cause of multiple adjacent uint16 elements being required to represent a character, I'm happy. scc: I wouldn't worry about those array element loads yet -- let's quantify and see whether this is not noise. One question tho: where is the input file you passed to your hash function test? I got results favoring P.J. Weinberger's function using /usr/dict/words, but I don't understand the test algorithm yet, so I'm not sure it tests what we want (multiplicative hash does not use divide by a prime). /be
Created attachment 27300 [details] here's the data I tested against (1 url per line; files pulled from other bugs)
brendan: There are endless debates about the meaning of the term "character" in I18N circles, but there are cases other than surrogates where you need more than one 16-bit unit to create some "thing". For example, in Arabic there are compulsory ligatures (where 2 or more Unicodes combine to form a single glyph from a font). Ligatures such as "fi" in English are not compulsory. Also, Unicode has the concept of a combining mark, such as an accent. These marks are encoded after their base character, and the rendering system is expected to draw them somewhere in the vicinity of their base glyph (or a precomposed glyph may be used). Anyway, I won't go into any more details, but for Mozilla we are just beginning to see some of these, namely Arabic ligatures. Surrogates are not supported yet.
Fix checked in to "nsCRT.cpp" version 3.38