Closed Bug 64592 Opened 24 years ago Closed 24 years ago

nsCRT::HashCode on strings accumulates bits badly

Categories

(Core :: XPCOM, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla0.8.1

People

(Reporter: brendan, Assigned: scc)

References

()

Details

Attachments

(4 files)

It uses << to strength-reduce *= 37 on the accumulator, which saves information
from only the last ~6 characters, yet it wastes cycles loading all chars from a
longer string.  Worse, using + rather than ^ reduces entropy in the high order
bits as carries propagate.

It should rotate and ^, a la
http://lxr.mozilla.org/mozilla/source/nsprpub/lib/ds/plhash.c#524 and clones
(but those Java-esque hash functions should accumulate every character toward
the end of longer strings, to get the more frequently different suffix chars of
URIs and filenames -- or perhaps the whole Java-esque sampling should be tossed,
and every char should be accumulated -- separate bug on this issue coming up).

/be
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla0.9.1
important, but behind strings; note, we'll want to update the |HashSink|
identically when we change this code.
Component: XPCOM → String
Depends on: 70090
No longer depends on: 70090
Blocks: 70090
See related bug #70085.  Jag is converting a bunch of |ns[C]String&| parameters
to |const nsAReadableString&| and |nsAWritableString&| as appropriate (which
helps me to fix bug #46892), he needs a hash that works on readables right away.
 It's just convenient to fix this bug and part of 70085 in answer to his need.
Depends on: 70085
Priority: -- → P1
Target Milestone: mozilla0.9.1 → mozilla0.9
Moving the hashing part of 70085 to this bug (because this bug _doesn't_ really
depend on a complete fix for that bug).  Therefore, as part of this bug, put the
fixed hashing functionality into |nsCharTraits|.
No longer depends on: 70085
Target Milestone: mozilla0.9 → mozilla0.8.1
I'm posting this comment from a browser that includes this fix.  I'll email
brendan for an sr=, maybe waterson or jag for an r=?
Keywords: approval, patch, review
sr=vidur 
Unless you want to be up all night fixing build bustage, I'd move the 
declaration of ``c'' out of the while loop's condition. r=waterson
Just curious why you convert the "UCS2" (do we really handle surrogate pairs and
so on in our PRUnichar usage elsewhere?) to UTF8 and then accumulate?  The 0x80
constants or'd in don't really help, and there are still "gap bits" accumulated
that could be squeezed out.

/be
Addressing Brendan's questions: hopefully the (new) comments make this clear;
with this behavior, equivalent strings from different representations among
ASCII, UTF-8, UCS-2, and UTF-16, will hash to the same value.  If there are no
surrogate pairs, we never get into that code.  In particular, if the value fits
in ASCII (as one would expect it currently usually does) then no extra tag bits
are inserted.  The one-byte case is _fairly_ streamlined.  The only annoyance
along that path is the no-op lookups in |sBytePrefix| and |sShift| arrays.  Is
it wishful thinking to hope those two arrays will be held in the on-chip data
cache for the duration of the loop?
> do we really handle surrogate pairs and
> so on in our PRUnichar usage elsewhere?

I don't know what "and so on" means, but no, we don't support surrogates
elsewhere (yet). We only support them in our UTF-8 converters.
erik: "and so on" means I'm still a Unicode ignoramus :-) -- if surrogates are
the only cause of multiple adjacent uint16 elements being required to represent
a character, I'm happy.

scc: I wouldn't worry about those array element loads yet -- let's quantify and
see whether this is not noise.  One question tho: where is the input file you
passed to your hash function test?  I got results favoring P.J. Weinberger's
function using /usr/dict/words, but I don't understand the test algorithm yet,
so I'm not sure it tests what we want (multiplicative hash does not use divide
by a prime).

/be
brendan: There are endless debates about the meaning of the term "character" in
I18N circles, but there are cases other than surrogates where you need more than
one 16-bit unit to create some "thing". For example, in Arabic there are
compulsory ligatures (where 2 or more Unicodes combine to form a single glyph
from a font). Ligatures such as "fi" in English are not compulsory. Also,
Unicode has the concept of a combining mark, such as an accent. These marks
are encoded after their base character, and the rendering system is expected to
draw them somewhere in the vicinity of their base glyph (or a precomposed glyph
may be used).

Anyway, I won't go into any more details, but for Mozilla we are just beginning
to see some of these, namely Arabic ligatures. Surrogates are not supported yet.
Fix checked in to "nsCRT.cpp" version 3.38
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Keywords: approval, patch, review
Resolution: --- → FIXED
Component: String → XPCOM
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: