nsCRT::HashCode on strings accumulates bits badly

RESOLVED FIXED in mozilla0.8.1

Status

()

Core
String
P1
normal
RESOLVED FIXED
17 years ago
17 years ago

People

(Reporter: brendan, Assigned: Scott Collins)

Tracking

Trunk
mozilla0.8.1
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

(URL)

Attachments

(4 attachments)

(Reporter)

Description

17 years ago
It uses << to strength-reduce *= 37 on the accumulator, which saves information
from only the last ~6 characters, yet it wastes cycles loading all chars from a
longer string.  Worse, using + rather than ^ reduces entropy in the high order
bits as carries propagate.

It should rotate and ^, a la
http://lxr.mozilla.org/mozilla/source/nsprpub/lib/ds/plhash.c#524 and clones
(but those Java-esque hash functions should accumulate every character toward
the end of longer strings, to get the more frequently different suffix chars of
URIs and filenames -- or perhaps the whole Java-esque sampling should be tossed,
and every char should be accumulated -- separate bug on this issue coming up).

/be
(Assignee)

Updated

17 years ago
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla0.9.1
(Assignee)

Comment 1

17 years ago
important, but behind strings; note, we'll want to update the |HashSink|
identically when we change this code.
(Assignee)

Updated

17 years ago
Component: XPCOM → String
(Assignee)

Updated

17 years ago
Depends on: 70090
(Assignee)

Updated

17 years ago
No longer depends on: 70090
(Assignee)

Updated

17 years ago
Blocks: 70090
(Assignee)

Comment 2

17 years ago
See related bug #70085.  Jag is converting a bunch of |ns[C]String&| parameters
to |const nsAReadableString&| and |nsAWritableString&| as appropriate (which
helps me to fix bug #46892), he needs a hash that works on readables right away.
 It's just convenient to fix this bug and part of 70085 in answer to his need.
Depends on: 70085
Priority: -- → P1
Target Milestone: mozilla0.9.1 → mozilla0.9
(Assignee)

Comment 3

17 years ago
Moving the hashing part of 70085 to this bug (because this bug _doesn't_ really
depend on a complete fix for that bug).  Therefore, as part of this bug, put the
fixed hashing functionality into |nsCharTraits|.
No longer depends on: 70085
(Assignee)

Comment 4

17 years ago
Created attachment 26596 [details]
Here's a test harness that compares several hash functions ... feel free to add your own
(Assignee)

Updated

17 years ago
Target Milestone: mozilla0.9 → mozilla0.8.1
(Assignee)

Comment 5

17 years ago
Created attachment 27222 [details] [diff] [review]
fixes |nsCRT|s implementations to do what |PL_Hash...| does, and as a bonus unifies wide and narrow hashing
(Assignee)

Comment 6

17 years ago
I'm posting this comment from a browser that includes this fix.  I'll email
brendan for an sr=, maybe waterson or jag for an r=?
Keywords: approval, patch, review

Comment 7

17 years ago
sr=vidur 

Comment 8

17 years ago
Unless you want to be up all night fixing build bustage, I'd move the 
declaration of ``c'' out of the while loop's condition. r=waterson
(Reporter)

Comment 9

17 years ago
Just curious why you convert the "UCS2" (do we really handle surrogate pairs and
so on in our PRUnichar usage elsewhere?) to UTF8 and then accumulate?  The 0x80
constants or'd in don't really help, and there are still "gap bits" accumulated
that could be squeezed out.

/be
(Assignee)

Comment 10

17 years ago
Created attachment 27252 [details] [diff] [review]
move declarations out of the loop condition (as per waterson); wrap assignment to prevent warnings; add comments; also fix |nsCRT::BufferHashCode|
(Assignee)

Comment 11

17 years ago
Addressing Brendan's questions: hopefully the (new) comments make this clear;
with this behavior, equivalent strings from different representations among
ASCII, UTF-8, UCS-2, and UTF-16, will hash to the same value.  If there are no
surrogate pairs, we never get into that code.  In particular, if the value fits
in ASCII (as one would expect it currently usually does) then no extra tag bits
are inserted.  The one-byte case is _fairly_ streamlined.  The only annoyance
along that path is the no-op lookups in |sBytePrefix| and |sShift| arrays.  Is
it wishful thinking to hope those two arrays will be held in the on-chip data
cache for the duration of the loop?

Comment 12

17 years ago
> do we really handle surrogate pairs and
> so on in our PRUnichar usage elsewhere?

I don't know what "and so on" means, but no, we don't support surrogates
elsewhere (yet). We only support them in our UTF-8 converters.
(Reporter)

Comment 13

17 years ago
erik: "and so on" means I'm still a Unicode ignoramus :-) -- if surrogates are
the only cause of multiple adjacent uint16 elements being required to represent
a character, I'm happy.

scc: I wouldn't worry about those array element loads yet -- let's quantify and
see whether this is not noise.  One question tho: where is the input file you
passed to your hash function test?  I got results favoring P.J. Weinberger's
function using /usr/dict/words, but I don't understand the test algorithm yet,
so I'm not sure it tests what we want (multiplicative hash does not use divide
by a prime).

/be
(Assignee)

Comment 14

17 years ago
Created attachment 27300 [details]
here's the data I tested against (1 url per line; files pulled from other bugs)

Comment 15

17 years ago
brendan: There are endless debates about the meaning of the term "character" in
I18N circles, but there are cases other than surrogates where you need more than
one 16-bit unit to create some "thing". For example, in Arabic there are
compulsory ligatures (where 2 or more Unicodes combine to form a single glyph
from a font). Ligatures such as "fi" in English are not compulsory. Also,
Unicode has the concept of a combining mark, such as an accent. These marks
are encoded after their base character, and the rendering system is expected to
draw them somewhere in the vicinity of their base glyph (or a precomposed glyph
may be used).

Anyway, I won't go into any more details, but for Mozilla we are just beginning
to see some of these, namely Arabic ligatures. Surrogates are not supported yet.
(Assignee)

Comment 16

17 years ago
Fix checked in to "nsCRT.cpp" version 3.38
Status: ASSIGNED → RESOLVED
Last Resolved: 17 years ago
Keywords: approval, patch, review
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.