Closed Bug 64592 Opened 24 years ago Closed 24 years ago

nsCRT::HashCode on strings accumulates bits badly

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla0.8.1

People

(Reporter: brendan, Assigned: scc)

References

(
URL
)

Details

Attachments

(4 files)

Here's a test harness that compares several hash functions ... feel free to add your own 24 years ago Scott Collins 3.06 KB, text/plain		Details
fixes \|nsCRT\|s implementations to do what \|PL_Hash...\| does, and as a bonus unifies wide and narrow hashing 24 years ago Scott Collins 3.13 KB, patch		Details \| Diff \| Splinter Review
move declarations out of the loop condition (as per waterson); wrap assignment to prevent warnings; add comments; also fix \|nsCRT::BufferHashCode\| 24 years ago Scott Collins 5.08 KB, patch		Details \| Diff \| Splinter Review
here's the data I tested against (1 url per line; files pulled from other bugs) 24 years ago Scott Collins 112.58 KB, text/plain		Details

Brendan Eich [:brendan]

Reporter

Description

•

24 years ago

It uses << to strength-reduce *= 37 on the accumulator, which saves information
from only the last ~6 characters, yet it wastes cycles loading all chars from a
longer string.  Worse, using + rather than ^ reduces entropy in the high order
bits as carries propagate.

It should rotate and ^, a la
http://lxr.mozilla.org/mozilla/source/nsprpub/lib/ds/plhash.c#524 and clones
(but those Java-esque hash functions should accumulate every character toward
the end of longer strings, to get the more frequently different suffix chars of
URIs and filenames -- or perhaps the whole Java-esque sampling should be tossed,
and every char should be accumulated -- separate bug on this issue coming up).

/be

Scott Collins

Assignee

Updated

•

24 years ago

Status: NEW → ASSIGNED

Target Milestone: --- → mozilla0.9.1

Scott Collins

Assignee

Comment 1

•

24 years ago

important, but behind strings; note, we'll want to update the |HashSink|
identically when we change this code.

Scott Collins

Assignee

Updated

•

24 years ago

Component: XPCOM → String

Scott Collins

Assignee

Updated

•

24 years ago

Depends on: 70090

Scott Collins

Assignee

Updated

•

24 years ago

No longer depends on: 70090

Scott Collins

Assignee

Updated

•

24 years ago

Blocks: 70090

Scott Collins

Assignee

Comment 2

•

24 years ago

See related bug #70085.  Jag is converting a bunch of |ns[C]String&| parameters
to |const nsAReadableString&| and |nsAWritableString&| as appropriate (which
helps me to fix bug #46892), he needs a hash that works on readables right away.
 It's just convenient to fix this bug and part of 70085 in answer to his need.

Depends on: 70085

Priority: -- → P1

Target Milestone: mozilla0.9.1 → mozilla0.9

Scott Collins

Assignee

Comment 3

•

24 years ago

Moving the hashing part of 70085 to this bug (because this bug _doesn't_ really
depend on a complete fix for that bug).  Therefore, as part of this bug, put the
fixed hashing functionality into |nsCharTraits|.

No longer depends on: 70085

Scott Collins

Assignee

Comment 4

•

24 years ago

Attached file Here's a test harness that compares several hash functions ... feel free to add your own — Details

Scott Collins

Assignee

Updated

•

24 years ago

Target Milestone: mozilla0.9 → mozilla0.8.1

Scott Collins

Assignee

Comment 5

•

24 years ago

Attached patch fixes |nsCRT|s implementations to do what |PL_Hash...| does, and as a bonus unifies wide and narrow hashing — Details — Splinter Review

Scott Collins

Assignee

Comment 6

•

24 years ago

I'm posting this comment from a browser that includes this fix.  I'll email
brendan for an sr=, maybe waterson or jag for an r=?

Keywords: approval, patch, review

vidur (gone)

Comment 7

•

24 years ago

sr=vidur

Chris Waterson

Comment 8

•

24 years ago

Unless you want to be up all night fixing build bustage, I'd move the 
declaration of ``c'' out of the while loop's condition. r=waterson

Brendan Eich [:brendan]

Reporter

Comment 9

•

24 years ago

Just curious why you convert the "UCS2" (do we really handle surrogate pairs and
so on in our PRUnichar usage elsewhere?) to UTF8 and then accumulate?  The 0x80
constants or'd in don't really help, and there are still "gap bits" accumulated
that could be squeezed out.

/be

Scott Collins

Assignee

Comment 10

•

24 years ago

Attached patch move declarations out of the loop condition (as per waterson); wrap assignment to prevent warnings; add comments; also fix |nsCRT::BufferHashCode| — Details — Splinter Review

Scott Collins

Assignee

Comment 11

•

24 years ago

Addressing Brendan's questions: hopefully the (new) comments make this clear;
with this behavior, equivalent strings from different representations among
ASCII, UTF-8, UCS-2, and UTF-16, will hash to the same value.  If there are no
surrogate pairs, we never get into that code.  In particular, if the value fits
in ASCII (as one would expect it currently usually does) then no extra tag bits
are inserted.  The one-byte case is _fairly_ streamlined.  The only annoyance
along that path is the no-op lookups in |sBytePrefix| and |sShift| arrays.  Is
it wishful thinking to hope those two arrays will be held in the on-chip data
cache for the duration of the loop?

Erik van der Poel

Comment 12

•

24 years ago

> do we really handle surrogate pairs and
> so on in our PRUnichar usage elsewhere?

I don't know what "and so on" means, but no, we don't support surrogates
elsewhere (yet). We only support them in our UTF-8 converters.

Brendan Eich [:brendan]

Reporter

Comment 13

•

24 years ago

erik: "and so on" means I'm still a Unicode ignoramus :-) -- if surrogates are
the only cause of multiple adjacent uint16 elements being required to represent
a character, I'm happy.

scc: I wouldn't worry about those array element loads yet -- let's quantify and
see whether this is not noise.  One question tho: where is the input file you
passed to your hash function test?  I got results favoring P.J. Weinberger's
function using /usr/dict/words, but I don't understand the test algorithm yet,
so I'm not sure it tests what we want (multiplicative hash does not use divide
by a prime).

/be

Scott Collins

Assignee

Comment 14

•

24 years ago

Attached file here's the data I tested against (1 url per line; files pulled from other bugs) — Details

Erik van der Poel

Comment 15

•

24 years ago

brendan: There are endless debates about the meaning of the term "character" in
I18N circles, but there are cases other than surrogates where you need more than
one 16-bit unit to create some "thing". For example, in Arabic there are
compulsory ligatures (where 2 or more Unicodes combine to form a single glyph
from a font). Ligatures such as "fi" in English are not compulsory. Also,
Unicode has the concept of a combining mark, such as an accent. These marks
are encoded after their base character, and the rendering system is expected to
draw them somewhere in the vicinity of their base glyph (or a precomposed glyph
may be used).

Anyway, I won't go into any more details, but for Mozilla we are just beginning
to see some of these, namely Arabic ligatures. Surrogates are not supported yet.

Scott Collins

Assignee

Comment 16

•

24 years ago

Fix checked in to "nsCRT.cpp" version 3.38

Status: ASSIGNED → RESOLVED

Closed: 24 years ago

Keywords: approval, patch, review

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

3 years ago

Component: String → XPCOM

You need to log in before you can comment on or make changes to this bug.