Closed Bug 304951 Opened 19 years ago Closed 19 years ago

Hebrew detector too confident

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla1.8beta4

People

(Reporter: bugs.caleb, Assigned: shooshx)

References

Details

(Keywords: fixed1.8, regression, testcase)

Attachments

(2 files)

I've bumped into a bugzilla page where the reporter had 2 hebrew chars as part
of his name, and that cuased the Universal detection to change encodings for the
whole page.
Attached file the page
This "testcase" needs some major reduction, but it shows the bug pretty
clearly.
(In reply to comment #0)
> I've bumped into a bugzilla page where the reporter had 2 hebrew chars as part
> of his name

It should be noted that these are NOT really Hebrew characters. these are two
Finnish characters (encoded in Latin-1) which were incorrectly identified as
Hebrew characters.
I just ran the prober with debug printing on this page and got the following output:
  MBCS inactive: [UTF8] (confidence is too low).
  MBCS inactive: [SJIS] (confidence is too low).
  MBCS inactive: [EUCJP] (confidence is too low).
  MBCS inactive: [GB18030] (confidence is too low).
  MBCS inactive: [EUCKR] (confidence is too low).
  MBCS inactive: [Big5] (confidence is too low).
  MBCS inactive: [EUCTW] (confidence is too low).
 SBCS Group Prober --------begin status
  SBCS: 0.010 [windows-1251]
  SBCS: 0.010 [KOI8-R]
  SBCS: 0.010 [ISO-8859-5]
  SBCS: 0.010 [x-mac-cyrillic]
  SBCS: 0.010 [IBM866]
  SBCS: 0.010 [IBM855]
  SBCS: 0.010 [ISO-8859-7]
  SBCS: 0.010 [windows-1253]
  SBCS: 0.010 [ISO-8859-5]
  SBCS: 0.010 [windows-1251]
  HEB: 0 - 0 [Logical-Visual score]
  SBCS: 0.508 [windows-1255]
  SBCS: 0.508 [windows-1255]
 SBCS Group found best match [windows-1255] confidence 0.508128.
 Latin1Prober: 0.500 [windows-1252]
Universal Charset Detector report charset windows-1255 .

significant part of this is:
  SBCS: 0.508 [windows-1255]
  SBCS: 0.508 [windows-1255]
 SBCS Group found best match [windows-1255] confidence 0.508128.
 Latin1Prober: 0.500 [windows-1252]

meaning only 0.8% difference between the hebrew and latin
these 0.8% are exactly the two offending letters which are two 0xE4 in the name
of the reporter of the bug and which correspond to the hebrew letter HEH in
windows-1255.
possible solutions I can see for this bug...
- make MINIMUM_THRESHOLD in
http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/src/nsUniversalDetector.cpp#99
higher. 0.20 is way too low in my opinion. increasing it to 0.55 for instance
solves this bug but possibly causes other regressions for other languages
- tell nsUniversalDetector and in general whoever makes a decision according to
model conficence to not make the decision if the difference between two probers
is lower then some threshold, say 1%. implemeted trivially, this is almost
certain to cause problems to the two hebrew detectors which often both get 99%
in hebrew pages. 
- a possible combination of the above two methods. tell the detector to not made
the decision if the difference is lower the 1% AND both scores are low.
it should be noted that this needs to be done generally, with any number of
scores, not just two.

these are my thoughs about it for the moment.
I don't really understand how charset-detection works, but there seems to be
something wrong here. A page which is 99% ASCII is *much* more likely to be
Latin1 than Hebrew (or any non-Latin alphabet). It seems like the detector is
only looking at the non-ASCII chcracters, which I think is wrong.
I don't know if this is a flaw with this particular detector or with the
detection mechanism in general, but IMO it's something that should be fixed.
you have a very good point there. something IS fundumentally wrong.
I have infact just noticed that there a serious mistake in the Hebrew Model in
LangHebrewModel.cpp which nither me nor simon noticed.

after a short debug session I've determined that the problem is infact not the
letters themselves but the pair "/xE4/x20" which appears in the text the prober
is given and is not ignored as it should be (since it contains a space).

LangHebrewModel.cpp is the unchaged model simon produced with the model maker
(model maker produced by the original author of this library). 
In the Language models already existing, I noticed that it is mentioned that the
models have been slightly altered from what the tool produced. I believe this
alteration is exactly what gives the hebrew model its problems.

I haven't fully understood all of this alteration but I believe its base is a
complete rewrite of 0x00-0x3f of the CharToOrderMap table. this rewrite, among
other things assigns 0x20 (space) order 253 (close to last) instead of order 0
(first, most common) which is where the current model assigns him. this
effectively excluded space from the pair check and saves the day for this test case.
another evidance that there is infact a change on the original model from the
tool is this comment from LangBulgarianModel.cpp:

/****************************************************************
255: Control characters that usually does not exist in any text
254: Carriage/Return
253: symbol (punctuation) that does not belong to word
252: 0 - 9
*****************************************************************/
as I understand it, this is infact a rudimentary mapping of the so called
alteration to the original CharToOrderMap table. however I suspect that it
doesn't describe the alretation fully.

to test this theory I copied 0x00-0x3f from the bulgaian model to simons hebrew
one and voila, it worked perfectly. space is now 253, last pair ignored and the
hebrew model prober stays with 1% like the rest of the probers:

  MBCS inactive: [UTF8] (confidence is too low).
  MBCS inactive: [SJIS] (confidence is too low).
  MBCS inactive: [EUCJP] (confidence is too low).
  MBCS inactive: [GB18030] (confidence is too low).
  MBCS inactive: [EUCKR] (confidence is too low).
  MBCS inactive: [Big5] (confidence is too low).
  MBCS inactive: [EUCTW] (confidence is too low).
 SBCS Group Prober --------begin status
  SBCS: 0.010 [windows-1251]
  SBCS: 0.010 [KOI8-R]
  SBCS: 0.010 [ISO-8859-5]
  SBCS: 0.010 [x-mac-cyrillic]
  SBCS: 0.010 [IBM866]
  SBCS: 0.010 [IBM855]
  SBCS: 0.010 [ISO-8859-7]
  SBCS: 0.010 [windows-1253]
  SBCS: 0.010 [ISO-8859-5]
  SBCS: 0.010 [windows-1251]
  HEB: 0 - 0 [Logical-Visual score]
  SBCS: 0.010 [windows-1255]
  SBCS: 0.010 [windows-1255]
 SBCS Group found best match [windows-1251] confidence 0.010000.
 Latin1Prober: 0.500 [windows-1252]

I would want to discuss this further with simon before making this into a patch.
Attached patch patch 1Splinter Review
this patch fixes the bug by applying the appropriate modification needed in
win1255_CharToOrderMap

the chage contains:
- rewrite of 0x00-0x40, 0x5b-0x60, 0x7b-0x7f (inclusive) to the values from any
of the other old tables. erasing the original values from the model to 252-255.
this makes the model engine disreguard these characters as it intentionally
should work.
- add a comment about values 252-255, this comment is copied from one of the
other models as well.

this solves the bug by making the model engine disreguard the space which gave
the testcase a high hebrew score, higher then latin-1.

testing:
the original test case of this bug is not showing any signs of being recognized
as hebrew.
also, I've tested all of my previous test cases from 86999 and they work
perfectly.

the modification made to the old language models is not really documented
anywhere and I had to use a bit of guess work and reverse engeneering to do it
right. there's no doubt this should be documented somewhere proprly but I'm
pretty sure this is not in the scope of this patch. it should be a part of the
model-making tool, whenever it is checked in.
Attachment #193395 - Flags: review?(smontagu)
Comment on attachment 193395 [details] [diff] [review]
patch 1

I should have spotted this when creating the language model. r=me.
Attachment #193395 - Flags: review?(smontagu) → review+
Assignee: smontagu → shoosh20012001
The middle line below seems wrong:

+255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,  //10
++253,253,253,253,253,253,253,253,253,253,253,253,253,253,253,253,  //20
+252,252,252,252,252,252,252,252,252,252,253,253,253,253,253,253,  //30
if you are referring to the odd "+" in the begining of the line then this is not
a mistake. this is how it should be and how it is in all the other language models.
I'm guessing that it's there to show to importance of this being what it is :)
Status: NEW → ASSIGNED
Attachment #193395 - Flags: superreview?(roc)
Attachment #193395 - Flags: superreview?(roc) → superreview+
We really need that documentation checked in somewhere!
Attachment #193395 - Flags: approval1.8b4?
That documentation by itself is useless since its context is the language model
maker tool which produces these tables. Simon and I are the only active
developers who currently have this code and in its current state it is not ready
for checkin.  current state being - only compiles on windows with VC6 and with
very little documentation.
In the coming weeks I would make my best effort to prepare this code for checkin
and document it to my best of understanding.
at any rate, this current bug is not the place to discuss this code. it should
have a bug of its own. 
Attachment #193395 - Flags: approval1.8b4? → approval1.8b4+
Trunk:
Checking in src/LangHebrewModel.cpp;
/cvsroot/mozilla/extensions/universalchardet/src/LangHebrewModel.cpp,v  <-- 
LangHebrewModel.cpp
new revision: 1.2; previous revision: 1.1
done

1.8 Branch:
Checking in src/LangHebrewModel.cpp;
/cvsroot/mozilla/extensions/universalchardet/src/LangHebrewModel.cpp,v  <-- 
LangHebrewModel.cpp
new revision: 1.2; previous revision: 1.1
done
Status: ASSIGNED → RESOLVED
Closed: 19 years ago
Keywords: fixed1.8
Resolution: --- → FIXED
OS: Windows XP → All
Hardware: PC → All
Target Milestone: --- → mozilla1.8beta4
Just a FYI: Now the page https://bugzilla.mozilla.org/show_bug.cgi?id=1156 is
detected as Korean (EUC-KR) :/
That seems to be a return to the status quo at least, because it's detected as
Korean in DPA2 also.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: