Closed Bug 306272 Opened 19 years ago Closed 18 years ago

a short chunk of text in UTF-8 misdetected as EUC-KR by universal autodetector

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla1.8.1beta2

People

(Reporter: hhschwab, Assigned: smontagu)

References

()

Details

(Keywords: fixed1.8.1, intl, testcase)

Attachments

(2 files)

Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.9a1) Gecko/20050826 SeaMonkey/1.1a also seen in Deer Park. Steps to repeat: 1. check: View -> CharacterEncoding -> AutoDetect -> Universal 2. Load Bug 1156 or upcoming testcase 3. notice that font size is bigger than normal 4. Page Info says: Encoding: EUC-KR 5. see: View -> CharacterEncoding -> Korean (EUC-KR)
Attached file testcase
<html><head> <title>306272</title> </head><body> <a href="mailto:Antti.Nayha@somewhere.fi">Antti N&#52292;yh&#52292; &lt;Antti.Nayha@somewhere.fi&gt;</a> </body></html>
also seen in Mozilla 1.4.2, so no recent regression. Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4.2) Gecko/20040426
Keywords: testcase
So it seems like the Korean autodetector suffers from the same problem that the new Hebrew detector had (bug 304951). Perhaps the same fix will apply?
OS: Windows 98 → All
Hardware: PC → All
This is similar to a group of bugs with misdetection of Latin-1 or UTF-8 as double-byte character sets. See especially bug 168526 comment 22 and following. In this specific case U+00E4 LATIN SMALL LETTER A WITH DIAERESIS is encoded in UTF-8 as 0xC3 0xA4, which is the EUC-KR encoding of U+CC44 HANGUL SYLLABLE CHIEUCH-AE.
Keywords: intl
Summary: finnish email address renders bugzilla bug in korean font → a short chunk of text in UTF-8 misdetected as EUC-KR by universal autodetector
Bug 310227 shows up in Camino 2005092804 (v1.0a1+) as Japanese (EUC) due to the autodetector. The cause/character is essentially the same (Latin small letter u with diaresis in UTF-8); does it need its own bug?
Blocks: 264871
As you say, that is essentially the same issue as this one, and I don't think it needs its own bug.
This particular misdetection given in the testcase would have been avoided if our universal detector could take into account the frequency of character triplets (within a 'word') that should vary with 'languages'. Latin-Hangul-Latin(-Hangul) sequence in a single word is extremly rare (if there's any) in actual Korean text. How hard would it be to apply 'triplet-check' to 'top contenders' to adjust their 'scores' (and to break ties or near-ties)?
a few notes bug 304951 is totally unrelated to this bug. 304951 was about a mistate in the charset map table. If anything, this bug is more related to bug 306224. I'm not very familiar with the EUC probers but I think that the solution suggested in comment #7 is a promising one. That solution, i.e. making sure whole words are all the same charset, could also probably fix 306224 to a certain degree. I still need to further think about that for some implementation ideas though.
(In reply to comment #8) > suggested in comment #7 is a promising one. That solution, i.e. making sure > whole words are all the same charset, could also probably fix 306224 to a > certain degree. I guess you meant that a single word has to be made up of characters belonging to a *single* script. However, that does not work very well if applied blindly. First of all, it's rather hard to delimit words in Chinese and Japanese text where space is not used to delimit words. We have some APIs for that, but I guess we don't want to use them here. We just want to use a naive word delimiting using whitespace characters. Secondly, multiple scripts can be present in a single word separated that way beause punctuation marks and numbers can be used together with non-Latin scripts. That's why I talked specifically about 'character triplets'. Perhaps we can build a statistical model for the frequency of character triplets using a method similar to that used to build language models for octet pairs and octets (if I understand the method involved correctly). This approach could be rather expensive because it needs to work on character level rather than 'octet level' (that is we have to call our encoding converters). So, I guess we need to limit its application to cases where we have a relatively short (how short is short is to be determined) chunk of text to deal with (because that's where current approach fails most often) and/or when encodings high in the list (as a result of the application of our current method) have scores close to each other (the thresold has to be determined as well). In other words, we'd better use this only as a 'tie/near-tie' breaker.
Another testcase for this bug is a patch where I just added the UTF8-version of my name "Håkan". See attachment 227240 [details] [diff] [review]
*** Bug 345262 has been marked as a duplicate of this bug. ***
Minimized testcase from the dupe (thank you, Simon): Steps to reproduce: 1. Visit data:text/html,%3Ctitle%3EEUC-JP%20instead%20of%20UTF-8%3C%2Ftitle%3E%0A%C3%BC%20%C3%A4%20%C3%BC%20%C3%A4 2. Visit data:text/html,%3Ctitle%3EUTF-8%20correctly%3C%2Ftitle%3E%0A%C3%BC%20%C3%A4%20%C3%BC%20%C3%A4%20%C3%BC Actual result: The first file is auto-detected as EUC-JP the second as UTF-8. Expected result: Both files are auto-detected as UTF-8.
Attached patch patchSplinter Review
I noticed that unlike the other confidence-getters, CharDistributionAnalysis::GetConfidence() doesn't use any minimum threshold for the number of characters. That is tending to increase the confidence of those detectors that use it (Shift-JIS, EUC-JP, GB18030, EUC-KR and Big5)
Attachment #231746 - Flags: review?(jshin1987)
Comment on attachment 231746 [details] [diff] [review] patch r=jshin Maybe, the comment above the line being modified has to be changed a little to reflect the change.
Attachment #231746 - Flags: review?(jshin1987) → review+
Attachment #231746 - Flags: superreview?(rbs)
Comment on attachment 231746 [details] [diff] [review] patch sr=rbs
Attachment #231746 - Flags: superreview?(rbs) → superreview+
Fix checked in.
Status: NEW → RESOLVED
Closed: 18 years ago
Resolution: --- → FIXED
Comment on attachment 231746 [details] [diff] [review] patch Drivers: This patch prevents pages containing 1 to 4 UTF-8 characters from often being mis-detected as an Asian multi-byte charset (when no charset is specified and the universal auto-detector is used) by specifying a simple minimum threshold. Risk: low.
Attachment #231746 - Flags: approval1.8.1?
Comment on attachment 231746 [details] [diff] [review] patch a=drivers for the 181 branch
Attachment #231746 - Flags: approval1.8.1? → approval1.8.1+
Whiteboard: [checkin needed (1.8 branch)]
Target Milestone: --- → mozilla1.8.1beta2
Keywords: fixed1.8.1
Whiteboard: [checkin needed (1.8 branch)]
Thanks for fixing this bug Simon!
Flags: in-testsuite+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: