Closed
Bug 306272
Opened 19 years ago
Closed 18 years ago
a short chunk of text in UTF-8 misdetected as EUC-KR by universal autodetector
Categories
(Core :: Internationalization, defect)
Core
Internationalization
Tracking
()
RESOLVED
FIXED
mozilla1.8.1beta2
People
(Reporter: hhschwab, Assigned: smontagu)
References
()
Details
(Keywords: fixed1.8.1, intl, testcase)
Attachments
(2 files)
232 bytes,
text/html
|
Details | |
1.20 KB,
patch
|
jshin1987
:
review+
rbs
:
superreview+
beltzner
:
approval1.8.1+
|
Details | Diff | Splinter Review |
Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.9a1) Gecko/20050826 SeaMonkey/1.1a
also seen in Deer Park.
Steps to repeat:
1. check: View -> CharacterEncoding -> AutoDetect -> Universal
2. Load Bug 1156 or upcoming testcase
3. notice that font size is bigger than normal
4. Page Info says: Encoding: EUC-KR
5. see: View -> CharacterEncoding -> Korean (EUC-KR)
Reporter | ||
Comment 1•19 years ago
|
||
<html><head>
<title>306272</title>
</head><body>
<a href="mailto:Antti.Nayha@somewhere.fi">Antti N채yh채
<Antti.Nayha@somewhere.fi></a>
</body></html>
Reporter | ||
Comment 2•19 years ago
|
||
also seen in Mozilla 1.4.2, so no recent regression.
Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4.2) Gecko/20040426
Keywords: testcase
Comment 3•19 years ago
|
||
So it seems like the Korean autodetector suffers from the same problem that the
new Hebrew detector had (bug 304951). Perhaps the same fix will apply?
OS: Windows 98 → All
Hardware: PC → All
Assignee | ||
Comment 4•19 years ago
|
||
This is similar to a group of bugs with misdetection of Latin-1 or UTF-8 as
double-byte character sets. See especially bug 168526 comment 22 and following.
In this specific case U+00E4 LATIN SMALL LETTER A WITH DIAERESIS is encoded in
UTF-8 as 0xC3 0xA4, which is the EUC-KR encoding of U+CC44 HANGUL SYLLABLE
CHIEUCH-AE.
Updated•19 years ago
|
Keywords: intl
Summary: finnish email address renders bugzilla bug in korean font → a short chunk of text in UTF-8 misdetected as EUC-KR by universal autodetector
Bug 310227 shows up in Camino 2005092804 (v1.0a1+) as Japanese (EUC) due to the
autodetector. The cause/character is essentially the same (Latin small letter u
with diaresis in UTF-8); does it need its own bug?
Blocks: 264871
Assignee | ||
Comment 6•19 years ago
|
||
As you say, that is essentially the same issue as this one, and I don't think it
needs its own bug.
Comment 7•19 years ago
|
||
This particular misdetection given in the testcase would have been avoided if
our universal detector could take into account the frequency of character
triplets (within a 'word') that should vary with 'languages'.
Latin-Hangul-Latin(-Hangul) sequence in a single word is extremly rare (if
there's any) in actual Korean text. How hard would it be to apply
'triplet-check' to 'top contenders' to adjust their 'scores' (and to break ties
or near-ties)?
Comment 8•19 years ago
|
||
a few notes
bug 304951 is totally unrelated to this bug. 304951 was about a mistate in the
charset map table. If anything, this bug is more related to bug 306224.
I'm not very familiar with the EUC probers but I think that the solution
suggested in comment #7 is a promising one. That solution, i.e. making sure
whole words are all the same charset, could also probably fix 306224 to a
certain degree.
I still need to further think about that for some implementation ideas though.
Comment 9•19 years ago
|
||
(In reply to comment #8)
> suggested in comment #7 is a promising one. That solution, i.e. making sure
> whole words are all the same charset, could also probably fix 306224 to a
> certain degree.
I guess you meant that a single word has to be made up of characters belonging
to a *single* script. However, that does not work very well if applied blindly.
First of all, it's rather hard to delimit words in Chinese and Japanese text
where space is not used to delimit words. We have some APIs for that, but I
guess we don't want to use them here. We just want to use a naive word
delimiting using whitespace characters.
Secondly, multiple scripts can be present in a single word separated that way
beause punctuation marks and numbers can be used together with non-Latin scripts.
That's why I talked specifically about 'character triplets'. Perhaps we can
build a statistical model for the frequency of character triplets using a method
similar to that used to build language models for octet pairs and octets (if I
understand the method involved correctly).
This approach could be rather expensive because it needs to work on character
level rather than 'octet level' (that is we have to call our encoding
converters). So, I guess we need to limit its application to cases where we have
a relatively short (how short is short is to be determined) chunk of text to
deal with (because that's where current approach fails most often) and/or when
encodings high in the list (as a result of the application of our current
method) have scores close to each other (the thresold has to be determined as
well). In other words, we'd better use this only as a 'tie/near-tie' breaker.
Comment 10•18 years ago
|
||
Another testcase for this bug is a patch where I just added the UTF8-version of my name "Håkan". See attachment 227240 [details] [diff] [review]
Assignee | ||
Comment 11•18 years ago
|
||
*** Bug 345262 has been marked as a duplicate of this bug. ***
Assignee | ||
Comment 12•18 years ago
|
||
Minimized testcase from the dupe (thank you, Simon):
Steps to reproduce:
1. Visit
data:text/html,%3Ctitle%3EEUC-JP%20instead%20of%20UTF-8%3C%2Ftitle%3E%0A%C3%BC%20%C3%A4%20%C3%BC%20%C3%A4
2. Visit
data:text/html,%3Ctitle%3EUTF-8%20correctly%3C%2Ftitle%3E%0A%C3%BC%20%C3%A4%20%C3%BC%20%C3%A4%20%C3%BC
Actual result:
The first file is auto-detected as EUC-JP the second as UTF-8.
Expected result:
Both files are auto-detected as UTF-8.
Assignee | ||
Comment 13•18 years ago
|
||
I noticed that unlike the other confidence-getters, CharDistributionAnalysis::GetConfidence() doesn't use any minimum threshold for the number of characters. That is tending to increase the confidence of those detectors that use it (Shift-JIS, EUC-JP, GB18030, EUC-KR and Big5)
Attachment #231746 -
Flags: review?(jshin1987)
Comment 14•18 years ago
|
||
Comment on attachment 231746 [details] [diff] [review]
patch
r=jshin
Maybe, the comment above the line being modified has to be changed a little to reflect the change.
Attachment #231746 -
Flags: review?(jshin1987) → review+
Assignee | ||
Updated•18 years ago
|
Attachment #231746 -
Flags: superreview?(rbs)
Comment 15•18 years ago
|
||
Comment on attachment 231746 [details] [diff] [review]
patch
sr=rbs
Attachment #231746 -
Flags: superreview?(rbs) → superreview+
Assignee | ||
Comment 16•18 years ago
|
||
Fix checked in.
Status: NEW → RESOLVED
Closed: 18 years ago
Resolution: --- → FIXED
Comment 17•18 years ago
|
||
Comment on attachment 231746 [details] [diff] [review]
patch
Drivers: This patch prevents pages containing 1 to 4 UTF-8 characters from often being mis-detected as an Asian multi-byte charset (when no charset is specified and the universal auto-detector is used) by specifying a simple minimum threshold. Risk: low.
Attachment #231746 -
Flags: approval1.8.1?
Comment 18•18 years ago
|
||
Comment on attachment 231746 [details] [diff] [review]
patch
a=drivers for the 181 branch
Attachment #231746 -
Flags: approval1.8.1? → approval1.8.1+
Updated•18 years ago
|
Whiteboard: [checkin needed (1.8 branch)]
Target Milestone: --- → mozilla1.8.1beta2
Assignee | ||
Updated•18 years ago
|
Keywords: fixed1.8.1
Whiteboard: [checkin needed (1.8 branch)]
Comment 19•18 years ago
|
||
Thanks for fixing this bug Simon!
Assignee | ||
Updated•17 years ago
|
Flags: in-testsuite+
You need to log in
before you can comment on or make changes to this bug.
Description
•