Closed Bug 306272 Opened 19 years ago Closed 18 years ago

a short chunk of text in UTF-8 misdetected as EUC-KR by universal autodetector

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla1.8.1beta2

People

(Reporter: hhschwab, Assigned: smontagu)

References

(
URL
)

Details

(Keywords: fixed1.8.1, intl, testcase)

Attachments

(2 files)

testcase 19 years ago Hermann Schwab 232 bytes, text/html		Details
patch 18 years ago Simon Montagu :smontagu 1.20 KB, patch	jshin1987 : review+ rbs : superreview+ beltzner : approval1.8.1+	Details \| Diff \| Splinter Review

Hermann Schwab

Reporter

Description

•

19 years ago

Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.9a1) Gecko/20050826 SeaMonkey/1.1a also seen in Deer Park. Steps to repeat: 1. check: View -> CharacterEncoding -> AutoDetect -> Universal 2. Load Bug 1156 or upcoming testcase 3. notice that font size is bigger than normal 4. Page Info says: Encoding: EUC-KR 5. see: View -> CharacterEncoding -> Korean (EUC-KR)

Hermann Schwab

Reporter

Comment 1

•

19 years ago

Attached file testcase — Details

<html><head> <title>306272</title> </head><body> <a href="mailto:Antti.Nayha@somewhere.fi">Antti N채yh채 <Antti.Nayha@somewhere.fi></a> </body></html>

Hermann Schwab

Reporter

Comment 2

•

19 years ago

also seen in Mozilla 1.4.2, so no recent regression. Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4.2) Gecko/20040426

Keywords: testcase

Uri Bernstein

Comment 3

•

19 years ago

So it seems like the Korean autodetector suffers from the same problem that the new Hebrew detector had (bug 304951). Perhaps the same fix will apply?

OS: Windows 98 → All

Hardware: PC → All

Simon Montagu :smontagu

Assignee

Comment 4

•

19 years ago

This is similar to a group of bugs with misdetection of Latin-1 or UTF-8 as double-byte character sets. See especially bug 168526 comment 22 and following. In this specific case U+00E4 LATIN SMALL LETTER A WITH DIAERESIS is encoded in UTF-8 as 0xC3 0xA4, which is the EUC-KR encoding of U+CC44 HANGUL SYLLABLE CHIEUCH-AE.

Jungshik Shin

Updated

•

19 years ago

Keywords: intl

Summary: finnish email address renders bugzilla bug in korean font → a short chunk of text in UTF-8 misdetected as EUC-KR by universal autodetector

Smokey Ardisson (offline for a while; not following bugs - do not email)

Comment 5

•

19 years ago

Bug 310227 shows up in Camino 2005092804 (v1.0a1+) as Japanese (EUC) due to the autodetector. The cause/character is essentially the same (Latin small letter u with diaresis in UTF-8); does it need its own bug?

Blocks: 264871

Simon Montagu :smontagu

Assignee

Comment 6

•

19 years ago

As you say, that is essentially the same issue as this one, and I don't think it needs its own bug.

Jungshik Shin

Comment 7

•

19 years ago

This particular misdetection given in the testcase would have been avoided if our universal detector could take into account the frequency of character triplets (within a 'word') that should vary with 'languages'. Latin-Hangul-Latin(-Hangul) sequence in a single word is extremly rare (if there's any) in actual Korean text. How hard would it be to apply 'triplet-check' to 'top contenders' to adjust their 'scores' (and to break ties or near-ties)?

Shy Shalom

Comment 8

•

19 years ago

a few notes bug 304951 is totally unrelated to this bug. 304951 was about a mistate in the charset map table. If anything, this bug is more related to bug 306224. I'm not very familiar with the EUC probers but I think that the solution suggested in comment #7 is a promising one. That solution, i.e. making sure whole words are all the same charset, could also probably fix 306224 to a certain degree. I still need to further think about that for some implementation ideas though.

Jungshik Shin

Comment 9

•

19 years ago

(In reply to comment #8) > suggested in comment #7 is a promising one. That solution, i.e. making sure > whole words are all the same charset, could also probably fix 306224 to a > certain degree. I guess you meant that a single word has to be made up of characters belonging to a *single* script. However, that does not work very well if applied blindly. First of all, it's rather hard to delimit words in Chinese and Japanese text where space is not used to delimit words. We have some APIs for that, but I guess we don't want to use them here. We just want to use a naive word delimiting using whitespace characters. Secondly, multiple scripts can be present in a single word separated that way beause punctuation marks and numbers can be used together with non-Latin scripts. That's why I talked specifically about 'character triplets'. Perhaps we can build a statistical model for the frequency of character triplets using a method similar to that used to build language models for octet pairs and octets (if I understand the method involved correctly). This approach could be rather expensive because it needs to work on character level rather than 'octet level' (that is we have to call our encoding converters). So, I guess we need to limit its application to cases where we have a relatively short (how short is short is to be determined) chunk of text to deal with (because that's where current approach fails most often) and/or when encodings high in the list (as a result of the application of our current method) have scores close to each other (the thresold has to be determined as well). In other words, we'd better use this only as a 'tie/near-tie' breaker.

Håkan Waara

Comment 10

•

18 years ago

Another testcase for this bug is a patch where I just added the UTF8-version of my name "Håkan". See attachment 227240 [details] [diff] [review]

Simon Montagu :smontagu

Assignee

Comment 11

•

18 years ago

*** Bug 345262 has been marked as a duplicate of this bug. ***

Simon Montagu :smontagu

Assignee

Comment 12

•

18 years ago

Minimized testcase from the dupe (thank you, Simon): Steps to reproduce: 1. Visit data:text/html,%3Ctitle%3EEUC-JP%20instead%20of%20UTF-8%3C%2Ftitle%3E%0A%C3%BC%20%C3%A4%20%C3%BC%20%C3%A4 2. Visit data:text/html,%3Ctitle%3EUTF-8%20correctly%3C%2Ftitle%3E%0A%C3%BC%20%C3%A4%20%C3%BC%20%C3%A4%20%C3%BC Actual result: The first file is auto-detected as EUC-JP the second as UTF-8. Expected result: Both files are auto-detected as UTF-8.

Simon Montagu :smontagu

Assignee

Comment 13

•

18 years ago

Attached patch patch — Details — Splinter Review

I noticed that unlike the other confidence-getters, CharDistributionAnalysis::GetConfidence() doesn't use any minimum threshold for the number of characters. That is tending to increase the confidence of those detectors that use it (Shift-JIS, EUC-JP, GB18030, EUC-KR and Big5)

Attachment #231746 - Flags: review?(jshin1987)

Jungshik Shin

Comment 14

•

18 years ago

Comment on attachment 231746 [details] [diff] [review] patch r=jshin Maybe, the comment above the line being modified has to be changed a little to reflect the change.

Attachment #231746 - Flags: review?(jshin1987) → review+

Simon Montagu :smontagu

Assignee

Updated

•

18 years ago

Attachment #231746 - Flags: superreview?(rbs)

rbs

Comment 15

•

18 years ago

Comment on attachment 231746 [details] [diff] [review] patch sr=rbs

Attachment #231746 - Flags: superreview?(rbs) → superreview+

Simon Montagu :smontagu

Assignee

Comment 16

•

18 years ago

Fix checked in.

Status: NEW → RESOLVED

Closed: 18 years ago

Resolution: --- → FIXED

Simon Bünzli

Comment 17

•

18 years ago

Comment on attachment 231746 [details] [diff] [review] patch Drivers: This patch prevents pages containing 1 to 4 UTF-8 characters from often being mis-detected as an Asian multi-byte charset (when no charset is specified and the universal auto-detector is used) by specifying a simple minimum threshold. Risk: low.

Attachment #231746 - Flags: approval1.8.1?

Mike Beltzner [:beltzner, not reading bugmail]

Comment 18

•

18 years ago

Comment on attachment 231746 [details] [diff] [review] patch a=drivers for the 181 branch

Attachment #231746 - Flags: approval1.8.1? → approval1.8.1+

Simon Bünzli

Updated

•

18 years ago

Whiteboard: [checkin needed (1.8 branch)]

Target Milestone: --- → mozilla1.8.1beta2

Simon Montagu :smontagu

Assignee

Updated

•

18 years ago

Keywords: fixed1.8.1

Whiteboard: [checkin needed (1.8 branch)]

Håkan Waara

Comment 19

•

18 years ago

Thanks for fixing this bug Simon!

Simon Montagu :smontagu

Assignee

Updated

•

17 years ago

Flags: in-testsuite+

You need to log in before you can comment on or make changes to this bug.