Closed Bug 1105839 Opened 11 years ago Closed 11 years ago

Charset detector wrongly determines the text as UTF-8

Categories

(Core :: Internationalization, defect)

x86_64
FreeBSD
defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: yuri, Unassigned, NeedInfo)

Details

Attachments

(1 file)

User Agent: Mozilla/5.0 (X11; FreeBSD amd64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.65 Safari/537.36 Steps to reproduce: (used the port of Mozilla code https://github.com/BYVoid/uchardet) The file has these bytes: > 00000000 78 78 78 e2 80 99 78 78 78 0a 63 68 61 72 20 27 |xxx...xxx.char '| > 00000010 e2 27 20 28 69 6e 0a 4d 69 6c 6f c5 a1 5f 46 6f |.' (in.Milo.._Fo| > 00000020 72 6d 61 6e 0a |rman.| Please note that it has 3 non-ascii areas: > 1. e2 80 99: is U+2019 RIGHT SINGLE QUOTATION MARK > 2. e2: could be UTF-8 3-char sequence, but bytes 27 20 don't make for any UTF-8 symbol > 3. c5 a1: U+0161 LATIN SMALL LETTER S WITH CARON However, uchardet determines that it is UFT-8: > $ uchardet < xxx > UTF-8 FreeBSD file(1) determines this file as: > $ file xxx > xxx: C source, Non-ISO extended-ASCII text I am not sure how it should determine this text, but this isn't UTF-8 for sure.
Yuri, can you actually reproduce this issue using a recent copy of Firefox? It looks like the code on the github repo is 3 years old, and I doubt it reflects the code we currently ship in Firefox. Issues in uchardet should be reported to its owner, not here. Of course, if this still happens in Firefox, we should look into it...
Component: Untriaged → Internationalization
Flags: needinfo?(yuri)
Product: Firefox → Core
Attached file Reporter's testcase
We no longer have a universal detector in Firefox. With the Japanese detector the testcase is detected as UTF-8 xxx’xxx char '�' (in Miloš_Forman With the Russian and Ukrainian it's apparently detected as iso-8859-5: xxxтАЩxxx char 'т' (in Milo┼б_Forman
I don't think there's anything we can do better here. The detector has to return *something*, and in a case like this "UTF-8 with an error" seems like not a bad guess.
Status: UNCONFIRMED → RESOLVED
Closed: 11 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: