Closed
Bug 1105839
Opened 11 years ago
Closed 11 years ago
Charset detector wrongly determines the text as UTF-8
Categories
(Core :: Internationalization, defect)
Tracking
()
RESOLVED
WONTFIX
People
(Reporter: yuri, Unassigned, NeedInfo)
Details
Attachments
(1 file)
36 bytes,
text/plain; charset=foo
|
Details |
User Agent: Mozilla/5.0 (X11; FreeBSD amd64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.65 Safari/537.36
Steps to reproduce:
(used the port of Mozilla code https://github.com/BYVoid/uchardet)
The file has these bytes:
> 00000000 78 78 78 e2 80 99 78 78 78 0a 63 68 61 72 20 27 |xxx...xxx.char '|
> 00000010 e2 27 20 28 69 6e 0a 4d 69 6c 6f c5 a1 5f 46 6f |.' (in.Milo.._Fo|
> 00000020 72 6d 61 6e 0a |rman.|
Please note that it has 3 non-ascii areas:
> 1. e2 80 99: is U+2019 RIGHT SINGLE QUOTATION MARK
> 2. e2: could be UTF-8 3-char sequence, but bytes 27 20 don't make for any UTF-8 symbol
> 3. c5 a1: U+0161 LATIN SMALL LETTER S WITH CARON
However, uchardet determines that it is UFT-8:
> $ uchardet < xxx
> UTF-8
FreeBSD file(1) determines this file as:
> $ file xxx
> xxx: C source, Non-ISO extended-ASCII text
I am not sure how it should determine this text, but this isn't UTF-8 for sure.
Comment 1•11 years ago
|
||
Yuri, can you actually reproduce this issue using a recent copy of Firefox? It looks like the code on the github repo is 3 years old, and I doubt it reflects the code we currently ship in Firefox. Issues in uchardet should be reported to its owner, not here. Of course, if this still happens in Firefox, we should look into it...
Component: Untriaged → Internationalization
Flags: needinfo?(yuri)
Product: Firefox → Core
Comment 2•11 years ago
|
||
We no longer have a universal detector in Firefox. With the Japanese detector the testcase is detected as UTF-8
xxx’xxx
char '�' (in Miloš_Forman
With the Russian and Ukrainian it's apparently detected as iso-8859-5:
xxxтАЩxxx
char 'т' (in Milo┼б_Forman
Comment 3•11 years ago
|
||
I don't think there's anything we can do better here. The detector has to return *something*, and in a case like this "UTF-8 with an error" seems like not a bad guess.
Status: UNCONFIRMED → RESOLVED
Closed: 11 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•