Last Comment Bug 760050 - Character encoding auto-detect fails on UTF-8 text file (regarded as TIS-620)
: Character encoding auto-detect fails on UTF-8 text file (regarded as TIS-620)
Product: Core
Classification: Components
Component: Internationalization (show other bugs)
: Trunk
: x86_64 Linux
-- normal with 3 votes (vote)
: ---
Assigned To: Nobody; OK to take it and work on it
: Makoto Kato [:m_kato]
Depends on:
  Show dependency treegraph
Reported: 2012-05-31 03:55 PDT by Vincent Lefevre
Modified: 2014-09-23 13:40 PDT (History)
3 users (show)
See Also:
Crash Signature:
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---

Text file showing the bug (10 bytes, text/plain; charset=)
2012-05-31 03:55 PDT, Vincent Lefevre
no flags Details
Same file with BOM (13 bytes, text/plain)
2013-06-14 06:46 PDT, Gingerbread Man
no flags Details

Description User image Vincent Lefevre 2012-05-31 03:55:25 PDT
Created attachment 628670 [details]
Text file showing the bug

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:15.0) Gecko/15.0 Firefox/15.0a1
Build ID: 20120530144327

Steps to reproduce:

Open a UTF-8 text file (same as attached) with the "file:" URL scheme, containing:
Test: §9

Character encoding choice is set to Auto-Detect → Universal.

Actual results:

Firefox regards the file as encoded in TIS-620 (Thai).

Expected results:

Firefox should have regarded the file as encoded in UTF-8. Since this is the most standard encoding, it should always be tried first with Universal.
Comment 1 User image Vincent Lefevre 2012-05-31 03:59:07 PDT
Note: the bug seems to occur only with the "file:" URL scheme, not with "http:" (so, do not try the attached text file directly, save it first).
Comment 2 User image Gingerbread Man 2013-06-14 06:46:11 PDT
Created attachment 762658 [details]
Same file with BOM

Your file is missing the byte order mark (BOM). All browsers I've tried open it as Western (Windows-1252). Notepad++ identifies it as "ANSI as UTF8".

"Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in."

Saving it with BOM makes all browsers treat it as UTF-8.
Comment 3 User image Vincent Lefevre 2013-06-14 07:49:33 PDT
The user shouldn't be forced to use the BOM as various Unix tools (less, xterm, the whole concept of streams and redirections...) can't handle it nicely. Let's recall that the BOM is optional for UTF-8 files; see also the "The reasons the standard does not advocate the UTF-8 BOM" text on Wikipedia. So, the drawbacks would be rather important, while a good user-level choice would be sufficient in practice. In particular in a context where the browser is run under UTF-8 based locales and the files are viewed locally (with the "file:" URL scheme), UTF-8 should be preferred to other encodings by default (and/or it should be configurable).
Comment 4 User image Zack Weinberg (:zwol) 2013-08-12 14:06:40 PDT discourages use of BOMs on UTF-8 (with the rationale that it may interfere with other expectations for the initial characters in the file, e.g. #!).

On today's internet, the encoding detector really ought to assume UTF-8 until proven otherwise.
Comment 5 User image Henri Sivonen (:hsivonen) 2014-09-23 01:31:43 PDT
This has been resolved by no longer offering the "Universal" detector.
Comment 6 User image Vincent Lefevre 2014-09-23 02:07:18 PDT
The original bug is no longer there. However Firefox now interprets UTF-8 files as ISO-8859-1 when no charset information could be provided (e.g. with the "file:" URL scheme). This isn't satisfactory either.

Note: I'm not talking about HTTP, where there can be some default (it was ISO-8859-1 in the past).
Comment 7 User image Henri Sivonen (:hsivonen) 2014-09-23 02:50:02 PDT
See bug 815551 comment 5.

FWIW, I'm very much against autodetecting UTF-8 on http[s] URLs but I think we should do it for file: URLs, since in the file: case we don't have HTTP headers but do have all the bytes in advance.

Feel free to argue for UTF-8 autodetection on file: URLs in bug 815551.
Comment 8 User image Vincent Lefevre 2014-09-23 05:34:42 PDT
Bug 815551 is about the HTML parser. Here this is about text/plain files open with a "file:" URL scheme, for which there is no way to specify the encoding (possibly except BOM, but which has major drawbacks in other contexts, so that it is not used in practice, as already said). This is very different from HTML files.
Comment 9 User image Vincent Lefevre 2014-09-23 13:40:32 PDT
Bug 815551 was actually for HTML files served via HTTP and wontfixed for this reason (see the first comments). So, I've submitted a new bug for text/plain files with unknown charset: Bug 1071816.

Note You need to log in before you can comment on or make changes to this bug.