Character encoding auto-detect fails on UTF-8 text file (regarded as TIS-620)




5 years ago
3 years ago


(Reporter: Vincent Lefevre, Unassigned)



Firefox Tracking Flags

(Not tracked)



(2 attachments)



5 years ago
Created attachment 628670 [details]
Text file showing the bug

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:15.0) Gecko/15.0 Firefox/15.0a1
Build ID: 20120530144327

Steps to reproduce:

Open a UTF-8 text file (same as attached) with the "file:" URL scheme, containing:
Test: §9

Character encoding choice is set to Auto-Detect → Universal.

Actual results:

Firefox regards the file as encoded in TIS-620 (Thai).

Expected results:

Firefox should have regarded the file as encoded in UTF-8. Since this is the most standard encoding, it should always be tried first with Universal.

Comment 1

5 years ago
Note: the bug seems to occur only with the "file:" URL scheme, not with "http:" (so, do not try the attached text file directly, save it first).
Attachment #628670 - Attachment mime type: text/plain → text/plain; charset=

Comment 2

4 years ago
Created attachment 762658 [details]
Same file with BOM

Your file is missing the byte order mark (BOM). All browsers I've tried open it as Western (Windows-1252). Notepad++ identifies it as "ANSI as UTF8".

"Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in."

Saving it with BOM makes all browsers treat it as UTF-8.

Comment 3

4 years ago
The user shouldn't be forced to use the BOM as various Unix tools (less, xterm, the whole concept of streams and redirections...) can't handle it nicely. Let's recall that the BOM is optional for UTF-8 files; see also the "The reasons the standard does not advocate the UTF-8 BOM" text on Wikipedia. So, the drawbacks would be rather important, while a good user-level choice would be sufficient in practice. In particular in a context where the browser is run under UTF-8 based locales and the files are viewed locally (with the "file:" URL scheme), UTF-8 should be preferred to other encodings by default (and/or it should be configurable). discourages use of BOMs on UTF-8 (with the rationale that it may interfere with other expectations for the initial characters in the file, e.g. #!).

On today's internet, the encoding detector really ought to assume UTF-8 until proven otherwise.
This has been resolved by no longer offering the "Universal" detector.
Last Resolved: 3 years ago
Resolution: --- → WORKSFORME

Comment 6

3 years ago
The original bug is no longer there. However Firefox now interprets UTF-8 files as ISO-8859-1 when no charset information could be provided (e.g. with the "file:" URL scheme). This isn't satisfactory either.

Note: I'm not talking about HTTP, where there can be some default (it was ISO-8859-1 in the past).
See bug 815551 comment 5.

FWIW, I'm very much against autodetecting UTF-8 on http[s] URLs but I think we should do it for file: URLs, since in the file: case we don't have HTTP headers but do have all the bytes in advance.

Feel free to argue for UTF-8 autodetection on file: URLs in bug 815551.

Comment 8

3 years ago
Bug 815551 is about the HTML parser. Here this is about text/plain files open with a "file:" URL scheme, for which there is no way to specify the encoding (possibly except BOM, but which has major drawbacks in other contexts, so that it is not used in practice, as already said). This is very different from HTML files.

Comment 9

3 years ago
Bug 815551 was actually for HTML files served via HTTP and wontfixed for this reason (see the first comments). So, I've submitted a new bug for text/plain files with unknown charset: Bug 1071816.
You need to log in before you can comment on or make changes to this bug.