Support loading BOMless UTF-8 text/plain files from file: URLs

NEW
Unassigned

Status

()

Core
HTML: Parser
3 years ago
a month ago

People

(Reporter: Vincent Lefevre, Unassigned)

Tracking

(Depends on: 1 bug)

32 Branch
x86_64
Linux
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

3 years ago
User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:32.0) Gecko/20100101 Firefox/32.0
Build ID: 20140917194002

Steps to reproduce:

Open a UTF-8 text file using a "file:" URL scheme.


Actual results:

The file is regarded as encoded in windows-1252 (according to what is displayed and to "View Page Info").


Expected results:

The file should be regarded as encoded in UTF-8. Alternatively, the encoding could be guessed.

The reason is that UTF-8 is much more used nowadays, and is the default in various contexts. I've also tried with other browsers (Opera, lynx, w3m and elinks), and all of them regarded the file as encoded in UTF-8. Only Firefox chose a wrong charset.

Note: Bug 815551 is a similar bug by its title, but for HTML ("HTML: Parser" component) with http URL's (see the first comments), which is obviously very different.
(Reporter)

Comment 1

3 years ago
I forgot to give a reference to bug 760050: there was auto-detection in the past but it didn't work correctly. So, if it is chosen to guess the encoding, UTF-8 should be favored.
(In reply to Vincent Lefevre from comment #0)
> The file is regarded as encoded in windows-1252 (according to what is
> displayed and to "View Page Info").

This depends on the locale, but, yes, it's windows-1252 for most locales.

> The file should be regarded as encoded in UTF-8.

Simply saying that all file: URLs resolving to plain text files should be treated as UTF-8 might be too simple. We might get away with it, though. (What's your use case for opening local .txt files in a browser, BTW?)

> Alternatively, the encoding could be guessed.

Since we can assume local files to be finite and all bytes available soon, we could do the following:
 1) Add a method to the UTF-8 decoder to ask it if it has seen an error yet.
 2) When loading text/plain (or text/html; see below) from file:, start with the UTF-8 decoder and let the converted buffers queue up without parsing them. After each buffer, ask the UTF-8 decoder if it has seen an error already.
 3) If the UTF-8 decoder says it has seen an error, throw away the converted buffers, seek to the beginning of file and parse normally with the fallback encoding.
 4) If end of the byte stream is reached without the UTF-8 decoder having seen an error, let the parser process the buffer queue.

The hardest part would probably be the "seek to the beginning of file" bit, considering that we've already changed the state of the channel to deliver to a non-main thread. I'm not sure if channels for file: URLs support seeking. (Seeking with the existing channel instead of asking the docshell to renavigate avoids problems with scenarios where existing local HTML expects iframed content to behave a certain way as far as events go.)

> The reason is that UTF-8 is much more used nowadays, and is the default in
> various contexts. I've also tried with other browsers (Opera, lynx, w3m and
> elinks), and all of them regarded the file as encoded in UTF-8. Only Firefox
> chose a wrong charset.

Lynx, w3m and elinks don't really have the sort of compatibility goals that Firefox has. Was the Presto-Opera or Blink-Opera? With fresh profile? What do Chrome and IE do?

> Note: Bug 815551 is a similar bug by its title, but for HTML ("HTML: Parser"
> component)

text/plain goes through the HTML parser, too. As far as implementation goes, if there's any level of content-based guessing, not having that code apply to HTML, too, would require an extra condition.

Any code to be written here would live in nsHtml5StreamParser.cpp.

> with http URL's (see the first comments), which is obviously very
> different.

(I still think we shouldn't add any new content-based detection for content loaded from http and that we should never use UTF-8 as the fallback for content loaded from http.)
Status: UNCONFIRMED → NEW
Component: File Handling → HTML: Parser
Ever confirmed: true
Summary: text/plain files with unknown charset should be regarded as encoded in UTF-8 → Support loading BOMless UTF-8 text/plain files from file: URLs
(Reporter)

Comment 3

3 years ago
(In reply to Henri Sivonen (:hsivonen) from comment #2)
> (In reply to Vincent Lefevre from comment #0)
> > The file is regarded as encoded in windows-1252 (according to what is
> > displayed and to "View Page Info").
> 
> This depends on the locale, but, yes, it's windows-1252 for most locales.

I'm using a UTF-8 locale, so that it's very surprising this if it depends on the locale, an incompatible charset is chosen.

> Simply saying that all file: URLs resolving to plain text files should be
> treated as UTF-8 might be too simple. We might get away with it, though.
> (What's your use case for opening local .txt files in a browser, BTW?)

I sometimes retrieve parts of websites locally (possibly with some URL update and other corrections). This may include .txt files, linked from HTML files. The simplest solution is to view them via "file:" URL's. This is precisely how I discovered bug 760050 (which was a variant of this one): some .txt files (in English) contained the non-ASCII "§" character.

> The hardest part would probably be the "seek to the beginning of file" bit,
> considering that we've already changed the state of the channel to deliver
> to a non-main thread. I'm not sure if channels for file: URLs support
> seeking. (Seeking with the existing channel instead of asking the docshell
> to renavigate avoids problems with scenarios where existing local HTML
> expects iframed content to behave a certain way as far as events go.)

If this is a problem and the goal is to differentiate UTF-8 from windows-1252 (or ISO-8859-1 under Unix, since windows-1252 is not used in practice), I'd say that if the first non-ASCII bytes correspond to a valid UTF-8 sequence, then the file could be regarded as in UTF-8 with a very high probability, in particular if the locale is a UTF-8 one. This shouldn't need seeking because all the past bytes are ASCII, which is common to UTF-8 and windows-1252. Note that if there's an UTF-8 decoding error later, this doesn't necessarily mean that Firefox did something wrong, because UTF-8 files with invalid sequences also occur in practice.

> > The reason is that UTF-8 is much more used nowadays, and is the default in
> > various contexts. I've also tried with other browsers (Opera, lynx, w3m and
> > elinks), and all of them regarded the file as encoded in UTF-8. Only Firefox
> > chose a wrong charset.
> 
> Lynx, w3m and elinks don't really have the sort of compatibility goals that
> Firefox has.

What compatibility goals (at least in the case of a UTF-8 locale)?

> Was the Presto-Opera or Blink-Opera? With fresh profile?

Opera 12 under GNU/Linux. Not a fresh profile, but I use Opera only for some testing.

> What do Chrome and IE do?

Chromium uses windows-1252. But I wonder whether the issue has already been discussed.
There's no IE under GNU/Linux.
(In reply to Vincent Lefevre from comment #3)
> (In reply to Henri Sivonen (:hsivonen) from comment #2)
> > (In reply to Vincent Lefevre from comment #0)
> > > The file is regarded as encoded in windows-1252 (according to what is
> > > displayed and to "View Page Info").
> > 
> > This depends on the locale, but, yes, it's windows-1252 for most locales.
> 
> I'm using a UTF-8 locale, so that it's very surprising this if it depends on
> the locale, an incompatible charset is chosen.

I meant the Firefox UI language--not the *nix system locale notion.

> > Simply saying that all file: URLs resolving to plain text files should be
> > treated as UTF-8 might be too simple. We might get away with it, though.
> > (What's your use case for opening local .txt files in a browser, BTW?)
> 
> I sometimes retrieve parts of websites locally (possibly with some URL
> update and other corrections). This may include .txt files, linked from HTML
> files. The simplest solution is to view them via "file:" URL's. This is
> precisely how I discovered bug 760050 (which was a variant of this one):
> some .txt files (in English) contained the non-ASCII "§" character.

I see. For this use case, it seems to me the problem applies also to text/html and not just to text/plain when using a spidering program that doesn't rewrite HTML to include <meta charset>.

> > The hardest part would probably be the "seek to the beginning of file" bit,
> > considering that we've already changed the state of the channel to deliver
> > to a non-main thread. I'm not sure if channels for file: URLs support
> > seeking. (Seeking with the existing channel instead of asking the docshell
> > to renavigate avoids problems with scenarios where existing local HTML
> > expects iframed content to behave a certain way as far as events go.)
> 
> If this is a problem and the goal is to differentiate UTF-8 from
> windows-1252 (or ISO-8859-1 under Unix, since windows-1252 is not used in
> practice), I'd say that if the first non-ASCII bytes correspond to a valid
> UTF-8 sequence, then the file could be regarded as in UTF-8 with a very high
> probability, in particular if the locale is a UTF-8 one. This shouldn't need
> seeking because all the past bytes are ASCII, which is common to UTF-8 and
> windows-1252.

What if the first non-ASCII byte sequence is a copyright sign in the page footer? If we're going to do this for local files, let's make use of the fact that we have all the bytes available.

> Note that if there's an UTF-8 decoding error later, this
> doesn't necessarily mean that Firefox did something wrong, because UTF-8
> files with invalid sequences also occur in practice.

Catering to such pages is *so* WONTFIX if this feature gets impemented at all.
 
> > > The reason is that UTF-8 is much more used nowadays, and is the default in
> > > various contexts. I've also tried with other browsers (Opera, lynx, w3m and
> > > elinks), and all of them regarded the file as encoded in UTF-8. Only Firefox
> > > chose a wrong charset.
> > 
> > Lynx, w3m and elinks don't really have the sort of compatibility goals that
> > Firefox has.
> 
> What compatibility goals (at least in the case of a UTF-8 locale)?

Being able to read old files that were readable before.

> > Was the Presto-Opera or Blink-Opera? With fresh profile?
> 
> Opera 12 under GNU/Linux. Not a fresh profile, but I use Opera only for some
> testing.

That's Presto-Opera. And no proof of the defaults if it's not known to be a fresh profile.
(Reporter)

Comment 5

3 years ago
(In reply to Henri Sivonen (:hsivonen) from comment #4)
> I see. For this use case, it seems to me the problem applies also to
> text/html and not just to text/plain when using a spidering program that
> doesn't rewrite HTML to include <meta charset>.

At worst, for HTML, this can be done with another tool, a shell/Perl script or whatever. This shouldn't confuse any HTML reader.

> > If this is a problem and the goal is to differentiate UTF-8 from
> > windows-1252 (or ISO-8859-1 under Unix, since windows-1252 is not used in
> > practice), I'd say that if the first non-ASCII bytes correspond to a valid
> > UTF-8 sequence, then the file could be regarded as in UTF-8 with a very high
> > probability, in particular if the locale is a UTF-8 one. This shouldn't need
> > seeking because all the past bytes are ASCII, which is common to UTF-8 and
> > windows-1252.
> 
> What if the first non-ASCII byte sequence is a copyright sign in the page
> footer?

I don't know the internals, but ideally one should be able to send all the ASCII bytes ASAP, and deal with the encoding only when it matters. This is what I meant.

> If we're going to do this for local files, let's make use of the
> fact that we have all the bytes available.

If you have all the bytes, yes. Firefox doesn't seem to support files like file:///dev/stdin anyway (rather useless for text/plain files anyway).

> > Note that if there's an UTF-8 decoding error later, this
> > doesn't necessarily mean that Firefox did something wrong, because UTF-8
> > files with invalid sequences also occur in practice.
> 
> Catering to such pages is *so* WONTFIX if this feature gets impemented at
> all.

I meant that it could just be obtained by accident. But conversely, if you have the ef bb bf byte sequence at the beginning of a file and an invalid UTF-8 sequence later, I wouldn't consider that recognizing the BOM sequence as UTF-8 was a mistake.

> > What compatibility goals (at least in the case of a UTF-8 locale)?
> 
> Being able to read old files that were readable before.

??? Most UTF-8 text/plain files were readable with old Firefox (actually I've just tried with Iceweasel 24.8.0 under Debian, but I doubt Debian has changed anything here), but it is no longer the case with new Firefox versions!

Note also the fact that Firefox also broke old HTML rendering with the new HTML5 parser.

> > Opera 12 under GNU/Linux. Not a fresh profile, but I use Opera only for some
> > testing.
> 
> That's Presto-Opera. And no proof of the defaults if it's not known to be a
> fresh profile.

I could do more tests later (that's on another machine), possibly with newer versions. But note that I haven't modified the profile explicitly.

BTW, in any case, the default settings wouldn't matter very much (except that I think that UTF-8 makes more sense, at least in UTF-8 *nix locales), as long as one can modify the settings to get UTF-8.

Comment 6

2 years ago
> > What compatibility goals (at least in the case of a UTF-8 locale)?
> Being able to read old files that were readable before.

While admirable, I'd be careful about sticking to that goal unconditionally.  UTF-8 is the future and the future is now.  Users will increasingly view FF as "broken" if FF can't render mainstream text files correctly (well, text files with multi-byte UTF-8 encodings).

By anlogy, it's great to support IE6 compatibility because old websites were coded to assume IE6.  But suicide to do so at the expense of correctly rendering modern websites.

Just my 2¢

Comment 7

2 years ago
Another use case: I create presentations as HTML. I regularly view them locally, prior to web-hosting. Each time I see weird characters due to file:// not recognizing encoding, despite it being specified.
(In reply to LAFK from comment #7)
> Another use case: I create presentations as HTML. I regularly view them
> locally, prior to web-hosting. Each time I see weird characters due to
> file:// not recognizing encoding, despite it being specified.

What do you mean "despite it being specified"? Either the UTF-8 BOM (both text/html and text/plain) or <meta charset=utf-8> (text/html) should work. I.e. the ways of it "being specified" should work.

Comment 9

2 years ago
I'm using meta charset. However please do ignore this, I failed to open " properly. Adding " fixed the problem, I found it out when I did minimum file to replicate the bug to attach it here.
I just had to dust off the "Text Encoding" hamburger menu widget for the first time in *years* due to this bug, so, here's my vote for doing *something* to detect UTF-8 in text/plain from file:.

(Use case: viewing a text file containing a bunch of emoji.  For whatever damn reason, neither 'less' nor Emacs recognize all the shiny new characters as printable, but Firefox does, once you poke the encoding.)

Comment 11

11 months ago
All platforms other than Windows include encoding in their locale settings.  Thus, if you don't like autodetection, please obey locale settings instead of hard-coding an archaic Windows code page.  Using the user's locale would be consistent with every single maintained program other than a browser these days.

Or, heck, just hard-code UTF-8.  A couple of years ago I gathered some data by mining Debian bug reports for locale settings included by reportbug.  Around 3% of bug reports had no configured locale, around half percent used non-UTF-8.  I believe that these numbers are still a large overestimation, as a text mode program like reportbug is often run over ssh on a headless box where users often have no reason to care about locale.  A GUI-capable system, on the other hand, is almost always installed via an installer which doesn't even support non-UTF-8.

Comment 12

9 months ago
Another use case:
We daily cut'n'paste uft-8 text from putty terminals to a web server to document workflows or modifications since many years now. Years ago a regression was done in Firefox, since then we always have to change the character set manually to view and print documents with special characters like umlauts correct.

A proposal:
It may be difficult and a bunch of work to parse and guess the character set in a perfect manner. But why not just provide an about:config variable (e.g. plain_text.charset.overwrite=utf-8)? I guess this is easy to implement and would be of great help to some of the Firefox users. At least for the time being.
Regarding the system locale:
The browser behavior here is aimed at being able to view legacy stuff saved to the local disk from the Web. It's not about primarily viewing content created on the local system. And in any case, the general approach that makes sense for HTML is sad for text/plain.

(In reply to Thomas Koch from comment #12)
> It may be difficult and a bunch of work to parse and guess the character set
> in a perfect manner. But why not just provide an about:config variable (e.g.
> plain_text.charset.overwrite=utf-8)? I guess this is easy to implement and
> would be of great help to some of the Firefox users. At least for the time
> being.

I'm open to reviewing an interim patch to that effect (for file: URLs only), but instead of writing that patch myself, I'm focusing on getting encoding_rs to a point that allows this bug to be fixed as described in comment 2. (Unlike Gecko's current uconv, which requires a caller either to handle malformed sequences itself or to lose knowledge of whether there were malformed sequences, encoding_rs allows the caller to know whether there were malformed sequences even when encoding_rs takes over handling the malformed sequences by emitting the REPLACEMENT CHARACTER.)
Depends on: 1261841
(Reporter)

Comment 14

9 months ago
(In reply to Henri Sivonen (:hsivonen) from comment #13)
> Regarding the system locale:
> The browser behavior here is aimed at being able to view legacy stuff saved
> to the local disk from the Web.

The browser has already broken the view of legacy stuff by assuming windows-1252, while legacy stuff under Unix is ISO-8859-1 (control characters in the range 0x80-0x9f should have remained invisible, and at least copied back to ISO-8859-1, otherwise this breaks charset handling). So, there isn't much reason to continue to support legacy stuff, in particular if it breaks the view of current stuff. Nowadays, text/plain data saved from the web uses UTF-8 in most cases (or is just plain ASCII, which can be seen as a particular case of UTF-8 anyway), and BOM is almost never used (I think I have never seen it except for test cases). In the few cases where another encoding has been used, the user may have converted the file to UTF-8 after the download, because this is what other tools expect (either because these tools expect the encoding specified by the locale or because they always expect UTF-8). So, IMHO, even hardcoding the charset to UTF-8 would be OK, and certainly better than the current status.

Comment 15

9 months ago
Please respect your users and obey either (a) the current locale or (b) some "about:config" variable for plain text default encoding. Hard-coding UTF-8 without configurability should be considered a provisional option, only.

Comment 16

4 months ago
If you're prepared to build Firefox yourself, UTF-8 default fallback can be achieved with some trivial source code edits.
Works for html as well as plain text.
This has nothing to do with auto-detection or other esoteric stuff, it's just a plain [user] choice between defaulting to 'Western/windows-1252' or UTF-8 for an unidentified encoding.

1] Remove the block on setting UTF-8:

sed -i 's|(mFallback).*$|(mFallback)) {|;/UTF-8/d' dom/encoding/FallbackEncoding.cpp


2] a) Add Unicode option to Preferences|Content|Fonts & Colours|Advanced|"Fallback Text Encoding" drop-down menu:

sed -i '104i<!ENTITY languages.customize.Fallback.unicode     "Unicode">' browser/locales/en-US/chrome/browser/preferences/fonts.dtd
sed -i '272i\            <menuitem label="\&languages.customize.Fallback.unicode;"     value="UTF-8"/>' browser/components/preferences/fonts.xul

   b) ... and for any localization:

sed -i '104i<!ENTITY languages.customize.Fallback.unicode     "Unicode">' browser/chrome/browser/preferences/fonts.dtd


I did this as well, but it may not be necessary.
Having done 1], Unicode can be selected through the menu entry added in 2] to set UTF-8.
3] Set [about:config] option 'intl.charset.fallback.override' to default to UTF-8:

sed -i 's|fallback.override.*$|fallback.override",      "UTF-8");|' modules/libpref/init/all.js


This works for Firefox 51.0, for en-GB - I'm assuming the en-US fonts.dtd patch would work as well.

Comment 17

2 months ago
https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
The Unicode Standard permits the BOM in UTF-8,[3] but does not require or recommend its use.[4] 
See also https://bugs.chromium.org/p/chromium/issues/detail?id=703006 .
You need to log in before you can comment on or make changes to this bug.