Closed Bug 1071816 Opened 6 years ago Closed 2 years ago

Support loading BOMless UTF-8 text/plain files from file: URLs

Categories

(Core :: DOM: HTML Parser, defect)

32 Branch
x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla66

People

(Reporter: vincent-moz, Assigned: hsivonen)

References

(Depends on 1 open bug)

Details

Attachments

(2 files, 5 obsolete files)

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:32.0) Gecko/20100101 Firefox/32.0
Build ID: 20140917194002

Steps to reproduce:

Open a UTF-8 text file using a "file:" URL scheme.


Actual results:

The file is regarded as encoded in windows-1252 (according to what is displayed and to "View Page Info").


Expected results:

The file should be regarded as encoded in UTF-8. Alternatively, the encoding could be guessed.

The reason is that UTF-8 is much more used nowadays, and is the default in various contexts. I've also tried with other browsers (Opera, lynx, w3m and elinks), and all of them regarded the file as encoded in UTF-8. Only Firefox chose a wrong charset.

Note: Bug 815551 is a similar bug by its title, but for HTML ("HTML: Parser" component) with http URL's (see the first comments), which is obviously very different.
I forgot to give a reference to bug 760050: there was auto-detection in the past but it didn't work correctly. So, if it is chosen to guess the encoding, UTF-8 should be favored.
(In reply to Vincent Lefevre from comment #0)
> The file is regarded as encoded in windows-1252 (according to what is
> displayed and to "View Page Info").

This depends on the locale, but, yes, it's windows-1252 for most locales.

> The file should be regarded as encoded in UTF-8.

Simply saying that all file: URLs resolving to plain text files should be treated as UTF-8 might be too simple. We might get away with it, though. (What's your use case for opening local .txt files in a browser, BTW?)

> Alternatively, the encoding could be guessed.

Since we can assume local files to be finite and all bytes available soon, we could do the following:
 1) Add a method to the UTF-8 decoder to ask it if it has seen an error yet.
 2) When loading text/plain (or text/html; see below) from file:, start with the UTF-8 decoder and let the converted buffers queue up without parsing them. After each buffer, ask the UTF-8 decoder if it has seen an error already.
 3) If the UTF-8 decoder says it has seen an error, throw away the converted buffers, seek to the beginning of file and parse normally with the fallback encoding.
 4) If end of the byte stream is reached without the UTF-8 decoder having seen an error, let the parser process the buffer queue.

The hardest part would probably be the "seek to the beginning of file" bit, considering that we've already changed the state of the channel to deliver to a non-main thread. I'm not sure if channels for file: URLs support seeking. (Seeking with the existing channel instead of asking the docshell to renavigate avoids problems with scenarios where existing local HTML expects iframed content to behave a certain way as far as events go.)

> The reason is that UTF-8 is much more used nowadays, and is the default in
> various contexts. I've also tried with other browsers (Opera, lynx, w3m and
> elinks), and all of them regarded the file as encoded in UTF-8. Only Firefox
> chose a wrong charset.

Lynx, w3m and elinks don't really have the sort of compatibility goals that Firefox has. Was the Presto-Opera or Blink-Opera? With fresh profile? What do Chrome and IE do?

> Note: Bug 815551 is a similar bug by its title, but for HTML ("HTML: Parser"
> component)

text/plain goes through the HTML parser, too. As far as implementation goes, if there's any level of content-based guessing, not having that code apply to HTML, too, would require an extra condition.

Any code to be written here would live in nsHtml5StreamParser.cpp.

> with http URL's (see the first comments), which is obviously very
> different.

(I still think we shouldn't add any new content-based detection for content loaded from http and that we should never use UTF-8 as the fallback for content loaded from http.)
Status: UNCONFIRMED → NEW
Component: File Handling → HTML: Parser
Ever confirmed: true
Summary: text/plain files with unknown charset should be regarded as encoded in UTF-8 → Support loading BOMless UTF-8 text/plain files from file: URLs
(In reply to Henri Sivonen (:hsivonen) from comment #2)
> (In reply to Vincent Lefevre from comment #0)
> > The file is regarded as encoded in windows-1252 (according to what is
> > displayed and to "View Page Info").
> 
> This depends on the locale, but, yes, it's windows-1252 for most locales.

I'm using a UTF-8 locale, so that it's very surprising this if it depends on the locale, an incompatible charset is chosen.

> Simply saying that all file: URLs resolving to plain text files should be
> treated as UTF-8 might be too simple. We might get away with it, though.
> (What's your use case for opening local .txt files in a browser, BTW?)

I sometimes retrieve parts of websites locally (possibly with some URL update and other corrections). This may include .txt files, linked from HTML files. The simplest solution is to view them via "file:" URL's. This is precisely how I discovered bug 760050 (which was a variant of this one): some .txt files (in English) contained the non-ASCII "§" character.

> The hardest part would probably be the "seek to the beginning of file" bit,
> considering that we've already changed the state of the channel to deliver
> to a non-main thread. I'm not sure if channels for file: URLs support
> seeking. (Seeking with the existing channel instead of asking the docshell
> to renavigate avoids problems with scenarios where existing local HTML
> expects iframed content to behave a certain way as far as events go.)

If this is a problem and the goal is to differentiate UTF-8 from windows-1252 (or ISO-8859-1 under Unix, since windows-1252 is not used in practice), I'd say that if the first non-ASCII bytes correspond to a valid UTF-8 sequence, then the file could be regarded as in UTF-8 with a very high probability, in particular if the locale is a UTF-8 one. This shouldn't need seeking because all the past bytes are ASCII, which is common to UTF-8 and windows-1252. Note that if there's an UTF-8 decoding error later, this doesn't necessarily mean that Firefox did something wrong, because UTF-8 files with invalid sequences also occur in practice.

> > The reason is that UTF-8 is much more used nowadays, and is the default in
> > various contexts. I've also tried with other browsers (Opera, lynx, w3m and
> > elinks), and all of them regarded the file as encoded in UTF-8. Only Firefox
> > chose a wrong charset.
> 
> Lynx, w3m and elinks don't really have the sort of compatibility goals that
> Firefox has.

What compatibility goals (at least in the case of a UTF-8 locale)?

> Was the Presto-Opera or Blink-Opera? With fresh profile?

Opera 12 under GNU/Linux. Not a fresh profile, but I use Opera only for some testing.

> What do Chrome and IE do?

Chromium uses windows-1252. But I wonder whether the issue has already been discussed.
There's no IE under GNU/Linux.
(In reply to Vincent Lefevre from comment #3)
> (In reply to Henri Sivonen (:hsivonen) from comment #2)
> > (In reply to Vincent Lefevre from comment #0)
> > > The file is regarded as encoded in windows-1252 (according to what is
> > > displayed and to "View Page Info").
> > 
> > This depends on the locale, but, yes, it's windows-1252 for most locales.
> 
> I'm using a UTF-8 locale, so that it's very surprising this if it depends on
> the locale, an incompatible charset is chosen.

I meant the Firefox UI language--not the *nix system locale notion.

> > Simply saying that all file: URLs resolving to plain text files should be
> > treated as UTF-8 might be too simple. We might get away with it, though.
> > (What's your use case for opening local .txt files in a browser, BTW?)
> 
> I sometimes retrieve parts of websites locally (possibly with some URL
> update and other corrections). This may include .txt files, linked from HTML
> files. The simplest solution is to view them via "file:" URL's. This is
> precisely how I discovered bug 760050 (which was a variant of this one):
> some .txt files (in English) contained the non-ASCII "§" character.

I see. For this use case, it seems to me the problem applies also to text/html and not just to text/plain when using a spidering program that doesn't rewrite HTML to include <meta charset>.

> > The hardest part would probably be the "seek to the beginning of file" bit,
> > considering that we've already changed the state of the channel to deliver
> > to a non-main thread. I'm not sure if channels for file: URLs support
> > seeking. (Seeking with the existing channel instead of asking the docshell
> > to renavigate avoids problems with scenarios where existing local HTML
> > expects iframed content to behave a certain way as far as events go.)
> 
> If this is a problem and the goal is to differentiate UTF-8 from
> windows-1252 (or ISO-8859-1 under Unix, since windows-1252 is not used in
> practice), I'd say that if the first non-ASCII bytes correspond to a valid
> UTF-8 sequence, then the file could be regarded as in UTF-8 with a very high
> probability, in particular if the locale is a UTF-8 one. This shouldn't need
> seeking because all the past bytes are ASCII, which is common to UTF-8 and
> windows-1252.

What if the first non-ASCII byte sequence is a copyright sign in the page footer? If we're going to do this for local files, let's make use of the fact that we have all the bytes available.

> Note that if there's an UTF-8 decoding error later, this
> doesn't necessarily mean that Firefox did something wrong, because UTF-8
> files with invalid sequences also occur in practice.

Catering to such pages is *so* WONTFIX if this feature gets impemented at all.
 
> > > The reason is that UTF-8 is much more used nowadays, and is the default in
> > > various contexts. I've also tried with other browsers (Opera, lynx, w3m and
> > > elinks), and all of them regarded the file as encoded in UTF-8. Only Firefox
> > > chose a wrong charset.
> > 
> > Lynx, w3m and elinks don't really have the sort of compatibility goals that
> > Firefox has.
> 
> What compatibility goals (at least in the case of a UTF-8 locale)?

Being able to read old files that were readable before.

> > Was the Presto-Opera or Blink-Opera? With fresh profile?
> 
> Opera 12 under GNU/Linux. Not a fresh profile, but I use Opera only for some
> testing.

That's Presto-Opera. And no proof of the defaults if it's not known to be a fresh profile.
(In reply to Henri Sivonen (:hsivonen) from comment #4)
> I see. For this use case, it seems to me the problem applies also to
> text/html and not just to text/plain when using a spidering program that
> doesn't rewrite HTML to include <meta charset>.

At worst, for HTML, this can be done with another tool, a shell/Perl script or whatever. This shouldn't confuse any HTML reader.

> > If this is a problem and the goal is to differentiate UTF-8 from
> > windows-1252 (or ISO-8859-1 under Unix, since windows-1252 is not used in
> > practice), I'd say that if the first non-ASCII bytes correspond to a valid
> > UTF-8 sequence, then the file could be regarded as in UTF-8 with a very high
> > probability, in particular if the locale is a UTF-8 one. This shouldn't need
> > seeking because all the past bytes are ASCII, which is common to UTF-8 and
> > windows-1252.
> 
> What if the first non-ASCII byte sequence is a copyright sign in the page
> footer?

I don't know the internals, but ideally one should be able to send all the ASCII bytes ASAP, and deal with the encoding only when it matters. This is what I meant.

> If we're going to do this for local files, let's make use of the
> fact that we have all the bytes available.

If you have all the bytes, yes. Firefox doesn't seem to support files like file:///dev/stdin anyway (rather useless for text/plain files anyway).

> > Note that if there's an UTF-8 decoding error later, this
> > doesn't necessarily mean that Firefox did something wrong, because UTF-8
> > files with invalid sequences also occur in practice.
> 
> Catering to such pages is *so* WONTFIX if this feature gets impemented at
> all.

I meant that it could just be obtained by accident. But conversely, if you have the ef bb bf byte sequence at the beginning of a file and an invalid UTF-8 sequence later, I wouldn't consider that recognizing the BOM sequence as UTF-8 was a mistake.

> > What compatibility goals (at least in the case of a UTF-8 locale)?
> 
> Being able to read old files that were readable before.

??? Most UTF-8 text/plain files were readable with old Firefox (actually I've just tried with Iceweasel 24.8.0 under Debian, but I doubt Debian has changed anything here), but it is no longer the case with new Firefox versions!

Note also the fact that Firefox also broke old HTML rendering with the new HTML5 parser.

> > Opera 12 under GNU/Linux. Not a fresh profile, but I use Opera only for some
> > testing.
> 
> That's Presto-Opera. And no proof of the defaults if it's not known to be a
> fresh profile.

I could do more tests later (that's on another machine), possibly with newer versions. But note that I haven't modified the profile explicitly.

BTW, in any case, the default settings wouldn't matter very much (except that I think that UTF-8 makes more sense, at least in UTF-8 *nix locales), as long as one can modify the settings to get UTF-8.
> > What compatibility goals (at least in the case of a UTF-8 locale)?
> Being able to read old files that were readable before.

While admirable, I'd be careful about sticking to that goal unconditionally.  UTF-8 is the future and the future is now.  Users will increasingly view FF as "broken" if FF can't render mainstream text files correctly (well, text files with multi-byte UTF-8 encodings).

By anlogy, it's great to support IE6 compatibility because old websites were coded to assume IE6.  But suicide to do so at the expense of correctly rendering modern websites.

Just my 2¢
Another use case: I create presentations as HTML. I regularly view them locally, prior to web-hosting. Each time I see weird characters due to file:// not recognizing encoding, despite it being specified.
(In reply to LAFK from comment #7)
> Another use case: I create presentations as HTML. I regularly view them
> locally, prior to web-hosting. Each time I see weird characters due to
> file:// not recognizing encoding, despite it being specified.

What do you mean "despite it being specified"? Either the UTF-8 BOM (both text/html and text/plain) or <meta charset=utf-8> (text/html) should work. I.e. the ways of it "being specified" should work.
I'm using meta charset. However please do ignore this, I failed to open " properly. Adding " fixed the problem, I found it out when I did minimum file to replicate the bug to attach it here.
I just had to dust off the "Text Encoding" hamburger menu widget for the first time in *years* due to this bug, so, here's my vote for doing *something* to detect UTF-8 in text/plain from file:.

(Use case: viewing a text file containing a bunch of emoji.  For whatever damn reason, neither 'less' nor Emacs recognize all the shiny new characters as printable, but Firefox does, once you poke the encoding.)
All platforms other than Windows include encoding in their locale settings.  Thus, if you don't like autodetection, please obey locale settings instead of hard-coding an archaic Windows code page.  Using the user's locale would be consistent with every single maintained program other than a browser these days.

Or, heck, just hard-code UTF-8.  A couple of years ago I gathered some data by mining Debian bug reports for locale settings included by reportbug.  Around 3% of bug reports had no configured locale, around half percent used non-UTF-8.  I believe that these numbers are still a large overestimation, as a text mode program like reportbug is often run over ssh on a headless box where users often have no reason to care about locale.  A GUI-capable system, on the other hand, is almost always installed via an installer which doesn't even support non-UTF-8.
Another use case:
We daily cut'n'paste uft-8 text from putty terminals to a web server to document workflows or modifications since many years now. Years ago a regression was done in Firefox, since then we always have to change the character set manually to view and print documents with special characters like umlauts correct.

A proposal:
It may be difficult and a bunch of work to parse and guess the character set in a perfect manner. But why not just provide an about:config variable (e.g. plain_text.charset.overwrite=utf-8)? I guess this is easy to implement and would be of great help to some of the Firefox users. At least for the time being.
Regarding the system locale:
The browser behavior here is aimed at being able to view legacy stuff saved to the local disk from the Web. It's not about primarily viewing content created on the local system. And in any case, the general approach that makes sense for HTML is sad for text/plain.

(In reply to Thomas Koch from comment #12)
> It may be difficult and a bunch of work to parse and guess the character set
> in a perfect manner. But why not just provide an about:config variable (e.g.
> plain_text.charset.overwrite=utf-8)? I guess this is easy to implement and
> would be of great help to some of the Firefox users. At least for the time
> being.

I'm open to reviewing an interim patch to that effect (for file: URLs only), but instead of writing that patch myself, I'm focusing on getting encoding_rs to a point that allows this bug to be fixed as described in comment 2. (Unlike Gecko's current uconv, which requires a caller either to handle malformed sequences itself or to lose knowledge of whether there were malformed sequences, encoding_rs allows the caller to know whether there were malformed sequences even when encoding_rs takes over handling the malformed sequences by emitting the REPLACEMENT CHARACTER.)
Depends on: encoding_rs
(In reply to Henri Sivonen (:hsivonen) from comment #13)
> Regarding the system locale:
> The browser behavior here is aimed at being able to view legacy stuff saved
> to the local disk from the Web.

The browser has already broken the view of legacy stuff by assuming windows-1252, while legacy stuff under Unix is ISO-8859-1 (control characters in the range 0x80-0x9f should have remained invisible, and at least copied back to ISO-8859-1, otherwise this breaks charset handling). So, there isn't much reason to continue to support legacy stuff, in particular if it breaks the view of current stuff. Nowadays, text/plain data saved from the web uses UTF-8 in most cases (or is just plain ASCII, which can be seen as a particular case of UTF-8 anyway), and BOM is almost never used (I think I have never seen it except for test cases). In the few cases where another encoding has been used, the user may have converted the file to UTF-8 after the download, because this is what other tools expect (either because these tools expect the encoding specified by the locale or because they always expect UTF-8). So, IMHO, even hardcoding the charset to UTF-8 would be OK, and certainly better than the current status.
Please respect your users and obey either (a) the current locale or (b) some "about:config" variable for plain text default encoding. Hard-coding UTF-8 without configurability should be considered a provisional option, only.
If you're prepared to build Firefox yourself, UTF-8 default fallback can be achieved with some trivial source code edits.
Works for html as well as plain text.
This has nothing to do with auto-detection or other esoteric stuff, it's just a plain [user] choice between defaulting to 'Western/windows-1252' or UTF-8 for an unidentified encoding.

1] Remove the block on setting UTF-8:

sed -i 's|(mFallback).*$|(mFallback)) {|;/UTF-8/d' dom/encoding/FallbackEncoding.cpp


2] a) Add Unicode option to Preferences|Content|Fonts & Colours|Advanced|"Fallback Text Encoding" drop-down menu:

sed -i '104i<!ENTITY languages.customize.Fallback.unicode     "Unicode">' browser/locales/en-US/chrome/browser/preferences/fonts.dtd
sed -i '272i\            <menuitem label="\&languages.customize.Fallback.unicode;"     value="UTF-8"/>' browser/components/preferences/fonts.xul

   b) ... and for any localization:

sed -i '104i<!ENTITY languages.customize.Fallback.unicode     "Unicode">' browser/chrome/browser/preferences/fonts.dtd


I did this as well, but it may not be necessary.
Having done 1], Unicode can be selected through the menu entry added in 2] to set UTF-8.
3] Set [about:config] option 'intl.charset.fallback.override' to default to UTF-8:

sed -i 's|fallback.override.*$|fallback.override",      "UTF-8");|' modules/libpref/init/all.js


This works for Firefox 51.0, for en-GB - I'm assuming the en-US fonts.dtd patch would work as well.
https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
The Unicode Standard permits the BOM in UTF-8,[3] but does not require or recommend its use.[4] 
See also https://bugs.chromium.org/p/chromium/issues/detail?id=703006 .
Duplicate of this bug: 1407594
Duplicate of this bug: 1419200
Edited copypaste of bug 1407594 comment 15 (note the last paragraph!):

> So, what can be done for this bug to be actually properly fixed in next
> release or at least release after next, instead of bug staying open for
> years without progress?

1) Wait for the fix for bug 980904 to land.
2) Then volunteer to write the following code:
  * Activate the following machinery in the HTML parser iff the URL is a file: URL and the encoding wasn't decided from BOM or <meta>:
    * Instantiate a decoder for UTF-8 instead of the fallback encoding.
    * Block tree op flushes to the main thread. Keep them accumulating the way they accumulate during speculative parsing.
    * Whenever bytes arrive from the "network" (i.e. file) stash them into a linked list of buffer copies in nsHtml5StreamParser.
    * Decode the bytes.
    * If there was an error:
      - Throw away the tree op queue.
      - Instantiate a decoder for the non-UTF-8 encoding that would have been used normally.
      - Unblock tree op delivery to the main thread.
      - Replay the stashed-away bytes to the decoder and the tokenizer.
    * When the EOF is reached, notify the main thread about the encoding being UTF-8 as if it had been discovered via <meta>.
    * Deliver the pending tree ops to the main thread.

In the interim, I'd r+ a boolean off-by-default pref to assume UTF-8 for text/plain and <meta>less text/html from file: URLs only.
WRT bug 1419200 comment 2:

> No, the legacy on the Web still varies by locale, which we approximate from the TLD and, failing that, from the UI locale.

Suppose I request the *same* HTML file with no charset declared in <head> or Content-Type from HTTP, both TLD and UI locale fails to detect it's encoded in UTF-8, the document will be decoded in windows-1251, but if I load the same document from file://, UTF-8 will be tried and first, and succeed in decoding the doc in UTF-8? If this is the case, a whole lot of authors are bound to be confused. Why can't HTTP try UTF-8 first before TLD/UI locale? Chrome and Safari seems to be fine with it.
(In reply to Henri Sivonen (:hsivonen) (away from Bugzilla until 2017-12-04) from comment #20)
>     * Block tree op flushes to the main thread. Keep them accumulating the
> way they accumulate during speculative parsing.
>     * Whenever bytes arrive from the "network" (i.e. file) stash them into a
> linked list of buffer copies in nsHtml5StreamParser.
>     * Decode the bytes.
>     * If there was an error:
>       - Throw away the tree op queue.
>       - Instantiate a decoder for the non-UTF-8 encoding that would have
> been used normally.
>       - Unblock tree op delivery to the main thread.
>       - Replay the stashed-away bytes to the decoder and the tokenizer.
>     * When the EOF is reached, notify the main thread about the encoding
> being UTF-8 as if it had been discovered via <meta>.
>     * Deliver the pending tree ops to the main thread.

Doesn't it cause OOM if the file is huge (such as a log file)? I'd be uncomfortable if I can't view huge files at all until Firefox reaches EOF.
> Doesn't it cause OOM if the file is huge (such as a log file)? I'd be
> uncomfortable if I can't view huge files at all until Firefox reaches EOF.

If the file is huge enough for this to be a concern, and you _still_ haven't seen a single decoding failure, it's beyond obvious that the file indeed is UTF-8 (or pure 7-bit which is also valid UTF-8).  In a recent discussion (among Mozilla and Chromium folks, I don't have a link), there was a debate whether looking at first 1024 bytes from the network is enough.  The biggest alternative anyone even mentioned was 4096.  Taking more from a locally-available file would be reasonable, but you really don't need to check a gigabyte.
(In reply to Adam Borowski from comment #23)
> Taking more from a
> locally-available file would be reasonable, but you really don't need to
> check a gigabyte.

But that's what Henri is proposing.
(In reply to Masatoshi Kimura [:emk] from comment #24)
> But that's what Henri is proposing.

Yeah, exactly -- I agree with what you said, I merely mentioned some size scales to make it more obvious what's wrong.

But I really don't get what the problem here is: on any Unix system, there is a well-defined setting (LC_CTYPE/LANG/LC_ALL) that all locale-aware programs but Firefox obey.  What you guys try to do, is to override the user's preferences, as some files might still use an ancient encoding.  This might make sense but only in _addition_ to supporting UTF-8, not as the primary choice.  By trying to do "better", Firefox still fails the vast majority of cases.
(In reply to Adam Borowski from comment #25)
> But I really don't get what the problem here is: on any Unix system, there
> is a well-defined setting (LC_CTYPE/LANG/LC_ALL) that all locale-aware
> programs but Firefox obey.

Firefox does *not* obey UTF-8 locales. It has a built-in small list[1] and the fallback encoding is determined by that list. The list does not contain UTF-8.

[1] https://dxr.mozilla.org/mozilla-central/rev/4affa6e0a8c622e4c4152872ffc14b73103830ac/dom/encoding/localesfallbacks.properties
I think Adam is saying Firefox should obey LC_*. I think this is a much larger issue. LANG is used to determine the language used for UI chrome. The reason Firefox doesn't obey that is because to this day and age, Firefox still build locale specific packages like it was the 80s where 500k or so from other language bundles matter. To be fair, lots of programs allow you to override LAND/LC_CTYPE/LC_* locally, it's just Firefox is a bit egregious. Respecting POSIX locale environment variables or not, I think Firefox should still default to UTF-8, given that most OSes such as macOS or Linux desktop environments never changes LC_CTYPE, they all set it to UTF-8, and leave character encoding overrides to the individual applications.

So, I think it's okay that Firefox doesn't obey LANG for now due to building issues, I don't think disobeying LC_CTYPE matter much these days as long as Firefox defaults to UTF-8. As to the other LC_*, I don't see any use case for them, except maybe LC_ALL, but it only matter if Firefox looks at LANG and LC_CTYPE at all.
(In reply to Yuen Ho Wong from comment #27)
> So, I think it's okay that Firefox doesn't obey LANG for now due to building
> issues, I don't think disobeying LC_CTYPE matter much these days as long as
> Firefox defaults to UTF-8. As to the other LC_*, I don't see any use case
> for them, except maybe LC_ALL, but it only matter if Firefox looks at LANG
> and LC_CTYPE at all.

If Firefox considers the locales, then it is important to honor LC_ALL when set, as it overrides LC_CTYPE, i.e. the considered charset should be the same as the one given by the "locale charmap" command. The order is, by precedence: LC_ALL, then LC_CTYPE, then LANG.
That's what I said. For now, I think Firefox should completely disregard LANG and LC_* but take UTF-8 has the highest priority when considering a list of fall back character encodings. Or, bundle all the language strings into just 1 build per platform, so Firefox can start obeying LANG and LC_ALL. I don't know how feasible the latter option is as I'm not familiar with Firefox's build process and Mozilla's build infrastructure.
Firefox obeys the language part of LANG/LC_ALL (at least as shipped by Debian), just not the encoding part.

By now, though, adding support for varying encodings is probably pointless.  I've gathered some data: of all Debian bug reports in 2016 that were filed with reportbug and included locale data, only 0.8% used something that's not UTF-8, and there's a strong downwards trend compared to previous data.  Bug reporters are strongly correlated with technical ability, and it takes some knowledge to set a modern Unix system to a non-UTF8 encoding, thus it's a safe assumption that the percentage of regular users with non-UTF8 is a good way less than 0.8%.

Here's a graph: Oct 2004 - Jan 2017, max=51%, 1 horizontal dot = 1 month.
⠀⠀⢠⠀⠀⠀⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⣄⣾⠀⠀⣦⣿⣿⡄⠀⡆⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⢠⣿⣿⣄⡇⣿⣿⣿⡇⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⣿⣿⣿⣿⣷⣿⣿⣿⣇⡄⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⣿⣿⣿⣿⣿⣿⣿⣿⣿⡇⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⡇⡀⢀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⢸⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣸⣿⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣶⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣧⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣾⣧⣤⣶⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡇⡀⠀⠀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⣤⣾⣇⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣧⣿⣦⣠⣷⣄⣰⣠⡀⢀⠀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⣾⣷⣷⣠⡀⠀⠀⡀⢸⣿⡄⠀⠀⠀⠀⠀⡄⣠⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣾⣿⣾⣿⣿⣶⣴⣤⣠⣷⣷⣿⣧⣰⣴⡄⣤⣀⣀⣀⣀⣠⣄⣀⣄⢀⢰⡀⣀⣀⣀⢀

As text-displaying programs other than Firefox that don't obey system locale (or even hard-code UTF-8) have gone the way of the dodo, users are accustomed to locally stored files in encodings other than UTF-8 ending in mojibake.  It's only Firefox that goes the other way: mojibake for UTF-8, ok for _an_ ancient encoding that may or may not match the file.

Elsewhere, support for such encodings has been bit-rotting and is being dropped; yet currently Firefox doesn't even recognize what's becoming the only option.
(In reply to Masatoshi Kimura [:emk] from comment #22)
> (In reply to Henri Sivonen (:hsivonen) (away from Bugzilla until 2017-12-04)
> from comment #20)
> >     * Block tree op flushes to the main thread. Keep them accumulating the
> > way they accumulate during speculative parsing.
> >     * Whenever bytes arrive from the "network" (i.e. file) stash them into a
> > linked list of buffer copies in nsHtml5StreamParser.
> >     * Decode the bytes.
> >     * If there was an error:
> >       - Throw away the tree op queue.
> >       - Instantiate a decoder for the non-UTF-8 encoding that would have
> > been used normally.
> >       - Unblock tree op delivery to the main thread.
> >       - Replay the stashed-away bytes to the decoder and the tokenizer.
> >     * When the EOF is reached, notify the main thread about the encoding
> > being UTF-8 as if it had been discovered via <meta>.
> >     * Deliver the pending tree ops to the main thread.
> 
> Doesn't it cause OOM if the file is huge (such as a log file)?

It would temporarily consume more RAM than now, yes. Do we really need to care about files whose size falls in the range that they don't OOM now but would OOM with my proposal.

> I'd be
> uncomfortable if I can't view huge files at all until Firefox reaches EOF.

How often do you view such files? Are huge local files really a use case we need to optimize for? Does it really matter if huge local log files take a bit more time to start displaying than they do now if in exchange we address the problem that currently if you save (verbatim) a UTF-8 file whose encoding was given on the HTTP layer, it opens up the wrong way?

I think we should solve the problem for files whose size is in the range of typical Web pages and let opening of huge log files work according to what follows from optimizing for normal-sized files.

(In reply to Yuen Ho Wong from comment #27)
> I think Adam is saying Firefox should obey LC_*.

We're not going to take LC_* as indication of the encoding of file *content* (we do use it for interpreting file *paths* at present). On Windows, the analog of LC_* is never UTF-8, so any solution relying on LC_* would leave the problem unsolved on Windows.
(In reply to Henri Sivonen (:hsivonen) (away from Bugzilla until 2017-12-04) from comment #31)
> they do now if in exchange we address
> the problem that currently if you save (verbatim) a UTF-8 file whose
> encoding was given on the HTTP layer, it opens up the wrong way?

Personally it is not a problem because the Japanese auto-detector detects UTF-8. Waiting for EOF will degrade the experience for me.

Do we really have to wait for EOF to make the UTF-8 detection perfect?

> On Windows, the
> analog of LC_* is never UTF-8, so any solution relying on LC_* would leave
> the problem unsolved on Windows.

Windows 10 Insider Preview added a new option "Beta: Use Unicode UTF-8 for worldwide language support" to enable UTF-8 system locales, by the way.
(In reply to Adam Borowski from comment #30)
It seems chromium strictly adheres to RFC standards though,
https://bugs.chromium.org/p/chromium/issues/detail?id=785209
(In reply to Dan Jacobson from comment #33)
> It seems chromium strictly adheres to RFC standards though,
> https://bugs.chromium.org/p/chromium/issues/detail?id=785209

This issue is about HTTP and buggy HTML. Off-topic here.
Although this is not the final solution, at least nobody disagrees with this.
Keywords: leave-open
Comment on attachment 8934161 [details]
Bug 1071816 - Add a pref to fallback to UTF-8 for files from file: URLs.

https://reviewboard.mozilla.org/r/204220/#review210746
Attachment #8934161 - Flags: review?(hsivonen) → review+
Pushed by VYV03354@nifty.ne.jp:
https://hg.mozilla.org/integration/autoland/rev/ba48231d04a8
Add a pref to fallback to UTF-8 for files from file: URLs. r=hsivonen
A use case that might have become more common is looking at local text files that have can have some rendering such as Markdown. Extensions handle the transformation to html pretty well, firefox does the nice rendering, but the encoding remains a problem.


As far as I can tell there is no work-around for extensions either:
- I found no function setting the charset of an open document.
- webRequest.onHeadersReceived is not triggered for local files so no headers with a utf-8 charset can be introduced.
- TextEncoder only supports UTF-8, so you can't re-encode to a bytestring and decode that as UTF-8.
Duplicate of this bug: 1468461
(In reply to Masatoshi Kimura [:emk] from comment #32)
> Personally it is not a problem because the Japanese auto-detector detects UTF-8.
[...]

In short, the current workaround: in about:config, set intl.charset.detector to "ja_parallel_state_machine".
Duplicate of this bug: 1468461
Why do we support this *ONLY* for the file: scheme?

Don't non-document files(JavaScript, CSS) in all remote protocols which don't receive explicit encoding in Content-Type response headers also need this?
If I am directly visiting a non-ASCII Javascript or CSS files through remote URL, say https://example.org/abc.css, https://example.org/abc.js. I will most likely see messy code which is encoded by utf-8, because in the real world most of the web servers aren't configured well enough to response with an explicit encoding in Content-Type.
I suggest changing the bug title to cover remote non-document files also.
The discussion not limited to file:// is here (see also first comment): https://bugzilla.mozilla.org/show_bug.cgi?id=815551
Thank you for the information. I saw this bug 815551 before, but not exactly what I meant.

Maybe I was poorly worded, I am mainly talking about defaulting the encoding to utf-8(just as the patch in this bug does), not just about auto-detection.

This bug seems talking about detection from its summary, but it's patch right now is about defaulting the encoding to utf-8.
(In reply to 張俊芝(Zhang Junzhi) from comment #44)
> Why do we support this *ONLY* for the file: scheme?

Because the handling with "file:" is not standardized, while with HTTP, it is standardized. In particular, with HTTP, an unspecified charset means ISO-8859-1. So, HTTP should remain off-topic here, and could be discussed in another bug (but it might be wontfix).
Thank you for the information.

I didn't know that the standard says HTTP contents default to ISO-8859-1 if unspecified.

I just saw that Chrome defaults BOMless HTTP contents to utf-8(At least it's the case in Chrome on my Linux), so does that mean Chrome has implemented it in a non-standard way.
That's no longer true.  Old RFCs specified HTTP default to be ISO-8859-1, but those RFCs have been superseded long time ago.

Current one is RFC 7231 which says:
#   The default charset of ISO-8859-1 for text media types has been
#   removed; the default is now whatever the media type definition says.
#   Likewise, special treatment of ISO-8859-1 has been removed from the
#   Accept-Charset header field.  (Section 3.1.1.3 and Section 5.3.3)

Thus, Chrome's default of UTF-8 is reasonable, and matches current practice.

Of course, that's for http://; this bug is for file:// -- but outside of Windows we do have a strongly specified setting, ie, LANG/LC_CTYPE/LC_ALL, for which support for anything but UTF-8 is rapidly disappearing.  And, newest point release of Windows 10 finally allows setting system locale's encoding to UTF-8 (Control Panel|Region|Administrative|Change system locale...|Use Unicode UTF-8 for worldwide language support), so even this last bastion is crumbling.
At least on Windows, Chrome does not default to UTF-8 for unlabelled HTTP plain text exactly because of the compatibility concern with legacy content. (But it defaults to UTF-8 for file:// plain text.)
(In reply to Adam Borowski from comment #51)
> That's no longer true.  Old RFCs specified HTTP default to be ISO-8859-1,
> but those RFCs have been superseded long time ago.

No, as I've just said in bug 815551, that's not a long time ago. 4 years only, while there are web pages / web server configuration that have not been rewritten since, e.g.: https://members.loria.fr/PZimmermann/cm10/reglement (ISO-8859-1, no charset explicitly specified, last modified in 2010, i.e. 8 years ago).

This was *guaranteed* to work in the past. This should still be the case nowadays.

> Current one is RFC 7231 which says:
> #   The default charset of ISO-8859-1 for text media types has been
> #   removed; the default is now whatever the media type definition says.
> #   Likewise, special treatment of ISO-8859-1 has been removed from the
> #   Accept-Charset header field.  (Section 3.1.1.3 and Section 5.3.3)
> 
> Thus, Chrome's default of UTF-8 is reasonable, and matches current practice.

Current practice, perhaps, but not old, legacy practice, for which documents are still available.

But really, the only good solution for the new text-based documents (in particular) is to provide the charset on the server side (whatever their media type), which is always possible with HTTP, unlike with "file://".

Defaulting to UTF-8 for HTTP would really be bad, as the (client-side) user cannot control anything, unlike with "file://". If ISO-8859-1 is no longer the default chosen by the client, the charset should be autodetected.

Only for "file://", defaulting to UTF-8 may be OK, though autodetect would be much better.
To avoid multichanneling, better to keep talk about http to bug 815551.  In that case, autodetect may indeed be a good option.

Not so for file:// -- Firefox is the only program I'm aware of that assumes an ancient locale, at least on mainstream Unix systems.  People like me file bugs for inadequate UTF-8 support quite aggressively, and the work is pretty much done, with stragglers having been kicked out of Debian (for example, I transitioned aterm and rxvt to rxvt-unicode, and kept kterm and xvt out of Buster, to list terminal emulators only).  On the other hand, unlike a decade ago, I don't even bother implementing support for ancient locales in any of my new programs, and no one reported this as a problem.

Thus, for file:// on non-Windows, the following options would be reasonable, in my opinion:
1. LANG/LC_CTYPE/LC_ALL
2. hard-coding UTF-8
3. autodetect
in this order of preference.  Note that I consider autodetect to be worse than even hard-coded UTF-8, these days!

But, the biggest problem is that, for ordinary users, UTF-8 currently doesn't work at all.  That new preference (intl.charset.fallback.utf8_for_file) is a good step forward, but save for those who read this bug report, dig in about:config or happen to hear about it somewhere, a hidden preference that defaults to false is not there.  Thus, something that works in any other bit of the user's system, doesn't work in Firefox.
(In reply to Adam Borowski from comment #54)
> Thus, for file:// on non-Windows, the following options would be reasonable,
> in my opinion:
> 1. LANG/LC_CTYPE/LC_ALL

Yes, but in a clean way instead of looking at the values of these environment variables (because the locale names are not standardized). Something like nl_langinfo(CODESET).

> 2. hard-coding UTF-8
> 3. autodetect
> in this order of preference.  Note that I consider autodetect to be worse
> than even hard-coded UTF-8, these days!

By default, yes. Some UTF-8 files unfortunately have spurious ISO-8859-1 characters in them (or partly binary data), and autodetection may incorrectly regard an UTF-8 file as an ISO-8859-1 file. But it would be nice if a user could enable autodetect.
> 1. LANG/LC_CTYPE/LC_ALL
> 2. hard-coding UTF-8

There is no material difference between these two on non-Windows. Firefox doesn't even support file paths that aren't valid UTF-8, so Firefox doesn't even fully support (on non-Windows) the case of running with a non-UTF-8 locale.

I encourage putting the energy into implementing comment 20 or adding a front-end checkbox for the pref that landed in comment 39 instead of designing other solutions.

To be clear, I want to do comment 20. I just chronically don't have the time to. Maybe in the second half of this year...
Would it help if I got mainstream Linux distributions to officially declare non-UTF8 locales as unsupported?  (Currently, they're merely bitrotten in practice.)  That would simplify this issue on anything that's not Windows (I don't think OSX supports ancient locales anymore either).

So everyone but Windows users would be done here; Windows users are behind as the changeover only started (UTF-8 is merely supported but not even the default yet, and not on non-telemetried-to-death versions of Windows).
(In reply to Adam Borowski from comment #57)
> Would it help if I got mainstream Linux distributions to officially declare
> non-UTF8 locales as unsupported?

For this bug or Firefox purposes generally, no. The behavior here doesn't and won't depend on the glibc locale setting. If anything in Firefox still depends on the glibc codeset, it should be filed as a bug.

Maybe we should flip the default value for the pref that got landed in comment 39 if the behavior described in comment 20 doesn't get implemented soon, but looking at the glibc codeset is not coming back.
(In reply to Henri Sivonen (:hsivonen) from comment #56)
> To be clear, I want to do comment 20. I just chronically don't have the time
> to. Maybe in the second half of this year...

I expect to be able to work on this in 2018H2.
Assignee: nobody → hsivonen
Status: NEW → ASSIGNED
Depends on: 1511972
Depends on: 1512155
Thanks @hsivonen for working on it!
What does "2018H2" mean ?
(In reply to ggrossetie from comment #62)
> What does "2018H2" mean ?

Second half of 2018.
Depends on: 1512713
Comment on attachment 9030009 [details] [diff] [review]
Decode unlabeled file: URLs as UTF-8 and redecode if there's an error in the first 50 MB, v5

Review of attachment 9030009 [details] [diff] [review]:
-----------------------------------------------------------------

> emk, do you have a Phabricator account?

No, its Windows support is still horrible.

> if there's an error in the first 50 MB

Nice compromise :)

::: parser/html/nsHtml5StreamParser.cpp
@@ +911,5 @@
> +            currentURI->SchemeIs("view-source", &isViewSource);
> +            if (isViewSource) {
> +              nsCOMPtr<nsINestedURI> nested = do_QueryInterface(currentURI);
> +              nsCOMPtr<nsIURI> temp;
> +              nested->GetInnerURI(getter_AddRefs(temp));

Why is this unwrapping only one-level? How about using NS_GetInnermostURI?

::: parser/nsCharsetSource.h
@@ +9,4 @@
>  #define kCharsetUninitialized 0
>  #define kCharsetFromFallback 1
>  #define kCharsetFromTopLevelDomain 2
> +#define kCharsetFromFileURLGuess 3

Let's change this to an enum while we are here.
(In reply to Masatoshi Kimura [:emk] from comment #70)
> > emk, do you have a Phabricator account?
> 
> No, its Windows support is still horrible.

Windows support isn't needed on the reviewer side.

> Why is this unwrapping only one-level? How about using NS_GetInnermostURI?

Using NS_GetInnermostURI now.

Also fixed the recordreplay reporting.

> ::: parser/nsCharsetSource.h
> @@ +9,4 @@
> >  #define kCharsetUninitialized 0
> >  #define kCharsetFromFallback 1
> >  #define kCharsetFromTopLevelDomain 2
> > +#define kCharsetFromFileURLGuess 3
> 
> Let's change this to an enum while we are here.

I'd rather change it to an enum as part of this bug. The number travels through XPIDL interfaces, for example.
Attachment #9030009 - Attachment is obsolete: true
Attachment #9030009 - Flags: review?(VYV03354)
Attachment #9030225 - Flags: review?(VYV03354)
Blocks: 977540
Comment on attachment 9030225 [details] [diff] [review]
Decode unlabeled file: URLs as UTF-8 and redecode if there's an error in the first 50 MB, v6

Review of attachment 9030225 [details] [diff] [review]:
-----------------------------------------------------------------

::: parser/html/nsHtml5StreamParser.cpp
@@ +905,5 @@
> +      nsCOMPtr<nsIURI> originalURI;
> +      rv = channel->GetOriginalURI(getter_AddRefs(originalURI));
> +      if (NS_SUCCEEDED(rv)) {
> +        bool originalIsResource;
> +        originalURI->SchemeIs("resource", &originalIsResource);

Are nested resource: URLs handled correctly? (such as view-source:resource:)
r=me with an anwser or a fix.
Attachment #9030225 - Flags: review?(VYV03354) → review+
(In reply to Masatoshi Kimura [:emk] from comment #72)
> Are nested resource: URLs handled correctly? (such as view-source:resource:)

resource: and view-source:resource: go down different code paths. resource: is fast-tracked to UTF-8. view-source:resource: is subject to UTF-8 detection.

I think this is acceptable considering that:

 * Viewing the source of resource: URLs isn't something that end users are expected to do.
 * As long as our resource: data is actually in UTF-8, as it should be, the view-source:resource: case also ends up as UTF-8--just with a bit more buffering.

> r=me with an anwser or a fix.

Thanks.
Pushed by hsivonen@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/5a6f372f62c1
Support loading unlabeled/BOMless UTF-8 text/html and text/plain files from file: URLs. r=emk.
Forgot to remove leave-open...
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Keywords: leave-open
Resolution: --- → FIXED
Target Milestone: --- → mozilla66
Henri, does is deserve a note in developers release notes for 66? https://developer.mozilla.org/fr/docs/Mozilla/Firefox/Releases/66
Flags: needinfo?(hsivonen)
(In reply to Pascal Chevrel:pascalc from comment #77)
> Henri, does is deserve a note in developers release notes for 66?
> https://developer.mozilla.org/fr/docs/Mozilla/Firefox/Releases/66

Thanks for the reminder. Added.
Flags: needinfo?(hsivonen)
Filed bug 1513513 as a follow-up.
Depends on: 1514728
Duplicate of this bug: 1519680
Duplicate of this bug: 1534006
Depends on: 1538190
Duplicate of this bug: 1477983

See duplicate bug 1477983 for additional discussion about standards, how Firefox perpetuates confusion about character sets, about the need for good character set conversion, and about this error. These problems with UTF-8 support have gone on for too many years.

The spec forbids to sniffing for UTF-8, not our discretion. Even Web-Platform test is present so that UTF-8 is not sniffed:
https://github.com/web-platform-tests/wpt/pull/14455

I guess this makes sense, as strings should be private (they could be encoded in some secret way that Firefox has no business knowing about). All the more reason that UTF-8 should be assumed and bugs like this one should be fixed promptly, not allowed to exist 5 years later, in my opinion.

(In reply to Masatoshi Kimura [:emk] from comment #84)

The spec forbids to sniffing for UTF-8, not our discretion. Even Web-Platform test is present so that UTF-8 is not sniffed:
https://github.com/web-platform-tests/wpt/pull/14455

Do you mean that the whatwg has decided that all text files are expected to be in windows-1252? Wow!

(In reply to Masatoshi Kimura [:emk] from comment #84)

The spec forbids to sniffing for UTF-8, not our discretion. Even Web-Platform test is present so that UTF-8 is not sniffed:
https://github.com/web-platform-tests/wpt/pull/14455

To clarify, does that test specifically apply to content from file: URLs, rather than just to content served over the web?

(In reply to Smylers from comment #87)

(In reply to Masatoshi Kimura [:emk] from comment #84)

The spec forbids to sniffing for UTF-8, not our discretion. Even Web-Platform test is present so that UTF-8 is not sniffed:
https://github.com/web-platform-tests/wpt/pull/14455

To clarify, does that test specifically apply to content from file: URLs, rather than just to content served over the web?

I believe it's meant to apply to non-file URLs only.

(In reply to Vincent Lefevre from comment #86)

(In reply to Masatoshi Kimura [:emk] from comment #84)

The spec forbids to sniffing for UTF-8, not our discretion. Even Web-Platform test is present so that UTF-8 is not sniffed:
https://github.com/web-platform-tests/wpt/pull/14455

Do you mean that the whatwg has decided that all text files are expected to be in windows-1252? Wow!

Unlabeled files are expected to be legacy files in legacy encodings (just like files that don't opt into standards mode behavior with doctype are expected to be legacy and quirky). Considering the configuration that test cases are run in (generic TLD and en-US browser localization), windows-1252 is the applicable legacy encoding.

Newly-created files are expected to be UTF-8 and labeled (just like newly-created files are expected to be non-quirky and to have the HTML5 doctype to say so). The file URL case is different when it comes to the labeling expectation, because a) one of the labeling mechanisms (HTTP headers) is not available and b) we don't need to support incremental loading of local files.

You can't label a text file. And it makes no sense to apply a different encoding to local files (which on most computers shipped today are exclusively UTF-8) than to the same files carried over the net. The latter, even including legacy stuff, is already 94% UTF-8.

So if one of competing standards bodies declares it wants Windows-1252, what about having a config option WHATWGLY_CORRECT that defaults to off, and doing a sane thing otherwise?

(In reply to Adam Borowski from comment #89)

You can't label a text file.

On common non-file transports you can: with ; charset=utf-8 appended to the text/plain type. (Also, the BOM is an option on any transport, but has other issues.)

And it makes no sense to apply a different encoding to local files (which on most computers shipped today are exclusively UTF-8) than to the same files carried over the net.

It does when a file carried over the net loses its Content-Type header when saved locally and sniffing for UTF-8 locally is feasible in a way that doesn't apply to network streams in the context of the Web's incremental rendering requirements.

So if one of competing standards bodies declares it wants Windows-1252, what about having a config option WHATWGLY_CORRECT that defaults to off, and doing a sane thing otherwise?

Do I understand correctly that you'd want to assume UTF-8 for unlabeled content carried over the network and break unlabeled legacy content in order to give newly-authored content the convenience of not having to declare UTF-8?

Adding that ";charset=utf-8" is not possible within most common user interfaces. And even if the content's author can actually access the web server's configuration, that requirement is an obscure detail that's not well publicized. And it shouldn't be — an assumption of plain ASCII or Windows-1252 might have been reasonable in 1980's but today it is well overdue to make such basics work out of the box.

Do I understand correctly that you'd want to assume UTF-8 for unlabeled content carried over the network and break unlabeled legacy content in order to give newly-authored content the convenience of not having to declare UTF-8?

I'm afraid that yes. It is nasty to potentially break historic content — but the alternative is to 1. educate users, and 2. require them to do manual steps; 2. is unnecessary work while 1. is not feasible.

A today's node.php developer doesn't even know what "encoding" is.

today it is well overdue to make such basics work out of the box

Doing this on the browser side is incompatible with not breaking existing content. If legacy content expects something unreasonable, reasonable behavior needs to be opt-in resulting in boilerplate for all new content. UTF-8 isn't the only case. Other obvious cases are the standards mode and viewport behavior. So newly-authored HTML needs to start with <!DOCTYPE html><meta charset="utf-8"><meta content="width=device-width, initial-scale=1" name="viewport">. It's sad that new content bears this burden instead of old content, but that's how backward compatibility works.

If you are already at peace with putting <!DOCTYPE html> and <meta content="width=device-width, initial-scale=1" name="viewport"> in your template, I suggest just treating <meta charset="utf-8"> as yet another template bit and not trying to fight it.

It doesn't work for text/plain, which is something of an afterthought in the Web Platform compared to text/html. However, text/plain is also significantly less common than text/html, so it kinda works out in the aggregate even though it's annoying for people who actually do serve text/plain.

Web servers have tried to change their out-of-the-box experience. For example, it takes special effort to get nginx not to state any charset for text/html. This has its own set of problems when the Web server is upgraded without making the corresponding content changes. (Previously with Apache, similar issues lead to browsers not trusting server-claimed text/plain to actually be text.)

In any case, this is off-topic for this bug.

A today's node.php developer doesn't even know what "encoding" is.

At least with Node or PHP they are in control of their HTTP headers.

In response to comment 88, "Unlabeled files are expected to be legacy files in legacy encodings": this is an unrealistic expectation. While it certainly may apply to many novice programmers or users stuck with old data or applications, it most certainly does not apply to modern users or modern data, both of which already use UTF-8 as a de facto convention or standard. Clearly, as time goes on, support for legacy data must fall on the users of such data. They must be responsible for declaring it as using a Windows code page or whatever other nonstandard encoding it perpetuates.

Perhaps over-reaching assumptions like this one by senior Firefox developers and/or by standards organizations are at the root of Firefox's problems in the area of character encoding.

"What problems, plural, (as opposed to the single issue of whether unlabeled should mean UTF-8) are you referring to?" Henri, I don't normally deal with Firefox character encoding problems in my software work, so you need to excuse me if I can't remember all the issues, but one of them is the large body of complaints about Firefox > Menu > View > Text Encoding, and the submenus of Text Encoding that sometimes are available. You might try some Web searching to find other issues with character encoding. I would be very surprised if my memory is wrong and there are no other problems in this area.

I deal with text/plain files daily (local and remote) and every single one of them is UTF-8 and have been for at least a decade. It's long past time that Firefox defaults to this modern encoding (like other browsers do).

It is not practical to convert these text/plain files into html or to encode them using a legacy encoding as that would (a) be significant added effort required daily (by several people), and (b) break at least three other parts of the workflow.

(In reply to Chris McKenna from comment #95)

I deal with text/plain files daily (local and remote) and every single one of them is UTF-8 and have been for at least a decade. It's long past time that Firefox defaults to this modern encoding (like other browsers do).

This bug is about local files only, and has been closed because the encoding is now detected by sniffing (see changeset 5a6f372f62c1 and Firefox 66 Release Notes #HTML). This makes sense for a whole lot of reasons cited above, most prominently that the transport does not convey the charset and that the full file is available. So your problem should be fixed here on Firefox 66+.

For other transports I’m sure another bug 81551 is more appropriate, though the key arguments against using UTF-8 by default remain:

  • the charset can be specified:
    • either by the transport (i.e. Content-Type HTTP headers)
    • or in the file in case of html <meta charset="utf-8">
  • forcing new content to be more specific than old content is by design, that’s how backwards compatibility works. In other words, bluntly defaulting to UTF-8 breaks legacy content,
    • (in a different thread) magically guessing UTF-8 is encouraging people not to specify the charset,
  • autodetecting is not desirable over the network, for a number of reasons including:
    • incremental loading,
    • interoperability,
    • For html, WHATWG forbids it (see the dedicated Web-Platform test and the Note “User agents are generally discouraged from attempting to autodetect encodings for resources obtained over the network” in the specs).

If you can address these points (maybe specifically for text/plain?) I suggest you open a new bug about it.

While I hold to a more general point of view, that of not hesitating to move on to the most stable version of the best technology (and doing so as close to automatically as possible, given constraints such as dependencies on standards, hardware, affordability, and other software), I also see some very good reasons for maintaining back compatibility with a given set of software versions: one excellent use case is the company, charity, or organization that has as its primary function something other than technology and simply cannot afford to port their current working software to a new platform, or even a new set of software versions. We can even assume that such organizations are running donated software on donated hardware and are providing a vital service to the world. Given no expertise and no money to hire expertise, such organizations can easily get stuck with versions that stop working.

While I do not subscribe to the outmoded point of view that software should be automatically back-compatible forever, because new features will always and eventually be needed, I do recognize (as should anyone in software) that many organizations and individuals simply must continue using their current hardware and software long past its supposed end-of-life. Windows XP has many current users, in spite of the fact that many new software products will not run on XP.

An intelligent and balanced point of view recognizes the validity of the points made in comment 95, but adds that workarounds must always be provided for those who use obsolete technologies. Such workarounds can be as simple as providing an extension to a Firefox Web Developer tool (or some other semi-hidden part of Firefox) that allows for the manual (and possibly programmatic) selection of a (now) nonstandard character encoding for a web page (including obsolete Code Pages and the most-used obsolete general encodings). While casual users of Mozilla-based tools would not see such an option (since it won't matter and will just confuse them), it should be available somewhere to spare organizations and individuals a porting expense they may not be able to afford.

Thus, Firefox should always provide reasonable and optional compatibility features to help its less knowledgeable and less wealthy users, even if they make absolutely no sense to cutting-edge developers who are eager for the world to discover the latest in elegant and functional features that technology and standards can offer.

(In reply to Cimbali from comment #96)

- (in a different thread) magically guessing UTF-8 is encouraging people not to specify the charset, 

Since this has come up before, note that there are contexts where the user cannot specify the charset (and the MIME type), e.g. with VCS repository viewing (though with some VCS such as Subversion, one can specify the MIME type and the charset). However, I think that's more the job of the server to guess the MIME type and the charset: it has the local file, it can cache the result, and this ensures the same behavior with all clients. So, nothing to change in Firefox.

You need to log in before you can comment on or make changes to this bug.