Support loading BOMless UTF-8 text/plain files from file: URLs

RESOLVED FIXED in mozilla66

Status

()

defect
RESOLVED FIXED
5 years ago
2 months ago

People

(Reporter: vincent-moz, Assigned: hsivonen)

Tracking

(Depends on 1 bug)

32 Branch
mozilla66
x86_64
Linux
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments, 5 obsolete attachments)

Reporter

Description

5 years ago
User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:32.0) Gecko/20100101 Firefox/32.0
Build ID: 20140917194002

Steps to reproduce:

Open a UTF-8 text file using a "file:" URL scheme.


Actual results:

The file is regarded as encoded in windows-1252 (according to what is displayed and to "View Page Info").


Expected results:

The file should be regarded as encoded in UTF-8. Alternatively, the encoding could be guessed.

The reason is that UTF-8 is much more used nowadays, and is the default in various contexts. I've also tried with other browsers (Opera, lynx, w3m and elinks), and all of them regarded the file as encoded in UTF-8. Only Firefox chose a wrong charset.

Note: Bug 815551 is a similar bug by its title, but for HTML ("HTML: Parser" component) with http URL's (see the first comments), which is obviously very different.
Reporter

Comment 1

5 years ago
I forgot to give a reference to bug 760050: there was auto-detection in the past but it didn't work correctly. So, if it is chosen to guess the encoding, UTF-8 should be favored.
Assignee

Comment 2

5 years ago
(In reply to Vincent Lefevre from comment #0)
> The file is regarded as encoded in windows-1252 (according to what is
> displayed and to "View Page Info").

This depends on the locale, but, yes, it's windows-1252 for most locales.

> The file should be regarded as encoded in UTF-8.

Simply saying that all file: URLs resolving to plain text files should be treated as UTF-8 might be too simple. We might get away with it, though. (What's your use case for opening local .txt files in a browser, BTW?)

> Alternatively, the encoding could be guessed.

Since we can assume local files to be finite and all bytes available soon, we could do the following:
 1) Add a method to the UTF-8 decoder to ask it if it has seen an error yet.
 2) When loading text/plain (or text/html; see below) from file:, start with the UTF-8 decoder and let the converted buffers queue up without parsing them. After each buffer, ask the UTF-8 decoder if it has seen an error already.
 3) If the UTF-8 decoder says it has seen an error, throw away the converted buffers, seek to the beginning of file and parse normally with the fallback encoding.
 4) If end of the byte stream is reached without the UTF-8 decoder having seen an error, let the parser process the buffer queue.

The hardest part would probably be the "seek to the beginning of file" bit, considering that we've already changed the state of the channel to deliver to a non-main thread. I'm not sure if channels for file: URLs support seeking. (Seeking with the existing channel instead of asking the docshell to renavigate avoids problems with scenarios where existing local HTML expects iframed content to behave a certain way as far as events go.)

> The reason is that UTF-8 is much more used nowadays, and is the default in
> various contexts. I've also tried with other browsers (Opera, lynx, w3m and
> elinks), and all of them regarded the file as encoded in UTF-8. Only Firefox
> chose a wrong charset.

Lynx, w3m and elinks don't really have the sort of compatibility goals that Firefox has. Was the Presto-Opera or Blink-Opera? With fresh profile? What do Chrome and IE do?

> Note: Bug 815551 is a similar bug by its title, but for HTML ("HTML: Parser"
> component)

text/plain goes through the HTML parser, too. As far as implementation goes, if there's any level of content-based guessing, not having that code apply to HTML, too, would require an extra condition.

Any code to be written here would live in nsHtml5StreamParser.cpp.

> with http URL's (see the first comments), which is obviously very
> different.

(I still think we shouldn't add any new content-based detection for content loaded from http and that we should never use UTF-8 as the fallback for content loaded from http.)
Status: UNCONFIRMED → NEW
Component: File Handling → HTML: Parser
Ever confirmed: true
Summary: text/plain files with unknown charset should be regarded as encoded in UTF-8 → Support loading BOMless UTF-8 text/plain files from file: URLs
Reporter

Comment 3

5 years ago
(In reply to Henri Sivonen (:hsivonen) from comment #2)
> (In reply to Vincent Lefevre from comment #0)
> > The file is regarded as encoded in windows-1252 (according to what is
> > displayed and to "View Page Info").
> 
> This depends on the locale, but, yes, it's windows-1252 for most locales.

I'm using a UTF-8 locale, so that it's very surprising this if it depends on the locale, an incompatible charset is chosen.

> Simply saying that all file: URLs resolving to plain text files should be
> treated as UTF-8 might be too simple. We might get away with it, though.
> (What's your use case for opening local .txt files in a browser, BTW?)

I sometimes retrieve parts of websites locally (possibly with some URL update and other corrections). This may include .txt files, linked from HTML files. The simplest solution is to view them via "file:" URL's. This is precisely how I discovered bug 760050 (which was a variant of this one): some .txt files (in English) contained the non-ASCII "§" character.

> The hardest part would probably be the "seek to the beginning of file" bit,
> considering that we've already changed the state of the channel to deliver
> to a non-main thread. I'm not sure if channels for file: URLs support
> seeking. (Seeking with the existing channel instead of asking the docshell
> to renavigate avoids problems with scenarios where existing local HTML
> expects iframed content to behave a certain way as far as events go.)

If this is a problem and the goal is to differentiate UTF-8 from windows-1252 (or ISO-8859-1 under Unix, since windows-1252 is not used in practice), I'd say that if the first non-ASCII bytes correspond to a valid UTF-8 sequence, then the file could be regarded as in UTF-8 with a very high probability, in particular if the locale is a UTF-8 one. This shouldn't need seeking because all the past bytes are ASCII, which is common to UTF-8 and windows-1252. Note that if there's an UTF-8 decoding error later, this doesn't necessarily mean that Firefox did something wrong, because UTF-8 files with invalid sequences also occur in practice.

> > The reason is that UTF-8 is much more used nowadays, and is the default in
> > various contexts. I've also tried with other browsers (Opera, lynx, w3m and
> > elinks), and all of them regarded the file as encoded in UTF-8. Only Firefox
> > chose a wrong charset.
> 
> Lynx, w3m and elinks don't really have the sort of compatibility goals that
> Firefox has.

What compatibility goals (at least in the case of a UTF-8 locale)?

> Was the Presto-Opera or Blink-Opera? With fresh profile?

Opera 12 under GNU/Linux. Not a fresh profile, but I use Opera only for some testing.

> What do Chrome and IE do?

Chromium uses windows-1252. But I wonder whether the issue has already been discussed.
There's no IE under GNU/Linux.
Assignee

Comment 4

5 years ago
(In reply to Vincent Lefevre from comment #3)
> (In reply to Henri Sivonen (:hsivonen) from comment #2)
> > (In reply to Vincent Lefevre from comment #0)
> > > The file is regarded as encoded in windows-1252 (according to what is
> > > displayed and to "View Page Info").
> > 
> > This depends on the locale, but, yes, it's windows-1252 for most locales.
> 
> I'm using a UTF-8 locale, so that it's very surprising this if it depends on
> the locale, an incompatible charset is chosen.

I meant the Firefox UI language--not the *nix system locale notion.

> > Simply saying that all file: URLs resolving to plain text files should be
> > treated as UTF-8 might be too simple. We might get away with it, though.
> > (What's your use case for opening local .txt files in a browser, BTW?)
> 
> I sometimes retrieve parts of websites locally (possibly with some URL
> update and other corrections). This may include .txt files, linked from HTML
> files. The simplest solution is to view them via "file:" URL's. This is
> precisely how I discovered bug 760050 (which was a variant of this one):
> some .txt files (in English) contained the non-ASCII "§" character.

I see. For this use case, it seems to me the problem applies also to text/html and not just to text/plain when using a spidering program that doesn't rewrite HTML to include <meta charset>.

> > The hardest part would probably be the "seek to the beginning of file" bit,
> > considering that we've already changed the state of the channel to deliver
> > to a non-main thread. I'm not sure if channels for file: URLs support
> > seeking. (Seeking with the existing channel instead of asking the docshell
> > to renavigate avoids problems with scenarios where existing local HTML
> > expects iframed content to behave a certain way as far as events go.)
> 
> If this is a problem and the goal is to differentiate UTF-8 from
> windows-1252 (or ISO-8859-1 under Unix, since windows-1252 is not used in
> practice), I'd say that if the first non-ASCII bytes correspond to a valid
> UTF-8 sequence, then the file could be regarded as in UTF-8 with a very high
> probability, in particular if the locale is a UTF-8 one. This shouldn't need
> seeking because all the past bytes are ASCII, which is common to UTF-8 and
> windows-1252.

What if the first non-ASCII byte sequence is a copyright sign in the page footer? If we're going to do this for local files, let's make use of the fact that we have all the bytes available.

> Note that if there's an UTF-8 decoding error later, this
> doesn't necessarily mean that Firefox did something wrong, because UTF-8
> files with invalid sequences also occur in practice.

Catering to such pages is *so* WONTFIX if this feature gets impemented at all.
 
> > > The reason is that UTF-8 is much more used nowadays, and is the default in
> > > various contexts. I've also tried with other browsers (Opera, lynx, w3m and
> > > elinks), and all of them regarded the file as encoded in UTF-8. Only Firefox
> > > chose a wrong charset.
> > 
> > Lynx, w3m and elinks don't really have the sort of compatibility goals that
> > Firefox has.
> 
> What compatibility goals (at least in the case of a UTF-8 locale)?

Being able to read old files that were readable before.

> > Was the Presto-Opera or Blink-Opera? With fresh profile?
> 
> Opera 12 under GNU/Linux. Not a fresh profile, but I use Opera only for some
> testing.

That's Presto-Opera. And no proof of the defaults if it's not known to be a fresh profile.
Reporter

Comment 5

5 years ago
(In reply to Henri Sivonen (:hsivonen) from comment #4)
> I see. For this use case, it seems to me the problem applies also to
> text/html and not just to text/plain when using a spidering program that
> doesn't rewrite HTML to include <meta charset>.

At worst, for HTML, this can be done with another tool, a shell/Perl script or whatever. This shouldn't confuse any HTML reader.

> > If this is a problem and the goal is to differentiate UTF-8 from
> > windows-1252 (or ISO-8859-1 under Unix, since windows-1252 is not used in
> > practice), I'd say that if the first non-ASCII bytes correspond to a valid
> > UTF-8 sequence, then the file could be regarded as in UTF-8 with a very high
> > probability, in particular if the locale is a UTF-8 one. This shouldn't need
> > seeking because all the past bytes are ASCII, which is common to UTF-8 and
> > windows-1252.
> 
> What if the first non-ASCII byte sequence is a copyright sign in the page
> footer?

I don't know the internals, but ideally one should be able to send all the ASCII bytes ASAP, and deal with the encoding only when it matters. This is what I meant.

> If we're going to do this for local files, let's make use of the
> fact that we have all the bytes available.

If you have all the bytes, yes. Firefox doesn't seem to support files like file:///dev/stdin anyway (rather useless for text/plain files anyway).

> > Note that if there's an UTF-8 decoding error later, this
> > doesn't necessarily mean that Firefox did something wrong, because UTF-8
> > files with invalid sequences also occur in practice.
> 
> Catering to such pages is *so* WONTFIX if this feature gets impemented at
> all.

I meant that it could just be obtained by accident. But conversely, if you have the ef bb bf byte sequence at the beginning of a file and an invalid UTF-8 sequence later, I wouldn't consider that recognizing the BOM sequence as UTF-8 was a mistake.

> > What compatibility goals (at least in the case of a UTF-8 locale)?
> 
> Being able to read old files that were readable before.

??? Most UTF-8 text/plain files were readable with old Firefox (actually I've just tried with Iceweasel 24.8.0 under Debian, but I doubt Debian has changed anything here), but it is no longer the case with new Firefox versions!

Note also the fact that Firefox also broke old HTML rendering with the new HTML5 parser.

> > Opera 12 under GNU/Linux. Not a fresh profile, but I use Opera only for some
> > testing.
> 
> That's Presto-Opera. And no proof of the defaults if it's not known to be a
> fresh profile.

I could do more tests later (that's on another machine), possibly with newer versions. But note that I haven't modified the profile explicitly.

BTW, in any case, the default settings wouldn't matter very much (except that I think that UTF-8 makes more sense, at least in UTF-8 *nix locales), as long as one can modify the settings to get UTF-8.

Comment 6

4 years ago
> > What compatibility goals (at least in the case of a UTF-8 locale)?
> Being able to read old files that were readable before.

While admirable, I'd be careful about sticking to that goal unconditionally.  UTF-8 is the future and the future is now.  Users will increasingly view FF as "broken" if FF can't render mainstream text files correctly (well, text files with multi-byte UTF-8 encodings).

By anlogy, it's great to support IE6 compatibility because old websites were coded to assume IE6.  But suicide to do so at the expense of correctly rendering modern websites.

Just my 2¢

Comment 7

4 years ago
Another use case: I create presentations as HTML. I regularly view them locally, prior to web-hosting. Each time I see weird characters due to file:// not recognizing encoding, despite it being specified.
Assignee

Comment 8

4 years ago
(In reply to LAFK from comment #7)
> Another use case: I create presentations as HTML. I regularly view them
> locally, prior to web-hosting. Each time I see weird characters due to
> file:// not recognizing encoding, despite it being specified.

What do you mean "despite it being specified"? Either the UTF-8 BOM (both text/html and text/plain) or <meta charset=utf-8> (text/html) should work. I.e. the ways of it "being specified" should work.

Comment 9

4 years ago
I'm using meta charset. However please do ignore this, I failed to open " properly. Adding " fixed the problem, I found it out when I did minimum file to replicate the bug to attach it here.
I just had to dust off the "Text Encoding" hamburger menu widget for the first time in *years* due to this bug, so, here's my vote for doing *something* to detect UTF-8 in text/plain from file:.

(Use case: viewing a text file containing a bunch of emoji.  For whatever damn reason, neither 'less' nor Emacs recognize all the shiny new characters as printable, but Firefox does, once you poke the encoding.)

Comment 11

3 years ago
All platforms other than Windows include encoding in their locale settings.  Thus, if you don't like autodetection, please obey locale settings instead of hard-coding an archaic Windows code page.  Using the user's locale would be consistent with every single maintained program other than a browser these days.

Or, heck, just hard-code UTF-8.  A couple of years ago I gathered some data by mining Debian bug reports for locale settings included by reportbug.  Around 3% of bug reports had no configured locale, around half percent used non-UTF-8.  I believe that these numbers are still a large overestimation, as a text mode program like reportbug is often run over ssh on a headless box where users often have no reason to care about locale.  A GUI-capable system, on the other hand, is almost always installed via an installer which doesn't even support non-UTF-8.

Comment 12

3 years ago
Another use case:
We daily cut'n'paste uft-8 text from putty terminals to a web server to document workflows or modifications since many years now. Years ago a regression was done in Firefox, since then we always have to change the character set manually to view and print documents with special characters like umlauts correct.

A proposal:
It may be difficult and a bunch of work to parse and guess the character set in a perfect manner. But why not just provide an about:config variable (e.g. plain_text.charset.overwrite=utf-8)? I guess this is easy to implement and would be of great help to some of the Firefox users. At least for the time being.
Assignee

Comment 13

3 years ago
Regarding the system locale:
The browser behavior here is aimed at being able to view legacy stuff saved to the local disk from the Web. It's not about primarily viewing content created on the local system. And in any case, the general approach that makes sense for HTML is sad for text/plain.

(In reply to Thomas Koch from comment #12)
> It may be difficult and a bunch of work to parse and guess the character set
> in a perfect manner. But why not just provide an about:config variable (e.g.
> plain_text.charset.overwrite=utf-8)? I guess this is easy to implement and
> would be of great help to some of the Firefox users. At least for the time
> being.

I'm open to reviewing an interim patch to that effect (for file: URLs only), but instead of writing that patch myself, I'm focusing on getting encoding_rs to a point that allows this bug to be fixed as described in comment 2. (Unlike Gecko's current uconv, which requires a caller either to handle malformed sequences itself or to lose knowledge of whether there were malformed sequences, encoding_rs allows the caller to know whether there were malformed sequences even when encoding_rs takes over handling the malformed sequences by emitting the REPLACEMENT CHARACTER.)
Depends on: encoding_rs
Reporter

Comment 14

3 years ago
(In reply to Henri Sivonen (:hsivonen) from comment #13)
> Regarding the system locale:
> The browser behavior here is aimed at being able to view legacy stuff saved
> to the local disk from the Web.

The browser has already broken the view of legacy stuff by assuming windows-1252, while legacy stuff under Unix is ISO-8859-1 (control characters in the range 0x80-0x9f should have remained invisible, and at least copied back to ISO-8859-1, otherwise this breaks charset handling). So, there isn't much reason to continue to support legacy stuff, in particular if it breaks the view of current stuff. Nowadays, text/plain data saved from the web uses UTF-8 in most cases (or is just plain ASCII, which can be seen as a particular case of UTF-8 anyway), and BOM is almost never used (I think I have never seen it except for test cases). In the few cases where another encoding has been used, the user may have converted the file to UTF-8 after the download, because this is what other tools expect (either because these tools expect the encoding specified by the locale or because they always expect UTF-8). So, IMHO, even hardcoding the charset to UTF-8 would be OK, and certainly better than the current status.

Comment 15

3 years ago
Please respect your users and obey either (a) the current locale or (b) some "about:config" variable for plain text default encoding. Hard-coding UTF-8 without configurability should be considered a provisional option, only.

Comment 16

2 years ago
If you're prepared to build Firefox yourself, UTF-8 default fallback can be achieved with some trivial source code edits.
Works for html as well as plain text.
This has nothing to do with auto-detection or other esoteric stuff, it's just a plain [user] choice between defaulting to 'Western/windows-1252' or UTF-8 for an unidentified encoding.

1] Remove the block on setting UTF-8:

sed -i 's|(mFallback).*$|(mFallback)) {|;/UTF-8/d' dom/encoding/FallbackEncoding.cpp


2] a) Add Unicode option to Preferences|Content|Fonts & Colours|Advanced|"Fallback Text Encoding" drop-down menu:

sed -i '104i<!ENTITY languages.customize.Fallback.unicode     "Unicode">' browser/locales/en-US/chrome/browser/preferences/fonts.dtd
sed -i '272i\            <menuitem label="\&languages.customize.Fallback.unicode;"     value="UTF-8"/>' browser/components/preferences/fonts.xul

   b) ... and for any localization:

sed -i '104i<!ENTITY languages.customize.Fallback.unicode     "Unicode">' browser/chrome/browser/preferences/fonts.dtd


I did this as well, but it may not be necessary.
Having done 1], Unicode can be selected through the menu entry added in 2] to set UTF-8.
3] Set [about:config] option 'intl.charset.fallback.override' to default to UTF-8:

sed -i 's|fallback.override.*$|fallback.override",      "UTF-8");|' modules/libpref/init/all.js


This works for Firefox 51.0, for en-GB - I'm assuming the en-US fonts.dtd patch would work as well.

Comment 17

2 years ago
https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
The Unicode Standard permits the BOM in UTF-8,[3] but does not require or recommend its use.[4] 
See also https://bugs.chromium.org/p/chromium/issues/detail?id=703006 .
Assignee

Updated

2 years ago
Duplicate of this bug: 1407594
Assignee

Updated

2 years ago
Duplicate of this bug: 1419200
Assignee

Comment 20

2 years ago
Edited copypaste of bug 1407594 comment 15 (note the last paragraph!):

> So, what can be done for this bug to be actually properly fixed in next
> release or at least release after next, instead of bug staying open for
> years without progress?

1) Wait for the fix for bug 980904 to land.
2) Then volunteer to write the following code:
  * Activate the following machinery in the HTML parser iff the URL is a file: URL and the encoding wasn't decided from BOM or <meta>:
    * Instantiate a decoder for UTF-8 instead of the fallback encoding.
    * Block tree op flushes to the main thread. Keep them accumulating the way they accumulate during speculative parsing.
    * Whenever bytes arrive from the "network" (i.e. file) stash them into a linked list of buffer copies in nsHtml5StreamParser.
    * Decode the bytes.
    * If there was an error:
      - Throw away the tree op queue.
      - Instantiate a decoder for the non-UTF-8 encoding that would have been used normally.
      - Unblock tree op delivery to the main thread.
      - Replay the stashed-away bytes to the decoder and the tokenizer.
    * When the EOF is reached, notify the main thread about the encoding being UTF-8 as if it had been discovered via <meta>.
    * Deliver the pending tree ops to the main thread.

In the interim, I'd r+ a boolean off-by-default pref to assume UTF-8 for text/plain and <meta>less text/html from file: URLs only.

Comment 21

2 years ago
WRT bug 1419200 comment 2:

> No, the legacy on the Web still varies by locale, which we approximate from the TLD and, failing that, from the UI locale.

Suppose I request the *same* HTML file with no charset declared in <head> or Content-Type from HTTP, both TLD and UI locale fails to detect it's encoded in UTF-8, the document will be decoded in windows-1251, but if I load the same document from file://, UTF-8 will be tried and first, and succeed in decoding the doc in UTF-8? If this is the case, a whole lot of authors are bound to be confused. Why can't HTTP try UTF-8 first before TLD/UI locale? Chrome and Safari seems to be fine with it.
(In reply to Henri Sivonen (:hsivonen) (away from Bugzilla until 2017-12-04) from comment #20)
>     * Block tree op flushes to the main thread. Keep them accumulating the
> way they accumulate during speculative parsing.
>     * Whenever bytes arrive from the "network" (i.e. file) stash them into a
> linked list of buffer copies in nsHtml5StreamParser.
>     * Decode the bytes.
>     * If there was an error:
>       - Throw away the tree op queue.
>       - Instantiate a decoder for the non-UTF-8 encoding that would have
> been used normally.
>       - Unblock tree op delivery to the main thread.
>       - Replay the stashed-away bytes to the decoder and the tokenizer.
>     * When the EOF is reached, notify the main thread about the encoding
> being UTF-8 as if it had been discovered via <meta>.
>     * Deliver the pending tree ops to the main thread.

Doesn't it cause OOM if the file is huge (such as a log file)? I'd be uncomfortable if I can't view huge files at all until Firefox reaches EOF.

Comment 23

2 years ago
> Doesn't it cause OOM if the file is huge (such as a log file)? I'd be
> uncomfortable if I can't view huge files at all until Firefox reaches EOF.

If the file is huge enough for this to be a concern, and you _still_ haven't seen a single decoding failure, it's beyond obvious that the file indeed is UTF-8 (or pure 7-bit which is also valid UTF-8).  In a recent discussion (among Mozilla and Chromium folks, I don't have a link), there was a debate whether looking at first 1024 bytes from the network is enough.  The biggest alternative anyone even mentioned was 4096.  Taking more from a locally-available file would be reasonable, but you really don't need to check a gigabyte.
(In reply to Adam Borowski from comment #23)
> Taking more from a
> locally-available file would be reasonable, but you really don't need to
> check a gigabyte.

But that's what Henri is proposing.

Comment 25

2 years ago
(In reply to Masatoshi Kimura [:emk] from comment #24)
> But that's what Henri is proposing.

Yeah, exactly -- I agree with what you said, I merely mentioned some size scales to make it more obvious what's wrong.

But I really don't get what the problem here is: on any Unix system, there is a well-defined setting (LC_CTYPE/LANG/LC_ALL) that all locale-aware programs but Firefox obey.  What you guys try to do, is to override the user's preferences, as some files might still use an ancient encoding.  This might make sense but only in _addition_ to supporting UTF-8, not as the primary choice.  By trying to do "better", Firefox still fails the vast majority of cases.
(In reply to Adam Borowski from comment #25)
> But I really don't get what the problem here is: on any Unix system, there
> is a well-defined setting (LC_CTYPE/LANG/LC_ALL) that all locale-aware
> programs but Firefox obey.

Firefox does *not* obey UTF-8 locales. It has a built-in small list[1] and the fallback encoding is determined by that list. The list does not contain UTF-8.

[1] https://dxr.mozilla.org/mozilla-central/rev/4affa6e0a8c622e4c4152872ffc14b73103830ac/dom/encoding/localesfallbacks.properties

Comment 27

2 years ago
I think Adam is saying Firefox should obey LC_*. I think this is a much larger issue. LANG is used to determine the language used for UI chrome. The reason Firefox doesn't obey that is because to this day and age, Firefox still build locale specific packages like it was the 80s where 500k or so from other language bundles matter. To be fair, lots of programs allow you to override LAND/LC_CTYPE/LC_* locally, it's just Firefox is a bit egregious. Respecting POSIX locale environment variables or not, I think Firefox should still default to UTF-8, given that most OSes such as macOS or Linux desktop environments never changes LC_CTYPE, they all set it to UTF-8, and leave character encoding overrides to the individual applications.

So, I think it's okay that Firefox doesn't obey LANG for now due to building issues, I don't think disobeying LC_CTYPE matter much these days as long as Firefox defaults to UTF-8. As to the other LC_*, I don't see any use case for them, except maybe LC_ALL, but it only matter if Firefox looks at LANG and LC_CTYPE at all.
Reporter

Comment 28

2 years ago
(In reply to Yuen Ho Wong from comment #27)
> So, I think it's okay that Firefox doesn't obey LANG for now due to building
> issues, I don't think disobeying LC_CTYPE matter much these days as long as
> Firefox defaults to UTF-8. As to the other LC_*, I don't see any use case
> for them, except maybe LC_ALL, but it only matter if Firefox looks at LANG
> and LC_CTYPE at all.

If Firefox considers the locales, then it is important to honor LC_ALL when set, as it overrides LC_CTYPE, i.e. the considered charset should be the same as the one given by the "locale charmap" command. The order is, by precedence: LC_ALL, then LC_CTYPE, then LANG.

Comment 29

2 years ago
That's what I said. For now, I think Firefox should completely disregard LANG and LC_* but take UTF-8 has the highest priority when considering a list of fall back character encodings. Or, bundle all the language strings into just 1 build per platform, so Firefox can start obeying LANG and LC_ALL. I don't know how feasible the latter option is as I'm not familiar with Firefox's build process and Mozilla's build infrastructure.

Comment 30

2 years ago
Firefox obeys the language part of LANG/LC_ALL (at least as shipped by Debian), just not the encoding part.

By now, though, adding support for varying encodings is probably pointless.  I've gathered some data: of all Debian bug reports in 2016 that were filed with reportbug and included locale data, only 0.8% used something that's not UTF-8, and there's a strong downwards trend compared to previous data.  Bug reporters are strongly correlated with technical ability, and it takes some knowledge to set a modern Unix system to a non-UTF8 encoding, thus it's a safe assumption that the percentage of regular users with non-UTF8 is a good way less than 0.8%.

Here's a graph: Oct 2004 - Jan 2017, max=51%, 1 horizontal dot = 1 month.
⠀⠀⢠⠀⠀⠀⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⣄⣾⠀⠀⣦⣿⣿⡄⠀⡆⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⢠⣿⣿⣄⡇⣿⣿⣿⡇⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⣿⣿⣿⣿⣷⣿⣿⣿⣇⡄⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⣿⣿⣿⣿⣿⣿⣿⣿⣿⡇⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⡇⡀⢀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⢸⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣸⣿⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣶⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣧⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣾⣧⣤⣶⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡇⡀⠀⠀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⣤⣾⣇⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣧⣿⣦⣠⣷⣄⣰⣠⡀⢀⠀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⣾⣷⣷⣠⡀⠀⠀⡀⢸⣿⡄⠀⠀⠀⠀⠀⡄⣠⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣾⣿⣾⣿⣿⣶⣴⣤⣠⣷⣷⣿⣧⣰⣴⡄⣤⣀⣀⣀⣀⣠⣄⣀⣄⢀⢰⡀⣀⣀⣀⢀

As text-displaying programs other than Firefox that don't obey system locale (or even hard-code UTF-8) have gone the way of the dodo, users are accustomed to locally stored files in encodings other than UTF-8 ending in mojibake.  It's only Firefox that goes the other way: mojibake for UTF-8, ok for _an_ ancient encoding that may or may not match the file.

Elsewhere, support for such encodings has been bit-rotting and is being dropped; yet currently Firefox doesn't even recognize what's becoming the only option.
Assignee

Comment 31

2 years ago
(In reply to Masatoshi Kimura [:emk] from comment #22)
> (In reply to Henri Sivonen (:hsivonen) (away from Bugzilla until 2017-12-04)
> from comment #20)
> >     * Block tree op flushes to the main thread. Keep them accumulating the
> > way they accumulate during speculative parsing.
> >     * Whenever bytes arrive from the "network" (i.e. file) stash them into a
> > linked list of buffer copies in nsHtml5StreamParser.
> >     * Decode the bytes.
> >     * If there was an error:
> >       - Throw away the tree op queue.
> >       - Instantiate a decoder for the non-UTF-8 encoding that would have
> > been used normally.
> >       - Unblock tree op delivery to the main thread.
> >       - Replay the stashed-away bytes to the decoder and the tokenizer.
> >     * When the EOF is reached, notify the main thread about the encoding
> > being UTF-8 as if it had been discovered via <meta>.
> >     * Deliver the pending tree ops to the main thread.
> 
> Doesn't it cause OOM if the file is huge (such as a log file)?

It would temporarily consume more RAM than now, yes. Do we really need to care about files whose size falls in the range that they don't OOM now but would OOM with my proposal.

> I'd be
> uncomfortable if I can't view huge files at all until Firefox reaches EOF.

How often do you view such files? Are huge local files really a use case we need to optimize for? Does it really matter if huge local log files take a bit more time to start displaying than they do now if in exchange we address the problem that currently if you save (verbatim) a UTF-8 file whose encoding was given on the HTTP layer, it opens up the wrong way?

I think we should solve the problem for files whose size is in the range of typical Web pages and let opening of huge log files work according to what follows from optimizing for normal-sized files.

(In reply to Yuen Ho Wong from comment #27)
> I think Adam is saying Firefox should obey LC_*.

We're not going to take LC_* as indication of the encoding of file *content* (we do use it for interpreting file *paths* at present). On Windows, the analog of LC_* is never UTF-8, so any solution relying on LC_* would leave the problem unsolved on Windows.
(In reply to Henri Sivonen (:hsivonen) (away from Bugzilla until 2017-12-04) from comment #31)
> they do now if in exchange we address
> the problem that currently if you save (verbatim) a UTF-8 file whose
> encoding was given on the HTTP layer, it opens up the wrong way?

Personally it is not a problem because the Japanese auto-detector detects UTF-8. Waiting for EOF will degrade the experience for me.

Do we really have to wait for EOF to make the UTF-8 detection perfect?

> On Windows, the
> analog of LC_* is never UTF-8, so any solution relying on LC_* would leave
> the problem unsolved on Windows.

Windows 10 Insider Preview added a new option "Beta: Use Unicode UTF-8 for worldwide language support" to enable UTF-8 system locales, by the way.

Comment 33

2 years ago
(In reply to Adam Borowski from comment #30)
It seems chromium strictly adheres to RFC standards though,
https://bugs.chromium.org/p/chromium/issues/detail?id=785209
Reporter

Comment 34

2 years ago
(In reply to Dan Jacobson from comment #33)
> It seems chromium strictly adheres to RFC standards though,
> https://bugs.chromium.org/p/chromium/issues/detail?id=785209

This issue is about HTTP and buggy HTML. Off-topic here.
Although this is not the final solution, at least nobody disagrees with this.
Keywords: leave-open
Assignee

Comment 37

2 years ago
mozreview-review
Comment on attachment 8934161 [details]
Bug 1071816 - Add a pref to fallback to UTF-8 for files from file: URLs.

https://reviewboard.mozilla.org/r/204220/#review210746
Attachment #8934161 - Flags: review?(hsivonen) → review+

Comment 38

2 years ago
Pushed by VYV03354@nifty.ne.jp:
https://hg.mozilla.org/integration/autoland/rev/ba48231d04a8
Add a pref to fallback to UTF-8 for files from file: URLs. r=hsivonen

Comment 40

a year ago
A use case that might have become more common is looking at local text files that have can have some rendering such as Markdown. Extensions handle the transformation to html pretty well, firefox does the nice rendering, but the encoding remains a problem.


As far as I can tell there is no work-around for extensions either:
- I found no function setting the charset of an open document.
- webRequest.onHeadersReceived is not triggered for local files so no headers with a utf-8 charset can be introduced.
- TextEncoder only supports UTF-8, so you can't re-encode to a bytestring and decode that as UTF-8.
Duplicate of this bug: 1468461
Reporter

Comment 42

11 months ago
(In reply to Masatoshi Kimura [:emk] from comment #32)
> Personally it is not a problem because the Japanese auto-detector detects UTF-8.
[...]

In short, the current workaround: in about:config, set intl.charset.detector to "ja_parallel_state_machine".
Duplicate of this bug: 1468461
Why do we support this *ONLY* for the file: scheme?

Don't non-document files(JavaScript, CSS) in all remote protocols which don't receive explicit encoding in Content-Type response headers also need this?
If I am directly visiting a non-ASCII Javascript or CSS files through remote URL, say https://example.org/abc.css, https://example.org/abc.js. I will most likely see messy code which is encoded by utf-8, because in the real world most of the web servers aren't configured well enough to response with an explicit encoding in Content-Type.
I suggest changing the bug title to cover remote non-document files also.

Comment 47

11 months ago
The discussion not limited to file:// is here (see also first comment): https://bugzilla.mozilla.org/show_bug.cgi?id=815551
Thank you for the information. I saw this bug 815551 before, but not exactly what I meant.

Maybe I was poorly worded, I am mainly talking about defaulting the encoding to utf-8(just as the patch in this bug does), not just about auto-detection.

This bug seems talking about detection from its summary, but it's patch right now is about defaulting the encoding to utf-8.
Reporter

Comment 49

11 months ago
(In reply to 張俊芝(Zhang Junzhi) from comment #44)
> Why do we support this *ONLY* for the file: scheme?

Because the handling with "file:" is not standardized, while with HTTP, it is standardized. In particular, with HTTP, an unspecified charset means ISO-8859-1. So, HTTP should remain off-topic here, and could be discussed in another bug (but it might be wontfix).
Thank you for the information.

I didn't know that the standard says HTTP contents default to ISO-8859-1 if unspecified.

I just saw that Chrome defaults BOMless HTTP contents to utf-8(At least it's the case in Chrome on my Linux), so does that mean Chrome has implemented it in a non-standard way.

Comment 51

11 months ago
That's no longer true.  Old RFCs specified HTTP default to be ISO-8859-1, but those RFCs have been superseded long time ago.

Current one is RFC 7231 which says:
#   The default charset of ISO-8859-1 for text media types has been
#   removed; the default is now whatever the media type definition says.
#   Likewise, special treatment of ISO-8859-1 has been removed from the
#   Accept-Charset header field.  (Section 3.1.1.3 and Section 5.3.3)

Thus, Chrome's default of UTF-8 is reasonable, and matches current practice.

Of course, that's for http://; this bug is for file:// -- but outside of Windows we do have a strongly specified setting, ie, LANG/LC_CTYPE/LC_ALL, for which support for anything but UTF-8 is rapidly disappearing.  And, newest point release of Windows 10 finally allows setting system locale's encoding to UTF-8 (Control Panel|Region|Administrative|Change system locale...|Use Unicode UTF-8 for worldwide language support), so even this last bastion is crumbling.
At least on Windows, Chrome does not default to UTF-8 for unlabelled HTTP plain text exactly because of the compatibility concern with legacy content. (But it defaults to UTF-8 for file:// plain text.)
Reporter

Comment 53

11 months ago
(In reply to Adam Borowski from comment #51)
> That's no longer true.  Old RFCs specified HTTP default to be ISO-8859-1,
> but those RFCs have been superseded long time ago.

No, as I've just said in bug 815551, that's not a long time ago. 4 years only, while there are web pages / web server configuration that have not been rewritten since, e.g.: https://members.loria.fr/PZimmermann/cm10/reglement (ISO-8859-1, no charset explicitly specified, last modified in 2010, i.e. 8 years ago).

This was *guaranteed* to work in the past. This should still be the case nowadays.

> Current one is RFC 7231 which says:
> #   The default charset of ISO-8859-1 for text media types has been
> #   removed; the default is now whatever the media type definition says.
> #   Likewise, special treatment of ISO-8859-1 has been removed from the
> #   Accept-Charset header field.  (Section 3.1.1.3 and Section 5.3.3)
> 
> Thus, Chrome's default of UTF-8 is reasonable, and matches current practice.

Current practice, perhaps, but not old, legacy practice, for which documents are still available.

But really, the only good solution for the new text-based documents (in particular) is to provide the charset on the server side (whatever their media type), which is always possible with HTTP, unlike with "file://".

Defaulting to UTF-8 for HTTP would really be bad, as the (client-side) user cannot control anything, unlike with "file://". If ISO-8859-1 is no longer the default chosen by the client, the charset should be autodetected.

Only for "file://", defaulting to UTF-8 may be OK, though autodetect would be much better.

Comment 54

11 months ago
To avoid multichanneling, better to keep talk about http to bug 815551.  In that case, autodetect may indeed be a good option.

Not so for file:// -- Firefox is the only program I'm aware of that assumes an ancient locale, at least on mainstream Unix systems.  People like me file bugs for inadequate UTF-8 support quite aggressively, and the work is pretty much done, with stragglers having been kicked out of Debian (for example, I transitioned aterm and rxvt to rxvt-unicode, and kept kterm and xvt out of Buster, to list terminal emulators only).  On the other hand, unlike a decade ago, I don't even bother implementing support for ancient locales in any of my new programs, and no one reported this as a problem.

Thus, for file:// on non-Windows, the following options would be reasonable, in my opinion:
1. LANG/LC_CTYPE/LC_ALL
2. hard-coding UTF-8
3. autodetect
in this order of preference.  Note that I consider autodetect to be worse than even hard-coded UTF-8, these days!

But, the biggest problem is that, for ordinary users, UTF-8 currently doesn't work at all.  That new preference (intl.charset.fallback.utf8_for_file) is a good step forward, but save for those who read this bug report, dig in about:config or happen to hear about it somewhere, a hidden preference that defaults to false is not there.  Thus, something that works in any other bit of the user's system, doesn't work in Firefox.
Reporter

Comment 55

11 months ago
(In reply to Adam Borowski from comment #54)
> Thus, for file:// on non-Windows, the following options would be reasonable,
> in my opinion:
> 1. LANG/LC_CTYPE/LC_ALL

Yes, but in a clean way instead of looking at the values of these environment variables (because the locale names are not standardized). Something like nl_langinfo(CODESET).

> 2. hard-coding UTF-8
> 3. autodetect
> in this order of preference.  Note that I consider autodetect to be worse
> than even hard-coded UTF-8, these days!

By default, yes. Some UTF-8 files unfortunately have spurious ISO-8859-1 characters in them (or partly binary data), and autodetection may incorrectly regard an UTF-8 file as an ISO-8859-1 file. But it would be nice if a user could enable autodetect.
Assignee

Comment 56

11 months ago
> 1. LANG/LC_CTYPE/LC_ALL
> 2. hard-coding UTF-8

There is no material difference between these two on non-Windows. Firefox doesn't even support file paths that aren't valid UTF-8, so Firefox doesn't even fully support (on non-Windows) the case of running with a non-UTF-8 locale.

I encourage putting the energy into implementing comment 20 or adding a front-end checkbox for the pref that landed in comment 39 instead of designing other solutions.

To be clear, I want to do comment 20. I just chronically don't have the time to. Maybe in the second half of this year...

Comment 57

11 months ago
Would it help if I got mainstream Linux distributions to officially declare non-UTF8 locales as unsupported?  (Currently, they're merely bitrotten in practice.)  That would simplify this issue on anything that's not Windows (I don't think OSX supports ancient locales anymore either).

So everyone but Windows users would be done here; Windows users are behind as the changeover only started (UTF-8 is merely supported but not even the default yet, and not on non-telemetried-to-death versions of Windows).
Assignee

Comment 58

11 months ago
(In reply to Adam Borowski from comment #57)
> Would it help if I got mainstream Linux distributions to officially declare
> non-UTF8 locales as unsupported?

For this bug or Firefox purposes generally, no. The behavior here doesn't and won't depend on the glibc locale setting. If anything in Firefox still depends on the glibc codeset, it should be filed as a bug.

Maybe we should flip the default value for the pref that got landed in comment 39 if the behavior described in comment 20 doesn't get implemented soon, but looking at the glibc codeset is not coming back.
Assignee

Comment 59

11 months ago
(In reply to Henri Sivonen (:hsivonen) from comment #56)
> To be clear, I want to do comment 20. I just chronically don't have the time
> to. Maybe in the second half of this year...

I expect to be able to work on this in 2018H2.
Assignee

Updated

6 months ago
Assignee: nobody → hsivonen
Status: NEW → ASSIGNED
Assignee

Updated

6 months ago
Depends on: 1511972
Assignee

Updated

6 months ago
Depends on: 1512155

Comment 62

6 months ago
Thanks @hsivonen for working on it!
What does "2018H2" mean ?
Assignee

Comment 64

6 months ago
(In reply to ggrossetie from comment #62)
> What does "2018H2" mean ?

Second half of 2018.
Assignee

Updated

6 months ago
Depends on: 1512713
Comment on attachment 9030009 [details] [diff] [review]
Decode unlabeled file: URLs as UTF-8 and redecode if there's an error in the first 50 MB, v5

Review of attachment 9030009 [details] [diff] [review]:
-----------------------------------------------------------------

> emk, do you have a Phabricator account?

No, its Windows support is still horrible.

> if there's an error in the first 50 MB

Nice compromise :)

::: parser/html/nsHtml5StreamParser.cpp
@@ +911,5 @@
> +            currentURI->SchemeIs("view-source", &isViewSource);
> +            if (isViewSource) {
> +              nsCOMPtr<nsINestedURI> nested = do_QueryInterface(currentURI);
> +              nsCOMPtr<nsIURI> temp;
> +              nested->GetInnerURI(getter_AddRefs(temp));

Why is this unwrapping only one-level? How about using NS_GetInnermostURI?

::: parser/nsCharsetSource.h
@@ +9,4 @@
>  #define kCharsetUninitialized 0
>  #define kCharsetFromFallback 1
>  #define kCharsetFromTopLevelDomain 2
> +#define kCharsetFromFileURLGuess 3

Let's change this to an enum while we are here.
Assignee

Comment 71

5 months ago
(In reply to Masatoshi Kimura [:emk] from comment #70)
> > emk, do you have a Phabricator account?
> 
> No, its Windows support is still horrible.

Windows support isn't needed on the reviewer side.

> Why is this unwrapping only one-level? How about using NS_GetInnermostURI?

Using NS_GetInnermostURI now.

Also fixed the recordreplay reporting.

> ::: parser/nsCharsetSource.h
> @@ +9,4 @@
> >  #define kCharsetUninitialized 0
> >  #define kCharsetFromFallback 1
> >  #define kCharsetFromTopLevelDomain 2
> > +#define kCharsetFromFileURLGuess 3
> 
> Let's change this to an enum while we are here.

I'd rather change it to an enum as part of this bug. The number travels through XPIDL interfaces, for example.
Attachment #9030009 - Attachment is obsolete: true
Attachment #9030009 - Flags: review?(VYV03354)
Attachment #9030225 - Flags: review?(VYV03354)
Assignee

Updated

5 months ago
Blocks: 977540
Comment on attachment 9030225 [details] [diff] [review]
Decode unlabeled file: URLs as UTF-8 and redecode if there's an error in the first 50 MB, v6

Review of attachment 9030225 [details] [diff] [review]:
-----------------------------------------------------------------

::: parser/html/nsHtml5StreamParser.cpp
@@ +905,5 @@
> +      nsCOMPtr<nsIURI> originalURI;
> +      rv = channel->GetOriginalURI(getter_AddRefs(originalURI));
> +      if (NS_SUCCEEDED(rv)) {
> +        bool originalIsResource;
> +        originalURI->SchemeIs("resource", &originalIsResource);

Are nested resource: URLs handled correctly? (such as view-source:resource:)
r=me with an anwser or a fix.
Attachment #9030225 - Flags: review?(VYV03354) → review+
Assignee

Comment 73

5 months ago
(In reply to Masatoshi Kimura [:emk] from comment #72)
> Are nested resource: URLs handled correctly? (such as view-source:resource:)

resource: and view-source:resource: go down different code paths. resource: is fast-tracked to UTF-8. view-source:resource: is subject to UTF-8 detection.

I think this is acceptable considering that:

 * Viewing the source of resource: URLs isn't something that end users are expected to do.
 * As long as our resource: data is actually in UTF-8, as it should be, the view-source:resource: case also ends up as UTF-8--just with a bit more buffering.

> r=me with an anwser or a fix.

Thanks.

Comment 74

5 months ago
Pushed by hsivonen@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/5a6f372f62c1
Support loading unlabeled/BOMless UTF-8 text/html and text/plain files from file: URLs. r=emk.
Forgot to remove leave-open...
Status: ASSIGNED → RESOLVED
Last Resolved: 5 months ago
Keywords: leave-open
Resolution: --- → FIXED
Target Milestone: --- → mozilla66
Henri, does is deserve a note in developers release notes for 66? https://developer.mozilla.org/fr/docs/Mozilla/Firefox/Releases/66
Flags: needinfo?(hsivonen)
Assignee

Comment 78

5 months ago
(In reply to Pascal Chevrel:pascalc from comment #77)
> Henri, does is deserve a note in developers release notes for 66?
> https://developer.mozilla.org/fr/docs/Mozilla/Firefox/Releases/66

Thanks for the reminder. Added.
Flags: needinfo?(hsivonen)
Assignee

Comment 79

5 months ago
Filed bug 1513513 as a follow-up.
Assignee

Updated

5 months ago
Depends on: 1514728

Updated

4 months ago
Duplicate of this bug: 1519680
Duplicate of this bug: 1534006

Updated

2 months ago
Depends on: 1538190
You need to log in before you can comment on or make changes to this bug.