Closed Bug 815551 Opened 12 years ago Closed 10 years ago

Autodetect UTF-8 by default

Categories

(Core :: DOM: HTML Parser, enhancement)

enhancement
Not set
normal

Tracking

()

VERIFIED WONTFIX

People

(Reporter: smontagu, Unassigned)

References

Details

(Keywords: intl)

http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#determining-the-character-encoding paragraph 8 says:

 "The UTF-8 encoding has a highly detectable bit pattern. Documents that contain bytes with values greater than 0x7F which match the UTF-8 pattern are very likely to be UTF-8, while documents with byte sequences that do not match it are very likely not. User-agents are therefore encouraged to search for this common encoding."

Implementing this would require some trivial changes to the autodetector to allow a "utf-8 only" option, plus making that the default in all.js.
I think we shouldn’t do this with the general autodetector code, because it can cause late reloads and we shouldn’t make such performance-sensitive badness look OK to Web authors. Instead we should make sure Web authors (even those who don’t real Web performance guides in detail) are incented to declare UTF-8. (It’s probably a bad idea that the existing detectors support detecting UTF-8 instead of just supporting the detection among legacy encodings.)

Autodetecting UTF-8 within the first 1024 bytes *might* make sense, but I think adding more magic DWIM around encoding stuff is presumptively bad.
(In reply to Henri Sivonen (:hsivonen) from comment #1)
> Autodetecting UTF-8 within the first 1024 bytes *might* make sense, but I
> think adding more magic DWIM around encoding stuff is presumptively bad.

Actually, detecting within the first 1024 bytes isn’t good, either, because stuff would mysteriously succeed or fail depending on English-language pages depending on how soon the first apostrophe or dash makes its appearance.

If we really wanted to add more magic (and I’m inclined to say we shouldn’t want to), I think having a separate detection step would be the wrong way to go. The better way would be to have a multiencoding decoder: ASCII bytes would be decoded as ASCII and then upon seeing the first non-ASCII byte, the decoder would buffer up a bit of data to make its guess between Windows-1252 and UTF-8, lock its decision and then continue in either the Windows-1252 or UTF-8 mode.

But I think we shouldn’t do this without a super-compelling reason.
I should note that autodetection and letting authors not declare their encodings is what got us in the current mess.
Due to the conflict between autodetection and incremental loading, I think this in WONTFIX for network-originating documents. However, with file: URLs, we know the data is finite and available soon, so I think it would make sense to do this for file: URLs.

smontagu, do you prefer me morphing this bug to file: URL-only or WONTFIXing and filing a new bug about file: URLs only?
Actually by your own logic "adding more magic DWIM around encoding stuff is presumptively bad" I would be against having different detection behaviour for file: and network URLs. It can already be confusing that the "same" document served from the network or downloaded and reloaded locally can have different behaviour because of the absence of meta-information from the HTTP headers. Let's not vary autodetection behaviour as well.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WONTFIX
Agreed. Adding magic solely for file URLs isn't worth it.
Status: RESOLVED → VERIFIED
I'm considering reopening this bug. We shouldn't punish users just because small population of those users may be also Web authors.
I'm still against doing this on the Web, but I want to reopen for file: URLs.
How often do users open file URLs that are HTML in Firefox when they are not developers? We should first get data on that I think.
I'm not going to limit this to file: URLs. I will not limit this to HTML either. Rather, plain-text will be more problematic.
I'm motivated primarily by bug 910192 comment 41. I will not agree to remove the UTF-8 detector from the ja detector. We should have a less hacky way to auto-detect UTF-8 for non-Japanese locales.
We cannot base your decisions on anecdotal evidence.
Then please base the spec encouragement. Otherwise, please remove the nonsensical encouragement from the spec.
I'm not sure what you're talking about. I meant our (not your) above by the way, sorry for that.
Ah, I missed you filed a bug. Yes that should be fixed, thanks for filing!
(In reply to Anne (:annevk) from comment #9)
> How often do users open file URLs that are HTML in Firefox when they are not
> developers? We should first get data on that I think.

I do it every day, opening HTML email parts from Mutt.

This often initially opens with the wrong encoding in Firefox (and then I have to find the menu item for tweaking it, causing a reload, before Mutt deletes the temporary .html file that it only created for passing to Firefox).

(In reply to  Simon Montagu :smontagu from comment #5)
> I would be against having different detection behaviour for file: and network URLs. It can already be confusing that the "same" document served from the network or downloaded and reloaded locally can have different behaviour because of the absence of meta-information from the HTTP headers. Let's not vary autodetection behaviour as well.

I think it's precisely because of the absence of the HTTP Content-Type: header
with file: URLs that it _does_ make sense to do more autodetection with them.
It means that pages which work fine over HTTP, because they are served with the
UTF-8 Content-Type: header, will also work locally. Currently they work over
the web, but break when viewed directly.
I've submitted bug 1071816 for the particular case of text/plain files with unknown charset (e.g. "file:" URL scheme).
(In reply to Vincent Lefevre from comment #16)
> I've submitted bug 1071816 for the particular case of text/plain files with
> unknown charset (e.g. "file:" URL scheme).

Let's debate the file: case there. The http: case remains a clear WONTFIX.
(In reply to Henri Sivonen (:hsivonen) from comment #17)
> (In reply to Vincent Lefevre from comment #16)
> > I've submitted bug 1071816 for the particular case of text/plain files with
> > unknown charset (e.g. "file:" URL scheme).
> 
> Let's debate the file: case there.

Bug 1071816 only covers text/plain, so doesn't include the opening local HTML files — comment #15 above. This can also happen if you do ‘Save Page As’ on somebody else's web page in order to debug it, so now have a local .html file without a Content-Type: header.

> The http: case remains a clear WONTFIX.

Sure. Can this bug be re-opened for a decision on local .html files (the summary doesn't say this bug is specific to http:), or should I create another new bug for that case?
(In reply to Henri Sivonen (:hsivonen) from comment #17)
> (In reply to Vincent Lefevre from comment #16)
> > I've submitted bug 1071816 for the particular case of text/plain files with
> > unknown charset (e.g. "file:" URL scheme).
> 
> Let's debate the file: case there. The http: case remains a clear WONTFIX.

Yes, but the charset is *always* known in the "http:" case (either explicitly or implicitly). There may be other URL schemes with unknown charset ("ftp:"?), though I don't use them.

(In reply to Smylers from comment #18)
> Bug 1071816 only covers text/plain,

Yes, this was intentional, because contrary to HTML files, there's no way to specify a charset (BOM in text/plain files being controversial, yielding various problems with other tools working on text/plain files, thus not used in practice).

The case of HTML files could still be discussed in another bug than 1071816, because it's still annoying to have to modify the HTML contents retrieved from an external source (downloaded, from e-mail...). In such a case, bug 720664 should be revisited, otherwise this won't completely solve the problem.
Almost year has passed... Is there a way to

1. Open local plain text files as UTF-8 by default.
2, Open local html without meta tags as UTF-8 by default (those I get by saving pages which are served with correct http headers, but those are lost in the process).
3. Show ftp filenames in UTF-8 by default.

?

All these could be done with some older version of firefox by choosing UTF-8 for the default character encoding. But no more. The world have moved to use UTF-8 almost everywhere, but firefox is still stuck in the 20th century legacy encodings mindset assumed by default!? Why do I have to always change the encoding by hand now? Why there is not option in about:config to override this?
I can't view text/html mime chunks in firefox anymore because of this issue.
It's well consistent with latest mozilla orientation on "agevage" user and "typical" use cases, which don't include plain text, local pages, ftp etc.
This comment seems off-topic, but still somewhat related to this bug:

Why don't we simply default the encodings to utf-8 for any files without BOMs. Chrome have already done so.

Nowadays, I highly doubt files without BOM are likely to be encoded in some encoding other than utf-8. I think we may more likely see messy code if we don't default the encoding of files without BOM to utf-8.
Files without BOMs are likely to be encoded in utf-8, at least I think it's the case on the Internet.
I would like to hear your opinions about my thought before I file a new bug for defaulting BOMless files to utf-8.
As I just mentioned in bug 1071816 (which discusses file:// not http://), the default for http has been changed by RFC 7231 from historical ISO-8859-1 to unspecified.  Thus, you need to either autodetect or to assume an encoding, thus if we have a reason not to autodetect, following Chrome and assuming UTF-8 is reasonable.

> Yes, but the charset is *always* known in the "http:" case

Per the above (RFC 7231), the charset for http is specifically defined to be no longer known.
RFC 7231 is rather new (4 years, while it's common to see web pages that haven't been modified for more than that). Forcing the default charset to UTF-8 is likely to break web legacy servers. Here's an example:

  https://members.loria.fr/PZimmermann/cm10/reglement

No charset is provided in the HTTP headers and this text/plain file uses ISO-8859-1. Last modified on May 17, 2010.

So, autodetect is a must.
BTW, I think that if the intent was to default to UTF-8, RFC 7231 would have said that the default is UTF-8, not that the charset is unknown. So, I think that the idea behind RFC 7231 is that there should be alternate ways to detect the charset from the client side.
Autodetect would be the best idea, yeah.  According to W3Techs, global use share is 91.7% for UTF-8, 4.0% for ISO-8859-1, thus assuming ISO-8859-1 is wrong in 96% cases (and I guess that most of that 4.0% is really 7-bit ASCII only).

As for the switch, you can't change the standard from "hard-code A" to "hard-code B" immediately, thus they had no other option than to legislate "unknown" for the time being.

Thus, by my reading of the standard, autodetect is indeed the only "legal" option.  Only if it can't be done (as was argued 6 years ago), a default can be assumed, but in this case it shouldn't be ISO-8859-1 anymore.
Maybe at least it's reasonable to default utf-8 to some certain mime types, for instance, CSS, Javascript.

Those files without encodings specified, are always encoded in the same encoding as the HTML document that includes them.

But few HTML documents are encoded in ISO-8859-1 nowadays. So if visiting non-ASCII Js, CSS files directly, chances are you will see messy code.
(In reply to Adam Borowski from comment #29)
> Autodetect would be the best idea, yeah.  According to W3Techs, global use
> share is 91.7% for UTF-8, 4.0% for ISO-8859-1, thus assuming ISO-8859-1 is
> wrong in 96% cases (and I guess that most of that 4.0% is really 7-bit ASCII
> only).

But this is for all documents, not for those with actually an unknown charset (no declared charset)? A large majority of UTF-8 documents are probably declaring the charset explicitly, while this is probably less the case for ISO-8859-1 ones.

Note also that for some users, among the documents with no declared charset, this will be ISO-8859-1 most of the time.

> As for the switch, you can't change the standard from "hard-code A" to
> "hard-code B" immediately, thus they had no other option than to legislate
> "unknown" for the time being.

Because they take into account legacy web servers, I assume. But then, this should still be the case for user agents.

> Thus, by my reading of the standard, autodetect is indeed the only "legal"
> option.  Only if it can't be done (as was argued 6 years ago),

6 years ago, the default specified by the RFC's was ISO-8859-1. Now, this has changed, and autodetect should be reconsidered with a high priority.

> a default can be assumed, but in this case it shouldn't be ISO-8859-1 anymore.

I disagree, or at least this should be configurable on a per domain basis. Otherwise this will be a major regression for some users.

Note also that if some major browser changes to UTF-8, one risk may be that webmasters will not fix the configuration of their server (the correct solution for new documents being to declare the charset), so that the users will have more and more work to select the charset on their side.
(In reply to 張俊芝(Zhang Junzhi) from comment #30)
> Maybe at least it's reasonable to default utf-8 to some certain mime types,
> for instance, CSS, Javascript.

Yes, perhaps for such MIME types. I would guess that old HTML documents don't use CSS or Javascript, or use them with ASCII only.
The whole point of UTF-8, beside being able to mix charsets, is to not require such metadata anymore.
(In reply to Adam Borowski from comment #33)
> The whole point of UTF-8, beside being able to mix charsets, is to not
> require such metadata anymore.

Yes, but:
* There are still old documents in other characters. They could be converted to UTF-8, but this would take time and/or could break things (in particular opening them in Firefox locally, due to bug 1071816!).
* There are files that must not be converted to UTF-8. This includes some Perl scripts, as converting them to UTF-8 would change the semantics, and much work may be needed to keep the same behavior.

And if UTF-8 is always used on some web server, it should be easy to declare he UTF-8 charset globally.
> And if UTF-8 is always used on some web server, it should be easy to declare he UTF-8 charset globally.

I requested such a global setting in Apache: https://bugs.debian.org/668858 but it has been denied, with the following response:

> Default is now not to send any charset in headers. It's up to the browser
> to choose one. Browsers sometimes fail to detect UTF-8, that's true. But this
> is a problem in the browser. Sending sometimes wrong information might help,
> but this is not the proper way to fix things.
> 
> I'm now tagging your bug "wontfix".
> 
> Please report a bug against your browser if unicode auto detection does not
> work. In iceweasel 10, this might be related to Default Character Encoding
> (Preferences/Content/Font & Colours (!)/Advanced/Character encoding); or
> make sure UTF-8 is first choice in View / Character Encoding / Customize list.
(In reply to Adam Borowski from comment #35)
> > And if UTF-8 is always used on some web server, it should be easy to declare he UTF-8 charset globally.
> 
> I requested such a global setting in Apache: https://bugs.debian.org/668858
> but it has been denied, with the following response:
[...]

So, this is not on default installations, but the webmaster can still do it (or this can be done via the .htaccess file) with:

AddDefaultCharset UTF-8
Yeah but the maintainer argues it's the browser's job, and that I should report a bug on the browser if unicode auto detection does not work.  Which is exactly this bug we're on...
(In reply to Adam Borowski from comment #37)
> Yeah but the maintainer argues it's the browser's job, and that I should
> report a bug on the browser if unicode auto detection does not work.  Which
> is exactly this bug we're on...

Yes, but this case is a specific one where choosing a default charset (either on the server side or on the client side) will not be OK. Thus only autodetection in the browser will work. (One may also argue whether autodetection shouldn't be done on the server side. BTW, this is what I do for my small web server, but statically, and with manual checking.)
To be clear, the reason HTTP no longer defines a default is because it's up to individual MIME types. As far as navigating to resources goes, which this bug is about, that's governed by HTML. HTML doesn't allow for autodetecting UTF-8.
(In reply to 張俊芝(Zhang Junzhi) from comment #22)
> Why don't we simply default the encodings to utf-8 for any files without
> BOMs. Chrome have already done so.

What evidence is this assertion based on? Testing on Linux with the en-US locale, Chrome autodetects UTF-8 vs. windows-1252 for file: URLs and defaults to windows-1252 (without autodetecting UTF-8) for https: URLs for .com domains. (Demo: https://hsivonen.com/test/moz/charset/ )

I support doing the same in Firefox. See bug 1071816 comment 20. I haven't had time to implement it, but I'd r+ a patch implementing the feature described in that comment.

The most constructive thing anyone can do is to implement a patch along the lines described in that comment. Please, please, direct your energy towards writing that code instead of debating what should be done.

> The whole point of UTF-8, beside being able to mix charsets, is to not require such metadata anymore.

Let's recap the key points:

 * A Web browser should be able to browse the Web as it exists. This is why for unlabeled legacy content there's a need to default to locale-dependent legacy encoding. The locale of the *content* is primarily guessed from the top-level domain. However, for multi-locale top-level domains, such as .com, the browser UI locale is used as the basis of guessing instead.

 * A Web browser should not try to enable newly-authored content to avoid having to declare its encoding. Magic to this end is necessarily brittle in the face of incremental rendering. It's bad to let newly-authored pages to depend on such a brittle mechanism. While new authoring should use UTF-8, letting Web authors not to say so is not a goal. Just like letting Web authors omit <!DOCTYPE html> is not a goal, letting authors to omit <meta charset=utf-8> is not a goal.

 * For local files, the entire file is available, so the situation differs from https: pages, which are potentially infinite.

 * RFCs are irrelevant on these points.

> Yeah but the maintainer argues it's the browser's job, and that I should report a bug on the browser if unicode auto detection does not work.

The Debian Apache maintainer is right that it's a problem for the server to just claim that all files are UTF-8, in case they aren't. For example, it's generally a bad idea to set such a server-wide setting on a server that hosts a lot of legacy content.

The maintainer is wrong that it's a browser bug if the browser doesn't guess.

The correct responsible party is the author of the content.
(In reply to Anne (:annevk) from comment #39)
> To be clear, the reason HTTP no longer defines a default is because it's up
> to individual MIME types. As far as navigating to resources goes, which this
> bug is about, that's governed by HTML. HTML doesn't allow for autodetecting
> UTF-8.

This does not seem to be what is described here:
https://html.spec.whatwg.org/multipage/parsing.html#the-input-byte-stream
(In reply to Henri Sivonen (:hsivonen) from comment #40)
> The correct responsible party is the author of the content.

No, the responsible party is the old HTTP standard, which said that ISO-8859-1 was the default. The author of the content could have relied on that. The issue is that new standards unnecessarily broke that (if the goal were to specify that the charset could be unknown at the MIME type level, charset=unknown would have been sufficient without breaking the old standards).
(In reply to Vincent Lefevre from comment #42)
> (In reply to Henri Sivonen (:hsivonen) from comment #40)
> > The correct responsible party is the author of the content.
> 
> No, the responsible party is the old HTTP standard, which said that
> ISO-8859-1 was the default. 

That default hasn't had practical relevance for the last 20 years or so.
You need to log in before you can comment on or make changes to this bug.