Closed Bug 521039 Opened 15 years ago Closed 7 years ago

Character encoding for CSS is specified to default to UTF-8, not ISO-8859-1

Categories

(Core :: CSS Parsing and Computation, defect)

defect
Not set
normal

Tracking

()

RESOLVED INVALID

People

(Reporter: zwol, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: compat, css2, Whiteboard: DUPEME)

Attachments

(4 files)

http://www.w3.org/TR/CSS2/syndata.html#charset specifies that the default character encoding for CSS (if there is no BOM and no @charset rule) is UTF-8.  However, we default to ISO-8859-1 instead.

This is trivially fixed -- we need only change the string literals near http://mxr.mozilla.org/mozilla-central/source/layout/style/nsCSSLoader.cpp#690 -- and it *should not* affect web compatibility in practice, since CSS as she is spoke tends to be pure 7-bit ASCII anyway (so the choice has no effect).  I'd like to change it, because we're looking at making the CSS parser do all its work in UTF-8 and not stack an encoding converter on the input stream unless the input isn't UTF-8.  Changing the default would mean that the common case [no BOM, no @charset rule, no characters outside ASCII] doesn't stack an encoding converter.

However, I'm filing this as a separate bug out of an abundance of caution -- maybe there's some web-compat issue I don't know about.  Also, I bet it's a dupe of some bug with a small number, but I can't find one.
Keywords: compat, css2, qawanted
Whiteboard: DUPEME
> Changing the default would mean that the common case [no BOM, no @charset rule, > no characters outside ASCII] doesn't stack an encoding converter.

Sadly no, since the common case would just fall back to the document charset... and that has a tendency to not be UTF-8.

I doubt we have an existing bug on this, and this seems like a perfectly reasonable change to make.  In practice, the document charset will never be empty for web stuff, so compat issues are unlikely.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Just thought I'd add my observations from a recent test (ff 3.6.8/Linux):

If you directly load a CSS file in a tab, it will indeed default to ISO-8859-1 encoding if no other is specified (header, @charset, etc), as you observe.

However, when a CSS file is referenced by another document and has no charset specified of its own, it will in fact assume the charset of the parent. So in fact the only thing left to do is set the default encoding to UTF-8 when the parent (if any) has no charset.

This may be harder than you thought, though, because parents always eventually get a charset; one is decided for them if they don't provide their own (I think that's what Boris is saying). So, the CSS child will also always get a charset completely determined by the parent document's content and not its own content, and will never have the chance to "default" to UTF-8 unless there is no parent at all (which is the scenario where it matters the least, for a browser).

While CSS2 and later specs establish a process for determining the charset of a CSS file, there is no such process for CSS1. And since there's no way of telling which level of CSS you're parsing before you have parsed it, that means the default can be undefined. However, UTF-8 is as valid a default as any. So there's no standards conflict, but it's not technically wrong as it is, so if you fear a compatibility issue, you can leave it alone.

Chrome (5.0.375.55/Linux), Opera (10.10.4724/Linux), Internet Explorer 8, and Safari (5.0.7533.16), all behave the exact same way as Firefox currently -- except that in Safari, the default for CSS seems to always be ISO-8859-1 even regardless of the parent document (I couldn't think of an easy way to test IE < 8 because they don't support the CSS content property and no version renders CSS without a parent, so it might not matter). Safari also doesn't render CSS without a parent. I didn't test Safari in further detail; that's a bug report for another day.

I think if any changes here were going to be an issue, we would already have seen people complaining about it. The change that Zack mentioned would affect no documents at all except for CSS which has no parent document; so if it improves things, then you might as well do it!

As for the issue of CSS that is included by a parent where both are without a charset, the issue is murky as to what the correct behavior is. Since the standard says to use:

4. charset of referring style sheet or document (if any)

The "if any" clause implies that if none is specified, then use the next rule ("5. Assume UTF-8"). But it could be argued that, because it doesn't say anything about where the parent gets its charset from - being assigned a default by the browser is technically as valid a source as one being specified by the author - that point 4 is satisfied.

Rule 5 would then never be invoked except where there is no parent, so it seems that this is perhaps not the intention. On the other hand, this would be consistent with how inline CSS is handled where the 'parent' has no explicit charset; one is assigned by the browser according to what the 'parent's' content is, and the inline CSS uses that same charset. But then again, this is a special rule for files that are already separate files from their parents, so that probably negates the previous argument. Gosh I love/hate standards! :-)

Sorry to waffle. Summary:
* The change that Zack mentioned would not affect this bug in any important way (only the unimportant case of a parentless CSS file). But can certainly be done anyhow if desired; it'll break nothing and be technically more correct.

As for cases where there is a parent:
* Leaving it as it is would be more compatible, and more consistent with the handling of inline CSS (if anyone cares about that).
* Changing it would be more honestly compliant with the (arguably ambiguous) spec.
* Neither is likely to have drastic affects on anyone at all, although changing it is more likely to break things than leaving it alone will (still unlikely, though).
* All other browsers behave the same as Firefox currently does (with a tiny exception for Safari, which behaves unambiguously incorrectly).
* Changing it may involve a lot of work; it would require a way for the document to know how the parents' charset was determined (not familiar enough with the source to know if there is any easy way to do that).

Personally, I'd change it for the sake of standards compliance, but I'd make it an extremely low priority.

(Note: I've done no testing whatsoever on how this interacts with CSS documents that have BOMs, which are considered higher up the decision-making-tree and so should no affect on this - assuming that they are handled correctly already…)

One more thing: in every case and with all browsers: a user manually selecting the encoding to use for a parent will have that choice of charset cascade down to the CSS files which inherited their charset from their parent because of their lack of charset definition. This should be kept in mind if anything is changed (I'm not sure what the correct way to handle that would be - I suppose a user selection is an unambiguous definition so there is no need for the CSS file to "default" to anything, meaning it would inherit the parent charset and not get its own UTF-8 default - but the spec makes no allowances for that and it does use the word "must" for its own rules, so, whatever. I guess you're not technically changing the charset but the "encoding parameter"; or perhaps it qualifies as a "similar parameters in other protocols").

Attachments:

If you open 521039utf8.css you will see a CSS file with UTF-8 content. 521039default.css is the same but without an @charset rule. Because 521039default.css has no @charset rule, by opening that you will see it rendered in ISO-8859-1 (the first part of this bug - it should be UTF-8).

521039eucjp.html is the same as 521039default.html except that it contains a charset parameter.

521039eucjp.html shows how rendering of the content property for 521039default.css which assumes the charset of the parent which is EUC-JP and displays rubbish. This is a deliberate test case and the css is wrong - it is encoded as UTF-8 but it is (correctly) interpreted as EUC-JP because that's what the parent is explicitly defined as. This proves that Firefox currently behaves correctly in this important case.

521039default.html shows why it is nontrivial to fix; the charset of the parent becomes Windows-1252 (autodetected), and therefore so does the content of 521039default.css which means that there is no opportunity for the CSS to fall back to UTF-8 when it has a parent. This is the reason why it has pretty much never been a problem before; but it is also the reason why it will be hard to change without a lot of work that may not be worth it (the workaround for the author is simple - specify your charset!).

Note that in every case, the documents that specify their charset will always be correct (according to that charset), regardless of everything else, in all browsers.
Thanks for the detailed analysis.  Did you look at the interaction with charsets specified in <link rel="stylesheet" type="text/css;charset=..."> (which I think we may have just disabled) and in HTTP content type headers?

I'm starting to think that having *any* mechanism where the parent document can influence the encoding of child documents is a Bad Idea, although if we can get rid of UTF-7, perhaps not a security hole anymore.
> Did you look at the interaction with charsets specified in <link rel="stylesheet" type="text/css;charset=..."> (which I think we may have just disabled) and in HTTP content type headers?

Yes - these parameters are higher up the decision-making-tree in handling charset cases so I didn't mention them. Firefox always obeys any explicit charset instruction, including in these cases, which is 100% correct behavior.

It's a good thing, too, because it means that you can link CSS files that you didn't write/can't edit/don't host and give them a different charset from your own document, if the CSS author (and host if applicable) didn't give it one. There is absolutely no problem here with charsets for any author who understands what they are and specifies them correctly, which only leaves the other 99% :-( .

> I'm starting to think that having *any* mechanism where the parent document can influence the encoding of child documents is a Bad Idea

Well, maybe; but it is exactly what the spec says to do, and it is the most compatible option because odds are good that HTML authors will also be CSS authors. Especially if they're using things like the content attribute, which is one of the few cases where it matters (others include non-ASCII-compatible charsets like EBCDIC, which are amazingly rare, and possibly one or two other rare CSS attributes which I haven't thought of).

Essentially, if the author doesn't say what charset to use, you have to guess. It's a much better bet that it will be the same as the parent, than whatever other heuristics you might apply. The only question mark is: what if I already had to guess for the parent? That's what my waffle is trying to get at: does it make sense to inherit a guess, or fallback to UTF-8? Currently, Firefox (and everyone else) inherits from the parent. The spec might say you should use UTF-8 (it's hard to tell). In the real world, it probably doesn't matter.

As I said: if the author at any point actually says "use this charset" by any of the several available methods, then Firefox and all other browsers will do so, so I see no problem with charset inheritance from a parent document from a compatibility point-of-view. I also don't see a problem from a security point-of-view; any effect that this would likely create could already be achieved by and attacker just specifying the charset incorrectly. That's all we're talking about.
> Did you look at the interaction with charsets specified in <link rel="stylesheet" type="text/css;charset=..."> (which I think we may have just disabled)

Sorry, hold the phone on that one. it's not always true.

Let me do some more test, I'll be back.
In order:

* 1. The HTTP Content-Type header is always obeyed if present. Correct.

* 2. BOMs and/or the @charset rule are then considered. Correct.
 * 2.a. If both exist and are consistent, they are obeyed. Correct.
 * 2.b. If the BOM indicates one charset, but the @charset rule indicates another, then they are resolved according to spec (they usually invalidate the css file, which is correct).
 * 2.b.
Sorry, that last post didn't like some of my characters. Trying again.

In order:

* 1. The HTTP Content-Type header is always obeyed if present. Correct.

* 2. BOMs and/or the @charset rule are then considered. Correct.
 * 2.a. If both exist and are consistent, they are obeyed. Correct.
 * 2.b. If the BOM indicates one charset, but the @charset rule indicates another, then they are resolved according to spec (they usually invalidate the css file, which is correct).
 * 2.b.i A parentless CSS file is rendered as text/plain because the CSS instruction are not being parsed. Therefore, it will autodetect and the BOM will almost certainly cause it to be autodetected according to such. The @charset rule does not apply to text/plain documents, so it is never used in such a case. Maybe incorrect, but unimportant.
 * 2.b.ii For a child CSS document, though, the CSS document will not be applied at all. Correct.

* 3. The value of the charset parameter of the referring Link element. Correct.
 * 3.a. Note that this is true if you use: 
<link type="text/css" charset="whatever"…
 with individual parameters, but NOT if you use: 
<link type="text/css; charset=whatever"…
 in the format of a Content-Type header (with the charset an extension in the type parameter instead of being in its own charset parameter). This makes sense, but may be outside specification. The HTML spec says that the type parameter contains %ContentType; which in the DTD is as defined by RFC 2045, which allows it. We're wandering very far away from the subject of the bug at this point, though, and it seems to be a security thing, so we probably don't need to worry about that here and now.

* 4. The charset of the parent is used. Correct, but:

* 5. Because of 4, it never happens that a default is assumed, where there is a parent. (Arguably) Incorrect. When there isn't a parent, it will be ISO-8859-1 instead of UTF-8. Incorrect.

Analysis:

* 2.b.i is probably a bug but seems to have no real-world impact (maybe it might affect Composer?). If it's to be fixed it should probably have a very low priority (my opinion).

* 3. Maybe a bug, but doesn't seem to affect anybody. I don't understand the security implications, if there are any, but it's not a recent change as far as I can tell. If it's to be fixed it should probably have a very low priority (my opinion).

* 5. Half of this bug, as originally defined by Zack, is solved by Zack's original solution. I agree; it's a simple fix that's unlikely to cause any issues at all (my opinion). The other half requires a definitive interpretation of the spec and heaps of work for, likely, no benefit. If it's to be fixed it should probably have a very low priority (my opinion).

I've kind of hijacked Zack's very simple bug here. Should these be split into separate bugs at this point (even if they just get marked WONTFIX so that they don't come up again)?


Compatibility Notes:

2.b.i current behavior is inconsistent with Chrome (5.0.375.55/Linux) and Opera (10.10.4724/Linux); they take the @charset rule into account even when showing parentless CSS.

2.b.ii current behavior is inconsistent with Chrome (5.0.375.55/Linux), Opera (10.10.4724/Linux), Internet Explorer 8, and Safari (5.0.7533.16) - they disobey the spec and render stylesheets with conflicting BOMs and @charset rules in many cases. Chrome, Opera and Safari obey the BOM, Internet Explorer obeys the @charset rule.

Note that this only applies to encodings which are supported. No browser, including Firefox, seems to support any variant of EBCDIC, IBM1026 or GSM 03.38, and so all of them ignored such stylesheets (because there were no intelligible content).
> If you directly load a CSS file in a tab, it will indeed default to ISO-8859-1

Yes, because it's treated as text/plain in that situation, not as CSS.  Of course this is not the case Zack was interested in, since he cared only about things going through the CSSLoader.
I could make a case for having the text/plain renderer notice @charset and default to utf-8 when the channel content type was text/css.  But that would be a separate bug.
Blocks: mimesniff
Couldn't find any duplicates, older or newer. Leaving "dupeme" for new, just in case someone else has more luck with this.
Keywords: qawanted
Resolving this as INVALID per https://drafts.csswg.org/css-syntax/#determine-the-fallback-encoding. If you want to remove the concept of "environment encoding" please change the standard first.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: