Auto-Detect uses a first "charset" wrong. Steps: 1. Set Auto-Detect Jppanese. 2. Go to http://www.jrtbinm.co.jp/topics/topics.html Actual: Garbage. Character Coding becomes Shift_JIS. Expected: Character Coding should become EUC-JP. This page is descripted as follows: <meta http-equiv="Content-Style-Type" content="text/css; charset=Shift_JIS"> <meta HTTP-EQUIV="Content-Type" CONTENT="text/html;CHARSET=EUC-JP"> If "Content-Type" is descripted earlier, no problem. http://bugzilla.mozilla.gr.jp/attachment.cgi?id=1403&action=view Confirming with 2002120507-MachO/MacOS 10.2.2.
Confirming also on 2002112800/FreeBSD. Changing to All.
OS: MacOS X → All
Hardware: Macintosh → All
Sorry. This is not related to Auto-Detect. Changing summary.
Summary: Auto-Detect uses a first "charset" wrong → First "charset" is honored wrong
The bug is in nsParser::DetectMetaTag which doesn't check the "http-equiv" part...
Apparently, Content-Style-Type can define both internal and external stylesheets. At least the definition below seems to allow both types: http://www.w3.org/TR/html401/present/styles.html#h-14.2.1 This particular web page does not specify an external sheetsheet. A single document cannot be in 2 encodings and so without an external sheetsheet, we seem to face a conflict here. The stylesheet is in "Shift_JIS" but the HTML file itself is in "EUC-JP". This is because internal stylesheet encoding must be the same as the document encoding. The above definition of Content-Style-Type says that if there are multiple instances of Content-Style-Type meta declarations, the last one determines the content-type. I have not seen an explicit coverage of the same inssue with regard to Character encoding, e.g. http://www.w3.org/TR/html401/charset.html#doc-char-set (See section 5.2.2) We might however take the last charset declaration in the document (if there is none from the server) whether it comes from the Content-Style-Type or Content-Type meta tags -- in case there is no external stylesheet. By the way, I don't think THIS way of defining a different charset for external stylesheets is a good thing. I also believe that defining charset in Content-Style-Type itself to be problematic because this would also apply to internal style sheets. What if there is a native language font name in the syle sheet and you define an encoding of the Content-Style-Type to be different from the document encoding? That will cause a problem. On the other hand, if the Content-Style-Type charset is the same as the document's charset, then there is no reason to specify it there. Here are 2 better approaches: 1. Specify an external stylesheet charset to be different in <link .... charset=" ">. This way we can assume the document encoding for any stylesheet definitions within the document. 2. Specify the charset in the external CSS file using @charset .. Yes, we should fix something in this bug but this is also an evangelism issue, it seem to me. We should tell the site to delete the Content-Style-Type line since it is serving no good purpose in this case.
Summary: First "charset" is honored wrong → Parser does not check Content-Type value before setting document charset
I guess I should retract the following statement above: "We might however take the last charset declaration in the document (if there is none from the server) whether it comes from the Content-Style-Type or Content-Type meta tags -- in case there is no external stylesheet." Instead, if the Content-Style Sheet does not refer to an external sylesheet, the explicitly stated document encoding should override it. One other case to deal with would be when there is one or more Content-Style-Type charsets decalared but there is no explicit server or document-based charset declaration. Should the style-charset be taken for the entire document? or should we fall back on the default browser encoding or let auto-detection take over?
I'm not sure that content-style-type applie to external stylesheets, since the stylesheet language of an external stylesheet is unambiguously determined by the server-supplied MIME type header... but then again there's always HTTP/0.9. In any case, I would say that applying the content-style-type charset to the whole document is a bit odd... If we do decide to do it, we can just make it a new level of charset hint with precendence below that of the content-type meta...
Content-Style-Type only applies to style attributes and can't have a charset value since attributes must be parsed by the document encoding.
> Content-Style-Type only applies to style attributes and can't have a charset > value since attributes must be parsed by the document encoding. I was waiting for a comment like this! As I described above, specifying charset in Content-Style-Type is very problematic. However, a page with Content-Style-Type charset specified along with Content-Type charset validates as an HTML 4.01 document. This threw me off.
the validator does not check the contents of attributes.
Fixed by the HTML5 parser.
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.