Closed Bug 184082 Opened 22 years ago Closed 8 years ago

Parser does not check Content-Type value before setting document charset

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: harunaga, Assigned: smontagu)

References

()

Details

Auto-Detect uses a first "charset" wrong.

Steps:
1. Set Auto-Detect Jppanese.
2. Go to http://www.jrtbinm.co.jp/topics/topics.html

Actual:
Garbage.
Character Coding becomes Shift_JIS.

Expected:
Character Coding should become EUC-JP.

This page is descripted as follows:
<meta http-equiv="Content-Style-Type" content="text/css; charset=Shift_JIS">
<meta HTTP-EQUIV="Content-Type" CONTENT="text/html;CHARSET=EUC-JP">

If "Content-Type" is descripted earlier, no problem.
http://bugzilla.mozilla.gr.jp/attachment.cgi?id=1403&action=view

Confirming with 2002120507-MachO/MacOS 10.2.2.
Confirming also on 2002112800/FreeBSD.
Changing to All.
OS: MacOS X → All
Hardware: Macintosh → All
Sorry. This is not related to Auto-Detect.
Changing summary.
Summary: Auto-Detect uses a first "charset" wrong → First "charset" is honored wrong
The bug is in  nsParser::DetectMetaTag which doesn't check the "http-equiv" part...
ccing shanjian
Apparently, Content-Style-Type can define both internal and external
stylesheets. At least the definition below seems to allow both types:

http://www.w3.org/TR/html401/present/styles.html#h-14.2.1

This particular web page does not specify an external sheetsheet. A single
document cannot be in 2 encodings and so without an external sheetsheet,
we seem to face a conflict here. The stylesheet is in "Shift_JIS" but the 
HTML file itself is in "EUC-JP". This is because internal stylesheet
encoding must be the same as the document encoding.

The above definition of Content-Style-Type says that if there are
multiple instances of Content-Style-Type meta declarations, the last
one determines the content-type. 
I have not seen an explicit coverage of the same inssue with regard to 
Character encoding, e.g.

http://www.w3.org/TR/html401/charset.html#doc-char-set  (See section 5.2.2)

We might however take the last charset declaration in the document (if there 
is none from the server) whether it comes from the Content-Style-Type or 
Content-Type meta tags -- in case there is no external stylesheet.

By the way, I don't think THIS way of defining a different charset for 
external stylesheets is a good thing. I also believe that defining 
charset in Content-Style-Type itself to be problematic because this would
also apply to internal style sheets. What if there is a native language 
font name in the syle sheet and you define an encoding  of the 
Content-Style-Type to be different from the document encoding? That will
cause a problem. On the other hand, if the Content-Style-Type charset is
the same as the document's charset, then there is no reason to specify 
it there.

Here are 2 better approaches:

1. Specify an external stylesheet charset to be different in
   <link .... charset="  ">. This way we can assume the document
   encoding for any stylesheet definitions within the document.
2. Specify the charset in the external CSS file using @charset ..

Yes, we should fix something in this bug but this is also an evangelism issue, 
it seem to me. We should tell the site to delete the Content-Style-Type
line since it is serving no good purpose in this case.
Summary: First "charset" is honored wrong → Parser does not check Content-Type value before setting document charset
I guess I should retract the following statement above:

"We might however take the last charset declaration in the document (if there 
is none from the server) whether it comes from the Content-Style-Type or 
Content-Type meta tags -- in case there is no external stylesheet."

Instead, if the Content-Style Sheet does not refer to an external
sylesheet, the explicitly stated document encoding should override
it. 

One other case to deal with would be when there is one or more
Content-Style-Type charsets decalared but there is no explicit
server or document-based charset declaration. Should the style-charset
be taken for the entire document? or should we fall back on the default
browser encoding or let auto-detection take over?
I'm not sure that content-style-type applie to external stylesheets, since the
stylesheet language of an external stylesheet is unambiguously determined by the
server-supplied MIME type header... but then again there's always HTTP/0.9.

In any case, I would say that applying the content-style-type charset to the
whole document is a bit odd...  If we do decide to do it, we can just make it a
new level of charset hint with precendence below that of the content-type meta...
Content-Style-Type only applies to style attributes and can't have a charset
value since attributes must be parsed by the document encoding.
> Content-Style-Type only applies to style attributes and can't have a charset
> value since attributes must be parsed by the document encoding.

I was waiting for a comment like this! As I described above, specifying
charset in Content-Style-Type is very problematic. 

However, a page with Content-Style-Type charset specified along 
with Content-Type charset validates as an HTML 4.01 document. This
threw me off.

the validator does not check the contents of attributes.
QA Contact: amyy → i18n
Fixed by the HTML5 parser.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.