Closed Bug 238488 Opened 21 years ago Closed 21 years ago

charset in HTTP header takes precendence over charset from meta tag in html

Categories

(Core :: DOM: HTML Parser, defect)

All
Linux
defect
Not set
major

Tracking

()

RESOLVED INVALID

People

(Reporter: P, Unassigned)

References

Details

User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040124 Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040124 On apache 2.0 on fedora core 1 at least has the following default config AddDefaultCharset UTF-8 This has the effect of causing all non UTF8 pages server from that server to be rendered incorrectly on gecko based browsers. The default apache config file has comments intimating that mozilla is currently doing the wrong thing. It's been independently causing problems here: http://mail.gnome.org/archives/gtk-doc-list/2002-November/msg00026.html Reproducible: Always Steps to Reproduce: 1. put AddDefaultCharset UTF-8 in apache's httpd.conf 2. get a ISO-8859-1 encoded page (with UTF invalid characters like á for e.g.) Actual Results: iso-8859-1 page is interpreted as UTF8 and then any UTF8 invalid characters are just rendered as question marks Expected Results: The Content-Type specified in the html should take precedence over that specified in the HTTP-header
http://www.w3.org/International/questions/qa-encoding-alts.html When trying to figure out the character encoding of a resource, user agents will try, in this order: 1. The HTTP Content-Type header sent by the server 2. The XML declaration (only for XHTML documents ??? Served as xml+xhtml???) 3. The HTML/XHTML meta element. ... Preferred approach Since the HTTP Content-Type header has precedence... This isn't a spec, just a I18N Q&A page, but unless a spec says otherwise it seems pretty obvious that the HTTP charset is the first choice.
Right there is a mismatch between apache and gecko wrt to this. IMHO apache is correct. I.E. the HTTP header specifies the *default* encoding which can be specialised by the html document It's described here why the httpd should set a default encoding: http://www.cert.org/tech_tips/malicious_code_mitigation.html But that also says httpd should be doing any translations required, and I don't think apache is doing that? Another data point is that konq or Internet Exploder do give precedence to the charset in the html file, unlike gecko. Here's the comment from httpd.conf: Specify a default charset for all pages sent out. This is always a good idea and opens the door for future internationalisation of your web site, should you ever want it. Specifying it as a default does little harm. There are also some security reasons in browsers, related to javascript and URL parsing which encourage you to always set a default char set.
The Apache documentation is much clearer than the (somewhat unhelpful) notes in the httpd.conf. AddDefaultCharset Directive This directive specifies the name of the character set that will be added to any response that does not have any parameter on the content type in the HTTP headers. This will override any character set specified in the body of the document via a META tag. http://httpd.apache.org/docs-2.0/mod/core.html#adddefaultcharset
Fair enough. As I see it there are 2 problems. 1. Apache should translate any unappropriate characters for AddDefaultCharset 2. Internet explorer should ignore the meta tag (can anyone confirm exploder does this? I've just seen it posted that it does). Note IMHO apache should not AddDefaultCharset if there is a value in the meta tag.
Status: NEW → RESOLVED
Closed: 21 years ago
Resolution: --- → INVALID
Just referencing the apache info (confirming that mozilla is correct) http://nagoya.apache.org/bugzilla/show_bug.cgi?id=23421 http://nagoya.apache.org/bugzilla/show_bug.cgi?id=14513 http://nagoya.apache.org/bugzilla/show_bug.cgi?id=23692 What a mess. Opera/Mozilla use the HTTP-header value and Konqueror/Explorer use the meta tag value
*** Bug 258154 has been marked as a duplicate of this bug. ***
*** Bug 291949 has been marked as a duplicate of this bug. ***
*** Bug 293148 has been marked as a duplicate of this bug. ***
All what I must say is what w3c documentation ( http://www.w3.org/TR/html401/charset.html#h-5.2.2 ) must be rewritten. It's contains logical errors and we are reproducing them.
Since this is highly annoying for non-english speaking users (it's not a matter of accents any more-no text is readable) as we have to change encoding EACH AND EVERY TIME the page reloads (imagine forums, link-hopping etc) may I suggest a solution ? Add an option like "Use *this* encoding for *this* page all the time". This will bring no complications to an already complicated issue, and it will provide peace of mind to all us users who don't use iso-8859-1 or UTF-8. It won't use auto-detection (which I agree is not the correct approach), most users already know the issue about the encoding (albeit not in an expert way, they just know where to search to make the page appear correctly), and when the authors make the page to behave correctly, the user will just turn the option off *for this page only*, and have FF interpret the encoding using all the standard ways.
I'm aware that this bug is rather old and that it has been resolved as invalid, but I think it should be reconsidered. According to http://www.w3.org/TR/html401/charset.html#h-5.2.2 the HTTP header has preference over the <meta http-equiv="Content-Type"> tag. However, this behavior is troublesome, and I'd consider it "a bug in the standard". The reason is that the <meta> tag can be included inside the same file where the encoding is used, whereas the Content-Type may be a feature of the HTTP server that is not as easy to change. So in case of mismatch between the HTTP header and the HTML content, I'd be more inclined to think that the valid choice is the one indicated in the HTML file (why would it be there otherwise?). This doesn't totally go against the specifications, since "In addition to this list of priorities, the user agent may use heuristics and user settings". This is, in case of a mismatch, the browser could be able to apply an encoding of its choice (it's not clear whether this heuristics can be anywhere on the list of priorities or only after them). Furthermore, the HTML5 standard (http://www.w3.org/TR/html5/syntax.html#determining-the-character-encoding) lists these steps: > 2. The user agent may wait for more bytes of the resource to be available > 4. If the transport layer specifies a character encoding, and it is supported, return that encoding > 5. Optionally prescan the byte stream to determine its encoding Steps 2 and 5 seem to be identical except for some sort of timeout condition, and both seem to point to an algorithm explained later that involves <meta http-equiv="Content-Type"> and <meta charset=> tags. I didn't understand clearly what these steps mean, but they seem to suggest that the browser could choose to perform the charset identification with priority over the HTTP header. And most important, I think that usability should prevail over standards compliance, so if the browser has to display webpages wrong to display them "right" (i.e. what both the user and webmaster would expect them to be), then I'd rather choose to override the standard in benefit of the user. Maybe this could be changed via a setting or an addon, although I'd prefer Firefox to do things "right" out of the box.
(In reply to cousteau from comment #13) > However, this behavior is troublesome, and I'd consider it "a bug in the > standard". Maybe, but it's consistently implemented by all browsers, so the consistency is worth more than supposed benefits from shaking things up. > The reason is that the <meta> tag can be included inside the same file where > the encoding is used, whereas the Content-Type may be a feature of the HTTP > server that is not as easy to change. Right. http://www.intertwingly.net/slides/2004/devcon/69.html > So in case of mismatch between the > HTTP header and the HTML content, I'd be more inclined to think that the > valid choice is the one indicated in the HTML file (why would it be there > otherwise?). HTML documents contain all kinds of crazy things for illogical reasons. > This doesn't totally go against the specifications It does go completely against the spec we are implementing: http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#determining-the-character-encoding > > 4. If the transport layer specifies a character encoding, and it is supported, return that encoding This means HTTP. The part "return that encoding" terminates the algorithm. > but they seem to suggest > that the browser could choose to perform the charset identification with > priority over the HTTP header. No, they don't, since the algorithm terminates at step 4 when the HTTP header states a supported encoding label. > And most important, I think that usability should prevail over standards > compliance, so if the browser has to display webpages wrong to display them > "right" (i.e. what both the user and webmaster would expect them to be), > then I'd rather choose to override the standard in benefit of the user. If we changed the behavior now, we'd potentially break pages that say the right thing on the HTTP level and says something else in <meta>. See above about crazy things and illogical reasons for why there is content like this.
You need to log in before you can comment on or make changes to this bug.