Closed
Bug 238488
Opened 21 years ago
Closed 21 years ago
charset in HTTP header takes precendence over charset from meta tag in html
Categories
(Core :: DOM: HTML Parser, defect)
Tracking
()
RESOLVED
INVALID
People
(Reporter: P, Unassigned)
References
Details
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040124
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040124
On apache 2.0 on fedora core 1 at least has the following default config
AddDefaultCharset UTF-8
This has the effect of causing all non UTF8 pages server from that
server to be rendered incorrectly on gecko based browsers.
The default apache config file has comments intimating that
mozilla is currently doing the wrong thing.
It's been independently causing problems here:
http://mail.gnome.org/archives/gtk-doc-list/2002-November/msg00026.html
Reproducible: Always
Steps to Reproduce:
1. put AddDefaultCharset UTF-8 in apache's httpd.conf
2. get a ISO-8859-1 encoded page (with UTF invalid characters like á for e.g.)
Actual Results:
iso-8859-1 page is interpreted as UTF8 and then
any UTF8 invalid characters are just rendered as question marks
Expected Results:
The Content-Type specified in the html should take precedence
over that specified in the HTTP-header
Comment 1•21 years ago
|
||
http://www.w3.org/International/questions/qa-encoding-alts.html
When trying to figure out the character encoding of a resource, user agents will
try, in this order:
1. The HTTP Content-Type header sent by the server
2. The XML declaration (only for XHTML documents ??? Served as xml+xhtml???)
3. The HTML/XHTML meta element.
...
Preferred approach
Since the HTTP Content-Type header has precedence...
This isn't a spec, just a I18N Q&A page, but unless a spec says otherwise it
seems pretty obvious that the HTTP charset is the first choice.
Reporter | ||
Comment 2•21 years ago
|
||
Right there is a mismatch between apache and gecko wrt to this.
IMHO apache is correct. I.E. the HTTP header specifies the *default*
encoding which can be specialised by the html document
It's described here why the httpd should set a default encoding:
http://www.cert.org/tech_tips/malicious_code_mitigation.html
But that also says httpd should be doing any translations required,
and I don't think apache is doing that?
Another data point is that konq or Internet Exploder do give
precedence to the charset in the html file, unlike gecko.
Here's the comment from httpd.conf:
Specify a default charset for all pages sent out. This is
always a good idea and opens the door for future internationalisation
of your web site, should you ever want it. Specifying it as
a default does little harm. There are also some security
reasons in browsers, related to javascript and URL parsing
which encourage you to always set a default char set.
Comment 3•21 years ago
|
||
The Apache documentation is much clearer than the (somewhat unhelpful) notes in
the httpd.conf.
AddDefaultCharset Directive
This directive specifies the name of the character set that will be added to
any response that does not have any parameter on the content type in the HTTP
headers. This will override any character set specified in the body of the
document via a META tag.
http://httpd.apache.org/docs-2.0/mod/core.html#adddefaultcharset
Reporter | ||
Comment 4•21 years ago
|
||
Fair enough. As I see it there are 2 problems.
1. Apache should translate any unappropriate characters for AddDefaultCharset
2. Internet explorer should ignore the meta tag (can anyone confirm
exploder does this? I've just seen it posted that it does).
Note IMHO apache should not AddDefaultCharset if there is a value in the meta tag.
Status: NEW → RESOLVED
Closed: 21 years ago
Resolution: --- → INVALID
Reporter | ||
Comment 5•21 years ago
|
||
Just referencing the apache info (confirming that mozilla is correct)
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=23421
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=14513
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=23692
What a mess. Opera/Mozilla use the HTTP-header value
and Konqueror/Explorer use the meta tag value
Comment 6•21 years ago
|
||
*** Bug 258154 has been marked as a duplicate of this bug. ***
Comment 7•20 years ago
|
||
*** Bug 291949 has been marked as a duplicate of this bug. ***
Comment 8•20 years ago
|
||
*** Bug 293148 has been marked as a duplicate of this bug. ***
Comment 9•20 years ago
|
||
All what I must say is what w3c documentation (
http://www.w3.org/TR/html401/charset.html#h-5.2.2 ) must be rewritten. It's
contains logical errors and we are reproducing them.
Comment 10•20 years ago
|
||
Since this is highly annoying for non-english speaking users (it's not a matter of accents any more-no text is readable) as we have to change encoding EACH AND EVERY TIME the page reloads (imagine forums, link-hopping etc) may I suggest a solution ?
Add an option like "Use *this* encoding for *this* page all the time". This will bring no complications to an already complicated issue, and it will provide peace of mind to all us users who don't use iso-8859-1 or UTF-8.
It won't use auto-detection (which I agree is not the correct approach), most users already know the issue about the encoding (albeit not in an expert way, they just know where to search to make the page appear correctly), and when the authors make the page to behave correctly, the user will just turn the option off *for this page only*, and have FF interpret the encoding using all the standard ways.
Reporter | ||
Comment 11•18 years ago
|
||
Just updating the links to the apache bugs referenced above:
http://issues.apache.org/bugzilla/show_bug.cgi?id=23421
http://issues.apache.org/bugzilla/show_bug.cgi?id=14513
http://issues.apache.org/bugzilla/show_bug.cgi?id=23692
Comment 13•11 years ago
|
||
I'm aware that this bug is rather old and that it has been resolved as invalid, but I think it should be reconsidered.
According to http://www.w3.org/TR/html401/charset.html#h-5.2.2 the HTTP header has preference over the <meta http-equiv="Content-Type"> tag. However, this behavior is troublesome, and I'd consider it "a bug in the standard".
The reason is that the <meta> tag can be included inside the same file where the encoding is used, whereas the Content-Type may be a feature of the HTTP server that is not as easy to change. So in case of mismatch between the HTTP header and the HTML content, I'd be more inclined to think that the valid choice is the one indicated in the HTML file (why would it be there otherwise?).
This doesn't totally go against the specifications, since "In addition to this list of priorities, the user agent may use heuristics and user settings". This is, in case of a mismatch, the browser could be able to apply an encoding of its choice (it's not clear whether this heuristics can be anywhere on the list of priorities or only after them).
Furthermore, the HTML5 standard (http://www.w3.org/TR/html5/syntax.html#determining-the-character-encoding) lists these steps:
> 2. The user agent may wait for more bytes of the resource to be available
> 4. If the transport layer specifies a character encoding, and it is supported, return that encoding
> 5. Optionally prescan the byte stream to determine its encoding
Steps 2 and 5 seem to be identical except for some sort of timeout condition, and both seem to point to an algorithm explained later that involves <meta http-equiv="Content-Type"> and <meta charset=> tags. I didn't understand clearly what these steps mean, but they seem to suggest that the browser could choose to perform the charset identification with priority over the HTTP header.
And most important, I think that usability should prevail over standards compliance, so if the browser has to display webpages wrong to display them "right" (i.e. what both the user and webmaster would expect them to be), then I'd rather choose to override the standard in benefit of the user.
Maybe this could be changed via a setting or an addon, although I'd prefer Firefox to do things "right" out of the box.
Comment 14•11 years ago
|
||
(In reply to cousteau from comment #13)
> However, this behavior is troublesome, and I'd consider it "a bug in the
> standard".
Maybe, but it's consistently implemented by all browsers, so the consistency is worth more than supposed benefits from shaking things up.
> The reason is that the <meta> tag can be included inside the same file where
> the encoding is used, whereas the Content-Type may be a feature of the HTTP
> server that is not as easy to change.
Right. http://www.intertwingly.net/slides/2004/devcon/69.html
> So in case of mismatch between the
> HTTP header and the HTML content, I'd be more inclined to think that the
> valid choice is the one indicated in the HTML file (why would it be there
> otherwise?).
HTML documents contain all kinds of crazy things for illogical reasons.
> This doesn't totally go against the specifications
It does go completely against the spec we are implementing: http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#determining-the-character-encoding
> > 4. If the transport layer specifies a character encoding, and it is supported, return that encoding
This means HTTP. The part "return that encoding" terminates the algorithm.
> but they seem to suggest
> that the browser could choose to perform the charset identification with
> priority over the HTTP header.
No, they don't, since the algorithm terminates at step 4 when the HTTP header states a supported encoding label.
> And most important, I think that usability should prevail over standards
> compliance, so if the browser has to display webpages wrong to display them
> "right" (i.e. what both the user and webmaster would expect them to be),
> then I'd rather choose to override the standard in benefit of the user.
If we changed the behavior now, we'd potentially break pages that say the right thing on the HTTP level and says something else in <meta>. See above about crazy things and illogical reasons for why there is content like this.
You need to log in
before you can comment on or make changes to this bug.
Description
•