User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-GB; rv:188.8.131.52) Gecko/20101027 Ubuntu/10.10 (maverick) Firefox/3.6.12 Build Identifier: Browsers have long substituted a stated standard encoding in the `Content-Type: ... charset` parameter with a Windows code page equivalent, eg. a page served as ISO-8859-1 is really handled as code page 1252. HTML5 section 184.108.40.206 documents and standardises this behaviour. `<input type="hidden" name="_charset_"/>` is an ugly IE extension to state the encoding used to submit form data, supported by Firefox and also subsequently standardised by HTML5 section 220.127.116.11. When a `_charset_` field is submitted on a form whose submission encoding (from the page charset or from `accept-charset`) is a substituted encoding, the value is given as the pre-substitution encoding and not the real encoding that is actually being used to submit the form. This will confuse any server-side software that tries to use the exact named encoding, and seems not to be mandated by HTML5 (eg 18.104.22.168.3; the submitted charset name is not the preferred MIME name of the encoding used). Reproducible: Always Steps to Reproduce: <form accept-charset="ISO-8859-1"> <input type="hidden" name="_charset_"> <input type="text" name="test" value="“test”"> Actual Results: ...?_charset_=ISO-8859-1&test=%93test%94 Expected Results: ...?_charset_=windows-1252&test=%93test%94 IE and Opera give the Expected Results.
We don't actually alias ISO-8859-1 to windows-1252; we just use the same underlying code to handle them. Simon, maybe we should actually alias them? This might be a web compat issue due to other places expecting to see ISO-8859-1....
Yes, I think we should, and we should handle all the substitutions listed at http://www.w3.org/TR/html5/parsing.html#character-encodings-0 in the same way.
If we do that, we should grep our source for the new aliases and change them to the canonical names; I'm fairly certain we have places where we assign ISO-8859-1 to member variables that expect a canonical charset name.
That's actually rather scary... Maybe we should retain ISO-8859-1 as a canonical name which nothing is aliased to, as a safety net.
How would that work? To fix this bug as reported, ISO-8859-1 needs to be an alias for windows-1252, no?
Yes, but if we just change the alias and don't remove the existing ISO-8859-1 decoder, "ISO-8859-1" will still function as a canonical charset name, at least for the purposes of GetUnicodeDecoderRaw. Were you thinking of other cases?
Oh, I see. I dunno that the relevant places are relying on GetUnicodeDecoderRaw, but I agree that we should keep the decoder. The relevant callsites _did_ rely on string equality compares of canonical charset names against ISO-8859-1, iirc. It's been a while.
Don't get me wrong -- I'm by no means disagreeing with comment 3.
OK, sounds like we agree on that and on comment 4, then. ;)
Maybe the real ISO-8859-1 decoder should be behind some kind of special getter like the encodings recently banned from being exposed to Web content. It would be rather counter-intuitive to have GetUnicodeDecoderRaw return an decoder for an argument that doesn't alias onto itself.
This get fixed as a side effect of moving to Encoding Standard-compliant label handling.