_charset_ field does not reflect encoding substitutions




7 years ago
2 years ago


(Reporter: Andrew Clover, Assigned: smontagu)



Firefox Tracking Flags

(Not tracked)




7 years ago
User-Agent:       Mozilla/5.0 (X11; U; Linux x86_64; en-GB; rv: Gecko/20101027 Ubuntu/10.10 (maverick) Firefox/3.6.12
Build Identifier: 

Browsers have long substituted a stated standard encoding in the `Content-Type: ... charset` parameter with a Windows code page equivalent, eg. a page served as ISO-8859-1 is really handled as code page 1252. HTML5 section documents and standardises this behaviour.

`<input type="hidden" name="_charset_"/>` is an ugly IE extension to state the encoding used to submit form data, supported by Firefox and also subsequently standardised by HTML5 section

When a `_charset_` field is submitted on a form whose submission encoding (from the page charset or from `accept-charset`) is a substituted encoding, the value is given as the pre-substitution encoding and not the real encoding that is actually being used to submit the form.

This will confuse any server-side software that tries to use the exact named encoding, and seems not to be mandated by HTML5 (eg; the submitted charset name is not the preferred MIME name of the encoding used).

Reproducible: Always

Steps to Reproduce:
<form accept-charset="ISO-8859-1">
<input type="hidden" name="_charset_">
<input type="text" name="test" value="&#x201C;test&#x201D;">

Actual Results:  

Expected Results:  

IE and Opera give the Expected Results.


7 years ago
Severity: normal → minor
Version: unspecified → Trunk
We don't actually alias ISO-8859-1 to windows-1252; we just use the same underlying code to handle them.

Simon, maybe we should actually alias them?  This might be a web compat issue due to other places expecting to see ISO-8859-1....
Assignee: nobody → smontagu
Component: HTML: Form Submission → Internationalization
QA Contact: form-submission → i18n

Comment 2

7 years ago
Yes, I think we should, and we should handle all the substitutions listed at http://www.w3.org/TR/html5/parsing.html#character-encodings-0 in the same way.
If we do that, we should grep our source for the new aliases and change them to the canonical names; I'm fairly certain we have places where we assign ISO-8859-1 to member variables that expect a canonical charset name.
Ever confirmed: true

Comment 4

7 years ago
That's actually rather scary... Maybe we should retain ISO-8859-1 as a canonical name which nothing is aliased to, as a safety net.
How would that work?  To fix this bug as reported, ISO-8859-1 needs to be an alias for windows-1252, no?

Comment 6

7 years ago
Yes, but if we just change the alias and don't remove the existing ISO-8859-1 decoder, "ISO-8859-1"  will still function as a canonical charset name, at least for the purposes of GetUnicodeDecoderRaw. Were you thinking of other cases?
Oh, I see.

I dunno that the relevant places are relying on GetUnicodeDecoderRaw, but I agree that we should keep the decoder.  

The relevant callsites _did_ rely on string equality compares of canonical charset names against ISO-8859-1, iirc.  It's been a while.

Comment 8

7 years ago
Don't get me wrong -- I'm by no means disagreeing with comment 3.
OK, sounds like we agree on that and on comment 4, then.  ;)
Maybe the real ISO-8859-1 decoder should be behind some kind of special getter like the encodings recently banned from being exposed to Web content.

It would be rather counter-intuitive to have GetUnicodeDecoderRaw return an decoder for an argument that doesn't alias onto itself.
This get fixed as a side effect of moving to Encoding Standard-compliant label handling.
Last Resolved: 2 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.