The specs say that we should add the charset parameter to the Content-Type header when doing an HTTP POST for a form submission, but we have found that a lot of server-side software cannot handle this. See the following: http://bugzilla.mozilla.org/show_bug.cgi?id=7533 MSIE has undoubtedly also figured this out, and they have come up with a good workaround, which is to indicate the charset via a special field in the form submission itself. That field's name is "_charset_" (including underscores). We should first of all consider whether to follow MSIE's example here. My vote is to do so!
Reassigning to EricP.
Another advantage of the _charset_ field is that it also works with HTTP GET. We cannot add a charset parameter to the Content-Type header for HTTP GET, because there is no Content-Type header in the case of GET (since there is no body following the headers). A *lot* of forms use GET.
Since the charset of GET requests is always straight 7bit ASCII encoding standard ISO-Latin-1 text, I would have thought that that was pretty irrelevant.
[oops, reinstating cc line that I clobbered with my last comment]
No, in the real world, form GETs also use charsets other than us-ascii and iso-8859-1. Those octets are of course "URL-encoded" (%XX hex encoded) so that it looks like ASCII to the casual observer, but in fact the charset underneath could be just about anything. And *that* charset is what we are talking about here.
This seems like a hack... are any sites out there using the _charset_ field now? Will we break them by not including a _charset_ field? What would happen if someone named an form element "_charset_" and set it's value to "Micky mouse figurines" or something?
Gagan, are you familiar with MSIE's work-around for this issue?
According to RFC1630: # Where the local naming scheme uses ASCII characters which are not # allowed in the URI, these may be represented in the URL by a # percent sign "%" immediately followed by two hexadecimal digits # (0-9, A-F) giving the ISO Latin 1 code for that character. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Does this not mean that URIs can only contain a single character encoding? Similarly in HTML4, section 17.13.1: # Note. The "get" method restricts form data set values to ASCII # characters. Only the "post" method (with # enctype="multipart/form-data") is specified to cover the entire # [ISO10646] character set.
Re-adding gagan to CC.
Pollmann, yes, it is a bit of a hack. I don't know if there are any sites out there using _charset_. Even if there are, we wouldn't "break" them by omitting _charset_, since there are many installed clients that omit _charset_ (such as the old Netscape clients). However, people have often asked us to indicate the charset in form submissions, and we did at one point try to add the charset parameter to Content-Type for POSTs, but quickly found that it broke. Now that MSIE5 has found a pretty good way to indicate the charset, I'm suggesting that we consider doing that too. If someone happened to already have a form field called _charset_ and set its value to "Mickey Mouse", our code would break that form, but I claim that such cases would be rare, and it would be easier to get that small number of people to modify their applications than to get the large number of people with software that can't grok charset in Content-Type to alter theirs.
Ian, RFC1630 does attempt to restrict URIs to iso-8859-1, but it is a well known fact that HTML forms have been violating that rule for years. Similar comment for HTML4's 17.13.1.
Here is the full text of a recent email from MS sent to the Unicode mailing list, describing their _charset_ field: Subject: Re: HTML forms and UTF-8 Date: 8 Nov 99 03:29:38 GMT From: email@example.com (Chris Wendt) To: firstname.lastname@example.org François has a typo in here: <FORM ACTION="..." METHOD="..." ACCEPT_CHARSET="UTF-8"> should be <FORM ACTION="..." METHOD="..." ACCEPT-CHARSET="UTF-8"> (note the dash instead of underscore) Internet Explorer 5: 1) Sets a hidden field named "_charset_" to the encoding the FORM data was submitted in. ("_charset_" name includes the underscores). 2) Submits in UTF-8 if Accept-Charset="UTF-8" is given in the <form> and input is found which does not fit into the form page's encoding. 3) If Accept-Charset is not specified or not set to UTF-8, Internet Explorer 5 will submit in the form page's encoding, given that support for that encoding is present on the client machine. 4) Submits the form data which does not fit into the used encoding as HTML4 Numeric Character References. Internet Explorer 4 shares feature 4) above and always submits in the form page's encoding. You could prepopulate the _charset_ field with the form page's encoding so it always gets returned to your CGI. Chris.. ----- Original Message ----- From: François Yergeau <email@example.com> To: Unicode List <firstname.lastname@example.org> Sent: Sunday, November 07, 1999 2:38 PM Subject: RE: HTML forms and UTF-8 > De: Glen Perkins [mailto:Glen.Perkins@nativeguide.com] > Date: dimanche 7 novembre 1999 16:14 > > What is the best approach for getting data submitted by an > HTML form into > Unicode (presumably UTF-8) encoding? See http://www.w3.org/TR/html40/interact/forms.html#h-17.3 for the official way to do that. Warning: it doesn't work. > I'd like to be able to roll out forms in any number of > languages/scripts and > have the data returned to the same CGI program (perl_mod or > whatever) in the > same encoding, UTF-8, or else determine the encoding of the > returning data > and convert to UTF-8 immediately as the first step in the > CGI/server side > processing program. Good idea, but be prepared for a rough ride :-( The traditional way that forms work is that the data is returned in the same encoding as the page containing the form. This kind of works when there is a single page in a single encoding (no transcoding proxy, for instance) handled by a single CGI script. But it breaks in many cases and does not allow multilingual content (except when the page is in Unicode, of course). RFC 2070 tried to improve that by introducing an Accept-Charset attribute on the INPUT element within forms. HTML 4 (the reference above) adopted that but moved the attribute to the FORM element. The wording accompanying it is pretty bad: the fact that it is supposed to influence browser behaviour is very unclear. Anyway, to tell the browser you want the data in UTF-8, you're supposed to say this: <FORM ACTION="..." METHOD="..." ACCEPT_CHARSET="UTF-8"> . . . </FORM> The problem is that most browsers will not listen, and still send the data in the page's encoding. People usually deal with that by using a hidden field in the form (<INPUT TYPE="hidden">). You can put some text in there that will not be shown to the user but that will be returned as part of the form data. By looking at the bytes of that text, your CGI can determine its encoding (knowing the characters in advance) and that is also the encoding of the rest of the data. If that smells like a hack, looks like a hack and moves like a hack, that's because it *is* a hack. But that's the best we have in the current broken architecture of Web forms. -- François Yergeau
Wow... a lot of activity on this bug in one day! But I like what has been discussed here (and other associates of this bug) I like the idea of a magic field (_charset_) to specify the charset that a server is willing to accept (alternatively we could have done something with the HTTP headers or META tags on the forms page to keep it "in spec") but this would work too. As for the potential case of someone setting a hidden field of charset, we should allow that to be overridden by that value (after all the web author knows better as to what he/she is doing) This will not break those sites. And for the miniscule section of websites that need both the _charset_ value and have a field that uses charset as a field... they can be told to easy-fix it.
There is a slight misunderstanding in the previous comment. The _charset_ field is not to indicate the charset that a server is willing to accept. It is the charset that the client is sending out.
Playing devil's advocate against myself, if server-side software authors would like to receive the charset info from the client, another way to do this is to ask for multipart/form-data in the FORM element's ENCTYPE. This type of form submission allows you to attach the charset parameter to each part. (We need to test this too, though, to see if a bunch of sites will break when adding charset to *these* Content-Type headers!) If we decide to go this way, we will be HTML4-compliant, and it is all very clean. So this is an advantage. One possible disadvantage of not sending out _charset_ could be that some sites start using _charset_ to work with MSIE5, and that those authors then complain about Mozilla not supporting it. We can try to bang those people over the head with the spec, but that might not work. It depends on how popular MSIE5 becomes, and how popular _charset_ becomes.
I got our browser to submit _charset_ fields. Unfortunately, I have it doing this every form, so I didn't check it in yet. Okay, to clarify - which cases, exactly, does IE 5.0 send out the _charset_ field? I couldn't get it to on these cases. http://blueviper/forms/utf-8.html <HTML> <BODY> Standard GET: <FORM ACTION="http://pollmann.net/echo.cgi" METHOD=GET> <INPUT TYPE=HIDDEN VALUE="Foo"><INPUT TYPE=SUBMIT> </FORM> Standard POST: <FORM ACTION="http://pollmann.net/echo.cgi" METHOD=POST> <INPUT TYPE=HIDDEN VALUE="Foo"><INPUT TYPE=SUBMIT> </FORM> UTF-8 GET: <FORM ACTION="http://pollmann.net/echo.cgi" METHOD=GET ACCEPT_CHARSET="UTF-8"> <INPUT TYPE=HIDDEN VALUE="Foo"><INPUT TYPE=SUBMIT> </FORM> Standard form: <FORM ACTION="http://pollmann.net/echo.cgi" METHOD=POST ACCEPT_CHARSET="UTF-8"> <INPUT TYPE=HIDDEN VALUE="Foo"><INPUT TYPE=SUBMIT> </FORM> </BODY> </HTML> None of these forms submit a _charset_ field. It doesn't make sense for us to go off and implement our own take on this, so I want to make sure we to this iff IE does it.
I just tried it out on MSIE5, and discovered that the form must include an INPUT element of TYPE HIDDEN with NAME _charset_. Otherwise, IE5 doesn't return that field. When that field is present, IE5 returns it with the charset value filled in. <INPUT TYPE="HIDDEN" NAME="_charset_">
QA Contact update.
Moving off to M16 - please speak up of you need this for M14, thanks!
I'm very unsure about this, but is this related to bug 5313?
This bug has been marked "future" because the original netscape engineer working on this is over-burdened. If you feel this is an error, that you or another known resource will be working on this bug,or if it blocks your work in some way -- please attach your concern to the bug for reconsideration.
Adding ftang and momoi to Cc list. We need to make sure we are going to ship a browser with sufficient I18N capability and compliance in the HTML forms area. The _charset_ field is one thing that several W3C I18N IG participants and I identified as being a good short-term solution. Frank, who owns this? Kat, do we have good tests and testing in this area?
I created 2 basic echo test templates (for POST and GET) to cover the the scenario #1 quoted in your comment: Erik van der Poel 1999-11-12 22:38 There is already a test case for #3 on our test server also. In adidtion to these, it should be easy to add cases which use _charset_ along with 1) pages marked with meta charset tags, and 2) pages marked with HTTP charset info from servers. We can also add a case for scenario #2. I have not confirmed that scenario #4 works with IE5 when the hideen _charset_ exists. -- form refuses to execute. It does work without it. I'll be happy to work with anyone assigned to verify this bug.
Updating QA contact.
Bulk reassigning form bugs to Alex
remove future, we should reconsider this. we should only add the "_charset_" field if there are no such field.
momoi, can you figure out how important is this for us ?
Can most servers (last couple revs of IIS and Apache especially) now handle this properly? _charset_ workaround might be unnecessary.
Created attachment 94597 [details] [diff] [review] Patch This should do it. Any <input type=hidden name=_charset_> will have its value hijacked and sent as the actual charset being used. If accept-charset is used to send the form, that value will be sent. This works across all enctypes and methods.
Taking so I don't lose track.
Created attachment 94602 [details] Testcase This contains three testcases: 1. form with no accept-charset containing _charset_ (sends default charset) 2. form with accept-charset=UTF-8 and two _charset_s (sends two copies of UTF-8) 3. form with accept-charset=UTF-8 and no _charset_ (sends nothing) 4. form with accept-charset=UTF-8 and _charset_ that is input type=text (sends _charset_ as it normally is) 5. form with enctype=multipart/form-data 6. form with enctype=text/plain 7. form with method=get
OK, so it was 7 testcases. Sue me. :P
Comment on attachment 94597 [details] [diff] [review] Patch r=ftang
Comment on attachment 94597 [details] [diff] [review] Patch sr=bzbarsky
Fix checked in
verifying on build 2002-08-27-08-trunk linux red hat
This is a reply to an old bug, but I'm writing a server side software, and have problems with form post encodings. A question: shouldn't you know the encoding before you can even parse the body and see what the _charset_ field holds?