<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Hixie (not reading bugmail)

Reporter

Comment 2

•

26 years ago

Another advantage of the _charset_ field is that it also works with HTTP GET. We cannot add a charset parameter to the Content-Type header for HTTP GET, because there is no Content-Type header in the case of GET (since there is no body following the headers). A *lot* of forms use GET.

Comment 3

•

26 years ago

Since the charset of GET requests is always straight 7bit ASCII encoding standard ISO-Latin-1 text, I would have thought that that was pretty irrelevant.

Hixie (not reading bugmail)

Comment 4

•

26 years ago

[oops, reinstating cc line that I clobbered with my last comment]

Reporter

Comment 5

•

26 years ago

No, in the real world, form GETs also use charsets other than us-ascii and iso-8859-1. Those octets are of course "URL-encoded" (%XX hex encoded) so that it looks like ASCII to the casual observer, but in fact the charset underneath could be just about anything. And *that* charset is what we are talking about here.

Comment 6

•

26 years ago

This seems like a hack... are any sites out there using the _charset_ field now? Will we break them by not including a _charset_ field? What would happen if someone named an form element "_charset_" and set it's value to "Micky mouse figurines" or something?

Hixie (not reading bugmail)

Comment 7

•

26 years ago

Gagan, are you familiar with MSIE's work-around for this issue?

Comment 8

•

26 years ago

According to RFC1630: # Where the local naming scheme uses ASCII characters which are not # allowed in the URI, these may be represented in the URL by a # percent sign "%" immediately followed by two hexadecimal digits # (0-9, A-F) giving the ISO Latin 1 code for that character. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Does this not mean that URIs can only contain a single character encoding? Similarly in HTML4, section 17.13.1: # Note. The "get" method restricts form data set values to ASCII # characters. Only the "post" method (with # enctype="multipart/form-data") is specified to cover the entire # [ISO10646] character set.

Comment 9

•

26 years ago

Re-adding gagan to CC.

Reporter

Comment 10

•

26 years ago

Pollmann, yes, it is a bit of a hack. I don't know if there are any sites out there using _charset_. Even if there are, we wouldn't "break" them by omitting _charset_, since there are many installed clients that omit _charset_ (such as the old Netscape clients). However, people have often asked us to indicate the charset in form submissions, and we did at one point try to add the charset parameter to Content-Type for POSTs, but quickly found that it broke. Now that MSIE5 has found a pretty good way to indicate the charset, I'm suggesting that we consider doing that too. If someone happened to already have a form field called _charset_ and set its value to "Mickey Mouse", our code would break that form, but I claim that such cases would be rare, and it would be easier to get that small number of people to modify their applications than to get the large number of people with software that can't grok charset in Content-Type to alter theirs.

Reporter

Comment 11

•

26 years ago

Ian, RFC1630 does attempt to restrict URIs to iso-8859-1, but it is a well known fact that HTML forms have been violating that rule for years. Similar comment for HTML4's 17.13.1.

Reporter

Comment 12

•

26 years ago

Here is the full text of a recent email from MS sent to the Unicode mailing list, describing their _charset_ field: Subject: Re: HTML forms and UTF-8 Date: 8 Nov 99 03:29:38 GMT From: christw@microsoft.com (Chris Wendt) To: unicode@unicode.org François has a typo in here: <FORM ACTION="..." METHOD="..." ACCEPT_CHARSET="UTF-8"> should be <FORM ACTION="..." METHOD="..." ACCEPT-CHARSET="UTF-8"> (note the dash instead of underscore) Internet Explorer 5: 1) Sets a hidden field named "_charset_" to the encoding the FORM data was submitted in. ("_charset_" name includes the underscores). 2) Submits in UTF-8 if Accept-Charset="UTF-8" is given in the <form> and input is found which does not fit into the form page's encoding. 3) If Accept-Charset is not specified or not set to UTF-8, Internet Explorer 5 will submit in the form page's encoding, given that support for that encoding is present on the client machine. 4) Submits the form data which does not fit into the used encoding as HTML4 Numeric Character References. Internet Explorer 4 shares feature 4) above and always submits in the form page's encoding. You could prepopulate the _charset_ field with the form page's encoding so it always gets returned to your CGI. Chris.. ----- Original Message ----- From: François Yergeau <yergeau@alis.com> To: Unicode List <unicode@unicode.org> Sent: Sunday, November 07, 1999 2:38 PM Subject: RE: HTML forms and UTF-8 > De: Glen Perkins [mailto:Glen.Perkins@nativeguide.com] > Date: dimanche 7 novembre 1999 16:14 > > What is the best approach for getting data submitted by an > HTML form into > Unicode (presumably UTF-8) encoding? See http://www.w3.org/TR/html40/interact/forms.html#h-17.3 for the official way to do that. Warning: it doesn't work. > I'd like to be able to roll out forms in any number of > languages/scripts and > have the data returned to the same CGI program (perl_mod or > whatever) in the > same encoding, UTF-8, or else determine the encoding of the > returning data > and convert to UTF-8 immediately as the first step in the > CGI/server side > processing program. Good idea, but be prepared for a rough ride :-( The traditional way that forms work is that the data is returned in the same encoding as the page containing the form. This kind of works when there is a single page in a single encoding (no transcoding proxy, for instance) handled by a single CGI script. But it breaks in many cases and does not allow multilingual content (except when the page is in Unicode, of course). RFC 2070 tried to improve that by introducing an Accept-Charset attribute on the INPUT element within forms. HTML 4 (the reference above) adopted that but moved the attribute to the FORM element. The wording accompanying it is pretty bad: the fact that it is supposed to influence browser behaviour is very unclear. Anyway, to tell the browser you want the data in UTF-8, you're supposed to say this: <FORM ACTION="..." METHOD="..." ACCEPT_CHARSET="UTF-8"> . . . </FORM> The problem is that most browsers will not listen, and still send the data in the page's encoding. People usually deal with that by using a hidden field in the form (<INPUT TYPE="hidden">). You can put some text in there that will not be shown to the user but that will be returned as part of the form data. By looking at the bytes of that text, your CGI can determine its encoding (knowing the characters in advance) and that is also the encoding of the rest of the data. If that smells like a hack, looks like a hack and moves like a hack, that's because it *is* a hack. But that's the best we have in the current broken architecture of Web forms. -- François Yergeau

Gagan

Comment 13

•

26 years ago

Wow... a lot of activity on this bug in one day! But I like what has been discussed here (and other associates of this bug) I like the idea of a magic field (_charset_) to specify the charset that a server is willing to accept (alternatively we could have done something with the HTTP headers or META tags on the forms page to keep it "in spec") but this would work too. As for the potential case of someone setting a hidden field of charset, we should allow that to be overridden by that value (after all the web author knows better as to what he/she is doing) This will not break those sites. And for the miniscule section of websites that need both the _charset_ value and have a field that uses charset as a field... they can be told to easy-fix it.

Reporter

Comment 14

•

26 years ago

There is a slight misunderstanding in the previous comment. The _charset_ field is not to indicate the charset that a server is willing to accept. It is the charset that the client is sending out.

Reporter

Comment 15

•

26 years ago

Playing devil's advocate against myself, if server-side software authors would like to receive the charset info from the client, another way to do this is to ask for multipart/form-data in the FORM element's ENCTYPE. This type of form submission allows you to attach the charset parameter to each part. (We need to test this too, though, to see if a bunch of sites will break when adding charset to *these* Content-Type headers!) If we decide to go this way, we will be HTML4-compliant, and it is all very clean. So this is an advantage. One possible disadvantage of not sending out _charset_ could be that some sites start using _charset_ to work with MSIE5, and that those authors then complain about Mozilla not supporting it. We can try to bang those people over the head with the spec, but that might not work. It depends on how popular MSIE5 becomes, and how popular _charset_ becomes.

Updated

•

26 years ago

Status: NEW → ASSIGNED

Target Milestone: M14

Comment 16

•

26 years ago

I got our browser to submit _charset_ fields. Unfortunately, I have it doing this every form, so I didn't check it in yet. Okay, to clarify - which cases, exactly, does IE 5.0 send out the _charset_ field? I couldn't get it to on these cases. http://blueviper/forms/utf-8.html <HTML> <BODY> Standard GET: <FORM ACTION="http://pollmann.net/echo.cgi" METHOD=GET> <INPUT TYPE=HIDDEN VALUE="Foo"><INPUT TYPE=SUBMIT> </FORM> Standard POST: <FORM ACTION="http://pollmann.net/echo.cgi" METHOD=POST> <INPUT TYPE=HIDDEN VALUE="Foo"><INPUT TYPE=SUBMIT> </FORM> UTF-8 GET: <FORM ACTION="http://pollmann.net/echo.cgi" METHOD=GET ACCEPT_CHARSET="UTF-8"> <INPUT TYPE=HIDDEN VALUE="Foo"><INPUT TYPE=SUBMIT> </FORM> Standard form: <FORM ACTION="http://pollmann.net/echo.cgi" METHOD=POST ACCEPT_CHARSET="UTF-8"> <INPUT TYPE=HIDDEN VALUE="Foo"><INPUT TYPE=SUBMIT> </FORM> </BODY> </HTML> None of these forms submit a _charset_ field. It doesn't make sense for us to go off and implement our own take on this, so I want to make sure we to this iff IE does it.

Reporter

Comment 17

•

26 years ago

I just tried it out on MSIE5, and discovered that the form must include an INPUT element of TYPE HIDDEN with NAME _charset_. Otherwise, IE5 doesn't return that field. When that field is present, IE5 returns it with the charset value filled in. <INPUT TYPE="HIDDEN" NAME="_charset_">

gerardok

Comment 18

•

26 years ago

QA Contact update.

Comment 19

•

26 years ago

Moving off to M16 - please speak up of you need this for M14, thanks!

Updated

•

26 years ago

Target Milestone: M14 → M16

Sitsofe Wheeler

Comment 20

•

26 years ago

I'm very unsure about this, but is this related to bug 5313?

Comment 21

•

26 years ago

Rescheduling

Target Milestone: M16 → M21

Comment 22

•

26 years ago

This bug has been marked "future" because the original netscape engineer working on this is over-burdened. If you feel this is an error, that you or another known resource will be working on this bug,or if it blocks your work in some way -- please attach your concern to the bug for reconsideration.

Target Milestone: M21 → Future

Reporter

Comment 23

•

26 years ago

Adding ftang and momoi to Cc list. We need to make sure we are going to ship a browser with sufficient I18N capability and compliance in the HTML forms area. The _charset_ field is one thing that several W3C I18N IG participants and I identified as being a good short-term solution. Frank, who owns this? Kat, do we have good tests and testing in this area?

Katsuhiko Momoi

Comment 24

•

26 years ago

I created 2 basic echo test templates (for POST and GET) to cover the the scenario #1 quoted in your comment: Erik van der Poel 1999-11-12 22:38 There is already a test case for #3 on our test server also. In adidtion to these, it should be easy to add cases which use _charset_ along with 1) pages marked with meta charset tags, and 2) pages marked with HTTP charset info from servers. We can also add a case for scenario #2. I have not confirmed that scenario #4 works with IE5 when the hideen _charset_ exists. -- form refuses to execute. It does work without it. I'll be happy to work with anyone assigned to verify this bug.

Vladimir Ermakov

Comment 25

•

25 years ago

Updating QA contact.

QA Contact: ckritzer → vladimire

Daniel Bratell

Updated

•

24 years ago

Blocks: 70838

Kevin McCluskey (gone)

Comment 26

•

24 years ago

Bulk reassigning form bugs to Alex

Assignee: pollmann → alexsavulov

Status: ASSIGNED → NEW

Alexandru Savulov

Updated

•

24 years ago

Summary: add support for _charset_ field in form submissions → add support for _charset_ field in form submissions [form sub]

Frank Tang

Comment 27

•

24 years ago

remove future, we should reconsider this. we should only add the "_charset_" field if there are no such field.

Assignee: alexsavulov → nhotta

Target Milestone: Future → ---

nhottanscp

Updated

•

24 years ago

Status: NEW → ASSIGNED

Frank Tang

Comment 28

•

24 years ago

momoi, can you figure out how important is this for us ?

Assignee

Comment 29

•

24 years ago

Can most servers (last couple revs of IIS and Apache especially) now handle this properly? _charset_ workaround might be unnecessary.

nhottanscp

Updated

•

24 years ago

Target Milestone: --- → mozilla1.2

Assignee

Comment 30

•

23 years ago

Attached patch Patch — Details — Splinter Review

This should do it. Any <input type=hidden name=_charset_> will have its value hijacked and sent as the actual charset being used. If accept-charset is used to send the form, that value will be sent. This works across all enctypes and methods.

Assignee

Comment 31

•

23 years ago

Taking so I don't lose track.

Assignee: nhotta → jkeiser

Status: ASSIGNED → NEW

Whiteboard: [FIX]

Assignee

Updated

•

23 years ago

Status: NEW → ASSIGNED

Assignee

Comment 32

•

23 years ago

Attached file Testcase — Details

This contains three testcases: 1. form with no accept-charset containing _charset_ (sends default charset) 2. form with accept-charset=UTF-8 and two _charset_s (sends two copies of UTF-8) 3. form with accept-charset=UTF-8 and no _charset_ (sends nothing) 4. form with accept-charset=UTF-8 and _charset_ that is input type=text (sends _charset_ as it normally is) 5. form with enctype=multipart/form-data 6. form with enctype=text/plain 7. form with method=get

Boris Zbarsky [:bzbarsky]

Assignee

Comment 33

•

23 years ago

OK, so it was 7 testcases. Sue me. :P

Frank Tang

Comment 34

•

23 years ago

Comment on attachment 94597 [details] [diff] [review] Patch r=ftang

Attachment #94597 - Flags: review+

Comment 35

•

23 years ago

Comment on attachment 94597 [details] [diff] [review] Patch sr=bzbarsky

Attachment #94597 - Flags: superreview+