Last Comment Bug 18643 - add support for _charset_ field in form submissions [form sub]
: add support for _charset_ field in form submissions [form sub]
Status: VERIFIED FIXED
[FIX]
:
Product: Core
Classification: Components
Component: HTML: Form Submission (show other bugs)
: Trunk
: All All
: P3 normal with 1 vote (vote)
: mozilla1.2alpha
Assigned To: John Keiser (jkeiser)
: Vladimir Ermakov
:
Mentors:
Depends on:
Blocks: 70838
  Show dependency treegraph
 
Reported: 1999-11-11 20:52 PST by Erik van der Poel
Modified: 2011-04-21 14:27 PDT (History)
12 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Patch (1.10 KB, patch)
2002-08-08 22:35 PDT, John Keiser (jkeiser)
ftang: review+
bzbarsky: superreview+
Details | Diff | Splinter Review
Testcase (2.48 KB, text/html)
2002-08-08 23:45 PDT, John Keiser (jkeiser)
no flags Details

Description Erik van der Poel 1999-11-11 20:52:20 PST
The specs say that we should add the charset parameter to the Content-Type
header when doing an HTTP POST for a form submission, but we have found that a
lot of server-side software cannot handle this. See the following:

http://bugzilla.mozilla.org/show_bug.cgi?id=7533

MSIE has undoubtedly also figured this out, and they have come up with a good
workaround, which is to indicate the charset via a special field in the form
submission itself. That field's name is "_charset_" (including underscores).

We should first of all consider whether to follow MSIE's example here. My vote
is to do so!
Comment 1 karnaze (gone) 1999-11-11 21:02:59 PST
Reassigning to EricP.
Comment 2 Erik van der Poel 1999-11-12 10:11:59 PST
Another advantage of the _charset_ field is that it also works with HTTP GET.
We cannot add a charset parameter to the Content-Type header for HTTP GET,
because there is no Content-Type header in the case of GET (since there is no
body following the headers). A *lot* of forms use GET.
Comment 3 Hixie (not reading bugmail) 1999-11-12 10:41:59 PST
Since the charset of GET requests is always straight 7bit ASCII encoding
standard ISO-Latin-1 text, I would have thought that that was pretty irrelevant.
Comment 4 Hixie (not reading bugmail) 1999-11-12 11:04:59 PST
[oops, reinstating cc line that I clobbered with my last comment]
Comment 5 Erik van der Poel 1999-11-12 13:56:59 PST
No, in the real world, form GETs also use charsets other than us-ascii and
iso-8859-1. Those octets are of course "URL-encoded" (%XX hex encoded) so that
it looks like ASCII to the casual observer, but in fact the charset underneath
could be just about anything. And *that* charset is what we are talking about
here.
Comment 6 Eric Pollmann 1999-11-12 16:41:59 PST
This seems like a hack...  are any sites out there using the _charset_ field
now?  Will we break them by not including a _charset_ field?  What would happen
if someone named an form element "_charset_" and set it's value to "Micky mouse
figurines" or something?
Comment 7 Eric Pollmann 1999-11-12 16:43:59 PST
Gagan, are you familiar with MSIE's work-around for this issue?
Comment 8 Hixie (not reading bugmail) 1999-11-12 17:00:59 PST
According to RFC1630:
#     Where the local naming scheme uses ASCII characters which are not
#     allowed in the URI, these may be represented in the URL by a
#     percent sign "%" immediately followed by two hexadecimal digits
#     (0-9, A-F) giving the ISO Latin 1 code for that character.
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Does this not mean that URIs can only contain a single character encoding?

Similarly in HTML4, section 17.13.1:
#   Note. The "get" method restricts form data set values to ASCII
#   characters. Only the "post" method (with
#   enctype="multipart/form-data") is specified to cover the entire
#   [ISO10646] character set.
Comment 9 Eric Pollmann 1999-11-12 17:03:59 PST
Re-adding gagan to CC.
Comment 10 Erik van der Poel 1999-11-12 22:06:59 PST
Pollmann, yes, it is a bit of a hack. I don't know if there are any sites out
there using _charset_. Even if there are, we wouldn't "break" them by omitting
_charset_, since there are many installed clients that omit _charset_ (such as
the old Netscape clients). However, people have often asked us to indicate the
charset in form submissions, and we did at one point try to add the charset
parameter to Content-Type for POSTs, but quickly found that it broke. Now that
MSIE5 has found a pretty good way to indicate the charset, I'm suggesting that
we consider doing that too. If someone happened to already have a form field
called _charset_ and set its value to "Mickey Mouse", our code would break that
form, but I claim that such cases would be rare, and it would be easier to get
that small number of people to modify their applications than to get the large
number of people with software that can't grok charset in Content-Type to alter
theirs.
Comment 11 Erik van der Poel 1999-11-12 22:26:59 PST
Ian, RFC1630 does attempt to restrict URIs to iso-8859-1, but it is a well
known fact that HTML forms have been violating that rule for years. Similar
comment for HTML4's 17.13.1.
Comment 12 Erik van der Poel 1999-11-12 22:38:59 PST
Here is the full text of a recent email from MS sent to the Unicode mailing
list, describing their _charset_ field:

Subject: Re: HTML forms and UTF-8
Date: 8 Nov 99 03:29:38 GMT
From: christw@microsoft.com (Chris Wendt)
To: unicode@unicode.org

François has a typo in here:

  <FORM ACTION="..." METHOD="..." ACCEPT_CHARSET="UTF-8">

should be

  <FORM ACTION="..." METHOD="..." ACCEPT-CHARSET="UTF-8">

(note the dash instead of underscore)

Internet Explorer 5:

1) Sets a hidden field named "_charset_" to the encoding the FORM data was
submitted in. ("_charset_" name includes the underscores).
2) Submits in UTF-8 if Accept-Charset="UTF-8" is given in the <form> and
input is found which does not fit into the form page's encoding.
3) If Accept-Charset is not specified or not set to UTF-8, Internet Explorer
5 will submit in the form page's encoding, given that support for that
encoding is present on the client machine.
4) Submits the form data which does not fit into the used encoding as HTML4
Numeric Character References.

Internet Explorer 4 shares feature 4) above and always submits in the form
page's encoding.

You could prepopulate the _charset_ field with the form page's encoding so
it always gets returned to your CGI.

Chris..

----- Original Message -----
From: François Yergeau <yergeau@alis.com>
To: Unicode List <unicode@unicode.org>
Sent: Sunday, November 07, 1999 2:38 PM
Subject: RE: HTML forms and UTF-8


> De: Glen Perkins [mailto:Glen.Perkins@nativeguide.com]
> Date: dimanche 7 novembre 1999 16:14
>
> What is the best approach for getting data submitted by an
> HTML form into
> Unicode (presumably UTF-8) encoding?

See http://www.w3.org/TR/html40/interact/forms.html#h-17.3 for the official
way to do that.  Warning: it doesn't work.

> I'd like to be able to roll out forms in any number of
> languages/scripts and
> have the data returned to the same CGI program (perl_mod or
> whatever) in the
> same encoding, UTF-8, or else determine the encoding of the
> returning data
> and convert to UTF-8 immediately as the first step in the
> CGI/server side
> processing program.

Good idea, but be prepared for a rough ride :-(

The traditional way that forms work is that the data is returned in the same
encoding as the page containing the form.  This kind of works when there is
a single page in a single encoding (no transcoding proxy, for instance)
handled by a single CGI script. But it breaks in many cases and does not
allow multilingual content (except when the page is in Unicode, of course).

RFC 2070 tried to improve that by introducing an Accept-Charset attribute on
the INPUT element within forms.  HTML 4 (the reference above) adopted that
but moved the attribute to the FORM element.  The wording accompanying it is
pretty bad: the fact that it is supposed to influence browser behaviour is
very unclear. Anyway, to tell the browser you want the data in UTF-8, you're
supposed to say this:

 <FORM ACTION="..." METHOD="..." ACCEPT_CHARSET="UTF-8">
 .
 .
 .
 </FORM>

The problem is that most browsers will not listen, and still send the data
in the page's encoding.

People usually deal with that by using a hidden field in the form (<INPUT
TYPE="hidden">).  You can put some text in there that will not be shown to
the user but that will be returned as part of the form data.  By looking at
the bytes of that text, your CGI can determine its encoding (knowing the
characters in advance) and that is also the encoding of the rest of the
data.  If that smells like a hack, looks like a hack and moves like a hack,
that's because it *is* a hack.  But that's the best we have in the current
broken architecture of Web forms.

--
François Yergeau
Comment 13 Gagan 1999-11-13 01:36:59 PST
Wow... a lot of activity on this bug in one day! But I like what has been
discussed here (and other associates of this bug) I like the idea of a magic
field (_charset_) to specify the charset that a server is willing to accept
(alternatively we could have done something with the HTTP headers or META tags
on the forms page to keep it "in spec") but this would work too. As for the
potential case of someone setting a hidden field of charset, we should allow
that to be overridden by that value (after all the web author knows better as to
what he/she is doing) This will not break those sites. And for the miniscule
section of websites that need both the _charset_ value and have a field that
uses charset as a field... they can be told to easy-fix it.
Comment 14 Erik van der Poel 1999-11-13 08:02:59 PST
There is a slight misunderstanding in the previous comment. The _charset_ field
is not to indicate the charset that a server is willing to accept. It is the
charset that the client is sending out.
Comment 15 Erik van der Poel 1999-11-13 08:15:59 PST
Playing devil's advocate against myself, if server-side software authors would
like to receive the charset info from the client, another way to do this is to
ask for multipart/form-data in the FORM element's ENCTYPE. This type of form
submission allows you to attach the charset parameter to each part. (We need to
test this too, though, to see if a bunch of sites will break when adding charset
to *these* Content-Type headers!)

If we decide to go this way, we will be HTML4-compliant, and it is all very
clean. So this is an advantage.

One possible disadvantage of not sending out _charset_ could be that some sites
start using _charset_ to work with MSIE5, and that those authors then complain
about Mozilla not supporting it. We can try to bang those people over the head
with the spec, but that might not work. It depends on how popular MSIE5 becomes,
and how popular _charset_ becomes.
Comment 16 Eric Pollmann 1999-11-19 16:43:59 PST
I got our browser to submit _charset_ fields.  Unfortunately, I have it doing
this every form, so I didn't check it in yet.

Okay, to clarify - which cases, exactly, does IE 5.0 send out the _charset_
field?  I couldn't get it to on these cases.

http://blueviper/forms/utf-8.html

<HTML>
 <BODY>
  Standard GET:
  <FORM ACTION="http://pollmann.net/echo.cgi" METHOD=GET>
   <INPUT TYPE=HIDDEN VALUE="Foo"><INPUT TYPE=SUBMIT>
  </FORM>
  Standard POST:
  <FORM ACTION="http://pollmann.net/echo.cgi" METHOD=POST>
   <INPUT TYPE=HIDDEN VALUE="Foo"><INPUT TYPE=SUBMIT>
  </FORM>
  UTF-8 GET:
  <FORM ACTION="http://pollmann.net/echo.cgi" METHOD=GET ACCEPT_CHARSET="UTF-8">
   <INPUT TYPE=HIDDEN VALUE="Foo"><INPUT TYPE=SUBMIT>
  </FORM>
  Standard form:
  <FORM ACTION="http://pollmann.net/echo.cgi" METHOD=POST
ACCEPT_CHARSET="UTF-8">
   <INPUT TYPE=HIDDEN VALUE="Foo"><INPUT TYPE=SUBMIT>
  </FORM>
 </BODY>
</HTML>

None of these forms submit a _charset_ field.  It doesn't make sense for us to
go off and implement our own take on this, so I want to make sure we to this iff
IE does it.
Comment 17 Erik van der Poel 1999-11-20 15:38:59 PST
I just tried it out on MSIE5, and discovered that the form must include an
INPUT element of TYPE HIDDEN with NAME _charset_. Otherwise, IE5 doesn't return
that field. When that field is present, IE5 returns it with the charset value
filled in.

  <INPUT TYPE="HIDDEN" NAME="_charset_">
Comment 18 gerardok 1999-12-02 16:45:59 PST
QA Contact update.
Comment 19 Eric Pollmann 2000-02-08 11:56:24 PST
Moving off to M16 - please speak up of you need this for M14, thanks!
Comment 20 Sitsofe Wheeler 2000-02-23 02:40:50 PST
I'm very unsure about this, but is this related to bug 5313?
Comment 21 Eric Pollmann 2000-04-11 01:23:09 PDT
Rescheduling
Comment 22 Eric Pollmann 2000-06-10 05:34:50 PDT
This bug has been marked "future" because the original netscape engineer working 
on this is over-burdened. If you feel this is an error, that you or another 
known resource will be working on this bug,or if it blocks your work in some way 
-- please attach your concern to the bug for reconsideration.
Comment 23 Erik van der Poel 2000-06-10 11:19:55 PDT
Adding ftang and momoi to Cc list. We need to make sure we are going to ship a
browser with sufficient I18N capability and compliance in the HTML forms area.
The _charset_ field is one thing that several W3C I18N IG participants and I
identified as being a good short-term solution. Frank, who owns this? Kat, do
we have good tests and testing in this area?
Comment 24 Katsuhiko Momoi 2000-06-11 15:54:48 PDT
I created 2 basic  echo test templates (for POST and GET) to cover 
the the scenario #1 quoted in your comment:

Erik van der Poel 1999-11-12 22:38 

There is already a test case for #3 on our test server also.

In adidtion to these, it should be easy to add cases which
use _charset_ along with 1) pages marked with meta charset tags,
and 2) pages marked with HTTP charset info from servers.

We can also add a case for scenario #2. 
I have not confirmed that scenario #4 works with IE5 when 
the hideen _charset_ exists. -- form refuses to execute.
It does work without it.

I'll be happy to work with anyone assigned to verify this bug.

Comment 25 Vladimir Ermakov 2000-09-25 14:44:56 PDT
Updating QA contact.
Comment 26 Kevin McCluskey (gone) 2001-10-05 14:34:50 PDT
Bulk reassigning form bugs to Alex
Comment 27 Frank Tang 2001-12-20 11:06:25 PST
remove future, we should reconsider this. we should only add the "_charset_"
field if there are no such field. 

Comment 28 Frank Tang 2002-01-10 17:47:06 PST
momoi, can you figure out how important is this for us ?
Comment 29 John Keiser (jkeiser) 2002-01-17 18:59:48 PST
Can most servers (last couple revs of IIS and Apache especially) now handle this
properly?  _charset_ workaround might be unnecessary.
Comment 30 John Keiser (jkeiser) 2002-08-08 22:35:20 PDT
Created attachment 94597 [details] [diff] [review]
Patch

This should do it.  Any <input type=hidden name=_charset_> will have its value
hijacked and sent as the actual charset being used.  If accept-charset is used
to send the form, that value will be sent.

This works across all enctypes and methods.
Comment 31 John Keiser (jkeiser) 2002-08-08 22:36:23 PDT
Taking so I don't lose track.
Comment 32 John Keiser (jkeiser) 2002-08-08 23:45:00 PDT
Created attachment 94602 [details]
Testcase

This contains three testcases:
1. form with no accept-charset containing _charset_ (sends default charset)
2. form with accept-charset=UTF-8 and two _charset_s (sends two copies of
UTF-8)
3. form with accept-charset=UTF-8 and no _charset_ (sends nothing)
4. form with accept-charset=UTF-8 and _charset_ that is input type=text (sends
_charset_ as it normally is)
5. form with enctype=multipart/form-data
6. form with enctype=text/plain
7. form with method=get
Comment 33 John Keiser (jkeiser) 2002-08-08 23:45:31 PDT
OK, so it was 7 testcases.  Sue me. :P
Comment 34 Frank Tang 2002-08-09 14:56:59 PDT
Comment on attachment 94597 [details] [diff] [review]
Patch

r=ftang
Comment 35 Boris Zbarsky [:bz] (still a bit busy) 2002-08-09 22:29:43 PDT
Comment on attachment 94597 [details] [diff] [review]
Patch

sr=bzbarsky
Comment 36 John Keiser (jkeiser) 2002-08-09 23:22:04 PDT
Fix checked in
Comment 37 Vladimir Ermakov 2002-08-27 14:04:59 PDT
verifying on build 2002-08-27-08-trunk linux red hat
Comment 38 Ilkka Huotari 2011-04-21 14:27:44 PDT
This is a reply to an old bug, but I'm writing a server side software, and have problems with form post encodings.

A question: shouldn't you know the encoding before you can even parse the body and see what the _charset_ field holds?

Note You need to log in before you can comment on or make changes to this bug.