Closed Bug 241540 Opened 20 years ago Closed 9 years ago

No charset encoding sent for application/x-www-form-urlencoded data

Categories

(Core :: DOM: Core & HTML, defect)

x86
Linux
defect
Not set
normal

Tracking

()

RESOLVED INVALID

People

(Reporter: moore, Unassigned)

References

Details

(Keywords: intl)

User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040413 Debian/1.6-5
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040413 Debian/1.6-5

When submiting data form a the brouser fails to add the charset with the mime
type. This is problmatic as it leavs the applacation on the server side to guess
the encoding format, lain-1, utf-8, ect. Under the case where the form tag dose
not spesify a accept-charset atrubute. This is some what accetubal as the html4
rfc states:

    The default value for this attribute is the reserved string
    "UNKNOWN". User agents may interpret this value as the character
    encoding that was used to transmit the document containing this
    FORM element.

But not spesifing it still leaves the applacation guesing some what. The
situation is wors when the form dose suply a encoding and suplyes more then one
option. ex:

   <form method="post" action="/newevent" accept-charset='ISO-8859-1,utf-8'>

In this case the server sid applacation will have no idea as to the proper
interpretation of the submited data. This can be solved by adding the encoting
to the content-type sting. ex:

  Content-Type: application/x-www-form-urlencoded; charset=utf-8

This is espechily an issue for peopel trying to wirte web applacation that
suport i18n and l10n. 

Revelent RFC refreance:
http://www.w3.org/TR/REC-html40/interact/forms.html#adef-accept-charset

    accept-charset = charset list [CI]        
      This attribute specifies the list of character encodings for        
      input data that is accepted by the server processing this form.        
      The value is a space- and/or comma-delimited list of charset   
      values. The client must interpret this list as an exclusive-or        
      list, i.e., the server is able to accept any single character        
      encoding per entity received.                 

      The default value for this attribute is the reserved string        
      "UNKNOWN". User agents may interpret this value as the character          
      encoding that was used to transmit the document containing this        
      FORM element.



ftp://ftp.isi.edu/in-notes/rfc2616.txt
7.2.1 Type

   When an entity-body is included with a message, the data type of that
   body is determined via the header fields Content-Type and Content-
   Encoding. These define a two-layer, ordered encoding model:

       entity-body := Content-Encoding( Content-Type( data ) )

   Content-Type specifies the media type of the underlying data.
   Content-Encoding may be used to indicate any additional content
   codings applied to the data, usually for the purpose of data
   compression, that are a property of the requested resource. There is
   no default encoding.

   Any HTTP/1.1 message containing an entity-body SHOULD include a
   Content-Type header field defining the media type of that body. If
   and only if the media type is not given by a Content-Type field, the
   recipient MAY attempt to guess the media type via inspection of its
   content and/or the name extension(s) of the URI used to identify the
   resource. If the media type remains unknown, the recipient SHOULD
   treat it as type "application/octet-stream".


14.17 Content-Type

   The Content-Type entity-header field indicates the media type of the
   entity-body sent to the recipient or, in the case of the HEAD method,
   the media type that would have been sent had the request been a GET.

       Content-Type   = "Content-Type" ":" media-type

   Media types are defined in section 3.7. An example of the field is

       Content-Type: text/html; charset=ISO-8859-4

   Further discussion of methods for identifying the media type of an
   entity is provided in section 7.2.1.





Reproducible: Always
Steps to Reproduce:
1. creat form with accept-charset='ISO-8859-1,utf-8'
2. submit content


Actual Results:  
content type is sent as "Content-Type: application/x-www-form-urlencoded" with
no encoding information

Expected Results:  
sent the content type as:

 Content-Type: application/x-www-form-urlencoded; charset=utf-8

or

  Content-Type: application/x-www-form-urlencoded; charset=ISO-8859-1

depeding on acculy encoding used.

I have tested this is a reasont nightly build as well:

  Mozilla 1.8a: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8a) Gecko/20040423
This is a duplicate.  We used to send this information, but this apparently
broke a number of server-side applications and we had to disable this feature to
make form submission in Mozilla at all usable....
Whiteboard: DUPEME
(In reply to comment #1)
> This is a duplicate.  We used to send this information, but this apparently
> broke a number of server-side applications and we had to disable this feature to
> make form submission in Mozilla at all usable....

would it be possible to add the content encoding in the case the the attrubute
was spesified on the form. My guess is that most apps that break when it is
preseant would not have that attrubute set.

-Jonathn
Summary: Mozilla dose not provied charset encoding information for application/x-www-form-urlencoded data → No charset encoding sent for application/x-www-form-urlencoded data
More relevent reading on this issue agin from 
ftp://ftp.isi.edu/in-notes/rfc2616.txt

It appers that mozillas policy of not puting the charset encoding on the type is
 at leas paritaly in line with the standerd:

 (From 3.7 ftp://ftp.isi.edu/in-notes/rfc2616.txt)
   Note that some older HTTP applications do not recognize media type
   parameters. When sending data to older HTTP applications,
   implementations SHOULD only use media type parameters when they are
   required by that type/subtype definition.

This is of corse a unforcheant stipulation as there is no clear way do determan
what a older HTTP applacation is. Espechely when submiting a request to a server
the client man not have evere talk to preavousely. 

Mozill preforms as expected on forms with no accept-charset with respect to the
next quote:

 (form http://www.w3.org/TR/REC-html40/interact/forms.html#adef-accept-charset)
   The default value for this attribute is the reserved string        
   "UNKNOWN". User agents may interpret this value as the character          
   encoding that was used to transmit the document containing this        
   FORM element.

If the document containing the form was in utf-8 mozilla responds in utf-8.
Where mozilla falles down is in conforming to the next quote:

 (Form 3.7.1 ftp://ftp.isi.edu/in-notes/rfc2616.txt)
   When no explicit charset
   parameter is provided by the sender, media subtypes of the "text"
   type are defined to have a default charset value of "ISO-8859-1" when
   received via HTTP. Data in character sets other than "ISO-8859-1" or
   its subsets MUST be labeled with an appropriate charset value.

This clearly states that if the encoding is not ISO-8859-1 (Latin-1) that you
must include a encoding paramater in the media type.

I think that the proper behavor for mozill shold be to never return the ecnoding
paramater when the text is in Latin-1 but to allway return it othere wise (as
the spec requires).

Why is this UNCONFIRMED?  There's no question that this behavior exists, nor that
it violates the W3C standard.
-> form submission
Assignee: darin → form-submission
Component: Networking: HTTP → HTML: Form Submission
QA Contact: core.networking.http
bz is right that we used to add it for a brief while, but was forced to remove
it because at that time, the majority of server-side programs couldn't cope with
it. A quick bugzilla search didn't lead me where it's discussed extensively. The
revision history (before Janueary 2002) of nsFormSubmission.cpp was lost (the
file was either moved or newly made). The code to add 'charset' is currently
blocked. See

http://lxr.mozilla.org/seamonkey/source/content/html/content/src/nsFormSubmission.cpp#495
Keywords: intl
The ifdef was added in bug 7533 (took some Attic-digging in CVS to find that). 
If the server software mentioned in that bug and the numerous duplicates has
been fixed, we should consider flipping that ifdef...
Whiteboard: DUPEME
This is an automated message, with ID "auto-resolve01".

This bug has had no comments for a long time. Statistically, we have found that
bug reports that have not been confirmed by a second user after three months are
highly unlikely to be the source of a fix to the code.

While your input is very important to us, our resources are limited and so we
are asking for your help in focussing our efforts. If you can still reproduce
this problem in the latest version of the product (see below for how to obtain a
copy) or, for feature requests, if it's not present in the latest version and
you still believe we should implement it, please visit the URL of this bug
(given at the top of this mail) and add a comment to that effect, giving more
reproduction information if you have it.

If it is not a problem any longer, you need take no action. If this bug is not
changed in any way in the next two weeks, it will be automatically resolved.
Thank you for your help in this matter.

The latest beta releases can be obtained from:
Firefox:     http://www.mozilla.org/projects/firefox/
Thunderbird: http://www.mozilla.org/products/thunderbird/releases/1.5beta1.html
Seamonkey:   http://www.mozilla.org/projects/seamonkey/
Status: UNCONFIRMED → NEW
Ever confirmed: true
I just took a few days to untangle an encoding problem with Tomcat.  I resolved this problem by doing request.setCharacterEncoding("UTF-8") in a filter before the request was handled, and all form submissions work fine now (for me).  I'd rather not have to do this hack, and I wish someone would look into fixing this bug.
Did you read comment 7?  There's no problem with changing the code to do this, except that then web sites break.
And also note bug 289060 comment 8, which points out that doing this would actually be a spec violation.
Right, what I was asking was if it was time to find out of coldfusion et al were fixed since bug 7533 was filed in 1999, but then I read bug 289060 comment 8 and realized it was the spec that was broken, not firefox or coldfusion.
Should someone maybe mark this as WONTFIX or INVALID?  It seems to me that the proper way to 'fix' this would be to submit an RFC to the IETF to get application/x-www-form-urlencoded to accept a charset= paramater just like text content types, but until that happens this bug might as well be closed.
When trying a POST from a UTF-8 encoded page firefox sends the following from e grave u and a grave u:
+++
Content-Type: application/x-www-form-urlencoded
Content-Length: 34

dataname=%C3%A8u&datavalue=%C3%A0u
+++
Which looks ok according to http://www.w3.org/TR/2003/REC-xforms-20031014/slice11.html

It does the same from a ISO-8859-1 page which looks weird but seems still correct.
Assignee: form-submission → nobody
QA Contact: form-submission
Wow, no fix since 8 years...

And this is a real bug: If the HTTP header says the file is encoded in ISO-8859-1 the common way to override this with HTML is:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Firefox reads the body in UTF-8 then, which is fine, but the charset used in forms is still ISO-8859-1, so you have to add accept-charset="utf-8" to the form just for firefox (other browser automatically use UTF-8 or send the charset with the content-type).

So: Why the hell is nobody fixing this bug?
> So: Why the hell is nobody fixing this bug?

Had you actually read the bug, you would know why.  Please do so now instead of applying profanity to the problem.
That said, the issue you describe in your fourth paragraph has nothing to do with this bug that I can tell, and I can't reproduce it.  I suggest you file a separate but on that issue.  Feel free to cc me and point to a web page that shows the problem.
I raaded, but: "but was forced to remove it because at that time, the majority of server-side programs couldn't cope with it" <- this was 8 years ago. Maybe give it another try? Anyway, filling another report...
> this was 8 years ago.

Unless there's data that something has changed, the assumption is it hasn't.  Breaking things for our users to test that assumption given lack of any indication that something has in fact changed is just a bad idea.
Please close this (very) old bug as WONTFIX because the current implementation is historically proven and expected.

Advice for Website-Developers: 
It is advised that Form-Tags include an "accept-charset"-attribute with exactly one charset (which the server assumes). The "accept-charset" attribute is practically required if different parts of the website are delivered in different charsets.

Advice for serverside-Developers implementing the form-action:
(The Olde Legacy Way: Charset of the received content-type is not evaluated, and assumed to be in a fixed, pre-decided charset. All website-forms should include an accept-charset with exactly this fixed, pre-decided charset. Make sure the application properly accepts a received content-type with attributes, even if attributes are ignored.)
If the received content-type includes a charset, then evaluate the data with the given charset. If the received content-type does not include a charset, there are multiple alternatives:
  * assume a fixed, pre-decided charset. All website-forms should include an "accept-charset"-attribute with this fixed, pre-decided charset.
  * if no charset has been agreed on before, then assume the charset of the website. If there is no website yet, then unilaterally agree on UTF-8 (The default charset of RFC3986).
  * in case of REST-services accepting application/x-www-form-urlencoded and typically without website-form, then assume UTF-8. (As per RFC3986. The RFC2616 3.7.1 recommending ISO-Latin-1 does not apply here because application/x-www-form-urlencoded is not of "text"-mimetype)
Not sending the charset is correct; see <http://www.w3.org/TR/html5/iana.html#application/x-www-form-urlencoded> -- this media type does *not* have a charset parameter.
Resolving as INVALID meaning "works as expected" per latest comments.
You can use a special _charset_ parameter if you need the character encoding:
https://html.spec.whatwg.org/multipage/forms.html#attr-fe-name-charset
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → INVALID
Component: HTML: Form Submission → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.