Closed Bug 289060 Opened 19 years ago Closed 17 years ago

add a charset to 'Content-Disposition: form-data; name="yourFormFieldName"' when posting multipart/form-data

Categories

(Core Graveyard :: File Handling, defect)

x86
Windows XP
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: hauser, Unassigned)

References

()

Details

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.6) Gecko/20050317 Firefox/1.0.2
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.6) Gecko/20050317 Firefox/1.0.2

if a html page containing a form is utf-8 encoded. Rightfully, firefox sends the
strings input by the user back as such.
Unfortunately, it does not declare that the content is encoded as such (see 
http://www.ietf.org/rfc/rfc1867.txt and http://www.ietf.org/rfc/rfc1521.txt)


-- RFC1521: Quote ------------

7.1  The Text Content-Type

   The text Content-Type is intended for sending material which is
   principally textual in form.  It is the default Content-Type.  A
   "charset" parameter may be used to indicate the character set of the
   body text for some text subtypes, notably including the primary
   subtype, "text/plain", which indicates plain (unformatted) text.  The
   default Content-Type for Internet mail is "text/plain; charset=us-
   ascii".

-- RFC1521: End of quote -----)

Reproducible: Always

Actual Results:  
no charset sent

Expected Results:  
send charset for anything but us-ascii

see also http://issues.apache.org/bugzilla/show_bug.cgi?id=20813
also see http://issues.apache.org/bugzilla/show_bug.cgi?id=34297 for how to
gracefully handle the status quo in the struts MVC
This is an automated message, with ID "auto-resolve01".

This bug has had no comments for a long time. Statistically, we have found that
bug reports that have not been confirmed by a second user after three months are
highly unlikely to be the source of a fix to the code.

While your input is very important to us, our resources are limited and so we
are asking for your help in focussing our efforts. If you can still reproduce
this problem in the latest version of the product (see below for how to obtain a
copy) or, for feature requests, if it's not present in the latest version and
you still believe we should implement it, please visit the URL of this bug
(given at the top of this mail) and add a comment to that effect, giving more
reproduction information if you have it.

If it is not a problem any longer, you need take no action. If this bug is not
changed in any way in the next two weeks, it will be automatically resolved.
Thank you for your help in this matter.

The latest beta releases can be obtained from:
Firefox:     http://www.mozilla.org/projects/firefox/
Thunderbird: http://www.mozilla.org/products/thunderbird/releases/1.5beta1.html
Seamonkey:   http://www.mozilla.org/projects/seamonkey/
This bug has been automatically resolved after a period of inactivity (see above
comment). If anyone thinks this is incorrect, they should feel free to reopen it.
Status: UNCONFIRMED → RESOLVED
Closed: 19 years ago
Resolution: --- → EXPIRED
Status: RESOLVED → UNCONFIRMED
Resolution: EXPIRED → ---
Status: UNCONFIRMED → NEW
Ever confirmed: true
Assignee: bross2 → file-handling
Component: General → File Handling
Depends on: 116346
Product: Firefox → Core
QA Contact: general → ian
Version: unspecified → Trunk
Note that this caused major issues with various server-side stuff when it was tried back in the day.  See bug 7533.
Web Forms 2 says to use the _charset_ field. It doesn't mention adding charset= either.
For multipart/form-data posts, the charset should be defined on the Content-Type header, not Content-Disposition.  Firefox does not appear to send a Content-Type header with individual form fields, assuming the default (text/plain) is sufficient.  It would be helpful for applications if this header were added with the charset parameter indicating which character encoding we used.
I agree with David Nesting, the charset parameter should go with a Content-Type header on the individual parts of the MIME body. This is what the spec says.

And I disagree with Boris Zbarsky saying that this caused major issues. I reviewed the bug reports and none of them is mentioning problems with the enctype multipart/form-data, all seemed to have used application/x-www-form-urlencoded. Additionally, these issues where 8 years ago.

I also disagree with the conclusions drawn on these bug reports. But first a resume; and I will restrict myself to HTML4:

The standard knows about forms to be submitted with
1) HTTP GET (always application/x-www-form-urlencoded)
2) POST application/x-www-form-urlencoded
3) POST multipart/form-data

For 1) there is technically no way to attach meta-data to it, as the form data gets attached as the "query" to the URI. It indeed is defined how all octets possible can be included in an URI, application/x-www-form-urlencoded restricts itself to US-ASCII as to how transform character to octets. So the octet/byte representation of a character outside US-ASCII is not specified with application/x-www-form-urlencoded.

Number 2) and 3), using POST, have a way to specify meta-data. They "bootstrap" on the HTTP Content-Type header which is send with a POST telling about the "form" of the HTTP POST body.

Unfortunately, number 2) specifies application/x-www-form-urlencoded which has no way defined to attach any other meta-data. Mozilla/Firefox did something like:

Content-Type: application/x-www-form-urlencoded; charset=...

which was WRONG from the very beginning. The charset attribute cant be attached to any content-type at will, it is basically only defined for text/... types.
Illustrating example:
Content-Type: image/jpeg; charset=...
is wrong either, as images have no charsets. Some people would argue that it should have the same meaning as for e.g. text/html, but that interpretation would yield a different thing. See this example:
Content-Type: text/html; charset=us-ascii
...<html> ... <p> &#8226;

The charset is describing the coding of the HTML, not of what the entity reference #8226 in the HTML means (which would be outside of ASCII anyway).

So, as the x-www-form-urlencode content-type is always within ASCII a charset attribute is useless. And the meaning of the percent-escaped stuff in that form does describe the x-www-form-urlencode spec only and not it's presentation charset.

So let's go with number 3) and do it right this time. multipart/form-data is a MIME type. These are outlined in RFC2045. MIME multipart types allow the inclusion of multiple parts (you guessed it!) and the inclusion of meta-data for every part. Firefox/Mozilla doesn't include a Content-Type header for these parts, so it defaults to "text/plain; charset=us-ascii".

Sending octets outside the 0-127 range in a multipart/... without Content-Type: header violates RFC2045 and forces the reader to guess.

The correct behavior would be to include in every non-ascii-only part:
Content-Type: text/plain; charset=...

It is shocking to see no support for HTTP11/HTML4/MIME in Seamonkey/Firefox; the first two standards now over 7 years old, MIME over 10.

Taking _charset_ into the game: it is a "solution" that involves modifying the original HTML form, including a hidden field with the name "_charset_". This hidden field gets "automatically" assigned a value from the browser, the charset in use. It is like writing with your favorite font in a jpeg-image 'This is a jpeg,' as this name/value pair gets transported together with the data.
I opened the bug 379858 so that this issue carries a proper subject.
A related issue here is the field names themselves.  Since multipart/form-data parts carry the field names in the part's headers, RFC2047 must be used to encode these.  This is completely independent of any charset parameter on the Content-Type of each field's value.

Firefox currently appears to provide field names in the submission's character encoding, without indicating it, just as it does for the data itself.

Unfortunately, fixing this particular aspect of this bug makes me more nervous.  You ought to be able to add a Content-Type header with a charset parameter, because you're just declaring something that you're already doing, and nobody expects this header today, so the damage should be minimal.  But encoding field names correctly would seem to change existing behavior for a lot of applications that use non-ASCII characters in their form fields, and haven't paid attention to RFC2388/RFC2047.
Also see bug 116346.
Has anyone considered submitting an RFC to address the charset= issue raised in  comment #8 ?
This got fixed by the patch for bug 116346.
Status: NEW → RESOLVED
Closed: 19 years ago17 years ago
Resolution: --- → FIXED
Joseph, what information not in comment #8 would you expect in such a RFC?

Robert
I'd like the RFC to make charset= legal for the application/x-www-form-urlencoded mime type.
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.