Form data should have Content-Type header when its enctype attribute is
Original Report in Bugzilla-jp:
Let me clarify the situation a little bit.
The test case supplied above sends form input data directly from
the form rather than via file uploads.
Current Mozilla builds actually appends appropriate Content-Type headers
without the charset parameter if the user attaches a file to
multi-part/form-data type of form.
So this report is talking about only the case in which data is sent
directly from the form. It is my understanding that there has been a
long tradition of browsers sending text data from form in the same
encoding/charset as the web page on which the form resides. In other
words, servers expect the data back in the same encoding.
Both HTML 4.01 cited above and RFC 2388 say that Content-Type header
In case Mozilla generates Content-Type header for file uploads, it
intentionally omits teh charset parameter because there is no
clear way to determine the content encoding of the file being
uploaded. We discussed before creating a dialog which asks for
the charset of the uploaed file but dropped the idea because most
users may know what this question means in the first place.
This gives us a good opportunity to review the current specs
in this area for a variety of cases/data.
There were some spellign errors in my comment above. Let me correct
them as below:
"We discussed before creating a dialog which asks for
the charset of the uploaed file but dropped the idea because most
users may know what this question means in the first place."
should read instead:
"We discussed before creating a dialog which asks for
the charset of the uploaded file but dropped the idea because users
may not know what this question means in the first place."
I remember the reason we not dare to add it is because by experiement it break
too many web sites.
So.... is the request to put:
Content-Type: text/plain; charset=foo
(where "foo" is the encoding of the page the form was in) on all the
non-file-control data parts?
setting milestone and component
I know Pollmann tested charset and *that* broke a bunch of websites; but what
about plain old Content-Type? Is there anything wrong with that?
(Note that even if there /was/ something wrong with it, it may be fixed by now.
Later versions of Apache are spread all across the web; and IIS has gone
through many revisions since those tests were done.)
See bug 18643 for charset discussion.
*** Bug 126407 has been marked as a duplicate of this bug. ***
RFC 1867 clearly specifies that the parts of a multipart/form-data should have a
Content-Type, see the citation in bug 126407. This is particularly important to
identify the charset of input fields.
I agree with Martin. It's important to have C-T with charset when submitting a
form(when uploading a text/* file, Kat's comment #2 makes sense although I tend
to think allowing the user-control woulnd't be that bad UI-wise) because a user
can override the default MIME charset (in View|Character Coding).
I thought Mozilla supported this, and wrote to that effect on www-international
list, but turned out that I was wrong. At the URL given in the URL field,
this can be tested. In addition to RFC 1867, HTMl 4.01 is clear about the need
to add C-T header (when C-T is NOT the default 'text/plain; charset=US-ASCII' or
C-T-E is NOT the default 7bit). My interpretation of HTML 4.01 is different from
that of Kat here. The repeated references to RFC 2045 and the following sentence
have to be interpreted as requiring C-T/C-T-E for all the cases _other than_
"text-plain; charset=US-ASCII" and "7bit":
As with all multipart MIME types, each part has an optional "Content-Type"
header that defaults to "text/plain". User agents should supply the
"Content-Type" header, accompanied by a "charset" parameter.
In the above, I believe 'optional' is a bit misleading. The intent is likely to
have been that it's optional only when its value is the default value
'text/plain; charset=US-ASCII'. Otherwise, I believe it's mandatory.
Now the question is whether we'd still have a problem (comment #4 and comment
#7) with many CGI programs/web servers/server side scripts (jsp, php, asp) if we
add C-T and C-T-E header fields to each part of multipart/form-data. It's likely
that we do, but .....
I wish HTML 4.01 had been a lot more explicit about the need for C-T header
field for non-default cases instead of just referring to RFC 2045.
 a thread of articles beginning with
How easy would it be to add this functionality as a preference turned off by default so people could at least test what it breaks?
Probably pretty easy. I'll be happy to review if someone posts a patch.
Created attachment 261020 [details] [diff] [review]
Use a pref to decide to attach the charset to content type
I've added a pref and if it is set, charset is appended only in the multipart/form-data case
Is this what people were looking for?
The above patch appears to append the charset parameter only to the HTTP request's Content-Type. A charset parameter here has no meaning and its behavior is not defined by any specification. I believe the request is to add it to each *part* of the multipart/form-data entity:
Content-Type: multipart/form-data; boundary="foo"
Content-Disposition: form-data; name="field"
Content-Type: text/plain; charset=utf-8
Today, this Content-Type header is absent entirely. It ought to be safe to *add* it (since nobody expects it) without causing too many problems.
See also bug 379858, which may be a duplicate of this one. Bug 379858 comment 1 contains a simple patch that implements the behavior described above.
> It ought to be safe to *add* it (since nobody expects it)
See comment 7. I _ought_ to be safe to, but in practice, given the number of broken web servers out there, any change like that requires serious testing.
*** Bug 379858 has been marked as a duplicate of this bug. ***
Created attachment 264033 [details] [diff] [review]
Adds Content-Type with charset to each form-data part
(As requested, copied from bug 379858 comment 1:)
This is a perhaps naive attempt at adding the requisite header for each
form-data part of a multipart/form-data submission.
I have also created a tool at http://fastolfe.net/2007/05/06/post-charsets for
testing browser behavior. The tool will treat anything ambiguous as US-ASCII,
to make ambiguous cases obvious (invalid characters are replaced). A non-ASCII
submission with a normal build of Firefox will see the submission garbled,
while a submission with a patched Firefox works correctly.
This patch does NOT address:
* non-ASCII form field names
* application/x-www-form-urlencoded submissions
* non-ASCII form values that cannot be encoded in the chosen character encoding
(The latter case causes Firefox to replace the character with an HTML entity,
which IMO is also broken behavior.)
I liked my bug 379858 more because it had a better subject... (-:
I will attach a copy of bug 289060 comment 8 here:
I agree with David Nesting, the charset parameter should go with a Content-Type
header on the individual parts of the MIME body. This is what the spec says.
And I disagree with Boris Zbarsky saying that this caused major issues. I
reviewed the bug reports and none of them is mentioning problems with the
enctype multipart/form-data, all seemed to have used
application/x-www-form-urlencoded. Additionally, these issues where 8 years
I also disagree with the conclusions drawn on these bug reports. But first a
resume; and I will restrict myself to HTML4:
The standard knows about forms to be submitted with
1) HTTP GET (always application/x-www-form-urlencoded)
2) POST application/x-www-form-urlencoded
3) POST multipart/form-data
For 1) there is technically no way to attach meta-data to it, as the form data
gets attached as the "query" to the URI. It indeed is defined how all octets
possible can be included in an URI, application/x-www-form-urlencoded restricts
itself to US-ASCII as to how transform character to octets. So the octet/byte
representation of a character outside US-ASCII is not specified with
Number 2) and 3), using POST, have a way to specify meta-data. They "bootstrap"
on the HTTP Content-Type header which is send with a POST telling about the
"form" of the HTTP POST body.
Unfortunately, number 2) specifies application/x-www-form-urlencoded which has
no way defined to attach any other meta-data. Mozilla/Firefox did something
Content-Type: application/x-www-form-urlencoded; charset=...
which was WRONG from the very beginning. The charset attribute cant be attached
to any content-type at will, it is basically only defined for text/... types.
Content-Type: image/jpeg; charset=...
is wrong either, as images have no charsets. Some people would argue that it
should have the same meaning as for e.g. text/html, but that interpretation
would yield a different thing. See this example:
Content-Type: text/html; charset=us-ascii
...<html> ... <p> •
The charset is describing the coding of the HTML, not of what the entity
reference #8226 in the HTML means (which would be outside of ASCII anyway).
So, as the x-www-form-urlencode content-type is always within ASCII a charset
attribute is useless. And the meaning of the percent-escaped stuff in that form
does describe the x-www-form-urlencode spec only and not it's presentation
So let's go with number 3) and do it right this time. multipart/form-data is a
MIME type. These are outlined in RFC2045. MIME multipart types allow the
inclusion of multiple parts (you guessed it!) and the inclusion of meta-data
for every part. Firefox/Mozilla doesn't include a Content-Type header for these
parts, so it defaults to "text/plain; charset=us-ascii".
Sending octets outside the 0-127 range in a multipart/... without Content-Type:
header violates RFC2045 and forces the reader to guess.
The correct behavior would be to include in every non-ascii-only part:
Content-Type: text/plain; charset=...
It is shocking to see no support for HTTP11/HTML4/MIME in Seamonkey/Firefox;
the first two standards now over 7 years old, MIME over 10.
Taking _charset_ into the game: it is a "solution" that involves modifying the
original HTML form, including a hidden field with the name "_charset_". This
hidden field gets "automatically" assigned a value from the browser, the
charset in use. It is like writing with your favorite font in a jpeg-image
'This is a jpeg,' as this name/value pair gets transported together with the
> I reviewed the bug reports and none of them is mentioning problems with the
> enctype multipart/form-data, all seemed to have used
Ah, excellent. In that case, yeah, we should do this for the multipart/form-data POSTs. Thanks for looking into that!
Comment on attachment 264033 [details] [diff] [review]
Adds Content-Type with charset to each form-data part
>+ + NS_LITERAL_CSTRING("Content-Type: text/plain; charset=")
>+ + mCharset
>+ + NS_LITERAL_CSTRING(CRLF)
So the only concern I have here is that if mEncoder is null we'll end up using UTF8, not mCharset, for the encoding. We could maybe set mCharset to "UTF-8" in the constructor if mEncoder is null, or we could null-check here (because mCharset is used for some weird bidi stuff that I don't quite understand).
Simon, would it be safe to just reset mCharset if it's a charset we don't have an encoder for?
I think that the worst that would happen is that it might break the weird bidi stuff which nobody understands and is probably broken anyway because it makes some very unsafe assumptions about correlation between the document character set and the characters that might be included in the form submission.
Yeah, that stuff was the part I was worried about. OK, then.
David, want to make that change? Just reset mCharset to UTF-8 in the constructor if mEncoder is null?
(In reply to comment #22)
> Just reset mCharset to UTF-8 in the
> constructor if mEncoder is null?
I suggest doing it in GetSubmissionFromForm() if GetEncoder() fails.
How can I test this null encoder case? When I attempt to use a bogus charset in the form submission, mCharset contains "UTF-8".
I don't think there's an easy way to test it. You'd need some charset for which we have a decoder (so we can load the page as that charset) but do not have an encoder... I guess you could hack nsFormSubmission::GetSubmitCharset to return a bogus charset. That should work.
After getting GetSubmitCharset to return a bogus charset, I couldn't get a form to submit at all, even without my other changes. If we intend this situation to result in a useful POST, I don't think it's working that way today.
Assuming that is a goal, though, and it just isn't working right now for other reasons, is this the type of check that should be done in GetSubmissionFromForm()?
// Get unicode encoder
nsFormSubmission::GetEncoder(aForm, charset, getter_AddRefs(encoder));
+ if (encoder == nsnull)
If that looks reasonable, I'll post an updated patch. It seems to work OK, but like I said, I can't get meaningful behavior either way in the null encoder (bogus GetSubmitCharset charset) case.
I'd do |if (!encoder)|, but other than that that looks like what I wanted, yes.
Created attachment 264215 [details] [diff] [review]
Adds Content-Type with charset to each form-data part
This patch expands upon the previous by also forcing the mCharset to UTF-8 when no encoder is available.
Comment on attachment 264215 [details] [diff] [review]
Adds Content-Type with charset to each form-data part
Looks good to me. sicking, would you sr?
Checked in. David, thanks for the patch!
(For what it's worth, whatever tool you're using is producing broken diff files -- they're missing spaces at the beginning of empty context lines. Took me a few minutes to figure out why this wasn't applying.)
So it turns out that this broke existing sites. Some of the known ones are referenced in bug 384270. So the big question is, is the fix worth the bustage, and how much of the bustage is there out in the wild that we don't yet know about. I'm leaning towards backing this out to fix what broke. Or is there anything else that could be done to leave parts of this in w/o breaking existing sites (or at least not as many of them)?
David, are you willing to get in touch with the various back-end folks whose software doesn't deal with this (Eve, etc) and see whether we can do a limited form of this that won't break them?
Given the "It's probably a Minefield bug, let's see if they fix it in the beta" attitude in the Eve forum I'm not that hopeful... :( But maybe we'll get something from them.
The JIRA dev team accepts that this behaviour in Minefield is standard
compliant and that this is a bug we should and will deal with.
However, there are > 6000 JIRA instances out there as of now, including quite a
few major public ones. The process of updating them all is going to take some
time, so the symptoms are likely to persist for quite some time (> FF3
release). This is likely to be the case for the other back-end software as
We would certainly prefer if there was an option to turn this behaviour on/off
- with off as standard, and then turn it on by default in a later release.
Jed, if it's off by default nothing will change and we still won't be able to enable it in a future release. I'm glad to hear that you guys will fix your end, but as you said there are other back-end packages, most of which will never even hear about the problem if the behavior defaults to off....
Is there by chance any aspect of this behavior that could be preserved without breaking existing JIRA installs?
Boris, we do understand the conundrum - we would also like to see the change. Unfortunately, there is very little that can be done about existing installs with the current FF3 behaviour that does not necessitate an upgrade or patch. We currently fail reasonably spectacularly.
BTW. what is the release time-frame for FF3?
(In reply to comment #33)
> We would certainly prefer if there was an option to turn this behaviour on/off
> - with off as standard, and then turn it on by default in a later release.
I second this as a call for more time.
Can you give a brief explanation of why this breaks your code? What new codepath does this cause?
The issue with JIRA also affects Confluence as we use the same underlying multipart parser. We also accept that it is Confluence that is broken with regard to this and not Minefield.
I'd like to propose that a switch be introduced so that web application may opt-in to have these data submitted as part of a form post. This would aid transition for broken implementations while allowing interested (and working) servers to use the new functionality. Maybe a meta element switch?
<meta name="form.include.multipart.content-type" content="true" />
or something similar (instead of a global switch for the page you might want to have a space delimited list of form ids to enable it on).
I think it is great that this capability has been included as it has often caused me frustration when authoring web apps in the past but the pragmatist in me suggests that we need to phase this in (and not just for our sake).
We will of course look to get upcoming releases of Confluence fixed.
Again, can someone please explain how exactly this is breaking the servers?
I'm curious to understand how it is failing.
(In reply to comment #39)
> Again, can someone please explain how exactly this is breaking the servers?
> I'm curious to understand how it is failing.
Michael, in the case of Jira or Confluence, it simply means that *any* form containing an upload button is unusable.
As you can imagine, Jira contains an upload button on almost any page. In other words, you cannot use Jira, or Confluence with current Gran Paradiso. Indeed, I have stopped using Gran Paradiso immediately, after I understood that I can switch off these problems by using Firefox. Likewise, this would prevent me to upgrade to Firefox 3, if it should contain the same change.
> Michael, in the case of Jira or Confluence, it simply means that *any* form
> containing an upload button is unusable.
Jochen, unfortunately, I think this does not answer Michael's question. He did not ask *what* exactly breaks, but *how* exactly it breaks. I.e. what specific algorithm on the server is invoked that works if Content-type is not included, but fails if it is included. E.g. what specific if condition in what specific source file of what specific library starts to misbehave.
well, when we investigate and fix we'll provide you the diff if you like.
The actual library is the pell-multipart-request plugin for webwork, our fork of which is here: https://svn.atlassian.com/svn/public/contrib/tools/pell-multipart-request/trunk
We have not investigated the actual errant code yet as the fix is not scheduled and the most relevant thing right now is the fact that it occurs at all.
We may not even fix pell-multipart-request but write our own multipart handler from scratch.
(In reply to comment #42)
> We may not even fix pell-multipart-request but write our own multipart handler
> from scratch.
OT: Before doing that, please consider using one of the multipart related Apache libraries, like commons-fileupload, or Mime4J.
I am the author of the streaming API for commons-fileupload and the author of the pull parser API for Mime4J and absolutely willing to support, possibly as part of a contract, or as part of my Apache work. Helping you will ultimately help me.
(In reply to comment #42)
From inspection, it looks like the problem is in /src/main/java/http/utils/multipartrequest/MultipartRequest.java:MultipartRequest.parse, specifically
// At the top of loop, we assume that the Content-Disposition line is next, otherwise we are at the end.
This assumption now breaks; the first thing in the part will be Content-type, not Content-disposition.
It seems that switching the order of the headers (i.e. putting Content-type after Content-disposition) might restore interoperability: the library later does expect that Content-type may follow before the actual data. In particular, a comment says
// FIX 1.14 IE Problem still: Check for content-type and extra line even though no file specified.
So apparently, MSIE already sends Content-type in other parts (at least in some releases under some circumstances), so if Firefox does the same, interoperability should be good for all sites that also support MSIE.
Notice that the library explicitly supports Content-type being sent for file uploads (which it detects by checking for the presence of the filename= parameter in Content-disposition).
For Firefox, I would recommend that just the order of headers is switched.
For pell-multipart-request, the right fix would be to read all header lines in each part until an empty line is seen, and extract content-disposition and content-type while doing so.
Martin, thanks for looking into this!
This is actually quite interesting. For file upload fields, we send:
800 NS_LITERAL_CSTRING("Content-Disposition: form-data; name=\"")
801 + nameStr + NS_LITERAL_CSTRING("\"; filename=\"")
802 + filenameStr + NS_LITERAL_CSTRING("\"" CRLF)
803 + NS_LITERAL_CSTRING("Content-Type: ") + aContentType
804 + NS_LITERAL_CSTRING(CRLF CRLF);
We also send:
794 NS_LITERAL_CSTRING("Content-Transfer-Encoding: binary" CRLF);
before that, but only if the browser.forms.submit.backwards_compatible preferense is false. It defaults to true. See bug 58189 and bug 83065 for that sordid story. Perhaps we should restore that behavior by default and make sure that header comes after Content-Disposition (so that pell-multipart-request's stupid assumptions are satisfied) but before Content-Type (so that PHP's stupid assumptions are satisfied, if it's still making those stupid assumption). This is a separate bug, in any case.
Moving on, for other form fields, this patch made us send:
+ NS_LITERAL_CSTRING("Content-Type: text/plain; charset=")
+ NS_LITERAL_CSTRING("Content-Disposition: form-data; name=\"")
+ nameStr + NS_LITERAL_CSTRING("\"" CRLF CRLF)
So indeed, the ordering is different. Let's switch that and see how compat looks?
Created attachment 275968 [details] [diff] [review]
Comment on attachment 275968 [details] [diff] [review]
Yeah, let's get this in and tested ASAP. r+sr=jst
The patch works with the Arstechnica forums (EVE), nice work devs. :)
please take a look at this one too:
Better see http://forums.mozillazine.org/viewtopic.php?t=574762
It is about: http://www.adslgr.com/forum/ a vBulletin forum with a similar failure.
yes, that's my thread,
anyone have an answer?
I'm not sure what sort of answer you're looking for. The thread has no indication of the actual steps to reproduce the problem (especially steps that could be followed by someone who does not know modern Greek well).
If you're still having a problem on that site with builds from this morning, check whether the issue started when the first patch for this bug got checked in? That would tell us whether this bug is even relevant to your problem.
I don't know when this bug started,
one thing I know though is that it started when I began using minefield,
worked fine with fx 126.96.36.199 and gran paradiso!
Someone that knows greek can follow these steps in order to reproduce it:
1. Login (or register if you don't have an account, then login) to http://www.adslgr.com
2. Goto any thread in the forum and try to post a quick reply clicking the submit button -> you'll get a please wait (must be div or something) message and the page hangs in there (no post takes place).
However, if you go through the normal reply process, everything is ok.
> I don't know when this bug started,
In that case, please file a new bug so we can figure out whether what caused the problem, get blocking flags set as needed, etc.
Note that this was hardly the only form submission change since the 1.8 branch.
> Login (or register if you don't have an account,
That's basically a non-starter, for what it's worth. Would you be willing to narrow down when the problem started using builds from http://archive.mozilla.org/pub/firefox/nightly/ and ftp://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/ ? You'll want the dated -trunk builds. Again, put the resulting information in the new bug you file. And please cc me on that bug
Hrm... this bug is showing back up in the nightly build...
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9a8pre) Gecko/2007081905 Minefield/3.0a8pre
The Ars Technica forums no longer work [again]...
I just tested with the 8/20 nightly and latest hourly:
Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.9a8pre) Gecko/2007082013 Minefield/3.0a7 ID:2007082013
And it still WFM. I looked back through Bonsai before testing and nothing jumped out at me, did it give the exact same error about MESSAGE_BODY being a required field or whatever?
It auto-upgraded to:
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9a8pre) Gecko/2007082005 Minefield/3.0a8pre
It's still broken for me, getting "TOPIC_MESSAGE_OID is a mandatory field. You must enter a value for it." when editing a post, and "MESSAGE_BODY is a mandatory field. You must enter a value for it." when posting a new message. 'Quick Reply' still works correctly as expected. The exact same messages I had been seeing prior to the fix.
OK. So what are the two nightly (or even better hourly) builds between which the problem reappeared?
Changing the order apparently causes bug 392982... Trying to figure out why.
The question in comment 31 remains unanswered. Does fixing this bug actually fix any real-world problems? Or are we simply doing it to do what the spec says.
It is at this point obvious that this bug is causing multiple sites to break, so there needs to be some significant value added in order for us, and our users, to be worth it.
(In reply to comment #61)
> Does fixing this bug actually fix any real-world problems?
Most definitely. Adding a Content-type allows to add a charset= parameter. This, in turn, allows to specify the encoding used to transmit the fields of the form. It resolves long-standing issues in entering non-ASCII data into forms, even if the page encoding is unknown or does not support the characters being entered. Past bugs that are addressed with the patch are Bug 324964 and Bug 135762; there probably have been more reports of this issue over the years.
How do other browsers deal with this issue? I'm very unhappy about breaking as many sites as this potentially breaks.
Couldn't sites that want to support other encodings use enctype attribute?
> Couldn't sites that want to support other encodings use enctype attribute?
No. enctype specifies the Content-type for the entire POST message, not for the individual parts. It is "multipart/form-data" in all cases that are relevant for this bug - see the bug title. Please study all relevant specifications carefully.
Well, no matter what the specs say we need to come up with a solution that doesn't break loads of sites.
If the entire message is encoded using the encoding in enctype, aren't the individual parts going to encoded in that encoding too?
(In reply to comment #65)
> Well, no matter what the specs say we need to come up with a solution that
> doesn't break loads of sites.
Is there any proof that the version proposed in comment #46 breaks a lot of sites?
> If the entire message is encoded using the encoding in enctype, aren't the
> individual parts going to encoded in that encoding too?
Please, PLEASE read the specs before making statements like that. The enctype does not include an encoding.
> Is there any proof that the version proposed in comment #46
It breaks Yahoo Mail at least (and therefore any site that uses the same server-side setup). And it's only been in the trunk for less than two weeks, which means it's not gotten any real testing yet. Note that breaking "lots" of sites is equivalent to breaking a few (or one) high-profile sites for compat purposes.
Now I'm hopeful that Yahoo rolled their own thing and will fix it, but if that's not the case, this patch will need to come out.
Two other notes.
1) We're at a point in the release cycle where the focus is on blockers, and this bug is not one of them. So effort to make this stuff work will need to come from people who deeply care about it. I suggest contacting Yahoo and seeing what they're up to, for a start.
2) If it turns out that we can't just enable it, the next obvious thing to try is a way for pages to opt into it. That could even get standardized by the HTML WG.
Even if Yahoo fixes their thing I'm very worried that there are loads of other form libraries out there that do the same thing. If high profile professional sites like yahoo use sloppy parsing, you can bet that there are tons of home-rolled parsing libraries that do too.
The burden of proof really goes the other way, we should have proof that the patch does not break sites. Especially with formats as old as this one. And extra especially now once we have seen that multiple sites break from various versions of the patch.
I see this bug report to get reopened because of bug 392982, so let me
outline the key points of this bug report:
-implement the standard
-avoid breaking a bunch of sites that can't handle the standard
I want to point out that this bug is not about implementing something else,
a new non-standard thing or whatever. That's because:
a) some proposed non-standard solutions (e.g. adding an proprietary
HTTP-header) are not contraindicative with the standard solution itself,
so no need to mix them
b) there is already one non-standard solution for the problem
("_charset_" form field); I'm not going to fight for a second.
c) my bug 379858 got closed referring to this one. I'm definitively going to
reopen it if this one is drawn to something different
So the real problem is that some sites choke when the browser talks
standard to them. There is actually no provable complete solution to this
problem as _any_ visible change could break a site. - If you can't find one,
I can make one! (This is why I disagree with Jonas.)
But that is not important. Important is to make sure that big, well known, old
applications (web sites) see the old browser behavior if known to fail on the
Why "big, well known and old sites" only?
-new apps will get tested with standard browsers like Firefox and the bug
will be seen from the very beginning
-"unknown" sites usually assume "the browser is right, the app is wrong"
-small sites are unknown sites... (-: ... or have a flexible development
team that corrects the problem in time
How to detect these sites? A (manual) work intensive solution would be a
(domain-/url-)blacklist. It is especially effective for the "old" criterion.
As time passes the blacklist will grow slower and later on needs no
maintaining at all as we can all assume that after some month/year any
site in questions is either not old or not well known. <-:
But I actually have a better idea, as I prefer solutions that need no
manual work at all: check if the page with the form to submit has parsing
errors. (I would like to say "renders in quirks mode", but that is not the
Pro: Yahoo Mail and any big corporate sites fail that test for sure (-:
Contra: most other sites, especially new sites, do probably fail, too...
Fazit: anyone keen on standards gets his/her solution, while anyone else
sees the old behavior. - Problem solved.
I have even more fine tuning in mind, but I will come back to that in
my next comment.
Feel free to post patches to implement the behavior you think should be happening. Then we can discuss it.
There are three components to a form submission: (1) the referrer, (2) the browser, and (3) the form processor. (1) and (3) may not be under the control of the same entity. If you are a site that gets many POSTs from 3rd-party sites, you can't possibly get all of them to include the _charset_ parameter in their forms unless you block their submissions until they do.
By placing the character encoding either in the MIME headers of the multipart/form-data content, or within a (non-standard) HTTP header, it's not necessary for the form to "opt-in" for the form processor to benefit.
Making this feature work as-is, but only with forms on pages rendered in standards compliance mode, helps only for "intra-site" form submissions. The real problem this feature is meant to solve is with form submissions made by unpredictable 3rd-party sites. The fact that the referring page is or is not standards-compliant may have nothing to do with how the form processor itself is written, which is really the barrier we seem to be facing today.
I'm not a big fan of the parsing error solution. First off, like David brings up, it doesn't really solve the problem. Second, it seems very unpredictable and illogical for a web developer that if they change a completely separate part of the page, the form submission format changes. What would happen if yahoo would fix their web pages? Should we punish them by "breaking" their form submissions?
There is no value in implementing standards for the sake of implementing standards. We implement standards to move the web forward. This standard is known to break sites making us, and probably many other browser vendors, very hesitant to implement it.
As I've stated before, I don't want to ship a beta with yahoo broken. So if someone wants another solution, please provide a patch soon. Probably within a week.
Created attachment 286068 [details] [diff] [review]
Backout of the previous attachment.
This patch is the reverse of the previous attachment in this bug. This is being backed out due to it causing regression bug 392982. I'm attaching this here partly to test a build with the previous patch backed out, there's no real differences between this patch and the reverse of the previous attachment.
Reopening since this got backed out. See bug 392982 for quite a bit of discussion around what this caused and how to possibly re-land this. Clearing blocking1.9+ on this bug as I don't think we'll have the time to look into a fix for this that doesn't cause bug 392982 in time for 1.9.
jst, I think you need to back out both patches that went in for this bug, not just the second one.... Otherwise you reintroduce bug 384270.
Reinstating the blocking flag, and nominating for beta blocking, since now we're in a known-broken state that we shouldn't be shipping for beta. Once the first attachment is backed out, we should undo the blocker settings.
Ok, backing out the other patch then too...
Created attachment 286076 [details] [diff] [review]
Backout of both fixes that went in for this bug.
Boris, please have a look at this patch, this is a combined backout of the two fixes for this bug (already checked in).
Clearing blocker flags again as both parts of this bug are now backed out.
Yeah, that second backout patch looks good.
I believe this bug can be closed. HTML5 now explicitly forbids the Content-Type header:
"The parts of the generated multipart/form-data resource that correspond to non-file fields must not have a Content-Type header specified."
So this is either not a bug, or the HTML5 specification needs revision.
So this is what Firefox is currently sending to my server [you will be able to guess which parts have been altered by me]:
POST http://[removed] HTTP/1.1
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20100101 Firefox/29.0
Accept-Encoding: gzip, deflate
Content-Type: multipart/form-data; boundary=---------------------------294571387113960
Content-Disposition: form-data; name="utf-8"
[some bytes which happen to be a utf-8 sequence]
Content-Disposition: form-data; name="format"
[some bytes which happen to be ascii text]
Content-Disposition: form-data; name="text"
[some bytes which happen to be a utf-8 sequence]
Please enlighten me: How is the server supposed to know that the encoding of the MIME parts is UTF-8? The MIME spec clearly states that in the absence of a Content-Type header, the correct content type is "text/plain;charset="us-ascii" (as stated in a 13 years old comment).
What really bugs me is th��� �� ������ ���� ������������ ���� and ���� ������ �� �� �? ��, ��������������, ���� �������� ���� ��
> How is the server supposed to know that the encoding of the MIME parts is UTF-8?
By assuming it's the encoding of the page that the form was on. Yes, this sucks. When we tried to fix it, we discovered that too many servers are too broken to allow us to send that information in the POST data.
If you have constructive suggestions for communicating that information, please raise them with the spec...