The default bug view has changed. See this FAQ.

Forms of UTF-16LE documents are encoded in UTF-8, although _charset_ declares UTF-16LE

RESOLVED FIXED in mozilla13

Status

()

Core
HTML: Form Submission
RESOLVED FIXED
6 years ago
5 years ago

People

(Reporter: Loïc, Assigned: bz)

Tracking

(Blocks: 1 bug)

unspecified
mozilla13
Points:
---
Bug Flags:
in-testsuite +

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(3 attachments, 1 obsolete attachment)

(Reporter)

Description

6 years ago
User-Agent:       Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12
Build Identifier: Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12

The following html-document should be converted to UTF16-LE, named form.utf16le.html, and then opened as file with Firefox:

<?xml version="1.0" encoding="UTF-16LE"?>
<html>
  <head>
     <title>encoding bug</title>
     <meta http-equiv='Content-Type' content='text/html; charset=UTF-16LE' />
  </head>
  <body>
    <form method='get'>
      <input type='text' name='oops' value='aïe' />
      <input type='hidden' name='_charset_' />
    </form>
  </body>
</html>

Submitting it (with the return key) an inconsistent URI is produced.

Reproducible: Always

Steps to Reproduce:
1. open file:///.../form.utf16le.html
2. commit the form (with the return key)
Actual Results:  
file:///.../form.utf16le.html?oops=a%C3%AFe&_charset_=UTF-16LE
(but %C3%AF are UTF-8 bytes, as is the whole query string).

Expected Results:  
file:///.../form.utf16le.html?oops=a%C3%AFe&_charset_=UTF-8
or
file:///.../form.utf16le.html?o%00o%00p%00s=a%00%EF%00e%00
or
file:///.../form.utf16le.html?o%00o%00p%00s=a%00%EF%00e%00&_charset_=UTF-16LE
or
? (open)

Note the logical difficulty to encode the string value "UTF-16LE" in UTF-16LE; here I used US-ASCII to encode the _charset_ parameter, because otherwise it would be of little use. I am not sure that _charset_ makes sense for non US-ASCII compatible encodings.

While it is not clear what is the best strategy, the adopted strategy should be documented. 

This bug is related to #169575. I agree, it would be simpler to always use UTF-8 for query string parameters. Unfortunately, the history cannot be changed... And Firefox still declares ISO-8859-1 as preferred encoding!

In an older release, I observed that _charset_=UTF-16 instead of _charset_=UTF-16LE. While LE may be the default of the OS, BE is the default for network communication; it is confusing. I suggest never to use UTF-16 as _charset_ value without an explicit LE or BE.

Updated

6 years ago
Version: unspecified → 3.6 Branch
Reporter, Firefox 4.0.1 has been released, and it features significant improvements over previous releases. Can you please update to Firefox 4.0.1 or later, and retest your bug? Please also create a fresh profile (
http://support.mozilla.com/kb/Managing+profiles), update your plugins (Flash, Java, Quicktime, Reader, etc) and update your graphics driver and Operating system to the latest versions available. 

If you still continue to see this issue, please comment. If you do not, please close this bug as RESOLVED > WORKSFORME

filter: prefirefox4uncobugs
(Reporter)

Comment 2

5 years ago
Same behaviour with FF 8.0
Confirming, our behavior violates the application/x-www-form-urlencoded encoding algorithm.
http://dev.w3.org/html5/spec/Overview.html#application-x-www-form-urlencoded-encoding-algorithm
file:///.../form.utf16le.html?oops=a%C3%AFe&_charset_=UTF-8
should be sent per spec.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Component: General → HTML: Form Submission
Product: Firefox → Core
QA Contact: general → form-submission
Version: 3.6 Branch → unspecified
Created attachment 595661 [details]
Test case as an attachment
> file:///.../form.utf16le.html?oops=a%C3%AFe&_charset_=UTF-8
> should be sent per spec.
Although the spec hasn't defined the behavior on file scheme, our behavior is invalid even on http(s).
http://dev.w3.org/html5/spec/Overview.html#form-submission-algorithm
Mmm... Gotta love black-hole components like "Firefox:General".  ;)

Do we also want to send windows-1252 when we're using that instead of ISO-8859-1?
Assignee: nobody → bzbarsky
OS: Linux → All
Hardware: x86_64 → All
Whiteboard: [need review]
Created attachment 595772 [details] [diff] [review]
Proposed fix that changes both cases

Jonas, please let me know if you think we should keep sending ISO-8859-1 instead of windows-1252?
Attachment #595772 - Flags: review?(jonas)
Note that the spec says nothing about ISO-8859-1 vs windows-1252 here...
Per Encoding Standard, ISO-8859-1 is just an alias of windows-1252. But it is more generic problem than this bug.
> -  if (charset.EqualsLiteral("ISO-8859-1")) {
> -    charset.AssignLiteral("windows-1252");
> -  }
I think we will have to continue to encode form contents as "windows-1252" for Web compat even if we treat "ISO-8859-1" is different from "windows-1252" right now.
> I think we will have to continue to encode form contents as "windows-1252" for Web compat 

Sure.  The question is which string should go in the _charset_ in the URL.  Should it be _charset_=windows-1252, or should it be _charset_=ISO-8859-1?  My patch makes it be the former, but it's easy to do the latter...
(In reply to Boris Zbarsky (:bz) from comment #11)
> > I think we will have to continue to encode form contents as "windows-1252" for Web compat 
> Sure.
Ah, I overlooked the code added in GetSubmissionFromForm.
> The question is which string should go in the _charset_ in the URL. 
> Should it be _charset_=windows-1252, or should it be _charset_=ISO-8859-1? 
> My patch makes it be the former, but it's easy to do the latter...
IMO it should be _charset_=windows-1252 to comply with Encoding Standard until we implement different alias sets between browser and mail.
Created attachment 595883 [details]
ISO-8859-1 test case
Created attachment 595890 [details]
ISO-8859-1 test case with a windows-1252 specific character

For the record,
IE9: ?oops=a%EFe%99&_charset_=iso-8859-1
Firefox without patch : ?oops=a%EFe%99&_charset_=ISO-8859-1
WebKit: ?oops=a%EFe%99&_charset_=
Opera, Firefox with patch: ?oops=a%EFe%99&_charset_=windows-1252
Hm, it may break compatibility with existing contents (unless those contents consider about Opera).
Attachment #595883 - Attachment is obsolete: true
Yeah, that was my worry...  Thank you for the data-gathering!

Jonas, thoughts?
I think we should land with _charset_=windows-1252 to get an indication of whether it's feasible to start treating ISO-8859-1 as an alias for Windows-1252 (i.e. to see if anything important breaks). If we see breakage, we'll probably need to change Anne's spec to say something more complicated than treating ISO-8859-1 as a plain alias for Windows-1252.
Comment on attachment 595772 [details] [diff] [review]
Proposed fix that changes both cases

I don't know enough about the windows-1252 issue to have an opinion.

The rest looks good though.
Attachment #595772 - Flags: review?(jonas) → review+
Simon, thoughts?

I'm not sure I want to try dealing with possible compat fallout here, honestly.  :(

What do other browsers send in this situation?
I'm not sure why people suspect that changing ISO-8859-1 to windows-1252 in the URL is likely to make things break. In the cases from attachment 14 [details] [diff] [review], 

Firefox without patch : ?oops=a%EFe%99&_charset_=ISO-8859-1
Firefox with patch: ?oops=a%EFe%99&_charset_=windows-1252

"Firefox without patch" is already treating ISO-8859-1 as an alias for windows-1252, since 0x99 is ™ in windows-1252 but not in ISO-8859-1; but "Firefox with patch" is correct without any aliasing.

Or did I totally misunderstand the question?
The question is what the server will do with the "Firefox with patch" query string.  If it's explicitly checking for ISO-8859-1 it could break... and I wouldn't bet on servers not doing that.  :(

Ignore my question about other browsers from comment 18; it's answered in comment 14...
So we're weighing the possibility of the server explicitly checking for ISO-8859-1 against the possibility of the server explicitly decoding ISO-8859-1 (which would be a problem with the status quo already)?

I say go for _charset_=ISO-8859-1 to maintain compatibility with IE and our old behaviour.
> So we're weighing the possibility of the server explicitly checking for ISO-8859-1
> against the possibility of the server explicitly decoding ISO-8859-1 (which would be a
> problem with the status quo already)?

Yes.

> I say go for _charset_=ISO-8859-1 to maintain compatibility with IE and our old behaviour.

OK.  I'll file a followup for considering changing the ISO-8859-1 bit.
http://hg.mozilla.org/integration/mozilla-inbound/rev/b1eb49737d05

Filed bug 732326.
Blocks: 732326
Flags: in-testsuite+
Whiteboard: [need review]
Target Milestone: --- → mozilla13
https://hg.mozilla.org/mozilla-central/rev/b1eb49737d05
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.