Last Comment Bug 615595 - Forms of UTF-16LE documents are encoded in UTF-8, although _charset_ declares UTF-16LE
: Forms of UTF-16LE documents are encoded in UTF-8, although _charset_ declares...
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: HTML: Form Submission (show other bugs)
: unspecified
: All All
: -- normal (vote)
: mozilla13
Assigned To: Boris Zbarsky [:bz]
:
Mentors:
Depends on:
Blocks: 732326
  Show dependency treegraph
 
Reported: 2010-11-30 09:54 PST by Loïc
Modified: 2012-03-02 06:34 PST (History)
9 users (show)
bzbarsky: in‑testsuite+
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Test case as an attachment (706 bytes, text/html; charset=utf-16le)
2012-02-08 22:56 PST, Masatoshi Kimura [:emk]
no flags Details
Proposed fix that changes both cases (5.23 KB, patch)
2012-02-09 09:08 PST, Boris Zbarsky [:bz]
jonas: review+
Details | Diff | Splinter Review
ISO-8859-1 test case (356 bytes, text/html; charset=ISO-8859-1)
2012-02-09 15:22 PST, Masatoshi Kimura [:emk]
no flags Details
ISO-8859-1 test case with a windows-1252 specific character (357 bytes, text/html; charset=ISO-8859-1)
2012-02-09 15:44 PST, Masatoshi Kimura [:emk]
no flags Details

Description Loïc 2010-11-30 09:54:09 PST
User-Agent:       Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12
Build Identifier: Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12

The following html-document should be converted to UTF16-LE, named form.utf16le.html, and then opened as file with Firefox:

<?xml version="1.0" encoding="UTF-16LE"?>
<html>
  <head>
     <title>encoding bug</title>
     <meta http-equiv='Content-Type' content='text/html; charset=UTF-16LE' />
  </head>
  <body>
    <form method='get'>
      <input type='text' name='oops' value='aïe' />
      <input type='hidden' name='_charset_' />
    </form>
  </body>
</html>

Submitting it (with the return key) an inconsistent URI is produced.

Reproducible: Always

Steps to Reproduce:
1. open file:///.../form.utf16le.html
2. commit the form (with the return key)
Actual Results:  
file:///.../form.utf16le.html?oops=a%C3%AFe&_charset_=UTF-16LE
(but %C3%AF are UTF-8 bytes, as is the whole query string).

Expected Results:  
file:///.../form.utf16le.html?oops=a%C3%AFe&_charset_=UTF-8
or
file:///.../form.utf16le.html?o%00o%00p%00s=a%00%EF%00e%00
or
file:///.../form.utf16le.html?o%00o%00p%00s=a%00%EF%00e%00&_charset_=UTF-16LE
or
? (open)

Note the logical difficulty to encode the string value "UTF-16LE" in UTF-16LE; here I used US-ASCII to encode the _charset_ parameter, because otherwise it would be of little use. I am not sure that _charset_ makes sense for non US-ASCII compatible encodings.

While it is not clear what is the best strategy, the adopted strategy should be documented. 

This bug is related to #169575. I agree, it would be simpler to always use UTF-8 for query string parameters. Unfortunately, the history cannot be changed... And Firefox still declares ISO-8859-1 as preferred encoding!

In an older release, I observed that _charset_=UTF-16 instead of _charset_=UTF-16LE. While LE may be the default of the OS, BE is the default for network communication; it is confusing. I suggest never to use UTF-16 as _charset_ value without an explicit LE or BE.
Comment 1 Tyler Downer [:Tyler] 2011-06-03 17:54:48 PDT
Reporter, Firefox 4.0.1 has been released, and it features significant improvements over previous releases. Can you please update to Firefox 4.0.1 or later, and retest your bug? Please also create a fresh profile (
http://support.mozilla.com/kb/Managing+profiles), update your plugins (Flash, Java, Quicktime, Reader, etc) and update your graphics driver and Operating system to the latest versions available. 

If you still continue to see this issue, please comment. If you do not, please close this bug as RESOLVED > WORKSFORME

filter: prefirefox4uncobugs
Comment 2 Loïc 2012-02-08 14:42:24 PST
Same behaviour with FF 8.0
Comment 3 Masatoshi Kimura [:emk] 2012-02-08 22:50:25 PST
Confirming, our behavior violates the application/x-www-form-urlencoded encoding algorithm.
http://dev.w3.org/html5/spec/Overview.html#application-x-www-form-urlencoded-encoding-algorithm
file:///.../form.utf16le.html?oops=a%C3%AFe&_charset_=UTF-8
should be sent per spec.
Comment 4 Masatoshi Kimura [:emk] 2012-02-08 22:56:17 PST
Created attachment 595661 [details]
Test case as an attachment
Comment 5 Masatoshi Kimura [:emk] 2012-02-08 23:00:39 PST
> file:///.../form.utf16le.html?oops=a%C3%AFe&_charset_=UTF-8
> should be sent per spec.
Although the spec hasn't defined the behavior on file scheme, our behavior is invalid even on http(s).
http://dev.w3.org/html5/spec/Overview.html#form-submission-algorithm
Comment 6 Boris Zbarsky [:bz] 2012-02-09 09:02:57 PST
Mmm... Gotta love black-hole components like "Firefox:General".  ;)

Do we also want to send windows-1252 when we're using that instead of ISO-8859-1?
Comment 7 Boris Zbarsky [:bz] 2012-02-09 09:08:44 PST
Created attachment 595772 [details] [diff] [review]
Proposed fix that changes both cases

Jonas, please let me know if you think we should keep sending ISO-8859-1 instead of windows-1252?
Comment 8 Boris Zbarsky [:bz] 2012-02-09 09:13:58 PST
Note that the spec says nothing about ISO-8859-1 vs windows-1252 here...
Comment 9 Masatoshi Kimura [:emk] 2012-02-09 10:59:26 PST
Per Encoding Standard, ISO-8859-1 is just an alias of windows-1252. But it is more generic problem than this bug.
Comment 10 Masatoshi Kimura [:emk] 2012-02-09 11:13:52 PST
> -  if (charset.EqualsLiteral("ISO-8859-1")) {
> -    charset.AssignLiteral("windows-1252");
> -  }
I think we will have to continue to encode form contents as "windows-1252" for Web compat even if we treat "ISO-8859-1" is different from "windows-1252" right now.
Comment 11 Boris Zbarsky [:bz] 2012-02-09 11:52:50 PST
> I think we will have to continue to encode form contents as "windows-1252" for Web compat 

Sure.  The question is which string should go in the _charset_ in the URL.  Should it be _charset_=windows-1252, or should it be _charset_=ISO-8859-1?  My patch makes it be the former, but it's easy to do the latter...
Comment 12 Masatoshi Kimura [:emk] 2012-02-09 15:12:05 PST
(In reply to Boris Zbarsky (:bz) from comment #11)
> > I think we will have to continue to encode form contents as "windows-1252" for Web compat 
> Sure.
Ah, I overlooked the code added in GetSubmissionFromForm.
> The question is which string should go in the _charset_ in the URL. 
> Should it be _charset_=windows-1252, or should it be _charset_=ISO-8859-1? 
> My patch makes it be the former, but it's easy to do the latter...
IMO it should be _charset_=windows-1252 to comply with Encoding Standard until we implement different alias sets between browser and mail.
Comment 13 Masatoshi Kimura [:emk] 2012-02-09 15:22:59 PST
Created attachment 595883 [details]
ISO-8859-1 test case
Comment 14 Masatoshi Kimura [:emk] 2012-02-09 15:44:06 PST
Created attachment 595890 [details]
ISO-8859-1 test case with a windows-1252 specific character

For the record,
IE9: ?oops=a%EFe%99&_charset_=iso-8859-1
Firefox without patch : ?oops=a%EFe%99&_charset_=ISO-8859-1
WebKit: ?oops=a%EFe%99&_charset_=
Opera, Firefox with patch: ?oops=a%EFe%99&_charset_=windows-1252
Hm, it may break compatibility with existing contents (unless those contents consider about Opera).
Comment 15 Boris Zbarsky [:bz] 2012-02-09 16:21:16 PST
Yeah, that was my worry...  Thank you for the data-gathering!

Jonas, thoughts?
Comment 16 Henri Sivonen (:hsivonen) 2012-02-20 01:12:34 PST
I think we should land with _charset_=windows-1252 to get an indication of whether it's feasible to start treating ISO-8859-1 as an alias for Windows-1252 (i.e. to see if anything important breaks). If we see breakage, we'll probably need to change Anne's spec to say something more complicated than treating ISO-8859-1 as a plain alias for Windows-1252.
Comment 17 Jonas Sicking (:sicking) No longer reading bugmail consistently 2012-02-26 20:37:06 PST
Comment on attachment 595772 [details] [diff] [review]
Proposed fix that changes both cases

I don't know enough about the windows-1252 issue to have an opinion.

The rest looks good though.
Comment 18 Boris Zbarsky [:bz] 2012-02-26 23:18:51 PST
Simon, thoughts?

I'm not sure I want to try dealing with possible compat fallout here, honestly.  :(

What do other browsers send in this situation?
Comment 19 Simon Montagu :smontagu 2012-03-01 08:46:34 PST
I'm not sure why people suspect that changing ISO-8859-1 to windows-1252 in the URL is likely to make things break. In the cases from attachment 14 [details] [diff] [review], 

Firefox without patch : ?oops=a%EFe%99&_charset_=ISO-8859-1
Firefox with patch: ?oops=a%EFe%99&_charset_=windows-1252

"Firefox without patch" is already treating ISO-8859-1 as an alias for windows-1252, since 0x99 is ™ in windows-1252 but not in ISO-8859-1; but "Firefox with patch" is correct without any aliasing.

Or did I totally misunderstand the question?
Comment 20 Boris Zbarsky [:bz] 2012-03-01 08:52:38 PST
The question is what the server will do with the "Firefox with patch" query string.  If it's explicitly checking for ISO-8859-1 it could break... and I wouldn't bet on servers not doing that.  :(

Ignore my question about other browsers from comment 18; it's answered in comment 14...
Comment 21 Simon Montagu :smontagu 2012-03-01 09:24:10 PST
So we're weighing the possibility of the server explicitly checking for ISO-8859-1 against the possibility of the server explicitly decoding ISO-8859-1 (which would be a problem with the status quo already)?

I say go for _charset_=ISO-8859-1 to maintain compatibility with IE and our old behaviour.
Comment 22 Boris Zbarsky [:bz] 2012-03-01 09:39:21 PST
> So we're weighing the possibility of the server explicitly checking for ISO-8859-1
> against the possibility of the server explicitly decoding ISO-8859-1 (which would be a
> problem with the status quo already)?

Yes.

> I say go for _charset_=ISO-8859-1 to maintain compatibility with IE and our old behaviour.

OK.  I'll file a followup for considering changing the ISO-8859-1 bit.
Comment 24 Marco Bonardo [::mak] 2012-03-02 06:34:59 PST
https://hg.mozilla.org/mozilla-central/rev/b1eb49737d05

Note You need to log in before you can comment on or make changes to this bug.