Last Comment Bug 224820 - POST/GET data gets corrupted in pages in non-byte oriented character encodings(UTF-16/32)
: POST/GET data gets corrupted in pages in non-byte oriented character encoding...
Status: RESOLVED FIXED
: dataloss, intl
Product: Core
Classification: Components
Component: HTML: Form Submission (show other bugs)
: Trunk
: x86 All
: -- critical (vote)
: ---
Assigned To: Jungshik Shin
:
:
Mentors:
http://members.lycos.nl/slu/test.php
: 248336 294774 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2003-11-05 12:07 PST by Kester Everts
Modified: 2005-08-07 08:57 PDT (History)
10 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
a tentative patch (yet to be polished) (35.42 KB, patch)
2003-11-08 02:55 PST, Jungshik Shin
no flags Details | Diff | Splinter Review
Charset Test Page Source (4.17 KB, text/plain)
2003-11-08 04:12 PST, Kester Everts
no flags Details
another (experimental) patch (38.30 KB, patch)
2003-11-09 02:24 PST, Jungshik Shin
no flags Details | Diff | Splinter Review
patch update (23.11 KB, patch)
2005-05-21 11:31 PDT, Jungshik Shin
no flags Details | Diff | Splinter Review
patch update (23.47 KB, patch)
2005-05-21 23:47 PDT, Jungshik Shin
bzbarsky: review-
Details | Diff | Splinter Review
patch update (with bz's concerns addressed) (23.74 KB, patch)
2005-06-01 06:58 PDT, Jungshik Shin
cbiesinger: review-
bzbarsky: superreview+
Details | Diff | Splinter Review
patch update addressing cbie's comments (24.60 KB, patch)
2005-07-01 09:12 PDT, Jungshik Shin
cbiesinger: review+
jshin1987: superreview+
asa: approval1.8b4+
Details | Diff | Splinter Review

Description Kester Everts 2003-11-05 12:07:17 PST
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; nl-NL; rv:1.6a) Gecko/20031030
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; nl-NL; rv:1.6a) Gecko/20031030

I am using the UTF-16 charset on my page. When data gets POSTed from a form, the
POST data gets corrupted.
You can see this on my demonstation page: http://members.lycos.nl/slu/test.php.
Note that when the entire page is gargabe, you will have to reload the page.
Everything is explained there.

Reproducible: Always

Steps to Reproduce:
1. Create a UTF-16 page
2. Make a form with method="post"
3. Look at the POST data

Actual Results:  
The field names are strangely truncated to 1 character and the values are not
really UTF-16 encoded.

Expected Results:  
Mozilla should NOT truncate the field names and should use the (url-encoded)
UTF-16 charset.
Comment 1 Boris Zbarsky [:bz] (still a bit busy) 2003-11-05 15:57:43 PST
All of the nsFormSubmission code uses char* pointers for the data it plans to
send to the server... the problem is that it needs to pass around lengths too;
otherwise things break with UTF-16 and the like.
Comment 2 Jungshik Shin 2003-11-05 20:44:26 PST
In addition, it seems like we need to add nsEscape/UnEscape() that accept  the
string length as a parameter. Currently, both use strlen(). nsISaveAsCharset
also needs some change.

Comment 3 Jungshik Shin 2003-11-06 08:45:23 PST
bz
do you know where nsIFormProcessor interface is implemented. Its interface is
baroque and I want to fix it, but the only implementation I can find is kind of
dummy (nsTestFormProcessor). 
Comment 4 Boris Zbarsky [:bz] (still a bit busy) 2003-11-06 11:10:22 PST
jshin, that's used for <keygen>.  See
http://lxr.mozilla.org/mozilla/source/security/manager/ssl/src/nsKeygenHandler.h#43
Comment 5 Jungshik Shin 2003-11-08 02:40:43 PST
I have a tentative patch, but we need to come up with what exactly we want when
we send form data in UTF-16/32 over POST. For instance, how should we represent
' ' (U+0020 SPACE) if charset for POST is UTF-16LE?  
Which of the following should we use?

0x2b  
0x2b 0x00
0x2b %00
%20 %00

How about   'a' (U+0061)? Which should we use? 

%61%00
a%00

The other way around, how about U+6100? 

%00a 
%00%61 


Kester, btw, can you attach your test.php (source) here? 

Probably, we have to treat non-ASCII-compatible encodings (UTF-16/32) specially.   
Comment 6 Jungshik Shin 2003-11-08 02:55:27 PST
Created attachment 135050 [details] [diff] [review]
a tentative patch (yet to be polished)
Comment 7 Kester Everts 2003-11-08 04:12:50 PST
Created attachment 135051 [details]
Charset Test Page Source

- The Multi-Byte String extension (mb_string) needs to be enabled (enabled by
default in some versions).
- You can test any charset that is supported by PHP. There is a list in the
document. I am using PHP 5, but I works correctly in PHP 4.3.2 too. Not tested
in other versions.
- Set the encoding of your editor to UTF-8, or you will not be able to see the
Chinese and Japanese characters. An alternative is explained in the document.
- Info on mb_string: http://www.php.net/mb_string
Comment 8 Jungshik Shin 2003-11-08 05:31:51 PST
Thanks. I wanted to see how your code deciphers url-encoded forms to arrays.
It's done in a single stroke with 'var_dump($_POST)'.  I'm afraid I have to
either look at PHP4/5 source code or come up with a way to get my hands on 'raw'
data sent by a client (Mozilla and IE). 

Anyway, with my patch applied and UTF-16LE, Mozilla hands over 'a%00' for 'a'
(or '%61%00'). At your test page, I got back 'a\0'  and character count of '3'(I
expected 1)  For '一' (U+4E00), Mozilla hands over '%00%4E' (or '%00N'  Note
that 'N' is U+004E) and got back  '\0N'. 

What do you think Mozilla has to produce for 'a' (U+0061) and '一' (U+4E00) if
character encoding is UTF-16LE? 

The section on form submission in HTML 4 is clearly byte-oriented ('a' is always
represented by a single octet 0x61) and can hardly be a guide for
non-byte-oriented character encodings. For UTF-16/UTF-32, if we had to escape
all characters, whatever little merit of using UTF-16 over UTF-8 (space saving
for CJK) would be lost.  


Richard, what do you think about this? 
Comment 9 Kester Everts 2003-11-08 06:04:42 PST
JS, you can change line 94 (<?php var_dump($_POST); ?>) to
<?php readfile("php://input"); ?> to see the raw POST data. You can also use
both if you want :)
Comment 10 Jungshik Shin 2003-11-08 07:32:11 PST
Thanks. My php installation is early 4.x and I couldn't make it work with your
test file. Could you add that statement (for raw POST data dump) at
http://members.lycos.nl/slu/test.php so that I can check what MS IE sends.  I
suspect MS IE may send data in '%uxxxx' form when UTF-16LE/BE is selected. 

In case of Mozilla, with my patch, Mozilla escapes byte-by-byte even for
UTF-16LE/BE. For instance, '%41%00' for 'A', '%00%4E' for '一'(U+4E00), '%01%AC'
for '각'(U+AC01) (UTF-16LE). PHP, however, doesn't interpret it as UTF-16LE.

Mozilla has a pref. entry to turn on/off 'charset' parameter in content-type
header of application/x-www-url-encoded. That's off by default because a lot of
servers choke on it, but I turned it on but that brought no change. 


BTW, please ignore my comment on the merit (or lack thereof) of using UTF-16. 
Comment 11 Kester Everts 2003-11-08 07:36:07 PST
I was working on that :)
I have updated my tool with some additional features:
- Show raw post data and character grouping
- Quick change of charset

To be found at: http://members.lycos.nl/slu/test2.php
Comment 12 Jungshik Shin 2003-11-08 08:16:45 PST
Thanks a lot. 

It turns out that MS IE sends POST data in url-escaped UTF-8.  As you may know,
by default, the option 'send all URLs by UTF-8' is on with MS IE. I didn't know
that the option also controls how POST data is encoded. Somehow, PHP interprets
it correctly.... 

Mozilla with my patch (slightly modified from the one uploaded here) behaves as
I intended. It turns all octets (of UTF-16LE/BE) to '%xx' forms.  Usually, the
server-side program expects its form data (post or get) to come in the encoding
of the original form. In this case, it's UTF-16LE/BE so that Mozilla sends POST
data in url-escaped UTF-16LE/BE. It even adds 'charset=UTF-16LE' (UTF-16BE) to
'Content-Type: application/x-www-url-encoded'. Still, PHP doesn't interpret it
as UTF-16LE/BE.   It's clearly a bug on PHP side that it doesn't honor 'charset'
parameter in C-T header.  

I'm not sure exactly what to do here. I'm a bit tempted to go ahead and refine
my patch to ask for r/sr (partly because it does some clean-up of the form
submission code). 

However, it doesn't help those who use PHP.  It's interesting to see what other
server side tools do (e.g. JSP, asp, Perl:CGI module etc)

BTW, there's a patch (made a long time ago) that implements an option 'send URL
always in UTF-8' a la MS IE. I guess I have to try to move it forward so that
people who want to turn it on can do so. 
Comment 13 Jungshik Shin 2003-11-08 08:20:26 PST
Just in case it's not clear, if the option 'send all URLs by UTF-8'  is turned
on, POST and GET data are always sent by MS IE in url-escaped UTF-8
__regardless__ of the character encoding used in the form page. 
Comment 14 Jungshik Shin 2003-11-09 02:24:06 PST
Created attachment 135111 [details] [diff] [review]
another (experimental) patch

This patch does what I had in mind when I came to this bug for the first time
in addition to cleaning up some code in nsFormSubmission. However, it'd be of
little practical value (in light of  what PHP does). Still cleaning up the code
is worth pursuing, isn't it? 

BTW, I was wrong about the following:

> It even adds 'charset=UTF-16LE' (UTF-16BE) to 'Content-Type:
application/x-www-form-url-encoded'. 

That part of the code is blocked (I thought it's controlled by
form.backwards_compatible?? pref. entry). To experiment, I had to activate that
block of the code. That block was disabled because the majority of the
server-side programs choke on it  two years ago. Things have changed and we may
have to consider what to do with them. Adding another pref. entry is an option
or we can overload another form-submission related prefs.
Comment 15 Jungshik Shin 2003-11-10 09:29:34 PST
In addition to MS IE, Opera transmits POST data in url-encoded UTF-8
(fromUTF-16LE/BE pages). I'm not sure which way to take. I'm writing to w3-i18n
WG to seek some guide. smontagu, what do you think?
Comment 16 Jungshik Shin 2003-12-23 01:43:20 PST
I haven't heard anything back from W3C-I18N WG. 

Kester , do you have any opinion? If PHP5 expects UTF-8 from UTF-16 encoded
pages and MS IE and Opera do the same, we may have to follow the suit. I'll test
JSP and Perl:CGI module later.  Alternatively, we can add a pref. entry (with NO
UI) to specify the way UTF-16 form is handled. The default would be to convert
to UTF-8 before submitting. 
Comment 17 Kester Everts 2003-12-23 09:44:05 PST
Well, since PHP is a widely used form processor and major browsers (MSIE, Opera)
do it, I think it's a good idea to convert the POST and GET data of ANY
Unicode-encoded page into the UTF-8 charset.

Indeed this pref entry should not be visible in the UI. If users mess with this
setting, some form submissions will fail, and they do not know why.
And, since MSIE and Opera convert by default, I agree that the default would be
to convert to UTF-8.

You mentionned earlier in one of your comments that MSIE always sends POST and
GET data in UTF-8, regardless of the char encoding (with the option enabled).
That is not entirely true. It only does that with the Unicode charsets. Mozilla
should act the same way.

<offtopic>

It took me about an hour to write this comment, because I was thinking a lot
about how all these charset problems should be solved... I could only produce
this (well-thought) announcement:

People of the World, USE one of the Unicode charsets for ALL purposes! And a
message to developers: please make your applications support multibyte characters!

JS, this is certainly nothing personal :)

</offtopic>
Comment 18 Boris Zbarsky [:bz] (still a bit busy) 2004-07-16 12:07:25 PDT
Note that this is broken for GET too.  See bug 248336.
Comment 19 Boris Zbarsky [:bz] (still a bit busy) 2004-07-16 12:08:29 PDT
*** Bug 248336 has been marked as a duplicate of this bug. ***
Comment 20 Jungshik Shin 2004-07-16 13:26:30 PDT
what's been holding me here is that there's no spec. to follow. I posted to
W3-I18N WG list last fall, but haven't heard back anything. 

A working-draft released by WHATWG doesn't offer much new here, either. 
Comment 21 Dennis Milns 2004-07-16 15:44:16 PDT
(In reply to comment #17)
I was the one who found the bug with GET as well. You suggest that we use only 
one charset - this is not possible in my case as the application is desigh=ned 
a set of Unicode charset - the choice is the user's. I think Mozilla should 
follow IE's and support UTF-8 in the same way - that is only where necessary. 
Although in my case the entire application is in Unicode.

Dennis
Comment 22 Jungshik Shin 2004-07-16 16:26:32 PDT
(In reply to comment #21)
> (In reply to comment #17)
> You suggest that we use only 
> one charset - this is not possible in my case as the application is desigh=ned 
> a set of Unicode charset - the choice is the user's. I think Mozilla should 
> follow IE's and support UTF-8 in the same way - that is only where necessary. 

I'm afraid I'm not following you. Mozilla does support UTF-8 very well. The
problem here is HTML4 authors and any other spec. writers have given a thought
to how to deal with __non-byte-oriented__ character encodings like UTF-16 and
UTF-32 when it comes to the form submission.  As already mentioned, MS IE and
Opera send form data in UTF-8 when the encoding of a page of which a form is a
part is UTF-16. That may be regarded as sensible, but web user agent is only one
part of the equation. We also have to see what server side program does. PHP 5.x
works well with that approach, but I haven't yet managed to check out other
server-side tools. (comment #16)

Comment 23 Dennis Milns 2004-07-17 03:10:06 PDT
(In reply to comment #22)
> (In reply to comment #21)
> > (In reply to comment #17)
> > You suggest that we use only 
> > one charset - this is not possible in my case as the application is 
desigh=ned 
> > a set of Unicode charset - the choice is the user's. I think Mozilla 
should 
> > follow IE's and support UTF-8 in the same way - that is only where 
necessary. 
> I'm afraid I'm not following you. Mozilla does support UTF-8 very well. The
> problem here is HTML4 authors and any other spec. writers have given a 
thought
> to how to deal with __non-byte-oriented__ character encodings like UTF-16 and
> UTF-32 when it comes to the form submission.  As already mentioned, MS IE and
> Opera send form data in UTF-8 when the encoding of a page of which a form is 
a
> part is UTF-16. That may be regarded as sensible, but web user agent is only 
one
> part of the equation. We also have to see what server side program does. PHP 
5.x
> works well with that approach, but I haven't yet managed to check out other
> server-side tools. (comment #16)

Sorry - I did mean UTF-16. But note the normal browser handles UTF-16 fine - 
it is only the FORMS POST/GET processor which is wrong. See the application in 
the attachment to Bug 248336 which is UTF-16 encoded.

Dennis
Comment 24 Jungshik Shin 2004-07-17 05:39:29 PDT
Re comment #23
 
> it is only the FORMS POST/GET processor which is wrong. See the 
> application in  the attachment to Bug 248336 which is UTF-16 encoded.

Would you please read comment #16, comment #8, comment #12, comment #14, comment
#15, etc? Everybody knows/agrees that there's something to be done here
(otherwise, this bug would have been closed. note that I even made two
experimental patches), but exactly what to do remains uncertain partly because
the spec authors didn't expect this situation (using non-byte-oriented character
encodings in the form submission). GET is probably less problematic than POST.
However,even for GET, we have to "decide" whether to convert UTF-16 to UTF-8 (as
is done by MS IE/Opera) or just use UTF-16 and url-encode  strings. 

In the meantime, I'd just convert UTF-16 files in your applications to UTF-8. I
don't see what you gain by using UTF-16 instead of UTF-8 in your application.
Comment 25 Dennis Milns 2004-07-19 01:58:49 PDT
(In reply to comment #24)
Thanks for your suggestions. Frankly I don't know how my applications became 
UTF-16 encoded - I did not do this overtly. It was not an easy thing to 
convert the program to UTF-8 but I found a painless method. Yes, the program 
now works, and as far as I am concerned this is the end of my problems with 
Mozilla.

Again - thanks

Dennis
Comment 26 Hixie (not reading bugmail) 2004-07-19 06:23:26 PDT
jshin: I don't understand the problem you are trying to solve. Could you send an 
e-mail to whatwg@whatwg.org explaining what you think it is that the specs leave 
undefined, so that we can define it in Web Forms 2.0? Thanks!
Comment 27 Jungshik Shin 2004-07-19 07:06:28 PDT
Thanks for the reminder. That's been in my TODO list for a while (I subscribed
to the list a few weeks ago). Anyway, the following provision in section 5.3 of
WHATWG's Web Form 2.0 clears the way. Sorry for missing that.  This is
consistent with what Opera and MS IE do now so that I'll implement this (perhaps
a couple of weeks from now)

> encode all the characters in the form data set, UTF-8. Character encodings
that > are not mostly supersets of US-ASCII must not be used (this includes UTF-16 
> and EBCDIC) even if specified in accept-charset attribute.

BTW, it'd be nice to have a similar provision against usin UTF-16/UTF-32 (and
perhaps EBCDIC) in section 5.5 as well. 
Comment 28 Hixie (not reading bugmail) 2004-07-20 01:31:04 PDT
Please send any requests to the list. (I don't see why you would want to not 
allow UTF-16 in section 5.5.)
Comment 29 djpohly 2005-05-19 17:06:08 PDT
*** Bug 294774 has been marked as a duplicate of this bug. ***
Comment 30 Kester Everts 2005-05-20 06:44:50 PDT
Comment on attachment 135051 [details]
Charset Test Page Source

<?php 

/******* Mozilla Test Page *******
Original File Name: test.php
Author: Kester Everts, Nov 2003 - May 2005
Register Globals: Off
*********************************/

// default to utf-8
$enc = isset($_GET['enc']) ? $_GET['enc'] : 'utf-8';

// the internal encoding used by the PHP parser is utf-8
mb_internal_encoding('utf-8');

/* the output encoding used by PHP is set through the enc parameter
PHP recognizes the following charsets:
UCS-4, UCS-4BE, UCS-4LE, UCS-2, UCS-2BE, UCS-2LE, UTF-32, UTF-32BE,
UTF-32LE, UCS-2LE, UTF-16, UTF-16BE, UTF-16LE, UTF-8, UTF-7, ASCII,
EUC-JP, SJIS, eucJP-win, SJIS-win, ISO-2022-JP, JIS, ISO-8859-1,
ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6,
ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-10, ISO-8859-13,
ISO-8859-14, ISO-8859-15, byte2be, byte2le, byte4be, byte4le,
BASE64, 7bit, 8bit and UTF7-IMAP
*/
mb_http_output($enc);

// if PHP does not recognize the charset, use utf-8
$enc = (mb_http_output() == 'pass') ? 'UTF-8' : mb_http_output();

// buffer the output until the page is entirely parsed, then
// let it go through the mb_output_handler() to convert it to
// any charset (the value of parameter enc)
// btw, make sure that there is not another output_handler
// defined in your php.ini that makes this malfunctional
ob_start('mb_output_handler');

// lets begin...
echo '<?xml version="1.0" encoding="'.htmlentities($enc).'">'."\r\n";
?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
	"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Testing form data corruption in the <?php echo htmlentities($enc); ?>
charset</title>
<style type="text/css">
body {
	font-family: Verdana, Arial, Helvetica, sans-serif;
	font-size: small;
	color: #0000CC;
	background-color: #FFFFFF;
}
</style>
<script type="text/javascript">
<!--
function changeCharset() {
	// get the drop down handle
	var el = document.getElementById("charset");

	// get the number of the selected option
	var n = el.selectedIndex;

	// get the value of the selected option
	var val = el.options[n].value;

	// use that value to set the charset
	location.href = "charsettest.php?enc=" + val;
}

var group = true;

function groupChars() {
	var el = document.getElementById("rawdata");
	var btn = document.getElementById("groupbtn");
	if(group) {
		var str = el.childNodes.item(0).nodeValue;
		str = str.replace(/\r\n|\r|\n|<br \/>|<br\/>|<br>/mgi, "")
		.replace(/(%.{2})/mg, "[$1] ")
		.replace(/(=)/mg, " $1 ")
		.replace(/(&amp;|&)/mg, "\r\n$1")
		.replace(/(.{75})/mg,"$1\r\n");
		el.childNodes.item(0).nodeValue = str;
		btn.value = "Degroup chars";
		group = false;
	}
	else {
		el.childNodes.item(0).nodeValue = original;
		btn.value = "Group chars"
		group = true;
	}
}
//-->
</script>
</head>

<body>
<h1>Using <?php echo htmlentities($enc); ?>...</h1>
<p>Fill in some data in the form below. You may fill in and test some Japanese
and Chinese characters to see what happens.<br />
Chinese: 木村被問候妻女變臉發惡.<br />
Japanese: ウェブ全体から検索.
</p>
<!--
If the Japanese and Chinese chars do not render correctly, use this instead:
Chinese:
&#26408;&#26449;&#34987;&#21839;&#20505;&#22971;&#22899;&#35722;&#33225;&#30332
;&#24801;
Japanese:
&#12454;&#12455;&#12502;&#20840;&#20307;&#12363;&#12425;&#26908;&#32034;
-->
<p><fieldset><legend>Unicode test form</legend><form
action="charsettest.php?enc=<?php echo htmlentities($enc); ?>" method="post">
	<label for="textfield">Name</label>
	<input type="text" name="name" accesskey="n" tabindex="1" id="name" />
    <br />
    <label for="email">E-mail</label>
    <input type="text" name="email" accesskey="e" tabindex="2" id="email" />
    <br />
    <input type="submit" tabindex="3" accesskey="s" /> 
    You are submitting this form in <span style="font-weight:bold;"><?php echo
htmlentities($enc); ?></span>.
</form></fieldset></p>
<p>Below will show POST data from <acronym title="Hypertext
PreProcessor">PHP</acronym>'s <var>$_POST</var> variable (only if you have
posted something).<hr />
<pre><?php var_dump($_POST); ?></pre>
This is what PHP thinks is submitted. PHP's charset detection can only detect
UTF-8, UTF-7, ASCII, EUC-JP,SJIS, eucJP-win, SJIS-win, JIS and ISO-2022-JP.
<hr />
Raw post data:<br />
<pre style="letter-spacing:1px; width:90%;" id="rawdata"><?php $raw =
file_get_contents("php://input");
echo htmlspecialchars(wordwrap($raw, 76, "\r\n", true), ENT_NOQUOTES, 'utf-8');
?></pre>
This should be equal to the output of an HTTP sniffer.<br /><input
id="groupbtn" type="button" value="Group chars" onclick="groupChars()"
style="display: inline" />
<script type="text/javascript">
try {
	original =
document.getElementById("rawdata").childNodes.item(0).nodeValue;
}
catch(e) {
	document.getElementById("groupbtn").style.display='none';
}
</script>
<hr />
</p>
<p>
You can switch the charset by modifying the <var>enc</var> parameter in the
<acronym title="Uniform Resource Locator">URL</acronym> or by use
of this:<br />
<label for="select">Charset</label>
<select id="charset" accesskey="c" tabindex="4" onchange="changeCharset()">
	<option value="utf-8">Choose charset...</option>
	<option value="utf-8">Unicode (UTF-8)</option>
	<option value="utf-16le">Unicode (UTF-16 Little Endian)</option>
	<option value="utf-16be">Unicode (UTF-16 Big Endian)</option>
	<option value="utf-32le">Unicode (UTF-32 Little Endian)</option>
	<option value="utf-32be">Unicode (UTF-32 Big Endian)</option>
	<option value="utf-7">Unicode (UTF-7)</option>
	<option value="iso-8859-1">Latin 1 (ISO-8859-1)</option>
	<option value="iso-8859-15">Latin 9 (ISO-8859-15)</option>
</select> 
</p>
</body>
</html>
Comment 31 Kester Everts 2005-05-20 06:46:15 PDT
Comment on attachment 135051 [details]
Charset Test Page Source

New test page: http://test.silcana.net/charsettest.php
Comment 32 Jungshik Shin 2005-05-21 11:31:27 PDT
Created attachment 184192 [details] [diff] [review]
patch update

Now I use UTF-8 for UTF-16/32.
Comment 33 Jungshik Shin 2005-05-21 23:47:05 PDT
Created attachment 184222 [details] [diff] [review]
patch update

I tested the patch with a slightly modified test case (I added a textfield with
'non-ASCII' name) and it worked fine.
Comment 34 Jungshik Shin 2005-05-22 08:58:48 PDT
Comment on attachment 184222 [details] [diff] [review]
patch update

bz, can you review? If you like it, would you put sr as well? I'm afraid
there's no active owner of this code.
Comment 35 Boris Zbarsky [:bz] (still a bit busy) 2005-05-22 09:24:26 PDT
Comment on attachment 184222 [details] [diff] [review]
patch update

I can review this, but it might take a few days.
Comment 36 Boris Zbarsky [:bz] (still a bit busy) 2005-05-31 15:02:25 PDT
Comment on attachment 184222 [details] [diff] [review]
patch update

>Index: content/html/content/src/nsFormSubmission.cpp

> @@ -342,46 +343,44 @@ nsFSURLEncoded::AddNameValuePair(nsIDOMH

>+  if (!processedValue.IsEmpty())
>+    rv = URLEncode(processedValue, convValue);
>+  else 
>     rv = URLEncode(aValue, convValue);

Style for the file (with good reason!) is that all if/else clauses have braces,
even one-line ones.

Also, why do we have this check?  To handle cases when ProcessValue failed or
something?  Shouldn't we check rv, then?  How can a form processor actually
send an empty value if it wants to, with this code?

>@@ -719,65 +715,62 @@ nsFSMultipartFormData::ProcessAndEncode(

>+  if (!processedValue.IsEmpty()) {

Again, what if the form processor wanst to send an empty string?

> nsFSTextPlain::AddNameValuePair(nsIDOMHTMLElement* aSource,

>+  if (!processedValue.IsEmpty()) {

And here.

No need for the explicit (void) casts when you're ignoring an rv, fwiw.....

>@@ -1310,14 +1307,21 @@ nsFormSubmission::GetEncoder(nsGenericHT

>+  if (!charset.Compare("UTF-16", PR_FALSE, 6) ||
>+      !charset.Compare("UTF-32", PR_FALSE, 6))

Was there a reason not to use StringBeginsWith() here?

>+nsFormSubmission::UnicodeToNewBytes(const nsAString& aStr, 
>+  else
>+    newBuffer = aStr;

Curly braces, please.

>+nsFormSubmission::EncodeVal(const nsAString& aStr, nsACString& 

>+  LossyCopyUTF16toASCII(aStr, aOut);

Why not at least fall back to using UTF-8 here?  Just to not change the old
behavior, or some other reason?

> nsFormSubmission::ProcessValue(nsIDOMHTMLElement* aSource,
>-#ifdef DEBUG
>-    nsresult rv =
>-#endif

Please leave this ifdef -- it's there to prevent a build warning in opt builds.
Comment 37 Jungshik Shin 2005-06-01 06:58:19 PDT
Created attachment 185009 [details] [diff] [review]
patch update (with bz's concerns addressed)

asking Simon for r and Boris for sr.

> >+  if (!processedValue.IsEmpty())
> >+	rv = URLEncode(processedValue, convValue);
> >+  else
> >	rv = URLEncode(aValue, convValue);
>
> Style for the file (with good reason!) is that all if/else clauses have
braces,
> even one-line ones.
 
 Fixed.
 
> Also, why do we have this check?  To handle cases when ProcessValue failed or

> something?  Shouldn't we check rv, then?  How can a form processor actually
> send an empty value if it wants to, with this code?
 
  It's possible to send an empty value because |aValue| would be empty in that
case.
Anyway, I changed to check rv.
 
 
> No need for the explicit (void) casts when you're ignoring an rv, fwiw.....
 
  fixed.
 
> >+  if (!charset.Compare("UTF-16", PR_FALSE, 6) ||
>
> Was there a reason not to use StringBeginsWith() here?
 
  Fixed.
 
> >+  LossyCopyUTF16toASCII(aStr, aOut);
>
> Why not at least fall back to using UTF-8 here?  Just to not change the old
> behavior, or some other reason?
 
  Not to change the old behavior, but falling back to UTF-8 seems better.
 
 
> > nsFormSubmission::ProcessValue(nsIDOMHTMLElement* aSource,
> >-#ifdef DEBUG
> >-	nsresult rv =
> >-#endif

>
> Please leave this ifdef -- it's there to prevent a build warning in opt
builds.
 
  It's not necessary because I do use 'rv' in opt builds.

I tested both get and post.
Comment 38 Boris Zbarsky [:bz] (still a bit busy) 2005-06-01 22:16:14 PDT
Comment on attachment 185009 [details] [diff] [review]
patch update (with bz's concerns addressed)

Looks great.  Thanks for fixing this!
Comment 39 Jungshik Shin 2005-06-22 20:24:09 PDT
ping Simon :-) can you review? I'll update the comment about RFC 2231 (in light
of recent RFCs about the form submission) in my tree.
Comment 40 Jungshik Shin 2005-06-26 03:39:35 PDT
Comment on attachment 185009 [details] [diff] [review]
patch update (with bz's concerns addressed)

asking cbie for r
Comment 41 Christian :Biesinger (don't email me, ping me on IRC) 2005-06-30 05:17:52 PDT
Comment on attachment 185009 [details] [diff] [review]
patch update (with bz's concerns addressed)

content/html/content/src/nsFormSubmission.cpp
 nsFSURLEncoded::URLEncode(const nsAString& aStr, nsCString& aEncoded)
[...]
+  char* escapedBuf = nsEscape(encodedBuf.get(), url_XPAlphas);
   aEncoded.Adopt(escapedBuf);

Shouldn't this return an error if nsEscape returns null?

@@ -1307,16 +1308,25 @@ nsFormSubmission::GetEncoder(nsGenericHT
+  if(charset.EqualsLiteral("ISO-8859-1")) {

if you're touching this line, please also add a space before (

Also: Why does a case-sensitive comparison suffice? Does the caller convert the
charset name to upper case?

Ah.. the comment below says so... but I'd move the first line of it before the
ISO-8859-1/windows-1252 part.

What about UCS-2 as charset, and UCS-4 as charset name?

+	 temp.SetCharAt(newBuffer.CharAt(len), z++);

this won't work. SetCharAt fails if the new index is > the length if the
string.

-  if(aStr && aStr[0]) {
+  if (!newBuffer.IsEmpty()) {

hmm, you're changing what is checked, is that ok?

+    if (formControl && formControl->GetType() == NS_FORM_INPUT_HIDDEN)
+	 CopyASCIItoUTF16(mCharset, aResult);
+	 return NS_OK;

?? Either indentation is wrong, or you have missing braces. (looks like the
latter)
Comment 42 Jungshik Shin 2005-07-01 09:00:34 PDT
(In reply to comment #41)

> +  char* escapedBuf = nsEscape(encodedBuf.get(), url_XPAlphas);
>    aEncoded.Adopt(escapedBuf);

> Shouldn't this return an error if nsEscape returns null?

Perhaps, yes. There's another place in the file that needs that check (not in
the current patch). I'll change that, too.


> Also: Why does a case-sensitive comparison suffice? Does the caller convert the
> charset name to upper case?

It's not just case-insenstive. By the time the program flows reaches there, all
character encoding names are already resolved to *our* internal canonical names
as listed in charsetalias.properties file.
 

> What about UCS-2 as charset, and UCS-4 as charset name?

See above. 


> +	 temp.SetCharAt(newBuffer.CharAt(len), z++);
> 
> this won't work. SetCharAt fails if the new index is > the length if the
> string.

 fixed in an upcoming patch.
> 
> -  if(aStr && aStr[0]) {
> +  if (!newBuffer.IsEmpty()) {
> 
> hmm, you're changing what is checked, is that ok?

Please, read the patch more carefully. I'm not changing what's checked. In the
current code, aStr is assigned a new value in three or four places. I removed
all of them and checking newBuffer, instead.

> +    if (formControl && formControl->GetType() == NS_FORM_INPUT_HIDDEN)
> +	 CopyASCIItoUTF16(mCharset, aResult);
> +	 return NS_OK;
> 
> ?? Either indentation is wrong, or you have missing braces. (looks like the
> latter)

Thanks for catching it. Fixed. 

Comment 43 Jungshik Shin 2005-07-01 09:12:18 PDT
Created attachment 187947 [details] [diff] [review]
patch update addressing cbie's comments

I updated it per cbie's comments (as I noted in my previous comment)
carrying over bz's sr here.
Comment 44 Jungshik Shin 2005-07-01 10:43:07 PDT
(In reply to comment #43)
> Created an attachment (id=187947) [edit]
> patch update addressing cbie's comments
> 
> I updated it per cbie's comments (as I noted in my previous comment)

I fixed other instances of SetCharAt without setting the length in advance in my
tree. 

Comment 45 Jungshik Shin 2005-07-06 18:45:51 PDT
Comment on attachment 187947 [details] [diff] [review]
patch update addressing cbie's comments

Forgot to specify cbie as a reviewer (see comment #44 as well)
Comment 46 Christian :Biesinger (don't email me, ping me on IRC) 2005-07-07 04:44:02 PDT
> See above. 

hm? what does UCS-2 get normalized to? It's different from UTF-16...
Comment 47 Christian :Biesinger (don't email me, ping me on IRC) 2005-07-07 04:48:12 PDT
> hm? what does UCS-2 get normalized to? It's different from UTF-16...

oh, or do we treat those as equivalent?

> Please, read the patch more carefully. 

sorry about that, I missed the assignment to aStr.

Comment 48 Jungshik Shin 2005-07-09 05:32:32 PDT
Comment on attachment 187947 [details] [diff] [review]
patch update addressing cbie's comments

asking for a to trunk

This is to support UTF-16/32 in the form submission. While fixing that, I also
sanitized the code. I tested this patch quite a bit and it works rather well.
Although it affects virtually all users (it affects the form submission code),
I'd say that the risk is rather low considering my test result. Even if there's
a regression, we'd better take the risk in 1.1a2 and fix it as soon as possible
(because we want this *feature* in 1.1). If there's a regression, it'll be
discovered and reported quickly.
Comment 49 Jungshik Shin 2005-08-07 08:57:52 PDT
sorry I haven't marked this as resolved/fixed. The patch was landed a while ago.

Note You need to log in before you can comment on or make changes to this bug.