Closed Bug 212380 Opened 17 years ago Closed 14 years ago
Bad encoding - Mozilla sents characters from forms only in Unicode!!!
User-Agent: Mozilla/5.0 (X11; U; SunOS sun4u; en-US; rv:1.4) Gecko/20030701 Build Identifier: Mozilla/5.0 (X11; U; SunOS sun4u; en-US; rv:1.4) Gecko/20030701 Very bad bug. Users can't search the web!!! Example of the problem: A web-server sends page in Cyrillic (Windows-1251) encoding. The page contains forms. When user enters a text into forms and presses "Submit" button Mozilla encodes forms' text in Unicode (not in Cyrillic !!!) and sends forms' data to the server. And the server recieves utf-encoded form data. It's not a good idea for most servers :) because it nedd Cyrillic-encoded data for good work. You may see example at http://ya.ru It's russian search engine. Enter some text in cyrillic at the search field and see a result :) Reproducible: Always Steps to Reproduce: 1. 2. 3. My proposition to solve the problem: Send forms' data in current page-encoding, not in Unicode Or introduce flag in Options which set this behaviour.
not a blocker.
Severity: blocker → major
Note that this url doesn't contain a charset-tag. The auto-detector has determined that it's probably Windows-1251. According the the HTML4-standard, section 17.13.1: # Note. The "get" method restricts form data set values to ASCII # characters. Only the "post" method (with # enctype="multipart/form-data") is specified to cover the entire # [ISO10646] character set. So a 'get' form has to use ASCII (Iso-Latin-1), a 'post' has to use Unicode ! But Internet Exploder and Mozilla allow a form to specify what charset has to be used, see bug 18643 : <FORM ACTION="..." METHOD="..." ACCEPT-CHARSET="..."> The charset is also passed to the form in a "_charset_" field.
Jo Hermans, Mozilla encodes data in Unicode in 'get' method too. See example on http://ya.ru I put word 'test' in russian 'тест' and as a result obtain variable 'text' in query url: http://www.yandex.ru/yandsearch?rpt=rad&text=%26%23212%3B%26%23197%3B%26%23211%3B%26%23212%3B It's unicoded word 'тест' :) Althou method is 'get' It's Mozilla's bug
> Enter some text in cyrillic at the search field and see a result :) I just did the same test as comment 3. I entered the word "test" in Russian. Then I clicked Submit. The following URL was submitted: http://www.yandex.ru/yandsearch?rpt=rad&text=%D2%E5%F1%F2 Which is most certainly not UTF-8 encoding of the text I typed.... As a matter of fact, it's the page encoding. This is with a current Linux trunk build. So: 1) Is this a Solaris-only problem? 2) Is this something that got fixed since 1.4? 3) Is this something that's different between my setup and your setup?
I saw the bug on Mozilla 1.4 and Netscape 7.0. But Netscape 4.78 works fine. The bug is independent upon my local settings. I saw this problem only on Solaris.
So this is not a problem on Linux?
>http://www.yandex.ru/yandsearch?rpt=rad&text=%26%23212%3B%26%23197%3B%26%23211%3B%26%23212%3B >It's unicoded word 'тест' :) Althou method is 'get' No it's not, it's url-encoded NCR's in windows-1251 encoding Boris's http://www.yandex.ru/yandsearch?rpt=rad&text=%D2%E5%F1%F2 is the same thing in url-encoded KOI8-R. So neither one is unicode, but they can't both be the page encoding
The possible cause is that Solaris X server doesn't support "UTF-8" for the clipboard selection (actually it does support it but with 'Compound Text Encoding') and the reporter typed it by copy'n'paste. At the moment, the exact X11 terms to use are escaping me (see bug 9449 and bug 150131). Alternatively, it's simply because the reporter sets 'View|Character Coding' to 'KOI8-R' instead of 'Windows-1251' while entering '¬ä¬Ö¬ã¬ä'. Can you try it again with 'View | Character Coding' set to 'Windows-1251'? Solaris 9 (8 as well?) supports ru_RU.windows-1251 (or something like. try 'locale -a |grep 1251') in addition to ru_RU.KOI8-R. Can you launch Mozilla under ru_RU.windows-1251 and see what difference it makes? BTW, I would not use either of that. Instead, I would use ru_RU.UTF-8 locale. > http://www.yandex.ru/yandsearch?rpt=rad&text=%26%23212%3B%26%23197%3B%26%23211%3B%26%23212%3B > It's unicoded word '¬ä¬Ö¬ã¬ä' Well, it's not. It's url-escaped NCRs for '¬ä¬Ö¬ã¬ä' in KOI8-R. With url-escaping removed, we have ÔÅÓÔ Note that NCRs have to use Unicode codepoints, but the above URL (before url-escaping) uses KOI8-R code points (0xD4 0xC5 0xD3 0xD4.) What Boris got must have been '%F2%E5%F1%F2' instead of '%D2%E5%F1%F2'. '¬ä¬Ö¬ã¬ä' represented in Windows-1251 is 0xF2 0xE5 0xF1 0xF2. > get' form has to use ASCII (Iso-Latin-1), US-ASCII (ISO 646) and ISO Latin1(ISO-8859-1) are DIFFERENT. ISO-646 is also national standards (with one more code points replaced by nat'l standard bodies if necessary)in virtually all countries, but ISO-8859-1 is not. > a 'post' has to use Unicode ! No, you don't have to. You can use any valid MIME charset by specifying 'charset' parameter in the Content-Type header of any text/* subpart of 'multipart/form-data' (see RFC 2388) P.S. You have to view this bug in KOI8-R. I thought the reporeter used Windows-1251 in comment #3, but it turned out that it was KOI8-R. Bugzilla should enforce UTF-8 in comments (there's a workable migration plan proposed by Markus Kuhn) to avoid this kind of problem. I'm using KOI8-R in my comment as well to avoid making this bug in multiple encodings.
Ooops. I'm sorry I forgot to set View|Character coding to KOI8-R before posting my comment. To view my comment #9 (Russian word 'тест'), 'View | Character Coding' has to be set to EUC-KR. Due to my mistake, this bug is now in mixed encodings.
>What Boris got must have been '%F2%E5%F1%F2' instead of '%D2%E5%F1%F2'. No, %D2 is just the upper case form of %F2 in windows-1251
>> the word "test" in Russian >No, %D2 is just the upper case form of %F2 in windows-1251 So, what you typed was not 'test' but 'Test' in Russian :-) Or is Russian like German in uppercasing the first letter of nouns? BTW, see bug 135762. It seems that it's a bit more relevant to this bug than I thought at first.
> So, what you typed was not 'test' but 'Test' in Russian Yep. I didn't even notice myself doing it.... In any case, reporter, have you set any IDN preferences that would cause your urls to be encoded as NCRs?
> Can you try it again with 'View | Character Coding' set to 'Windows-1251'? Ok. I set both Win-1251 and KOI8-R and get the above result :( But I notice some feature: when I set encoding to ISO-8859-1 the server returns Win-1251-encoded page :) locale -a | grep 1251 returns ru_RU.ANSI1251 > In any case, reporter, have you set any IDN preferences that would cause your > urls to be encoded as NCRs? Hm... In my preferences "Languages/Content" is set to english, "Default Character Coding" is Western(ISO-8859-1) But the settings IMHO does not affect to encoding of forms... I think it's Solaris problem. As I mention above, the same problem is observed on Netscape 7.0. Victor PS: It's a pity that it's impossible to set preferences of Mozilla by .Xdefaults file :(
> Can you launch Mozilla under ru_RU.windows-1251 and see what difference > it makes? BTW, I would not use either of that. Instead, I would use > ru_RU.UTF-8 locale. You didn't try this, did you? Can you try all three cases below? % env LC_ALL=ru_RU.ANSI1251 mozilla % env LC_ALL=ru_RU.UTF-8 mozilla % env LC_ALL=ru_RU.KOI8-R mozilla BTW, how are you entering Cyrillic letters? Can you try 'locale' in your default setting and let us know the output?
> % env LC_ALL=ru_RU.ANSI1251 mozilla Gdk-WARNING **: Missing charsets in FontSet creation Gdk-WARNING **: ansi-1251 mozilla was not started :( -i flag did not help. > % env LC_ALL=ru_RU.UTF-8 mozilla > % env LC_ALL=ru_RU.KOI8-R mozilla mozilla runs but the problem arised again :( >BTW, how are you entering Cyrillic letters? Switch keyboard by pressing 'Num Lock' and type cyrillic letters. >Can you try 'locale' in your default > setting and let us know the output? locale: LANG= LC_CTYPE="C" LC_NUMERIC="C" LC_TIME="C" LC_COLLATE="C" LC_MONETARY="C" LC_MESSAGES="C" LC_ALL= locale -a: POSIX common en_US.UTF-8 C iso_8859_1 bg_BG bg_BG.ISO8859-5 et et_EE et_EE.ISO8859-15 hr_HR hr_HR.ISO8859-2 lt lt_LT lt_LT.ISO8859-13 lv lv_LV lv_LV.ISO8859-13 mk_MK mk_MK.ISO8859-5 nr ro_RO ro_RO.ISO8859-2 ru ru.UTF-8 ru.koi8-r ru_RU ru_RU.ANSI1251 ru_RU.ISO8859-5 ru_RU.KOI8-R ru_RU.UTF-8 sh_BA sh_BA.ISO8859-2@bosnia sl_SI sl_SI.ISO8859-2 sq_AL sq_AL.ISO8859-2 sr_SP sr_YU sr_YU.ISO8859-5 tr tr_TR tr_TR.ISO8859-9 iso_8859_13 iso_8859_15 iso_8859_2 iso_8859_5 iso_8859_9 koi8-r
In google.com all works fine!!!
This is an automated message, with ID "auto-resolve01". This bug has had no comments for a long time. Statistically, we have found that bug reports that have not been confirmed by a second user after three months are highly unlikely to be the source of a fix to the code. While your input is very important to us, our resources are limited and so we are asking for your help in focussing our efforts. If you can still reproduce this problem in the latest version of the product (see below for how to obtain a copy) or, for feature requests, if it's not present in the latest version and you still believe we should implement it, please visit the URL of this bug (given at the top of this mail) and add a comment to that effect, giving more reproduction information if you have it. If it is not a problem any longer, you need take no action. If this bug is not changed in any way in the next two weeks, it will be automatically resolved. Thank you for your help in this matter. The latest beta releases can be obtained from: Firefox: http://www.mozilla.org/projects/firefox/ Thunderbird: http://www.mozilla.org/products/thunderbird/releases/1.5beta1.html Seamonkey: http://www.mozilla.org/projects/seamonkey/
This bug has been automatically resolved after a period of inactivity (see above comment). If anyone thinks this is incorrect, they should feel free to reopen it.
Status: UNCONFIRMED → RESOLVED
Closed: 14 years ago
Resolution: --- → EXPIRED
Component: HTML: Form Submission → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.