212380 - Bad encoding - Mozilla sents characters from forms only in Unicode!!!

Reporter

Description

•

22 years ago

User-Agent:       Mozilla/5.0 (X11; U; SunOS sun4u; en-US; rv:1.4) Gecko/20030701
Build Identifier: Mozilla/5.0 (X11; U; SunOS sun4u; en-US; rv:1.4) Gecko/20030701

Very bad bug. Users can't search the web!!!

Example of the problem:

A web-server sends page in Cyrillic (Windows-1251) encoding. The page contains
forms.
When user enters a text into forms and presses "Submit" button Mozilla encodes
forms' text in Unicode (not in Cyrillic !!!) and sends forms' data to the server. 
And the server recieves utf-encoded form data. It's not a good idea for most
servers :) because it nedd Cyrillic-encoded data for good work.

You may see example at http://ya.ru It's russian search engine.
Enter some text in cyrillic at the search field and see a result :)



Reproducible: Always

Steps to Reproduce:
1.
2.
3.




My proposition to solve the problem:

Send forms' data in current page-encoding, not in Unicode
Or introduce flag in Options which set this behaviour.

Olivier Cahagne

Comment 1

•

22 years ago

not a blocker.

URL: http://ya.ru/

Severity: blocker → major

Jo Hermans

Comment 2

•

22 years ago

Note that this url doesn't contain a charset-tag. The auto-detector has
determined that it's probably Windows-1251.

According the the HTML4-standard, section 17.13.1:
#   Note. The "get" method restricts form data set values to ASCII
#   characters. Only the "post" method (with
#   enctype="multipart/form-data") is specified to cover the entire
#   [ISO10646] character set.

So a 'get' form has to use ASCII (Iso-Latin-1), a 'post' has to use Unicode ! 

But Internet Exploder and Mozilla allow a form to specify what charset has to be
used, see bug 18643 :

<FORM ACTION="..." METHOD="..." ACCEPT-CHARSET="...">

The charset is also passed to the form in a "_charset_" field.

qwerty

Reporter

Comment 3

•

22 years ago

Jo Hermans,

Mozilla encodes data in Unicode in 'get' method too.

See example on http://ya.ru

I put word 'test' in russian 'тест' and as a result obtain variable 'text' in
query url:

http://www.yandex.ru/yandsearch?rpt=rad&text=%26%23212%3B%26%23197%3B%26%23211%3B%26%23212%3B

It's unicoded word 'тест' :) Althou method is 'get'

It's Mozilla's bug

Boris Zbarsky [:bzbarsky]

Comment 4

•

22 years ago

> Enter some text in cyrillic at the search field and see a result :)

I just did the same test as comment 3.  I entered the word "test" in Russian. 
Then I clicked Submit.  The following URL was submitted:

http://www.yandex.ru/yandsearch?rpt=rad&text=%D2%E5%F1%F2

Which is most certainly not UTF-8 encoding of the text I typed....  As a matter
of fact, it's the page encoding.  This is with a current Linux trunk build.

So:
1) Is this a Solaris-only problem?
2) Is this something that got fixed since 1.4?
3) Is this something that's different between my setup and your setup?

qwerty

Reporter

Comment 5

•

22 years ago

I saw the bug on Mozilla 1.4 and Netscape 7.0. But Netscape 4.78 works fine.
The bug is independent upon my local settings.
I saw this problem only on Solaris.

Boris Zbarsky [:bzbarsky]

Comment 6

•

22 years ago

So this is not a problem on Linux?

Simon Montagu :smontagu

Comment 7

•

22 years ago

>http://www.yandex.ru/yandsearch?rpt=rad&text=%26%23212%3B%26%23197%3B%26%23211%3B%26%23212%3B
>It's unicoded word 'тест' :) Althou method is 'get'

No it's not, it's url-encoded NCR's in windows-1251 encoding

Boris's http://www.yandex.ru/yandsearch?rpt=rad&text=%D2%E5%F1%F2 is the same
thing in url-encoded KOI8-R.

So neither one is unicode, but they can't both be the page encoding

Simon Montagu :smontagu

Comment 8

•

22 years ago

Sorry, wrong way round. The URI in comment 3 is KOI8-R and the one in comment 4
is Windows-1251

Jungshik Shin

Comment 9

•

22 years ago

The possible cause is that Solaris X server doesn't support
"UTF-8" for the clipboard selection (actually it does support
it but with 'Compound Text Encoding') and the reporter typed
it by copy'n'paste. At the moment, the exact X11 terms to use
are escaping me (see bug 9449 and bug 150131).

Alternatively, it's simply because the reporter sets 'View|Character
Coding' to 'KOI8-R' instead of 'Windows-1251' while entering
'¬ä¬Ö¬ã¬ä'. Can you try it again with 'View | Character Coding' set to
'Windows-1251'?

Solaris 9 (8 as well?) supports ru_RU.windows-1251 (or something like.
try 'locale -a |grep 1251') in addition to ru_RU.KOI8-R. Can you
launch Mozilla under ru_RU.windows-1251   and see what difference
it makes? BTW, I would not use either of that. Instead, I would use
ru_RU.UTF-8 locale.

>
http://www.yandex.ru/yandsearch?rpt=rad&text=%26%23212%3B%26%23197%3B%26%23211%3B%26%23212%3B

> It's unicoded word '¬ä¬Ö¬ã¬ä'

 Well, it's not. It's url-escaped NCRs for '¬ä¬Ö¬ã¬ä' in KOI8-R.

With url-escaping removed, we have

  &#212;&#197;&#211;&#212;

Note that NCRs have to use Unicode codepoints, but the above URL
(before url-escaping) uses KOI8-R code points (0xD4 0xC5 0xD3 0xD4.)

What Boris got must have been  '%F2%E5%F1%F2' instead of '%D2%E5%F1%F2'.
'¬ä¬Ö¬ã¬ä' represented in Windows-1251 is 0xF2 0xE5 0xF1 0xF2.



> get' form has to use ASCII (Iso-Latin-1),

  US-ASCII (ISO 646) and ISO Latin1(ISO-8859-1) are DIFFERENT. ISO-646 is also
national standards (with one more code points replaced by nat'l standard bodies
if necessary)in virtually all countries, but ISO-8859-1 is not.

> a 'post' has to use Unicode !

  No, you don't have to. You can use any valid MIME charset by specifying
'charset' parameter in the Content-Type header of any text/* subpart of
'multipart/form-data' (see RFC 2388)


P.S. You have to view this bug in KOI8-R. I thought the reporeter used
Windows-1251 in comment #3, but it turned out that it was KOI8-R.
Bugzilla should enforce UTF-8 in comments (there's a workable migration plan
proposed by Markus Kuhn) to avoid this kind of problem. I'm using KOI8-R in my
comment as well to avoid making this bug in multiple encodings.

Jungshik Shin

Comment 10

•

22 years ago

Ooops. I'm sorry I forgot to set View|Character coding to KOI8-R before posting
my comment. To view my comment #9 (Russian word 'тест'), 'View | Character
Coding' has to be set to EUC-KR. Due to my mistake, this bug is now in mixed
encodings.

Simon Montagu :smontagu

Comment 11

•

22 years ago

>What Boris got must have been  '%F2%E5%F1%F2' instead of '%D2%E5%F1%F2'.

No, %D2 is just the upper case form of %F2 in windows-1251

Jungshik Shin

Comment 12

•

22 years ago

>> the word "test" in Russian
>No, %D2 is just the upper case form of %F2 in windows-1251

So, what you typed was not 'test' but 'Test' in Russian :-) Or is Russian like
German in uppercasing the first letter of nouns? 

BTW, see bug 135762. It seems that it's a bit more relevant to this bug than I
thought at first.

Boris Zbarsky [:bzbarsky]

Comment 13

•

22 years ago

> So, what you typed was not 'test' but 'Test' in Russian

Yep.  I didn't even notice myself doing it....

In any case, reporter, have you set any IDN preferences that would cause your
urls to be encoded as NCRs?

Roland Mainz

Updated

•

22 years ago

OS: SunOS → Solaris

qwerty

Reporter

Comment 14

•

22 years ago

> Can you try it again with 'View | Character Coding' set to
'Windows-1251'?

Ok. I set both Win-1251 and KOI8-R and get the above result :(
But I notice some feature: when I set encoding to ISO-8859-1 the server returns
Win-1251-encoded page :)

locale -a | grep 1251 returns ru_RU.ANSI1251

> In any case, reporter, have you set any IDN preferences that would cause your
> urls to be encoded as NCRs?

Hm... In my preferences "Languages/Content" is set to english,
"Default Character Coding" is Western(ISO-8859-1)

But the settings IMHO does not affect to encoding of forms...

I think it's Solaris problem. As I mention above, the same problem is observed
on Netscape 7.0.

Victor

PS: It's a pity that it's impossible to set preferences of Mozilla by .Xdefaults
file :(

Jungshik Shin

Comment 15

•

22 years ago

> Can you launch Mozilla under ru_RU.windows-1251   and see what difference
> it makes? BTW, I would not use either of that. Instead, I would use
> ru_RU.UTF-8 locale.

  You didn't try this, did you?  Can you try all three cases below? 

  % env LC_ALL=ru_RU.ANSI1251  mozilla
  % env LC_ALL=ru_RU.UTF-8 mozilla
  % env LC_ALL=ru_RU.KOI8-R mozilla  

BTW, how are you entering Cyrillic letters? Can you try 'locale' in your default
setting and let us know the output?

qwerty

Reporter

Comment 16

•

22 years ago

>  % env LC_ALL=ru_RU.ANSI1251  mozilla

Gdk-WARNING **: Missing charsets in FontSet creation


Gdk-WARNING **:     ansi-1251

mozilla was not started :(

-i flag did not help.

>  % env LC_ALL=ru_RU.UTF-8 mozilla
>  % env LC_ALL=ru_RU.KOI8-R mozilla  

mozilla runs but the problem arised again :(

>BTW, how are you entering Cyrillic letters?

Switch keyboard by pressing 'Num Lock' and type cyrillic letters.

>Can you try 'locale' in your default
> setting and let us know the output? 

locale: 

LANG=
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=

locale -a:
POSIX
common
en_US.UTF-8
C
iso_8859_1
bg_BG
bg_BG.ISO8859-5
et
et_EE
et_EE.ISO8859-15
hr_HR
hr_HR.ISO8859-2
lt
lt_LT
lt_LT.ISO8859-13
lv
lv_LV
lv_LV.ISO8859-13
mk_MK
mk_MK.ISO8859-5
nr
ro_RO
ro_RO.ISO8859-2
ru
ru.UTF-8
ru.koi8-r
ru_RU
ru_RU.ANSI1251
ru_RU.ISO8859-5
ru_RU.KOI8-R
ru_RU.UTF-8
sh_BA
sh_BA.ISO8859-2@bosnia
sl_SI
sl_SI.ISO8859-2
sq_AL
sq_AL.ISO8859-2
sr_SP
sr_YU
sr_YU.ISO8859-5
tr
tr_TR
tr_TR.ISO8859-9
iso_8859_13
iso_8859_15
iso_8859_2
iso_8859_5
iso_8859_9
koi8-r

qwerty

Reporter

Comment 17

•

22 years ago

In google.com all works fine!!!

Boris Zbarsky [:bzbarsky]

Updated

•

21 years ago

Blocks: 217807

Gervase Markham [:gerv]

Comment 18

•

19 years ago

This is an automated message, with ID "auto-resolve01".

This bug has had no comments for a long time. Statistically, we have found that
bug reports that have not been confirmed by a second user after three months are
highly unlikely to be the source of a fix to the code.

While your input is very important to us, our resources are limited and so we
are asking for your help in focussing our efforts. If you can still reproduce
this problem in the latest version of the product (see below for how to obtain a
copy) or, for feature requests, if it's not present in the latest version and
you still believe we should implement it, please visit the URL of this bug
(given at the top of this mail) and add a comment to that effect, giving more
reproduction information if you have it.

If it is not a problem any longer, you need take no action. If this bug is not
changed in any way in the next two weeks, it will be automatically resolved.
Thank you for your help in this matter.

The latest beta releases can be obtained from:
Firefox:     http://www.mozilla.org/projects/firefox/
Thunderbird: http://www.mozilla.org/products/thunderbird/releases/1.5beta1.html
Seamonkey:   http://www.mozilla.org/projects/seamonkey/

Gervase Markham [:gerv]

Comment 19

•

19 years ago

This bug has been automatically resolved after a period of inactivity (see above
comment). If anyone thinks this is incorrect, they should feel free to reopen it.

Status: UNCONFIRMED → RESOLVED

Closed: 19 years ago

Resolution: --- → EXPIRED

Nobody; OK to take it and work on it

Updated

•

6 years ago

Component: HTML: Form Submission → DOM: Core & HTML

Bugzilla

Quick Search

Bad encoding - Mozilla sents characters from forms only in Unicode!!!

Categories

(Core :: DOM: Core & HTML, defect)

Tracking

()

People

(Reporter: pilylkin, Unassigned)

References

(
URL
)

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Updated

Comment 14

Comment 15

Comment 16

Comment 17

Updated

Comment 18

Comment 19

Updated