Closed Bug 228779 Opened 21 years ago Closed 10 years ago

Submitted characters not included in the iso-8859-1 charset for iso-8859-1 documents should be always encoded as numeric character references

Categories

(Core :: DOM: Core & HTML, defect)

x86
Windows 2000
defect
Not set
normal

Tracking

()

VERIFIED WONTFIX

People

(Reporter: moz, Unassigned)

References

(Depends on 1 open bug, )

Details

Attachments

(1 file)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.6b) Gecko/20031211 Firebird/0.7+
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.6b) Gecko/20031211 Firebird/0.7+

Let's have an html document which uses the iso-8859-1 charset, and have a form
with a text input. Insert characters like the the euro symbol or the oe
ligature, and submit the text. During submission those characters should be
encoded as numeric character references : € and œ
Instead you get the windows-1252 equivalent when you read the data submitted :
0x80 and 0x9c, which are actually special control characters in iso-8859-1.
The numeric character references encoding works for rarer characters like the
sigma character (as seen in the test page provided).

Reproducible: Always

Steps to Reproduce:
1.in an iso-8859-1 encoded document, submit text with an euro symbol
2.display the submitted data in the same document
3.view the document source to check how the data is encoded

Actual Results:  
The euro symbol was not encoded as numeric character reference, you get a
character encoded as 0x80 instead (which is a special control character in
iso-8859-1).

Expected Results:  
Mozilla should have encoded the euro symbol as € because the euro symbol
is not included in the iso-8859-1 charset.

At present I only tested with the euro and oe ligature, but this bug may not be
limited to those characters. For now I suspect that Mozilla makes the confusion
between windows-1252 and iso-8859-1. Further testing is needed.
My assumption about the windows-1252 and iso-8859-1 confusion was true.

The problem is in the nsFormSubmission::GetEncoder function, located in the file
content\html\content\src\nsFormSubmission.cpp

Those two lines should be removed :

  if(charset.Equals(NS_LITERAL_CSTRING("ISO-8859-1")))
    charset.Assign(NS_LITERAL_CSTRING("windows-1252"));

I don't see any reason for such a replacement. iso-8859-1 and windows-1252 might
look a bit similar but they are two different charsets. For a matter of
consistency, when the page specifies iso-8859-1 for its encoding no other
charset should be used.

I built a personnal version of Mozilla without those two lines. It behaved fine
with the test page provided here.
The patch removes two faulty lines in the nsFormSubmission::GetEncoder which
prevent Mozilla to properly encode some iso-8859-1 characters into numeric
character references.
Won't this just regress bug 81203?
Yes, it is. What's 'the' correct behavior? First of all,  using NCRs for
characters not covered by the current document encoding is 'MS IE's invention
extended by Mozilla followed by Konqueror/Safari'. 

Now , the main issue here:  ISO-8859-1 vs Windows-1252, we all know that they're
different. However, the reality is that there are so many 'broken' web pages out
there that don't specify the document character encoding in any way or that are
mistagged as ISO-8859-1 while they're actually in Windows-1252. So, what does
Mozilla do? It regards ISO-8859-1 as synonymous with Windows-1252 when
interpreting documents. The form submission is a different issue and what does
it have to do?  It's a tough call. If we apply the patch here, I'm sure tomorrow
somebody will file a bug asking for taking it out.  On the other hand, we don't
do it for similar cases (TIS620 < ISO-8859-11 < Windows-874?, EUC-KR <
Windows-949, GB2312 < GBK < GB18030). 

I really don't know what to do. Everybody, please switch to UTF-8 (or other
encoding forms of Unicode as you see fit) everywhere !
I think that there were an error resolving bug 81203.

Mozilla has never been a IE-compatible browser, and it must not become one of them.

I don't really disagree with transforming iso-8859-1 to windows-1252 but.. Why
don't you copy IE rendering bugs aswell ? Why don't you reproduce IE security
holes aswell ? Why don't you implement document.all ?

Compatibility has never been the way Mozilla had been developped, and I think it
is a bad idea to let webmasters think that their website works whereas they
specified a wrong charset.
>Yes, it is. What's 'the' correct behavior? First of all,  using NCRs for
>characters not covered by the current document encoding is 'MS IE's invention
>extended by Mozilla followed by Konqueror/Safari'. 

I think this is a good idea. So one can send all the characters he wants. I am
happy to be able to use the euro symbol in forms which use iso-8859-1 for example.

>Now , the main issue here:  ISO-8859-1 vs Windows-1252, we all know that 
>they're different. However, the reality is that there are so many 'broken' web
>pages out there that don't specify the document character encoding in any way
>or that are mistagged as ISO-8859-1 while they're actually in Windows-1252. 
>So, what does Mozilla do? It regards ISO-8859-1 as synonymous with Windows-1252
>when interpreting documents. 

About interpreting document, I think this behaviour is not really correct, but
anyway interpreting iso-8859-1 as windows-1252 is a need because - as you are
saying - there are many broken pages which make the confusion. We all know the
Internet protocols rule : one should be tolerant when parsing received data, but
one should be very strict in the format when sending data.

>The form submission is a different issue and what does
>it have to do?  It's a tough call. If we apply the patch here, I'm sure
>tomorrow somebody will file a bug asking for taking it out.  On the other
>hand, we don't do it for similar cases (TIS620 < ISO-8859-11 < Windows-874?,
>EUC-KR <Windows-949, GB2312 < GBK < GB18030). 

Yes, I consider the form issue as a different topic than the document
interpretation. And that's time I think Mozilla should be strict and stick to
the charset information provided by the document. I think it is really not
harmfull to send real iso-8859-1 in forms. If a problem might occur (and I don't
have examples in mind), this is a server-side problem.

>I really don't know what to do. Everybody, please switch to UTF-8 (or other
>encoding forms of Unicode as you see fit) everywhere !

lol That would be a good solution :) I personnaly use the iso-8859-15 charset
which is a slight improvement of iso-8859-1, and this bug 228779 does not occur
with this charset.

So here is a summary of my thoughts : 
1/ Mozilla should be tolerant when reading document data (this is already the case)
2/ Mozilla should be strict when sending data (this has to be improved a bit as
notified here).

For now, I hope the nsFormSubmission::GetEncoder function is only used when
sending data and not when interpreting documents. Otherwise a better patch
should be applied. I will go deeper in the source code to help answering this
question.
re comment #5 : I'm afraid you have no clue what is at issue here. It's little
to do with MS IE parity.

re comment #6: I know where you stand and I usually practice 'be generous in
what you accept and be strict in what you send out'. However, the reality of the
internet makes it hard to do the 'right' thing in some cases. (I would have
objected to the change made in bug 81203  if I had been there.) Your patch is
the exact mirror image of the patch for bug 81203, which means it's gonna make
some (probably many) people unhappy (at least get them surprised) if applied.
However, one thing has changed since. In 2001, Mozilla didn't turn characters
outside the repertoire of the 'form charset' to NCRs (which is a non-standard
[1] hack although widely used).  It just turn them into question marks. Now it
does so that reversing the ptach for bug 81203 (i.e. applying your path) would
be of less problem now than then.  Let me think about it some more.

BTW, you don't have to worry about interpreting document. Your patch has no
impact there. 

[1] HTML4 and XHTML 1 failed to address I18N issues in the form submission so
that there's no 'standard' way, but just a couple of different 'practices'. 
Vincent Robert, how much do you know exactly about how Mozilla has been
developed?  Development has followed two main principles:

1)  If there is a spec follow it
2)  If there is no spec, be compatible with other browsers, unless there are
    very strong logical reasons not to (since we can do whatever we want, and
    there is no reason to not do what they do).

Frankly, this form submission behavior falls under item #2 as far as I can tell,
but I don't have other browsers on hand to test this hypothesis....

Hadrien Nilsson, saying "it's a server problem" is a convenient out, but the
fact remains that this patch will break pages that have worked with Mozilla for
years now.  That's the loss.  What is the benefit?
> 1)  If there is a spec follow it

does not necessarily apply to quirks mode...
Now that I thought a bit more about it, using true iso-8859-1 and NCR will not
lead to server errors. Server programs usually already know how to deal with NCR
already, so some more will not harm them. Moreover some server-side programs
wants to manage data in the charset they expect. For instance in PHP the native
charset is iso-8859-1 so I think sending windows-1252 is not a good solution.

Let's have this example of a server-side program which converts to uppercase a
string and store it somewhere :

1/ Get the data from HTTP POST
2/ Convert the result from iso-8859-1 & NCR to some Unicode flavour string (like
UTF-8)
3/ ToUpper function applied to the Unicode string
4/ Store the result somewhere : database or a file

Here the step 2 will fail if the server expect true iso-8859-1.

Of course many characters of iso-8859-1 and windows-1252 are similar, so it may
take long before a bug arises. But one day somebody will use the euro symbol or
the oe ligature, and will trash the database or the file with some strange
characters.

How did I discover this bug ? One day I posted a comment on a blog. It uses
iso-8859-1 and my comment contained an oe ligature. Some people could not see my
oe ligature, they were just seeing a symbol which meaning was « unknown
character ». I looked at the problem and found that the blog comment system
stored and was delivering an evil 0x9c character instead of the « &#339; » NCR :(

Thanks Jungshik Shin for thinking about it. Now I understand the history behind
the use of windows-1252. As you say, as now Mozilla form submission system knows
about NCR, we can revert back to a better behaviour.
Strange, my latest comment (Comment #10 From Hadrien Nilsson 2003-12-27 08:42)
wasn't sent by e-mail. Maybe this new one will flush the output. A bugzilla bug ? 

Just for curiosity I made some tests with other browsers :
- Opera 7.22 : convert characters outside the charset into question marks ;
- Konqueror 3.1.4 : same as Opera, you get question marks ;
- Amaya 8.2 : is totally lost :) Displays a parse error ;
- MSIE 6 : you get NCR for any character outside windows-1252.
I could not test on Safari yet.
Getting NCR for any character outside iso-8859-1 (of course when the document
encoding IS iso-8859-1) seems to be a sensible choice for me for Mozilla.
Oh, and I post this comment with my modified Mozilla, &#8364; and &#339; should be encoded
as NCR here :)
When I wrote comment #4, I didn't realize that 'NCR-hack' had been added _after_
bug 81203. With that in place, it could be all right to revert back to pre-81203
state (at least, I'm   less sure that somebody will file a bug a day after we
apply the patch here.)

off-topic: it might be a good idea to warn users if characters not covered by
the 'form' charset are submitted. 

As for my wish for UTF-8, I'm  serious. I don't think there's any compelling
reason to keep using legacy character encodings in 2003.  
So IE6 acts just like Mozilla on that testcase, as I understand it?  (Just
making sure I understand the situation.)
>So IE6 acts just like Mozilla on that testcase, as I understand it?  (Just
>making sure I understand the situation.)
Currently, for that testcase, yes it does.
How about the client-side setting? Does it make any difference if you explicitly
set 'Encoding' in MS IE to 'Western(iso)' (ISO-8859-1) instead of just 'Western'
 (Windows-1252)? 
>How about the client-side setting? Does it make any difference if you explicitly
>set 'Encoding' in MS IE to 'Western(iso)' (ISO-8859-1) instead of just 'Western'
>(Windows-1252)? 

In my french MSIE 6 I've got the display options : 
- Europe occidentale (Windows)  {should be windows-1252}
- Europe occidentale (ISO)    {should be iso-8859-1}

Switching from one to another does not change anything, for MSIE iso-8859-1 ==
windows-1252
Related bug 232191:

"Character encoding of submitted form
silently influenced by that of original page."
I agree prima facie with the assumption that the patch won't regress bug 81203,
because NCRs will now be used, so subject to testing that assumption, I am OK
with this patch.

However, we should bear in mind that the "NCR hack" is controversial. I thought
there was a bug saying that we shouldn't do it, though I can't find it in
Bugzilla; and the proposed Web Forms spec at
http://www.hixie.ch/specs/html/forms/web-forms#unacceptableCharacters forbids it.
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040321 Firefox/0.8

It's confusing me a lot. Mozilla Firebird at least seems to do the following:

- Mask the Euro sign as 0x80
- Mask everything else with &#<unicode number>;
- Do not mask &#something; input from the user

The last point cracks me: this means I cannot differ between entities
transformed by the browser and entities the user put in. Additionally, Mozilla
seems to ignore the accept-charset parameter in the form tag. You can test it here:

http://selfforum.teamone.de/cgi-bin/test.pl

Can somebody explain?

Greetings,
 CK
(In reply to comment #19)

> The last point cracks me: this means I cannot differ between entities
> transformed by the browser and entities the user put in. Additionally, Mozilla
> seems to ignore the accept-charset parameter in the form tag. You can test it
here:
> 
> http://selfforum.teamone.de/cgi-bin/test.pl
> 
> Can somebody explain?

Well, Mozilla ignores the accept-charset attribute in the form but IE6 and Opera
7.2 do not and so send back UTF-8 characters. Mozilla simply pays attention to
only the document encoding. I think it would be better to change this behavior. 

The above test form indiactes a good compromise solution for the web. If a site
doesn't want to see Unicode entities returned for those characters outside of
ISO-8859-1, then adding UTF-8 as the second preferred charset in the
accept-charset attribute of the form seems likes a very good practice for web
sites. 

Mozilla sends Windows-1252/ISO-8859-1 characters as is, IE6 and Opera 7 send
UTF-8 characters for those outside the ISO-8859-1 range. Only Mozilla sends
Unicode entities for those characters outside Windows-1252 when there is UTF-8
in the accept-charset attribute. I think we should add the accept charset
support in forms to Mozilla. 

Should we revert to sending entities? I'm not sure. I am concerned about legacy
cases and breaking server side expectations. Also the new web form standard
seems to be against it as Simon says above. One thing we can do is to implement
the form accept charset and evangelize specifying both the local encoding and
UTF-8 always as the 2nd choice.
(In reply to comment #20)

> I think we should add the accept charset support in forms to Mozilla. 

This would be a very good idea, IMHO.

> Should we revert to sending entities?

I think sending entities without masking user input is a bad idea. Look,
what happens if a user enters &#77220; as text but also enters U+77220 as
an character? It's impossible to differ this case. So there are IMHO two
ways:

- Is there a accept-charset parameter?
  - Yes:
      for every charset in accept-charset:
        - Are all characters displayable in this charset?
          - Yes: Send the form with that charset
          - No: Are there any more charsets in accept-charset?
            - Yes: continue with the next charset in accpept-charset
            - No: Either send unicode entities and mask every & the user put
                  in by &amp; or do not send unicode entities and mask every
                  not displayable character with 0x3F (question mark)

  - No: Either send unicode entities and mask every & the user put in by &amp;
        or send question marks (0x3F) for characters not displayable in the
        site charset

I think both (sending unicode entities and mask the user input or sending
question marks) would be an acceptable solution.


> One thing we can do is to implement the form accept charset and evangelize
> specifying both the local encoding and UTF-8 always as the 2nd choice.

This would be the best, IMHO.

Greetings,
 CK
I have three points to add to your discussion.


1. Send the charset used for encoding the form data together with the 
content-type.

Opera uses this method, and I would consider it legal according to specs. RFC 
2616, Section 3.7 explicitely allows adding parameters to media types.

This enables someone who cares about charsets to actually detect what encoding 
was used. This could reveal which happend to the encoding when Mozilla decides 
that the normal encoding is not satisfying enough (e.g. changes ISO-8859-1 to 
Windows-1252 - which is a bad idea nonetheless).


2. I must object to Christians suggestion of converting any unencodable 
character to its NCR code.

Doing this means that any "&" character entered on purpose has to be converted 
to "&amp;", because otherwise you wouldn't be able to distinguish between why 
there is an ampersand - was it a chinese character converted to "&22770;", or 
did the user enter this very string literally?

Example: "&#22770; is &22770;" (the first letter is the chinese sign)
would be converted to "&22770; is &22770;".

By converting the literal "&" to "&amp;" you would end up with "&22770; is &amp;
22770;"

But this would REALLY break things, I think. Everybody is relying on the fact 
that an ampersand is sent as an ampersand, and not as an HTML entity.

You cannot encode characters entered into a form which do not have a binary 
representation within the allowed or chosen character encoding, and without a 
suitable escape code signaling a different encoding approach for a special 
character. There is no euro sign in ISO-8859-1. If you have to use the euro 
sign, use ISO-8859-15, that's why it is there. Users are aware that the euro 
sign has relatively poor support across systems. Sending form data touches only 
those users who set up a website and do want some kind of feedback - so this 
might be a topic not concerning "John Doe" who is just surfing the net but those 
who have to process form data.


3. Do not send a character encoding which is not stated in the document.

The accept-charset-attribute should have highest priority. But it allows 
conflicting encoding (which is why sending the used encoding is important).

HTML defines its contents as "a space separated list of character encodings, as 
per RFC 2045". This allows values like

<form accept-encoding="Windows-1252 UTF-8" ...>

Assuming you have such a form, and enter the following two characters: "Ö¤" 
(which is &Ouml;&euro; in entities).

Using the Windows-1252 encoding, the Ö (&Ouml;) translates to 0xD6, the ¤ 
(&euro;) translates to 0x80. So the browser would send "0xD6 0x80".

But those two bytes are a valid UTF-8-encoding for Unicode character 0x0580 
(Armenian small letter reh). And since UTF-8-encoding has been defined as a 
valid encoding, too, you cannot decide which has been used.

The other way round: If only "Windows-1252" is accept-charset, and the user 
enters this very armenian letter, it cannot be encoded in Windows-1252. If you 
would encode it using UTF-8 as a fallback (which has been suggested in comment 
#10), the server can validly assume the User entered Ö¤. Using any other charset 
can lead to similar confusions.

Stick to accept-charset if defined, and to it ONLY. And tell the server which 
encoding you used - at least if there are multiple valid charsets given.

If no accept-charset is defined, use the charset of the page.

If a character cannot be encoded using the requested charset, you might display 
an error message to the user. And you cannot send this character, but have to 
replace it with something else (the question mark seems to be a good choice).

Thanks for considering. :)

Sven
Bug #29271 suggests a good solution for this whole "unappropriate encoding for 
characters to send" thing.

Plus: It's an assigned bug, Priority 3, ans should have been solved for version 
0.9.8 - or so the bug history says...

Depends on: 29271
I have a comment which I had hastily added to bug #288904 comment #14.

I've just spent 2 days tracking down a bug in minimo whereby form submission
didn't work. I built minimo from current CVS sources, and have it running on
Familiar Linux v0.8.4-rc2 on an iPAQ hx4700.

I've now found the problem. It was a gratuitous substitution of character set
ISO-8859-1 by windows-1252 in the following file:

content/html/content/src/nsFormSubmission.cpp

  1313    // canonical name is passed so that we just have to check against
  1314    // *our* canonical names listed in charsetaliases.properties
  1315    if (charset.EqualsLiteral("ISO-8859-1")) {
  1316      charset.AssignLiteral("windows-1252");
  1317    }

I don't know why this code is there. The comment about
charsetaliases.properties means nothing to me because I can find no reference
to it elsewhere.

While Windows and even Linux installed on a desktop PC will have Codepage 1252
installed somewhere, an embedded Linux distribution installed on a portable
device may not.

I fixed form submission in my minimo build by simply deleting the above 5
lines.

I might equally have installed Codepage 1252. But my feeling is that the
software should respect its own default character sets!
Why do browsers (FireFox, IE, Opera) send &#8364; instead of € for example?
Is there an RFC for this or is this the behaviour suggested by the W3C?
Where can i read about it?

Anyway: a manually entered "&#8364;" is not different from a generated one. No chance to safely decode things on the server-side.
Assignee: form-submission → nobody
QA Contact: form-submission
I don't if it's a good idea to up this bug. But here are my 2 cents on this :

I personnaly think that any input data *should* be submitted according strictly to the page's character encoding.
Display compatibility *should* only apply to display concerns, not to data integrity.

On the question about entities, I also think there should not be any difference between a browser entity entry and a user one. If a website needs to make a difference between a browser generated entity and a manual entry, it's its concern for encoding the ampersand as an entity at user input.
The Firefox behavior is Web-compatible and complies with the Encoding Standard (https://encoding.spec.whatwg.org/).
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WONTFIX
@Henri:
I don't buy it. Can you explain why this is web-compatible?

The way I remember it, Firefox will submit a cp1252 encoding of an euro symbol (and other characters) to a server-side application that decodes it with ISO-8859-1 - which will obviously lead to problems on the server side.

What is this funny URL you posted? Where does it say that a browser is allowed to encode data using cp1252 even though it is asked to encode using iso-8859-1?
All browsers do that, not just Firefox (and sites depend on it).

https://encoding.spec.whatwg.org/#names-and-labels defines "iso-8859-1" as label for windows-1252, which is defined here: https://encoding.spec.whatwg.org/#windows-1252

Obviously content should migrate to utf-8. The sooner the better.
Status: RESOLVED → VERIFIED
From what I understand from web compatibility, the iso-8859-1 → windows-1252 compatibility should ensure the chars 0x80 to 0x99, which are control chars in iso-8859-1, are displayed as if the encoding was windows-1252. So that old broken sites with iso-8859-1 charset are displayed correctly. I understand this as a display compatibility so I can't understand why this rule should be applied on user input. Moreover, I can't remember other browsers having the same kind of trouble on user input.
There is no trouble. All browsers work the same way.
This would imply, that Tomcat (for example) has a bug. If I specify iso-8859-1 encoding, it actually uses iso-8859-1 instead of cp1252 as it should. If the list is part of some standard then I should definitely report that bug. Could you point me to the place in the standard where it says that form data has to be sent using an encoding that is obtained from mapping labels (which look like names of encodings but really aren't) to encodings?

Back then, when I first encountered that bug, Opera was the only browser with a reasonable (but obviously non-standard compliant) behavior. But Chrome wasn't around yet and Opera now switched to WebKit.

Also, there should be a punchbag with some general face on it representing people that admin "standards" that are basically documentations of what has been done wrong in the past.
Could you please keep it civil?

Opera changed its behavior on this before it changed to Chromium because of compatibility issues.

https://html.spec.whatwg.org/multipage/forms.html#application/x-www-form-urlencoded-encoding-algorithm defines it for <form enctype=application/x-www-form-urlencoded> (default). You should also find the other algorithms there.
Component: HTML: Form Submission → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: