Closed Bug 288904 Opened 19 years ago Closed 16 years ago

Do not display codepage 1252 (Windows-1252) characters when document is parsed as ISO-8859-1

Categories

(Core :: Internationalization, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED DUPLICATE of bug 99426

People

(Reporter: 32768, Assigned: smontagu)

References

Details

(Whiteboard: parity-Konqueror)

Attachments

(4 files, 1 obsolete file)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8b2) Gecko/20050329 Firefox/1.0+
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8b2) Gecko/20050329 Firefox/1.0+

When a document is parsed as ISO-8859-1, characters in the range 128-159
(0x80-0x9F) are displayed using Microsoft Codepage 1252 (Windows-1252) extended
characters.

This character range is 'reserved' in ISO-8859-1 and not valid.

Reproducible: Always

Steps to Reproduce:
1. Open testcase
2. Make sure browser is treating it as ISO-8859-1


Actual Results:  
Characters which are invalid in ISO-8859-1 are displayed as Windows-1252

Expected Results:  
Characters which are invalid in ISO-8859-1 should be removed or replaced with a
replacement character (such as '?')

Not all browsers/platforms which support ISO-8859-1 also support the codepage
1252 extensions.  Rendering these characters when the ISO-8859-1 character
encoding is used will encourage web authorers to use characters that won't
appear in other platforms or browsers, and are invalid in ISO-8859-1.
Attached file Testcase (obsolete) —
Attached file Testcase
Attachment #179536 - Attachment is obsolete: true
Assignee: nobody → smontagu
Component: Layout: Fonts and Text → Internationalization
QA Contact: layout.fonts-and-text → amyy
Whiteboard: DUPEME
(In reply to comment #0)

> Not all browsers/platforms which support ISO-8859-1 also support the codepage
> 1252 extensions.  

  Would they work if 'charset=windows-1252' is specified? If not, I don't think
changing our behavior won't help them much.

> Rendering these characters when the ISO-8859-1 character
> encoding is used will encourage web authorers to use characters that won't
> appear in other platforms or browsers, and are invalid in ISO-8859-1.

I'm not as big a fan of 'be generous in what you accept and be strict in what
you emit' as I used to be. And, I understand your good intent very well.
However, I'm afraid this is one of cases where we have to bend a little.(we gave
in to 'document.all' crowd) Note that it's not just ISO-8859-1/Windows-1252 but
also TIS620/ISO-8859-11/ Windows-874 and EUC-KR/x-Windows-949. 

OS: Windows XP → All
Hardware: PC → All
David Baron suggested some time ago in a bug which I can't now find that we
should make Windows-1252 our default encoding for pages with no encoding
specified, and interpret ISO-8859-1 strictly at least in standards mode. I very
much agree in theory, but I'm sure it would break lots and lots of sites.

It would be interesting to gather some up-to-date statistics on how many pages
that are declared as ISO-8859-1 actually contain codepoints in the 0x80-0x-9f
range. Erik, I believe you wrote a tool once to perform similar analyses. Is it
part of mozilla/webtools?

Off the top of my head, I would guess "one third of the Web", and as much as
half if you include pages that have a META charset declaration as windows-1252
but ISO-8859-1 in the HTTP headers.
(In reply to comment #4)

Yes, the charset-sniffing Web crawler is mozilla/webtools/web-sniffer/robot.c
It looks for HTTP and HTML charsets but not for 0x80-0x9f. Note that its
crawl may not be "broad" enough to get a good sense. Let me know if you have
questions.

Another approach, to be used instead or in parallel, is to start a discussion
on, say, www-international@w3.org to see how other people feel about this.
(In reply to comment #4)
> David Baron suggested some time ago in a bug which I can't now find that we
> should make Windows-1252 our default encoding for pages with no encoding
> specified, 

Urh, when no encoding is specified, we use the encoding specified in the user
preference as you know well.

(In reply to comment #6)

> Urh, when no encoding is specified, we use the encoding specified in the user
> preference as you know well.

Yes. What I meant by "make Windows-1252 our default encoding ..." was "make
Windows-1252 the out-of-the-box value of the user preference that specifies the
default encoding ...".

(In reply to comment #4)
> David Baron suggested some time ago in a bug which I can't now find

Bug 193094 comment 6
(In reply to comment #7)

> Yes. What I meant by "make Windows-1252 our default encoding ..." was "make
> Windows-1252 the out-of-the-box value of the user preference that specifies the
> default encoding ...".

Sorry to be pedantic ;-), but that needs a bit more qualification, 'for Western
European language versions' because the out-of-the-box default value for that
user pref is localizable. 

Anyway, if we split the ISO-8859-1 decoder (currently, it's Windows-1252 decoder
) into two, Windows-1252 decoder and ISO-8859-1 decoder, we may interpret
ISO-8859-1 strictly for pages in standard mode.I have no idea as to the number
of pages that will break with this change, but applying the strict
interpretation only to standard mode  will significantly reduce the number.


Perhaps this could be thought of as a user interface issue, that having separate
menu items in the user application for ISO-8859-1, US-ASCII and Windows-1252 is
misleading since they will all have the same effect - the Windows-1252 encoding
is used for all.

Perhaps another bug, in the Firefox product, could be created that specifies
"ISO-8859-1, US-ASCII and Windows-1252 choices under View->Character Encoding
are redundant".
This is an automated message, with ID "auto-resolve01".

This bug has had no comments for a long time. Statistically, we have found that
bug reports that have not been confirmed by a second user after three months are
highly unlikely to be the source of a fix to the code.

While your input is very important to us, our resources are limited and so we
are asking for your help in focussing our efforts. If you can still reproduce
this problem in the latest version of the product (see below for how to obtain a
copy) or, for feature requests, if it's not present in the latest version and
you still believe we should implement it, please visit the URL of this bug
(given at the top of this mail) and add a comment to that effect, giving more
reproduction information if you have it.

If it is not a problem any longer, you need take no action. If this bug is not
changed in any way in the next two weeks, it will be automatically resolved.
Thank you for your help in this matter.

The latest beta releases can be obtained from:
Firefox:     http://www.mozilla.org/projects/firefox/
Thunderbird: http://www.mozilla.org/products/thunderbird/releases/1.5beta1.html
Seamonkey:   http://www.mozilla.org/projects/seamonkey/
Status: UNCONFIRMED → RESOLVED
Closed: 19 years ago
Resolution: --- → INVALID
*** Bug 333292 has been marked as a duplicate of this bug. ***
> Anyway, if we split the ISO-8859-1 decoder (currently, it's Windows-1252 decoder
> ) into two, Windows-1252 decoder and ISO-8859-1 decoder, we may interpret
> ISO-8859-1 strictly for pages in standard mode.

Note that the native uconv support *does* use different decoders for windows-1252 and iso-8859-1 thus not making this bug appear.

I believe myself this bug should be fixed. Reopening.
Status: RESOLVED → UNCONFIRMED
Resolution: INVALID → ---
Perhaps this is relevant, perhaps not.

I've just spent 2 days tracking down a bug in minimo whereby form submission didn't work. I built minimo from current CVS sources, and have it running on Familiar Linux v0.8.4-rc2 on an iPAQ hx4700.

I've now found the problem. It was a gratuitous substitution of character set ISO-8859-1 by windows-1252 in the following file:

content/html/content/src/nsFormSubmission.cpp

  1313    // canonical name is passed so that we just have to check against
  1314    // *our* canonical names listed in charsetaliases.properties
  1315    if (charset.EqualsLiteral("ISO-8859-1")) {
  1316      charset.AssignLiteral("windows-1252");
  1317    }

I don't know why this code is there. The comment about charsetaliases.properties means nothing to me because I can find no reference to it elsewhere.

While Windows and even Linux installed on a desktop PC will have Codepage 1252 installed somewhere, an embedded Linux distribution installed on a portable device may not.

I fixed form submission in my minimo build by simply deleting the above 5 lines.

I might equally have installed Codepage 1252. But my feeling is that the software should respect its own default character sets!
Comment 14 is not really the same issue. Could you open a new bug, please? (Possibly those lines should be #ifndef MOZ_USE_NATIVE_UCONV, but it needs careful consideration)
You're correct, I was too hasty. A more thorough search of Bugzilla revealed that my comment would have been more usefully added to bug #228779. Duly added.
bug 372325 is related to this issue
Attachment #257007 - Attachment mime type: text/html → text/html; charset=ISO-8859-1
Has it already been mentioned, that € is also displayed as € ?
All &#XYZ; with XYZ between 0x80 and 0x9F are also decoded according to cp1252 - no matter which charset the page actually uses.
(In reply to comment #19)
> Has it already been mentioned, that € is also displayed as € ?

Yes, that's bug 372325, referred to in comment 17.
This is a dupe of bug 99426 but that call was made over five years ago so I dunno if someone wants to revisit this.
Whiteboard: DUPEME
(In reply to comment #21)
> This is a dupe of bug 99426 but that call was made over five years ago so I
> dunno if someone wants to revisit this.

apparently not.
Status: UNCONFIRMED → RESOLVED
Closed: 19 years ago16 years ago
QA Contact: amyy → i18n
Resolution: --- → DUPLICATE
For what it's worth, still broken in 3.0.3

Personally, I'd like a config setting that I can turn on to make this strict as it is supposed to be as I would like to see broken encodings as broken.
(In reply to comment #23)
> For what it's worth, still broken in 3.0.3
> 
> Personally, I'd like a config setting that I can turn on to make this strict as
> it is supposed to be as I would like to see broken encodings as broken.

ISTR that some years ago (maybe about the time Fx1 came out), the opposite decision (i.e. the decision to render 0x80 to 0x9F according to Windows 1252 when the stated charset is iso-8859-1) was made in order to avoid user reactions along the lines of "Firefox is broken, look: it cannot render pages X, Y, Z correctly, while I made them with PageMaker and IE has no problem with them" (yeah, I know IE6 and PageMaker were even more broken then than they are know, but the average M$W user is not expected to know that).

In Latin1, those are control characters, which (or most of which) would be expected not to happen on a webpage. Maybe whether to render Latin1 as Windows-1252 or not could be made (or is?) dependent on whether the page is rendered in "Quirks" vs. "Strict" HTML mode?
> .. dependent on whether the page is rendered in "Quirks" vs. "Strict" HTML mode?

Nope.  Sorry.  Makes no difference.  This is quite broken no matter how one winds a set of rationalizations around it.  I want to see the broken stuff as broken.

Please add a strict option like:
intl.charset.iso-is-strict-and-mozilla-is-not-a-joke
Can someone explain the rational of this?  There is no glyph at that codepoint, I want to see nothing.
The rationale includes at least in part the fact that on Windows, most programs (including Notepad and IE) treat iso-8859-1 as if it were Windows-1252. However there must be something more than that, since even on Linux characters in the range 0x80-0x9F which exist in Windows-1252 are displayed as Windows-1252 even with "Wiew => Character Encoding => Western (ISO-8859-1)", see attached example.

Another part of the rationale might be (I'm speculating here) that when the decision was made to render Latin1 as cp1252, both Windows and IE had significantly larger market shares than they do now, that most corporations didn't care about Linux and few did about Firefox, that significantly more people than now were publishing pages in cp1252 but with Content-Type lines saying iso-8859-1, and that it was thought that if, at that time, Firefox didn't bend to the stream it would be perceived as "broken" because it wouldn't be displaying "correctly" the euro signs, OE digraphs, z-caron and big-y-diaeresis letters the way "other browsers" did (well, the way IE did; I can't test Safari but, by now at least, Konqueror/KDE3 doesn't).
Whiteboard: parity-Konqueror
P.S. See also comment #4.
(In reply to comment #27)

> The rationale includes at least in part the fact that on Windows, most programs
> (including Notepad and IE) treat iso-8859-1 as if it were Windows-1252.

I wouldn't say that is accurate as the system encoding on windows is already
cp1252 for the USA.  To properly "serve" the file correctly requires
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252"> so
that it is viewed as intended.  But you knew that.

Where in the code are these hacks found so that I may fix it myself?
(In reply to comment #29)
> (In reply to comment #27)
> 
> > The rationale includes at least in part the fact that on Windows, most programs
> > (including Notepad and IE) treat iso-8859-1 as if it were Windows-1252.
> 
> I wouldn't say that is accurate as the system encoding on windows is already
> cp1252 for the USA.  To properly "serve" the file correctly requires
> <meta http-equiv="Content-Type" content="text/html; charset=windows-1252"> so
> that it is viewed as intended.  But you knew that.

Even on Windows, people exchange email and browse the Web. I've seen (and more than just rarely) pages containing cp1252 text which were served as iso-8859-1.

> 
> Where in the code are these hacks found so that I may fix it myself?

See comment #14. Maybe it's got bitrotten by now, I wouldn't know about that. But unless you compile your own browser, you would (IIUC) have to get your changes reviewed, superreviewed and approved before you could land them. You're free to rey, but I don't think you would succeed, considering this already has (as bug 99426) been WONTFIXed before.
oops! "you're free to " s/rey/try/ ", but"...
shoot, I wouldn't submit it.  I'm just out for myself.  I want to see broken pages as broken and with the correct set of glyphs as defined in the standards.
Bug 372375 has been idle for 1 1/2 years so I don't expect much from the dev folks anyways
Attachment #340847 - Attachment mime type: text/html → text/html; charset=ISO-8859-1
If you'd like to hear my 50 cent:
first fix the sending (Bug 228779) and then years later (or never) fix the displaying.

Sending invalid ISO-8859-1 is many times more unreasonable than just displaying invalid ISO-8859-1 for compatibility reasons.

If people complain, tell them in the most friendly way, that they are wrong - but keeping it this way just breaks things (databases containg data with \u0080 instead of € and such things) and it's just pure lazyness.

Many websites are already UTF-8 anyway - so you might already lucky and see less complains than some years ago.
Thank you Simon.  I found the first yesterday.  I don't see how that one could be config switchable, though.  Switching at the caller makes more sense, but I don't know the code, yet.  The second looks to be easy to adapt a config switch into it.

If you have some more thoughts, please post to Bug 372375
(In reply to comment #35)

> Sending invalid ISO-8859-1 is many times more unreasonable than just displaying
> invalid ISO-8859-1 for compatibility reasons.

John Postel's law: "Be conservative in what you do; be liberal in what you accept from others."
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: