288904 - Do not display codepage 1252 (Windows-1252) characters when document is parsed as ISO-8859-1

Reporter

Description

•

20 years ago

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8b2) Gecko/20050329 Firefox/1.0+ Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8b2) Gecko/20050329 Firefox/1.0+ When a document is parsed as ISO-8859-1, characters in the range 128-159 (0x80-0x9F) are displayed using Microsoft Codepage 1252 (Windows-1252) extended characters. This character range is 'reserved' in ISO-8859-1 and not valid. Reproducible: Always Steps to Reproduce: 1. Open testcase 2. Make sure browser is treating it as ISO-8859-1 Actual Results: Characters which are invalid in ISO-8859-1 are displayed as Windows-1252 Expected Results: Characters which are invalid in ISO-8859-1 should be removed or replaced with a replacement character (such as '?') Not all browsers/platforms which support ISO-8859-1 also support the codepage 1252 extensions. Rendering these characters when the ISO-8859-1 character encoding is used will encourage web authorers to use characters that won't appear in other platforms or browsers, and are invalid in ISO-8859-1.

Thomas Rutter

Reporter

Comment 1

•

20 years ago

Attached file Testcase (obsolete) — Details

Thomas Rutter

Reporter

Comment 2

•

20 years ago

Attached file Testcase — Details

Attachment #179536 - Attachment is obsolete: true

Simon Montagu :smontagu

Assignee

Updated

•

20 years ago

Assignee: nobody → smontagu

Component: Layout: Fonts and Text → Internationalization

QA Contact: layout.fonts-and-text → amyy

Whiteboard: DUPEME

Jungshik Shin

Comment 3

•

20 years ago

(In reply to comment #0) > Not all browsers/platforms which support ISO-8859-1 also support the codepage > 1252 extensions. Would they work if 'charset=windows-1252' is specified? If not, I don't think changing our behavior won't help them much. > Rendering these characters when the ISO-8859-1 character > encoding is used will encourage web authorers to use characters that won't > appear in other platforms or browsers, and are invalid in ISO-8859-1. I'm not as big a fan of 'be generous in what you accept and be strict in what you emit' as I used to be. And, I understand your good intent very well. However, I'm afraid this is one of cases where we have to bend a little.(we gave in to 'document.all' crowd) Note that it's not just ISO-8859-1/Windows-1252 but also TIS620/ISO-8859-11/ Windows-874 and EUC-KR/x-Windows-949.

OS: Windows XP → All

Hardware: PC → All

Simon Montagu :smontagu

Assignee

Comment 4

•

20 years ago

David Baron suggested some time ago in a bug which I can't now find that we should make Windows-1252 our default encoding for pages with no encoding specified, and interpret ISO-8859-1 strictly at least in standards mode. I very much agree in theory, but I'm sure it would break lots and lots of sites. It would be interesting to gather some up-to-date statistics on how many pages that are declared as ISO-8859-1 actually contain codepoints in the 0x80-0x-9f range. Erik, I believe you wrote a tool once to perform similar analyses. Is it part of mozilla/webtools? Off the top of my head, I would guess "one third of the Web", and as much as half if you include pages that have a META charset declaration as windows-1252 but ISO-8859-1 in the HTTP headers.

Erik van der Poel

Comment 5

•

20 years ago

(In reply to comment #4) Yes, the charset-sniffing Web crawler is mozilla/webtools/web-sniffer/robot.c It looks for HTTP and HTML charsets but not for 0x80-0x9f. Note that its crawl may not be "broad" enough to get a good sense. Let me know if you have questions. Another approach, to be used instead or in parallel, is to start a discussion on, say, www-international@w3.org to see how other people feel about this.

Jungshik Shin

Comment 6

•

20 years ago

(In reply to comment #4) > David Baron suggested some time ago in a bug which I can't now find that we > should make Windows-1252 our default encoding for pages with no encoding > specified, Urh, when no encoding is specified, we use the encoding specified in the user preference as you know well.

Simon Montagu :smontagu

Assignee

Comment 7

•

20 years ago

(In reply to comment #6) > Urh, when no encoding is specified, we use the encoding specified in the user > preference as you know well. Yes. What I meant by "make Windows-1252 our default encoding ..." was "make Windows-1252 the out-of-the-box value of the user preference that specifies the default encoding ...".

Simon Montagu :smontagu

Assignee

Comment 8

•

20 years ago

(In reply to comment #4) > David Baron suggested some time ago in a bug which I can't now find Bug 193094 comment 6

Jungshik Shin

Comment 9

•

20 years ago

(In reply to comment #7) > Yes. What I meant by "make Windows-1252 our default encoding ..." was "make > Windows-1252 the out-of-the-box value of the user preference that specifies the > default encoding ...". Sorry to be pedantic ;-), but that needs a bit more qualification, 'for Western European language versions' because the out-of-the-box default value for that user pref is localizable. Anyway, if we split the ISO-8859-1 decoder (currently, it's Windows-1252 decoder ) into two, Windows-1252 decoder and ISO-8859-1 decoder, we may interpret ISO-8859-1 strictly for pages in standard mode.I have no idea as to the number of pages that will break with this change, but applying the strict interpretation only to standard mode will significantly reduce the number.

Thomas Rutter

Reporter

Comment 10

•

20 years ago

Perhaps this could be thought of as a user interface issue, that having separate menu items in the user application for ISO-8859-1, US-ASCII and Windows-1252 is misleading since they will all have the same effect - the Windows-1252 encoding is used for all. Perhaps another bug, in the Firefox product, could be created that specifies "ISO-8859-1, US-ASCII and Windows-1252 choices under View->Character Encoding are redundant".

Gervase Markham [:gerv]

Comment 11

•

19 years ago

This is an automated message, with ID "auto-resolve01". This bug has had no comments for a long time. Statistically, we have found that bug reports that have not been confirmed by a second user after three months are highly unlikely to be the source of a fix to the code. While your input is very important to us, our resources are limited and so we are asking for your help in focussing our efforts. If you can still reproduce this problem in the latest version of the product (see below for how to obtain a copy) or, for feature requests, if it's not present in the latest version and you still believe we should implement it, please visit the URL of this bug (given at the top of this mail) and add a comment to that effect, giving more reproduction information if you have it. If it is not a problem any longer, you need take no action. If this bug is not changed in any way in the next two weeks, it will be automatically resolved. Thank you for your help in this matter. The latest beta releases can be obtained from: Firefox: http://www.mozilla.org/projects/firefox/ Thunderbird: http://www.mozilla.org/products/thunderbird/releases/1.5beta1.html Seamonkey: http://www.mozilla.org/projects/seamonkey/

Thomas Rutter

Reporter

Updated

•

19 years ago

Status: UNCONFIRMED → RESOLVED

Closed: 19 years ago

Resolution: --- → INVALID

Phil Ringnalda (:philor)

Comment 12

•

19 years ago

*** Bug 333292 has been marked as a duplicate of this bug. ***

Mike Hommey [:glandium]

Comment 13

•

18 years ago

> Anyway, if we split the ISO-8859-1 decoder (currently, it's Windows-1252 decoder > ) into two, Windows-1252 decoder and ISO-8859-1 decoder, we may interpret > ISO-8859-1 strictly for pages in standard mode. Note that the native uconv support *does* use different decoders for windows-1252 and iso-8859-1 thus not making this bug appear. I believe myself this bug should be fixed. Reopening.

Status: RESOLVED → UNCONFIRMED

Resolution: INVALID → ---

junkmailnotread

Comment 14

•

18 years ago

Perhaps this is relevant, perhaps not. I've just spent 2 days tracking down a bug in minimo whereby form submission didn't work. I built minimo from current CVS sources, and have it running on Familiar Linux v0.8.4-rc2 on an iPAQ hx4700. I've now found the problem. It was a gratuitous substitution of character set ISO-8859-1 by windows-1252 in the following file: content/html/content/src/nsFormSubmission.cpp 1313 // canonical name is passed so that we just have to check against 1314 // *our* canonical names listed in charsetaliases.properties 1315 if (charset.EqualsLiteral("ISO-8859-1")) { 1316 charset.AssignLiteral("windows-1252"); 1317 } I don't know why this code is there. The comment about charsetaliases.properties means nothing to me because I can find no reference to it elsewhere. While Windows and even Linux installed on a desktop PC will have Codepage 1252 installed somewhere, an embedded Linux distribution installed on a portable device may not. I fixed form submission in my minimo build by simply deleting the above 5 lines. I might equally have installed Codepage 1252. But my feeling is that the software should respect its own default character sets!

Simon Montagu :smontagu

Assignee

Comment 15

•

18 years ago

Comment 14 is not really the same issue. Could you open a new bug, please? (Possibly those lines should be #ifndef MOZ_USE_NATIVE_UCONV, but it needs careful consideration)

junkmailnotread

Comment 16

•

18 years ago

You're correct, I was too hasty. A more thorough search of Bugzilla revealed that my comment would have been more usefully added to bug #228779. Duly added.

David Gravereaux

Comment 17

•

18 years ago

bug 372325 is related to this issue

David Gravereaux

Comment 18

•

18 years ago

Attached file iso-8859-1 mapping that shows the leakage — Details

Mike Hommey [:glandium]

Updated

•

18 years ago

Attachment #257007 - Attachment mime type: text/html → text/html; charset=ISO-8859-1

Sven

Comment 19

•

18 years ago

Has it already been mentioned, that  is also displayed as € ? All &#XYZ; with XYZ between 0x80 and 0x9F are also decoded according to cp1252 - no matter which charset the page actually uses.

Simon Montagu :smontagu

Assignee

Comment 20

•

18 years ago

(In reply to comment #19) > Has it already been mentioned, that  is also displayed as € ? Yes, that's bug 372325, referred to in comment 17.

Justin Kerk

Comment 21

•

18 years ago

This is a dupe of bug 99426 but that call was made over five years ago so I dunno if someone wants to revisit this.

Whiteboard: DUPEME

Tony Mechelynck [:tonymec]

Comment 22

•

16 years ago

(In reply to comment #21) > This is a dupe of bug 99426 but that call was made over five years ago so I > dunno if someone wants to revisit this. apparently not.

Status: UNCONFIRMED → RESOLVED

Closed: 19 years ago → 16 years ago

QA Contact: amyy → i18n

Resolution: --- → DUPLICATE

David Gravereaux

Comment 23

•

16 years ago

For what it's worth, still broken in 3.0.3 Personally, I'd like a config setting that I can turn on to make this strict as it is supposed to be as I would like to see broken encodings as broken.

Tony Mechelynck [:tonymec]

Comment 24

•

16 years ago

(In reply to comment #23) > For what it's worth, still broken in 3.0.3 > > Personally, I'd like a config setting that I can turn on to make this strict as > it is supposed to be as I would like to see broken encodings as broken. ISTR that some years ago (maybe about the time Fx1 came out), the opposite decision (i.e. the decision to render 0x80 to 0x9F according to Windows 1252 when the stated charset is iso-8859-1) was made in order to avoid user reactions along the lines of "Firefox is broken, look: it cannot render pages X, Y, Z correctly, while I made them with PageMaker and IE has no problem with them" (yeah, I know IE6 and PageMaker were even more broken then than they are know, but the average M$W user is not expected to know that). In Latin1, those are control characters, which (or most of which) would be expected not to happen on a webpage. Maybe whether to render Latin1 as Windows-1252 or not could be made (or is?) dependent on whether the page is rendered in "Quirks" vs. "Strict" HTML mode?

David Gravereaux

Comment 25

•

16 years ago

> .. dependent on whether the page is rendered in "Quirks" vs. "Strict" HTML mode? Nope. Sorry. Makes no difference. This is quite broken no matter how one winds a set of rationalizations around it. I want to see the broken stuff as broken. Please add a strict option like: intl.charset.iso-is-strict-and-mozilla-is-not-a-joke

David Gravereaux

Comment 26

•

16 years ago

Attached file XHTML numeric entities are broken, too — Details

Can someone explain the rational of this? There is no glyph at that codepoint, I want to see nothing.

Tony Mechelynck [:tonymec]

Comment 27

•

16 years ago

Attached file alphabet (bytes 0x20 to 0xFF) to check glyphs per encoding — Details

The rationale includes at least in part the fact that on Windows, most programs (including Notepad and IE) treat iso-8859-1 as if it were Windows-1252. However there must be something more than that, since even on Linux characters in the range 0x80-0x9F which exist in Windows-1252 are displayed as Windows-1252 even with "Wiew => Character Encoding => Western (ISO-8859-1)", see attached example. Another part of the rationale might be (I'm speculating here) that when the decision was made to render Latin1 as cp1252, both Windows and IE had significantly larger market shares than they do now, that most corporations didn't care about Linux and few did about Firefox, that significantly more people than now were publishing pages in cp1252 but with Content-Type lines saying iso-8859-1, and that it was thought that if, at that time, Firefox didn't bend to the stream it would be perceived as "broken" because it wouldn't be displaying "correctly" the euro signs, OE digraphs, z-caron and big-y-diaeresis letters the way "other browsers" did (well, the way IE did; I can't test Safari but, by now at least, Konqueror/KDE3 doesn't).

Tony Mechelynck [:tonymec]

Updated

•

16 years ago

Whiteboard: parity-Konqueror

Tony Mechelynck [:tonymec]

Comment 28

•

16 years ago

P.S. See also comment #4.

David Gravereaux

Comment 29

•

16 years ago

(In reply to comment #27) > The rationale includes at least in part the fact that on Windows, most programs > (including Notepad and IE) treat iso-8859-1 as if it were Windows-1252. I wouldn't say that is accurate as the system encoding on windows is already cp1252 for the USA. To properly "serve" the file correctly requires <meta http-equiv="Content-Type" content="text/html; charset=windows-1252"> so that it is viewed as intended. But you knew that. Where in the code are these hacks found so that I may fix it myself?

Tony Mechelynck [:tonymec]

Comment 30

•

16 years ago

(In reply to comment #29) > (In reply to comment #27) > > > The rationale includes at least in part the fact that on Windows, most programs > > (including Notepad and IE) treat iso-8859-1 as if it were Windows-1252. > > I wouldn't say that is accurate as the system encoding on windows is already > cp1252 for the USA. To properly "serve" the file correctly requires > <meta http-equiv="Content-Type" content="text/html; charset=windows-1252"> so > that it is viewed as intended. But you knew that. Even on Windows, people exchange email and browse the Web. I've seen (and more than just rarely) pages containing cp1252 text which were served as iso-8859-1. > > Where in the code are these hacks found so that I may fix it myself? See comment #14. Maybe it's got bitrotten by now, I wouldn't know about that. But unless you compile your own browser, you would (IIUC) have to get your changes reviewed, superreviewed and approved before you could land them. You're free to rey, but I don't think you would succeed, considering this already has (as bug 99426) been WONTFIXed before.

Tony Mechelynck [:tonymec]

Comment 31

•

16 years ago

oops! "you're free to " s/rey/try/ ", but"...

David Gravereaux

Comment 32

•

16 years ago

shoot, I wouldn't submit it. I'm just out for myself. I want to see broken pages as broken and with the correct set of glyphs as defined in the standards.

David Gravereaux

Comment 33

•

16 years ago

Bug 372375 has been idle for 1 1/2 years so I don't expect much from the dev folks anyways

David Gravereaux

Updated

•

16 years ago

Attachment #340847 - Attachment mime type: text/html → text/html; charset=ISO-8859-1

Simon Montagu :smontagu

Assignee

Comment 34

•

16 years ago

You would need to change http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/intl/uconv/src/nsISO88591ToUnicode.cpp&rev=1.10&mark=45#40 to #include "8859-1.ut" For numeric entities, remove the marked lines at http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/parser/htmlparser/src/nsHTMLTokens.cpp&rev=3.288&mark=2175-2178#2171

Sven

Comment 35

•

16 years ago

If you'd like to hear my 50 cent: first fix the sending (Bug 228779) and then years later (or never) fix the displaying. Sending invalid ISO-8859-1 is many times more unreasonable than just displaying invalid ISO-8859-1 for compatibility reasons. If people complain, tell them in the most friendly way, that they are wrong - but keeping it this way just breaks things (databases containg data with \u0080 instead of € and such things) and it's just pure lazyness. Many websites are already UTF-8 anyway - so you might already lucky and see less complains than some years ago.

David Gravereaux

Comment 36

•

16 years ago

Thank you Simon. I found the first yesterday. I don't see how that one could be config switchable, though. Switching at the caller makes more sense, but I don't know the code, yet. The second looks to be easy to adapt a config switch into it. If you have some more thoughts, please post to Bug 372375

David Gravereaux

Comment 37

•

16 years ago

(In reply to comment #35) > Sending invalid ISO-8859-1 is many times more unreasonable than just displaying > invalid ISO-8859-1 for compatibility reasons. John Postel's law: "Be conservative in what you do; be liberal in what you accept from others."

Testcase 20 years ago Thomas Rutter 282 bytes, text/html; charset=iso-8859-1		Details
Testcase 20 years ago Thomas Rutter 282 bytes, text/html; charset=iso-8859-1		Details
iso-8859-1 mapping that shows the leakage 18 years ago David Gravereaux 3.30 KB, text/html; charset=ISO-8859-1		Details
XHTML numeric entities are broken, too 16 years ago David Gravereaux 521 bytes, text/html; charset=ISO-8859-1		Details
alphabet (bytes 0x20 to 0xFF) to check glyphs per encoding 16 years ago Tony Mechelynck [:tonymec] 238 bytes, text/plain; charset=iso-8859-1		Details