Closed Bug 372325 Opened 18 years ago Closed 18 years ago

— under UTF-8 is not a —

Categories

(Firefox :: General, defect)

x86
Windows XP
defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: davygrvy, Unassigned)

References

()

Details

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9a3pre) Gecko/20070228 Minefield/3.0a3pre Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9a3pre) Gecko/20070228 Minefield/3.0a3pre The character represented by the entity — is not \u2014 (—) when the claimed encoding is UTF-8. There is no glyph there. I think it is nice that FF is "fixing" Windows-1252 mis-representations, but by doing so, aren't you perpetuating the problem? Please give me the empty squarebox instead so I can -RFC police- the offenders :) Reproducible: Always Steps to Reproduce: 1. Go to http://news.yahoo.com/ 2. Look at almost any news article for the 'em dash' glyph 3. Open source and see it referenced as —, not — 4. Scratch head trying to find why 'Windows-1252' leaked into UTF-8 Actual Results: There's no glyph for — under UTF-8 Expected Results: Show me a empty squarebox there. Mozilla should be accurate. Be the RFC police! Do it well.
The web authorship issue of this is described @ http://www.cs.tut.fi/~jkorpela/www/windows-chars.html
odd Windows-1252 leakage into ISO-8859-1 shown @ http://old.no/charmap/iso-8859-1.html I'm seeing glyphs in the forbidden zones. Yes, — is displayed along with others. Notice I added the offending char (—) to this form just to see the behavior.
Comment #2 is bug 288904, and a separate issue.
The good thing anyways, is it comes back properly as \u2014 on this page as form entry saw it correctly from cp1252. I hear from other people that the problem might be related to fonts under windows as Win32 might be going under Mozilla and implicitly adding those glyphs there as a "favor", thus requiring code to be strict to remove that range. Umm.. add code to remove them, thanks.
(In reply to comment #0) > The character represented by the entity — is not \u2014 (—) when the > claimed encoding is UTF-8. There is no glyph there. This is true but doesn't go far enough: "when the claimed encoding is UTF-8" is redundant because numeric entities should *always* represent Unicode codepoints whatever the encoding of the document they appear in. However I fear we are stuck with this incorrect behaviour for the sake of backward compatibility. I really don't think we want to start evangelizing http://news.yahoo.com and all the other authors who (ab)use entities in this way.
Status: UNCONFIRMED → RESOLVED
Closed: 18 years ago
Resolution: --- → WONTFIX
> This is true but doesn't go far enough: "when the claimed encoding is UTF-8" is > redundant because numeric entities should *always* represent Unicode codepoints > whatever the encoding of the document they appear in. Yes, I looked it up earlier tonight and found numeric entities are always Unicode referenced. > However I fear we are stuck with this incorrect behavior for the sake of > backward compatibility. I say make a statement for the better and fix the cruft.
Is there an about:config setting I can add so I get strict ISO-8859-1 decoding instead of the current broken one?
No, but if you file a bug asking for one I will consider adding it ;-)
you rock :)
You need to log in before you can comment on or make changes to this bug.