Closed
Bug 372325
Opened 17 years ago
Closed 17 years ago
— under UTF-8 is not a —
Categories
(Firefox :: General, defect)
Tracking
()
RESOLVED
WONTFIX
People
(Reporter: davygrvy, Unassigned)
References
()
Details
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9a3pre) Gecko/20070228 Minefield/3.0a3pre Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9a3pre) Gecko/20070228 Minefield/3.0a3pre The character represented by the entity — is not \u2014 (—) when the claimed encoding is UTF-8. There is no glyph there. I think it is nice that FF is "fixing" Windows-1252 mis-representations, but by doing so, aren't you perpetuating the problem? Please give me the empty squarebox instead so I can -RFC police- the offenders :) Reproducible: Always Steps to Reproduce: 1. Go to http://news.yahoo.com/ 2. Look at almost any news article for the 'em dash' glyph 3. Open source and see it referenced as —, not — 4. Scratch head trying to find why 'Windows-1252' leaked into UTF-8 Actual Results: There's no glyph for — under UTF-8 Expected Results: Show me a empty squarebox there. Mozilla should be accurate. Be the RFC police! Do it well.
Reporter | ||
Comment 1•17 years ago
|
||
The web authorship issue of this is described @ http://www.cs.tut.fi/~jkorpela/www/windows-chars.html
Reporter | ||
Comment 2•17 years ago
|
||
odd Windows-1252 leakage into ISO-8859-1 shown @ http://old.no/charmap/iso-8859-1.html I'm seeing glyphs in the forbidden zones. Yes, — is displayed along with others. Notice I added the offending char (—) to this form just to see the behavior.
Comment 3•17 years ago
|
||
Comment #2 is bug 288904, and a separate issue.
Reporter | ||
Comment 4•17 years ago
|
||
The good thing anyways, is it comes back properly as \u2014 on this page as form entry saw it correctly from cp1252. I hear from other people that the problem might be related to fonts under windows as Win32 might be going under Mozilla and implicitly adding those glyphs there as a "favor", thus requiring code to be strict to remove that range. Umm.. add code to remove them, thanks.
Comment 5•17 years ago
|
||
(In reply to comment #0) > The character represented by the entity — is not \u2014 (—) when the > claimed encoding is UTF-8. There is no glyph there. This is true but doesn't go far enough: "when the claimed encoding is UTF-8" is redundant because numeric entities should *always* represent Unicode codepoints whatever the encoding of the document they appear in. However I fear we are stuck with this incorrect behaviour for the sake of backward compatibility. I really don't think we want to start evangelizing http://news.yahoo.com and all the other authors who (ab)use entities in this way.
Status: UNCONFIRMED → RESOLVED
Closed: 17 years ago
Resolution: --- → WONTFIX
Reporter | ||
Comment 6•17 years ago
|
||
> This is true but doesn't go far enough: "when the claimed encoding is UTF-8" is > redundant because numeric entities should *always* represent Unicode codepoints > whatever the encoding of the document they appear in. Yes, I looked it up earlier tonight and found numeric entities are always Unicode referenced. > However I fear we are stuck with this incorrect behavior for the sake of > backward compatibility. I say make a statement for the better and fix the cruft.
Reporter | ||
Comment 7•17 years ago
|
||
Is there an about:config setting I can add so I get strict ISO-8859-1 decoding instead of the current broken one?
Comment 8•17 years ago
|
||
No, but if you file a bug asking for one I will consider adding it ;-)
Reporter | ||
Comment 9•17 years ago
|
||
you rock :)
You need to log in
before you can comment on or make changes to this bug.
Description
•