Closed
Bug 372325
Opened 18 years ago
Closed 18 years ago
— under UTF-8 is not a —
Categories
(Firefox :: General, defect)
Tracking
()
RESOLVED
WONTFIX
People
(Reporter: davygrvy, Unassigned)
References
()
Details
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9a3pre) Gecko/20070228 Minefield/3.0a3pre
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9a3pre) Gecko/20070228 Minefield/3.0a3pre
The character represented by the entity — is not \u2014 (—) when the claimed encoding is UTF-8. There is no glyph there.
I think it is nice that FF is "fixing" Windows-1252 mis-representations, but by doing so, aren't you perpetuating the problem? Please give me the empty squarebox instead so I can -RFC police- the offenders :)
Reproducible: Always
Steps to Reproduce:
1. Go to http://news.yahoo.com/
2. Look at almost any news article for the 'em dash' glyph
3. Open source and see it referenced as —, not —
4. Scratch head trying to find why 'Windows-1252' leaked into UTF-8
Actual Results:
There's no glyph for — under UTF-8
Expected Results:
Show me a empty squarebox there. Mozilla should be accurate. Be the RFC police! Do it well.
Reporter | ||
Comment 1•18 years ago
|
||
The web authorship issue of this is described @ http://www.cs.tut.fi/~jkorpela/www/windows-chars.html
Reporter | ||
Comment 2•18 years ago
|
||
odd Windows-1252 leakage into ISO-8859-1 shown @ http://old.no/charmap/iso-8859-1.html
I'm seeing glyphs in the forbidden zones. Yes, — is displayed along with others. Notice I added the offending char (—) to this form just to see the behavior.
Comment 3•18 years ago
|
||
Comment #2 is bug 288904, and a separate issue.
Reporter | ||
Comment 4•18 years ago
|
||
The good thing anyways, is it comes back properly as \u2014 on this page as form entry saw it correctly from cp1252.
I hear from other people that the problem might be related to fonts under windows as Win32 might be going under Mozilla and implicitly adding those glyphs there as a "favor", thus requiring code to be strict to remove that range.
Umm.. add code to remove them, thanks.
Comment 5•18 years ago
|
||
(In reply to comment #0)
> The character represented by the entity — is not \u2014 (—) when the
> claimed encoding is UTF-8. There is no glyph there.
This is true but doesn't go far enough: "when the claimed encoding is UTF-8" is redundant because numeric entities should *always* represent Unicode codepoints whatever the encoding of the document they appear in.
However I fear we are stuck with this incorrect behaviour for the sake of backward compatibility. I really don't think we want to start evangelizing http://news.yahoo.com and all the other authors who (ab)use entities in this way.
Status: UNCONFIRMED → RESOLVED
Closed: 18 years ago
Resolution: --- → WONTFIX
Reporter | ||
Comment 6•18 years ago
|
||
> This is true but doesn't go far enough: "when the claimed encoding is UTF-8" is
> redundant because numeric entities should *always* represent Unicode codepoints
> whatever the encoding of the document they appear in.
Yes, I looked it up earlier tonight and found numeric entities are always Unicode referenced.
> However I fear we are stuck with this incorrect behavior for the sake of
> backward compatibility.
I say make a statement for the better and fix the cruft.
Reporter | ||
Comment 7•18 years ago
|
||
Is there an about:config setting I can add so I get strict ISO-8859-1 decoding instead of the current broken one?
Comment 8•18 years ago
|
||
No, but if you file a bug asking for one I will consider adding it ;-)
Reporter | ||
Comment 9•18 years ago
|
||
you rock :)
You need to log in
before you can comment on or make changes to this bug.
Description
•