Closed Bug 372325 Opened 17 years ago Closed 17 years ago

— under UTF-8 is not a —

Categories

(Firefox :: General, defect)

x86
Windows XP
defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: davygrvy, Unassigned)

References

()

Details

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9a3pre) Gecko/20070228 Minefield/3.0a3pre
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9a3pre) Gecko/20070228 Minefield/3.0a3pre

The character represented by the entity — is not \u2014 (—) when the claimed encoding is UTF-8.  There is no glyph there.

I think it is nice that FF is "fixing" Windows-1252 mis-representations, but by doing so, aren't you perpetuating the problem?  Please give me the empty squarebox instead so I can -RFC police- the offenders :)

Reproducible: Always

Steps to Reproduce:
1. Go to http://news.yahoo.com/
2. Look at almost any news article for the 'em dash' glyph
3. Open source and see it referenced as —, not —
4. Scratch head trying to find why 'Windows-1252' leaked into UTF-8
Actual Results:  
There's no glyph for — under UTF-8

Expected Results:  
Show me a empty squarebox there.  Mozilla should be accurate.  Be the RFC police!  Do it well.
The web authorship issue of this is described @ http://www.cs.tut.fi/~jkorpela/www/windows-chars.html
odd Windows-1252 leakage into ISO-8859-1 shown @ http://old.no/charmap/iso-8859-1.html

I'm seeing glyphs in the forbidden zones.  Yes, — is displayed along with others.  Notice I added the offending char (—) to this form just to see the behavior.
Comment #2 is bug 288904, and a separate issue.
The good thing anyways, is it comes back properly as \u2014 on this page as form entry saw it correctly from cp1252.

I hear from other people that the problem might be related to fonts under windows as Win32 might be going under Mozilla and implicitly adding those glyphs there as a "favor", thus requiring code to be strict to remove that range.

Umm.. add code to remove them, thanks.
(In reply to comment #0)
> The character represented by the entity — is not \u2014 (—) when the
> claimed encoding is UTF-8.  There is no glyph there.

This is true but doesn't go far enough: "when the claimed encoding is UTF-8" is redundant because numeric entities should *always* represent Unicode codepoints whatever the encoding of the document they appear in.

However I fear we are stuck with this incorrect behaviour for the sake of backward compatibility. I really don't think we want to start evangelizing http://news.yahoo.com and all the other authors who (ab)use entities in this way.
Status: UNCONFIRMED → RESOLVED
Closed: 17 years ago
Resolution: --- → WONTFIX
> This is true but doesn't go far enough: "when the claimed encoding is UTF-8" is
> redundant because numeric entities should *always* represent Unicode codepoints
> whatever the encoding of the document they appear in.

Yes, I looked it up earlier tonight and found numeric entities are always Unicode referenced.

> However I fear we are stuck with this incorrect behavior for the sake of
> backward compatibility.

I say make a statement for the better and fix the cruft.
Is there an about:config setting I can add so I get strict ISO-8859-1 decoding instead of the current broken one?
No, but if you file a bug asking for one I will consider adding it ;-)
you rock :)
You need to log in before you can comment on or make changes to this bug.