Unicode font mapping fails for Egyptian Hieroglyphs

RESOLVED INVALID

Status

()

RESOLVED INVALID
8 years ago
8 years ago

People

(Reporter: saqqara, Assigned: smontagu)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(URL)

Attachments

(1 attachment)

(Reporter)

Description

8 years ago
User-Agent:       Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0)
Build Identifier: Mozilla/5.0 (Windows NT 6.0; rv:2.0b6) Gecko/20100101 Firefox/4.0b6

Web page text that contains Unicode 5.2 Egyptian Hieroglyphs (but does not specify a specific font name, e.g. via CSS) displays the unknown character rather than picking a glyph from a suitable installed font on the host system. A character mapping bug.

Reproducible: Always

Steps to Reproduce:
1. Install an Egyptian font (such as Aegyptus, http://users.teilar.gr/~g1951d/).
2. Read page http://jtotobsc.blogspot.com/2010/09/quick-test-for-ancient-egyptian-in-web.html
3. Note the hieroglyphs are not displayed.


Expected Results:  
Mapped the 'unknown characters' to a font that supports the Unicode 5.2 Egyptian Hieroglyphs.

Curiously, paste the hieroglyph characters into an edit box, or the Navigation or Search toolbars and they display correctly. Only web pages that don't work. Also note that Wikipedia entries contain Aegyptus in CSS styling thus fooling the casual observer that hieroglyphs work (they only work if that specific font is installed).
Assignee: nobody → smontagu
Component: General → Internationalization
Product: Firefox → Core
QA Contact: general → i18n
You can't use surrogate pairs for character references. Encode non-BMP code points directly.
For exapmle, �� should be 𓄿
Intrestingly, the RSS feed uses the UTF-8 raw bytes correctly instead of character references.
http://jtotobsc.blogspot.com/feeds/posts/default?alt=rss
Status: UNCONFIRMED → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → INVALID
(Assignee)

Comment 2

8 years ago
Created attachment 479313 [details]
Testcase using the right entity codes

Note that this is not a bug in Firefox when entities are used for Unicode SMP, as you wrote in an update to http://jtotobsc.blogspot.com/2010/09/quick-test-for-ancient-egyptian-in-web.html. The bug is in Blogger, which uses the wrong values for the entities, as Kimura-san already explained. This testcase uses the correct values and displays fine.

There is a useful tool for converting Unicode characters to different forms at http://rishida.net/tools/conversion/
(Reporter)

Comment 3

8 years ago
I'll take your word for it that some specification somewhere says this construction is illegal. Whare can I read this for myself? 

I was fooled by the fact that Safari, Chrome and Internet Explorer all support the use of surrogate pairs for character references so apparently I was not alone making the assumption the construction is valid - suggest Mozilla submit to W3C test suite etc. if not done already in interests of standards conformance.
(In reply to comment #3)
> I'll take your word for it that some specification somewhere says this
> construction is illegal. Whare can I read this for myself? 

HTML5: 8.2.4.70 Tokenizing character references
http://www.w3.org/TR/html5/tokenization.html#tokenizing-character-references
> Otherwise, if the number is in the range 0xD800 to 0xDFFF or is greater than 
> 0x10FFFF, then this is a parse error. Return a U+FFFD REPLACEMENT CHARACTER.

XML: 4.1 Character and Entity References
http://www.w3.org/TR/xml/#sec-references
> [66]       CharRef       ::=       '&#' [0-9]+ ';'
>             | '&#x' [0-9a-fA-F]+ ';'    [WFC: Legal Character]
and the definition of Well-formedness constraint: Legal Character,
http://www.w3.org/TR/xml/#wf-Legalchar
> Characters referred to using character references MUST match the production
> for Char.
and the definition of Char production.
http://www.w3.org/TR/xml/#NT-Char
> [2]       Char       ::=       #x9 | #xA | #xD | [#x20-#xD7FF] | 
> [#xE000-#xFFFD] | [#x10000-#x10FFFF]    /* any Unicode character, excluding
> the surrogate blocks, FFFE, and FFFF. */
WFC violations are fatal errors.
http://www.w3.org/TR/xml/#dt-wfc
XML processor must stop the normal parsing when it encounters a fatal error.
http://www.w3.org/TR/xml/#dt-fatal

HTML 4: 20 SGML Declaration of HTML 4
http://www.w3.org/TR/html4/sgml/sgmldecl.html
> 55296   2048    UNUSED  -- SURROGATES --
Character numbers from 55296 (D800 in hex) to 57343 (DFFF in hex) are not used for HTML 4 document.

> I was fooled by the fact that Safari, Chrome and Internet Explorer all support
> the use of surrogate pairs for character references so apparently I was not
> alone making the assumption the construction is valid - suggest Mozilla submit
> to W3C test suite etc. if not done already in interests of standards
> conformance.

Recently WebKit also implemented this rule.
http://trac.webkit.org/changeset/61234
> fast/parser/entity-surrogate-pairs-expected.txt:
>     * HTML5 doesn't allow entities to create surrogate pairs.
You need to log in before you can comment on or make changes to this bug.