Last Comment Bug 717529 - Support CSS escape sequences like `\d834\df06 ` (broken up in UTF-16 code units)
: Support CSS escape sequences like `\d834\df06 ` (broken up in UTF-16 code units)
Status: RESOLVED INVALID
:
Product: Core
Classification: Components
Component: CSS Parsing and Computation (show other bugs)
: unspecified
: x86 Mac OS X
: -- normal (vote)
: ---
Assigned To: Nobody; OK to take it and work on it
:
Mentors:
http://jsfiddle.net/mathias/jY7ra/
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-01-12 00:15 PST by Mathias Bynens
Modified: 2012-02-01 10:25 PST (History)
6 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments

Description Mathias Bynens 2012-01-12 00:15:52 PST
Gecko is the only engine that doesn’t support CSS escape sequences of the form `\d834\df06 ` (broken up in UTF-16 code units). Even though I cannot find any mention of these in the spec (http://www.w3.org/TR/CSS21/syndata.html#characters), it would be better for interoperability if Gecko added support for these.

The example I’m using — `\d834\df06 ` should be identical to `\1d306 ` or `\01d306`, both of which are escape sequences for the “tetragram for centre” symbol (U+1D306).

Here’s a simple test case: http://jsfiddle.net/mathias/jY7ra/

Opera and IE8+ support both types of escape sequences.

Note that WebKit doesn’t support the standard CSS escape sequences for symbols outside the BMP; see https://bugs.webkit.org/show_bug.cgi?id=76152.
Comment 1 Jonathan Kew (:jfkthame) 2012-01-12 01:53:14 PST
Given that a backslash-hexadecimal CSS escape sequence is defined as providing the code number of _an ISO 10646 character_, not the code number of a UTF16 code unit, I don't think this should be supported. The correct way to represent a non-BMP character is to use a 5- or 6-hexdigit sequence.

Perhaps this could be clarified in http://www.w3.org/TR/css3-syntax; IMO, the range \d000 to \dfff should, if anything, be explicitly made invalid. cc'ing dbaron for any thoughts.
Comment 2 Mathias Bynens 2012-01-12 02:02:35 PST
(In reply to Jonathan Kew (:jfkthame) from comment #1)
> Given that a backslash-hexadecimal CSS escape sequence is defined as
> providing the code number of _an ISO 10646 character_, not the code number
> of a UTF16 code unit, I don't think this should be supported. The correct
> way to represent a non-BMP character is to use a 5- or 6-hexdigit sequence.

Don’t get me wrong, I definitely agree that is the only correct way according to the spec; I filed this “bug” purely because of interoperability concerns. Since Opera, WebKit and IE8+ support this non-standard syntax, it would be nice if Firefox could support it as well.

Either way the spec could use some tweaks, be it by just clarifying what you suggested, or by changing it to reflect reality (i.e. defining the “broken-up UTF-16 code units” syntax that almost all browsers have implemented).
Comment 3 Mathias Bynens 2012-01-12 04:43:08 PST
Taking this to www-style: http://lists.w3.org/Archives/Public/www-style/2012Jan/0536.html
Comment 4 David Baron :dbaron: ⌚️UTC-7 (busy September 14-25) 2012-01-12 09:53:45 PST
If we were to support this, we'd need a way to prevent unpaired surrogates from ending up in internal data structures -- that could be a security risk.

The current code that does this is the use of ENSURE_VALID_CHAR() inside of nsCSSScanner::ParseAndAppendEscape (in layout/style/nsCSSScanner.cpp).  ENSURE_VALID_CHAR is defined in xpcom/string/public/nsCharTraits.h and converts surrogates (U+D800 to U+DFFF) and codepoints greater than U+10FFFF to U+FFFD.
Comment 5 Mathias Bynens 2012-02-01 10:25:12 PST
This bug is invalid as per http://lists.w3.org/Archives/Public/www-style/2012Feb/0006.html.

Note You need to log in before you can comment on or make changes to this bug.